Predicting Customer Behaviour from British Airways Customer Buying Data

Github file

As part of the forage’s digital experience I have used the XGboost algorithm to predict the outcome of a customer booking process and extract the important features that affect the decision to purchase the airline ticket.

To start the machine learning process we have to explore and understand the data. I have done some exploratory data analysis using pandas to have an extensive understanding of the data.

The above image shows the column index and you can see there are 6 numerical columns and 8 categorical columns.

Doing a value count method shows that the dataset is highly imbalanced. This can be alleviated using undersampling the booking not complete class or oversampling the booking complete class.

For some of the numerical classes I have plotted the graphs against the percentage of booking complete to learn how these factors affect the customer behaviour.

It can be seen these features do have some importance over the decision to book a ticket. The first graph shows that most people tend to book as a group of 3 or 5. People are likely to complete a booking if the flight is on Wednesday. The third graph shows that most people tend to book flights that depart in the evenings. The fourth graph shows that people are more likely to purchase a ticket in the near feature with decreasing likelihood as the days increase.

More analysis can be seen on the github page for this project.

Training

The XGboost python library was used to train the prediction model for this dataset. To encode the categorical values Target Encoders were used which encodes the categorical values based on the probability of occurrence in relation to the target variable. Bayes searchCV from Scikit optimizers were used to optimize the hyperparameters for this model to find the best combination of hyperparameters.

The model predicted the customer buying behaviour with an accuracy of 81%, but the f1-score for the booking complete class was only 0.26. But the important goal of this project was to find the factors that affected the customer buying behaviour

The XGboost model has an inbuilt method to extract the feature importances. It is shown in the image below.

It can be seen that the purchase lead has the most highest feature importance, followed by the route that the flight takes and flight hour. This confirms the graphs above which shows that relation between each of these features and the booking percentage.