The company is interested in identifying profitable customers who are likely to purchase a ticket when given a promotional offer. My goal is to build a model to predict whether a customer will buy a ticket, and specifically to improve recall and precision for the minority class to maximize profits. there are 9% buyers and 91% non buyers (imbalanced target variable)
Dataset Description: The dataset contains the following features:
MARKETING_SCORE
STATUS_PANTINUM
STATUS_GOLD
STATUS_SILVER
NUM_DEAL
LAST_DEAL
ADVANCE_PURCHASE
CALL_FLAG
CREDIT_PROBLEM
RETURN_FLAG
BENEFIT_FLAG
AVG_FARE
AVG_POINTS
BUYER_FLAG: Target variable (1 if the customer bought the ticket, 0 otherwise)
I used both correlation analysis and random forest importance to select features. The most important features based on random forest were:
MARKETING_SCORE
ADVANCE_PURCHASE
LAST_DEAL
NUM_DEAL
CALL_FLAG
CREDIT_PROBLEM
STATUS_SILVER
BENEFIT_FLAG
STATUS_GOLD
RETURN_FLAG
STATUS_PANTINUM
I first trained a logistic regression model with the important features but it did not perform well. I then used XGBoost with parameter optimization, After training the final XGBoost model and calculating the optimal cutoff, I evaluated it on the training and test sets. The results were as follows:
precision recall f1-score support
0 0.94 0.88 0.91 27265
1 0.29 0.47 0.36 2735
accuracy 0.85 30000
macro avg 0.62 0.68 0.64 30000
weighted avg 0.88 0.85 0.86 30000
precision recall f1-score support
0 0.94 0.88 0.91 9088
1 0.27 0.42 0.33 912
accuracy 0.84 10000
macro avg 0.60 0.65 0.62 10000
weighted avg 0.88 0.84 0.86 10000
How can I further improve the recall and precision for the minority class (buyers) to maximize the expected profit? Any insights or suggestions would be greatly appreciated. Im a student and still learning :)