Data Scientist with M.A. in sociology, B.A. in environmental sociology, and 5+ years' experience teaching statistics. Completed TripleTen's 10 month data science bootcamp and a real-world data science externship with DataSpeak. Currently accepting data analysis and statistics consulting projects May 2024.
View My LinkedIn Profile
Background: Telecom operator Interconnect would like to forecast the churn of their clients. If the customer is likely to leave, they will be sent promotions and special plan offers.
Purpose: Fit an imbalanced classification model that accurately predicts which customers are likely to leave the company.
Techniques: CatBoost, LGBM, XGBoost, AdaBoost, pipelines, GridSearchCV, class balancing.
numpy, time
pandas
matplotlib, seaborn
sklearn, imblearn
catboost, lightgbm, xgboost
contract_df.csv, internet_df.csv, personal_df.csv, phone_df.csv
Target:
Features:
The data were provided by TripleTen’s Data Science bootcamp. The full dataset is loaded into the notebook but is proprietary information and cannot be shared online.
Data were checked for missing values and duplicates. The 4 missing values in total charges were filled with the median of total charges. No other duplicates or missing values were found.
The four datasets were merged on customer id to make a comprehensive dataset for all variables.
Four additional features were created:
There are fewer customers who churned than did not churn. This is an imbalanced classification problem. Class balancing and weighting techniques will be applied.
Customers who began their contracts in 2014 - 2018 are almost all still with the company.
About 50% of customers who began their contracts in 2019 - 2020 have already churned.
New customers are more likely to leave than old customers.
The distribution of monthly charges has three peaks at $20, $50, and $80 per month.
Total charges is highly right skewed, with most people paying close to $0 total and only a few people paying over $6000 over the life of their plan.
Contract length is bi-modal, with many people having contracts less than 100 months or more than 2000 months.
The correlation heatmap shows high correlations between numeric features, representing multicollinearity, and a violation of the assumption of non-multicollinearity. Some features will need to be removed from the model.
Total charges, while highly correlated with begin year (r = -0.82), shares only a moderate correlation with monthly charges (r = 0.65). Tree models are not highly affected by slight multicollinearity. Total charges will be kept in the model.
Begin year and monthly charges have a low correlation with each other and will be kept in the model (r = -0.26).
Contract length and total internet services will be removed from the model.
The best model was the LightGBM trained on SMOTE upsampled data. This model achieved the highest scores on roc-auc and accuracy (ROC-AUC = 0.88, accuracy = 0.81). The LightGBM Model will be tested on the test set.
The LighGBM Classifier, fit on SMOTE upsampled training data, achieved a lower ROC-AUC on the test set (ROC-AUC = 0.80). This model is likely slightly overfit but still achieves a reasonabl training score.
LightGBM Classifier achieved the best model fit (ROC-AUCTEST = 0.80, accuracyTEST = 0.80). If this model predicts a customer will churn, there’s about 80% chance the customer will actually churn (precisionTEST = 0.80). Of customers who do churn, the model can be expected to predict about 57% of them (recallTEST = 0.57).
Telecom can feel confident implementing this model to predict which customers will churn. They can expect that if the model says a customer will churn, it’s very likely that that customer will indeed churn. The model may miss some customers who will churn, so it’s not a bad idea to offer some additional small promotions across the clientelle. Telecom would do well to focus on keeping new customers, as old customers are likely to continue with the company.
More research should be done on the company’s newer customers to determine why they are leaving the company. Follow up surveys and additional promotions with this group could help strengthen new client retention.