Abstract
Customer churn is a common problem faced by many industries, including
telecommunications industries and this has resulted in the development of
advanced techniques for the prediction and prevention of customer churn.
Customer churn occurs when customers terminate their contracts with their
service provider. This research introduces an effective way to further improve
the churn prediction capability of different machine learning algorithms
through the employment of topological data analysis (TDA). TDA is a relatively
new method in data analysis that can be used to gather topological and
geometrical information from big data. TDA is capable of finding topological
structure associated with each class in the data (if any) thereby simplifying
and hence improving the pattern-finding capability of the ML algorithms.
Some of the most pressing challenges in ML customer churn prediction addressed
by this research include the effective preprocessing and analysis of
large customer datasets, and the effective tuning of ML hyperparameters
in order to achieve a good customer churn prediction. Firstly, a data preprocessing
technique was implemented that consists of different stages such
as handling of missing data (numerical and categorical), feature engineering,
encoding of categorical features using the hashing encoding method, and feature
selection. Secondly, a TDA summary of 0− and 1−dimensional holes of
the topological structure of data called barcode statistics was applied to the
preprocessed data. Barcode statistics were computed using three subsets of
the data (L5, L10, and L15), that represent 5%, 10%, and 15% of the customer
dataset respectively. Thirdly, three ML algorithms, that is, k-nearest neighbour
(KNN), support vector machine (SVM), and extreme gradient boosting
(XGBoost) were then implemented and hyperparameters were tuned using
GridSearchCV. Lastly, to account for the class imbalance nature of the two
classes (churn and non-churn), the undersampling technique was used.
To evaluate the performance of the implemented models, different evaluation
metrics such as accuracy, precision, recall, and f1-measure were used
for the analysis of the churn prediction capability of the different ML algorithms.
Without the additional TDA feature on barcode statistics, the XGBoost algorithm with tuned hyperparameters achieved the best results, with accuracy of 92.71%, precision of 85.95%, recall of 92.71%, and f1-
measure of 89.20%. Adding barcode statistics as an additional feature, the
XGBoost algorithm with tuned hyperparameters achieved the best and much
improved results, with accuracy of 98.50%, precision of 98.50%, recall of
98.50%, and f1-measure of 98.50%. The use of TDA and barcode statistics
significantly improved the churn prediction capability of the ML algorithms.
It was also observed that hyperparameter tuning is not needed when an effective
data preprocessing technique is used and this was also true when TDA
was used.
Keywords: Support vector machine (SVM), k-nearest neighbour (KNN),
extreme gradient boosting (XGBoost), topological data analysis (TDA), customer
churn, landmark, and barcode statistics.