Churn prediction in telecommunication

de Groot, D.S.

Churn prediction in telecommunication

Title

Churn prediction in telecommunication

Author

de Groot, D.S.

Contributor

van der Meulen, F.H. (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Delft Institute of Applied Mathematics (DIAM)

Date

2017-05-19

Abstract

For telecommunication businesses it is important to retain as many customers as possible. For this purpose it could be useful to predict which customers have a high potential to stop their subscription. When a customer leaves the company by stopping the subscription, it is called a customer churn. In this thesis customer churn was predicted for KPN, a telecommunication business in the Netherlands. To predict customer churns for KPN, a data set with information about the network, subscriptions and calls to the call centres was used.The data set contains numerical, ordinal and categorical features and just over two million customers. One of the features is the response feature churn, which suffers from class imbalance. Only 3.39 percent of the customers had churned. Firstly, the data was explored by a cluster analysis. For this cluster analysis partitioning around medoids was used, with the Gower distance as dissimilarity measure. This clustering was illustrated using t-distributed stochastic neighbour embedding. The cluster analysis showed that it is hard to distinguish churners from non-churners since they are not clearly grouped. To find a model that predicts customer churns, several classification methods were compared. Those methods were logistic regression, lasso, adaptive boosting, naive Bayes, random forest and a super learner, respectively. Not only the classification methods were applied to the data as is, but also to the data where the training set was balanced by up-sampling. To assess which model performs the best, an in-depth study into performance measures was executed. In the end, three different performance measures were used to assess the performance. Those performance measures are the weighted risk, the H-measure and the area under the recall-precision curve, respectively. It turned out that the super learner, which was an ensemble of logistic regression and random forest, performed the best for the KPN data set. Furthermore, it turned out that all feature should be included in the modelling process since a model that only contains the most important features, as determined by a feature importance method, gives worse predictions.

To reference this document use:

http://resolver.tudelft.nl/uuid:c061a03a-84a5-4ac3-ba22-c5b315f08edd

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Thesis_Confidential.pdf

382.26 KB

Close viewer