Print Email Facebook Twitter Churn prediction in telecommunication Title Churn prediction in telecommunication Author de Groot, D.S. Contributor van der Meulen, F.H. (mentor) Faculty Electrical Engineering, Mathematics and Computer Science Department Delft Institute of Applied Mathematics (DIAM) Date 2017-05-19 Abstract For telecommunication businesses it is important to retain as many customers as possible. For this purpose it could be useful to predict which customers have a high potential to stop their subscription. When a customer leaves the company by stopping the subscription, it is called a customer churn. In this thesis customer churn was predicted for KPN, a telecommunication business in the Netherlands. To predict customer churns for KPN, a data set with information about the network, subscriptions and calls to the call centres was used.The data set contains numerical, ordinal and categorical features and just over two million customers. One of the features is the response feature churn, which suffers from class imbalance. Only 3.39 percent of the customers had churned. Firstly, the data was explored by a cluster analysis. For this cluster analysis partitioning around medoids was used, with the Gower distance as dissimilarity measure. This clustering was illustrated using t-distributed stochastic neighbour embedding. The cluster analysis showed that it is hard to distinguish churners from non-churners since they are not clearly grouped. To find a model that predicts customer churns, several classification methods were compared. Those methods were logistic regression, lasso, adaptive boosting, naive Bayes, random forest and a super learner, respectively. Not only the classification methods were applied to the data as is, but also to the data where the training set was balanced by up-sampling. To assess which model performs the best, an in-depth study into performance measures was executed. In the end, three different performance measures were used to assess the performance. Those performance measures are the weighted risk, the H-measure and the area under the recall-precision curve, respectively. It turned out that the super learner, which was an ensemble of logistic regression and random forest, performed the best for the KPN data set. Furthermore, it turned out that all feature should be included in the modelling process since a model that only contains the most important features, as determined by a feature importance method, gives worse predictions. To reference this document use: http://resolver.tudelft.nl/uuid:c061a03a-84a5-4ac3-ba22-c5b315f08edd Part of collection Student theses Document type master thesis Rights (c) 2017 de Groot, D.S. Files PDF Thesis_Confidential.pdf 382.26 KB Close viewer /islandora/object/uuid:c061a03a-84a5-4ac3-ba22-c5b315f08edd/datastream/OBJ/view