Abstract-Customer churn is the business term that is used to describe loss of clients or customers. Banks, Telecom companies, ISPs, Insurance firms, etc. use customer churn analysis and customer churn rate as one of their key business metrics, because retaining an existing customer is far less than acquiring a new one. Corporates have dedicated departments which attempt to win back defecting clients, because recovered long term customers can be worth much more to a company than newly recruited clients. Customer Churn can be categorized into voluntary churn and involuntary churn. In voluntary churn, customer decides to switch to another service provider, whereas in involuntary churn, the customer leaves the service due to relocation, death, etc. Businesses usually exclude involuntary churn from churn prediction models, and focus on voluntary churn, because it usually occurs due to company-customer relationship, on which the company has full control. Churn is usually measured as gross churn and net churn. Gross churn is calculated as loss of previous customers and their associated recurring revenue, generated by those customers. Net churn is measured as sum of Gross Churn and addition of new similar customers. This is often measure as Recurring Monthly Revenue (RMR) in the Financial Systems.
Predicting and preventing customer churn is becoming the primary focus of many enterprises. Every enterprise wants to retain its each and every customer, in order to maximize maximum profits and revenue from them. With the introduction of business and management systems, and automation of operation flow, corporates have gathered lots of customer and business related data during the daily operating activities, which give data mining techniques a good ground for working and predicting. Lots of data mining algorithms and models have emerged to rescue from this issue of customer loss. These algorithms have been widely used, from past decades, in this field.
For prediction of customer churn, many algorithms and models have been applied. Most common of them are Decision tree , Artificial Neural Network , Logistic Regression . In addition, other algorithms such as Bayesian Network , Support Vector Machine , Rough set , and Survival Analysis  have also been used.
In addition of algorithms and models, other techniques, such as input variable selection, feature selection, outlier detection, etc. have also been applied to get better results out of the above algorithms.
First three models i.e. Decision tree, Artificial Neural Network and Logistic Regression have been applied maturely at multiple corporates. Each algorithm has been improved over multiple iterations, and are now pretty much stable. But as the operation and activities of business are growing, it is becoming more and more complex challenge to solve the problem of customer churn, and this is requesting for the generation of new churn prediction models, which are fast and robust, and which can quickly be trained and scored on large amounts of data.
Jiayin and Yuanquan  presented a step by step approach on selecting effective input variables for customer churn prediction model in telecommunication industry. In telecommunication industry, there are usually very large number of input variables is available for churn prediction models. Of all these variables, there could be variables which have positive effect on the model, and few which are redundant. These redundant variables cause overload for the churn prediction model. So it is always better to select only important features and remove redundant, noisy and less informative variables. In their study, they have proposed Area under ROC (AUC) method for calculating classifying abilities of the variable, where ROC is Receiver Operating Characteristics, and then selecting variables which have the highest classifying abilities. In addition, he also proposed to compute mutual information among all selected variables and finally selecting variables which have relatively low mutual information co-efficient.
Huang and Kechadi  proposed a new technique for Feature Selection for the churn prediction models. As their primary focus was telecommunication industry, and in telecom the amount of input variables / feature is very large, and it is always better to select a subset of features, which have the most ability to classify the target classes. Otherwise running algorithm on all the input variables will be too much to time and resource consuming. Most commonly used techniques for selection of features only judges whether an input feature is helpful to classify the classes or not. The approach proposed by them takes into account the relationship between the specified categorical value of the feature and a class for selecting or removing the feature.
Luo, Shoa and Lie  proposed the customer churn prediction using Decision Tree for Personal Handyphone System Service (PHSS), where the number of variables in input data set is very small. Decision Tree is probably the most commonly used data mining algorithm. Decision Tree model is a predictive model that predicts using a classification process. It is represented as upside down Tree, in which root is at the top and leaves are at the bottom. Decision Trees is the representation of rules. This helps us in understanding, why a record has been classified in a particular way. And these rules can be used to find records that fall into some specific category. In their work they found out the optimal values of input dataset with reference to time sub-period, cost of misclassification and sampling method. With their research, they came up to conclusion that 10-days of sub-period, 1:5 cost of misclassification and random sampling method are the most optimal parameters when training a data model using decision trees, when the number of input variables is very small.
Ming, Huili and Yuwei  proposed a model for churn prediction using Bayesian Network. The concept of Bayesian Network was initially proposed by Judea Pearl (1986). This is a kind of graphics mode used to show the joint probability among different variables. It provides a natural way to describe the causality information which could be used in discovering the potential relations in data. This algorithm has been successively used in knowledge representation of expert system, data mining and machine learning. Recently, it has also been applied in fields of artificial intelligence, including causal reasoning, uncertain knowledge representation, pattern recognition cluster analysis and etc.
A Bayesian network consists of many nodes representing attributes connected by some lines, so the problems are concerned that more than one attribute determine another one which involving the theory of multiple probability distribution. Besides, since different Bayesian networks have different structures and some conceptions in graph theory such as tree, graph and directed acyclic graph can describe these structures clearly, graph theory is an important theoretical foundation of Bayesian networks as well as the probability theory, thus the results of Customer Churn using Bayesian network are very promising.
Jiayin, Yangming, Yingying and Shuang  proposed a new algorithm for churn prediction and called it TreeLogit. This algorithm is combination of ADTree and Logistic Regression models. It incorporates the advantages of both algorithms and making it equally good as TreeNetA® Model which won the best prize in 2003 customer churn prediction contest. As Treelogit combines the advantages of both base algorithms so it becomes very powerful tool for customer churn prediction.
The Modeling process of TreeLogit starts by Designing Customer’s character variables based on prior knowledge. Then the character variables are categorized into m sub-vectors, and a decision tree for each sub-vector is created. Once we have the decision tree for each sub-vector, then we develop logistic regression models for each sub-vector. And finally we evaluate the accuracy and interpretability of the model. If they are acceptable then the customer retention process is started, otherwise the model is re-tuned for better results.
Jing and Xinghua  in their work on customer churn prediction, presented a model based on Support Vector Machines. Support Vector Machines are developed on the basis of statistical learning theory which is regarded as the best theory for the small sample estimation and predictive learning. The studies on the machine learning of finite sample were started by Vapnik in sixties of last century and a relatively complete theoretical system called statistical learning theory was set up in nineties. After that, Support Vector Machines, a new learning machine was proposed. SVM is built on the structural risk minimization principle that is to minimize the real error probability and is mainly used to solve the pattern recognition problems. Because of SVM’s complete theoretical framework and the good effects in practical application, it has been widely valued in machine learning field.
Xu E, Liangeshan Shao, XXuedong Gao and Zhai Baofeng introduced Rough set algorithm for customer churn prediction . Dengh Hu also studied the applications of rough set for customer churn prediction. According to them, Rough set is a data analysis theory proposed by Z. Pawlak. Its main idea is to export the decision or classification rules by knowledge reduction at the premise of keeping the classification ability unchanged. This theory has some unique views such as knowledge granularity which make Rough set theory especially suitable for data analysis. Rough set is built on the basis of classification mechanism and the space’s partition made by equivalence relation is regarded as knowledge. Generally speaking, it describes the imprecise or uncertain knowledge using the knowledge that has been proved. In this theory, knowledge is regarded as a kind of classification ability on data and the objects in the universe are usually described by decision table that is a two-dimensional table whose row represents an object and column an attribute. The attribute consists of decision attribute and condition attribute. The objects in the universe can be distributed into decision classes with different decision attributes according to the condition attributes of them. One of the core contents in the rough set theory is reduction that is a process in which some unimportant or irrelevant knowledge are deleted at the premise of keeping the classification ability unchanged. A decision table may have several reductions whose intersection was defined as the core of the decision table. The attribute of the core is important due to the effect to classification.
Survival analysis is a kind of Statistical Analysis method to analyze and deduce the life expectancy of the creatures or products according to the data comes from surveys or experiments. It always combines the consequences of some events and the corresponding time span to analyze some problems. It was initially used in medical science to study the medicines’ influence to the life expectancy of the research objects. The survival time should be acknowledged widely, that is, the duration of some condition in nature, society or technical process. In this paper, the churn of a customer is regarded as the end of the customer’s survival time. In the fifties of last century, the statisticians began to study the reliability of industrial products, which advanced the development of the survival analysis in theory and application. The proportional hazard regression model is a commonly used survival analysis technique which was first proposed by Cox in 1972.
Jiayin and Yuanquan  proposed a very simple method for the variable selection. The method proposed is very effective and practical, But there are more systematic methods available, which use advance neural network, induction algorithms and rough set.
Huang’s and Kechadi’s  concept for taking into account the categorical values into account when feature selection is being performed, is good. But their concept is limited to categorical values and continues values can’t be applied on their approach. Continues values need to be discretized into categorical values, before their feature selection concept could be applied, but this conversion from continues to discrete may result in loss of information.
Luo, Shoa and Lie  selected Decision Tree as their choice of data mining algorithm for churn prediction, which is the simplest and understandable algorithm for classification. Its simplicity also makes it the most widely used algorithm. But decision tree has its own limitations, they are very unstable and a very little change in the input variables, such as addition of newer ones, require rebuilding and re-training of complete decision tree. In addition, they should have also focused on how to enrich the input variables, by adding new derived variables that could enhance the efficiency of the model.
Ming, Huili and Yuwei  Bayesian network model has advantages and some short comings. It has the ability to product best results even when the input datasets are incomplete. In addition, it has the ability to take connections into account when predicting churn and to take prior knowledge into consideration. This algorithm also has the ability to effectively prevent over fitting. But if the dataset is large, the structure learning of the Bayesian networks will be too difficult. Thus this model is not fit for telecom, where the dataset is always very large.
Jiayin, Yangming, Yingying and Shuang  TreeLogit combines the advantages of both algorithms i.e. ADTree and logistic regression, thus it is both data-driven and assumption-driven and it has the capability of analyzing objects with incomplete information. Moreover, its efficiency is not affected by the bad quality data and it generates continues output with relatively low complexity.
Jing and Xinghua  used Support Vector Machine algorithm for Churn Prediction. This algorithm is best if you have a limited number of sample records, but on the other hand its theory is very complex and there are many variations in it. So it is difficult to find the version which best suites your problem.
There are multiple solutions available for customer churn prediction. Each has its own advantages and disadvantages. So a single solution might not be best for any organization. The organization may have to use the combination of algorithms and techniques to get the best results for churn prediction.