Model process includes the following four major steps. Explanatory Data Analysis (EDA) – Explanatory data analysis was conducted to prepare the data for the survival analysis. An univariate frequency analysis was used to pinpoint value distributions, missing values and outliers.
Variable transformation was conducted for some necessary numerical variables to reduce the level of skewness, because transformations are helpful to improve the fit of a model to the data. Outliers are filtered to exclude observations, such as outliers or other extreme values that are suggested not to be included in the data mining analysis. Filtering extreme values from the training data tends to produce better models because the parameter estimates are more stable. Variables with missing values are not a big issue, except for those demographic variables. The demographic variables with more than 20% of missing values were eliminated. For observations with missing values, one choice is to use incomplete observations, but that may lead to ignore useful information from the variables that have nonmissing values. It may also bias the sample since observations that have missing values may have other things in common as well. Therefore, in this study, missing values were replaced by appropriate methods.
For interval variables, replacement values were calculated based on the random percentiles of the variable’s distribution, i.e., values were assigned based on the probability distribution of the nonmissing observations. Missing values for class variables were replaced with the most frequent values (count or mode).
Variable reduction – Started with 212 variables in the original data set, by using PROC FREQ, an initial univariate analysis of all categorical variables crossed with customer churn status (STATUS) was carried out to determine the statistically significant categorical variables to be included in the next modeling step. All the categorical variables with a chi-square value or t statistics of 0.05 or less were kept. This step reduced the number of variables to 115 (&VARLIST1) – including all the numerical variables and the kept categorical variables from the step one.
The next step is to use PROC PHREG to further reduce the number of variables. A stepwise selection method was used to create a final model with statistically
significant effects of 29 exploratory variables on customer churn over time.
PROC PHREG DATA = SASOUT2.ALL2 OUTEST = SASOUT2.BETA;
MODEL DUR*STATUS(0) = &VARLIST1 / SELECTION = STEPWISE
SLENTRY = 0.0025 SLSTAY = 0.0025 DETAILS;
Model Estimation – With only 29 exploratory variables, the final data set has reasonable number of variables to perform survival analysis. Before applying survival analysis procedures to the final data set, the customer survival function and hazard function were estimated using the following code. The purpose of estimating customer survival function and customer hazard function is to gain knowledge of customer churn hazard characteristics. From the shape of hazard function, customer churn in this study demonstrates a typical hazard function of a Log-Normal model. As previously discussed, since the shape of survival distribution and hazard function was known, PROC LIFEREG produces more efficient estimates (with smaller standard error) than PROC PHREG does.
PROC LIFETEST DATA = SASOUT2.ALL3 OUTSURV = SASOUT2.OUTSURV
METHOD = LIFE PLOT = (S, H) WIDTH = 1 GRAPHICS;
TIME DUR*STATUS(0); RUN;
The final step is to estimate customer churn. PROC LIFEREG was used to calculate customer survival probability. At this step the final data set was divided 50/50 into two data sets: model data set and validation data set. The model data set is used to fit the model and the validation data set is used to score the survival probability for each customer. A variable of USE is used to distinguish the model data set (set USE = 0) and validation data set (set USE = 1). In the validation data set, set both DUR and STATUS missing so that cases in the validation data set were not to be used in model estimation.
出处:Jun Xiang Lu, Ph.D. Predicting Customer Churn in the Telecommunications
Industry –– An Application of Survival Analysis Modeling Using SAS: SAS User Group International (SUGI27) Online Proceedings,2002, Paper No. 114-27.
译文:预测电信行业客户流失——基于一种SAS生存分析模式的应用程序
Jun Xiang Lu, Ph.D. Sprint Communications Company
Overland Park, Kansas
摘要
传统的统计方法(如logistic回归,决策树等等)都是能非常成功的预测客户流失的。但是,这些方法是很难预测什么时候客户会流失,或者这些客户还能保留多久。这项研究的目的是运用生存分析技术通过使用来自电信公司的数据来预测客户流失。这项研究将会帮助电信公司了解客户流失的风险和通过预测那些和何时客户将要流失的一种时间方式的危害。这一研究的结果有助于电信公司优化客户的保留和(或)处理资源来努力降低他们的客户流失。
引言
在电信行业,客户可以在多个提供服务的供应者中进行选择,积极运用他们从一个服务供应商转换到另一个供应商的权利。在这个竞争激烈的市场,客户需要用低价格获得的按要求特质非产品和更好的服务,
服务的供应商要不断的专注于收购作为他们的业务目标。鉴于电信业的经验是30-35%的平均客户流失率,开发一个新客户的成本是保留原有客户成本的5-10倍。对于许多老牌的运营商,企业的主要头痛的是留住高利润的客户。许多电信公司在协调方案和过程时使用保持战略通过提供量身定做的产品和服务来更长时间的保持客户。随着各地方使用客户保持战略,很多公司开始把降低客户流失作为他们业务的目标之一。
为了支持电信企业管理客户流失的减少,我们不仅需要预测那些客户存在流失的高风险,还需要知道什么时候这些高风险的客户要流失。因此,电信公司优化了其市场营销的资源来防止很多可能的客户流失。换句话说,如果电信公司知道他们的客户有流失的高风险和什么时候他们将要流失,他们就设计出与客户即使有效的交流沟通的方案。
传统的统计方法(如logistic回归,决策树等等)都是能非常成功的预测客户流失的。但是,这些方法是很难预测什么时候客户会流失,或者这些客户还
能保留多久。然而,生存分析的最初设计是用于处理存在的数据,因此是预测客户流失的一种有效和强大的工具。
目标
这项预测研究的目标有两个。第一个目标是为了建立客户生存函数和客户风险函数来获取在客户的任期时间的客户流失的知识。第二个目标是演示用来识别那些是高风险流失的客户和什么时候他们将要流失的生存分析技术。
定义和排除
本问澄清一些重要的概念和排除在本次研究之外的使用。
流失——在电信含有,客户流失的广泛定义是指一个客户的电信服务被取消了。这包括服务提供者引发的客户流失,和客户主动的流失。一个服务提供者引发的客户流失的例子有客户的账户因为客户欠费被关闭。客户主动流失就比较复杂,流失的原因也是不同的。在这项研究中只研究客户的主动流失,它被定义为由一系列取消原因代码,原因代码的举例有:不能接受通话质量,竞争对手的更优惠的定价计划,在销售中误传了信息,客户的期望得不到满足,计费问题,移动,业务上的变化等等。
高价值客户——仅仅只那些已经接受至少有三个月账单的客户。高价值客户是那些在过去三个月每个月平均收益在x美元或以上的客户。如果客户的第一张发票少于30天的服务,那么客户的每个月的收益是按比例分配到一个整月的收入。
尺度——本研究讨论关于账户的客户流失率
排除——这项研究没有区分国内客户和国际客户,实际上把国际客户流失从国内客户流失中分开是值得做的。此外,这项研究不包括员工的账户,因为员工账户的流失不只是一个问题或是企业的一种权利。
生存分析和客户流失
生存分析是为学习发生的事情和实时的事件的一种统计研究方法。从一开始,生存分析对发生的事件的设计纵向数据。对客户流失的跟踪时一个生存数据的很好的例子。生存数据有两个共同的特点,很难用传统的统计方法处理:审查和时间上的依赖性变量。
百度搜索“77cn”或“免费范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,免费范文网,提供经典小说综合文库预测电信行业客户流失——基于一种SAS生存分析模式的应用程序(2)在线全文阅读。
相关推荐: