• 【大数据部落】R语言代写电信公司churn数据客户流失 k近邻(knn)模型预测分析


    原文链接:http://tecdat.cn/?p=5521

    Data background

    A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. 

    The data set  is Churn . The fields are as follows:

    State

     discrete.

    account length

     continuous.

    area code

     continuous.

    phone number

     discrete.

    international plan

     discrete.

    voice mail plan

     discrete.

    number vmail messages

     continuous.

    total day minutes

     continuous.

    total day calls

     continuous.

    total day charge

     continuous.

    total eve minutes

     continuous.

    total eve calls

     continuous.

    total eve charge

     continuous.

    total night minutes

     continuous.

    total night calls

     continuous.

    total night charge

     continuous.

    total intl minutes

     continuous.

    total intl calls

     continuous.

    total intl charge

     continuous.

    number customer service calls

     continuous.

    churn

     Discrete


    Data Preparation and Exploration 

    查看数据概览
    
    ##      state      account.length    area.code        phone.number 
    ##  WV     : 158   Min.   :  1.0   Min.   :408.0    327-1058:   1  
    ##  MN     : 125   1st Qu.: 73.0   1st Qu.:408.0    327-1319:   1  
    ##  AL     : 124   Median :100.0   Median :415.0    327-2040:   1  
    ##  ID     : 119   Mean   :100.3   Mean   :436.9    327-2475:   1  
    ##  VA     : 118   3rd Qu.:127.0   3rd Qu.:415.0    327-3053:   1  
    ##  OH     : 116   Max.   :243.0   Max.   :510.0    327-3587:   1  
    ##  (Other):4240                                   (Other)  :4994  
    ##  international.plan voice.mail.plan number.vmail.messages
    ##   no :4527           no :3677       Min.   : 0.000       
    ##   yes: 473           yes:1323       1st Qu.: 0.000       
    ##                                     Median : 0.000       
    ##                                     Mean   : 7.755       
    ##                                     3rd Qu.:17.000       
    ##                                     Max.   :52.000       
    ##                                                          
    ##  total.day.minutes total.day.calls total.day.charge total.eve.minutes
    ##  Min.   :  0.0     Min.   :  0     Min.   : 0.00    Min.   :  0.0    
    ##  1st Qu.:143.7     1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4    
    ##  Median :180.1     Median :100     Median :30.62    Median :201.0    
    ##  Mean   :180.3     Mean   :100     Mean   :30.65    Mean   :200.6    
    ##  3rd Qu.:216.2     3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1    
    ##  Max.   :351.5     Max.   :165     Max.   :59.76    Max.   :363.7    
    ##                                                                      
    ##  total.eve.calls total.eve.charge total.night.minutes total.night.calls
    ##  Min.   :  0.0   Min.   : 0.00    Min.   :  0.0       Min.   :  0.00   
    ##  1st Qu.: 87.0   1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00   
    ##  Median :100.0   Median :17.09    Median :200.4       Median :100.00   
    ##  Mean   :100.2   Mean   :17.05    Mean   :200.4       Mean   : 99.92   
    ##  3rd Qu.:114.0   3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00   
    ##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.00   
    ##                                                                        
    ##  total.night.charge total.intl.minutes total.intl.calls total.intl.charge
    ##  Min.   : 0.000     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
    ##  1st Qu.: 7.510     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
    ##  Median : 9.020     Median :10.30      Median : 4.000   Median :2.780    
    ##  Mean   : 9.018     Mean   :10.26      Mean   : 4.435   Mean   :2.771    
    ##  3rd Qu.:10.560     3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
    ##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
    ##                                                                          
    ##  number.customer.service.calls     churn     
    ##  Min.   :0.00                   False.:4293  
    ##  1st Qu.:1.00                   True. : 707  
    ##  Median :1.00                                
    ##  Mean   :1.57                                
    ##  3rd Qu.:2.00                                
    ##  Max.   :9.00                                
    ## 
    
    

     从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去

    Examine the variables graphically 

       

    从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。

    从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。

    ##  account.length    area.code     number.vmail.messages total.day.minutes
    ##  Min.   :  1.0   Min.   :408.0   Min.   : 0.000        Min.   :  0.0    
    ##  1st Qu.: 73.0   1st Qu.:408.0   1st Qu.: 0.000        1st Qu.:143.7    
    ##  Median :100.0   Median :415.0   Median : 0.000        Median :180.1    
    ##  Mean   :100.3   Mean   :436.9   Mean   : 7.755        Mean   :180.3    
    ##  3rd Qu.:127.0   3rd Qu.:415.0   3rd Qu.:17.000        3rd Qu.:216.2    
    ##  Max.   :243.0   Max.   :510.0   Max.   :52.000        Max.   :351.5    
    ##  total.day.calls total.day.charge total.eve.minutes total.eve.calls
    ##  Min.   :  0     Min.   : 0.00    Min.   :  0.0     Min.   :  0.0  
    ##  1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4     1st Qu.: 87.0  
    ##  Median :100     Median :30.62    Median :201.0     Median :100.0  
    ##  Mean   :100     Mean   :30.65    Mean   :200.6     Mean   :100.2  
    ##  3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1     3rd Qu.:114.0  
    ##  Max.   :165     Max.   :59.76    Max.   :363.7     Max.   :170.0  
    ##  total.eve.charge total.night.minutes total.night.calls total.night.charge
    ##  Min.   : 0.00    Min.   :  0.0       Min.   :  0.00    Min.   : 0.000    
    ##  1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00    1st Qu.: 7.510    
    ##  Median :17.09    Median :200.4       Median :100.00    Median : 9.020    
    ##  Mean   :17.05    Mean   :200.4       Mean   : 99.92    Mean   : 9.018    
    ##  3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00    3rd Qu.:10.560    
    ##  Max.   :30.91    Max.   :395.0       Max.   :175.00    Max.   :17.770    
    ##  total.intl.minutes total.intl.calls total.intl.charge
    ##  Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
    ##  1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
    ##  Median :10.30      Median : 4.000   Median :2.780    
    ##  Mean   :10.26      Mean   : 4.435   Mean   :2.771    
    ##  3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
    ##  Max.   :20.00      Max.   :20.000   Max.   :5.400    
    ##  number.customer.service.calls
    ##  Min.   :0.00                 
    ##  1st Qu.:1.00                 
    ##  Median :1.00                 
    ##  Mean   :1.57                 
    ##  3rd Qu.:2.00                 
    ##  Max.   :9.00

    Relationships between variables


    从结果中我们可以看到两者之间存在显著的正相关线性关系。



    Using the statistics node, report

    ##                               account.length    area.code
    ## account.length                  1.0000000000 -0.018054187
    ## area.code                      -0.0180541874  1.000000000
    ## number.vmail.messages          -0.0145746663 -0.003398983
    ## total.day.minutes              -0.0010174908 -0.019118245
    ## total.day.calls                 0.0282402279 -0.019313854
    ## total.day.charge               -0.0010191980 -0.019119256
    ## total.eve.minutes              -0.0095913331  0.007097877
    ## total.eve.calls                 0.0091425790 -0.012299947
    ## total.eve.charge               -0.0095873958  0.007114130
    ## total.night.minutes             0.0006679112  0.002083626
    ## total.night.calls              -0.0078254785  0.014656846
    ## total.night.charge              0.0006558937  0.002070264
    ## total.intl.minutes              0.0012908394 -0.004153729
    ## total.intl.calls                0.0142772733 -0.013623309
    ## total.intl.charge               0.0012918112 -0.004219099
    ## number.customer.service.calls  -0.0014447918  0.020920513
    ##                               number.vmail.messages total.day.minutes
    ## account.length                        -0.0145746663      -0.001017491
    ## area.code                             -0.0033989831      -0.019118245
    ## number.vmail.messages                  1.0000000000       0.005381376
    ## total.day.minutes                      0.0053813760       1.000000000
    ## total.day.calls                        0.0008831280       0.001935149
    ## total.day.charge                       0.0053767959       0.999999951
    ## total.eve.minutes                      0.0194901208      -0.010750427
    ## total.eve.calls                       -0.0039543728       0.008128130
    ## total.eve.charge                       0.0194959757      -0.010760022
    ## total.night.minutes                    0.0055413838       0.011798660
    ## total.night.calls                      0.0026762202       0.004236100
    ## total.night.charge                     0.0055349281       0.011782533
    ## total.intl.minutes                     0.0024627018      -0.019485746
    ## total.intl.calls                       0.0001243302      -0.001303123
    ## total.intl.charge                      0.0025051773      -0.019414797
    ## number.customer.service.calls         -0.0070856427       0.002732576
    ##                               total.day.calls total.day.charge
    ## account.length                   0.0282402279     -0.001019198
    ## area.code                       -0.0193138545     -0.019119256
    ## number.vmail.messages            0.0008831280      0.005376796
    ## total.day.minutes                0.0019351487      0.999999951
    ## total.day.calls                  1.0000000000      0.001935884
    ## total.day.charge                 0.0019358844      1.000000000
    ## total.eve.minutes               -0.0006994115     -0.010747297
    ## total.eve.calls                  0.0037541787      0.008129319
    ## total.eve.charge                -0.0006952217     -0.010756893
    ## total.night.minutes              0.0028044650      0.011801434
    ## total.night.calls               -0.0083083467      0.004234934
    ## total.night.charge               0.0028018169      0.011785301
    ## total.intl.minutes               0.0130972198     -0.019489700
    ## total.intl.calls                 0.0108928533     -0.001306635
    ## total.intl.charge                0.0131613976     -0.019418755
    ## number.customer.service.calls   -0.0107394951      0.002726370
    ##                               total.eve.minutes total.eve.calls
    ## account.length                    -0.0095913331     0.009142579
    ## area.code                          0.0070978766    -0.012299947
    ## number.vmail.messages              0.0194901208    -0.003954373
    ## total.day.minutes                 -0.0107504274     0.008128130
    ## total.day.calls                   -0.0006994115     0.003754179
    ## total.day.charge                  -0.0107472968     0.008129319
    ## total.eve.minutes                  1.0000000000     0.002763019
    ## total.eve.calls                    0.0027630194     1.000000000
    ## total.eve.charge                   0.9999997749     0.002778097
    ## total.night.minutes               -0.0166391160     0.001781411
    ## total.night.calls                  0.0134202163    -0.013682341
    ## total.night.charge                -0.0166420421     0.001799380
    ## total.intl.minutes                 0.0001365487    -0.007458458
    ## total.intl.calls                   0.0083881559     0.005574500
    ## total.intl.charge                  0.0001593155    -0.007507151
    ## number.customer.service.calls     -0.0138234228     0.006234831
    ##                               total.eve.charge total.night.minutes
    ## account.length                   -0.0095873958        0.0006679112
    ## area.code                         0.0071141298        0.0020836263
    ## number.vmail.messages             0.0194959757        0.0055413838
    ## total.day.minutes                -0.0107600217        0.0117986600
    ## total.day.calls                  -0.0006952217        0.0028044650
    ## total.day.charge                 -0.0107568931        0.0118014339
    ## total.eve.minutes                 0.9999997749       -0.0166391160
    ## total.eve.calls                   0.0027780971        0.0017814106
    ## total.eve.charge                  1.0000000000       -0.0166489191
    ## total.night.minutes              -0.0166489191        1.0000000000
    ## total.night.calls                 0.0134220174        0.0269718182
    ## total.night.charge               -0.0166518367        0.9999992072
    ## total.intl.minutes                0.0001320238       -0.0067209669
    ## total.intl.calls                  0.0083930603       -0.0172140162
    ## total.intl.charge                 0.0001547783       -0.0066545873
    ## number.customer.service.calls    -0.0138363623       -0.0085325365
     
    如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。

    Data Manipulation

     
    从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
    特别是voicemial为no的变量之间存在负相关关系。

     Discretize (make categorical) a relevant numeric variable  

    对变量进行离散化

     construct a distribution of the variable with a churn overlay 

    construct a histogram of the variable with a churn overlay

     Find a pair of numeric variables which are interesting with respect to churn. 

     
    从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
     

    Model Building

    特别是churn为no的变量之间存在相关关系。

    ##                                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)                    0.3082150  0.0735760   4.189 2.85e-05 ***
    ## stateAL                        0.0151188  0.0462343   0.327 0.743680    
    ## stateAR                        0.0894792  0.0490897   1.823 0.068399 .  
    ## stateAZ                        0.0329566  0.0494195   0.667 0.504883    
    ## stateCA                        0.1951511  0.0567439   3.439 0.000588 ***
    ## international.plan yes         0.3059341  0.0151677  20.170  < 2e-16 ***
    ## voice.mail.plan yes           -0.1375056  0.0337533  -4.074 4.70e-05 ***
    ## number.vmail.messages          0.0017068  0.0010988   1.553 0.120402    
    ## total.day.minutes              0.3796323  0.2629027   1.444 0.148802    
    ## total.day.calls                0.0002191  0.0002235   0.981 0.326781    
    ## total.day.charge              -2.2207671  1.5464583  -1.436 0.151056    
    ## total.eve.minutes              0.0288233  0.1307496   0.220 0.825533    
    ## total.eve.calls               -0.0001585  0.0002238  -0.708 0.478915    
    ## total.eve.charge              -0.3316041  1.5382391  -0.216 0.829329    
    ## total.night.minutes            0.0083224  0.0695916   0.120 0.904814    
    ## total.night.calls             -0.0001824  0.0002225  -0.820 0.412290    
    ## total.night.charge            -0.1760782  1.5464674  -0.114 0.909355    
    ## total.intl.minutes            -0.0104679  0.4192270  -0.025 0.980080    
    ## total.intl.calls              -0.0063448  0.0018062  -3.513 0.000447 ***
    ## total.intl.charge              0.0676460  1.5528267   0.044 0.965254    
    ## number.customer.service.calls  0.0566474  0.0033945  16.688  < 2e-16 ***
    ## total.day.minutes1medium       0.0502681  0.0160228   3.137 0.001715 ** 
    ## total.day.minutes1short        0.2404020  0.0322293   7.459 1.02e-13 ***
    从结果中看,我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的变量有重要的影响。

    Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn 

    ##         Direction.2005
    ## knn.pred   1   2
    ##        1 760  97
    ##        2 100  43
    
    
     [1] 0.803
     
    混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。
    ##         Direction.2005
    ## knn.pred   1   2
    ##        1 827 104
    ##        2  33  36
     
    
    
     [1] 0.863
    从测试集的结果,我们可以看到准确度达到86%。

    Findings  

    我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。

    如果您有任何疑问,请在下面发表评论。  

  • 相关阅读:
    Spring Boot 2 (十):Spring Boot 中的响应式编程和 WebFlux 入门
    开源精神就意味着免费吗?
    是时候给大家介绍 Spring Boot/Cloud 背后豪华的研发团队了。
    springcloud(十五):Spring Cloud 终于按捺不住推出了自己的服务网关 Gateway
    写年终总结到底有没有意义?
    培训班出来的怎么了?
    【重磅】Spring Boot 2.1.0 权威发布
    一线大厂逃离或为新常态,大龄程序员改如何选择?
    Elastic 今日在纽交所上市,股价最高暴涨122%。
    技术人如何搭建自己的技术博客
  • 原文地址:https://www.cnblogs.com/tecdat/p/11327631.html
Copyright © 2020-2023  润新知