• R语言学习笔记之十


    摘要: 仅用于记录R语言学习过程:

    内容提要:

    描述性统计;t检验;数据转换;方差分析;卡方检验;回归分析与模型诊断;生存分析;COX回归

    写在正文前的话,关于基础知识,此篇为终结篇,笔记来自医学方的课程,仅用于学习R的过程。

    正文:

      描述性统计

    n  如何去生成table1 用table()函数,快速汇总频数

    u  生成四格表:table(行名,列名)

    > table(tips$sex,tips$smoker)

          No Yes

    Female 54  33

          Male   97  60

    addmargins()函数:对生成的table表格进行计算

    > table(esoph$agegp,esoph$ncases)

          

             0  1  2  3  4  5  6  8  9 17

      25-34 14  1  0  0  0  0  0  0  0  0

      35-44 10  2  2  1  0  0  0  0  0  0

      45-54  3  2  2  2  3  2  2  0  0  0

      55-64  0  0  2  4  3  2  2  1  2  0

      65-74  1  4  2  2  2  2  1  0  0  1

      75+    1  7  3  0  0  0  0  0  0  0

    > tt <- table(esoph$agegp,esoph$ncases)

    > addmargins(tt,margin = c(1,2))  # margin 1表示行,2表示列

          

             0  1  2  3  4  5  6  8  9 17 Sum

      25-34 14  1  0  0  0  0  0  0  0  0  15

      35-44 10  2  2  1  0  0  0  0  0  0  15

      45-54  3  2  2  2  3  2  2  0  0  0  16

      55-64  0  0  2  4  3  2  2  1  2  0  16

      65-74  1  4  2  2  2  2  1  0  0  1  15

      75+    1  7  3  0  0  0  0  0  0  0  11

          Sum   29 16 11  9  8  6  5  1  2  1  88

    n  xtabs()函数:在数据集中取子集,生成表格

    > hightip <- tips[,'tip'] > mean(tips[,'tip'] )

    > as.data.frame(xtabs(~tips$sex + hightip,subset = tips$smoker=='No'))

    #as.data.frame转换成数据框;~后面的公式类似table括号中的内容,为分类变量;~左边需添加的是连续型变量;有一个子集subset可进行提取

      tips.sex hightip Freq

    1   Female   FALSE   31

    2     Male   FALSE   49

    3   Female    TRUE   23

    4     Male    TRUE   48

    > as.data.frame(xtabs(~tips$sex + hightip,subset = tips$day %in% c('Sun','Sat')))

      tips.sex hightip Freq

    1   Female   FALSE   21

    2     Male   FALSE   53

    3   Female    TRUE   25

    4     Male    TRUE   64

    示例2:

    > xtabs(ncontrols~agegp + alcgp,data = esoph)

           alcgp

    agegp   0-39g/day 40-79 80-119 120+

      25-34        61    45      5    5

      35-44        89    80     20   10

      45-54        78    81     39   15

      55-64        89    84     43   26

      65-74        71    53     29    8

      75+          27    12      2    3

    > addmargins(xtabs(ncontrols~agegp + alcgp,data = esoph),margin = c(1,2))

           alcgp

    agegp   0-39g/day 40-79 80-119 120+ Sum

      25-34        61    45      5    5 116

      35-44        89    80     20   10 199

      45-54        78    81     39   15 213

      55-64        89    84     43   26 242

      65-74        71    53     29    8 161

      75+          27    12      2    3  44

      Sum         415   355    138   67 975

    #cbind(数值型,数值型)

    > xtabs(cbind(ncases,ncontrols)~agegp + alcgp,data = esoph)

    , ,  = ncases

           alcgp

    agegp   0-39g/day 40-79 80-119 120+

      25-34         0     0      0    1

      35-44         1     4      0    4

      45-54         1    20     12   13

      55-64        12    22     24   18

      65-74        11    25     13    6

      75+           4     4      2    3

    , ,  = ncontrols

           alcgp

    agegp   0-39g/day 40-79 80-119 120+

      25-34        61    45      5    5

      35-44        89    80     20   10

      45-54        78    81     39   15

      55-64        89    84     43   26

      65-74        71    53     29    8

      75+          27    12      2    3

    n  ftable()函数:扁平表格,接受xtabs对象,进行调整

    > ftable(xtabs(cbind(ncases,ncontrols)~agegp +alcgp,data = esoph))

                     ncases ncontrols

    agegp alcgp                     

    25-34 0-39g/day       0        61

          40-79           0        45

          80-119          0         5

          120+            1         5

    35-44 0-39g/day       1        89

          40-79           4        80

          80-119          0        20

          120+            4        10

    45-54 0-39g/day       1        78

          40-79          20        81

          80-119         12        39

          120+           13        15

    55-64 0-39g/day      12        89

          40-79          22        84

          80-119         24        43

          120+           18        26

    65-74 0-39g/day      11        71

          40-79          25        53

          80-119         13        29

          120+            6         8

    75+   0-39g/day       4        27

          40-79           4        12

          80-119          2         2

          120+            3         3

    n  数据汇总可结合前面学习的函数,如:

    u  summary(数据集)函数

    u  describe(数据集)函数(psych包里的)

    u  describe.data.frame(数据集)函数 (Hmisc包里的)

    u  apply()家族等

      t检验

    n  假设检验:服从正态分布方差齐性(如果不齐,可以用t’检验),

    n  示例:

    u  生成随机数据,并检验是否符合正态分布:(shapiro.test()函数)

    > data1 <- sample(1:100,50)

    >

    > #检验正态性  shapiro.test()函数

    > shapiro.test(data1)

           Shapiro-Wilk normality test

    data:  data1

    W = 0.96424, p-value = 0.1338  #不能拒绝原假设,服从正态分布

    除此之外,还有以下5个函数可用于检验是否符合正态分布:

    library(nortest)

    lillie.test(data1)

    ad.test(data1)

    cvm.test(data1)

    pearson.test(data1)

    sf.test(data1)

    如:

    > lillie.test(data1)

     

           Lilliefors (Kolmogorov-Smirnov) normality test

     

    data:  data1

    D = 0.064492, p-value = 0.8693

     

    > ad.test(data1)

     

           Anderson-Darling normality test

     

    data:  data1

    A = 0.40918, p-value = 0.333

     

    > cvm.test(data1)

     

           Cramer-von Mises normality test

     

    data:  data1

    W = 0.054212, p-value = 0.4437

     

    > pearson.test(data1)

     

           Pearson chi-square normality test

     

    data:  data1

    P = 4, p-value = 0.7798

     

    > sf.test(data1)

     

           Shapiro-Francia normality test

     

    data:  data1

    W = 0.97496, p-value = 0.3075

    生成服从正态分布的数据:rnorm()函数

    > #生成服从正态分布的数据

    > data3 <- rnorm(100,3,5)

    > data4 <- rnorm(200,3.4,8)

    >

    > shapiro.test(data3)

           Shapiro-Wilk normality test

    data:  data3

    W = 0.99369, p-value = 0.9258

    >

    > shapiro.test(data4)

           Shapiro-Wilk normality test

    data:  data4

    W = 0.98946, p-value = 0.149

    检验方差齐性:var.test()函数:仅适用于两组

    > var.test(data3,data4)

           F test to compare two variances

    data:  data3 and data4

    F = 0.40348, num df = 99, denom df = 199, p-value = 1.088e-06

    alternative hypothesis: true ratio of variances is not equal to 1

    95 percent confidence interval:

     0.2893355 0.5739700

    sample estimates:

    ratio of variances

             0.4034794   #p值小于0.05,说明方差不齐性

    u  进行t检验函数:t.test()函数

    两组均数t检验

    > t.test(data3,data4,var.equal = F)   #方差不齐则var.equal设置为F,默认也为FALSE,若其设置为TRUE,如为FALSE,则会使用t’检验

           Welch Two Sample t-test

    data:  data3 and data4

    t = -1.3681, df = 281.41, p-value = 0.1724   #说明两组均数无显著统计学差异

    alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval:

     -2.587004  0.465494

    sample estimates:

    mean of x mean of y

          2.468549  3.529304

    样本均数与总体均数t检验

    > #样本均数与总体均数的比较

    > t.test(data3,mu =3.2)

           One Sample t-test

    data:  data3

    t = -1.4117, df = 99, p-value = 0.1612  #样本与总体无统计学差异

    alternative hypothesis: true mean is not equal to 3.2

    95 percent confidence interval:

     1.440424 3.496675

    sample estimates:

    mean of x

     2.468549

    配对数据的t检验

    data3 <- rnorm(200,3,5)

    > data4 <- rnorm(200,3.4,5)

    >

    > #配对数据

    > #进行配对t检验

    > t.test(data3,data4,paired = TRUE)

           Paired t-test

    data:  data3 and data4

    t = 0.59079, df = 199, p-value = 0.5553  #p大于0.05,说明无显著差异

    alternative hypothesis: true difference in means is not equal to 0

    95 percent confidence interval:

     -0.6946177  1.2888581

    sample estimates:

    mean of the differences

                  0.2971202

      数据变换

    n  数据不满足正态分布时该如何处理:有时可采用取log()、sqrt()、1/x等等方式,

    n  runif()平均分布函数

    > mydata <- runif(100,min = 1,max = 2)

    > shapiro.test(mydata)

           Shapiro-Wilk normality test

    data:  mydata

    W = 0.93149, p-value = 6.05e-05

    > shapiro.test(log(mydata))

           Shapiro-Wilk normality test

    data:  log(mydata)

    W = 0.93082, p-value = 5.54e-05

    > shapiro.test(sqrt(mydata))

           Shapiro-Wilk normality test

    data:  sqrt(mydata)

    W = 0.93236, p-value = 6.794e-05

    > shapiro.test(1/mydata)

           Shapiro-Wilk normality test

    data:  1/mydata

    W = 0.9195, p-value = 1.325e-05

    #几种方式均不可行。

    boxcox转换---对公式的,加载MASS包,运用里面的boxcox()函数,#boxcox()转换,核心为找到lambda的值进行相应的转换

    bc <- boxcox(Volume ~ log(Height) + log(Girth),data = trees,

                 lambda = seq(-0.25,0.25,length =10))

    #Volume为拟操作的变量,Height和Girth为用此两个数据进行估计,但要取log,trees为数据集,lambda后面的公式为取最佳值

    u  用公式:(x^lambda-1)/lambda  进行数据转换(lambda 不等于1),新的Volume_bc就服从正态分布了。

    > Volume_bc <- (trees$Volume^lambda-1)/lambda

    > shapiro.test(Volume_bc)

           Shapiro-Wilk normality test

    data:  Volume_bc

    W = 0.96431, p-value = 0.3776

    u  可用qqnorm(Volume_bc)和qqline(Volume_bc)c查看图,是否符合正态分布

    boxcox转换---对数值的,加载forecast包,利用包中的BoxCox.lambda()函数

    > BoxCox.lambda(trees$Volume)

    [1] -0.4954451   #lambda的值  ,采用1/sqrt(x)

    分以下几种情况:

    ü  lambda接近-1时    1/x

    ü             -0.5     1/sqrt(x)

    ü              0       ln(x) 或log(x)

    ü              0.5      sqrt()

    ü               1       不用变换了

    完整示例:

    > BoxCox.lambda(trees$Volume)

    [1] -0.4954451

    > new_volum <- 1/sqrt(trees$Volume)

    > shapiro.test(new_volum)

           Shapiro-Wilk normality test

    data:  new_volum

    W = 0.94523, p-value = 0.1152

    利用car包中powerTransform得到lambda值,再进行正态分布分析

    > library(car)

    > powerTransform(trees$Volume)

    Estimated transformation parameter

    trees$Volume

     -0.07476608

    > new_volum <- trees$Volume^-0.07476608

    > shapiro.test(new_volum)

           Shapiro-Wilk normality test

    data:  new_volum

    W = 0.96428, p-value = 0.3768

      方差分析

    n  用于多组均值的比较。两组均值的t检验与方差分析等价,但是对于>=3组的均数比较,t检验不适用,需用方差分析。

    n  假设检验,需满足的条件:

    u  正态性

    u  方差齐性

    u  独立性

    n  方差齐性的检验

    u  安装multcomp包,加载cholesterol数据集(里面包含response组和trt组)

    u  方差齐性的检验:

    l  R语言的内置包:bartlett.test()

    > bartlett.test(response~trt,data = cholesterol)

           Bartlett test of homogeneity of variances

    data:  response by trt

    Bartlett's K-squared = 0.57975, df = 4, p-value =

    0.9653

    > #p = 0.9653   方差是齐性的

    >

    > #正态性检验

    > shapiro.test(cholesterol$response)

           Shapiro-Wilk normality test

    data:  cholesterol$response

    W = 0.97722, p-value = 0.4417

    l  R语言的内置包:fligner.test() 与bartlett.test()检验原理不同,但公式一样

    > #方差齐性检验

    > fligner.test(response~trt,data = cholesterol)

           Fligner-Killeen test of homogeneity of variances

    data:  response by trt

    Fligner-Killeen:med chi-squared = 0.74277, df = 4,

    p-value = 0.946

    car包中的ncvTest()函数:

    > ncvTest(lm(response~trt,data = cholesterol))

    Non-constant Variance Score Test

    Variance formula: ~ fitted.values

    Chisquare = 0.1338923    Df = 1     p = 0.71443

    car包中的leveneTest()函数:

    > leveneTest(response~trt,data = cholesterol)

    Levene's Test for Homogeneity of Variance (center = median)

          Df F value Pr(>F)

    group  4  0.0755 0.9893

          45

     

    n  方差分析

    单因素方差分析aov()函数

    > aov(response~trt,data = cholesterol)

    Call:

       aov(formula = response ~ trt, data = cholesterol)

    Terms:

                          trt Residuals

    Sum of Squares  1351.3690  468.7504

    Deg. of Freedom         4        45

    Residual standard error: 3.227488

    Estimated effects may be unbalanced

    > summary(fit)

                Df Sum Sq Mean Sq F value   Pr(>F)   

    trt          4 1351.4   337.8   32.43 9.82e-13 ***

    Residuals   45  468.8    10.4                    

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    #用gplots包中的plotmeans可视化

    plotmeans(response~trt,data = cholesterol)

    单因素方差分析:oneway.test()函数

    > oneway.test(response~trt,data = cholesterol)

           One-way analysis of means (not assuming equal

           variances)

    data:  response and trt

    F = 30.709, num df = 4.00, denom df = 22.46, p-value =

    8.227e-09

    > oneway.test(response~trt,data = cholesterol,var.equal = T)

           One-way analysis of means

    data:  response and trt

    F = 32.433, num df = 4, denom df = 45, p-value =

    9.819e-13

    两两比较:TukeyHSD()函数

    > TukeyHSD(fit)

      Tukey multiple comparisons of means

        95% family-wise confidence level

    Fit: aov(formula = response ~ trt, data = cholesterol)

    $trt

                      diff        lwr       upr     p adj

    2times-1time   3.44300 -0.6582817  7.544282 0.1380949

    4times-1time   6.59281  2.4915283 10.694092 0.0003542

    drugD-1time    9.57920  5.4779183 13.680482 0.0000003

    drugE-1time   15.16555 11.0642683 19.266832 0.0000000

    4times-2times  3.14981 -0.9514717  7.251092 0.2050382

    drugD-2times   6.13620  2.0349183 10.237482 0.0009611

    drugE-2times  11.72255  7.6212683 15.823832 0.0000000

    drugD-4times   2.98639 -1.1148917  7.087672 0.2512446

    drugE-4times   8.57274  4.4714583 12.674022 0.0000037

    drugE-drugD    5.58635  1.4850683  9.687632 0.0030633

    #可视化

    plot(TukeyHSD(fit))

    多因素方差分析:aov()函数

    > data('ToothGrowth')

    > head('ToothGrowth')

    len supp dose

    1  4.2   VC  0.5

    2 11.5   VC  0.5

    3  7.3   VC  0.5

    4  5.8   VC  0.5

    5  6.4   VC  0.5

    6 10.0   VC  0.5

    > levels(ToothGrowth$supp)

    [1] "OJ" "VC"

    > levels(ToothGrowth$dose)

    NULL

    > levels(as.factor(ToothGrowth$dose))

    [1] "0.5" "1"   "2" 

    > table(ToothGrowth$supp,ToothGrowth$dose)

       

         0.5  1  2

      OJ  10 10 10

      VC  10 10 10

    #公式的写法   关注交互作用

    > fangcha <- aov (len~supp+dose,data = ToothGrowth)

    > summary(fangcha)

                Df Sum Sq Mean Sq F value   Pr(>F)   

    supp         1  205.4   205.4   11.45   0.0013 **

    dose         1 2224.3  2224.3  123.99 6.31e-16 ***

    Residuals   57 1022.6    17.9                    

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    #交互作用

    > fangcha <- aov (len~supp*dose,data = ToothGrowth)

    > summary(fangcha)

                Df Sum Sq Mean Sq F value   Pr(>F)   

    supp         1  205.4   205.4  12.317 0.000894 ***

    dose         1 2224.3  2224.3 133.415  < 2e-16 ***

    supp:dose    1   88.9    88.9   5.333 0.024631 *    #交互作用不能忽视

    Residuals   56  933.6    16.7                    

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    #可视化  交互作用图

    >interaction.plot(ToothGrowth$dose,ToothGrowth$supp,ToothGrowth$len,type = 'b',

    +                  pch = c(1,10),col = c('red','green'))

    单因素协方差分析:aov()函数

    > data('litter')

    > head(litter)

      dose weight gesttime number

    1    0  28.05     22.5     15

    2    0  33.33     22.5     14

    3    0  36.37     22.0     14

    4    0  35.52     22.0     13

    5    0  36.77     21.5     15

    6    0  29.60     23.0      5

    > facn <- aov(weight~gesttime + dose + gesttime:dose,data = litter)

    > summary(facn)

                  Df Sum Sq Mean Sq F value  Pr(>F)  

    gesttime       1  134.3  134.30   8.289 0.00537 **  #协变量确实有影响,该如何去除?

    dose           3  137.1   45.71   2.821 0.04556 *

    gesttime:dose  3   81.9   27.29   1.684 0.17889  

    Residuals     66 1069.4   16.20                  

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    去除协变量的影响:加载effects包,用其中的effect()函数

    > effect('dose',facn)

    NOTE: dose is not a high-order term in the model

     dose effect

    dose

           0        5       50      500

    32.30722 28.50625 30.61078 29.29005

    #常见研究设计的表达式

      卡方检验

    n  对分类变量的检验方法

    n  分类:

    u  拟合优度检验:用chisq.test()函数

    (针对样本数据的分布与已知总体分布是否一致)

    > men <- c(11,120,60,45)

    > women <- c(20,102,39,30)

    > df <- as.data.frame(rbind(men,women))

    > colnames(df) <- c('AB','O','A','B')

    > chisq.test(men)

           Chi-squared test for given probabilities

    data:  men

    X-squared = 105.46, df = 3, p-value < 2.2e-16

    > chisq.test(men,p = c(0.1,0.5,0.2,0.2))  #p 中为总体人群中各血型的比例

           Chi-squared test for given probabilities

    data:  men

    X-squared = 10.335, df = 3, p-value = 0.01592

    u  卡方齐性检验:用于比较不同分类水平下,各个类型的比例是否一致。

    > chisq.test(df)  #行变量与列变量的相关性检验

           Pearson's Chi-squared test

    data:  df

    X-squared = 6.8607, df = 3, p-value = 0.07647 #男女不同血型的分布一致

    u  卡方独立性检验:

    > chisq.test(df)

           Pearson's Chi-squared test

    data:  df

    X-squared = 6.8607, df = 3, p-value = 0.07647 # 行变量和列变量无关联,血型分布和性别无关

    u  CMH检验:分层检验 三维,行变量为无序,列变量为有序数据;判断是否有混杂因素

    #采用mantelhaen.test()函数

    > Rabbits <-

    +   array(c(0,0,6,5,

    +           3,0,3,6,

    +           6,2,0,4,

    +           5,6,1,0,

    +           2,5,0,0),

    +         dim = c(2,2,5),

    +         dimnames = list(

    +           Delay = c('None','1.5h'),

    +           Response = c('Cured','Died'),

    +           Penicillin.Level = c('1/8','1/4','1/2','1','4')))

    > Rabbits

    , , Penicillin.Level = 1/8

          Response

    Delay  Cured Died

      None     0    6

      1.5h     0    5

    , , Penicillin.Level = 1/4

          Response

    Delay  Cured Died

      None     3    3

      1.5h     0    6

    , , Penicillin.Level = 1/2

          Response

    Delay  Cured Died

      None     6    0

      1.5h     2    4

    , , Penicillin.Level = 1

          Response

    Delay  Cured Died

      None     5    1

      1.5h     6    0

    , , Penicillin.Level = 4

          Response

    Delay  Cured Died

      None     2    0

      1.5h     5    0

    > mantelhaen.test(Rabbits)

           Mantel-Haenszel chi-squared test with continuity

           correction

    data:  Rabbits

    Mantel-Haenszel X-squared = 3.9286, df = 1, p-value =

    0.04747

    alternative hypothesis: true common odds ratio is not equal to 1

    95 percent confidence interval:

      1.026713 47.725133

    sample estimates:

    common odds ratio

                    7    #可以判定盘尼西林是否为混杂因素

    u  CMH检验:有序分类的卡方检验:连续型变量

    #采用mantelhaen.test()函数

    > Satisfaction <-

    +   as.table(array(c(1,2,0,0,3,3,1,2,

    +                    11,17,8,4,2,3,5,2,

    +                    1,0,0,0,1,3,0,1,

    +                    2,5,7,9,1,1,3,6),

    +                  dim = c(4,4,2),

    +                  dimnames =

    +                    list(Income =

    +                           c('<5000','5000-15000',

    +                             '15000-25000','>25000'),

    +                         'Job Satisfaction' =

    +                           c('V_D','L_S','M_S','V_S'),

    +                         Gender = c('Female','Male'))))

    > Satisfaction

    , , Gender = Female

                 Job Satisfaction

    Income        V_D L_S M_S V_S

      <5000         1   3  11   2

      5000-15000    2   3  17   3

      15000-25000   0   1   8   5

      >25000        0   2   4   2

    , , Gender = Male

                 Job Satisfaction

    Income        V_D L_S M_S V_S

      <5000         1   1   2   1

      5000-15000    0   3   5   1

      15000-25000   0   0   7   3

      >25000        0   1   9   6

    > mantelhaen.test(Satisfaction)

           Cochran-Mantel-Haenszel test

    data:  Satisfaction

    Cochran-Mantel-Haenszel M^2 = 10.2, df = 9, p-value =

    0.3345   #工资水平与满意度无关

    u  配对四格表的卡方检验:采用mcnemar.test()函数

    > paired <- as.table(matrix(c(157,24,69,18),nrow = 2,dimnames = list(case =c('A','B'),control = c('A','B'))))

    > paired

        control

    case   A   B

       A 157  69

       B  24  18

    > mcnemar.test(paired)

           McNemar's Chi-squared test with continuity correction

    data:  paired

    McNemar's chi-squared = 20.817, df = 1, p-value =

    5.053e-06   

    > chisq.test(paired)

           Pearson's Chi-squared test with Yates' continuity

           correction

    data:  paired

    X-squared = 1.9244, df = 1, p-value = 0.1654

    u  的

      回归分析与模型诊断

    前提:是否适合做线性回归,是否满足正态分布

    只要因变量是连续型变量即可做线性回归,因变量是分类变量需要做Logistic回归;自变量是连续型或者分类变量无影响

    n  一元回归分析:lm(因变量~自变量)

    x为连续型变量

    > x<- seq(1,5,len =100)

    > noise <- rnorm(n=100,mean = 0,sd = 1)

    > beta0 = 1

    > beta1 <-2

    > y <- beta0 + beta1*x + noise

    > plot(y~x)   #看一下是否适合做线性回归

    > model <- lm(y~x)

    > summary(model)

    Call:

    lm(formula = y ~ x)

    Residuals:

         Min       1Q   Median       3Q      Max

    -2.49732 -0.64794  0.00215  0.75355  3.06579

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)  0.77938    0.28733   2.712  0.00789 **

    x            2.05561    0.08927  23.027  < 2e-16 *** 

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.041 on 98 degrees of freedom

    Multiple R-squared:  0.844,    Adjusted R-squared:  0.8424 #模型可解释度

    F-statistic: 530.3 on 1 and 98 DF,  p-value: < 2.2e-16  #模型可靠性

    x为分类变量

    > x <- factor(rep(c(0,1,2),each = 20))

    > y <- c(rnorm(20,0,1),rnorm(20,1,1),rnorm(20,2,1))

    > model <- lm(y~x)

    > summary(model)

    Call:

    lm(formula = y ~ x)

    Residuals:

         Min       1Q   Median       3Q      Max

    -2.50566 -0.82826  0.01137  0.87966  2.41338

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)    

    (Intercept)  0.05338    0.24331   0.219    0.827   

    x1           0.81708    0.34409   2.375    0.021 * 

    x2           1.99903    0.34409   5.810 2.94e-07 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.088 on 57 degrees of freedom

    Multiple R-squared:  0.3745,      Adjusted R-squared:  0.3525

    F-statistic: 17.06 on 2 and 57 DF,  p-value: 1.558e-06

    > exp( 1.99903)  #转化成OR值

    [1] 7.381892

    n  模型诊断

    非正态性:shapiro.test()判断

    > #检验残差是否服从正态分布

    > data(LMdata,package = 'rinds')

    > model <- lm(y~x,data = LMdata$NonL)

    > #找残差

    >

    > res1 <- residuals(model)

    > shapiro.test(res1)   #残差不符合正态分布,需要重新做

           Shapiro-Wilk normality test

    data:  res1

    W = 0.93524, p-value = 1e-04   #残差不符合正态分布,需要重新做

    u  非线性

    > #2.非线性

    > model2 <- lm(y~x,data = LMdata$NonL)

    > plot(model2)

    #y 与x2的关系是否成线性

    model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

    summary(model2)

    #踢掉x

    > model3 <- update(model2,y~.-x)

    > summary(model3)

    Call:

    lm(formula = y ~ 1, data = LMdata$NonL)

    Residuals:

        Min      1Q  Median      3Q     Max

    -19.456 -12.907  -2.464  10.725  28.697

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)   21.829      1.429   15.28   <2e-16 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 14.29 on 99 degrees of freedom

    > model2 <- lm(y~x+ I(x^2),data = LMdata$NonL)

    > summary(model2)

    Call:

    lm(formula = y ~ x + I(x^2), data = LMdata$NonL)

    Residuals:

         Min       1Q   Median       3Q      Max

    -2.32348 -0.60702  0.00701  0.56018  2.27346

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)  0.98702    0.62216   1.586    0.116   

    x            0.11085    0.45405   0.244    0.808   

    I(x^2)       1.97966    0.07456  26.552   <2e-16 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.907 on 97 degrees of freedom

    Multiple R-squared:  0.9961,      Adjusted R-squared:  0.996

    F-statistic: 1.224e+04 on 2 and 97 DF,  p-value: < 2.2e-16

    > model3 <- update(model2,y~.-x)

    > summary(model3)

    Call:

    lm(formula = y ~ I(x^2), data = LMdata$NonL)

    Residuals:

         Min       1Q   Median       3Q      Max

    -2.34289 -0.60692  0.01722  0.58186  2.29601

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)  1.13378    0.15963   7.103 1.97e-10 ***

    I(x^2)       1.99760    0.01271 157.195  < 2e-16 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.9026 on 98 degrees of freedom

    Multiple R-squared:  0.996,       Adjusted R-squared:  0.996

    F-statistic: 2.471e+04 on 1 and 98 DF,  p-value: < 2.2e-16

    > AIC(model,model2,model3)

           df      AIC

    model   3 478.4558

    model2  4 269.2121

    model3  3 267.2736  #拟合效果越来越好

     

    plot(model3$residuals~LMdata$NonL)  #在0左右分布

    u  异方差:考虑的是噪声对模型的影响

    model4 <- lm(y~x,data = LMdata$Hetero)

    plot(model4$residuals~LMdata$Hetero$x)

    #处理方法:使用加权最小二乘法:对于不同的样本点给予不同的权重

    > summary(model5)

    Call:

    lm(formula = y ~ x, data = LMdata$Hetero, weights = 1/x^2)

    Weighted Residuals:

         Min       1Q   Median       3Q      Max

    -2.25132 -0.68409 -0.03924  0.63997  2.61098

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)   1.2611     0.5139   2.454   0.0159 * 

    x             1.8973     0.2317   8.190 9.98e-13 ***

    ---

    Signif. codes: 

    0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.025 on 98 degrees of freedom

    Multiple R-squared:  0.4063,      Adjusted R-squared:  0.4003

    F-statistic: 67.07 on 1 and 98 DF,  p-value: 9.977e-13

    > summary(model4)

    Call:

    lm(formula = y ~ x, data = LMdata$Hetero)

    Residuals:

        Min      1Q  Median      3Q     Max

    -8.3371 -1.6383 -0.1533  1.4075 12.3423

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)   1.6175     1.0019   1.615     0.11   

    x             1.7671     0.3113   5.677  1.4e-07 ***

    ---

    Signif. codes: 

    0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 3.63 on 98 degrees of freedom

    Multiple R-squared:  0.2475,      Adjusted R-squared:  0.2398

    F-statistic: 32.23 on 1 and 98 DF,  p-value: 1.396e-07

    #处理方法:使用迭代重复加权最小二乘法  采用nlme包中的gls()函数,找到最合适的weights公式,得到最小AIC

    > glsmodel <- gls(y~x,weights = varFixed(~x),data = LMdata$Hetero)

    > summary(glsmodel)

    Generalized least squares fit by REML

      Model: y ~ x

      Data: LMdata$Hetero

           AIC      BIC    logLik

      517.5286 525.2835 -255.7643

    Variance function:

     Structure: fixed weights

     Formula: ~x

    Coefficients:

                   Value Std.Error  t-value p-value

    (Intercept) 1.421324 0.7091780 2.004184  0.0478

    x           1.832543 0.2603653 7.038351  0.0000

     Correlation:

      (Intr)

    x -0.908

    Standardized residuals:

            Min          Q1         Med          Q3

    -2.17451000 -0.55322766 -0.03454023  0.51060224

            Max

     2.93969900

    Residual standard error: 1.890127

    Degrees of freedom: 100 total; 98 residual

    u  自相关:利用lmtest包中的DW检验函数:dwtest()函数

    > model6 <- lm(y~x,data = LMdata$AC)

    > dwtest(model6)

           Durbin-Watson test

    data:  model6

    DW = 0.65556, p-value = 2.683e-12  #存在自相关,不能使用最小二乘法,可使用gls()函数,实现广义最小二乘法,注意gls中的correlation 参数,设置为corAR1()

    alternative hypothesis: true autocorrelation is greater than 0

    > glsmodel2 <- gls(y~x,correlation = corAR1(),data = LMdata$AC)

    > summary(glsmodel2)

    Generalized least squares fit by REML

      Model: y ~ x

      Data: LMdata$AC

           AIC      BIC    logLik

      268.7617 279.1016 -130.3809

    Correlation Structure: AR(1)

     Formula: ~1

     Parameter estimate(s):

          Phi

    0.7041665

    Coefficients:

                   Value Std.Error  t-value p-value

    (Intercept) 1.335059 0.7792019 1.713367  0.0898

    x           2.072936 0.2405513 8.617438  0.0000

     Correlation:

      (Intr)

    x -0.926

    Standardized residuals:

             Min           Q1          Med           Q3

    -1.667086282 -0.571601900  0.001724114  0.568360354

             Max

     2.306177988

    Residual standard error: 1.25329

    Degrees of freedom: 100 total; 98 residual

    > summary(model6)

    Call:

    lm(formula = y ~ x, data = LMdata$AC)

    Residuals:

         Min       1Q   Median       3Q      Max

    -2.10131 -0.72894 -0.03329  0.66138  2.77461

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)   1.2626     0.3303   3.822 0.000232 ***

    x             2.1118     0.1026  20.578  < 2e-16 ***

    ---

    Signif. codes: 

    0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.197 on 98 degrees of freedom

    Multiple R-squared:  0.8121,      Adjusted R-squared:  0.8101

    F-statistic: 423.4 on 1 and 98 DF,  p-value: < 2.2e-16

    u  异常值:离群点,杠杆点、高影响点

    model7 <- lm(y~x,data = LMdata$Outlier)

    plot(y~x,data = LMdata$Outlier)

    abline(model7)  #画出回归直线

    #利用car包中的infuencePlot()函数进行判断,圆圈越大,对模型影响越大,做线性回归模型时需踢掉该点,用update函数

    model8 <- update(model7,y~x,subset = -32,data = LMdata$Outlier)

    plot(y~x,data = LMdata$Outlier)   #比较去除前后的差异

    abline(model7,col = 'red')   #比较去除前后的差异

    abline(model8,col = 'green')  #比较去除前后的差异

    u  多重共线性的判断:自变量之间存在线性关系

    #x1,x2,x3与y之间p值无显著差异,但是总体上的p值和R值非常显著,说明x1,x2,x3之间可能是存在相关性的,但与y没有相关性,不能进行回归分析,需要用函数进行检验确认:vif()函数计算方差膨胀因子

    > model9 <- lm(y~x1+x2+x3,data = LMdata$Mult)

    > summary(model9)

    Call:

    lm(formula = y ~ x1 + x2 + x3, data = LMdata$Mult)

    Residuals:

         Min       1Q   Median       3Q      Max

    -1.29100 -0.26369  0.00141  0.32529  0.91182

    Coefficients:

                Estimate Std. Error t value Pr(>|t|) 

    (Intercept)   0.3848     0.5813   0.662   0.5095 

    x1            7.2022     4.8552   1.483   0.1412  

    x2          -14.0916    12.1385  -1.161   0.2486 

    x3            8.2312     4.8559   1.695   0.0933 .

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.499 on 96 degrees of freedom

    Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

    F-statistic: 2.057e+04 on 3 and 96 DF,  p-value: < 2.2e-16

    #vif计算方差膨胀因子,一般而言,大于10则认为存在共线性

    > vif(model9)

            x1         x2         x3

      7560.819 214990.752 222630.742

    #运用自动回归判断是x中的哪些变量存在共线性:step()函数

    > model10 <- step(model9)

    Start:  AIC=-135.11

    y ~ x1 + x2 + x3

           Df Sum of Sq    RSS     AIC

    - x2    1   0.33560 24.241 -135.71

    <none>              23.905 -135.11

    - x1    1   0.54795 24.453 -134.84

    - x3    1   0.71550 24.621 -134.16

    Step:  AIC=-135.71

    y ~ x1 + x3

           Df Sum of Sq     RSS     AIC

    <none>                 24.2 -135.71

    - x1    1     189.2   213.4   79.81

    - x3    1   15276.4 15300.7  507.05

    > summary(model10)

    Call:

    lm(formula = y ~ x1 + x3, data = LMdata$Mult)

    Residuals:

         Min       1Q   Median       3Q      Max

    -1.24046 -0.27519 -0.02751  0.32824  0.91344

    Coefficients:

                Estimate Std. Error t value Pr(>|t|)   

    (Intercept)  0.38158    0.58230   0.655    0.514   

    x1           1.56614    0.05692  27.514   <2e-16 ***

    x3           2.59393    0.01049 247.241   <2e-16 ***

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.4999 on 97 degrees of freedom

    Multiple R-squared:  0.9984,      Adjusted R-squared:  0.9984

    F-statistic: 3.074e+04 on 2 and 97 DF,  p-value: < 2.2e-16

    n  Logistic回归

    与线性回归的区别:y为分类变量。

    u  使用glm()函数:下载HSAUR2包的数据集plasma

    > library(HSAUR2)

    > data('plasma')

    > head(plasma)

      fibrinogen globulin      ESR

    1       2.52       38 ESR < 20

    2       2.56       31 ESR < 20

    3       2.19       33 ESR < 20

    4       2.18       31 ESR < 20

    5       3.41       37 ESR < 20

    6       2.46       36 ESR < 20

    > fit1 <- glm(ESR~fibrinogen+globulin,data = plasma,

    +             family = binomial())  #family选择二分类

    > summary(fit1)

    Call:

    glm(formula = ESR ~ fibrinogen + globulin, family = binomial(),

        data = plasma)

    Deviance Residuals:

        Min       1Q   Median       3Q      Max 

    -0.9683  -0.6122  -0.3458  -0.2116   2.2636 

    Coefficients:

                Estimate Std. Error z value Pr(>|z|) 

    (Intercept) -12.7921     5.7963  -2.207   0.0273 *

    fibrinogen    1.9104     0.9710   1.967   0.0491 *

    globulin      0.1558     0.1195   1.303   0.1925 

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for binomial family taken to be 1)

        Null deviance: 30.885  on 31  degrees of freedom

    Residual deviance: 22.971  on 29  degrees of freedom

    AIC: 28.971

    Number of Fisher Scoring iterations: 5

    > exp(coef(fit1)['fibrinogen'])

    fibrinogen

      6.755579

    > exp(1.9104)

    [1] 6.755791   #ORR值

    > exp(confint(fit1,parm = 'fibrinogen'))

    Waiting for profiling to be done...

        2.5 %    97.5 %   #95%的可信区间

        1.404131 73.000836

    n  生存分析与COX回归

    生存分析

    u  用coin包中的glima数据集演示,用survival包进行生存分析,利用其中的survfit()函数做出生存曲线图

    > g3 <- subset(glioma,histology =='Grade3')

    > fit <- survfit(Surv(time,event)~group,data = g3)  #注意公式的写法

    > plot(fit,lty = c(2,3),col = c(2,1))

    > legend('bottomright',legend = c('Control','Treatment'),lty = c(2,1),col = c(2,1))

    用survdiff()函数计算两组间生存差异

    > survdiff(Surv(time,event)~group,data = g3)  #接受传入的形式

    Call:

    survdiff(formula = Surv(time, event) ~ group, data = g3)

                   N Observed Expected (O-E)^2/E (O-E)^2/V

    group=Control  6        4     1.49      4.23      6.06

    group=RIT     11        2     4.51      1.40      6.06

          Chisq= 6.1  on 1 degrees of freedom, p= 0.0138

    u  用survdiff()函数计算两组间生存差异

    > logrank_test(Surv(time,event)~group,data = g3,distribution='exact')

    # 注意公式的写法,选择精确分布

           Exact Two-Sample Logrank Test

    data:  Surv(time, event) by group (Control, RIT)

    Z = -2.1711, p-value = 0.02877

    alternative hypothesis: true theta is not equal to 1

    #讨论不同histology的对照组与治疗组是否有差异

    logrank_test(Surv(time,event)~group|histology,data = glioma,distribution = approximate(B =1000))

           Approximative Two-Sample Logrank Test

    data:  Surv(time, event) by

            group (Control, RIT)

            stratified by histology

    Z = -3.6704, p-value < 2.2e-16

    alternative hypothesis: true theta is not equal to 1

                  COX回归

    u  用GBSG2数据集

    > install.packages('TH.data')

    Error in install.packages : Updating loaded packages

    > data('GBSG2',package = 'TH.data')

    > head(GBSG2)

      horTh age menostat tsize tgrade pnodes progrec estrec time cens

    1    no  70     Post    21     II      3      48     66 1814    1

    2   yes  56     Post    12     II      7      61     77 2018    1

    3   yes  58     Post    35     II      9      52    271  712    1

    4   yes  59     Post    17     II      4      60     29 1807    1

    5    no  73     Post    35     II      1      26     65  772    1

    6    no  32      Pre    57    III     24       0     13  448    1

    > plot(survfit(Surv(time,cens)~horTh,data = GBSG2),lty = c(2,1),col = c(2,1), mark.Time = T)  #如果需要标记生存时间,在后面加上mark.Time = T

    > legend('bottomright',legend = c('yes','no'),lty = c(2,1),col = c(2,1))

    #画出了生存曲线图

    u  生存分析:使用survival包中的coxph()函数

    > coxreg <- coxph(Surv(time,cens)~.,data = GBSG2)

    > summary(coxreg)

    Call:

    coxph(formula = Surv(time, cens) ~ ., data = GBSG2)

      n= 686, number of events= 299

                       coef  exp(coef)   se(coef)      z Pr(>|z|)   

    horThyes     -0.3462784  0.7073155  0.1290747 -2.683 0.007301 **

    age          -0.0094592  0.9905854  0.0093006 -1.017 0.309126   

    menostatPost  0.2584448  1.2949147  0.1834765  1.409 0.158954   

    tsize         0.0077961  1.0078266  0.0039390  1.979 0.047794 * 

    tgrade.L      0.5512988  1.7355056  0.1898441  2.904 0.003685 **

    tgrade.Q     -0.2010905  0.8178384  0.1219654 -1.649 0.099199 . 

    pnodes        0.0487886  1.0499984  0.0074471  6.551  5.7e-11 ***

    progrec      -0.0022172  0.9977852  0.0005735 -3.866 0.000111 ***

    estrec        0.0001973  1.0001973  0.0004504  0.438 0.661307   

    ---

    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                 exp(coef) exp(-coef) lower .95 upper .95

    horThyes        0.7073     1.4138    0.5492    0.9109

    age             0.9906     1.0095    0.9727    1.0088

    menostatPost    1.2949     0.7723    0.9038    1.8553

    tsize           1.0078     0.9922    1.0001    1.0156

    tgrade.L        1.7355     0.5762    1.1963    2.5178

    tgrade.Q        0.8178     1.2227    0.6439    1.0387

    pnodes          1.0500     0.9524    1.0348    1.0654

    progrec         0.9978     1.0022    0.9967    0.9989

    estrec          1.0002     0.9998    0.9993    1.0011

    Concordance= 0.692  (se = 0.018 )

    Rsquare= 0.142   (max possible= 0.995 )

    Likelihood ratio test= 104.8  on 9 df,   p=0

    Wald test            = 114.8  on 9 df,   p=0

    Score (logrank) test = 120.7  on 9 df,   p=0

           #画树

    install.packages('party')

    tree <- ctree(Surv(time,cens)~.,data = GBSG2)

    plot(tree)

  • 相关阅读:
    node.js----服务器http
    node.js---对文件操作
    node.js
    历史管理
    h5
    git与github
    js中面向对象(创建对象的几种方式)
    jq基础
    POJ 2492 A Bug's Life
    POJ 1742 Coins
  • 原文地址:https://www.cnblogs.com/ppjs/p/9448999.html
Copyright © 2020-2023  润新知