• R语言文摘:R Function of the Day: cut


    原文地址:https://www.r-bloggers.com/r-function-of-the-day-cut-2/

    The R Function of the Day series will focus on describing in plain language how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.

    Today, I will discuss the cut function.

    What situation is cut useful in?

    In many data analysis settings, it might be useful to break up a continuous variable such as age into a categorical variable. Or, you might want to classify a categorical variable like year into a larger bin, such as 1990-2000. There are many reasons not to do this when performing regression analysis, but for simple displays of demographic data in tables, it could make sense. The cut function in R makes this task simple!

    How do I use cut?

    First, we will simulate some data from a hypothetical clinical trial that includes variables for patient ID, age, and year of enrollment.

    > ## generate data for clinical trial example
    > clinical.trial <-
        data.frame(patient = 1:100,              
                   age = rnorm(100, mean = 60, sd = 8),
                   year.enroll = sample(paste("19", 85:99, sep = ""),
                     100, replace = TRUE))
    > summary(clinical.trial)
        patient            age         year.enroll
     Min.   :  1.00   Min.   :41.18   1991   :12  
     1st Qu.: 25.75   1st Qu.:52.99   1988   :11  
     Median : 50.50   Median :60.08   1985   : 9  
     Mean   : 50.50   Mean   :59.67   1993   : 7  
     3rd Qu.: 75.25   3rd Qu.:65.67   1995   : 7  
     Max.   :100.00   Max.   :76.40   1997   : 7  
                                      (Other):47   
    
    

    Now, we will use the cut function to make age a factor, which is what R calls a categorical variable. Our first example calls cut with the breaks argument set to a single number. This method will cause cut to break up age into 4 intervals. The default labels use standard mathematical notation for open and closed intervals.

    > ## basic usage of cut with a numeric variable
    > c1 <- cut(clinical.trial$age, breaks = 4)
    > table(c1)
    c1
      (41.1,50]   (50,58.8] (58.8,67.6] (67.6,76.4] 
              9          34          41          16  
    > ## year.enroll is a factor, so must convert to numeric first!
    > c2 <- cut(as.numeric(as.character(clinical.trial$year.enroll)),
                breaks = 3)
    > table(c2)
    c2
    (1985,1990] (1990,1994] (1994,1999] 
             36          34          30  
    
    

    Well, the intervals that cut chose by default are not the nicest looking with the age example, although they are fine with the year example, since it was already discrete. Luckily, we can specify the exact intervals we want for age. Our next example shows how.

    > ## specify break points explicitly using seq function
    > 
    > ## look what seq does  
    > seq(30, 80, by = 10)
    [1] 30 40 50 60 70 80 
    > ## cut the age variable using the seq defined above
    > c1 <- cut(clinical.trial$age, breaks = seq(30, 80, by = 10))
    > ## table of the resulting factor           
    > table(c1)
    c1
    (30,40] (40,50] (50,60] (60,70] (70,80] 
          0       9      40      42       9  
    
    

    That looks pretty good. There is no reason that the breaks argument has to be equally spaced as I have done above. It could be any grouping that you want.

    Finally, I am going to show you an example of a custom R function to categorize ages. It uses cut inside of it, but does some preprocessing and uses the labels argument to cut to make the output look nice.

    age.cat <- function(x, lower = 0, upper, by = 10,
                       sep = "-", above.char = "+") {
    
     labs <- c(paste(seq(lower, upper - by, by = by),
                     seq(lower + by - 1, upper - 1, by = by),
                     sep = sep),
               paste(upper, above.char, sep = ""))
    
     cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
         right = FALSE, labels = labs)
    }
    

    This function categorizes age in a fairly flexible way. The first assignment to labs inside the function creates a vector of labels. Then, the cut function is called to do the work, with the custom labels as an argument. Here are some examples using our simulated data from above. I am no longer going to save the results of the function calls to a variable and call table on them, but rather just nest the call to age.cat in a call to table. I previously did a post on the table function.

    > ## only specifying an upper bound, uses 0 as lower bound, and
    > ## breaks up categories by 10
    > table(age.cat(clinical.trial$age, upper = 70))
      0-9 10-19 20-29 30-39 40-49 50-59 60-69   70+ 
        0     0     0     0     9    40    42     9  
    > ## now specifying a lower bound
    > table(age.cat(clinical.trial$age, lower = 30, upper = 70))
    30-39 40-49 50-59 60-69   70+ 
        0     9    40    42     9  
    > ## now specifying a lower bound AND the "by" argument 
    > table(age.cat(clinical.trial$age, lower = 30, upper = 70, by = 5))
    30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69   70+ 
        0     0     3     6    22    18    22    20     9  
    
    

    Summary of cut

    The cut function is useful for turning continuous variables into factors. You saw how to specify the number of cutpoints, specify the exact cutpoints, and saw a function built around cut that simplifies categorizing an age variable and giving it appropriate labels.

  • 相关阅读:
    关于I2C的重要的结构体
    写一个简单的hello字符驱动模块
    Linux设备号
    创建一个简单的TCP服务器
    使用fork循环创建子进程
    vim自动添加头文件
    运行时多态的最终奥义:虚函数的妙用
    springboot的热部署之代码配置(一)
    对github上面的项目进行更新
    eclipse中安装git项目的运行
  • 原文地址:https://www.cnblogs.com/chickenwrap/p/10241140.html
Copyright © 2020-2023  润新知