• With our powers combined! xgboost and pipelearner


    @drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

     Why a post on xgboost and pipelearner?

    xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

    The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

     Setup

    To follow this post you’ll need the following packages:

    # Install (if necessary)
    install.packages(c("xgboost", "tidyverse", "devtools"))
    devtools::install_github("drsimonj/pipelearner")
    
    # Attach
    library(tidyverse)
    library(xgboost)
    library(pipelearner)
    library(lazyeval)
    

    Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

    data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
    
    d <- read_csv(
      data_url,
      col_names = c('id', 'thinkness', 'size_uniformity',
                    'shape_uniformity', 'adhesion', 'epith_size',
                    'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>% 
      select(-id) %>%            # Remove id; not useful here
      filter(nuclei != '?') %>%  # Remove records with missing data
      mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
      mutate_all(as.numeric)     # All to numeric; needed for XGBoost
    
    d
    #> # A tibble: 683 × 10
    #>    thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
    #>        <dbl>           <dbl>            <dbl>    <dbl>      <dbl>  <dbl>
    #> 1          5               1                1        1          2      1
    #> 2          5               4                4        5          7     10
    #> 3          3               1                1        1          2      2
    #> 4          6               8                8        1          3      4
    #> 5          4               1                1        3          2      1
    #> 6          8              10               10        8          7     10
    #> 7          1               1                1        1          2     10
    #> 8          2               1                2        1          2      1
    #> 9          2               1                1        1          2      1
    #> 10         4               2                1        1          2      1
    #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
    #> #   nucleoli <dbl>, mitoses <dbl>, cancer <dbl>
    

     pipelearner

    pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

    Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

    pipelearner(d, rpart::rpart, cancer ~ .,
                minsplit = c(2, 4, 6, 8, 10),
                maxdepth = c(2, 3, 4, 5))
    

    The challenge for xgboost:

    pipelearner expects a model function that has two arguments: data andformula

     xgboost

    Here’s an xgboost model:

    # Prep data (X) and labels (y)
    X <- select(d, -cancer) %>% as.matrix()
    y <- d$cancer
    
    # Fit the model
    fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    
    # Examine accuracy
    predicted <- as.numeric(predict(fit, X) >= .5)
    mean(predicted == y)
    #> [1] 0.9838946
    

    Look like we have a model with 98.39% accuracy on the training data!

    Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

     Wrapper function to parse data and formula

    To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

    pl_xgboost <- function(data, formula, ...) {
      data <- as.data.frame(data)
    
      X_names <- as.character(f_rhs(formula))
      y_name  <- as.character(f_lhs(formula))
    
      if (X_names == '.') {
        X_names <- names(data)[names(data) != y_name]
      }
    
      X <- data.matrix(data[, X_names])
      y <- data[[y_name]]
    
      xgboost(data = X, label = y, ...)
    }
    

    Let’s try it out:

    pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    
    # Examine accuracy
    pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
    mean(pl_predicted == y)
    #> [1] 0.9838946
    

    Perfect!

     Bringing it all together

    We can now use pipelearner and pl_xgboost() for easy grid searching:

    pl <- pipelearner(d, pl_xgboost, cancer ~ .,
                      nrounds = c(5, 10, 25),
                      eta = c(.1, .3),
                      max_depth = c(4, 6))
    
    fits <- pl %>% learn()
    #> [1]  train-rmse:0.453832 
    #> [2]  train-rmse:0.412548 
    #> ...
    
    fits
    #> # A tibble: 12 × 9
    #>    models.id cv_pairs.id train_p               fit target      model
    #>        <chr>       <chr>   <dbl>            <list>  <chr>      <chr>
    #> 1          1           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 2         10           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 3         11           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 4         12           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 5          2           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 6          3           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 7          4           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 8          5           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 9          6           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 10         7           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 11         8           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 12         9           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> # ... with 3 more variables: params <list>, train <list>, test <list>
    

    Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

    accuracy <- function(fit, data, target_var) {
      # Convert resample object to data frame
      data <- as.data.frame(data)
      # Get feature matrix and labels
      X <- data %>%
        select(-matches(target_var)) %>% 
        as.matrix()
      y <- data[[target_var]]
      # Obtain predicted class
      y_hat <- as.numeric(predict(fit, X) > .5)
      # Return accuracy
      mean(y_hat == y)
    }
    
    results <- fits %>% 
      mutate(
        # hyperparameters
        nrounds   = map_dbl(params, "nrounds"),
        eta       = map_dbl(params, "eta"),
        max_depth = map_dbl(params, "max_depth"),
        # Accuracy
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      ) %>% 
      # Select columns and order rows
      select(nrounds, eta, max_depth, contains("accuracy")) %>% 
      arrange(desc(accuracy_test), desc(accuracy_train))
    
    results
    #> # A tibble: 12 × 5
    #>    nrounds   eta max_depth accuracy_train accuracy_test
    #>      <dbl> <dbl>     <dbl>          <dbl>         <dbl>
    #> 1       25   0.3         6      1.0000000     0.9489051
    #> 2       25   0.3         4      1.0000000     0.9489051
    #> 3       10   0.3         6      0.9981685     0.9489051
    #> 4        5   0.3         6      0.9945055     0.9489051
    #> 5       10   0.1         6      0.9945055     0.9489051
    #> 6       25   0.1         6      0.9945055     0.9489051
    #> 7        5   0.1         6      0.9926740     0.9489051
    #> 8       25   0.1         4      0.9890110     0.9489051
    #> 9       10   0.3         4      0.9871795     0.9489051
    #> 10       5   0.3         4      0.9853480     0.9489051
    #> 11      10   0.1         4      0.9853480     0.9416058
    #> 12       5   0.1         4      0.9835165     0.9416058
    

    Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

    Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

     Bonus: bootstrapped cross validation

    For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

    results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>% 
      learn_cvpairs(n = 100) %>% 
      learn() %>% 
      mutate(
        test_accuracy  = pmap_dbl(list(fit, test,  target), accuracy)
      )
    #> [1]  train-rmse:0.357471 
    #> [2]  train-rmse:0.256735 
    #> ...
    
    results %>% 
      ggplot(aes(test_accuracy)) +
        geom_histogram(bins = 30) +
        scale_x_continuous(labels = scales::percent) +
        theme_minimal() +
        labs(x = "Accuracy", y = "Number of samples",
             title = "Test accuracy distribution for
    100 bootstrapped samples")
    

    unnamed-chunk-11-1.jpg

     Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

  • 相关阅读:
    游戏编程模式--原型模式
    游戏编程模式--观察者模式
    游戏编程模式--享元模式
    游戏编程模式--命令模式
    mybatis的线程安全
    开发遇到的问题
    spring的ThreadLocal解决线程安全
    i++
    jvm内存初步了解
    注解@RequestMapping,@RequestBody
  • 原文地址:https://www.cnblogs.com/payton/p/6375101.html
Copyright © 2020-2023  润新知