• Fast data loading from files to R


    Recently we were building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in our case customer provided csv files) and identify the best one.

    The goal of my post is to compare:

    1. read.csv from utils, which was the standard way of reading csvfiles to R in RStudio,
    2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
    3. load and readRDS from base, and
    4. read_feather from feather and fread from data.table.

    Data

    First let’s generate some random data

    set.seed(123)
    df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                     replicate(10, stringi::stri_rand_strings(1000, 5)))

    and save the files on a disk to evaluate the loading time. Besides thecsv format we will also need featherRDS and Rdata files.

    path_csv <- '../assets/data/fast_load/df.csv'
    path_feather <- '../assets/data/fast_load/df.feather'
    path_rdata <- '../assets/data/fast_load/df.RData'
    path_rds <- '../assets/data/fast_load/df.rds'
    library(feather)
    library(data.table)
    write.csv(df, file = path_csv, row.names = F)
    write_feather(df, path_feather)
    save(df, file = path_rdata)
    saveRDS(df, path_rds)

    Next let’s check our files sizes:

    files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
    info <- file.info(files)
    info$size_mb <- info$size/(1024 * 1024) 
    print(subset(info, select=c("size_mb")))
    ##                                       size_mb
    ## ../assets/data/fast_load/df.csv     1780.3005
    ## ../assets/data/fast_load/df.feather 1145.2881
    ## ../assets/data/fast_load/df.RData    285.4836
    ## ../assets/data/fast_load/df.rds      285.4837

    As we can see both csv and feather format files are taking much more storage space. Csv more than 6 times and feather more than 4 times comparing to RDS and RData.

    Benchmark

    We will use microbenchmark library to compare the reading times of the following methods:

    • utils::read.csv
    • readr::read_csv
    • data.table::fread
    • base::load
    • base::readRDS
    • feather::read_feather

    in 10 rounds.

    library(microbenchmark)
    benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
                   readrCSV = readr::read_csv(path_csv, progress = F),
                   fread = data.table::fread(path_csv, showProgress = F),
                   loadRdata = base::load(path_rdata),
                   readRds = base::readRDS(path_rds),
                   readFeather = feather::read_feather(path_feather), times = 10)
    print(benchmark, signif = 2)
    ##Unit: seconds
    ##        expr   min    lq       mean median    uq   max neval
    ##     readCSV 200.0 200.0 211.187125  210.0 220.0 240.0    10
    ##    readrCSV  27.0  28.0  29.770890   29.0  32.0  33.0    10
    ##       fread  15.0  16.0  17.250016   17.0  17.0  22.0    10
    ##   loadRdata   4.4   4.7   5.018918    4.8   5.5   5.9    10
    ##     readRds   4.6   4.7   5.053674    5.1   5.3   5.6    10
    ## readFeather   1.5   1.8   2.988021    3.4   3.6   4.1    10

    And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
    Using load or readRDS can improve performance (second and third place in terms of speed) and has a benefit of storing smaller/compressed file. In both cases you will have to convert your file to the proper format first.

    When it comes to reading from csv format fread significantly beatsread_csv and read.csv, and thus is the best option to read a csv file.

    In our case we decided to go with feather file since conversion fromcsv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usage of Rds or RDataformat.

    The final workflow was:

    1. reading a csv file provided by our customer using fread,
    2. writing it to feather using write_feather, and
    3. loading a feather file on app initialization using read_feather.

    First two tasks were done once and outside of a Shiny App context.

    There is also quite interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with an character type object, and you will have to apply string manipulations to obtain a commonly and widely used dataframe.

    转自:http://blog.appsilondatascience.com/rstats/2017/04/11/fast-data-load.html

  • 相关阅读:
    Luogu P3372 【模板】线段树 1
    Luogu P1439 【模板】最长公共子序列
    Luogu P3374 【模板】树状数组 1
    Computer Vision_33_SIFT:Improving Bag-of-Features for Large Scale Image Search——2010
    Computer Vision_33_SIFT:Distinctive Image Features from Scale-Invariant Keypoints——2004
    Computer Vision_33_SIFT:PCA-SIFT A More Distinctive Representation for Local Image Descriptors——2004
    Computer Vision_33_SIFT:Speeded-Up Robust Features (SURF)——2006
    Computer Vision_33_SIFT:Evaluation of Interest Point Detectors——2000
    Computer Vision_33_SIFT:Object recognition from local scale-invariant features——1999
    Image Processing and Analysis_21_Scale Space:Feature Detection with Automatic Scale Selection——1998
  • 原文地址:https://www.cnblogs.com/payton/p/6697764.html
Copyright © 2020-2023  润新知