• Is the “*apply” family really not vectorized?


    Question:

    So we are used to say to every R new user that "apply isn't vectorized, check out the Patrick Burns R Inferno Circle 4" which says (I quote):

    A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition. The lapply function buries the loop, but execution times tend to be roughly equal to an explicit for loop.

    Indeed, a quick look on the apply source code reveals the loop:

    grep("for", capture.output(getAnywhere("apply")), value = TRUE)
    ## [1] "        for (i in 1L:d2) {"  "    else for (i in 1L:d2) {"

    Ok so far, but a look at lapply or vapply actually reveals a completely different picture:

    lapply
    ## function (X, FUN, ...) 
    ## {
    ##     FUN <- match.fun(FUN)
    ##     if (!is.vector(X) || is.object(X)) 
    ##        X <- as.list(X)
    ##     .Internal(lapply(X, FUN))
    ## }
    ## <bytecode: 0x000000000284b618>
    ## <environment: namespace:base>

    So apparently there is no R for loop hiding there, rather they are calling internal C written function.

    A quick look in the rabbit hole reveals pretty much the same picture

    Moreover, let's take the colMeans function for example, which was never accused in not being vectorised

    colMeans
    # function (x, na.rm = FALSE, dims = 1L) 
    # {
    #   if (is.data.frame(x)) 
    #     x <- as.matrix(x)
    #   if (!is.array(x) || length(dn <- dim(x)) < 2L) 
    #     stop("'x' must be an array of at least two dimensions")
    #   if (dims < 1L || dims > length(dn) - 1L) 
    #     stop("invalid 'dims'")
    #   n <- prod(dn[1L:dims])
    #   dn <- dn[-(1L:dims)]
    #   z <- if (is.complex(x)) 
    #     .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) * 
    #     .Internal(colMeans(Im(x), n, prod(dn), na.rm))
    #   else .Internal(colMeans(x, n, prod(dn), na.rm))
    #   if (length(dn) > 1L) {
    #     dim(z) <- dn
    #     dimnames(z) <- dimnames(x)[-(1L:dims)]
    #   }
    #   else names(z) <- dimnames(x)[[dims + 1]]
    #   z
    # }
    # <bytecode: 0x0000000008f89d20>
    #   <environment: namespace:base>

    Huh? It also just calls .Internal(colMeans(... which we can also find in the rabbit hole. So how is this different from .Internal(lapply(..?

    Actually a quick benchmark reveals that sapply performs no worse than colMeans and much better than a for loop for a big data set

    m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
    system.time(colMeans(m))
    # user  system elapsed 
    # 1.69    0.03    1.73 
    system.time(sapply(m, mean))
    # user  system elapsed 
    # 1.50    0.03    1.60 
    system.time(apply(m, 2, mean))
    # user  system elapsed 
    # 3.84    0.03    3.90 
    system.time(for(i in 1:ncol(m)) mean(m[, i]))
    # user  system elapsed 
    # 13.78    0.01   13.93 

    In other words, is it correct to say that lapply and vapply are actually vectorised (compared to apply which is a for loop that also calls lapply) and what did Patrick Burns really mean to say?



    Answer:

    First of all, in your example you make tests on a "data.frame" which is not fair for colMeans, apply and "[.data.frame" since they have an overhead:

    system.time(as.matrix(m))  #called by `colMeans` and `apply`
    #   user  system elapsed 
    #   1.03    0.00    1.05
    system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
    #   user  system elapsed 
    #  12.93    0.01   13.07

    On a matrix, the picture is a bit different:

    mm = as.matrix(m)
    system.time(colMeans(mm))
    #   user  system elapsed 
    #   0.01    0.00    0.01 
    system.time(apply(mm, 2, mean))
    #   user  system elapsed 
    #   1.48    0.03    1.53 
    system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
    #   user  system elapsed 
    #   1.22    0.00    1.21

    Regading the main part of the question, the main difference between lapply/mapply/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about? An example where we find the mean in each of a "list"s elements:

    #all computations in C
    all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
        SEXP tmp, ans;
        PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));
    
        double *ptmp, *pans = REAL(ans);
    
        for(int i = 0; i < LENGTH(R_ls); i++) {
            pans[i] = 0.0;
    
            PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
            ptmp = REAL(tmp);
    
            for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];
    
            pans[i] /= LENGTH(tmp);
    
            UNPROTECT(1);
        }
    
        UNPROTECT(1);
        return(ans);
    ')
    
    #a very simple `lapply(x, mean)`
    C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
        SEXP call, ans, ret;
    
        PROTECT(call = allocList(2));
        SET_TYPEOF(call, LANGSXP);
        SETCAR(call, install("mean"));
    
        PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
        PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));
    
        for(int i = 0; i < LENGTH(R_ls); i++) {
            SETCADR(call, VECTOR_ELT(R_ls, i));
            SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
        }
    
        double *pret = REAL(ret);
        for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];
    
        UNPROTECT(3);
        return(ret);
    ')                    
    
    R_lapply = function(x) unlist(lapply(x, mean))                       
    
    R_loop = function(x) 
    {
        ans = numeric(length(x))
        for(i in seq_along(x)) ans[i] = mean(x[[i]])
        return(ans)
    } 
    
    R_loopcmp = compiler::cmpfun(R_loop)
    
    
    set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
    all.equal(all_C(myls), C_and_R(myls))
    #[1] TRUE
    all.equal(all_C(myls), R_lapply(myls))
    #[1] TRUE
    all.equal(all_C(myls), R_loop(myls))
    #[1] TRUE
    all.equal(all_C(myls), R_loopcmp(myls))
    #[1] TRUE
    
    microbenchmark::microbenchmark(all_C(myls), 
                                   C_and_R(myls), 
                                   R_lapply(myls), 
                                   R_loop(myls), 
                                   R_loopcmp(myls), 
                                   times = 15)
    #Unit: milliseconds
    #            expr       min        lq    median        uq      max neval
    #     all_C(myls)  37.29183  38.19107  38.69359  39.58083  41.3861    15
    #   C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822    15
    #  R_lapply(myls)  98.48009 103.80717 106.55519 109.54890 116.3150    15
    #    R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128    15
    # R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976    15

  • 相关阅读:
    软件测试 Lab1 实验报告
    软件测试 Homework2
    谈谈最近的一个让我印象深刻的错误
    Bill Manager Problem Statement
    C#学习记录(九)Windows Phone开发中的Binding
    C#学习记录(八) XML Serializer尝试
    C#学习记录(七)LINQ语句及LAMDA表达式
    C#学习记录(六)
    软件测试之作业三
    软件测试之实验一
  • 原文地址:https://www.cnblogs.com/vigorz/p/10499222.html
Copyright © 2020-2023  润新知