Семейство «* apply» действительно не векторизовано?

Question 1

Таким образом, мы привыкли говорить каждому новому пользователю R, что « applyне векторизован, посмотрите Patrick Burns R Inferno Circle 4 », в котором говорится (цитирую):

Распространенный рефлекс - использовать функцию из семейства apply. Это не векторизация, это скрытие петель . В определении функции apply есть цикл for. Функция lapply скрывает цикл, но время выполнения обычно примерно равно явному циклу for.

Действительно, беглый взгляд на applyисходный код обнаруживает цикл:

grep("for", capture.output(getAnywhere("apply")), value = TRUE)
## [1] "        for (i in 1L:d2) {"  "    else for (i in 1L:d2) {"

Пока что хорошо, но если взглянуть на нее lapplyили vapplyувидеть ее, то картина будет совершенно иной:

lapply
## function (X, FUN, ...) 
## {
##     FUN <- match.fun(FUN)
##     if (!is.vector(X) || is.object(X)) 
##        X <- as.list(X)
##     .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000000284b618>
## <environment: namespace:base>

Таким образом, очевидно, что там нет скрытого forцикла R , скорее они вызывают внутреннюю написанную функцию C.

Беглый взгляд в кроличью нору показывает почти ту же картину

Более того, возьмем colMeansдля примера функцию, которую никогда не обвиняли в том, что она не векторизована.

colMeans
# function (x, na.rm = FALSE, dims = 1L) 
# {
#   if (is.data.frame(x)) 
#     x <- as.matrix(x)
#   if (!is.array(x) || length(dn <- dim(x)) < 2L) 
#     stop("'x' must be an array of at least two dimensions")
#   if (dims < 1L || dims > length(dn) - 1L) 
#     stop("invalid 'dims'")
#   n <- prod(dn[1L:dims])
#   dn <- dn[-(1L:dims)]
#   z <- if (is.complex(x)) 
#     .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) * 
#     .Internal(colMeans(Im(x), n, prod(dn), na.rm))
#   else .Internal(colMeans(x, n, prod(dn), na.rm))
#   if (length(dn) > 1L) {
#     dim(z) <- dn
#     dimnames(z) <- dimnames(x)[-(1L:dims)]
#   }
#   else names(z) <- dimnames(x)[[dims + 1]]
#   z
# }
# <bytecode: 0x0000000008f89d20>
#   <environment: namespace:base>

А? Это также просто вызовы, .Internal(colMeans(...которые мы также можем найти в кроличьей норе . Так чем это отличается от .Internal(lapply(..?

Фактически, быстрый тест показывает, что он sapplyработает не хуже colMeansи намного лучше, чем forцикл для большого набора данных.

m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
system.time(colMeans(m))
# user  system elapsed 
# 1.69    0.03    1.73 
system.time(sapply(m, mean))
# user  system elapsed 
# 1.50    0.03    1.60 
system.time(apply(m, 2, mean))
# user  system elapsed 
# 3.84    0.03    3.90 
system.time(for(i in 1:ncol(m)) mean(m[, i]))
# user  system elapsed 
# 13.78    0.01   13.93

Другими словами, это правильно сказать , что lapplyи vapply на самом деле vectorised ( по сравнению с applyкоторой является forцикл , который также называет lapply) , и что же Патрик Бернс действительно хотел сказать?

Question 2

Прежде всего, в вашем примере вы проводите тесты на "data.frame", что несправедливо colMeans, applyи "[.data.frame"поскольку они имеют накладные расходы:

system.time(as.matrix(m))  #called by `colMeans` and `apply`
#   user  system elapsed 
#   1.03    0.00    1.05
system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
#   user  system elapsed 
#  12.93    0.01   13.07

На матрице картина немного другая:

mm = as.matrix(m)
system.time(colMeans(mm))
#   user  system elapsed 
#   0.01    0.00    0.01 
system.time(apply(mm, 2, mean))
#   user  system elapsed 
#   1.48    0.03    1.53 
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
#   user  system elapsed 
#   1.22    0.00    1.21

Что касается основной части вопроса, основное различие между lapply/ mapply/ etc и простыми R-циклами заключается в том, где выполняется цикл. Как отмечает Роланд, циклы C и R должны оценивать функцию R на каждой итерации, что является наиболее затратным. Действительно быстрые функции C - это те, которые делают все на C, так что, я думаю, это должно быть то, о чем "векторизация"?

Пример, в котором мы находим среднее значение в каждом элементе «списка»:

( ИЗМЕНИТЬ 11 мая '16 : я считаю, что пример с нахождением «среднего» не является хорошей установкой для различий между итеративным вычислением функции R и скомпилированным кодом, (1) из-за особенностей алгоритма среднего R для «числового» s по сравнению с простым sum(x) / length(x)и (2) должно иметь больше смысла тестировать на "list" s с length(x) >> lengths(x). Таким образом, "средний" пример перемещается в конец и заменяется другим.)

В качестве простого примера мы могли бы рассмотреть нахождение противоположности каждому length == 1элементу «списка»:

В tmp.cфайле:

#include <R.h>
#define USE_RINTERNALS 
#include <Rinternals.h>
#include <Rdefines.h>

/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    for(int i = 0; i < LENGTH(x); i++) 
        REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

    UNPROTECT(1);
    return(ans);
}

/* call an R function inside a C function;
 * will be used with 'f' as a closure and as a builtin */    
SEXP sapply_oppR(SEXP x, SEXP f)
{
    SEXP call = PROTECT(allocVector(LANGSXP, 2));
    SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));     
    for(int i = 0; i < LENGTH(x); i++) { 
        SETCADR(call, VECTOR_ELT(x, i));
        REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
    }

    UNPROTECT(2);
    return(ans);
}

А в стороне R:

system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")

с данными:

set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`
oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)
{
    f = match.fun(f)  
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[[i]] = f(x[[i]])
    return(ans)
})

Бенчмаркинг:

#call a C function iteratively
system.time({ sapplyC =  .Call("sapply_oppC", myls) }) 
#   user  system elapsed 
#  0.048   0.000   0.047 

#evaluate an R closure iteratively
system.time({ sapplyRC =  .Call("sapply_oppR", myls, "oppR") }) 
#   user  system elapsed 
#  3.348   0.000   3.358 

#evaluate an R builtin iteratively
system.time({ sapplyRCprim =  .Call("sapply_oppR", myls, "-") }) 
#   user  system elapsed 
#  0.652   0.000   0.653 

#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
#   user  system elapsed 
#  4.396   0.000   4.409 

#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
#   user  system elapsed 
#  1.908   0.000   1.913 

#for reference and testing 
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
#   user  system elapsed 
#  7.080   0.068   7.170 
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) 
#   user  system elapsed 
#  3.524   0.064   3.598 

all.equal(sapplyR, sapplyRprim)
#[1] TRUE 
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE

(Follows the original example of mean finding):

#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP tmp, ans;
    PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

    double *ptmp, *pans = REAL(ans);

    for(int i = 0; i < LENGTH(R_ls); i++) {
        pans[i] = 0.0;

        PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
        ptmp = REAL(tmp);

        for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];

        pans[i] /= LENGTH(tmp);

        UNPROTECT(1);
    }

    UNPROTECT(1);
    return(ans);
')

#a very simple `lapply(x, mean)`
C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP call, ans, ret;

    PROTECT(call = allocList(2));
    SET_TYPEOF(call, LANGSXP);
    SETCAR(call, install("mean"));

    PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
    PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));

    for(int i = 0; i < LENGTH(R_ls); i++) {
        SETCADR(call, VECTOR_ELT(R_ls, i));
        SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
    }

    double *pret = REAL(ret);
    for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];

    UNPROTECT(3);
    return(ret);
')                    

R_lapply = function(x) unlist(lapply(x, mean))                       

R_loop = function(x) 
{
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[i] = mean(x[[i]])
    return(ans)
} 

R_loopcmp = compiler::cmpfun(R_loop)


set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
all.equal(all_C(myls), C_and_R(myls))
#[1] TRUE
all.equal(all_C(myls), R_lapply(myls))
#[1] TRUE
all.equal(all_C(myls), R_loop(myls))
#[1] TRUE
all.equal(all_C(myls), R_loopcmp(myls))
#[1] TRUE

microbenchmark::microbenchmark(all_C(myls), 
                               C_and_R(myls), 
                               R_lapply(myls), 
                               R_loop(myls), 
                               R_loopcmp(myls), 
                               times = 15)
#Unit: milliseconds
#            expr       min        lq    median        uq      max neval
#     all_C(myls)  37.29183  38.19107  38.69359  39.58083  41.3861    15
#   C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822    15
#  R_lapply(myls)  98.48009 103.80717 106.55519 109.54890 116.3150    15
#    R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128    15
# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976    15

Question 3

To me, vectorisation is primarily about making your code easier to write and easier to understand.

The goal of a vectorised function is to eliminate the book-keeping associated with a for loop. For example, instead of:

means <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  means[i] <- mean(mtcars[[i]])
}
sds <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  sds[i] <- sd(mtcars[[i]])
}

You can write:

means <- vapply(mtcars, mean, numeric(1))
sds   <- vapply(mtcars, sd, numeric(1))

That makes it easier to see what's the same (the input data) and what's different (the function you're applying).

A secondary advantage of vectorisation is that the for-loop is often written in C, rather than in R. This has substantial performance benefits, but I don't think it's the key property of vectorisation. Vectorisation is fundamentally about saving your brain, not saving the computer work.

Question 4

I agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:

Consider this C code snippet:

for (int i=0; i<n; i++)
  c[i] = a[i] + b[i]

What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.

The most obvious way is that the code is run in a sequential manner. Load a[i] and b[i] on to registers, add them, store the result in c[i], and do this for each i.

However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from a and b under the same instruction, instead of one at a time.

We would like to exploit the Single Instruction Multiple Data and perform data level parallelism, i.e., load 4 things at a time, add 4 things at time, store 4 things at a time, for example. And this is code vectorisation.

Note that this is different from code parallelisation -- where multiple computations are performed concurrently.

It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of GNU-gcc here. Similarly for LLVM-clang here. And you can also find some benchmarks in the last link compared against gcc and ICC (Intel C++ compiler).

gcc (I'm on v4.9) for example doesn't vectorise code automatically at -O2 level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.

We either need to add the flag -ftree-vectorize or change optimisation to level -O3. (Note that -O3 performs other additional optimisations as well). The flag -fopt-info-vec is useful as it informs when a loop was successfully vectorised).

# compiling with -O2, -ftree-vectorize and  -fopt-info-vec
# test.c:32:5: note: loop vectorized
# test.c:32:5: note: loop versioned for vectorization because of possible aliasing
# test.c:32:5: note: loop peeled for vectorization to enhance alignment

This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:

x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector

# non-vectorised, -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   1.830   0.009   1.852

# vectorised using flags shown above at -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   0.361   0.001   0.362

# both results are checked for identicalness, returns TRUE

This part can be safely skipped without losing continuity.

Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.

#pragma omp simd
for (i=0; i<n; i++) 
  c[i] = a[i] + b[i]

By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag -fopenmp. By doing that:

# timing with -O2 + OpenMP with simd
x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector
system.time(.Call("Cvecsum", x, y, z))
#    user  system elapsed 
#   0.360   0.001   0.360

which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.

In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).

In case of R, even rowSums() or colSums() code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes for lapply(). In case of apply(), it's in R. All of these are therefore loop hiding.

In short, wrapping an R function by:

just writing a for-loop in C != vectorising your code.
just writing a for-loop in R != vectorising your code.

Intel Math Kernel Library (MKL) for example implements vectorised forms of functions.

HTH

References:

Talk by James Reinders, Intel (this answer is mostly an attempt to summarise this excellent talk)

Question 5

So to sum the great answers/comments up into some general answer and provide some background: R has 4 types of loops (in from not-vectorized to vectorized order)

R for loop that repeatedly calls R functions in each iterations (Not vectorised)
C loop that repeatedly calls R functions in each iterations (Not vectorised)
C loop that calls R function only once (Somewhat vectorized)
A plain C loop that doesn't call any R function at all and uses it own compiled functions (Vectorized)

So the *apply family is the second type. Except apply which is more of the first type

You can understand this from the comment in its source code

/* .Internal(lapply(X, FUN)) */

/* This is a special .Internal, so has unevaluated arguments. It is
called from a closure wrapper, so X and FUN are promises. FUN must be unevaluated for use in e.g. bquote . */

That means that lapplys C code accepts an unevaluated function from R and later evaluates it within the C code itself. This is basically the difference between lapplys .Internal call

.Internal(lapply(X, FUN))

Which has a FUN argument that holds an R function

And the colMeans .Internal call which does not have a FUN argument

.Internal(colMeans(Re(x), n, prod(dn), na.rm))

colMeans, unlike lapply knows exactly what function it needs to use, thus it calculates the mean internally within the C code.

You can clearly see the evaluation process of the R function in each iteration within lapply C code

 for(R_xlen_t i = 0; i < n; i++) {
      if (realIndx) REAL(ind)[0] = (double)(i + 1);
      else INTEGER(ind)[0] = (int)(i + 1);
      tmp = eval(R_fcall, rho);   // <----------------------------- here it is
      if (MAYBE_REFERENCED(tmp)) tmp = lazy_duplicate(tmp);
      SET_VECTOR_ELT(ans, i, tmp);
   }

To sum things up, lapply is not vectorized, though it has two possible advantages over the plain R for loop

Accessing and assigning in a loop seems to be faster in C (i.e. in lapplying a function) Although the difference seems big, we, still, stay at the microsecond level and the costly thing is the valuation of an R function in each iteration. A simple example:

ffR = function(x)  {
    ans = vector("list", length(x))
    for(i in seq_along(x)) ans[[i]] = x[[i]]
    ans 
}

ffC = inline::cfunction(sig = c(R_x = "data.frame"), body = '
    SEXP ans;
    PROTECT(ans = allocVector(VECSXP, LENGTH(R_x)));
    for(int i = 0; i < LENGTH(R_x); i++) 
           SET_VECTOR_ELT(ans, i, VECTOR_ELT(R_x, i));
    UNPROTECT(1);
    return(ans); 
')

set.seed(007) 
myls = replicate(1e3, runif(1e3), simplify = FALSE)     
mydf = as.data.frame(myls)

all.equal(ffR(myls), ffC(myls))
#[1] TRUE 
all.equal(ffR(mydf), ffC(mydf))
#[1] TRUE

microbenchmark::microbenchmark(ffR(myls), ffC(myls), 
                               ffR(mydf), ffC(mydf),
                               times = 30)
#Unit: microseconds
#      expr       min        lq    median        uq       max neval
# ffR(myls)  3933.764  3975.076  4073.540  5121.045 32956.580    30
# ffC(myls)    12.553    12.934    16.695    18.210    19.481    30
# ffR(mydf) 14799.340 15095.677 15661.889 16129.689 18439.908    30
# ffC(mydf)    12.599    13.068    15.835    18.402    20.509    30

As mentioned by @Roland, it runs a compiled C loop rather an interpreted R loop

Though when vectorizing your code, there are some things you need to take into account.

If your data set (let's call it df) is of class data.frame, some vectorized functions (such as colMeans, colSums, rowSums, etc.) will have to convert it to a matrix first, simply because this is how they were designed. This means that for a big df this can create a huge overhead. While lapply won't have to do this as it extracts the actual vectors out of df (as data.frame is just a list of vectors) and thus, if you have not so many columns but many rows, lapply(df, mean) can sometimes be better option than colMeans(df).
Another thing to remember is that R has a great variety of different function types, such as .Primitive, and generic (S3, S4) see here for some additional information. The generic function have to do a method dispatch which sometimes a costly operation. For example, mean is generic S3 function while sum is Primitive. Thus some times lapply(df, sum) could be very efficient compared colSums from the reasons listed above

Answer 1

Таким образом, мы привыкли говорить каждому новому пользователю R, что « applyне векторизован, посмотрите Patrick Burns R Inferno Circle 4 », в котором говорится (цитирую):

Распространенный рефлекс - использовать функцию из семейства apply. Это не векторизация, это скрытие петель . В определении функции apply есть цикл for. Функция lapply скрывает цикл, но время выполнения обычно примерно равно явному циклу for.

Действительно, беглый взгляд на applyисходный код обнаруживает цикл:

grep("for", capture.output(getAnywhere("apply")), value = TRUE)
## [1] "        for (i in 1L:d2) {"  "    else for (i in 1L:d2) {"

Пока что хорошо, но если взглянуть на нее lapplyили vapplyувидеть ее, то картина будет совершенно иной:

lapply
## function (X, FUN, ...) 
## {
##     FUN <- match.fun(FUN)
##     if (!is.vector(X) || is.object(X)) 
##        X <- as.list(X)
##     .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000000284b618>
## <environment: namespace:base>

Таким образом, очевидно, что там нет скрытого forцикла R , скорее они вызывают внутреннюю написанную функцию C.

Беглый взгляд в кроличью нору показывает почти ту же картину

Более того, возьмем colMeansдля примера функцию, которую никогда не обвиняли в том, что она не векторизована.

colMeans
# function (x, na.rm = FALSE, dims = 1L) 
# {
#   if (is.data.frame(x)) 
#     x <- as.matrix(x)
#   if (!is.array(x) || length(dn <- dim(x)) < 2L) 
#     stop("'x' must be an array of at least two dimensions")
#   if (dims < 1L || dims > length(dn) - 1L) 
#     stop("invalid 'dims'")
#   n <- prod(dn[1L:dims])
#   dn <- dn[-(1L:dims)]
#   z <- if (is.complex(x)) 
#     .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) * 
#     .Internal(colMeans(Im(x), n, prod(dn), na.rm))
#   else .Internal(colMeans(x, n, prod(dn), na.rm))
#   if (length(dn) > 1L) {
#     dim(z) <- dn
#     dimnames(z) <- dimnames(x)[-(1L:dims)]
#   }
#   else names(z) <- dimnames(x)[[dims + 1]]
#   z
# }
# <bytecode: 0x0000000008f89d20>
#   <environment: namespace:base>

А? Это также просто вызовы, .Internal(colMeans(...которые мы также можем найти в кроличьей норе . Так чем это отличается от .Internal(lapply(..?

Фактически, быстрый тест показывает, что он sapplyработает не хуже colMeansи намного лучше, чем forцикл для большого набора данных.

m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
system.time(colMeans(m))
# user  system elapsed 
# 1.69    0.03    1.73 
system.time(sapply(m, mean))
# user  system elapsed 
# 1.50    0.03    1.60 
system.time(apply(m, 2, mean))
# user  system elapsed 
# 3.84    0.03    3.90 
system.time(for(i in 1:ncol(m)) mean(m[, i]))
# user  system elapsed 
# 13.78    0.01   13.93

Другими словами, это правильно сказать , что lapplyи vapply на самом деле vectorised ( по сравнению с applyкоторой является forцикл , который также называет lapply) , и что же Патрик Бернс действительно хотел сказать?

Answer 2

8

Это все в семантике, но я бы не стал считать их векторизованными. Я считаю подход векторизованным, если функция R вызывается только один раз и ей можно передать вектор значений. *applyфункции многократно вызывают функции R, что делает их циклами. Относительно хорошей производительности sapply(m, mean): Возможно, C-код lapplyметода отправляет только один раз, а затем вызывает метод повторно? mean.defaultдовольно оптимизирован.

Роланд

Answer 3

4

Отличный вопрос, спасибо за проверку исходного кода. Я искал, было ли это недавно изменено, но ничего об этом не было в примечаниях к выпуску R, начиная с версии 2.13.0 и новее.

ilir

Answer 4

1

В какой степени производительность зависит как от платформы, так и от используемых флагов C-компилятора и компоновщика?

smci

Answer 5

3

@DavidArenburg На самом деле, я не думаю, что это четко определено. По крайней мере, я не знаю канонической ссылки. В определении языка упоминаются «векторизованные» операции, но не определяется векторизация.

Роланд

Answer 6

3

Очень похоже: семейство приложений R больше, чем синтаксический сахар? (И, как эти ответы, тоже хорошее чтение.)

Грегор Томас

Answer 7

Прежде всего, в вашем примере вы проводите тесты на "data.frame", что несправедливо colMeans, applyи "[.data.frame"поскольку они имеют накладные расходы:

system.time(as.matrix(m))  #called by `colMeans` and `apply`
#   user  system elapsed 
#   1.03    0.00    1.05
system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
#   user  system elapsed 
#  12.93    0.01   13.07

На матрице картина немного другая:

mm = as.matrix(m)
system.time(colMeans(mm))
#   user  system elapsed 
#   0.01    0.00    0.01 
system.time(apply(mm, 2, mean))
#   user  system elapsed 
#   1.48    0.03    1.53 
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
#   user  system elapsed 
#   1.22    0.00    1.21

Что касается основной части вопроса, основное различие между lapply/ mapply/ etc и простыми R-циклами заключается в том, где выполняется цикл. Как отмечает Роланд, циклы C и R должны оценивать функцию R на каждой итерации, что является наиболее затратным. Действительно быстрые функции C - это те, которые делают все на C, так что, я думаю, это должно быть то, о чем "векторизация"?

Пример, в котором мы находим среднее значение в каждом элементе «списка»:

( ИЗМЕНИТЬ 11 мая '16 : я считаю, что пример с нахождением «среднего» не является хорошей установкой для различий между итеративным вычислением функции R и скомпилированным кодом, (1) из-за особенностей алгоритма среднего R для «числового» s по сравнению с простым sum(x) / length(x)и (2) должно иметь больше смысла тестировать на "list" s с length(x) >> lengths(x). Таким образом, "средний" пример перемещается в конец и заменяется другим.)

В качестве простого примера мы могли бы рассмотреть нахождение противоположности каждому length == 1элементу «списка»:

В tmp.cфайле:

#include <R.h>
#define USE_RINTERNALS 
#include <Rinternals.h>
#include <Rdefines.h>

/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    for(int i = 0; i < LENGTH(x); i++) 
        REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

    UNPROTECT(1);
    return(ans);
}

/* call an R function inside a C function;
 * will be used with 'f' as a closure and as a builtin */    
SEXP sapply_oppR(SEXP x, SEXP f)
{
    SEXP call = PROTECT(allocVector(LANGSXP, 2));
    SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));     
    for(int i = 0; i < LENGTH(x); i++) { 
        SETCADR(call, VECTOR_ELT(x, i));
        REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
    }

    UNPROTECT(2);
    return(ans);
}

А в стороне R:

system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")

с данными:

set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`
oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)
{
    f = match.fun(f)  
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[[i]] = f(x[[i]])
    return(ans)
})

Бенчмаркинг:

#call a C function iteratively
system.time({ sapplyC =  .Call("sapply_oppC", myls) }) 
#   user  system elapsed 
#  0.048   0.000   0.047 

#evaluate an R closure iteratively
system.time({ sapplyRC =  .Call("sapply_oppR", myls, "oppR") }) 
#   user  system elapsed 
#  3.348   0.000   3.358 

#evaluate an R builtin iteratively
system.time({ sapplyRCprim =  .Call("sapply_oppR", myls, "-") }) 
#   user  system elapsed 
#  0.652   0.000   0.653 

#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
#   user  system elapsed 
#  4.396   0.000   4.409 

#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
#   user  system elapsed 
#  1.908   0.000   1.913 

#for reference and testing 
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
#   user  system elapsed 
#  7.080   0.068   7.170 
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) 
#   user  system elapsed 
#  3.524   0.064   3.598 

all.equal(sapplyR, sapplyRprim)
#[1] TRUE 
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE

(Follows the original example of mean finding):

#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP tmp, ans;
    PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

    double *ptmp, *pans = REAL(ans);

    for(int i = 0; i < LENGTH(R_ls); i++) {
        pans[i] = 0.0;

        PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
        ptmp = REAL(tmp);

        for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];

        pans[i] /= LENGTH(tmp);

        UNPROTECT(1);
    }

    UNPROTECT(1);
    return(ans);
')

#a very simple `lapply(x, mean)`
C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP call, ans, ret;

    PROTECT(call = allocList(2));
    SET_TYPEOF(call, LANGSXP);
    SETCAR(call, install("mean"));

    PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
    PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));

    for(int i = 0; i < LENGTH(R_ls); i++) {
        SETCADR(call, VECTOR_ELT(R_ls, i));
        SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
    }

    double *pret = REAL(ret);
    for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];

    UNPROTECT(3);
    return(ret);
')                    

R_lapply = function(x) unlist(lapply(x, mean))                       

R_loop = function(x) 
{
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[i] = mean(x[[i]])
    return(ans)
} 

R_loopcmp = compiler::cmpfun(R_loop)


set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
all.equal(all_C(myls), C_and_R(myls))
#[1] TRUE
all.equal(all_C(myls), R_lapply(myls))
#[1] TRUE
all.equal(all_C(myls), R_loop(myls))
#[1] TRUE
all.equal(all_C(myls), R_loopcmp(myls))
#[1] TRUE

microbenchmark::microbenchmark(all_C(myls), 
                               C_and_R(myls), 
                               R_lapply(myls), 
                               R_loop(myls), 
                               R_loopcmp(myls), 
                               times = 15)
#Unit: milliseconds
#            expr       min        lq    median        uq      max neval
#     all_C(myls)  37.29183  38.19107  38.69359  39.58083  41.3861    15
#   C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822    15
#  R_lapply(myls)  98.48009 103.80717 106.55519 109.54890 116.3150    15
#    R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128    15
# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976    15

Answer 8

10

Great point about the costs of converting the data.frame to a matrix, and thanks for providing benchmarks.

Joshua Ulrich

Answer 9

That is a very nice answer, though I wasn't able to compile your all_C and C_and_R functions. I also found in the documentations of compiler::cmpfun an old R version of lapply which contains an actual R for loop, I'm starting to suspect that Burns was referring to that old version which was vectorised since then and this is the actual answer to my question....

David Arenburg

Answer 10

@DavidArenburg : Benchmarking la1 from ?compiler::cmpfun seems, still, to yield same efficiency with all but all_C functions. I guess, it -indeed- comes to be a matter of definition; is "vectorised" meaning any function that accepts not only scalars, any function that has C code, any function that uses computations in C only?

alexis_laz

Answer 11

1

I guess all functions in R have C code in them, simply because everything in R is a function (which had to be written in some language). So basically, if I understand it right, you are saying that lapply isn't vectorized simply because it's evaluating an R function in each iteration wihin its C code?

David Arenburg

Answer 12

5

@DavidArenburg : If I must define "vectorization" in some way, I guess, I would choose a linguistic approach; i.e. a function that accepts and knows how to handle a "vector", whether it's fast, slow, written in C, in R or anything else. In R, the importance of vectorisation is in that many functions are written in C and handle vectors while in other languages users would , usually, loop over the input to -e.g.- find the mean. That makes vectorisation to relate, indirectly, with speed, efficiency, safety, and robustness.

alexis_laz

Answer 13

65

To me, vectorisation is primarily about making your code easier to write and easier to understand.

The goal of a vectorised function is to eliminate the book-keeping associated with a for loop. For example, instead of:

means <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  means[i] <- mean(mtcars[[i]])
}
sds <- numeric(length(mtcars))
for (i in seq_along(mtcars)) {
  sds[i] <- sd(mtcars[[i]])
}

You can write:

means <- vapply(mtcars, mean, numeric(1))
sds   <- vapply(mtcars, sd, numeric(1))

That makes it easier to see what's the same (the input data) and what's different (the function you're applying).

A secondary advantage of vectorisation is that the for-loop is often written in C, rather than in R. This has substantial performance benefits, but I don't think it's the key property of vectorisation. Vectorisation is fundamentally about saving your brain, not saving the computer work.

hadley
источник

5

I don't think there is a meaningful performance difference between C and R for loops. OK, a C loop might be optimized by the compiler, but the main point for performance is whether the content of the loop is efficient. And obviously compiled code is usually faster than interpreted code. But that's probably what you meant to say.

Roland

3

@Roland yeah, it's not the for-loop itself per se, it's all the stuff around it (the cost of a function call, the ability to modify in place, ...).

hadley

10

@DavidArenburg "Needless consistency is the hobgoblin of small minds" ;)

hadley

6

No, I don't think performance is the main point of vectorising your code. Rewriting a loop into an lapply is beneficial even if it isn't any faster. The main point of dplyr is that it makes it easier to express data manipulation (and it's just very nice that it's also fast).

hadley

12

@DavidArenburg that's because you're an experienced R user. Most new users find loops much more natural, and need to be encouraged to vectorise. To me, using a function like colMeans isn't necessarily about vectorisation, it's about reusing fast code that someone has already written

hadley

Answer 14

5

I don't think there is a meaningful performance difference between C and R for loops. OK, a C loop might be optimized by the compiler, but the main point for performance is whether the content of the loop is efficient. And obviously compiled code is usually faster than interpreted code. But that's probably what you meant to say.

Roland

Answer 15

3

@Roland yeah, it's not the for-loop itself per se, it's all the stuff around it (the cost of a function call, the ability to modify in place, ...).

hadley

Answer 16

10

@DavidArenburg "Needless consistency is the hobgoblin of small minds" ;)

hadley

Answer 17

6

No, I don't think performance is the main point of vectorising your code. Rewriting a loop into an lapply is beneficial even if it isn't any faster. The main point of dplyr is that it makes it easier to express data manipulation (and it's just very nice that it's also fast).

hadley

Answer 18

12

@DavidArenburg that's because you're an experienced R user. Most new users find loops much more natural, and need to be encouraged to vectorise. To me, using a function like colMeans isn't necessarily about vectorisation, it's about reusing fast code that someone has already written

hadley

Answer 19

I agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:

Consider this C code snippet:

for (int i=0; i<n; i++)
  c[i] = a[i] + b[i]

What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.

The most obvious way is that the code is run in a sequential manner. Load a[i] and b[i] on to registers, add them, store the result in c[i], and do this for each i.

However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from a and b under the same instruction, instead of one at a time.

We would like to exploit the Single Instruction Multiple Data and perform data level parallelism, i.e., load 4 things at a time, add 4 things at time, store 4 things at a time, for example. And this is code vectorisation.

Note that this is different from code parallelisation -- where multiple computations are performed concurrently.

It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of GNU-gcc here. Similarly for LLVM-clang here. And you can also find some benchmarks in the last link compared against gcc and ICC (Intel C++ compiler).

gcc (I'm on v4.9) for example doesn't vectorise code automatically at -O2 level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.

We either need to add the flag -ftree-vectorize or change optimisation to level -O3. (Note that -O3 performs other additional optimisations as well). The flag -fopt-info-vec is useful as it informs when a loop was successfully vectorised).

# compiling with -O2, -ftree-vectorize and  -fopt-info-vec
# test.c:32:5: note: loop vectorized
# test.c:32:5: note: loop versioned for vectorization because of possible aliasing
# test.c:32:5: note: loop peeled for vectorization to enhance alignment

This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:

x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector

# non-vectorised, -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   1.830   0.009   1.852

# vectorised using flags shown above at -O2
system.time(.Call("Csum", x, y, z))
#    user  system elapsed 
#   0.361   0.001   0.362

# both results are checked for identicalness, returns TRUE

This part can be safely skipped without losing continuity.

Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.

#pragma omp simd
for (i=0; i<n; i++) 
  c[i] = a[i] + b[i]

By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag -fopenmp. By doing that:

# timing with -O2 + OpenMP with simd
x = sample(100L, 500e6L, TRUE)
y = sample(100L, 500e6L, TRUE)
z = vector("integer", 500e6L) # result vector
system.time(.Call("Cvecsum", x, y, z))
#    user  system elapsed 
#   0.360   0.001   0.360

which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.

In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).

In case of R, even rowSums() or colSums() code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes for lapply(). In case of apply(), it's in R. All of these are therefore loop hiding.

In short, wrapping an R function by:

just writing a for-loop in C != vectorising your code.
just writing a for-loop in R != vectorising your code.

Intel Math Kernel Library (MKL) for example implements vectorised forms of functions.

HTH

References:

Talk by James Reinders, Intel (this answer is mostly an attempt to summarise this excellent talk)

Answer 20

So to sum the great answers/comments up into some general answer and provide some background: R has 4 types of loops (in from not-vectorized to vectorized order)

R for loop that repeatedly calls R functions in each iterations (Not vectorised)
C loop that repeatedly calls R functions in each iterations (Not vectorised)
C loop that calls R function only once (Somewhat vectorized)
A plain C loop that doesn't call any R function at all and uses it own compiled functions (Vectorized)

So the *apply family is the second type. Except apply which is more of the first type

You can understand this from the comment in its source code

/* .Internal(lapply(X, FUN)) */

/* This is a special .Internal, so has unevaluated arguments. It is
called from a closure wrapper, so X and FUN are promises. FUN must be unevaluated for use in e.g. bquote . */

That means that lapplys C code accepts an unevaluated function from R and later evaluates it within the C code itself. This is basically the difference between lapplys .Internal call

.Internal(lapply(X, FUN))

Which has a FUN argument that holds an R function

And the colMeans .Internal call which does not have a FUN argument

.Internal(colMeans(Re(x), n, prod(dn), na.rm))

colMeans, unlike lapply knows exactly what function it needs to use, thus it calculates the mean internally within the C code.

You can clearly see the evaluation process of the R function in each iteration within lapply C code

 for(R_xlen_t i = 0; i < n; i++) {
      if (realIndx) REAL(ind)[0] = (double)(i + 1);
      else INTEGER(ind)[0] = (int)(i + 1);
      tmp = eval(R_fcall, rho);   // <----------------------------- here it is
      if (MAYBE_REFERENCED(tmp)) tmp = lazy_duplicate(tmp);
      SET_VECTOR_ELT(ans, i, tmp);
   }

To sum things up, lapply is not vectorized, though it has two possible advantages over the plain R for loop

Accessing and assigning in a loop seems to be faster in C (i.e. in lapplying a function) Although the difference seems big, we, still, stay at the microsecond level and the costly thing is the valuation of an R function in each iteration. A simple example:

ffR = function(x)  {
    ans = vector("list", length(x))
    for(i in seq_along(x)) ans[[i]] = x[[i]]
    ans 
}

ffC = inline::cfunction(sig = c(R_x = "data.frame"), body = '
    SEXP ans;
    PROTECT(ans = allocVector(VECSXP, LENGTH(R_x)));
    for(int i = 0; i < LENGTH(R_x); i++) 
           SET_VECTOR_ELT(ans, i, VECTOR_ELT(R_x, i));
    UNPROTECT(1);
    return(ans); 
')

set.seed(007) 
myls = replicate(1e3, runif(1e3), simplify = FALSE)     
mydf = as.data.frame(myls)

all.equal(ffR(myls), ffC(myls))
#[1] TRUE 
all.equal(ffR(mydf), ffC(mydf))
#[1] TRUE

microbenchmark::microbenchmark(ffR(myls), ffC(myls), 
                               ffR(mydf), ffC(mydf),
                               times = 30)
#Unit: microseconds
#      expr       min        lq    median        uq       max neval
# ffR(myls)  3933.764  3975.076  4073.540  5121.045 32956.580    30
# ffC(myls)    12.553    12.934    16.695    18.210    19.481    30
# ffR(mydf) 14799.340 15095.677 15661.889 16129.689 18439.908    30
# ffC(mydf)    12.599    13.068    15.835    18.402    20.509    30

As mentioned by @Roland, it runs a compiled C loop rather an interpreted R loop

Though when vectorizing your code, there are some things you need to take into account.

If your data set (let's call it df) is of class data.frame, some vectorized functions (such as colMeans, colSums, rowSums, etc.) will have to convert it to a matrix first, simply because this is how they were designed. This means that for a big df this can create a huge overhead. While lapply won't have to do this as it extracts the actual vectors out of df (as data.frame is just a list of vectors) and thus, if you have not so many columns but many rows, lapply(df, mean) can sometimes be better option than colMeans(df).
Another thing to remember is that R has a great variety of different function types, such as .Primitive, and generic (S3, S4) see here for some additional information. The generic function have to do a method dispatch which sometimes a costly operation. For example, mean is generic S3 function while sum is Primitive. Thus some times lapply(df, sum) could be very efficient compared colSums from the reasons listed above

Answer 21

1

Very cohesive summary. Just a few notes: (1) C knows how to handle "data.frame"s, since they are "list"s with attributes; it's that colMeans etc. that are built to handle only matrices. (2) I'm a bit confused by your third category; I can't tell what -exaclty- you're referring to. (3) Since you're referring specifically to lapply, I believe it doesn't make a difference between "[<-" in R and C; they both pre-allocate a "list" (an SEXP) and fill it in each iteration (SET_VECTOR_ELT in C), unless I'm missing your point.

alexis_laz

Answer 22

2

I get your point about do.call in that it builts a function call in the C environmen and just evaluates it; although I'm having a hard time to compare it to looping or vectorization since it does a different thing. You're, actually, right about accessing and assigning differences between C and R, although both stay at the microsecond level and don't affect the result the rerult hugely, since the costly is the iterative R function call (compare R_loop and R_lapply in my answer). (I'll edit your post with a benchmark; I hope you, still, won't mind)

alexis_laz

Answer 23

2

I'm not trying to disagree---and I'm confused, honestly, about what you're disagreeing with. My earlier comment could have been worded better. I'm trying refine the terminology being used because the term "vectorization" has two definitions that are often conflated. I don't think this is arguable. Burns and you seem to want to use it only in the sense of implementation, but Hadley and many R-Core members (taking Vectorize() as an example) use it in the UI sense too. I think that much of the disagreement in this thread is caused by using one term for two separate-but-related concepts.

Gregor Thomas

Answer 24

3

@DavidArenburg and is that not vectorization in a UI sense, regardless of whether there's a for loop in R or C underneath?

Gregor Thomas

Answer 25

2

@DavidArenburg, Gregor, I think the confusion is between "code vectorisation" and "vectorised functions". In R, the usage seems inclined towards the latter. "Code vectorisation" describes operating on a vector of length 'k' in the same instruction. Wrapping a fn. around loopy code results in "vectorised functions" (yes, it doesn't make sense and is confusing, I agree, better would be loop hiding or vector i/p functions) and need not have anything to do with code vectorisation. In R, apply would be a vectorised function, but it doesn't vectorise your code, rather operates on vectors.

Arun

Семейство «* apply» действительно не векторизовано?

Ответы: