Таким образом, мы привыкли говорить каждому новому пользователю R, что « apply
не векторизован, посмотрите Patrick Burns R Inferno Circle 4 », в котором говорится (цитирую):
Распространенный рефлекс - использовать функцию из семейства apply. Это не векторизация, это скрытие петель . В определении функции apply есть цикл for. Функция lapply скрывает цикл, но время выполнения обычно примерно равно явному циклу for.
Действительно, беглый взгляд на apply
исходный код обнаруживает цикл:
grep("for", capture.output(getAnywhere("apply")), value = TRUE)
## [1] " for (i in 1L:d2) {" " else for (i in 1L:d2) {"
Пока что хорошо, но если взглянуть на нее lapply
или vapply
увидеть ее, то картина будет совершенно иной:
lapply
## function (X, FUN, ...)
## {
## FUN <- match.fun(FUN)
## if (!is.vector(X) || is.object(X))
## X <- as.list(X)
## .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000000284b618>
## <environment: namespace:base>
Таким образом, очевидно, что там нет скрытого for
цикла R , скорее они вызывают внутреннюю написанную функцию C.
Беглый взгляд в кроличью нору показывает почти ту же картину
Более того, возьмем colMeans
для примера функцию, которую никогда не обвиняли в том, что она не векторизована.
colMeans
# function (x, na.rm = FALSE, dims = 1L)
# {
# if (is.data.frame(x))
# x <- as.matrix(x)
# if (!is.array(x) || length(dn <- dim(x)) < 2L)
# stop("'x' must be an array of at least two dimensions")
# if (dims < 1L || dims > length(dn) - 1L)
# stop("invalid 'dims'")
# n <- prod(dn[1L:dims])
# dn <- dn[-(1L:dims)]
# z <- if (is.complex(x))
# .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) *
# .Internal(colMeans(Im(x), n, prod(dn), na.rm))
# else .Internal(colMeans(x, n, prod(dn), na.rm))
# if (length(dn) > 1L) {
# dim(z) <- dn
# dimnames(z) <- dimnames(x)[-(1L:dims)]
# }
# else names(z) <- dimnames(x)[[dims + 1]]
# z
# }
# <bytecode: 0x0000000008f89d20>
# <environment: namespace:base>
А? Это также просто вызовы, .Internal(colMeans(...
которые мы также можем найти в кроличьей норе . Так чем это отличается от .Internal(lapply(..
?
Фактически, быстрый тест показывает, что он sapply
работает не хуже colMeans
и намного лучше, чем for
цикл для большого набора данных.
m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
system.time(colMeans(m))
# user system elapsed
# 1.69 0.03 1.73
system.time(sapply(m, mean))
# user system elapsed
# 1.50 0.03 1.60
system.time(apply(m, 2, mean))
# user system elapsed
# 3.84 0.03 3.90
system.time(for(i in 1:ncol(m)) mean(m[, i]))
# user system elapsed
# 13.78 0.01 13.93
Другими словами, это правильно сказать , что lapply
и vapply
на самом деле vectorised ( по сравнению с apply
которой является for
цикл , который также называет lapply
) , и что же Патрик Бернс действительно хотел сказать?
источник
*apply
функции многократно вызывают функции R, что делает их циклами. Относительно хорошей производительностиsapply(m, mean)
: Возможно, C-кодlapply
метода отправляет только один раз, а затем вызывает метод повторно?mean.default
довольно оптимизирован.Ответы:
Прежде всего, в вашем примере вы проводите тесты на "data.frame", что несправедливо
colMeans
,apply
и"[.data.frame"
поскольку они имеют накладные расходы:system.time(as.matrix(m)) #called by `colMeans` and `apply` # user system elapsed # 1.03 0.00 1.05 system.time(for(i in 1:ncol(m)) m[, i]) #in the `for` loop # user system elapsed # 12.93 0.01 13.07
На матрице картина немного другая:
mm = as.matrix(m) system.time(colMeans(mm)) # user system elapsed # 0.01 0.00 0.01 system.time(apply(mm, 2, mean)) # user system elapsed # 1.48 0.03 1.53 system.time(for(i in 1:ncol(mm)) mean(mm[, i])) # user system elapsed # 1.22 0.00 1.21
Что касается основной части вопроса, основное различие между
lapply
/mapply
/ etc и простыми R-циклами заключается в том, где выполняется цикл. Как отмечает Роланд, циклы C и R должны оценивать функцию R на каждой итерации, что является наиболее затратным. Действительно быстрые функции C - это те, которые делают все на C, так что, я думаю, это должно быть то, о чем "векторизация"?Пример, в котором мы находим среднее значение в каждом элементе «списка»:
( ИЗМЕНИТЬ 11 мая '16 : я считаю, что пример с нахождением «среднего» не является хорошей установкой для различий между итеративным вычислением функции R и скомпилированным кодом, (1) из-за особенностей алгоритма среднего R для «числового» s по сравнению с простым
sum(x) / length(x)
и (2) должно иметь больше смысла тестировать на "list" s сlength(x) >> lengths(x)
. Таким образом, "средний" пример перемещается в конец и заменяется другим.)В качестве простого примера мы могли бы рассмотреть нахождение противоположности каждому
length == 1
элементу «списка»:В
tmp.c
файле:#include <R.h> #define USE_RINTERNALS #include <Rinternals.h> #include <Rdefines.h> /* call a C function inside another */ double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); } SEXP sapply_oppC(SEXP x) { SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x))); for(int i = 0; i < LENGTH(x); i++) REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]); UNPROTECT(1); return(ans); } /* call an R function inside a C function; * will be used with 'f' as a closure and as a builtin */ SEXP sapply_oppR(SEXP x, SEXP f) { SEXP call = PROTECT(allocVector(LANGSXP, 2)); SETCAR(call, install(CHAR(STRING_ELT(f, 0)))); SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x))); for(int i = 0; i < LENGTH(x); i++) { SETCADR(call, VECTOR_ELT(x, i)); REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0]; } UNPROTECT(2); return(ans); }
А в стороне R:
system("R CMD SHLIB /home/~/tmp.c") dyn.load("/home/~/tmp.so")
с данными:
set.seed(007) myls = rep_len(as.list(c(NA, runif(3))), 1e7) #a closure wrapper of `-` oppR = function(x) -x for_oppR = compiler::cmpfun(function(x, f) { f = match.fun(f) ans = numeric(length(x)) for(i in seq_along(x)) ans[[i]] = f(x[[i]]) return(ans) })
Бенчмаркинг:
#call a C function iteratively system.time({ sapplyC = .Call("sapply_oppC", myls) }) # user system elapsed # 0.048 0.000 0.047 #evaluate an R closure iteratively system.time({ sapplyRC = .Call("sapply_oppR", myls, "oppR") }) # user system elapsed # 3.348 0.000 3.358 #evaluate an R builtin iteratively system.time({ sapplyRCprim = .Call("sapply_oppR", myls, "-") }) # user system elapsed # 0.652 0.000 0.653 #loop with a R closure system.time({ forR = for_oppR(myls, "oppR") }) # user system elapsed # 4.396 0.000 4.409 #loop with an R builtin system.time({ forRprim = for_oppR(myls, "-") }) # user system elapsed # 1.908 0.000 1.913 #for reference and testing system.time({ sapplyR = unlist(lapply(myls, oppR)) }) # user system elapsed # 7.080 0.068 7.170 system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) # user system elapsed # 3.524 0.064 3.598 all.equal(sapplyR, sapplyRprim) #[1] TRUE all.equal(sapplyR, sapplyC) #[1] TRUE all.equal(sapplyR, sapplyRC) #[1] TRUE all.equal(sapplyR, sapplyRCprim) #[1] TRUE all.equal(sapplyR, forR) #[1] TRUE all.equal(sapplyR, forRprim) #[1] TRUE
(Follows the original example of mean finding):
#all computations in C all_C = inline::cfunction(sig = c(R_ls = "list"), body = ' SEXP tmp, ans; PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls))); double *ptmp, *pans = REAL(ans); for(int i = 0; i < LENGTH(R_ls); i++) { pans[i] = 0.0; PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP)); ptmp = REAL(tmp); for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j]; pans[i] /= LENGTH(tmp); UNPROTECT(1); } UNPROTECT(1); return(ans); ') #a very simple `lapply(x, mean)` C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = ' SEXP call, ans, ret; PROTECT(call = allocList(2)); SET_TYPEOF(call, LANGSXP); SETCAR(call, install("mean")); PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls))); PROTECT(ret = allocVector(REALSXP, LENGTH(ans))); for(int i = 0; i < LENGTH(R_ls); i++) { SETCADR(call, VECTOR_ELT(R_ls, i)); SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv)); } double *pret = REAL(ret); for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0]; UNPROTECT(3); return(ret); ') R_lapply = function(x) unlist(lapply(x, mean)) R_loop = function(x) { ans = numeric(length(x)) for(i in seq_along(x)) ans[i] = mean(x[[i]]) return(ans) } R_loopcmp = compiler::cmpfun(R_loop) set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE) all.equal(all_C(myls), C_and_R(myls)) #[1] TRUE all.equal(all_C(myls), R_lapply(myls)) #[1] TRUE all.equal(all_C(myls), R_loop(myls)) #[1] TRUE all.equal(all_C(myls), R_loopcmp(myls)) #[1] TRUE microbenchmark::microbenchmark(all_C(myls), C_and_R(myls), R_lapply(myls), R_loop(myls), R_loopcmp(myls), times = 15) #Unit: milliseconds # expr min lq median uq max neval # all_C(myls) 37.29183 38.19107 38.69359 39.58083 41.3861 15 # C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822 15 # R_lapply(myls) 98.48009 103.80717 106.55519 109.54890 116.3150 15 # R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128 15 # R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976 15
источник
all_C
andC_and_R
functions. I also found in the documentations ofcompiler::cmpfun
an old R version of lapply which contains an actual Rfor
loop, I'm starting to suspect that Burns was referring to that old version which was vectorised since then and this is the actual answer to my question....la1
from?compiler::cmpfun
seems, still, to yield same efficiency with all butall_C
functions. I guess, it -indeed- comes to be a matter of definition; is "vectorised" meaning any function that accepts not only scalars, any function that has C code, any function that uses computations in C only?lapply
isn't vectorized simply because it's evaluating an R function in each iteration wihin its C code?To me, vectorisation is primarily about making your code easier to write and easier to understand.
The goal of a vectorised function is to eliminate the book-keeping associated with a for loop. For example, instead of:
means <- numeric(length(mtcars)) for (i in seq_along(mtcars)) { means[i] <- mean(mtcars[[i]]) } sds <- numeric(length(mtcars)) for (i in seq_along(mtcars)) { sds[i] <- sd(mtcars[[i]]) }
You can write:
means <- vapply(mtcars, mean, numeric(1)) sds <- vapply(mtcars, sd, numeric(1))
That makes it easier to see what's the same (the input data) and what's different (the function you're applying).
A secondary advantage of vectorisation is that the for-loop is often written in C, rather than in R. This has substantial performance benefits, but I don't think it's the key property of vectorisation. Vectorisation is fundamentally about saving your brain, not saving the computer work.
источник
for
loops. OK, a C loop might be optimized by the compiler, but the main point for performance is whether the content of the loop is efficient. And obviously compiled code is usually faster than interpreted code. But that's probably what you meant to say.I agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:
Consider this
C
code snippet:for (int i=0; i<n; i++) c[i] = a[i] + b[i]
What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.
The most obvious way is that the code is run in a sequential manner. Load
a[i]
andb[i]
on to registers, add them, store the result inc[i]
, and do this for eachi
.However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from
a
andb
under the same instruction, instead of one at a time.It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of
GNU-gcc
here. Similarly forLLVM-clang
here. And you can also find some benchmarks in the last link compared againstgcc
andICC
(Intel C++ compiler).gcc
(I'm onv4.9
) for example doesn't vectorise code automatically at-O2
level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.We either need to add the flag
-ftree-vectorize
or change optimisation to level-O3
. (Note that-O3
performs other additional optimisations as well). The flag-fopt-info-vec
is useful as it informs when a loop was successfully vectorised).# compiling with -O2, -ftree-vectorize and -fopt-info-vec # test.c:32:5: note: loop vectorized # test.c:32:5: note: loop versioned for vectorization because of possible aliasing # test.c:32:5: note: loop peeled for vectorization to enhance alignment
This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:
x = sample(100L, 500e6L, TRUE) y = sample(100L, 500e6L, TRUE) z = vector("integer", 500e6L) # result vector # non-vectorised, -O2 system.time(.Call("Csum", x, y, z)) # user system elapsed # 1.830 0.009 1.852 # vectorised using flags shown above at -O2 system.time(.Call("Csum", x, y, z)) # user system elapsed # 0.361 0.001 0.362 # both results are checked for identicalness, returns TRUE
This part can be safely skipped without losing continuity.
Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.
#pragma omp simd for (i=0; i<n; i++) c[i] = a[i] + b[i]
By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag
-fopenmp
. By doing that:# timing with -O2 + OpenMP with simd x = sample(100L, 500e6L, TRUE) y = sample(100L, 500e6L, TRUE) z = vector("integer", 500e6L) # result vector system.time(.Call("Cvecsum", x, y, z)) # user system elapsed # 0.360 0.001 0.360
which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.
In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).
In case of R, even
rowSums()
orcolSums()
code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes forlapply()
. In case ofapply()
, it's in R. All of these are therefore loop hiding.HTH
References:
источник
So to sum the great answers/comments up into some general answer and provide some background: R has 4 types of loops (in from not-vectorized to vectorized order)
for
loop that repeatedly calls R functions in each iterations (Not vectorised)So the
*apply
family is the second type. Exceptapply
which is more of the first typeYou can understand this from the comment in its source code
That means that
lapply
s C code accepts an unevaluated function from R and later evaluates it within the C code itself. This is basically the difference betweenlapply
s.Internal
callWhich has a
FUN
argument that holds an R functionAnd the
colMeans
.Internal
call which does not have aFUN
argumentcolMeans
, unlikelapply
knows exactly what function it needs to use, thus it calculates the mean internally within the C code.You can clearly see the evaluation process of the R function in each iteration within
lapply
C codefor(R_xlen_t i = 0; i < n; i++) { if (realIndx) REAL(ind)[0] = (double)(i + 1); else INTEGER(ind)[0] = (int)(i + 1); tmp = eval(R_fcall, rho); // <----------------------------- here it is if (MAYBE_REFERENCED(tmp)) tmp = lazy_duplicate(tmp); SET_VECTOR_ELT(ans, i, tmp); }
To sum things up,
lapply
is not vectorized, though it has two possible advantages over the plain Rfor
loopAccessing and assigning in a loop seems to be faster in C (i.e. in
lapply
ing a function) Although the difference seems big, we, still, stay at the microsecond level and the costly thing is the valuation of an R function in each iteration. A simple example:ffR = function(x) { ans = vector("list", length(x)) for(i in seq_along(x)) ans[[i]] = x[[i]] ans } ffC = inline::cfunction(sig = c(R_x = "data.frame"), body = ' SEXP ans; PROTECT(ans = allocVector(VECSXP, LENGTH(R_x))); for(int i = 0; i < LENGTH(R_x); i++) SET_VECTOR_ELT(ans, i, VECTOR_ELT(R_x, i)); UNPROTECT(1); return(ans); ') set.seed(007) myls = replicate(1e3, runif(1e3), simplify = FALSE) mydf = as.data.frame(myls) all.equal(ffR(myls), ffC(myls)) #[1] TRUE all.equal(ffR(mydf), ffC(mydf)) #[1] TRUE microbenchmark::microbenchmark(ffR(myls), ffC(myls), ffR(mydf), ffC(mydf), times = 30) #Unit: microseconds # expr min lq median uq max neval # ffR(myls) 3933.764 3975.076 4073.540 5121.045 32956.580 30 # ffC(myls) 12.553 12.934 16.695 18.210 19.481 30 # ffR(mydf) 14799.340 15095.677 15661.889 16129.689 18439.908 30 # ffC(mydf) 12.599 13.068 15.835 18.402 20.509 30
As mentioned by @Roland, it runs a compiled C loop rather an interpreted R loop
Though when vectorizing your code, there are some things you need to take into account.
df
) is of classdata.frame
, some vectorized functions (such ascolMeans
,colSums
,rowSums
, etc.) will have to convert it to a matrix first, simply because this is how they were designed. This means that for a bigdf
this can create a huge overhead. Whilelapply
won't have to do this as it extracts the actual vectors out ofdf
(asdata.frame
is just a list of vectors) and thus, if you have not so many columns but many rows,lapply(df, mean)
can sometimes be better option thancolMeans(df)
..Primitive
, and generic (S3
,S4
) see here for some additional information. The generic function have to do a method dispatch which sometimes a costly operation. For example,mean
is genericS3
function whilesum
isPrimitive
. Thus some timeslapply(df, sum)
could be very efficient comparedcolSums
from the reasons listed aboveисточник
colMeans
etc. that are built to handle only matrices. (2) I'm a bit confused by your third category; I can't tell what -exaclty- you're referring to. (3) Since you're referring specifically tolapply
, I believe it doesn't make a difference between"[<-"
in R and C; they both pre-allocate a "list" (an SEXP) and fill it in each iteration (SET_VECTOR_ELT
in C), unless I'm missing your point.do.call
in that it builts a function call in the C environmen and just evaluates it; although I'm having a hard time to compare it to looping or vectorization since it does a different thing. You're, actually, right about accessing and assigning differences between C and R, although both stay at the microsecond level and don't affect the result the rerult hugely, since the costly is the iterative R function call (compareR_loop
andR_lapply
in my answer). (I'll edit your post with a benchmark; I hope you, still, won't mind)Vectorize()
as an example) use it in the UI sense too. I think that much of the disagreement in this thread is caused by using one term for two separate-but-related concepts.