📜 ⬆️ ⬇️

Use apply, sapply, lapply in R

This introductory article on the use of apply, sapply and lapply is best suited for people who have recently worked with R or are unfamiliar with these functions. I will give a few examples of using the functions of the apply family, since they are often used when working in R.

I compared these three methods on a data set. A sample was generated and applied to it. I wanted to see how the results of their application differ.

A test bench was also used, which returned a matrix. It had three columns and about 30 lines. It looked like this:
')
method1 method2 method3 [1,] 0.05517714 0.014054038 0.017260447 [2,] 0.08367678 0.003570883 0.004289079 [3,] 0.05274706 0.028629661 0.071323030 [4,] 0.06769936 0.048446559 0.057432519 [5,] 0.06875188 0.019782518 0.080564474 [6,] 0.04913779 0.100062929 0.102208706 

Such data can be simulated using rnorm to create three sets. The first - with an average equal to 0, the second - with an average of 2, the third - with an average of 5, and 30 lines.

 m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3) 

Apply


When to apply apply? If we have a large amount of ordered data to process. For example, a set of averages, in the form of a matrix. What operations are supposed to be used: to obtain information, possibly, transformation, selection of a subset, any operations on data.

If you use a data block (data frame type), all data must be of the same type, otherwise a transformation will be applied. This may be exactly what you need, and maybe not. If there is string / alphabetic and numeric data in the data block, the numeric data will be converted to strings, and operations on numbers may produce not exactly expected results.

Undoubtedly, the circumstances in which the application of application is justified arise quite often when working in R, therefore it is worth spending time and getting acquainted with the possibilities of this function, this will dramatically increase productivity. Which function of the apply family you need depends on the data, what you need to do with them, and how the result should look. Perhaps after these examples it will be a little easier to make the right choice.

First, I want to make sure that I correctly created a matrix with three columns with average 0, 2 and 5, respectively. We use apply and the base mean function to verify this. The second argument we specify apply, to which dimension to apply the function - columns or rows. In this case, at the end we want to get three numbers, so we will specify apply to work with columns, passing 2 as the second argument. But let's get it wrong to illustrate:

 apply(m, 1, mean) 

 # [1] 2.408150 2.709325 1.718529 0.822519 2.693614 2.259044 1.849530 2.544685 2.957950 2.219874 #[11] 2.582011 2.471938 2.015625 2.101832 2.189781 2.319142 2.504821 2.203066 2.280550 2.401297 #[21] 2.312254 1.833903 1.900122 2.427002 2.426869 1.890895 2.515842 2.363085 3.049760 2.027570 

Passing 1 as the second argument, we get 30 values, the average of each line. Not three numbers we wanted. Let's try again:

 apply(m, 2, mean) 

 #[1] -0.02664418 1.95812458 4.86857792 

Fine. As you can see, the average for each column is approximately 0, 2, and 5, as expected.

Own function


Let's imagine that after seeing this negative number, I realized that I would like to work only with positive ones. Let's find out how many negative numbers are in each column by applying apply again:

 apply(m, 2, function(x) length(x[x<0])) 

 #[1] 14 1 0 

So, 14 negative numbers in the first column, one in the second and none in the third. More or less expected for the three normal distributions with the averages and the unit standard deviation given above.

Here we used a simple function that was defined directly in the call to apply , and not some built-in one. Note that in the function we did not specify the return value. In fact, the function uses splitting into subsets to select all elements less than 0, and then calculate them with the help of length . The function takes one argument, which I arbitrarily denoted by . In this case, is one of the columns of the matrix. Is it a single-column matrix or just a vector? Let's get a look:

 apply(m, 2, function(x) is.matrix(x)) 

 #[1] FALSE FALSE FALSE 

Not a matrix. Here the definition of the function is not required, it was possible to simply pass the function is.matrix , since it takes one argument and has already been created. Let's make sure these are vectors, as expected:

 apply(m, 2, is.vector) 

 #[1] TRUE TRUE TRUE 

Why, then, was it necessary to wrap the length function? When we want to define our own handler for apply, we must at least set the name of the input variable to use it in the function:

 apply(m, 2, length(x[x<0])) 

 #Error in match.fun(FUN) : object 'x' not found 

In the function, we refer to some value of , but R knows nothing about it, and therefore gives an error. Other factors also play a role here, but for simplicity, remember to wrap any code in a function. For example, let's take a look at the average of only positive values:

 apply(m, 2, function(x) mean(x[x>0])) 

 #[1] 0.4466368 2.0415736 4.8685779 


Use sapply and lapply


These two functions work in a similar way, representing the data set as a list or vector and applying the given function to each element.

Sometimes we need something more than linear data conversion. For example, we would like to compare the current value with a value five times back. It may be worthwhile to use rollapply for this, but a quick, though not quite beautiful way is to run sapply or lapply , passing a set of indexed values.

Here we will use sapply , which works with a list or data vector:

 sapply(1:3, function(x) x^2) 

 #[1] 1 4 9 

lapply very similar, but returns a list, not a vector:
 lapply(1:3, function(x) x^2) 

 #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9 

Passing to sapply simplify=FALSE , also get the list:
 sapply(1:3, function(x) x^2, simplify=F) 

 #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9 

You can also use the unlist with lapply to get a vector.

 unlist(lapply(1:3, function(x) x^2)) 

 #[1] 1 4 9 

It is best to use lapply and sapply if it makes sense for your data and the expected result. If you want a list, use lapply . If the vector is sapply .

Workarounds


In any case, a simple way is to transfer the sapply vector of indices and write your function, making an assumption about the structure of the input data. Let's take another look at the example with mean :

 sapply(1:3, function(x) mean(m[,x])) 

 [1] -0.02664418 1.95812458 4.86857792 

In our function, we pass column indices (1, 2, 3), which implies the presence of a variable m with our data. Well as a quick solution, but in general, not very, and with a high probability in the future will turn into a big problem with the support.

You can do a little bit better by passing our data as an argument to a function and using the special argument "...", which all apply functions accept to pass additional parameters:

 sapply(1:3, function(x, y) mean(y[,x]), y=m) 

 #[1] -0.02664418 1.95812458 4.86857792 

This time our function has two arguments, and . The variable , as before, will denote the data that sapply through, whatever that is. The variable will be sapply using the optional sapply arguments.

In this case, we passed to the input m , explicitly setting the variable when calling sapply . This is not strictly necessary, but easier for the perception and maintenance of the code. The value of will be the same every time we call our function in sapply .

It is strongly advised not to pass indexed arguments in this way, it is a source of errors and is difficult to perceive when other people read your code.

Hope these examples were helpful.

Source: https://habr.com/ru/post/274611/


All Articles