Top Machine Learning Packages in R, Part 2

One of the most common questions faced by data processing and analysis specialists is “Which programming language is best used for solving machine learning problems?” The answer to this question always leads to a difficult choice between R, Python and MATLAB. Generally speaking, no one can give an objective answer, which programming language is better. Of course, the language you choose for machine learning depends on the limitations of the specific task and data, the preferences of the specialist himself and those machine learning methods that need to be applied. According to the survey about Kaggle users' favorite tool, 543 out of 1714 prefer to solve R data analysis problems.

Now in CRAN 8341 package is available. In addition to CRAN , there are other repositories with a large number of packages. The syntax for installing any of these is simple: install.packages(“Name_Of_R_Package”) .

Here are a few packages that you can hardly do without as a data analyst: dplyr, ggplot2, reshape2 . Of course, this is not a complete list. In this article we will focus more on the packages used in machine learning.

5. `randomForest` : collect a lot of trees in the forest

The random forest algorithm is one of the most widely used machine learning algorithms. The randomForest package is used to create a large number of decision trees, after which each observed value is placed in the tree. The result containing the largest number of values is considered final. To use the randomForest algorithm, you need to make sure that all variables are either numeric or factorial. The factor can not have more than 32 levels if randomForest is used.
')
As you may know, the random forest algorithm takes random sets of variables and values and builds many trees. In the end, these trees are combined, and based on the best solution, the class of the dependent variable is determined.

Let's take for example the iris data set to build a random forest using the randomForest package.

 Rf_fit<-randomForest(formula=Species~., data=iris)

You need to execute a line of code similar to the other packages, and you can use a random forest algorithm. Let's see how it works.

 print(Rf_fit)

 Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 4 46 0.08

 importance(Rf_fit)

  MeanDecreaseGini Sepal.Length 10.200682 Sepal.Width 2.673111 Petal.Length 43.116951 Petal.Width 43.246585

You may need to vary the different control parameters of randomForest, for example, the number of variables in each tree, the number of trees that need to be built, etc. Typically, data analysts perform several iterations and select the best combination.

6. `nnet` : it's all about hidden levels

This is the most widely used and easy-to-read packet for working with neural networks, but its limitation is one level of nodes. However, according to some studies, more is not required, since they not only do not add a performance model, but also increase the computation time and complexity of the model.

This package does not provide any special methods for determining the number of nodes at the hidden level. Therefore, when big data specialists use nnet, it is always assumed that you need to specify the value between the number of input and output nodes. The nnet package provides an implementation of the Artificial Neural Networks (ANNs) algorithm, which is based on knowledge of how the human brain works, based on input and output signals. ANNs are widely used for forecasting in aviation. In particular, neural networks provide better prediction results using nnet functions than standard prediction methods such as exponential smoothing, regression, etc.

R has many packages for building neural networks, for example, nnet, neuralnet, RSNNS. Let's use the iris dataset for an example again (I suspect you're tired of it already). Let's try to predict Species with nnet and see how it looks.

 nnet_model <- nnet(Species~., data=iris, size = 10)

In the output neural network, you can see 10 hidden layers, because we set size = 10 when building a neural network.

Unfortunately, there is no easy way to build the resulting neural network, but on github there are many special functions for this purpose. For example, to build the network above, this one was used.

7. `e1071` : vectors as a support for your model

This is a very important package of the R language, in which there are special functions for implementing the naive Bayes classifier (conditional probability), the Support Vector Machines (SVM) method, Fourier transform, fuzzy clustering, etc. Let's say the first implementation of SVM in R was exactly in the e1071 package. Great for cases where, say, you need to determine what is the probability that the person who bought the iPhone 6S, also buys a case for it.

This type of analysis is based on conditional probability, so data analysts use the package e1071, where there are functions that implement the naive Bayes classifier.

The support vector method is useful if your dataset is not divided into the original dimension, and you need to bring the data to a higher dimension in order to use the classification or regression methods. SVM uses kernel functions (to optimize mathematical operations) and maximizes the distance between the two classes.

The syntax of functions implementing SVM is similar:

 svm_model <- svm(Species ~Sepal.Length + Sepal.Width, data=iris)

To visualize SVM, we use the plot() function with the corresponding data:

 plot(svm_model, data = iris[,c(1,2,5)])

The graph above clearly shows the boundaries obtained after applying the SVM to the iris data.

There are many parameters that you may need to change to get the best accuracy (kernel, cost, gamma, coefficients, etc.). To get a good classification using SVM, you have to experiment with many of these parameters: let's say, the kernel can take on different values - linear, Gaussian, Cosine.

8. `kernLab` : well packaged kernel

Kernlab takes advantage of the S4 object model in R so that data analysts can use kernel-based machine learning algorithms. Kernlab has implementations of SVM, nuclear analysis, scalar product primitives, a ranking algorithm, Gaussian processes, and a spectral clustering algorithm. Methods of machine learning based on nuclei are used when it is difficult to solve the problems of classification, clustering and regression in the observation results space.

The Kernlab package is widely used as an SVM implementation that facilitates the task of pattern recognition. There are many kernel functions, for example, tanhdot (hyperbolic tangent kernel function), polydot (polynomial kernel function), laplacedot (Laplace kernel function) and others used in pattern recognition.

Kernfunction is extremely important for SVM. This method would be impossible without them.

SVM is not the only technique using kernels. There are many other useful and well-known algorithms based on kernels, such as the random vortex method (RVM), nuclear analysis of the main components, dimension reduction, etc. The kernLab package contains about 20 such algorithms. KernLab has its own predefined kernels, but the user can create and use their own kernel functions.

Let's initialize our own radial basic function with a rms value of 0.01.

  Myrbf <- rbfdot(sigma = 0.01) Myrbf

 Gaussian Radial Basis kernel function. Hyperparameter : sigma = 0.01

You can recognize the Myrbf class by simply applying the class() function to the newly created object.

 class(Myrbf)

 [1] "rbfkernel" attr(,"package") [1] "kernlab"

Each core takes as input two vectors and returns their scalar product. Let's define two vectors and see their scalar product.

  x<-rnorm(10) y<-rnorm(10) Myrbf(x,y)

  [,1] [1,] 0.8443782

We created two random, normally distributed variables, x and y, each with 10 values, and calculated their scalar product using the Myrbf kernel function.

Let's take a look at SVM using the Myrbf kernel function. Again, use the iris dataset to understand how SVM works with kernLab.

  Kernlab_svm <- ksvm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = Myrbf, C=4) Kernlab_svm

 Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 4 Gaussian Radial Basis kernel function. Hyperparameter : sigma = 0.01 Number of Support Vectors : 103 Objective Function Value : -95.3715 -70.6262 -291.6249 Training error : 0.2

Let's use the newly built SVM to predict:

  predicted<-predict(Kernlab_svm,iris) table(predicted = predicted, true = iris$Species)

  true predicted setosa versicolor virginica setosa 49 0 0 versicolor 1 37 16 virginica 0 13 34

Conclusion

Each package or function in R has its own default values. Before applying any algorithm, it makes sense to find out what options are available. The default values will give you some result, but there is no certainty that it will be the most optimal or accurate.

In CRAN there are other packages for machine learning, for example, igraph, glmnet, gbm, tree, CORElearn, mboost and others. They are used in various fields to build the most efficient models. You may encounter situations when changing one parameter completely changes the type of output data. Therefore, you should not rely so much on the default values: examine your data and requirements before applying any algorithm.

Source: https://habr.com/ru/post/306184/

All Articles

Top Machine Learning Packages in R, Part 2

5. randomForest : collect a lot of trees in the forest

6. nnet : it's all about hidden levels

7. e1071 : vectors as a support for your model

8. kernLab : well packaged kernel

Conclusion

More articles:

5. `randomForest` : collect a lot of trees in the forest

6. `nnet` : it's all about hidden levels

7. `e1071` : vectors as a support for your model

8. `kernLab` : well packaged kernel