📜 ⬆️ ⬇️

Deep (Learning + Random) Forest and parsing articles

We continue to talk about the conference on statistics and machine learning AISTATS 2019. In this post we will discuss articles about deep models from tree ensembles, mix regularization for highly sparse data and time-effective approximation of cross-validation.



Deep Forest Algorithm: An Exploration to Non-NN Deep Models based on Non-Differentiable Modules


Zhi-Hua Zhou (Nanjing University)
Presentation
Article
Implementations - below


A professor from China spoke about the tree ensemble, which the authors call the first in-depth training on non-differentiable modules. This may seem too loud a statement, but this professor and his H-index 95 are invited speakers, this fact makes it possible to take the statement more seriously. The basic theory of Deep Forest has been developed a long time ago, the original article was already in 2017 (almost 200 citations), but the authors write libraries and improve the speed algorithm every year. And now, it seems, they have reached the level when this beautiful theory can finally be applied in practice.


General view of the Deep Forest architecture


Prerequisites


Deep models, which now mean deep neural networks, are used to capture complex dependencies in data. Moreover, it turned out that increasing the number of layers is more efficient than increasing the number of units on each layer. But neural networks have their disadvantages:



In addition, elements of deep neural networks are differentiable modules that are not necessarily effective for each task. Despite the complexity of neural networks, conceptually simple algorithms, like a random forest, often work better or not much worse. But for such algorithms you need to manually construct features, which is also difficult to do optimally.


Already, researchers have noticed that the ensembles at Kaggle: "very poverfull to do the practice", and inspired by the words of Scholl and Hinton about the fact that differentiation is the weakest side of Deep Learning, they decided to create an ensemble of trees with DL properties.


Slide “How to make a good ensemble”


The architecture was derived from the properties of the ensembles: the elements of the ensembles should not be very poor in quality and different.


GcForest consists of two stages: Cascade Forest and Multi-Grained Scanning. Moreover, so that the cascade is not retrained, it consists of 2 types of trees - one of which is absolutely random trees that can be used on unpartitioned data. The number of layers is determined within the algorithm by cross-validation.


Two types of trees


results


In addition to the results on standard datasets, the authors tried to apply gcForest on transactions of the Chinese payment system for fraud searches and got F1 and AUC much higher than those of LR and DNN. These results are only in the presentation, but the code to run on some standard datasets is on Git.



Shutter Algorithm Results. mdDF is optimal Margin Distribution Deep Forest, a variant of gcForest



Pros:



Minuses:



Neural grids have a gradient damping problem, and deep woods have a problem of “diversity vanishing”. Since this is an ensemble, the more “different” and “good” elements to use - the higher the quality. The problem is that the authors have already reviewed almost all the classical approaches (sampling, randomization). Until new fundamental research appears in the “differences” topic, it will be difficult to improve the quality of deep forest. But now it is possible to improve the speed of calculations.


Reproducible results


XGBoost won such a cool gain on the tabular data, and wanted to reproduce the result. I took Adults datasets and applied GcForestCS (slightly accelerated version of GcForest) with parameters from the authors of the article and XGBoost with default parameters. In the example that the authors had, the categorical features were already pre-processed somehow, but not how. In the end, I used CatBoostEncoder and another metric - ROC AUC. The results were statistically different - XGBoost won. The running time of XGBoost is insignificant, and for gcForestCS - 20 minutes each, each cross-validation on 5 folds. On the other hand, the authors tested the algorithm on different datasets, and the parameters for this dataset were adjusted to their pre-processing features.


Code can be found here .


Implementations


Official code from the authors of the article
Official improved version, works faster, but there is no documentation
Implementation easier


PcLasso: the lasso meets principal components regression


J. Kenneth Tay, Jerome Friedman, Robert Tibshirani (Stanford University)


Article
Presentation
Example of use


In early 2019, J. Kenneth Tay, Jerome Friedman and Robert Tibshirani from Stanford University proposed a new method of teaching with a teacher, especially suitable for sparse data.


The authors of the article solved the problem of analyzing data on the study of gene expression, which are described in Zeng & Breesy (2016). Target is the mutational status of the p53 gene, which regulates gene expression in response to various cellular stress signals. The aim of the study is to identify predictors that correlate with the mutational status of p53. The data consists of 50 lines, 17 of which are classified as normal and the remaining 33 are mutations in the p53 gene. In accordance with the analysis in Subramanian et al. (2005) 308 sets of genes that are between 15 and 500 are included in this analysis. These gene kits contain a total of 4301 genes and are available in the grpregOverlap R package. When expanding data to handle overlapping groups, 13,237 columns go out. The authors of the article applied the pcLasso method, which helped to improve the model results.


In the picture we see an increase in AUC when using “pcLasso”


The essence of the method


Method combines l_1 -regularization with l_2 , which narrows the vector of coefficients to the main components of the matrix of features. They called the proposed method "lasso core components" ("pcLasso" available on R). The method can be especially powerful if the variables are pre-grouped (the user chooses what and how to group). In this case, pcLasso compresses each group and gets a solution in the direction of the main components of this group. In the process of decision, the selection of significant groups among the existing ones is also performed.


We present the diagonal matrix of the singular decomposition of a centered matrix of attributes X in the following way:


Our singular decomposition of the centered matrix X (SVD) is represented as X = UDV ^ T where D - diagonal matrix consisting of singular values. In this form l_2 -regularization can be represented:
\ beta ^ T VZV ^ T \ beta Where Z - diagonal matrix containing the function of the squares of singular values: Z_ {11} = f_1 (d_1 ^ 2, d_2 ^ 2, ..., d_m ^ 2), ..., Z_ {22} = f_2 (d_1 ^ 2, d_2 ^ 2, ..., d_m ^ 2) .


In general, in l_2 -regularization Z_ {jj} = 1 for all j that matches \ beta ^ T \ beta . They propose to minimize the following functionality:



Here D - matrix of differences of diagonal elements d_1 ^ 2-d_1 ^ 2, d_1 ^ 2-d_2 ^ 2, ..., d_1 ^ 2-d_m ^ 2 . In other words, we control the vector. \ beta also with the help of the hyperparameter \ theta .
Transforming this expression, we get the solution:



But the main "feature" of the method, of course, is the ability to group data, and based on these groups, select the main components of the group. Then we rewrite our solution in the form:



Here \ beta_k - vector subvector \ beta corresponding to group k, d_k = (d_ {k1}, ..., d_ {kmk}) - singular values X_k arranged in descending order, and D_ {d_ {k1} ^ 2-d_ {kj} ^ 2} - diagonal matrix d_ {k1} ^ 2-d_ {kj} ^ 2, j = 1,2, ..., m_k


Some comments on the solution of the target functional:


  1. The objective function is convex, and the non-smooth component is separable. Consequently, it can be effectively optimized using gradient descent.
    The approach is to fix multiple values. \ theta (including zero, respectively, getting the standard l_1 -regularization), and then optimize: picking up \ lambda . Accordingly, the parameters \ theta and \ lambda are selected for cross-validation.


  2. Parameter \ theta hard to interpret. In the software (package "pcLasso"), the user himself sets the value of this parameter, which belongs to the interval [0,1], where 1 corresponds to \ theta = 0 (lasso).



In practice, varying the values \ theta = 0.25, 0.5, 0.75, 0.9, 0.95, and 1, you can cover a large range of models.


The algorithm itself is as follows


This algorithm is already written in R, if you wish, you can already use it. The library is called 'pcLasso'.


A Swiss Army Infinitesimal Jackknife


Ryan Giordano (UC Berkeley); William Stephenson (MIT); Runjing Liu (UC Berkeley);
Michael Jordan (UC Berkeley); Tamara Broderick (MIT)


Article
Code


The quality of machine learning algorithms is often evaluated by multiple cross-validation (cross-validation or bootstrap). These methods are powerful, but slow on large data sets.


In this paper, colleagues use linear approximation of scales, producing results that work faster. This linear approximation is known in the statistical literature as the “infinitesimal jackknife”. It is mainly used as a theoretical tool for proving asymptotic results. The results of the paper are applicable regardless of whether the weights and data are stochastic or deterministic. As a consequence, this approximation consistently evaluates the true cross-validation for any fixed k.


Handing Paper Award to the author of the article


The essence of the method


Consider the problem of estimating an unknown parameter. \ theta \ in \ Omega _ {\ theta} \ subset R ^ {D} where \ Omega _ {\ theta} - compact, and the size of our dataset - N . Our analysis will be conducted on a fixed set of data. We define our assessment \ theta \ in \ Omega _ {\ theta} in the following way:


  1. For each n = 1,2 ..., N let's set g_n ( \ theta ) - function from \ Omega _ {\ theta} \ subset R ^ {D}
  2. \ omega_n - a real number, and \ omega - vector consisting of \ omega_n

Then \ hat {\ theta} can be represented as:



Solving this optimization problem using the gradient method, we assume that the functions are differentiable and we can calculate the Hessian. The main problem that we solve is the computational costs associated with the assessment \ hat {\ theta} ̂ (\ omega) for all \ omega∈W . The main contribution of the authors of the article is to calculate the estimate \ hat {\ theta} _1 = \ hat {\ theta} _1 (1 _ {\ omega}) where 1_ \ omega = (1,1, ..., 1) . In other words, our optimization will depend only on derivatives g_n (\ theta) , which we assume exist, and are the Hessian:



Next, we define an equation with a fixed point and its derivative:


It is worth noting here that G (\ theta ̂ (\ omega), w) = 0 , because \ hat {\ theta} (\ omega) - solution for \ frac {1} {N} \ sum_ {n = 1} ^ {N} \ omega_n g_n (\ theta) = 0 . Also define: H_1 = H (\ hat {\ theta} _1,1_ \ omega) and matrix weights like: \ Delta \ omega = \ omega-1_ \ omega \ in R ^ {n} . In that case, when H_1 has an inverse matrix, we can use the implicit function theorem and the 'chain rule':



This derivative allows us to form a linear approximation \ hat {\ theta} ̂ (\ omega) through \ hat {\ theta} _1 which looks like this:



Because \ hat {\ theta} _ {IJ} depends only on \ hat {\ theta} _1 and \ Delta \ omega and not on solutions for any other values. \ omega , accordingly, there is no need to recalculate anew and find new values ​​of ω. Instead, the SLN (system of linear equations) must be solved.


results


In practice, this significantly reduces the time compared to cross-validation:


')

Source: https://habr.com/ru/post/458388/


All Articles