Multi-output in machine learning

The task of artificial intelligence algorithms is to learn, based on the sample provided, for subsequent data prediction. However, the most common problem that is spoken in most textbooks is the prediction of one value, one or another set of features. What if we need to get the return data? That is, get a certain number of features based on one or more values.

Faced with a task of this kind and not having in-depth knowledge in the sections of mathematical statistics and probability theory - for me it turned out to be a little research.

So, the first thing I learned about is the method of recovering lost data by means of averages. Accordingly, I worked with the class provided by scikit-learn - Imputer. Referring to the materials , I can clarify:

The Imputer class provides basic strategies for recovering lost values, either using the mean, median, or the most commonly occurring value of a column or row containing lost data.

Even despite the understanding that the result will not be useful, I still decided to try to use this class, and this is what actually happened:

import pandas as pd from sklearn.preprocessing import Imputer from sklearn.model_selection import train_test_split url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data' df = pd.read_csv(url, header=None) df.columns = ['', '', ' ', '', ' ', '', ' ', '', ' ', '', ' ', '', 'OD280/OD315  ', ''] imp = Imputer(missing_values='NaN', strategy='mean') imp.fit(df) imp.transform([[3, 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN']])

 array([[3.00000000e+00, 1.30006180e+01, 2.33634831e+00, 2.36651685e+00, 1.94949438e+01, 9.97415730e+01, 2.29511236e+00, 2.02926966e+00, 3.61853933e-01, 1.59089888e+00, 5.05808988e+00, 9.57449438e-01, 2.61168539e+00, 7.46893258e+02]])

After trying to verify the data obtained on the RandomForestClassifier class, it turned out that he does not agree with us, and generally believes that this array of values exactly corresponds to the first class, but not to the third one.
')
Now, after we realized that this method does not suit us, let us turn to the MultiOutputRegressor class. MultiOutputRegressor is designed specifically for those regressors that do not support multi-target regression. Check its action on the method of least squares:

 from sklearn.datasets import make_regression from sklearn.multioutput import MultiOutputRegressor X, y = make_regression(n_features=1, n_targets=10) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4) multioutput = MultiOutputRegressor(LinearRegression()).fit(X_train, y_train) print("   : {:.2f}".format(multioutput.score(X_test, y_test))) print("   : {:.2f}".format(multioutput.score(X_train, y_train)))

    : 0.82    : 0.83

The result is quite good. The logic of the action is very simple - it all comes down to applying a separate regressor for each element of the set of output characteristics.
I.e:

 class MultiOutputRegressor__: def __init__(self, est): self.est = est def fit(self, X, y): g, h = y.shape self.estimators_ = [sklearn.base.clone(self.est).fit(X, y[:, i]) for i in range(h)] return self.estimators_ def predict(self, X): res = [est.predict(X)[:, np.newaxis] for est in self.estimators_] return np.hstack(res)

Now let's check the work of the RandomForestRegressor class, which also supports multi-target regression, on real data.

 df = df.drop([''], axis=1) X, y = df[['', '']], df.drop(['', ''], axis=1) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) forest = RandomForestRegressor(n_estimators=30, random_state=13) forest.fit(X_train, y_train) print("   : {:.2f}".format(forest.score(X_test, y_test))) print("   :{:.2f}".format(forest.score(X_train, y_train)))

    : 0.65    :0.87

In order not to mislead some people about proanthocyanidins

proper

Proanthocyanidins are a natural chemical compound. It is found mainly in the pits and skin of the grapes; it is also found in oak and goes into wine when aged in oak barrels. The molecular mass of proanthocyanidins varies depending on the duration of exposure of wines. The older the wine, the larger it is (for very old wines, the molecular weight decreases).

To a large extent affect the resistance of red wines.

The result is worse than on synthetic data (random forest works on them by 99%). However, with the addition of signs, it is as expected improved.

With the help of multi-output methods, you can solve a lot of interesting problems and get the really necessary data.

Source: https://habr.com/ru/post/358954/

All Articles

Multi-output in machine learning

More articles: