I am not a real bigdot, I just found xgboost on github'e.The pursuit of
500kr from Beeline forced
me to plunge into the world of machine learning, to which I showed interest before, but did not show confidence and, accordingly, did not dive. A quick search revealed that in this regard,
xgboost now
steers from Chinese comrades from the University of Washington. As I understand it, this is something like Apple in the field of machine learning: I pressed one button and got what I wanted quickly and beautifully.
Upon closer examination, the settings came to light; by twisting them, you can speed up or slow down learning, refine or coarsen the predictions given by this program. The input to the program is most conveniently served in the libsvm format. Surely there are a lot of libraries with which you can transfer anything (from csv to avi) to libsvm, but in this case I managed to use my “bicycle” in JavaScript.
So, already on September 30, with xgboost help, 76.58% of hits were received according to preliminary estimates. Mature bigdatschiki meanwhile fell by 77% +! After reading about the benefits of ensembles (this is a combination of several predictions on the basis of a majority of votes), I began to look for other methods in order to get uncorrelated with the existing predictions. As if to make an ensemble of predictions of approximately the same origin, their accuracy is averaged, and if from completely different ones, then accuracy increases.
')
I tried the method of support vectors (SVM), the method of the nearest neighbors with the weighting by distance of the neighbor and "random-forest". These methods have already required some data preparation: somewhere to normalize, somewhere to weigh the significance of the factors, I had to understand a little, because I know the python at the level “I read / write with a dictionary”, with the Python sklearn (Sci-Kit Learn), since there the documentation and examples are clear and the cat. The results are much worse: from 57 to 73% of hits, due to the inability to prepare data for which these methods are critical. Although it seems to do as they say in the books, the categories are in dummy variables, the numbers are in the range [0; 1], they are discarded by insignificant ones.
To diversify the model began to get rid of significant and insignificant factors. The variable x8 contains more than 90% of the information available in all variables about the age group (only one can get 72% + hits). Xgboost without this information, something began to miss a lot - within 55% of hits, but surprisingly, 73% of the nearest neighbors remained. Creating several dozens of options with various circumcisions of information available for learning, gathered them into an ensemble and ... 77.05%. It is sad…
Some useful method of processing input data or method, therefore, did not apply, but knowledgeable people used. Well, it has become annoying. Somewhat cheered up is that, as it turned out, Beeline’s checking for a preliminary rating is always done on the first 15 thousand test lines, and not on the random 15 thousand, as I had originally intended. This means that you can pull out some more lines of data for training and cross-validation. Suddenly something unknown and useful in them ?!
It is not clear whether this was provided by the organizers of the competition, but on October 27, over several hours and approximately 23 thousand attempts received 15 thousand first lines of the test sample with the correct answers (there were no caps, no restrictions on the frequency of sending options for solutions, so all passed quickly and cheaply even in 1-2 streams). But these additional 15k didn’t bring much benefit, besides the fun of ČSV: according to my estimates, the leaders did not manage to bypass 70% of the test data, as the methods I used could not be found.
The practical results of the lesson ML from Beeline
- xgboost is optimal when searching for dependencies in data; it allows you to get 95% of the best possible result at the cost of 5% of effort; it suits the consumer approach to the task of lazing at zero level.
- For the advanced level will go python-sklearn. If you constantly work with him, then it is also not difficult, the classes in the library are comfortable, they implement almost all fantasies.
- Combining the uncorrelated results of even poor quality, one can improve the accuracy to a level above the best available source options (ensemble utility).
And how did you participate in the contest from Beeline and what did you get out of it?