Big Data and Machine Learning? To you on HighLoad ++

Contrary to the name and first impression that arises in the majority of ordinary people - “Big Data” is not just “big data” and does not even unite all arrays with unlimited (or constantly updated and expanding) data.

In fact, “Big Data” is primarily about the approaches, tools and methods for processing data directly. Which, in turn, most often not structured, diverse and heterogeneous.
')
And, most importantly, “Big Data” is a new section of 2015 within the HighLoad ++ program, first proposed, by the way, at the meeting of speakers. The first, single reports appeared in the past years:

By the way, the illustration with the galaxy appeared in the title of this article is not casual - as a customer, astronomy was at the origin of several databases that are now widely used in web development.

And this year, the Program Committee decided to allocate the “Big Data” theme into a separate section and, as it turned out, for good reason. Already 12 applications have been registered in the “Big Data and Machine Learning” section and we want to introduce you to the speakers.

Clever faces

The keynote report , without a doubt, is the report of Pavel Velikhov , director of science at Toprater.com.

Pavel received a Master of Computer Science degree in 2000 from the University of California San Diego. In graduate school he was engaged in database technology and machine learning, including for statistical linguistics. Since then, he has been developing developments in the field of DBMS, machine learning and Natural Language Processing.

The first startup that was based on the master's work was data integration technology using XML and XQuery. In 2003, this startup was bought by BEA Systems, and now it is part of Oracle. In 2004, Pavel returned to Russia - we had a dream to create a high-tech start-up.

At first, Pavel helped develop the Sedna DBMS at the Institute for System Programming. It was also an XML DBMS, written by a very strong team. When the guys realized that it was very difficult to unleash it - the idea arose to sell it. The Oracle company showed interest, but unfortunately, due to the difficulties associated with intellectual property and M & A, the transaction did not take place. After that, Pavel spent some time engaged in semantic technologies, creating the Texterra system. And in 2008, he joined the project of SciDB Michael Stonebreaker - the creation of a massively parallel DBMS for work in projects such as LHC and LSST. The team built a large amount of mathematics and machine learning into SciDB - and Pavel was excited about the idea of working on this area again more closely. After SciDB, I worked for a while as a science director at the start-up of News360, where I mainly worked on the recommendation system.

We asked Paul a few questions and he was very kind to find time to answer them.

- Pavel, why did you start and continue to do what you do?

- From the very beginning of my career, when I was still in graduate school, I dreamed of putting together two of my main interests: databases and artificial intelligence. Then the term AI was an abusive word (after the next HYIP in this area), and for a long time I was studying the insides of the DBMS. In the Toprater.com project, we are creating a platform for social multi-criteria choice. That is, in a large number of areas, we process people's opinions with the help of our technology, and turn it into a system of criteria. We have put a lot of effort into creating a technology of understanding opinions, the quality of which is now at the global level. So my scientific and engineering interests finally came together for me in this project. Well, you also have to develop your own systems, such as a search engine.

- What interesting problems have to be solved at work?

- We now have the most interesting period in the development of a startup, we are preparing to enter the B2C market, at the same time we cooperate with leading e-commerce companies in the B2B format. And, on the one hand, we solve complex research problems in understanding texts, then we expand clusters to quickly process terabytes of data, we created a system of end-to-end versioning of data. On the other hand, sometimes we solve in a very short time the tasks of cleaning and matching data in order to quickly launch the next sphere or integrate the data of the next partner. To the maximum, we use machine learning tools, and we already have a tradition to arrange internal hackathons for solving such problems.

Also, we hide all the functionality of our system under the API, so that products on top of it can be developed not only by our team, but also by third-party developers. This is part of our strategy - we actively invite third-party teams to participate in our project. But creating a quality API is also not an easy task, it should be as simple and minimal as possible, and with good documentation, and the constant new requirements threaten to turn it into a complex monster. Here we are also constantly looking for balance and compromise.

Mikhail Trofimov from Avito will tell you about sports data analysis, to which we also asked similar questions and even got answers!

Meanwhile, Mikhail was born and studied in the town of Shakhty, Rostov region, was a prize-winner of regional mathematics competitions, graduated from a physical and mathematical gymnasium with honors, in 2010 he entered the Moscow Institute of Physics and Technology, department of management and applied mathematics. From the third year I became interested in data analysis, a year later I won the first competition for Kaggle. Since 2014, Mikhail has been working at Avito.

- Michael, in which competitions you participated?

- Participated in a variety of contests, apparently they were formulated as:
Determining the relevance of the query / document pair
Detection of bots in online auctions
Virus classification
Predicting contextual advertising clicks (CTR)
Search for the Higgs boson
Text ad classification
Prediction of the number of views / likes of a post on the social network
Borrower's default prediction
Predicting customer returns

Despite the external "heterogeneity", these tasks are solved in much the same way.

- Is it right about Kaggle competition? Any more?

- Kaggle is not the only platform (and not the oldest), but definitely the most “live” at the moment.
Basically, yes, I participate in Kaggle, but besides - I participated in a couple of offline hackathons and in competitions on other platforms.

- Why did you start and continue to do what you do?

- Once, a couple of years ago, I participated in a business case competition with a team, and there was a task for data analysis.
Neither I, nor my comrades had such experience - and therefore we came up with a very simple solution based on manual rules and heuristics. Spent a lot of time, in the end got 4th place. And the winning team simply applied one of the machine learning methods.

Since the prizes were iPads, and I really wanted this for myself - I decided after the fact to figure out what the winners did, and what we lacked. At that moment, my acquaintance with machine learning began, which later developed into a real hobby and thesis work. By the way, I got the iPad, but in a different contest :)

For me, competition is an opportunity to learn something new, to knead wit and observation. Exchanging ingenuity - more often it is data analysis and non-standard view of the task, rather than machine learning algorithms, that play a decisive role. I continue to participate in such competitions in order to expand my horizons and be aware of fresh ideas and tools.

- Can I have a couple of tales? For example, what was the task, how difficult was it to solve?

- One of the first competitions in which I participated was the prediction of a multi-dimensional time series . Having tried a bunch of options, I could not make a decision better than "the last value multiplied by a constant", and the constant picked it up to the third digit. My friend, who participated in the same competition, turned out to be in 4th position - because he picked up the same constant to the fourth digit. This competition taught me that the decision of a competition is not always complex models or crazy mathematical formulas. And that the accuracy of calculations is sometimes very important.

Another funny incident happened not so long ago, when the team decided the Microsoft Malware Classification Challenge. One of the trick of the competition was that the data - 400 GB of data. And now, a day before the end, in an attempt to increase the quality to the top-10, we decided to count byte 10-grams (this means that it was necessary to calculate the frequency of occurrence of all sequences of 10 bytes). It is easy to estimate that the number of such sequences is 256 ^ 10. They count the problem, not that the selection to hold. But we had the courage, use a few technical chips, still count, make a selection and add them to the model. Shipped solution in the last 15 seconds before closing. Had time, in the end with the fifteenth position soared to third. Unforgettably!

- What interesting problems have to be solved at work?

- There are many tasks, tasks are different, each is unique in its own way. Sometimes the complexity (= interest) lies in the volume of data (there can be both very much and very little), sometimes in the formulation itself. One of the bright impressions is the task of determining the price of a car. In essence, this is a task of correcting the query to the database in order to get the most similar cut, but containing at least a specified number of objects - and Ivan (Goose) and I were able to come up with a rather beautiful solution, in my opinion.

Otherwise, no one needs to explain here why “3” is a beautiful number, so we talked to another “practical” speaker, Konstantin Ignatov from QRator Labs, engaged in machine learning to prepare for a gigabit DDoS attack.

Konstantin - a graduate of the Moscow State Technical University. Baumana (Informatics and Management, Automatic Control Systems) and the Higher School of Economics (Business Informatics, Corporate Information Systems), development engineer in the research department of Qrator Labs. Konstantin's report on HighLoad ++ will be some introduction to machine learning, during which he will show all the listed types of tasks, how they relate to each other, explain what principles are used to solve them, and finally how long it takes to learn and predict.

To begin with, I would like to emphasize that we are a security company. That is, we have a goal - to ensure the availability of client resources on the Internet. And there are attackers, their goal is to ensure that our client is unavailable for at least some time. As a result, we are in a constant struggle, and the mistakes of one side, for sure, will be used by the other.

As usual, the principle of “thinking like a criminal” acts, that is, in order to stop intruders, we need to understand how they will act.

For example, a user walks through a site, clicks with a mouse, types something in forms, the browser sends requests. Some of these requests are done by the browser almost of its own accord (for example, it loads statics), and some are directly related to the user's actions. Some of these requests may be processed longer than others, say GET /search?q=... attacker may notice this and try to send a lot of such requests to the site in a row, exhausting the resources necessary to process "normal" requests.

In response, you can, for example, enter caching responses, but then the attacker simply adds diversity to the q=... parameter q=... In this case, you can enter a limit on the number of requests per second from one IP for / search, but the attacker will start using many IP addresses (say buy a botnet). In response, you can, for example, prohibit doing GET /search as the first request, i.e. require the installation of cookies from any other page of the site, but the attacker will quickly understand the chip and start making requests in pairs ...

In the limit, such interaction can be viewed from two points of view - strategic and tactical.

Strategically, it is simply important who spent more money: if an attacker needs a budget for at least any significant success, which he does not have and will not have, then he considers the threat eliminated. Our “moves” here are forcing the intruders to use the most expensive equipment and, perhaps more importantly, manual intellectual work — we want to attack each resource protected by us separately. We automate our actions as much as possible, and use expensive equipment rationally.

Tactically, the task described above comes down:
For us: to search for site visitors who behave strangely and at the same time create a significant burden;
For an attacker: to write such bots, which will be very well disguised as real users, but at the same time "knock out" the site.

Since, as I just mentioned, we want to automate our task as much as possible, we use machine learning. In this case, we are faced with all kinds of ML:
With the need to predict the load (for example, in order to understand under what conditions the backend will "fall"), i.e. with regression;
It is necessary to understand whether an attack is occurring at all, or someone just attracted many legitimate users, i.e. with classification;
With the need to search for groups of visitors with one or another property, i.e. either with clustering or searching for anomalies.

And the list, of course, is not complete.

Well, the "full list" you yourself know where and when , so - see you soon!

And finally : For the users of Habrakhabr, the conference offers a special discount of 15%, all that needs to be done is to use the code " IAmHabr " when booking tickets.

Source: https://habr.com/ru/post/267379/

All Articles

Big Data and Machine Learning? To you on HighLoad ++

Clever faces

More articles: