- Pavel, why did you start and continue to do what you do?
- From the very beginning of my career, when I was still in graduate school, I dreamed of putting together two of my main interests: databases and artificial intelligence. Then the term AI was an abusive word (after the next HYIP in this area), and for a long time I was studying the insides of the DBMS. In the Toprater.com project, we are creating a platform for social multi-criteria choice. That is, in a large number of areas, we process people's opinions with the help of our technology, and turn it into a system of criteria. We have put a lot of effort into creating a technology of understanding opinions, the quality of which is now at the global level. So my scientific and engineering interests finally came together for me in this project. Well, you also have to develop your own systems, such as a search engine.
- What interesting problems have to be solved at work?
- We now have the most interesting period in the development of a startup, we are preparing to enter the B2C market, at the same time we cooperate with leading e-commerce companies in the B2B format. And, on the one hand, we solve complex research problems in understanding texts, then we expand clusters to quickly process terabytes of data, we created a system of end-to-end versioning of data. On the other hand, sometimes we solve in a very short time the tasks of cleaning and matching data in order to quickly launch the next sphere or integrate the data of the next partner. To the maximum, we use machine learning tools, and we already have a tradition to arrange internal hackathons for solving such problems.
Also, we hide all the functionality of our system under the API, so that products on top of it can be developed not only by our team, but also by third-party developers. This is part of our strategy - we actively invite third-party teams to participate in our project. But creating a quality API is also not an easy task, it should be as simple and minimal as possible, and with good documentation, and the constant new requirements threaten to turn it into a complex monster. Here we are also constantly looking for balance and compromise.
- Michael, in which competitions you participated?
- Participated in a variety of contests, apparently they were formulated as:
- Determining the relevance of the query / document pair
- Detection of bots in online auctions
- Virus classification
- Predicting contextual advertising clicks (CTR)
- Search for the Higgs boson
- Text ad classification
- Prediction of the number of views / likes of a post on the social network
- Borrower's default prediction
- Predicting customer returns
Despite the external "heterogeneity", these tasks are solved in much the same way.
- Is it right about Kaggle competition? Any more?
- Kaggle is not the only platform (and not the oldest), but definitely the most “live” at the moment.
Basically, yes, I participate in Kaggle, but besides - I participated in a couple of offline hackathons and in competitions on other platforms.
- Why did you start and continue to do what you do?
- Once, a couple of years ago, I participated in a business case competition with a team, and there was a task for data analysis.
Neither I, nor my comrades had such experience - and therefore we came up with a very simple solution based on manual rules and heuristics. Spent a lot of time, in the end got 4th place. And the winning team simply applied one of the machine learning methods.
Since the prizes were iPads, and I really wanted this for myself - I decided after the fact to figure out what the winners did, and what we lacked. At that moment, my acquaintance with machine learning began, which later developed into a real hobby and thesis work. By the way, I got the iPad, but in a different contest :)
For me, competition is an opportunity to learn something new, to knead wit and observation. Exchanging ingenuity - more often it is data analysis and non-standard view of the task, rather than machine learning algorithms, that play a decisive role. I continue to participate in such competitions in order to expand my horizons and be aware of fresh ideas and tools.
- Can I have a couple of tales? For example, what was the task, how difficult was it to solve?
- One of the first competitions in which I participated was the prediction of a multi-dimensional time series . Having tried a bunch of options, I could not make a decision better than "the last value multiplied by a constant", and the constant picked it up to the third digit. My friend, who participated in the same competition, turned out to be in 4th position - because he picked up the same constant to the fourth digit. This competition taught me that the decision of a competition is not always complex models or crazy mathematical formulas. And that the accuracy of calculations is sometimes very important.
Another funny incident happened not so long ago, when the team decided the Microsoft Malware Classification Challenge. One of the trick of the competition was that the data - 400 GB of data. And now, a day before the end, in an attempt to increase the quality to the top-10, we decided to count byte 10-grams (this means that it was necessary to calculate the frequency of occurrence of all sequences of 10 bytes). It is easy to estimate that the number of such sequences is 256 ^ 10. They count the problem, not that the selection to hold. But we had the courage, use a few technical chips, still count, make a selection and add them to the model. Shipped solution in the last 15 seconds before closing. Had time, in the end with the fifteenth position soared to third. Unforgettably!
- What interesting problems have to be solved at work?
- There are many tasks, tasks are different, each is unique in its own way. Sometimes the complexity (= interest) lies in the volume of data (there can be both very much and very little), sometimes in the formulation itself. One of the bright impressions is the task of determining the price of a car. In essence, this is a task of correcting the query to the database in order to get the most similar cut, but containing at least a specified number of objects - and Ivan (Goose) and I were able to come up with a rather beautiful solution, in my opinion.
To begin with, I would like to emphasize that we are a security company. That is, we have a goal - to ensure the availability of client resources on the Internet. And there are attackers, their goal is to ensure that our client is unavailable for at least some time. As a result, we are in a constant struggle, and the mistakes of one side, for sure, will be used by the other.Well, the "full list" you yourself know where and when , so - see you soon!
As usual, the principle of “thinking like a criminal” acts, that is, in order to stop intruders, we need to understand how they will act.
For example, a user walks through a site, clicks with a mouse, types something in forms, the browser sends requests. Some of these requests are done by the browser almost of its own accord (for example, it loads statics), and some are directly related to the user's actions. Some of these requests may be processed longer than others, sayGET /search?q=...
attacker may notice this and try to send a lot of such requests to the site in a row, exhausting the resources necessary to process "normal" requests.
In response, you can, for example, enter caching responses, but then the attacker simply adds diversity to theq=...
parameterq=...
In this case, you can enter a limit on the number of requests per second from one IP for / search, but the attacker will start using many IP addresses (say buy a botnet). In response, you can, for example, prohibit doingGET /search
as the first request, i.e. require the installation of cookies from any other page of the site, but the attacker will quickly understand the chip and start making requests in pairs ...
In the limit, such interaction can be viewed from two points of view - strategic and tactical.
Strategically, it is simply important who spent more money: if an attacker needs a budget for at least any significant success, which he does not have and will not have, then he considers the threat eliminated. Our “moves” here are forcing the intruders to use the most expensive equipment and, perhaps more importantly, manual intellectual work — we want to attack each resource protected by us separately. We automate our actions as much as possible, and use expensive equipment rationally.
Tactically, the task described above comes down:
- For us: to search for site visitors who behave strangely and at the same time create a significant burden;
- For an attacker: to write such bots, which will be very well disguised as real users, but at the same time "knock out" the site.
Since, as I just mentioned, we want to automate our task as much as possible, we use machine learning. In this case, we are faced with all kinds of ML:
- With the need to predict the load (for example, in order to understand under what conditions the backend will "fall"), i.e. with regression;
- It is necessary to understand whether an attack is occurring at all, or someone just attracted many legitimate users, i.e. with classification;
- With the need to search for groups of visitors with one or another property, i.e. either with clustering or searching for anomalies.
And the list, of course, is not complete.
And finally : For the users of Habrakhabr, the conference offers a special discount of 15%, all that needs to be done is to use the code " IAmHabr " when booking tickets.
Source: https://habr.com/ru/post/267379/
All Articles