📜 ⬆️ ⬇️

School data: is it possible to influence the election with the help of Big Data



Hi, Habr! Is it possible to control the world with data? Well, the answer is obvious. The question is, how ...

Everyone has already heard about the success of CambridgeAnalytica in the election campaign of Trump and the notorious Brexit.
')
Article gathered a large number of fans. It tells the tremendous results that modern analyst can achieve. However, these results are achievable only if certain nuances are observed, which the authors of the article did not mention and which we would like to talk about. These nuances can transform this problem from easily solvable to impossible or vice versa.

The first nuance is data access. Undoubtedly, in order to draw any conclusions about people, we need data about them. The article describes that the data were obtained in one of three ways: by purchasing, by third-party applications, or through an API.

About the API, you should immediately say that this is a rather limited tool - therefore, this option will not work for processing large amounts of data (as a rule, after a certain number of requests to the API, CAPTCHA pops up).

The second option - third-party applications - in our opinion - very narrow - you first need to “lure” all users into this application. According to the authors of the article - they have a base for more than 200 million people - hardly 200 million people put up a special supplement and passed surveys. In addition, the method of extracting surveys from social data is a whole art — it is quite difficult to get the right answers, because the method itself is determined by the interface, the question, perception, and many other things.

The third option with the purchase of data looks the most likely, but not the fact that it is exhaustive. Although the data market in the West is more advanced and transparent, it nevertheless has limitations. If, of course, this company had connections with the largest social network, then there is no need for the API and there is no need to buy anything. But the question of A / B testing remains: you need to make a huge number of impressions to learn how to distinguish between people's preferences at that level of detail, as described in the article. Recalling the Programmatic market, even targeting by sufficiently basic interests or belonging to groups is far from being always achievable.

The second nuance is the algorithms. Here there is no doubt that the authors are telling the truth, because at the moment both computational power and mathematics are at the highest level and allow us to build fairly high-quality algorithms. Who wants to make sure personally - just look at the articles from typical conferences on Machine Learning and you will understand that everything is possible, and in a relatively short period of time.

These are mainly machine learning algorithms with a teacher, which form a learning sample of thematic groups of social networks. For example, in order to form a pool of people actively supporting Trump, you can find the corresponding group on Facebook and carefully analyze them, selecting the typical people. Similarly, you can find people who support Hillary Clinton. So we’ll have 2 sets of people that we need to distinguish. Next - a matter of technology. For these people, using the API, a large number of signs are unloaded (you can only see how many things are there, for example, for Facebook or VKontakte ).

Further, with the help of simple methods of machine learning, for example, logistic regression, a classifier is built, which is able to separate these objects. After that, it’s already a matter of small things — a ready-made classifier to “walk” on a social network (in fact, it’s enough for relevant groups) and select a target audience with which you can already work with, for example, targeted advertising.

There are also heuristic algorithms, for example, one time on the social network VKontakte it was impossible to see the people who have the largest number of subscribers (quite a while this could be done by going to the people tab, then this possibility was removed for a while). In this case, heuristic ideas can be applied based on the well-known idea of ​​preferable joining from the theory of web graphs. For example, described here .

It is noteworthy that the method described in the article practically does not use the social network API (and therefore does not rest on its limitations), no graphs are built, nevertheless - as you can see, the algorithm works quite qualitatively. There is also a set of algorithms and methods for analyzing textual information from social networks, in which the ideas of thematic modeling, recurrent neural networks and much more are used. An example of this analytics is BrandWatch .

The third nuance is audience reach. At the moment - one of the most acute problems in digital advertising - where to show it in order to capture the target audience they need. In the RTB ecosystem, these are the so-called publishers — the resources on which the banners of targeted advertising are actually located. In this case, the colleagues had no problems - they used advertising in the social network.

In other words, to get the results described in the article the main problem with getting data. The use of algorithms in this case is more a craft than know-how. But this craft is very important to apply correctly, which means, for example, to be able to answer the following questions:

- What is the quality of the algorithm enough for the solution to be paid back?

- How to choose the right target group and training set for learning the algorithm?

- How can I get around the limitations of social networking APIs using heuristic algorithms?

- How often should I retrain a model?

- How to put the calculation model "on stream", for which metrics you need to monitor?

We learn this craft in our School.

Unfortunately, our experience shows that even participation and successful performance in Kaggle competitions do not help in solving industrial problems (fans of sports programming competitions came to a similar conclusion - participation in competitions like ACM has little to do with industrial software development). Moreover, this experience is acquired only by trial and error and will never be described in books - even in our lectures we do not tell all the subtleties that we have put into practice.

We remind you the start dates of our courses:

Analyst Course - May 22
Course for Managers - May 23
Course in St. Petersburg - May 22

We also have a new course. We received many requests regarding distance learning. In response to these requests, we made an online introductory course. This course is an introduction to machine learning and data analysis and, on the one hand, allows you to get acquainted with these disciplines, and on the other hand prepares students for our basic courses. Sign up for a preparatory course here .

Source: https://habr.com/ru/post/327528/


All Articles