📜 ⬆️ ⬇️

In search of the optimal point of application of human resources

One of the paradoxes of modern Internet platforms is that, although they are largely automated and the content that end users see, it is shown without any human moderation, however, they are completely dependent on human behavior, because in fact, they only observe, receive information and draw conclusions based on the actions of hundreds of millions or billions of people.


At the root of this principle was PageRank. Instead of relying on manually created rules that provide an understanding of the meaning of each individual page, or working with the original text, PageRank observes what people have done or said about such a page. Who is in any way associated with it, what text did they use, and who is associated with the people associated with this page? At the same time, Google gives each user the ability to rank (index, rate) each set of search results manually: you are given 10 blue links, and you just tell Google which one is appropriate. The same goes for Facebook: Facebook doesn't really know who you are or what you are interested in or what this or that content is about. But he knows who you are subscribed to, what you like, who else like you like this and what else they like and subscribe to. Facebook is human-centered PageRank. In general, the same applies to YouTube: he never knew what a particular video was about, but only what people wrote under it and what else they looked at and liked.


At their core, these systems are huge “mechanical Turks”. After all, they absolutely do not understand the content of the content with which they work, they are only trying to create, fix and convey the human moods of the relative content. They are huge distributed computing systems in which people act as processors, and the platform itself is a collection of routers and interconnections. (This reminds me a bit of the idea from the book “The Hitchhiker's Guide to the Galaxy” that the whole Earth is actually a huge computer that performs certain functions, and our daily activities are part of the calculations).


This means that much in the design of the system is tied to finding the optimal points of application of human resources in working with an automated system. Do you fix what is already happening? Here Google began to use links that already existed. Do you need to stimulate activity to identify its value? Facebook had to create the activity before they could get any benefit from it. Can you really rely on human resources? This approach is used in Apple Music, with their manually selected playlists, which are automatically issued to tens of millions of users. Or do you have to pay people to do everything?


Initially, Yahoo’s Internet resources catalog was an attempt to use a “pay people to do everything” approach — Yahoo paid people to catalog the entire Internet. In the beginning it seemed achievable, but since the Internet was growing too fast, it soon turned out to be a daunting task, and when Yahoo gave up, the size of their catalog already exceeded 3 million pages. PageRank solved this problem. On the contrary, Google Maps uses a large number of cars with cameras that are controlled by people (so far) and drive through almost all the streets in the world and many more people look at these photos, and this is not an unbearably huge task - it just costs a lot. Google Maps is such a private “mechanical turk”. Now we are exploring the exact same question, talking about the moderation of content by people - how many tens of thousands of people will you need to view each post and how can you automate this task? Is this huge task unsustainable or is its realization very expensive?


If you look at these platforms as if using billions of people to make real calculations, this should raise two interesting questions: what vulnerabilities exist in such platforms and how can machine learning change this area?


In the past, when we thought about hacking computer systems, various technical vulnerabilities occurred to us - stolen or weak passwords, unclosed vulnerabilities in systems, bugs, buffer overflow, SQL injections. We represented “hackers” looking for software holes. But, if you imagine that YouTube or Facebook are distributed computer systems in which routers are familiar software, but people play the role of processors, then every attacker will immediately think about finding vulnerabilities not only in software, but also in people. Typical cognitive distortions begin to play the same role as typical defects in software.


That is, in fact, there are two ways to rob a bank - you can bypass the alarm and pick up the master key to the safe or you can bribe a bank employee. In each of these examples, the system failed, and now one of the systems is you and me. Consequently, as I already wrote in this article about the recent change of Facebook's course towards privacy and user security, content moderation by live people on such platforms is inherently similar to the work of antiviruses that began to flourish in response to the appearance of malware on Windows two decades ago . One part of the computer is watching if the other part is not doing something that it should not do.


Even if we do not talk about intentional hacking of systems, there are other problems that arise when trying to analyze the activities of one person with the help of another person. So, when you start using a computer to analyze another computer, you risk creating feedback loops. This is reflected in concepts such as “filter bubble,” “radicalization of YouTube,” or search spam. At the same time, one of the problems Facebook encountered is that sometimes the availability and production of large amounts of data levels the value of this data. Let's call it the problem of overloading news feeds: for example, you have 50 or 150 friends and you publish 5 or 10 entries every day, or something like that, but all your friends do the same thing, and there are already 1,500 entries in your feed every day. Dunbar number + Zuckerberg law = overload ... which leads us to the Goodhart Law.


“Any observable statistical pattern is prone to destruction as soon as pressure is put on it to control.” - Charles Goodhart

And yet how can machine learning make a difference? A little earlier, I said that the main difficulty is how to use human resources in working with software in the most optimal way, although there is another option - just letting the computer do all the work. Until very recently, the difficulties and the reasons for which such systems existed, first of all, consisted in the presence of a large class of tasks that computers could not solve, although people solved them instantly. We called it “tasks that are easy for a person, but difficult for a computer”, but in reality they were tasks that were easy for a person, but which a person is practically incapable of describing a computer. Breakthrough feature of machine learning is that it allows the computers themselves to develop the necessary description.


The comic below (straight from 2014, even as machine learning and computer vision systems began to flourish) perfectly illustrates these changes. The first task was easily accomplished, unlike the second - at least before the arrival of machine learning.



The old way to solve this problem is to find people who would classify the image - to resort to a kind of crowdsourcing. In other words, use “mechanical turka”. But today, we may no longer need anyone to look at this image, because with the help of machine learning we can very often automate the solution of this particular problem.


So: how many problems could you solve earlier just by analyzing the actions of millions or hundreds of millions of people that you can solve now with the help of machine learning and without the need to involve users?


Of course, there is some contradiction in this, because in machine learning you always need a large amount of data. Obviously, in this case, someone could say that if you have a large platform, you automatically have a lot of data, therefore, the machine learning process will also be easier. This is definitely the case, at least in the beginning, but I think it would not be superfluous to ask how many tasks could be solved only with the help of existing users. In the past, if you had a photo of a cat, it could be labeled as “cat” only if you had enough users, and one of them would look at that particular photo and tag it. Today, you don’t really need real users to process this particular cat image — you only need to have any other users, anywhere in the world, at some point in the past, who have already classified enough other images from cats to generate the required pattern of recognition.


It’s just another way to make the best use of human resources: you need people anyway to classify items (and to write rules according to which people will classify them). But here we are already shifting the lever and, perhaps, we fundamentally change the required number of people and, thus, the rules of the game, to a certain extent, due to the “winner takes all” effect. In the end, all these large-scale social networks of the platform are just huge collections of manually classified data, so how does it end up, is their glass half full or half empty? On the one hand, it is half full: they have at their disposal the largest collection of manually classified data (in their particular field of activity). On the other hand, the glass is half empty: this data was selected and classified manually.


Even where the data could form one of these platforms (which most likely will not happen - probably will not happen - as I wrote here ), they would still become, well, a platform. As in the case of AWS, which ensured the development of startups that no longer needed millions of users in order for their infrastructure to take effect scale, creating such tools would mean that you no longer need millions or billions of users to recognize a cat. You can automate the process.


Translation: Alexander Tregubov
Edited by: Alexey Ivanov
Community: @ponchiknews


')

Source: https://habr.com/ru/post/452716/


All Articles