📜 ⬆️ ⬇️

How to create your dataset with Kirkorov and Face on Yandex Toloka


With neural networks you will not surprise anyone. Almost everyone knows what machine learning, linear regression, random forest. Every year, thousands of people take machine learning courses at ODS and Coursera . For a couple of weeks, any student can now master keras and rivet neuron points. But in neural networks, as in all machine learning, in addition to creating a good algorithm, data are needed on which the algorithm will be trained.


Let me introduce myself, my name is Kutsev Roman. I work in the RnD team in Prisma AI and create datasets for our tasks.


So, suddenly a grocery request arrives: we urgently need a neuron, which will distinguish Kirkorov from Faith (of course, a fictional example, we are not engaged in such nonsense).
I ask Anton (the main for neurons) in what form the data is needed, in response I hear: “Two folders, in one image of Faith, in another Kirkorov, preferably about 500 images per class, and 299x299 were”.


Where to find data?


  1. We look at public datasets, such as ImageNet , COCO , openimages .
  2. If there is no necessary marked data in popular public datasets, then google, open articles on these topics on arxiv.org in the hope that somewhere there will be a link to the necessary dataset.
  3. If the first two points fail, then the necessary dataset does not exist, and it must be created!

Obviously, no one has previously dealt with the task of classifying Kirkorov and Feys. Therefore it is necessary to create such a dataset.


The pipeline is:


  1. We download from Google on 1k images of Kirkorov and Faith.
  2. Resize to the right size.
  3. We check them with freelancers on Yandex Toloka.

For download, use Google Images Download .


In the terminal we write:


googleimagesdownload -k '' -l 1000 -t photo -s '>400*300' -o 'Kirkorov' googleimagesdownload -k ' Face' -l 1000 -t photo -s '>400*300' -o 'Face' 

(in fact, everything is not so simple, for one request you can download only 100 images, so you have to download it with different settings)


Immediately resize the image to a size of 299x299, for this I sketched the following function:


 from PIL import Image import multiprocessing, time, os def resize_img(img_path): img = Image.open(os.path.join('Kirkorov',img_path)) img = img.resize((299,299), Image.ANTIALIAS) img.save(os.path.join('Kirkorov_resize',img_path)) num_processes = multiprocessing.cpu_count() pool = multiprocessing.Pool(processes=num_processes) st = time.time() pool.map(resize_img, os.listdir('Kirkorov')) print("Execution time: ", time.time()-st) 

Looking at the images, we see that the necessary images were mostly downloaded, but in some places there is garbage.






Well, the first two points are ready, the third is left.


If you still do not know what Yandex Toloka is, then I advise you to read this and this article.


In a nutshell: Toloka - a freelance exchange on steroids, where the customer creates a task, loads data and freelancers perform it. Of course, there are other markup tools , but for this task it is most convenient to use Toloka.


How to start using Yandex Toloka:


1. We register on sandbox.toloka.yandex.ru and toloka.yandex.ru as a customer (sandbox is a sandbox in which you create tasks, check for correctness from freelancers, and if everything is good, transfer it to toloka.yandex .ru ).


2. In our toloka.yandex.ru personal account we replenish our balance.


3. In sandbox.toloka.yandex.ru in the "Projects" tab, select "Create project".



4. You will be offered ready-made templates. Of all the templates, the "Image Categorization" template is best for us. Choose it.



5. We create instructions for freelancers and the name of the project.



6. In the input parameters to the task we will have a URL to the image. At the weekend: the output line. The task interface is written in html and javascript (this is very convenient, since for almost any task we can create a page that will be shown to the performers). Slightly change the template html, and you're done. Save the project.



7. The project is ready. Now you can add a pool of tasks . In this article, I will not create a training pool (in the hope that freelancers are able to immediately complete our most complex task). But you always create a learning pool in your assignment. This will allow:



We invent the name of the pool (it is visible only to us). Set the price of $ 0.01 per page assignments. Greedy Yandex takes a commission $ 0.005. Task time: 10 minutes. Overlap: 3 (each of our photos will be shown to three different people). Let's turn on the "CONTENT FOR ADULTS", and then what else is bad that we could download from the Internet. And activate the "POSTPONED ACCEPTANCE" to not give money to those freelancers who will perform the task poorly.



Next comes the clause: "REQUIREMENTS TO USERS", in which we can make demands on the performers. In our case, the performer must be from Russia and speak Russian. Each artist has its own rating and we can only choose artists with a good rating. Choose 80% of the best performers.



The most important and most difficult in terms of development is the quality control unit . A separate description of him will draw on an entire article, so I recommend reading it yourself. In our project, I chose a restriction for quick answers. If the performer answers 3 times out of 5 in less than 40 seconds, the system will block it.



Great, we're at the finish line, just a little bit left. Save the pool.


8. In the created pool we download an example of the boot file. We upload our images to Yandex Disk or to our server, so that they can be accessed via the link. Paste the URL to our images in the "INPUT: image" field. Example:


 INPUT:image http://kucev.ru/Kirkorov/Kirkorov-pink_17_ 21787_l_jpg.jpg http://kucev.ru/Kirkorov/Kirkorov-green_82_ hqdefault_jpg.jpg http://kucev.ru/Kirkorov/Kirkorov-pink_25_ hqdefault_jpg.jpg 

Download the resulting tsv file to Yandex Toloka. Specify how many images will be shown to the artist at a time.



Great, no mistakes got out, so we did everything correctly and we can start our pool.



9. Now we need to check that from the performer everything is displayed correctly. To do this, create a new account on sandbox.toloka.yandex.ru , only now as an artist. In the customer’s account, select the Users tab and add our new account to the list of trusted users.



Well, if our pool is running, now our task is available from the test account of the executor.



Click "Continue." This is how our performers page will look like:



We check that everything works correctly and no bugs arise.


10. If you are satisfied with the result, then it’s time to release the baby to the big world, that is, to transfer our project from the sandbox to the main Toloka. To do this, in the account of the executor in the tab "Actions over the project" choose "Export".



We go to toloka.yandex.ru and see that our project has been successfully exported.



We open it, once again we check that everything is correct. Once I made a mistake, having incorrectly formed a task and 2000 performers, having executed it, could not send the result (it is still unknown how many people I killed ).


11. Start the pool, wait 20 minutes and get the result:



Total: we had 1159 photos. On one page for the freelancer placed 40 photos. Those. there were 1159/40 = 29 sets of tasks. But we did with the overlap of 3, so it was only shown 29 x 3 = 87 pages. For one page we pay $ 0.01 + $ 0.005. Therefore, to check 1159 photos, it took us $ 1.3 or 80 rubles. At the same time, freelancers received only $ 0.87 or 50 rubles, spending approximately 97 seconds x 87 tasks = 2 hours and 20 minutes. Are you ready to work for 25 rubles per hour?


12. Check the tasks and download the result.



Open the resulting file in pandas, group by "INPUT: image"


 import pandas as pd data = pd.read_csv('assignments_v0(14.05.18).tsv',sep = '\t') data['OUTPUT:output'] = data['OUTPUT:output'].map({'NO':0,'YES':1}) data_groupby = data.groupby('INPUT:image').sum() 

Got:



 data_groupby['OUTPUT:output'].value_counts() 3.0 634 0.0 445 2.0 57 1.0 23 

Using the voting method, and considering that in the photo Kirkorov, if at least two performers voted for it, received 634 + 57 = 691 photos.



To summarize, I want to say that:


  1. Toloka - a flexible tool that can be customized for any task.
  2. Toloka is a very cheap labor force.
  3. There are about 10,000 performers on Toloka, which makes it possible to mark out huge amounts of data in a very short period of time.
  4. Toloka has very convenient tools for controlling performers, which allows you to create high-quality markup dataset.

Of the huge drawbacks: it is impossible to mark personal data on Toloka, since no NDAs sign with the performers and this will be a violation of 152-FZ and GDPR.


If you are interested in the topic of creating datasets, then put the pros and in the next article I will tell you how to create datasets for the sticky-ai application.


')

Source: https://habr.com/ru/post/358574/


All Articles