Uklon, a popular Ukrainian taxi service, regularly holds jokes among its drivers and customers. According to the results of each drawing, they publish on their Facebook page a video with the contact information of the participants: the name and email address (for
example , the video was deleted). Frame snippet:
In the original video, the addresses are not blurred; I washed them for the article so as not to spread this data further. To my letter in support of them, this is not a good idea:
Good day!
You can not in the video (link) to spread the addresses and names of people.
1. First, your competitors can get a list of your customers and drivers.
2. Secondly, you are violating the law of Ukraine on access to personal data
They replied that they did not care:
Hello, Denis, when registering, each user agrees to use his personal data. 2.6. Many thanks for your feedback.
Therefore, I decided to write a small post, how can I get contact information from the video without using any special skills.
Disclaimer: this post is educational in nature, and demonstrates how not to handle customer data.1. Download the video
There are many services for downloading videos from facebook. I used
http://www.fbdown.net/ , it gives a direct link to the video.
All the following examples will be on Ubunt, but should work similarly in other OSs.2. We break into frames
In the original video, the list of contacts is shown in the first 17 seconds of the video. With ffmpeg, we save the first 17 seconds of the video as a sequence of png images:
')
$ ffmpeg -i video.mp4 -t 00:00:17 out%d.png
3. Prepare for OCR

For recognition, we will use the free
tesseract OCR. Which works quite well, but is sensitive to the quality of the original images.
Cut all the excess using
ImageMagickfrom frames (starting with coordinates 40, 202 and sizes 345x421).
convert '*.png[345x421+40+202]' thumbnail%03d.png
It should turn out as in the picture on the right, without blurring, of course.
Tesseract poorly identifies small letters, so in its manual it is recommended to simply increase the screenshots 2-3 times:
convert thumbnail*.png -filter Lanczos -resize 300% final%d.png
4. Recognition
Come across all the files and recognize.
With the -psm 4 key, we indicate that we want tesseract to take the text as one column. And the key
load_system_dawg = 0 , which is not necessary to use dictionaries for recognition:
for i in final*.png; do tesseract $i stdout -psm 4 -l eng+rus -c load_system_dawg=0; done > text.txt
We delete duplicates - and our base is ready:
sort -u text.txt > uniq.txt
findings
As a result, there are quite a few errors in the database. And there are two options for improvement:
- use commercial OCR;
- Configure templates for tesseract so that it knows that we recognize email addresses.
In any case, the purpose of the article was to show not the quality of recognition, but the fundamental possibility of doing it quickly and with minimal resources.
Update: Legitimate Service Actions
In 2012, the head of the State Service of Ukraine on the protection of personal data gave a comment that even the base of electronic addresses falls under the action of the Law of Ukraine “On the protection of personal data”. And accordingly, the publication of such data may entail administrative or criminal liability. (
source , in Ukrainian)
Update 2: Service Comments
The service responded to my article that it does not consider the database of electronic addresses personal data:
Hello, Denis, we read your article. Personal data - information or a set of information about an individual that is identified or can be specifically identified; This is information that can be used to identify an individual. Such information includes the surname, name, patronymic; date and place of birth, address and telephone number; identification code; passport details; educational documents and more.
Update 3: Service has deleted a video with contacts from its page.
This, I think, is the right decision on their part:
Good afternoon, due to the concerns of users about the security of their personal data, the Uklon online auto call service team decided to place not all the information about the participant (winner) of the promotion, namely, to place only the full name. winners in further drawings.
Recall that earlier we published the name or nickname and email of the user in the program random.org.
At the same time, we want to note that by taking part in the action, the user agrees to use and publish his personal data,
according to the Rules of the action, which are always posted on the official website of the Slope, and the conditions of which are accepted by all participants of the action, it is said that each Participant of the Action testifies and confirms that he is familiar with the rights that relate to his personal data, and also that he voluntarily granted personal data are consent to their processing and distribution (distribution) by the Organizer / Contractor of the Promotion at its discretion by any means with marketing, advertising and / or any other purpose, not tivorechaschey legislation of Ukraine. The specified consent is given taking into account requirements of Art. 7, Art. 8 and Art. 11 of the Law of Ukraine "On the protection of personal data" and operates indefinitely and without limitation of the territory of action.