The history of one test task

Some time ago, flipping through the open spaces of the habr, I came across the “Python Backend Developer” job. In it, I was bribed most of all by the location of the office - it was near the house, and I wrote a response. The answer came quickly with the question of whether I was ready to perform the test task. I replied that I would think if they sent it to me. Letters with the task was not two weeks.

And so, just before the May holidays, an answer came with a test task. The task seemed simple, but I decided to abandon further communication in general, because for some reason in two weeks the impulse of finding a new job passed, and the holidays ahead. However, on the same day I fell ill. Quite a serious runny nose with all the consequences. And the next day I decided to try this test task and see what came of it. And this is my story.

Below is the text of the task itself:

You need to create a small web application using any web framework (preferably one that you know best). Layout is not important, pay attention to backend, code design, trifles.

In fact, the application should consist of one page, but can optionally be divided into several (separate download form).
The necessary elements are a photo upload form and a table with a list of uploaded photos. Authorization is not required.
')
Download form:
Text field for entering photo name
File selection

Table:
Preview of the photo (you need to make a smaller copy of the photo (thumbnail); also this preview should be a link to the original / full image, which is opened by clicking on the preview)
Name of the photo (which the user indicates when uploading)
Manufacturer and camera model (from EXIF, if present)
file size
Date the photo was taken (from EXIF)
Photo upload date
Delete button

Requirements:
Do not save existing photos. Check for a duplicate file and give an error if found.
Check if the file being downloaded is an image, if not - generate an error. (Do not use the EXIF data check as validation)
Do not allow saving photos created more than a year ago (check the date the photos were created from EXIF).
If there is no date of creation of the photo in EXIF, then you should give an error and not add the file.

So, Tornado was chosen as a web framework for Python, I have known him for a long time. We will be raising several backend servers, so we will need a balancer and Supervisor. Initially, I was thinking about HAProxy as a balancer, but then it dawned on me that the pictures could be well distributed by NGINX. As a result, at the beginning, the architecture seemed to me like this: NGINX balances connections and distributes statics from disk, 4 Tornado servers handle requests, Redis synchronizes backend.

On Tornado dropped the burden of analyzing incoming pictures and creating thumbnails. The task does not say which formats need to be supported, so I searched for the EXIF description in Wikipedia, which refers to TIFF and JPEG formats. If this is all, then things are not so bad, the Pillow for Python library supports both formats, as well as EXIF metadata. But there is a nuance - TIFF images do not open by the browser. This makes it impossible to open the original file in the browser, so I decided to transcode these images to JPEG and additionally save together with the EXIF file the data from which it would be possible to recover all the necessary information for display in the table.

We will save the table itself in Redis. And although Redis is fully loaded into memory and with an emergency stop the chances of restoring the latest database changes are not great, I believe that it can accommodate a very large number of descriptions of pictures and it will be enough for a long time. In case of emergency, the missing information can be recovered from JPEG file metadata.

The solution with metadata in JPEG seemed to me beautiful, and although Pillow is quite able to save EXIF to JPEG, the metadata themselves should already be in binary format. That is, Pillow gives out metadata in the form of a dictionary, but from a dictionary into metadata it does not know how. The Gexiv2 library was also found, which also works with metadata, but its installation required skill.

Attempting to compile Gexiv2 from source many times led to errors about missing libraries. While searching for another such library, I came across the installation package of this library for Ubuntu. But there was a problem. I installed Python on the system via pyenv, and I was going to run scripts from virtualenv, but in this case Gexiv2 installed in the system is not available. There are certain dances with a tambourine on this topic, but having already spent an hour on Gexiv2, I decided to abandon virtualenv and use system Python 2.7.6.

Gexiv2 successfully edits EXIF in files, but with data in its memory is tight. And I basically didn’t want to access the file twice: once for JPEG recording, for the second time for writing metadata to the same file. I did not know what to expect. And the following was waiting for me - the supported formats were listed in the Gexiv2 documentation, such as EXV, CR2, CRW and many others. Thus, Pillow could no longer cope with the task of reading downloaded images. So I found ImageMagick, and the corresponding adapter under Python - Wand.

Wand looked promising - support for many formats, reading EXIF, relatively simple installation. But to save JPEG with my metadata, I still need Pillow. Having spent some time, I was lucky to find the piexif library, which helped edit the Pillow metadata, and one problem was less. After spending a few hours you can sit down and program.

The algorithm was simple, Wand loads a picture from memory, gives EXIF data, then Wand gives the RGB buffer, we consider its md5 hash to check for duplicates, convert the buffer to JPEG and save it with its metadata, plus save the thumbnail. Of course, the corresponding data is updated in Redis. It remains to check. However, finding pictures on the Internet with metadata, and even fresh ones, is a problem. And I spent a lot of time looking for a program that would have been well edited for EXIF data.

And here, the first JPEG sample is ready, we load - it works! But the second sample, a 7MB CR2 file, gave a few surprises. The first one was that Wand could not read it from the buffer; he needed a format hint as an extension of the source file. But even here the problem is, the library began to write that it does not find any temporary file. Again searches, it turned out you need to install the ufraw utility, and the file was read. In 11 seconds. And then in JPEG something more like noise than the original picture fell out.

Initially, I sinned on Wand, it seemed to me that it crookedly converts the image into the RGB buffer, however, running the calculator, I found that the buffer is exactly 2 times more than necessary - that is, the channel accounts not 8, but 16 bits. Hurray, one line and everything works. But what to do with a long file download? Even if there are four servers, the same number of large CR2 files will simply make the service unavailable.

During the first day, I spent a considerable part of my time installing the system, searching and installing various libraries, researching the subject area, but in the end I wrote only 150 lines of code. As a result, the test task was failed - the result of the application was unacceptable. A sick body asked to rest.

On the second day, it was decided to abandon the image processing during the request, assign the image upload to NGINX, leave only one server in the backend, and run three scripts that would process the images.

I started with the assembly of the NGINX module's Upload, and, of course, without success. Having spent some time, I realized that the author had abandoned him, and this module will not work on the latest version. Well, okay - let the Tornado server save incoming files to disk. Further, actually copy-paste into new image processing scripts. We configure Supervisor and NGINX, finish the installation script. And it works.

Of the minuses: after downloading the image, the page with the download status opens to the user, before the image appears in the table it takes some time, we had to abandon the hashing of pure RGB data, now the entire file is hashed to check for duplicates. In case of an error, the result of the download is stored for one day.

As a result, I believe that with a relatively low load, this application will work stably, and on a powerful server with many cores, we can talk about good performance. Unfortunately, this test task was filled mostly with library search and administration, there is very little programming in it. What the employer wanted to find out about this task is not very clear to me. Maybe you needed to write your library to get EXIF data? And you can’t call such a small test task a little - it’s over 8 hours. It would greatly simplify the task of introducing specifics on the supported image formats, a more detailed explanation of the intended use of the application.

Sources can be viewed on github . And I continue to hurt.

Source: https://habr.com/ru/post/257817/

All Articles

The history of one test task

More articles: