Computer vision: recognizing clothes in a photo using a mobile application

Not so long ago we decided to make a project that would allow us to search for clothes in various online stores from the photo (picture). The idea is simple - the user uploads an image (photo), highlights the area of interest (t-shirt, pants, etc.), specifies (optional) the specifying parameters (gender, size, etc.), and the system looks for similar clothes in our catalogs , sorting it according to the degree of similarity with the original.

The idea itself is not that new, but qualitatively unrealized by anyone. There is a project on the market for several years already www.snapfashion.co.uk , but the relevance of its search is very low, the selection is mainly based on the definition of the color of the image. For example, he can find a red dress, but a dress with a certain style or pattern is no longer there. The audience of this project, by the way, is not growing, we associate it with the fact that the search is definitely of low relevance and, in fact, is no different if you choose a color on the store's website when searching through their catalog.

In 2013, the project www.asap54.com appeared, and here the search is slightly better. The emphasis is on the color and some small options, indicated manually from a special catalog (short dress, long dress, dress of medium length). This project, faced with the difficulties of visual search, slightly turned in the direction of social networks, where fashionistas can share their “bows” in clothes, from “clothes chasam” to “fashion instagram”.
')
Despite the fact that there are projects in this area, there is definitely an uncovered need to search by image, which is very relevant today. And the solution to this problem by creating a mobile application, as SnapFashion and Asap54 did, most closely matches the trends of the e-commerce market: according to various forecasts, the share of mobile sales in the US from 11% in 2013 can grow by 25-50% in 2017. Such rapid growth Mobile commerce foreshadows the growing popularity of a wide variety of applications that help make purchases. And most likely the stores themselves will invest in the development, promotion of such applications, as well as actively cooperate with them.

After analyzing the competitors, we decided that we should try to deal with this topic ourselves and launched the project Sarafan www.getsarafan.com .
Corporate identity originally wanted to make bright. Worked many options:

As a result, we stopped at the style with bright colors.

To start, choose a client for iOS (for iPhone). Design in the form of paints, works through the Rest-service, on the main screen of the application there is a choice: take a photo or choose from the gallery.

This was probably the easiest of the entire project. Meanwhile, at the advanced backend-development, everything was not so rosy. And here is the story of our searches: what we did and what we came to.

Visual search

We have tried several approaches, but none of them gave results that would allow us to make a highly relevant search. In this article we will tell you what you tried, and what and how it worked on different data. We hope that this experience will be useful to all readers.

So, the main problem of our search is the so-called semantic gap (semantic gap). Those. the difference between what images (in this case, the images of clothes) are considered similar by a person and a car. For example, a person wants to find a black short-sleeved T-shirt:

A person will easily say that in the list below this is the second image. However, the machine is likely to select image 3, which is a women's T-shirt, but the scene has a very similar configuration and the same color distribution.

A person expects that the search result will be positions with the same type (t-shirt, jersey, jisa ...), approximately the same style and approximately with the same color distribution (color, texture or pattern). But in practice, to ensure that all three conditions were met was problematic.

Let's start with the simplest, image with a similar color. For comparison of images by color most often use the method of color histograms. The idea of a color histogram method for comparing images is as follows. The whole set of colors is divided into a set of non-intersecting subsets completely covering it. For the image, a histogram is formed, reflecting the proportion of each subset of colors in the color gamut of the image. For comparison of histograms, the concept of the distance between them is introduced. There are many different methods for forming color subsets. In our case, it would be reasonable to form them from our catalog of images. However, even for such a simple comparison, the following conditions are required:
- Images in the catalog should contain only one thing on a easily separable background;
- We need to effectively distinguish the background and the area of clothing that interests us in the user's photos.
In practice, the first condition is never satisfied. We will tell about attempts to solve this problem later. With the second condition is relatively easier, because the selection of the region of interest in the image of the user occurs with his active participation. For example, there is a rather efficient background removal algorithm - GrabCut ( http://en.wikipedia.org/wiki/GrabCut ). We proceeded from the consideration that the area of interest in the image is closer to the center of the user-rounded area than to its border and the background in this area of the image will be relatively uniform in color. Using GrabCut and some heuristics, we managed to get an algorithm that works correctly in most cases.

Now about the selection of the area of interest to us in the images of the catalog. The first thing that comes to mind is to segment the image by color. For example, the watershed algorithm ( http://en.wikipedia.org/wiki/Watershed_(image_processing) is suitable.
However, the image of a red skirt in the catalog can have several options:

If in the first and second cases it is relatively easy to segment the area of interest, in the third case we will also single out the jacket. For more complex cases, this method will not work, for example:

It should be noted that the problem of image segmentation is not completely solved. Those. There are no methods that allow to select the region of interest in one fragment, as a person can do:

Instead, the image is divided into superpixels, here it is worth looking in the direction of the n-cuts and turbopixel algorithms.

In the future, use some combination of them. For example, the task of searching and localizing an object is reduced to finding a combination of superpixels belonging to the object, instead of searching for a bounding box.

So, the task of marking catalog images has been reduced to finding a combination of superpixels that correspond to things of this type. This is a machine learning task. The idea was to take a lot of hand-marked images, train a classifier on it and classify various areas of a segmented image. The region with the maximum response is considered to be the region of interest. But here it is necessary to decide again how to compare the images, since simple color matching is guaranteed not to work. We'll have to compare the form or a certain image of the scene. As it seemed at that time, the gist descriptor would be suitable for this purpose ( http://people.csail.mit.edu/torralba/code/spatialenvelope/ ). The gist descriptor is a kind of histogram of the distribution of edges in an image. The image is divided into equal parts by a grid of any size, the distribution of edges of different orientations and different sizes is considered and sampled in each cell. The resulting n-dimensional vectors can be compared.

A training set was created; many images of different classes (about 10) were manually marked up. But, unfortunately, even with cross-validation, it was not possible to achieve classification accuracy above 50%, changing the parameters of the algorithm. Part of the reason for this is that the shirt in terms of edge distribution will not be much different from the jacket, partly because the training sample was not large enough (usually gist is used to search for very large image collections), and partly because may not apply at all.

Another method for comparing images is comparing local features. His idea is to highlight significant points in the images (local features), in some way describe the neighborhoods of these points and compare the number of feature coincidences of the two images. SIFT was used as a descriptor. But the comparison of local features also gave poor results, mostly due to the fact that this method is designed to compare images of the same scene, taken from a different angle.

Thus, it was not possible to mark the images from the catalog. A search for unallocated images using the methods described above sometimes gave roughly similar results, but in most cases the result was nothing similar from a person’s point of view.

When it became clear that we did not manage to partition the catalog, we tried to make a classifier for the user's images, i.e. automatically determine the type of thing that the user wanted to find (t-shirt, jeans, etc.). The main problem is the lack of a training set. Images of the catalog are not suitable, firstly because they were not marked up, and secondly they are presented in a rather limited set of spatial representations and there is no guarantee that the user will provide images in a similar view. To get a large set of spatial representations for the thing, we shot a person in this thing on the video, then cut out the thing and built a training sample on a set of frames. In this case, the thing was a contrast and easily separated from the background.

Unfortunately, this approach was quickly rejected when it became clear how much video to take and process to cover all possible styles of clothing.

Computer vision is a very extensive segment, but we have (so far) failed to achieve the desired result with a highly relevant search. We do not want to fold to the side, adding additional side functions, and we will fight, creating a search tool. We will be happy to hear any advice and comments.

Source: https://habr.com/ru/post/241343/

All Articles

Computer vision: recognizing clothes in a photo using a mobile application

More articles: