We restore the detailed geometry of objects for more accurate assortment validation.

Dealing with search quality issues, sooner or later we have to face the challenge of visual validation of products. Omit simple tasks that a regular classifier can cope with, focusing on cases that require more or less accurate geometry of the object:

Suppose you need to select only good photos of certain objects, for later use in e-commerce. By good we mean photos without unnecessary details with the dominant main object.

Why do you need it?

Any non-standard product image will definitely attract attention. But the reaction of a potential buyer can be both positive and negative. The task of preliminary validation is to reduce (preferably significantly) the likelihood of a negative scenario.
')
Below is the "inconsistency" of styles for one of the categories of test store

Without complicating further, if a T-shirt gets a little lost in the photo, or if you consider the details you do not quite need - something is more likely to go (or has already gone) wrong.

Thus, one of the strategies of preliminary validation can be formulated very simply: photos with dominant products win. Things are easy, you need to let them win.

Early results looked quite good and allowed to significantly simplify and automate the validation:

What is not so bounding box approach?

The main problem is the accuracy of the results. Complex objects, non-standard photos, real life, you know. Thus, if you have a bounding box, you still do not have enough information.

The conclusion is somewhat disappointing, since it immediately brushes aside proven and well-working solutions (or makes them significantly more difficult). For example, using neural networks to obtain any kind of accurate geometry requires a lot of resources to prepare a training set, without guaranteeing the necessary accuracy.

But having more or less exact geometry, one could use more complex logic of analysis and validation. But what is really there, you can also make a blow at the video (selection of the required segment, automatic crop, etc.)

Decision

The current solution cannot be called universal due to a sufficiently large number of limitations and simplifications.

Simplification # 1: Contrast

One of the simplifications can be formulated as follows: the object in the photo will always be a contrast. It is easy to find a contrast object, and then perform a scan (adaptive, with dynamic pitch, etc.):

Naturally, if necessary, the contrast can be enhanced by making the decision more stable.

By the way, the example above implemented the search for implanted hair. A very strange task that appeared on stackoverflow and successfully “selected” one evening.

Simplification # 2: Only one object should be dominant

In this case, a very small amount of products with obvious design solutions suffers, but other cases are worked out quite easily:

Difficult cases

Being engaged in this topic for some time, I can confidently say - all cases are complex in their own way. However, dynamic scenes or scenes with varying distances pose the greatest problems.