Non-standard application of IT in everyday life: parsing, perceptual hash, image comparison = cost optimization

In this article I want to share an interesting story, about an unusual solution of one interesting problem that I came across a year ago. Everything described in the article was done, above all, “just for fun” and out of pure academic interest ...
It was a year ago, just had free time and a desire to do something useful. Obviously there was some intellectual hunger and an acute shortage of something new, some interesting task ... Hence the attempt to stick the bike even to where it was not needed at all ... Actually, this bike is everything described below ...

1. Task

At one trade and procurement enterprise, the issue of procurement optimization was quite acute. The company had several dozen major suppliers, but at the same time, many suppliers had an intersection of goods at 20–30%, and prices were different for everyone. Unfortunately, the majority of goods were purchased “by old memory”, for example, they got used to that goods of group A are supplied by supplier X, and goods of group B are supplied by supplier Y, although if we select goods not by groups, but by piece, then we can save money. For clarity, I will show by example:

  For supplier X:
 Product A1 - 11 rubles.
 Product A2 - 10 rubles.
 Product B1 - 10 rubles.
 Goods B2 - 11 rubles.

 At supplier Y:
 Product A1 - 10 rubles.
 Product A2 - 11 rubles.
 Product B1 - 11 rubles.
 Product B2 - 10.5 rubles.

From the table it is obvious that A1 and B2 are more profitable to purchase from “Y”, and A2 and B1 from “X”. But it is easy and simple on the example of 4 positions of the nomenclature and 2 suppliers, and if the nomenclature totals tens of thousands of positions, and dozens of suppliers?
It would seem that what the problem is is an elementary task, but it requires a lot of simple calculations, the volume is not lifting with your hands, so you need to quickly transfer it to the PC and reap the benefits? Everything would be so if there was one single base with all the nomenclature and prices. But alas, this is utopia ... the base was in the usual, terrible state of the garbage, in which only those who could dump it every day could somehow figure it out. The prices are still worse, they are generally spread in different directories, prices, sites. If you try to catalog and aggregate it somehow, it becomes obvious that it is extremely difficult without manual intervention. One supplier uses the manufacturer’s article, the other one puts his own, and the third does not indicate them in prices at all. The names almost never match, for example, the same product can be called: “Vase with a pattern 'triangular pattern'”, “Vase A-563”, “Vase with a pattern”, finally, just “Vase” :)
It means that it was also absolutely useless to connect by names.

2. Crazy idea

And then there was a crazy idea. The specificity of the product was such that almost everywhere there was a picture, moreover, the suppliers often took pictures in one place, or in general from each other. There was one “BUT”, the pictures were naturally of different sizes, they could even be different proportions. So a “head on” comparison will not help us. But, after all, does Yandex somehow find similar images? And why shouldn't we try?!? Resolved!
')
The study of literature, reading Google, articles, gave an idea of how similar images search works. In short: image hashes are built, and then hashes are compared with each other. The main differences are precisely in hash functions. The simplest and fastest, both in implementation and in the complexity of calculations, is the perceptual hash “by mean value”.
In short, the principle of operation: the image is compressed into a square, for example 16x16, then from the color image we get an image in grayscale, then, by the points of the square, we calculate the average color value, and then in the second pass for each point we put 0 or 1, depending on the color of this point is less than or greater than the mean. The resulting sequence, 0th and 1th, is the hash we are looking for.

In order to make a quick prototype, it was decided to use ImageMagick to process images, and php was chosen as the scripting language.
As a material for tests, I took the image "Chrysanthemum.jpg", from the standard set of Windows. I called the first file test1.jpg, then took the source file and distorted the proportions, saving it as test2.jpg. As the 3rd sample I took “Desert.jpg” and named it test3.jpg. Here they are:

Images for the test

Next, call ImageMagick for preprocessing:

E:\ImageMagick\convert.exe E:\Article\img\test1.jpg -resize 16x16! -colorspace gray E:\Article\img\_test1.jpg E:\ImageMagick\convert.exe E:\Article\img\test2.jpg -resize 16x16! -colorspace gray E:\Article\img\_test2.jpg E:\ImageMagick\convert.exe E:\Article\img\test3.jpg -resize 16x16! -colorspace gray E:\Article\img\_test3.jpg

We get:

Images 16x16 grayscale

Next, consider the hash itself:

 <? function getPerceptHash($imgName) { $im = imagecreatefromjpeg($imgName); //    ,     256 (16*16),     $summ = 0; for($i=0;$i<8;$i++){ for($j=0;$j<8;$j++){ $summ += imagecolorat($im, $i, $j); } } $sred = $summ/64; //               0  1 $str=''; for($i=0;$i<8;$i++){ for($j=0;$j<8;$j++){ if(imagecolorat($im, $i, $j)>=$sred){ $str .= '1'; }else{ $str .= '0'; } } } //  16- ,   $str = base_convert($str, 2, 16); return $str; } echo '_test1.jpg '.getPerceptHash("E:\Denwer3\home\imagecomparer.loc\www\Article\img\_test1.jpg").'<br>'; echo '_test2.jpg '.getPerceptHash("E:\Denwer3\home\imagecomparer.loc\www\Article\img\_test2.jpg").'<br>'; echo '_test3.jpg '.getPerceptHash("E:\Denwer3\home\imagecomparer.loc\www\Article\img\_test3.jpg").'<br>'; ?>

We get the result:

 _test1.jpg f9f6f0f0e0b0f000
 _test2.jpg fbf6f0f0c0b0f000
 _test3.jpg 1e1e3e3e3e3e3c00

We see that the first and second hashes are very close, and the third is not even close. There was also a series of other experiments, the method works.

3. Attempt to put into practice

The comparison method was found, verified, it worked on test examples. It was time to check it out in battle. It took several days to write various parsers of directories / prices / vendor sites, and the parsing itself. All information + hashes were added to the database, and the images themselves to disk, for further experiments. In total, about 3 million records were collected in the database.
At the beginning, the method worked, but as the base grew, the results got worse and worse until they just fell into the dustbin. There was a bunch of false positives. The method worked well on images that were simply resized, but if the image was color-corrected, cropped, or even worse, watermark was applied to it (which was very, very often), then the result was bad.

4. pHash

It was obvious that the used hash function does not suit us, we need something more serious. It was decided to use pHash based on discrete cosine transform

By itself, the algorithm is hard and attempts to implement it in php did not end with anything good. The solution was to make a console utility on the SI, which would get the path to the file at the input, would consider the hash and return it. A library was found http://www.phash.org/

Then a test sample using this library was quickly written:

 #include "stdafx.h" #include <iostream> #include <windows.h> using namespace std; #ifndef _WIN32 typedef unsigned _uint64 ulong64; typedef signed _int64 long64; #else typedef unsigned long long ulong64; typedef signed long long long64; typedef unsigned char uint8_t; typedef unsigned int uint32_t; #endif typedef int (*ph_dct_imagehash)(const char* file,ulong64 &hash); int _tmain(int argc, TCHAR* argv[]) { if (argc > 1) { char* filename = (char *)malloc( 255 ); wcstombs( filename, argv[1], 255 ); ph_dct_imagehash _ph_dct_imagehash; HINSTANCE hInstLibrary = LoadLibrary(L"pHash.dll"); if (hInstLibrary) { _ph_dct_imagehash = (ph_dct_imagehash)GetProcAddress(hInstLibrary, "ph_dct_imagehash"); int err = 0; err = GetLastError(); ulong64 myhash=0; _ph_dct_imagehash(filename, myhash); std::cout << myhash << std::endl; FreeLibrary(hInstLibrary); } else { return 0; } }else{ return 0; } return 0; }

Compile, we get at the output - pHashConcole.exe

From php, this construct is invoked through the usual exex:

 $pHash=exec('E:\Article\pHashConsole.exe "'.$filename.'"');

The hashes were added to the MySQL table, and the search result can be obtained with a very simple query:

 'SELECT * FROM (SELECT * , BIT_COUNT( hash ^ '.$pHash.') AS distance FROM images ORDER BY distance LIMIT 0 , 100) s1 WHERE distance < '.$precision.';'

 Where, $ pHash is the hash for which we are looking for similar ones.
 $ precision - accuracy, the more, the more results will be, but the more distant from the original they will be.  It is selected empirically.

Using pHash made it possible to find not only scaled images, but also images with inscriptions, watermarks, and parts that have been color-corrected or cut out.

5. The End!

As a result of all these manipulations, it was possible to solve the task and, without manual analysis, collect a common base, thanks to which it was possible to optimize the expenses of the enterprise for procurement. Such a seemingly insane idea turned out to be not only feasible, but also useful in practice, thanks to which a tangible result was obtained.

That's such a funny story ...

Thank you all for your attention! The End!

Source: https://habr.com/ru/post/210388/

All Articles