Quick search for matching objects by their checksums using the example of searching for duplicate images

Initial data:

set of objects possessing attributes
the ability to approximately accurately identify the object by matching it with a checksum.

Final goal:

get lists of objects for which it is easy to identify matches.

The idea of the algorithm is to create a suffix tree, each node of which stores one byte of checksum. When we receive the checksum of the next object, we start moving from the root of the tree deep into, if we do not find the node for the next byte in the sequence, we create it. Reaching the end of the checksum and creating the end node, we will write object parameters into it. As a result, we get a list of leaf nodes, if in the leaf node there is a description of more than one object, we assume that these objects are identical.

I give a general view of the source text for a better understanding, which can be used as a template for real-world tasks.

using System; using System.Collections. Generic ; /// <summary> /// /// </summary> public class HashObject { public string attribute1 { get ; set ; } public string attribute2 { get ; set ; } } /// <summary> /// , /// /// /// </summary> public class HashTreeNode { /// <summary> /// /// </summary> private byte myPart; public byte MyPart { get { return myPart; } } public HashTreeNode( byte part) { myPart = part; } /// <summary> /// , /// </summary> public List <HashObject> files = null ; /// <summary> /// /// </summary> private List <HashTreeNode> NextNodes = new List <HashTreeNode>(); /// <summary> /// /// </summary> public HashTreeNode FindEndNode( byte [] crc, int position) { HashTreeNode endNode = FindSubNodes(crc[position]); if (position < crc.Length - 1) return endNode.FindEndNode(crc, position + 1); else return endNode; } /// <summary> /// / /// </summary> private HashTreeNode FindSubNodes( byte part) { lock (NextNodes) { for ( int i = 0; i < NextNodes.Count; i++) if (NextNodes[i].MyPart.Equals(part)) return NextNodes[i]; NextNodes.Add( new HashTreeNode(part)); return NextNodes[NextNodes.Count - 1]; } } } /// <summary> /// , /// </summary> public class HashTree { /// <summary> /// /// </summary> List <HashTreeNode> Nodes = new List <HashTreeNode>(); /// <summary> /// /// , /// </summary> /// <param name="crc">Crc32 </param> /// <returns></returns> public HashTreeNode CheckOnTree( byte [] crc) { HashTreeNode root = FindNode(crc[0]); return root.FindEndNode(crc, 1); } /// <summary> /// /// </summary> private HashTreeNode FindNode( byte part) { lock (Nodes) { for ( int i = 0; i < Nodes.Count; i++) if (Nodes[i].MyPart.Equals(part)) return Nodes[i]; Nodes.Add( new HashTreeNode(part)); return Nodes[Nodes.Count - 1]; } } } * This source code was highlighted with Source Code Highlighter .

As an example, the simplest application is implemented to find duplicate images in the specified directory. As a result of the work we get the html report file.
The project was executed in Microsoft Visual Studio 2008, and compiled for Windows x86.
The project was tested on its own folder with photos, test result:
Files in directory: 4326
Of these photos: 4131
Image weight: 8.33 GB
Duplicate search time: 117 seconds (2 minutes) + 6 seconds to generate a report and copy duplicate files to the report folder.
Found: 90 duplicates.
Link to the application and source:
ifolder.ru/19211139
')
If anyone is interested, checking files for belonging to the images is not done by extension, in by signature, examining the file header. Signatures.cs - class in the project.

Pros:

relatively high speed
ease of implementation and storage

Minuses:

The entire data structure is stored in RAM. If the number of pictures will be measured in hundreds of thousands or millions, it will be necessary to rethink how to store the tree. Or search for another algorithm.

I would be grateful to those minus the post, which will give links to better algorithms known to them or indicate the shortcomings of the described.

Source: https://habr.com/ru/post/103638/

All Articles

Quick search for matching objects by their checksums using the example of searching for duplicate images

More articles: