⬆️ ⬇️

Quick search for matching objects by their checksums using the example of searching for duplicate images

Initial data:





Final goal:



The idea of ​​the algorithm is to create a suffix tree, each node of which stores one byte of checksum. When we receive the checksum of the next object, we start moving from the root of the tree deep into, if we do not find the node for the next byte in the sequence, we create it. Reaching the end of the checksum and creating the end node, we will write object parameters into it. As a result, we get a list of leaf nodes, if in the leaf node there is a description of more than one object, we assume that these objects are identical.



I give a general view of the source text for a better understanding, which can be used as a template for real-world tasks.

using System;

using System.Collections. Generic ;

/// <summary>

///

/// </summary>

public class HashObject

{

public string attribute1 { get ; set ; }

public string attribute2 { get ; set ; }

}

/// <summary>

/// ,

///

///

/// </summary>

public class HashTreeNode

{

/// <summary>

///

/// </summary>

private byte myPart;

public byte MyPart

{

get { return myPart; }

}

public HashTreeNode( byte part)

{

myPart = part;

}

/// <summary>

/// ,

/// </summary>

public List <HashObject> files = null ;

/// <summary>

///

/// </summary>

private List <HashTreeNode> NextNodes = new List <HashTreeNode>();

/// <summary>

///

/// </summary>

public HashTreeNode FindEndNode( byte [] crc, int position)

{

HashTreeNode endNode = FindSubNodes(crc[position]);

if (position < crc.Length - 1)

return endNode.FindEndNode(crc, position + 1);

else return endNode;

}

/// <summary>

/// /

/// </summary>

private HashTreeNode FindSubNodes( byte part)

{

lock (NextNodes)

{

for ( int i = 0; i < NextNodes.Count; i++)

if (NextNodes[i].MyPart.Equals(part))

return NextNodes[i];

NextNodes.Add( new HashTreeNode(part));

return NextNodes[NextNodes.Count - 1];

}

}

}

/// <summary>

/// ,

/// </summary>

public class HashTree

{

/// <summary>

///

/// </summary>

List <HashTreeNode> Nodes = new List <HashTreeNode>();

/// <summary>

///

/// ,

/// </summary>

/// <param name="crc">Crc32 </param>

/// <returns></returns>

public HashTreeNode CheckOnTree( byte [] crc)

{

HashTreeNode root = FindNode(crc[0]);

return root.FindEndNode(crc, 1);

}

/// <summary>

///

/// </summary>

private HashTreeNode FindNode( byte part)

{

lock (Nodes)

{

for ( int i = 0; i < Nodes.Count; i++)

if (Nodes[i].MyPart.Equals(part))

return Nodes[i];

Nodes.Add( new HashTreeNode(part));

return Nodes[Nodes.Count - 1];

}

}

}




* This source code was highlighted with Source Code Highlighter .




As an example, the simplest application is implemented to find duplicate images in the specified directory. As a result of the work we get the html report file.

The project was executed in Microsoft Visual Studio 2008, and compiled for Windows x86.

The project was tested on its own folder with photos, test result:

Files in directory: 4326

Of these photos: 4131

Image weight: 8.33 GB

Duplicate search time: 117 seconds (2 minutes) + 6 seconds to generate a report and copy duplicate files to the report folder.

Found: 90 duplicates.

Link to the application and source:

ifolder.ru/19211139

')

If anyone is interested, checking files for belonging to the images is not done by extension, in by signature, examining the file header. Signatures.cs - class in the project.



Pros:





Minuses:





I would be grateful to those minus the post, which will give links to better algorithms known to them or indicate the shortcomings of the described.

Source: https://habr.com/ru/post/103638/



All Articles