📜 ⬆️ ⬇️

Results of the competition for the restoration of documents after shredder

Problem number 5, about 6200 fragments, the size of each fragment is about 150 x 60 px.

Agency DARPA announced the results of the competition for the restoration of documents after Schröder. Almost 9,000 teams took part in the competition.

Each “puzzle” consisted of handwritten text fragments ground on a new commercial shredder and scanned with a resolution of 400 DPI. In the most difficult task number 5 there were about 6,200 fragments from an unknown number of pages - only two teams coped with this task.
')
The winner was the team All Your Shreds Are Belong To US - she was able to score the maximum possible 50 points by completing all the tasks. The closest competitors scored 30 and 26 points.

No one was able to develop a fully automatic solution; all the teams provided for the participation of one or several human operators who verify the correctness of the fragments coincidence. Polish team tried to use crowdsourcing. A couple of dozens of users jointly decided on the first puzzle relatively quickly, but did not go further.

Programmer Mark Newlin ( wasabi team), who finished third, published his document recovery technique. All modules are developed in C # / .NET 4.0 / MSSQL. At the first stage, preparation for the assembly is carried out: splitting the image into separate fragments, clearing from the background and alignment.



The selection of borders is carried out after filling the background. Fragment alignment is automated by the side with the maximum number of pixels, and in controversial cases, manual alignment helps (according to Mark, these were about 1%). The upper and lower boundaries of the fragments are also easily identified by the characteristic traces of the Schröder, so that if necessary, the fragment is rotated 180 °. Each piece of the puzzle is saved to a file. Separately, the “cleaned” version of the fragment is stored, cut off from the long sides - it is needed to find the junction points of the handle trace.

Before assembly, a database is compiled with information about each fragment: dimensions are “dirty” and clean, the coordinates of the lines (if a sheet fragment is visible in a ruler), the shape of the border, the exit points of the handle trace, the color of each point on the border, and the recognized symbol. Since OCR programs deal poorly with such a task, character recognition was performed manually, says Mark, with the adoption of a glass of beer after every thousand fragments.

The proximity probability for each pair of fragments was calculated taking into account the points of contact of the trace from the handle on the borders of the fragment (by coordinates and the number of such points), by the points of contact of lines on paper and the similarity of fragments by color.

Based on this information, the document is assembled manually in a graphical editor. Mark used GIMP and Paint.NET, but for complex puzzles of the fourth and fifth tasks with thousands of fragments, he had to make a separate interface to filter the view of fragments from the database by different parameters: proximity probability, pen color, coffee stains, etc. .



An interface was also added to display the most appropriate fragments, which increased the accuracy and speed of assembly.



The general document with all the matches found was gradually supplemented, and the probabilities were recounted.



Mark Newlin says he has spent all his free time on this project over the past few weeks. He managed to solve four of the five tasks of the competition, except for the most difficult fifth puzzle of 6,200 pieces, for which 24 points were given. Apparently, Mark simply did not have enough time, because he worked alone. Now he is going to buy a pair of commercial shredders to continue experiments and improve his technology. Perhaps in the future Mark will write a book or open his own company to compete with Unshredder.com . Although he will not be alone. After the DARPA competition, a large community of people interested in this topic was formed.

The winning team of All Your Shreds Are Belong To US also promises to reveal its solution algorithm soon. In the comments on Mark’s blog post, they said they used the same methods in many ways. In the accompanying note, they reported that the solution of all tasks took about 600 man-hours.

DARPA published scans of solutions (PDF) sent by the winning team. For example, the originals and recovered fragments of three pages from the fifth assignment are shown below. In the task, all the fragments were mixed, on each page there were missing fragments, and the second page was almost completely missing. To get points, it was necessary not only to assemble a puzzle, but also to decipher the message. Thus, in the fifth task, the message was encoded in Morse code ( solution of each task , PDF).

Page 1 , Morse code in the last line


Page 2 was crushed upside down


Page 3


The safety standard for shredders DIN 32757 specifies the minimum size of the fragments after grinding for each security level:

Level 1 = 12 mm strips or 11 x 40 mm fragments
Level 2 = 6 mm strips or 8 x 40 mm fragments
Level 3 = 2 mm strips or 4 x 30 mm fragments (Confidential marking)
Level 4 = 2 x 15 mm fragments (marked Commercially Sensitive)
Level 5 = fragments 0.8 x 12 mm (labeled Top Secret or Classified)
Level 6 = fragments 0.8 x 4 mm (labeled Top Secret or Classified)

In the fifth task of the contest, the size of the fragments is about 148 x 59 pixels, that is, 9.4 x 3.7 mm, which roughly corresponds to a level 4 shredder according to safety standard DIN 32757. According to Wikipedia, the CIA safety standards for shredders provide for the size of the fragments no more than 1 x 5 mm, in the Russian Federation - 1 x 1 mm.

Source: https://habr.com/ru/post/134047/


All Articles