About approaches to comparing file versions

People who use source code version control systems (SVN, Mercurial, Git, etc.) will most likely use the ability to compare file versions to see the changes made by users. There are many independent version comparison programs ( WinMerge , BeyondCompare , etc.). When comparing versions, as a rule, two versions of a file are shown next to each other so that the same (unchanged) parts of the documents are opposite each other, and the changed (added and deleted) are highlighted in the corresponding color.
I am sure many would be interested to know which algorithms can be used to implement such a comparison.

As a simplest case, consider a comparison of plain text (plain text) files.
A text file is a set of lines (or rather paragraphs) of text. A paragraph is a character string that ends with a line break and carriage return (on Windows) or just a single line end (on Unix / Linux). We will not be distracted by the fact that characters can be represented by different encodings and we will assume that we are dealing with single-byte ASCII. The task of comparing versions of such a file is to determine which paragraphs have not changed, and which have changed, been deleted or added.

Most file comparison algorithms are based on the longest matching subsequence search algorithm (LCS, Longest Common Subsequence). Indeed, the longest common subsequence of characters of two files can be considered the unchanged part, and all other sections (depending on their belonging to one of the versions) are deleted (if they belong to the old version) or added (if they belong to the new version). There are several implementations of the LCS algorithm, the computational complexity of which varies from polynomial (directly proportional to the lengths of the subsequences being compared) to logarithmic. These algorithms are also very memory demanding. It becomes clear that using character-by-character presentation of documents in such conditions is unrealistic. It is more convenient to use paragraphs as units of comparison. They can be compared "as is", i.e. directly as strings, but in order to optimize, it is more convenient to hash them and compare hashes, and the rows themselves should be accessed only if the hashes match (since no one is insured for collisions during hashing).

Thus, the first stage of the comparison algorithm is to load a list of paragraphs of documents into memory (yes, we really have to load both of the compared documents into memory), their hashing and applying to the received hashes of the search algorithm the largest common matching subsequence (LCS). At the output of this stage, we will receive information about guaranteed unchanged (i.e., overlapping) sections of documents. All other sites are changed. If the changed section from the old version corresponds to an empty section in the new version, then this section has been deleted. If the changed section of the new version corresponds to an empty section in the old version, then this section has been added.
')
The situation is more complicated with the changed sections present in the old and in the new version. For example, between two overlapping sections we can place a changed section, which in the old version was two paragraphs, and in the new version - three. There are no equal (coinciding) paragraphs among these five paragraphs. If you stop at this step, you can simply select the entire paragraph groups as modified as shown to the user and shift the task of analyzing detailed changes to the user. But you can do more cunning. You can calculate the optimal fit for the paragraphs of the modified area to know exactly which one corresponds to which. To assess the "similarity" of each pair of paragraphs, you can apply the algorithm for calculating the so-called "editing distance" (or Levenshtein distance). The editing distance is the minimum number of inserts, deletes and replacements from one character to another necessary to turn one line into another. By calculating the editing distance between each pair of paragraphs of the modified section, we can use the dynamic programming method to calculate the optimal fit between the paragraphs of the modified section. The most non-optimal placement option is when all the paragraphs of the modified section in the old version are considered deleted, and all the paragraphs of the new version are added (that is, no one matches anyone). All other accommodation options will be more optimal. Applying the dynamic programming method (how exactly it can be applied in this case, I will write in another post), you can find the most optimal match option. Now we already know exactly which paragraph of the modified section corresponds to which one, and we can use the comparison algorithm (this time, character-by-character) to calculate the exact changes inside the paragraph.

If dear readers are interested in the details and features of the implementation of the search algorithms for the largest common subsequence, editing distance and applying dynamic programming to calculate the optimal match for the group of changed paragraphs (the last algorithm I consider my own know-how), please write about it in the comments, and I I will write more detailed posts about this (each of these algorithms is in itself very interesting and deserves a separate post).

Source: https://habr.com/ru/post/142676/

All Articles

About approaches to comparing file versions

More articles: