The task of comparing similar strings is encountered in practice quite often: personally, I recently ran into it when trying to import email addresses from one program to another.
For example, one address may look like “Tver Region, Kashin g, Sovetskaya St., 1, 5”, and the other - as “Tver Region; Kashin city; Sovetskaya street; House number 1; apartment 5". Are these lines similar and how much? Undoubtedly similar. And “with the naked eye” their structure is visible: region - settlement - street - house - apartment. It is logical that for addresses such division of lines into groups is important; that is, we should not compare “two porridges” of similar words (where one “porridge” consists of the words of the first line, and the second - of the words of the second), namely, to perform a “group-wise” comparison of the words from the first line with the words from the second. The criterion for splitting into groups is also viewed: in the first line the separator for groups is “,”, and in the second line - “; ".
At the same time, there are lines where no explicit grouping is seen. For example, let's take the “classics”: “When there is no agreement in comrades, they will not do anything right, and it will not matter - only flour.” And the second line: “Mischievous monkey, Donkey, Goat and clubfoot Bear, started a quartet. “Obviously, the lines are different (and even the moral of these fables is different, although parallels can be found).
The task in question is not new. There are algorithms (sometimes very complex) that try to solve it, and even sometimes successfully solve it. I propose another one for the algorithms. When compiling it, I proceeded from the following principles:
')
- ease of calling the comparison function;
- ease of implementation;
- sufficient versatility.
In addition, the algorithm is implemented on VBA Excel, so it is very “democratic”, and it can be applied everywhere: Excel not only exists among the software of various computers “by itself”, but also data from various DBMS and applications are exported to it.
So, let's begin.
The comparison function is called StrCompare. It will have 4 arguments, two of which are optional: the first line of str1, the second line of str2, the group separator of the first line div1 and the group separator of the second line div2. If div1 or div2 is omitted, the default delimiter is “|”. “|” Is chosen because it is unlikely to occur in the “average” line, and therefore can be used to compare monolithic (not divisible into groups) lines. Such monolithic lines can also be considered as rows consisting of one group. That is, the title of the comparison function looks like this:
Public Function StrCompare(str1 As String, str2 As String, Optional div1 As String = "|", Optional div2 As String = "|") As Single
Single - because the result of the function will be a number indicating the degree of similarity of the compared strings.
All groups of line 1 are sequentially compared with all groups of line 2 word for word, and the number of matches in each pair of groups is considered. For each group of row 1, the “best group” is finally selected from row 2 (i.e. the group with the most matches). The matches for each pair of words are checked by the word with the minimum length: that is, “street = ul”, and “r = city”. This rule does not apply to numbers: that is, 200 <> 20. When selecting words, all “insignificant characters” within groups are the word separators, but they are ignored, that is, words can consist only of characters WordSymbols = “0123456789 ABB GWDWGII CLEANLEYES ABSTRACTHSCCHSCHSCHYyAABCDEFGHIJJKLMNOPQRSTU It is clear that the case of characters is not taken into account.
To search for a matching word in the current group of the second line, a quick half-division method is used (but slightly modernized compared to the “classical” one, since the matches are checked by the method described above). And since the work of the method of half division requires sorted arrays, it also uses the quick sort algorithm.
The result of the StrCompare function will be the result of dividing the number of matching words by the total number of words in lines 1 and 2:
StrCompare = (da * 2) / ((kon1_2 - nach1_2 + 1) * (kon1_1 - nach1_1 + 1) + (kon2_2 - nach2_2 + 1) * (kon2_1 - nach2_1 + 1))
Here, for example, kon1_2 is the end border of array 1 (array of words contained in the groups of the first line) in the 2nd dimension (the 1st dimension is the number of groups, and the 2nd is the number of words in the group).
It is time to submit the code:
' "" . : '1, 2, 1, 2 Public Function StrCompare(str1 As String, str2 As String, Optional div1 As String = "|", Optional div2 As String = "|") As Single WordSymbols = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" Dim massiv1() As String, massiv2() As String, mass1() As String, mass2() As String, m1() As Variant, m2() As Variant ' Dim mm1() As String, mm2() As String Dim nach1_1 As Integer, kon1_1 As Integer, nach1_2 As Integer, kon1_2 As Integer, nach2_1 As Integer, kon2_1 As Integer, nach2_2 As Integer, kon2_2 As Integer Dim item As String, itemnumber As Integer Dim yes As Integer, maxyes As Integer, da As Integer Dim counter As Integer ' noname str1 = UCase(str1): str2 = UCase(str2) massiv1 = Split(str1, div1) ReDim mass1(LBound(massiv1) To UBound(massiv1), 0 To 1000) maxk = 0 counter = 0 For i = LBound(massiv1) To UBound(massiv1) item = massiv1(i) dlina = Len(item) slovo = "" NewWord = False k = 0 ' For j = 1 To dlina bukva = mid(item, j, 1) If (InStr(1, WordSymbols, bukva) > 0) And Not NewWord Then NewWord = True slovo = slovo + bukva Else If InStr(1, WordSymbols, bukva) > 0 Then slovo = slovo + bukva Else If (InStr(1, WordSymbols, bukva) = 0) And NewWord Then NewWord = False mass1(i, k) = slovo If k > maxk Then maxk = k k = k + 1 slovo = "" End If End If End If Next j If NewWord Then mass1(i, k) = slovo If k > maxk Then maxk = k End If Next i ReDim Preserve mass1(LBound(massiv1) To UBound(massiv1), 0 To maxk) '*************************************************************' massiv2 = Split(str2, div2) ReDim mass2(LBound(massiv2) To UBound(massiv2), 0 To 1000) maxk = 0 For i = LBound(massiv2) To UBound(massiv2) item = massiv2(i) dlina = Len(item) slovo = "" NewWord = False k = 0 ' For j = 1 To dlina bukva = mid(item, j, 1) If (InStr(1, WordSymbols, bukva) > 0) And Not NewWord Then NewWord = True slovo = slovo + bukva Else If InStr(1, WordSymbols, bukva) > 0 Then slovo = slovo + bukva Else If (InStr(1, WordSymbols, bukva) = 0) And NewWord Then NewWord = False mass2(i, k) = slovo If k > maxk Then maxk = k k = k + 1 slovo = "" End If End If End If Next j If NewWord Then mass2(i, k) = slovo If k > maxk Then maxk = k End If Next i ReDim Preserve mass2(LBound(massiv2) To UBound(massiv2), 0 To maxk) ' "" ; : kon1_2 - 1 2- nach1_1 = LBound(mass1, 1) kon1_1 = UBound(mass1, 1) nach1_2 = LBound(mass1, 2) kon1_2 = UBound(mass1, 2) nach2_1 = LBound(mass2, 1) kon2_1 = UBound(mass2, 1) nach2_2 = LBound(mass2, 2) kon2_2 = UBound(mass2, 2) For i = nach1_1 To kon1_1 For j = nach1_2 To kon1_2 If mass1(i, j) = "" Then counter = counter + 1 mass1(i, j) = "noname" + Trim(Str(counter)) End If 'MsgBox ("mass1(" + Trim(Str(i)) + "," + Trim(Str(j)) + ")=" + mass1(i, j)) Next j Next i For i = nach2_1 To kon2_1 For j = nach2_2 To kon2_2 If mass2(i, j) = "" Then counter = counter + 1 mass2(i, j) = "noname" + Trim(Str(counter)) End If 'MsgBox ("mass2(" + Trim(Str(i)) + "," + Trim(Str(j)) + ")=" + mass2(i, j)) Next j Next i ' " -" ReDim m2(nach2_2 To kon2_2) As Variant For i = nach2_1 To kon2_1 For j = nach2_2 To kon2_2 m2(j) = mass2(i, j) Next j Call QuickSort(m2, nach2_2, kon2_2) For j = nach2_2 To kon2_2 mass2(i, j) = m2(j) Next j Next i ' : 1 2 ReDim mm2(nach2_2 To kon2_2) da = 0 For k = nach1_1 To kon1_1 ' 1 maxyes = 0 For i = nach2_1 To kon2_1 ' 2 yes = 0 For j = nach2_2 To kon2_2: mm2(j) = mass2(i, j): Next j ' 2 For l = nach1_2 To kon1_2 ' 1 If BinarySearch(mm2, nach2_2, kon2_2, mass1(k, l)) <> -1 Then yes = yes + 1 Next l If yes > maxyes Then maxyes = yes Next i da = da + maxyes Next k StrChange = (da * 2) / ((kon1_2 - nach1_2 + 1) * (kon1_1 - nach1_1 + 1) + (kon2_2 - nach2_2 + 1) * (kon2_1 - nach2_1 + 1)) 'StrChange = da End Function Public Sub QuickSort(ByRef vArray() As Variant, inLow As Integer, inHi As Integer) Dim pivot As Variant Dim tmpSwap As Variant Dim tmpLow As Integer Dim tmpHi As Integer tmpLow = inLow tmpHi = inHi pivot = vArray((inLow + inHi) \ 2) While (tmpLow <= tmpHi) While (vArray(tmpLow) < pivot And tmpLow < inHi) tmpLow = tmpLow + 1 Wend While (pivot < vArray(tmpHi) And tmpHi > inLow) tmpHi = tmpHi - 1 Wend If (tmpLow <= tmpHi) Then tmpSwap = vArray(tmpLow) vArray(tmpLow) = vArray(tmpHi) vArray(tmpHi) = tmpSwap tmpLow = tmpLow + 1 tmpHi = tmpHi - 1 End If Wend If (inLow < tmpHi) Then QuickSort vArray, inLow, tmpHi If (tmpLow < inHi) Then QuickSort vArray, tmpLow, inHi End Sub Public Function BinarySearch(vArray() As String, inLow As Integer, inHi As Integer, key As String) As Integer Dim lev As Integer, prav As Integer, mid As Integer Dim key_ As String, arritem As String, arritem_ As String Dim minlen As Integer, keylen As Integer, arritemlen As Integer If key = Trim(Str(Val(key))) Then ' lev = inLow: prav = inHi While lev <= prav mid = lev + (prav - lev) \ 2 arritem = vArray(mid) If key < arritem Then prav = mid - 1 ElseIf key > arritem Then lev = mid + 1 Else BinarySearch = mid Exit Function End If Wend Else keylen = Len(key) lev = inLow prav = inHi While lev <= prav mid = lev + (prav - lev) \ 2 arritem = vArray(mid) arritemlen = Len(arritem) minlen = IIf(keylen < arritemlen, keylen, arritemlen) key_ = left(key, minlen) arritem_ = left(arritem, minlen) If key_ < arritem_ Then prav = mid - 1 ElseIf key_ > arritem_ Then lev = mid + 1 Else BinarySearch = mid Exit Function End If Wend End If BinarySearch = -1 End Function
I think it makes no sense to comment on everything: you can navigate through the code. Just analyze the operation of the comparison function on several lines of different nature.
- str1 = ”Tver region, Kashin g, Sovetskaya str., 1, 5” str2 = ”Tver region; Kashin city; Sovetskaya street; House number 1; apartment 5".
First, compare the lines without groups:
StrCompare (str1, str2) gives the result 0.8888889.
And now, taking into account:
StrCompare (str1, str2, ",", ";") - the result is 0.8.
As you can see, the groups are more strictly related to the comparison; in this case, it is important for them that “the house is a house, and the apartment is an apartment”. If you ignore groups, this does not matter. - str1 = ”There was a gray goat at my grandmother” str2 = ”There was a gray goat at my grandmother”
StrCompare (str1, str2) -> 0.6666667 - str1 = ”Ivanov Ivan Ivanovich m. Kaluga 1950 ”str2 =” Ivanov I.I. 01.20.1950 ”
StrCompare (str1, str2) -> 0.6153846 - str1 = "When there is no agreement in comrades, their business will not work, and nothing will come out of it - only flour." str2 = "Prankish monkey, Donkey, Goat and Bruin Mishka have started a quartet."
StrCompare (str1, str2) -> 0 - str1 = ”In accordance with paragraph 1 of Art. 540 of the Civil Code of the Russian Federation in the case when the subscriber under the energy supply contract is a citizen using energy for household consumption, the contract is considered to be concluded from the moment the subscriber actually connects to the network. | According to Part 1 of Article 153 of the Housing Code of the Russian Federation, citizens are obliged to make timely and full payments for housing and utilities. | During the period from ____________ 2017 to ____________ 2017, the Guaranteed Provider supplied you with electricity worth ______________________. | In connection with your violation of your obligations to pay for electricity, which led to the formation of consumer debt to the guaranteeing supplier in the amount of more than 2 billing periods, in respect of the consumer’s living quarters, actions were taken to limit the resumption of the provision of public utilities for electricity. | In accordance with paragraph 121 (1) of the Rules for the provision of public utilities to owners and users of premises in an apartment building x and residential buildings, approved by Government Decree 06.05.2011g. No. 354, the costs of the contractor related to the imposition of a restriction, suspension and resumption of the provision of public utilities to a debtor consumer shall be reimbursed at the expense of the consumer in respect of whom the indicated actions were carried out. | For the Guarantee, the cost of paying for the actions for imposing the restriction and the subsequent restoration of electricity supply supplier amount ______________________________________________. | Based on the above, the TverAtomEnergoSbyt PO asks you to pay the debt for actions of restriction / renewal of municipal services for electricity in the amount of _____________________ rubles. the following details with the account number and purpose of payment: "
str2 = ”“ ____ ”__________ 2017 between AtomEnergoSbyt JSC - the Guaranteeing Supplier and _____________________ - the Consumer has concluded the power supply contract No. ___________________, valid from _________________ year, subject to its further prolongation (clause 8.1 of the contract, article 540 of the Civil Code of the Russian Federation ), according to clause 1.1. which the guaranteeing supplier has undertaken to sell electricity (power), as well as independently or through third parties involved to provide services for the transmission of electricity and services, the provision of which is an integral part of the process of supplying electricity to the consumer, and the Buyer is obliged to pay for the purchased electricity (power) . | In connection with the violation by the Consumer of his obligations to pay for electric energy (clause 5.2. Of the power supply contract No. __________________ of ___________), which led to the formation of consumer debt to the guaranteeing supplier in the amount of more than one settlement period in respect of the consumer's electric grid facility -_________________ actions were taken to limit / renew the mode of power supply consumption in accordance with the Rules for complete and (or) partial restriction of the mode of consumption of electric energy, approved by the Decree of the Government of the Russian Federation of 04.05.2012 No. 442 (hereinafter referred to as the Rules). | According to clause 24 of the Rules, the consumer is obliged to compensate the contractor for the costs of paying for actions to impose a restriction and subsequent restoration of the mode of consumption of electric energy. | The cost of expenses of the Guarantee supplier to pay for actions to impose restrictions and the subsequent restoration of the mode of consumption of electric energy amounts to ______________________________________________. | of the above, the TverAtomEnergoSbyt PS asks you to pay the costs of the Guarantee Provider for actions on limited w / resume mode electric power consumption in the amount of _______________ rubles. by the following details indicating the contract number and purpose of payment: | Purpose of payment: payment of restriction / renewal of the mode of consumption of electric energy under the contract number ____________ ”
Here, str1 and str2 are fragments of very similar documents (energy supply contracts for individuals and legal entities, respectively). Comparison without StrCompare groups (str1, str2, ”*”, ”*”) can be used for “rough evaluation” of document similarities (the “|” symbol is not suitable in this case, since it is used in the source lines to split them into groups), which reveals a decent similarity of 0.75 (that is, documents are clearly of the same nature!). And to specify the similarity, we apply a partition into groups: StrCompare (str1, str2, ”|”, ”|”) (or simply StrCompare (str1, str2)). Result: 0.3790227.
And now, perhaps the most interesting example. The plot of the fable about the crow and the fox has been known since the time of Aesop. Using StrCompare, we compare two fables: the classic version from I.A. Krylov and less known from A.P. Sumarokov:
str1 = ”For how many times they told the world That flattery is vile, harmful; but all is not in vain, And in the heart a flatterer will always find a corner. A raven somewhere God sent a piece of cheese; On the spruce tree of the Crow perching, the Breakfast was very much gathered, Yes, she became thoughtful, and kept the cheese in her mouth. Unfortunately, the Fox ran very close; Suddenly the cheesy spirit of Fox stopped: Fox sees cheese, Fox captivated cheese. Cheating on a tiptoe tree; He twirls his tail, does not take his eyes off the Crow And speaks so sweetly, breathing a little: “Darling, how good! Well, what a neck, what eyes! Tell, so, right, fairy tales! What kind of parushes! what a sock! And, truly, an angelic voice should be! Sing, little light, do not be ashamed! What, if, sister, With such beauty, and you are a master of singing, “Why, you would have had a king-bird!” Things got a head spin, from praise, the breath curled into joy, and the Lisitsyny’s affable words crow crowed at all crows throat: Cheese fell out - with him was a cheat like that. ”
str2 = ”And the birds hold on to human craft. A crow crow was once killed by a cheese And sat down on an oak tree. Yes, just did not eat a ditty. She saw a fox in her mouth in her piece And she thinks: “I will give Crow juice! Although I will not pick it up there, I will get this piece, Oak, however high. ” “It's great,” says Fox, “My friend, Voronushka, my little sister!” You are a beautiful bird! What are the legs, what a sock, And you can tell you without hypocrisy, What more than all of you measures, my light, good! And the parrot is nothing in front of you, soul, More beautiful than a hundred of your peacock feathers! ”(Praise is pleasant for us to endure). “Oh, if you could still sing and sing, So you wouldn’t have such a bird in the world!” Crow opened her neck wide, To be a nightingale, “And cheese,” she thinks, “and then I will sing. At this moment I am not here about a feast! ”Razinil lips And waited for the post. Slightly sees only the end of the Lisitsyn tail. I wanted to sing, I didn't sing, I wanted to eat, I didn't eat.
The reason is that the cheese is no more. The cheese fell out of the company, - Fox for lunch. ”
StrCompare (str1, str2) gives the result 0.5590062 - so the similarity of the plot is obvious!