Search for repetitions in a two-dimensional array, or computational complexity by example

Good day, dear habrasoobschestvo.

When I studied at the institute in the second or third year (that is, in general, not so long ago), I had, among others, a subject called “algorithms and data structures”. They told there, however, not only about the algorithms and structures themselves, but also about such a concept as “computational complexity”. I admit, then it did not really interest me.

“Surely bothering with the study of the algorithm for spatial and temporal complexity is necessary only when developing either very high-performance / high-loaded systems, or when working with really large amounts of data,” some such thoughts visited me (and, probably, not only me) then.
')
However, recently I had to greatly change my opinion because of a seemingly simple task.

The task consisted of the following: there were Excel tables with some (mostly numerical) data. These data, without going into details, were identifiers used in the subsystems. It was necessary to find duplicate values in the tables, and, depending on various factors, to decide what to do with them. At this stage, you may already have the question: "Why not do it automatically?". In fact, the tables contained those data sets that the automated system could not cope with and which required human intervention.
Naturally, manually selecting a large amount of data (on average about 5-10 columns and 1000 lines per document) is a thankless job, so I decided to simplify the process a little, at least in terms of searching for repetitions.
Since it was impossible to write a separate utility on C because of the strict rules of the organization, I had to read a little about VBA (which I did not know completely), and come up with a solution in the form of a VBA macro.

So, we proceed directly to the algorithms.
In fact, the task of finding repetitions on an Excel sheet is simply the task of finding repetitions in a two-dimensional array. But how to do that? Oddly enough, Google was completely unable to help me in solving such a seemingly trivial task, and I had to think a little about it.

A simple solution

The simplest, most direct and elementary solution that comes to mind is simply element-wise comparison. On VBA, it looks like this:

Sub FindMatches() Dim sCol, sRow, cCol, cRow As Integer For sCol = 1 To 10 For sRow = 1 To 10 For cCol = sCol + 1 To 10 For cRow = 1 To 10 If Cells(sRow, sCol) = Cells(cRow, cCol) Then If Cells(sRow, sCol).Font.Color = RGB(0, 0, 0) Then Cells(sRow, sCol).Font.Color = RGB(0, 150, 0) End If Cells(cRow, cCol).Font.Color = RGB(150, 0, 0) End If Next cRow Next cCol Next sRow Next sCol End Sub

Thus, we take each cell of the table, and compare it with all the cells that are after it (it is worth noting that in the column where the occurrence of an element occurs for the first time, there can be no repetitions, so we start the comparison with the next column) and if we find a match, paint them in red. Conveniently? Full However, it is here that computational complexity shows teeth. This is how long the processing takes on my not very seemingly weak machine:

100 cells (10x10) - 70 milliseconds
200 cells (10x20) - 240 milliseconds
400 cells (10x40) - 920 milliseconds
2000 cells (20x100) - 23535 milliseconds

In general, we can say that with an increase in the number of elements at the input n times, the operation time will increase n ² times. Of course, such a decision was unacceptable to me.

Slightly less simple solution

So, let's try to speed up our algorithm a bit. The first thing I realized is that the access operation directly to the cells of the sheet takes a huge amount of time, and the number of these operations should be tried to be minimized. As a solution to this problem, I chose the original data entry from the cells into the memory, and work with this memory. For this we are ideally suited a two-dimensional array.

For the general reduction in the number of operations, I decided to write each unique value to a separate array, and make a comparison only with this array of unique elements. It looks like this:

 Sub FindMatches() Dim row As Integer //  Dim col As Integer //  Dim Values(5000, 5) As Integer //   Dim DistinctValues() As Integer //  Dim DistinctCount As Integer //   Dim i As Integer // Dim IsUnique As Boolean //  //        ReDim DistinctValues(0 To 0) DistinctCount = 0 //   Values   . For col = 1 To 5 For row = 1 To 5000 Values(row, col) = Cells(row, col) Next row Next col // . For col = 1 To 5 For row = 1 To 5000 //    Values  .            DistinktValues. IsUnique = True For i = 0 To DistinctCount - 1 If DistinctValues(i) = Values(row, col) Then //  ,       . IsUnique = False Cells(row, col).Font.Color = RGB(255, 75, 75) Exit For End If Next i //  ,      ,       DistinctValues,      If IsUnique = True Then DistinctCount = DistinctCount + 1 ReDim Preserve DistinctValues(0 To DistinctCount) DistinctValues(DistinctCount - 1) = Values(row, col) Cells(row, col).Font.Color = RGB(75, 175, 75) End If Next row Next col End Sub

So, we have significantly reduced the number of element comparison operations by comparing only with the unique elements already encountered. We also reduced the number of operations on cells in the sheet to two (the first is the preservation of the cell value in the array, the second is the painting of the cell). How much has the running time decreased?

Unfortunately, the speed of this algorithm is highly dependent on the entropy of the input data. Since we are comparing with an array of unique elements, the larger this array is, the more time we will need. In other words, the more unique elements in the input data, the slower the algorithm will work. I conducted two tests for 25,000 cells (5 columns and 5000 rows). In the first test, all cells were filled with the same value (1); in the second, on the contrary, they were filled with different values without repetitions.
For the first case, the work of the algorithm took 816 milliseconds, and for the second - almost 19 seconds . And yet, it is much faster than our first full brute force, which could digest 25,000 cells in a breathtaking 58 minutes .

However, the pursuit of the best knows no limit. After some thought, I decided not to reinvent the wheel, but to use proven algorithms. My choice fell on the quick sort algorithm, which is known to have O (n log n) complexity.

Fast decision

So, how can you use quick sorting to solve my problem?

 // ,  . Type MyCell row As Long col As Long Value As Long End Type //  ,            . Sub QSort(sortArray() As MyCell, ByVal leftIndex As Long, _ ByVal rightIndex As Long) Dim compValue As MyCell Dim i As Long Dim J As Long Dim tempNum As MyCell i = leftIndex J = rightIndex compValue = sortArray(CLng((i + J) / 2)) //  . Do Do While (sortArray(i).Value < compValue.Value And i < rightIndex Or _ (sortArray(i).Value = compValue.Value And sortArray(i).col < compValue.col) Or _ (sortArray(i).Value = compValue.Value And sortArray(i).col = compValue.col And sortArray(i).row < compValue.row)) i = i + 1 Loop Do While (compValue.Value < sortArray(J).Value And J > leftIndex Or _ (sortArray(J).Value = compValue.Value And sortArray(J).col > compValue.col) Or _ (sortArray(J).Value = compValue.Value And sortArray(J).col = compValue.col And sortArray(J).row > compValue.row)) J = J - 1 Loop If i <= J Then tempNum = sortArray(i) sortArray(i) = sortArray(J) sortArray(J) = tempNum i = i + 1 J = J - 1 End If Loop While i <= J If leftIndex < J Then QSort sortArray(), leftIndex, J If i < rightIndex Then QSort sortArray(), i, rightIndex End Sub Sub FindMatches() //   Dim myRange As Range Set myRange = Selection //      Dim ColCount, RowCount As Integer ColCount = myRange.Columns.Count RowCount = myRange.rows.Count Dim FirstCol, FirstRow As Integer FirstCol = myRange.Column FirstRow = myRange.row //   ,       . Dim MyCells() As MyCell ReDim MyCells(ColCount * RowCount) Dim col, row As Integer Dim i As Long For col = FirstCol To FirstCol + ColCount - 1 For row = FirstRow To FirstRow + RowCount - 1 MyCells(CLng((col - FirstCol) * RowCount + row - FirstRow)).row = row MyCells(CLng((col - FirstCol) * RowCount + row - FirstRow)).col = col MyCells(CLng((col - FirstCol) * RowCount + row - FirstRow)).Value = CLng(Val(Cells(row, col))) Next row Next col //      Call QSort(MyCells, 0, ColCount * RowCount - 1) // . Cells(1, 1).Font.Color = RGB(0, 255, 0) For i = 1 To ColCount * RowCount - 1 If MyCells(i).Value <> MyCells(i - 1).Value Then Cells(MyCells(i).row, MyCells(i).col).Font.Color = RGB(0, 255, 0) Else Cells(MyCells(i).row, MyCells(i).col).Font.Color = RGB(255, 0, 0) End If Next i Cells(MyCells(firstOccurance).row, MyCells(firstOccurance).col).Font.Color = RGB(0, 255, 0) End Sub

In addition to the sorting itself, I also added the processing of the selected area on the sheet, because it is more convenient.

The principle of operation of this algorithm is very simple. We create a data structure that stores the cell address (row and column number) and its value. Then from these structures an array is created, into which all data from the sheet are entered. The array is sorted by quick sort.
After that, it is enough to go through the sorted array once, coloring the elements: the first element in the group with the same values is green, all the rest are red.

It should be noted that the quick sort algorithm itself is unstable, that is, it does not preserve the order of elements with the same key. Since, in our case, the key was the cell value, in the classical version, in each group of elements with the same value, the elements would be arranged in a random order. Obviously, this does not suit me, since then in order to search for the first occurrence in the table, it would be necessary to sort each group once more, already by the cell number. To avoid this, I expanded the quick sort key by adding two additional conditions for rearranging placema elements: now, if the values match, the elements will be arranged in the order in which they occur on the sheet.

How big was the performance gain? A sheet with 100,000 cells with random values is processed by this algorithm in 4.2 seconds, which, in my opinion, is an acceptable result.

So, what conclusions can be drawn from all this?

First, the problem of the computational complexity of the algorithm is relevant even for completely seemingly simple tasks, and in order to face it it is not necessary to work with some absolutely incredible amounts of data;
Secondly, do not try to reinvent the wheel. In many cases, the best option would be to adapt proven classical algorithms for a specific task.

  PS Since the <source> tag does not support VBA highlighting, in order to highlight at least comments, I had to use C highlighting and C-style comments.

Source: https://habr.com/ru/post/141258/

All Articles

Search for repetitions in a two-dimensional array, or computational complexity by example

A simple solution

Slightly less simple solution

Fast decision

More articles: