Search for common elements in an array

Task: in an array of length N, find an element that repeats more than N / 2 times.

It would seem, what is there to think? Take a Dictionary <element value, number of occurrences>, for one pass through the array, count the occurrences of each element, then select the desired element from the dictionary. Solution for O ( N ), where could it be even faster?

There is one caveat: for the dictionary, we need O ( N ) additional memory - several times larger than the size of the original array, and this is when implementing the dictionary even with a hash table, even a tree. What will we do if our goal is to process the signal with a device with a small memory? Array - measurements of the signal level, of which one is the “real” transmitted level, and the rest are noise and interference. Do you really have to mess around with hash tables and trees to determine the “real” level?

Fortunately, no: there is enough O (1) additional memory, and still one pass through the array. The algorithm of Boyer-Moore — those of Boyer and Moore who invented the much better known algorithm for searching for a substring — is easiest to imagine as follows : N people gathered at the party , and each had one element from the array. When there are two people whose elements are different, they sit down to discuss it. In the end, only people with the same elements will remain standing; Obviously, this is the very element that has been met more than N / 2 times.
')
The implementation of the Boyer-Moore algorithm takes only a few lines:

int* majority(int[] array, int N) { int confidence = 0; //  ,       int* candidate = NULL; //  ,    -- // ,      //          for (int i = 0; i<N; i++) { //      ,      if (confidence == 0) { candidate = array+i; confidence++; } //        , //      ,       else if (*candidate == array[i])) confidence++; else confidence--; } return confidence > 0 ? candidate : NULL; }

In the end, we get the “most likely candidate” for presence in N / 2 copies: if such an element really exists in the array, then it will be found; if there is no such element, then perhaps it will remain just some poor fellow who lacked a pair. For a more stringent implementation of the majority it is required to go through the array a second time and check whether the found element actually occurs the required number of times.

Let's complicate the task. Now, in an array of length N, we need to find elements that repeat more than N / 3 times - if there are two, then both, if there is one, then one.

For example, our device with a small memory needs to accept a binary signal with unknown levels of zero and one, but it is known that about half of the time is zero, about half of the time is one, and any signal levels that are different from them are interferences and bounce from switching.

The idea of the past algorithm is also easy to generalize for triples: let people with different elements sit in three. So, at the end of the party there will be a maximum of people with two different elements. If an element has been met more than N / 3 times, then the person with him is guaranteed to remain standing, after all, the sitting threes will get no more than N / 3. As in the past case, if the required elements exist, then they will be found, but if they are not there, then someone can be found.

The code is a little different from the previous one: there is still one pass through the array and O (1) additional memory.

 void majority(int[] array, int N, int** cand1, int** cand2) { int conf1 = 0, conf2 = 0; //       *cand1 = NULL; *cand2 = NULL; //       //          for (int i = 0; i<N; i++) { //     ,   ?      if (conf1 > 0 && **cand1 == array[i]) conf1++; else if (conf2 > 0 && **cand2 == array[i]) conf2++; else // ,     ,    ? if (conf1 == 0) { *cand1 = array+i; conf1++; } else if (conf2 == 0) { *cand2 = array+i; conf2++; } else { //     ,      conf1--; conf2--; } } if(conf1 == 0) *cand1 = NULL; if(conf2 == 0) *cand2 = NULL; }

This algorithm was published in 1982 by American scientists Jayadev Misra and David Gries (Jayadev Misra & David Gries), and their work uses a more boring metaphor - a bag with N numbers from which they extract 3 different numbers for each operation. In addition, their pseudo-code is not similar to any language understandable to the modern programmer. I liked more the explanation of their algorithm in the last year’s lecture notes by American professor Amit Chakrabarti.

In the most general form, when in an array of length N you need to find elements that repeat more than N / k times, you will have to use a dictionary. Store in the dictionary, we will only those elements with which people are - that is, no more than k -1 entries.

 int[] majority(int[] array, int N, int k) { //      Dictionary<int,uint> candidates = new Dictionary<int,uint>{}; //        k for (int i = 0; i<N; i++) { //     ,   ?      if (candidates.ContainsKey(array[i])) candidates[array[i]]++; else // ,     k-1 ? if (candidates.Count < k - 1) candidates[array[i]] = 1; else //   k-1  ,      foreach(int l in candidates.Keys) if (candidates[l]-- == 0) // (**) candidates.Remove(l); // (*) } return candidates.Keys.ToArray(); }

In this most general form of the algorithm, there is still one pass through the array and O ( k ) additional memory. If we use a hash table to implement the dictionary, and link all entries in the dictionary to the list, then the overall complexity of the algorithm will remain linear: the line (*) will be executed at most N times after deleting the entry from the dictionary, because at each iteration of the main loop the dictionary is added no more than one entry. As an exercise, readers are asked to understand why the line (**) does not violate the linearity of the algorithm.

Thus, our device with a small memory would be able to communicate with one fluffy beast , recently prepared by habraumelets. The signals of this animal have five logical levels: we assume k = 6, and we get all five levels right on the move, even without storing the signal in memory. It is only necessary to provide a protocol so that all five levels occur in the signal equally often.

Other applications are mentioned for the Misra-Gris algorithm. For example, you can monitor real-time traffic on the network, and if a single host consumes a disproportionately large part of the traffic, start an investigation. You can also monitor clicks on banners, financial transactions, the flow of instructions in the simulated processor ... In general, everywhere where a large number of repetitions is a suspicious anomaly.

For animating the text with illustrations, thank Nitatunarabe.

Source: https://habr.com/ru/post/167177/

All Articles

Search for common elements in an array

More articles: