Algorithms for searching palindromes

Today I want to tell you about the algorithms for counting the number of palindromes per line: what it is for, where it is used, how to do it quickly, what pitfalls await us and much more. Consider the various ways to solve this problem, find out the pros and cons of each method. This article will be an overview: if I do not describe something here, I will try to always give you a set of links, where everything is described and painted in detail. I hope that the material will be of interest to both newcomers in the field of algorithms and experienced programmers. Well, if I could interest you, then please under the cat!

For a start, it is worth remembering what a palindrome is.

A palindrome is a number (for example, 404), a letter combination, a word, or text that can be read in the same direction in both directions. Sometimes a palindrome is called any symmetrical set of symbols with respect to its middle. (Excerpt from Wikipedia )
')
The definition is very easy and understandable. We in this post will consider palindromes-words (pop, cook, etc.). But this does not mean that the algorithms are not applicable for other situations. Just slightly narrow the area.

What might you need to look for palindromes?

In fact, it is not very often that one has to look for palindromes in everyday life. Painfully specific is the task. If we talk about string algorithms, then we find the same search for a substring in a string more often than finding the number of palindromes. Therefore, the coverage in the literature is much weaker.

One of the main applications is sports / Olympiad programming. There are enough tasks for this. Tasks are written, and participants already think how to solve it. From personal experience, I would say that I met such tasks only a couple of times, but they were all from the category “here’s your task, you don’t think, but just write an algorithm.” That is, not interesting tasks that develop simply mechanical memory by stuffing algorithms.

A more practical application of finding palindromes is biological algorithms. According to the same Wikipedia and Wiki links below, the palindromic nature of biological compounds plays an important role on the properties of various biological compounds. I am weak in biology, so if you are interested, you can find more detailed information yourself.

Great, we found out that we will be engaged in not entirely useless things. Let's move on to the algorithms!

Note: in all the algorithms below, a single character is not counted as a palindrome. As you understand, it is easy to fix, if single characters also need to be taken into account.

0) The most banal algorithm with asymptotics O (N ^ 3)

bool SlowN3::isPalindrom(int leftBorder, int rightBorder) { while(leftBorder <= rightBorder) { if(str[leftBorder] != str[rightBorder]) { return false; } ++leftBorder; --rightBorder; } return true; } long long SlowN3::operator ()() { for(int leftBorder = 0;leftBorder < str.length(); ++leftBorder) { for(int rightBorder = leftBorder + 1; rightBorder < str.length(); ++rightBorder) { if( isPalindrom(leftBorder, rightBorder) ) { ++cntPalindromes; } } } return cntPalindromes; }

I want to warn you right away that all the sources will be in C ++, and these will be significant parts of the classes that may be of interest to us.

The algorithm is the most “forehead”: there is a string, there is a pointer to the beginning of the substring of the given string, there is a pointer to the end of the given substring. In two nested loops, we iterate over the boundaries of the substrings and tritely check whether our substring is a palindrome. Nothing complicated.

But this is all terribly slow. Why? Yes, because we are square N times (N we will always have the length of the string) iterate over the boundaries of the weaving substring, and then also for O (N) in the worst case we perform a check for each substring for palindrome.

There are some advantages to this method:

+ Easy to implement. Indeed, it is very difficult to leave a bug here.
+ Easy to read. You just looked, and already understood that yes how it works here.
+ Small hidden constant. As it is known, not only an asymptotic estimate (here we work only with the worst cases), but also a hidden constant, affects the running time of the algorithm. This algorithm has a hidden constant that is extremely small.

Minuses:

- Extremely low speed. As you can see, with a row of one thousand letters 'a', this algorithm will do about 10 ^ 9 iterations, which is very bad. And what if the string is 100,000 long?

1) The most banal algorithm with asymptotics O (N ^ 2)

Code:

 void SlowN2::oddCount() { for(int indMiddle = 0; indMiddle < str.length(); ++indMiddle) { int leftBorder = indMiddle - 1, rightBorder = indMiddle + 1; while(leftBorder >= 0 && rightBorder < str.length() && str[leftBorder] == str[rightBorder]) { ++cntPalindromes; --leftBorder; ++rightBorder; } } } void SlowN2::evenCount() { for(int indMiddle = 0; indMiddle < str.length(); ++indMiddle) { int leftBorder = indMiddle, rightBorder = indMiddle + 1; while(leftBorder >= 0 && rightBorder < str.length() && str[leftBorder] == str[rightBorder]) { ++cntPalindromes; --leftBorder; ++rightBorder; } } }

Here is a little more interesting. We have a string, and two temporary arrays for palindromes of even and odd length. And the number in cell i of the array will store the number of palindromes in string s (source line) with center at point i. For the odd length of the palindrome, the center is easy, but for the even one, we will just play around with the indices and shift the center slightly (as seen in the code). Our result is the sum of numbers from two arrays.

As you can see, nothing complicated again. But it works much faster than the previous algorithm. Why? Let's look at how it all works.

We run along the line, sorting through the central symbol of our potential palindrome. And then just run left and right from the central element as long as the characters are the same. As soon as they became different, it means that there will be no further palindromes with a center in our selected element, and proceed to the next.

And it works quickly on ordinary lines very quickly because our second, nested cycle, which runs until the extreme characters are equal, terminates extremely quickly (assuming that the palindromes in the line are much less than the total number of substrings).

Consider the pros and cons of the method.

Pros:

+ Everything is also easy to code. It is very difficult to make a mistake.
+ Small hidden constant
+ Behaves well on random strings

Minuses:

- Still a long time work

After this method, the head begins to think, think, think. And here is the idea! And what if we modify this method, but we will jump from the central element not by 1 symbol with each iteration, but a little more? And here we come to the rescue ...

2) Use hashes!

This method is a bit more complicated, so I’ll immediately give you the code, and then I will try to explain what kind of magic is going on there (although there is no magic, as you yourself know very well):

 inline long long Hash::getHash(int indLeft, int indRight) const { return prefixHash[indRight] - (indLeft > 0 ? prefixHash[indLeft - 1] : 0); } inline long long Hash::getRevHash(int indLeft, int indRight) const { return suffixHash[indLeft] - (indRight < suffixHash.size() - 1 ? suffixHash[indRight + 1] : 0); } void Hash::PrepareSimpleNumbers() { if(str.length() > simpleNumbers.size()) { size_t oldSize = simpleNumbers.size(); simpleNumbers.resize(str.length()); simpleNumbers[0] = 1LL; for(int i = oldSize; i < simpleNumbers.size(); ++i) simpleNumbers[i] = simpleNumbers[i - 1] * SimpleBase; } } void Hash::CountingPrefixHash() { prefixHash[0] = str[0]; for(int i = 1; i < prefixHash.size(); ++i) { prefixHash[i] = prefixHash[i - 1] + str[i] * simpleNumbers[i]; } } void Hash::CountingSuffixHash() { suffixHash[suffixHash.size() - 1] = str[str.length() - 1]; for(int i = (int) str.length() - 2, j = 1; i >= 0; --i, ++j) { suffixHash[i] = suffixHash[i + 1] + str[i] * simpleNumbers[j]; } } bool Hash::isPalindrome(int indLeft, int indRight) const { return getHash(indLeft, indRight) * simpleNumbers[prefixHash.size() - indRight - 1] == getRevHash(indLeft, indRight) * simpleNumbers[indLeft]; } void Hash::CountingOdd() { for (int i = 0; i < oddCount.size(); i++) { int indLeft = 1, indRight = min(i + 1, static_cast<int> (oddCount.size() - i)); while (indLeft <= indRight) { int middle = (indLeft + indRight) / 2; if (isPalindrome(i - middle + 1, i + middle - 1)) { oddCount[i] = middle - 1; indLeft = middle + 1; } else { indRight = middle - 1; } } } } void Hash::CountingEven() { for (int i = 0; i < evenCount.size(); i++) { int indLeft = 1, indRight = min(i, static_cast<int> (evenCount.size() - i)); while (indLeft <= indRight) { int middle = (indLeft + indRight) / 2; if (isPalindrome(i - middle, i + middle - 1)) { evenCount[i] = middle; indLeft = middle + 1; } else { indRight = middle - 1; } } } } long long Hash::operator()() { PrepareSimpleNumbers(); CountingPrefixHash(); CountingSuffixHash(); CountingOdd(); CountingEven(); for(int i = 0; i < str.length(); ++i) { cntPalindromes += oddCount[i] + evenCount[i]; } return cntPalindromes; }

A brief summary of this method: we iterate over the central element of our palindrome, and then dichotomy we try to find the largest radius of our palindrome (radius here means the distance from the central element to the extreme, if we have an even-length palindrome, then we just add a little game with indices to worked). During the selection, we must somehow quickly compare the substrings for identity. do it with hashes. Asymptotics, as it is easy to guess: N is spent on enumeration of the central element, logN in the worst case, we spend on the selection of the palindrome radius, for a unit we compare substrings using hashes. Total we have O (NlogN), which is very good indeed.

Now let's take a closer look at this method.

What is our method based on? Since we iterate through the central element, and then by dichotomy, we select the radius of the largest palindrome with the center in this element, we must quickly take the hash of the left side of our potential palindrome and compare it with the hash of the right side of our potential palindrome.

How to do it? Let's first calculate the hash for each prefix and suffix of the source string. In the code, this is done by the CountingPrefixHash () and CountingSuffixHash () methods, respectively.

Using the getHash () and getRevHash () methods, we can quickly get hashes of the left and right sides of the potential palindrome considered at this stage. There may be a question why it is impossible to use the same function of hash calculation for the left and right parts. And all because when we check for palindrome, we read the left side from left to right, and the second from right to left. And we need to support the hash from left to right and from right to left.

The only difficulty left is how to compare these 2 hashes? And we solve this problem with the help of the isPalindrome () method, which leads the hashes to one base and compares them.

The result of each iteration will be the number of palindromes with center i. We run 2 times for palindromes of even and odd length, the answer to our problem is the sum of all values of the oddCount arrays and evenCount

In more detail about this method you can read in this article .

Pros:

+ Asymptotically we reduced to NlogN, which is good news. If we take only the worst cases, then in theory someday we will win

Minuses:

- Quite hard to write. Easy to sow bug. We need some theoretical training in the field of hashes.
- Slowly working on random tests. You can see for yourself. And all because of the large hidden constant and because even if we do not have a single palindrome with a center in this element, the algorithm will make logN jumps, and this all takes time.
- Collisions. As you can see, my source uses a hash, which is usually written in programming contests (yes, I once did this a bit). So, at competitions this method shows itself well. Personally, my collisions did not happen. But I know about ways to calmly fill up this hash (in particular, the method using Thue-Morse lines). I once heard that there is some kind of fibonacci hash that doesn’t seem to break on this test, but the information is unreliable. So be careful with collisions.

And if we want a 100% solution without any collision correction and entering a hash on another basis, and so on?

3) Algorithm Manaker

Code:

 //Find palindroms like 2*N+1 void Manacker::PalN2() { int leftBorder = 0, rightBorder = -1, tempMirror;//start digits for algortihm for(int i = 0; i < str.length(); ++i) { tempMirror = (i > rightBorder ? 0 : std::min(ansPalN2[leftBorder + rightBorder - i], rightBorder - i)) + 1;//find mirror of current index while(i + tempMirror < str.length() && i - tempMirror >= 0 && str[i - tempMirror] == str[i + tempMirror])//increase our index { ++tempMirror; } ansPalN2[i] = --tempMirror; if(i + tempMirror > rightBorder)//try to increase our right border of palindrom { leftBorder = i - tempMirror; rightBorder = i + tempMirror; } } } void Manacker::Pal2() { int leftBorder = 0, rightBorder = -1, tempMirror; for(int i = 0; i < str.length(); ++i) { tempMirror = (i > rightBorder ? 0 : std::min(ansPal2[leftBorder + rightBorder - i + 1], rightBorder - i + 1)) + 1; while(i + tempMirror - 1 < str.length() && i - tempMirror >= 0 && str[i - tempMirror] == str[i + tempMirror - 1]) { ++tempMirror; } ansPal2[i] = --tempMirror; if(i + tempMirror - 1 > rightBorder) { leftBorder = i - tempMirror; rightBorder = i + tempMirror - 1; } } }

So, we obtained with the help of hashes the algorithm for NlogN. But I want faster. I want something linear. And here we are in a hurry to help the Manaker Algorithm (by reference you can also see the implementation of the algorithm in Java), which is not only linear, it also has an extremely small hidden constant, which has a positive effect on its speed. I will not describe the method in detail, since it will be a retelling of this wonderful link (I myself taught the algorithm for this link). Here I will give a slightly retold explanation.

The algorithm is that we process the string character by symbol, supporting the rightmost palindrome in our string. And at each iteration we look, our current element is inside the borders of the rightmost palindrome or not. If it is, then we can extract the answer from the previously calculated values, by simple manipulations with the indices. If it is not, then we go with exactly the same algorithm described in paragraph 1) of this review: go character by character and compare the mirror elements relative to the center. Go as long as they are equal. And do not forget to update after this the boundaries of the right-most palindrome found. This is in brief. You can read about some of the pitfalls in this article.

What else is interesting: in all the tasks that I solved on the contests (on Olympiad programming), this algorithm was enough. Very easy to write if you wrote it N times at home and you already know by heart.

Pros:

+ Pretty short code.
+ It works very fast. Not only is the asymptotic behavior of O (N), but also a small hidden constant.

Minuses:

- The idea is not so simple to come up with this algorithm right away
- You can get confused in all sorts of indices, if you show carelessness when writing code

Are there any other ways to solve this problem in linear time?

4) Palindrome tree

A little background. This relatively new data structure was discovered by Mikhail Rubinchik rumi13 and presented it at the Petrozavodsk training camp. The structure is extremely interesting, on the one hand, simplicity, and on the other, it allows you to quickly solve a fairly wide range of tasks about palindromes. And most importantly, it makes it quite simple to count the number of palindromes in a row for O (N) (but I think the palindrome tree itself was far from being invented for this purpose, but for more serious tasks with palindromes).

Code:

 PalindromicTree::PalindromicTree(const std::string& s) : FindPalindrome(s) { initTree(); } PalindromicTree::~PalindromicTree() { for(int i = 0;i < pullWorkNodes.size(); ++i) { delete pullWorkNodes[i]; } } void PalindromicTree::initTree() { root1 = new Node; root1->len = -1; root1->sufflink = root1; root2 = new Node; root2->len = 0; root2->sufflink = root1; suff = root2; pullWorkNodes.push_back(root1); pullWorkNodes.push_back(root2); } void PalindromicTree::findAddSuffix(Node* &cur, int pos, int& curlen) { while (true) { curlen = cur->len; if (pos - 1 - curlen >= 0 && str[pos - 1 - curlen] == str[pos]) { break; } cur = cur->sufflink; } } void PalindromicTree::makeSuffixLink(Node* &cur, int pos, int& curlen,int let) { while (true) { cur = cur->sufflink; curlen = cur->len; if (pos - 1 - curlen >= 0 && str[pos - 1 - curlen] == str[pos]) { suff->sufflink = cur->next[let]; break; } } } void PalindromicTree::addLetter(int pos) { Node* cur = suff; int let = str[pos] - 'a', curlen = 0; findAddSuffix(cur, pos, curlen); if (cur->next[let] != nullptr) { suff = cur->next[let]; return; } suff = new Node; pullWorkNodes.push_back(suff); suff->len = cur->len + 2; cur->next[let] = suff; if (suff->len == 1) { suff->sufflink = root2; suff->num = 1; return; } makeSuffixLink(cur, pos, curlen, let); suff->num = 1 + suff->sufflink->num; } long long PalindromicTree::operator ()() { for (int i = 0; i < str.length(); ++i) { addLetter(i); cntPalindromes += suff->num - 1; } return cntPalindromes; }

Again, I will retell in brief the essence of this algorithm. A more detailed explanation can be found in this wonderful article .

So, the idea of a tree. In the tops of our tree there will be only the palindromes themselves: 'a', 'b', 'aba' and so on. It is clear that we will not store palindromes ourselves, but simply keep from which vertex we came here, and which symbol we added from both sides. We also have two dummy vertices for convenient implementation of the algorithm.

But as in any interesting tree, we also have suffix links. The suffix link will lead to the summit, which is also a palindrome (well, because we only have palindromes in the vertices) and which is the largest proper suffix of this vertex. That is, from the vertex 'aba' the link will lead to the vertex 'a'.

Next, we in turn add characters to the tree one by one. And thanks to the tricky arrangement of the tree and the recursive operation of adding (as well as the suffix links under which the transition is made), we update the entire tree.

In short, read more information on the link above (if you are not afraid of English)

Pros:

+ If you have previously worked with trees, then it will be very easy for you to understand this idea.
+ Allows you to solve a wide range of tasks for palindromes

Minuses:

- It works slower than the Manaker algorithm.
- You can put the bug. But, purely subjective, it is more difficult to do here than in the same Manaker algorithm.

It is also worth mentioning that using trees there is another solution to this problem. It consists in using the suffix tree and the fast LCA algorithm (which works for preprocessing O (N) and answering the query O (1). Farah-Colton-Bender algorithm). But it is not applied in practice, as it is rather complicated and it has an extremely large hidden constant. It is more of academic interest.

What could be more interesting about the algorithms? That's right, memory consumption and running time.
On the github you can also download a test code that generates random strings and looks for palindromes in them.

My testing showed that the expected algorithm number 0 is extremely slow. The leader, as you can guess, is the Manaker algorithm. But what is most interesting: the algorithm with O (N ^ 2) wins with an approximately two-fold separation from the algorithm using hashes with O (NlogN), which once again proves that algorithms are not measured by a single asymptotics. This is due to the peculiarities of the clipping algorithm number 1, and the lack thereof in the method with hashes. As for the tree of palindromes, it loses to Manucker mainly due to memory operations, as it is necessary to allocate memory for each new vertex. But if you use, for example, new with placement, the gap will be reduced.

Memory all algorithms consume linearly from the size of the input data.

This concludes our review. I hope that you have learned something new for yourself and it was just a little interesting for you! You can find all the sources in my public repository on Github .

PS: If you notice any typos, inaccuracies or you have additions - please unsubscribe in the comments and write in the LAN.
PPS: Feel free to ask questions in the comments - I will try to answer if my competence is enough.

Useful links that may be of interest to you after reading this article (some links may be repeated, as they might be skipped in the article itself):

0) What is a palindrome
1) The Manaker Algorithm: Wiki , Codeforces , e-maxx
2) A bit about hashes and their use: e-maxx , Habrahabr
3) Discussion about the obstruction of hashes on Codeforces: tyts
4) Thue-Morse lines (words): one , two
5) Articles about palindrome tree: good description , codeforces
6) Here is another series of articles about the search for numbers-palindromes: Habr

Source: https://habr.com/ru/post/276195/

All Articles