Continuation of the article
Non-visual methods to protect the site from spamPart 2. The true face of characters
Non-visual methods to protect the site from spam use, in particular, the analysis of the transmitted text. Spammers use many tricks to complicate such an analysis. Here, examples of one of them will be shown, namely character substitutions. The examples given are taken from the real data of the company
CleanTalk .
Character substitution is very simple, but as a result, filters may not work on stop words, Bayesian filters, and language-specific filters may work worse. Therefore, before applying these filters, it makes sense to return the symbols to their true face.
')
Immediately, Iβll make a reservation that replacing the head-on characters, for example, national symbols with the Latin a character for the Latin a itself, is completely unacceptable without analyzing the language and context. You can also replace letters that are similar to zero by the zero itself only if you know exactly what to look for in the text (for example, phone numbers).
However, the replacement of characters is valid in the case when the meaning of the written text after the replacement is preserved. And it is necessary to bring a certain set of service characters to one.
Here I will show the two most interesting, in my opinion, ways of such a substitution of characters from those we encountered.
1. Replacing the characters of the usual type
Spammers do everything to make the text conspicuous, even at a quick glance. Fortunately for them, Unicode provides a set of Latin characters with an extended outline. Fortunately for us, this is easily fixed.
Below are the most common ways how Latin characters are replaced with the same Latin, but not from the main Latin range.
Kind of characters | Beginning of range | Example |
---|
extended | U + FF01 | ViaGra |
within the main | U + 2460 | β§-β§βͺβͺ-β β‘β’-β£β€-β₯β¦ |
within the framework of additional | U + 1F130 | π²π°π»π» |
within the framework of additional | U + 1F150 | π
π
π
¦ |
within the framework of additional | U + 1F170 | π
΅π
Ύπ |
within the framework of additional | U + 1F1E6 | π«π·πͺπͺ |
Replacing such Latin characters with regular ones is done with a simple regular expression. After such a replacement, subsequent filters work better and faster, since The range of input values ββis greatly narrowed.
2. Replace point
A dot as a symbol is used much more widely than a punctuation mark - it is both a field separator, and digits, and a number separator in spam numbers, etc.
Therefore, we are faced with the need to reduce the diversity of spam points to one single one.
The most common such point substitutions we encountered are listed below.
Substitute Code | Substitute type |
---|
U + 3002 | . |
U + 0701 | ά |
U + 0702 | ά |
U + 2024 | . |
U + FE12 | οΈ |
U + FE52 | . |
U + FF61 | . |
Replacement points can be performed with a simple regular expression.
tr/
\N{U+3002}\N{U+0701}\N{U+0702}\N{U+2024}\N{U+FE12}\N{U+FE52}\N{U+FF61}
/
\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}
/
It is noticed that after replacing points, subsequent filters work really efficiently.
3. Conclusion
I gave two ways of character substitution. The inverse replacement is simple, undemanding to resources and greatly enhances the correctness of the filters based on the analysis of words and expressions.