Non-visual methods to protect the site from spam. Part 3. Repeats

Continuation of the article Non-visual methods to protect the site from spam

Part 3. Duplicate substrings

As already mentioned, non-visual methods of protecting the site from spam use text analysis. One of the most frequently encountered spam signals is the presence of duplicate lines. As always, the examples given are taken from the real data of the company CleanTalk .

The search for such repetitions should be minimally resource-intensive. It is better if it will be called after the tests from parts 1 and 2 of the article, which will eliminate obvious spam and bring the text into a form suitable for analysis. Here I will give some statistics, as well as sample code.

')

1. Sample code

The function we use to determine the longest repeating substrings is made using the naive algorithm described here http://algolist.manual.ru/search/lrs/naive.php

An example of the output is shown below.

s a l e f o r s a l e f o r s a l e 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 s 0 + . . . . . . . . + . . . . . . . . + . . . a 1 . + . . . . . . . . + . . . . . . . . + . . l 2 . . + . . . . . . . . + . . . . . . . . + . e 3 . . . + . . . . . . . . + . . . . . . . . + 4 . . . . + . . . + . . . . + . . . + . . . . f 5 . . . . . + . . . . . . . . + . . . . . . . o 6 . . . . . . + . . . . . . . . + . . . . . . r 7 . . . . . . . + . . . . . . . . + . . . . . 8 . . . . . . . . + . . . . + . . . + . . . . s 9 . . . . . . . . . + . . . . . . . . + . . . a 10 . . . . . . . . . . + . . . . . . . . + . . l 11 . . . . . . . . . . . + . . . . . . . . + . e 12 . . . . . . . . . . . . + . . . . . . . . + 13 . . . . . . . . . . . . . + . . . + . . . . f 14 . . . . . . . . . . . . . . + . . . . . . . o 15 . . . . . . . . . . . . . . . + . . . . . . r 16 . . . . . . . . . . . . . . . . + . . . . . 17 . . . . . . . . . . . . . . . . . + . . . . s 18 . . . . . . . . . . . . . . . . . . + . . . a 19 . . . . . . . . . . . . . . . . . . . + . . l 20 . . . . . . . . . . . . . . . . . . . . + . e 21 . . . . . . . . . . . . . . . . . . . . . + $VAR1 = { 'sale' => 3, 'for sale' => 2 };

The pluses on the diagonals are marked with pluses. It can be seen that the number of repetitions is equal to the number of segments of the diagonals of adjacent pluses parallel to the main one and having the same vertical displacement.

In the implementation of the algorithm, the actual search for duplicate characters takes a bit. More time is done exactly the analysis of the diagonals of the matrix with the allocation of substrings. For this, a threshold of minimum repetition length was introduced. Also had to take into account repetitions, starting with spaces. It was also necessary to ensure that repetitions do not overlap. Finding repetitions is done from short to long.

And here is the Perl function itself with minimal changes. For convenience, the full text that displays the matrix above.

 #!/usr/bin/perl -w use strict; use utf8; use Data::Dumper; binmode(STDOUT, ':utf8'); my $min_longest_repeat_length = 4; my $message = 'sale for sale for sale'; my %longest_repeates = (); get_longest_repeates(\$message, \%longest_repeates); print Dumper(\%longest_repeates); sub get_longest_repeates { my $test_ref = shift; #      my $reps_ref = shift; #     my @symbols = split //, $$test_ref; my $m_len = scalar @symbols; my @matrix = (); #     #       for (my $i = 0; $i < $m_len; $i++) { #  $matrix[$i] = []; for (my $j = $i; $j < $m_len; $j++) { #       $matrix[$i][$j] = 1 if $symbols[$i] eq $symbols[$j]; } } #           my %repeats_tmp = (); #   my ($i, $j); #    , ..      for ($i = $m_len - 1; $i > 0; $i--) { my $repeat = ''; my $repeat_pos = undef; my $repeat_temp; for ($j = $i; $j < $m_len; $j++) { if (defined($matrix[$j-$i][$j]) && $matrix[$j-$i][$j] == 1) { $repeat_temp = $repeat; $repeat_temp =~ s/^ //; #          if (defined($repeats_tmp{$repeat_temp})) { $repeat_pos = $j - length($repeat_temp); $repeats_tmp{$repeat_temp}{$repeat_pos} = 1; $repeat = $symbols[$j]; } else { $repeat .= $symbols[$j]; } } else { if ($repeat ne '') { $repeat =~ s/^ //; $repeat_pos = $j - length($repeat); if (length($repeat) >= $min_longest_repeat_length) { if (defined($repeats_tmp{$repeat})) { $repeats_tmp{$repeat}{$repeat_pos} = 1; } else { $repeats_tmp{$repeat} = {$repeat_pos => 1}; } } $repeat = ''; } } } if ($repeat ne '') { $repeat =~ s/^ //; $repeat_pos = $j - length($repeat); if (length($repeat) >= $min_longest_repeat_length) { if (defined($repeats_tmp{$repeat})) { $repeats_tmp{$repeat}{$repeat_pos} = 1; } else { $repeats_tmp{$repeat} = {$repeat_pos => 1}; } } $repeat = ''; } } foreach (keys %repeats_tmp){ $$reps_ref{$_} = 1 + scalar keys %{$repeats_tmp{$_}}; } #     print "\n"; print ' '; for (my $i = 0; $i < $m_len; $i++) { print ' ' . $symbols[$i]; } print "\n"; print ' '; for (my $i = 0; $i < $m_len; $i++) { printf '%3d', $i; } print "\n"; print "\n"; for (my $i = 0; $i < $m_len; $i++) { print $symbols[$i]; printf '%3d ', $i; for (my $j = 0; $j < $m_len; $j++) { my $value = '.'; $value = '+' if (defined $matrix[$i][$j] && $matrix[$i][$j] == 1); printf(' %1s', $value); } print "\n"; } print "\n"; }

2. Repeat statistics

We have chosen the threshold of the minimum repetition length (I do not specifically cite it), which gave maximum efficiency in tests. His results in the number of repetitions are as follows:

Number of repetitions	In spam,%	Not spam,%
2	78.58	90.28
3	11.93	4.86
four	4.45	2.08
five	2.30	1.39
6	1.93	0
7	0.22	0
eight	0.37	0
9	0.07	0

3. Conclusion

I showed the implementation of a naive algorithm for finding duplicate substrings in a text. For the analysis, you can use both the number of repetitions and the repetitions themselves (for example, by stop words). I repeat that complex tests are most effective in the fight against spam.

Source: https://habr.com/ru/post/301302/

All Articles

Non-visual methods to protect the site from spam. Part 3. Repeats

Part 3. Duplicate substrings

1. Sample code

2. Repeat statistics

3. Conclusion

More articles: