📜 ⬆️ ⬇️

“Digitize” the captcha of a single registry of sites that protect people from information

Recently, the portal of the Unified State Register of sites has opened. Apart from everything else, I liked a very weak captcha, and I decided to overcome it.

I have already dealt with such things, though not on such a scale. If you are interested in how to get the recognition efficiency in 57% using only GNU / Bash , imagemagick and Tesseract -ocr, welcome to Cat.

The following instructions can be easily modified for any other similar weak captcha.

Introduction


Recognizing the captcha images themselves can be divided in exactly two stages:
')

Getting pictures and sending results is easily accomplished with curl (or wget, but I didn't make friends with it).

Picture preparation


ImageMagick has a built-in Fx language for creating special effects. Since the captcha of the registry letters are always black, the simplest version of the script that removes all non-black will look like this:

convert file.png -colorspace cmyk -fx 'k * (k >= 1.0)' file.ppm 


Execution takes a tenth of a second, which is valid. But the picture quality is not, so you need to go over it again. You can rely on the fact that if the letter was hit by a line, then in this place there should be filled-in pixel neighbors from two opposite sides at once:

 convert file.ppm -colorspace cmyk -fx 'k || (p[-1].k && p[+1].k) || (p[0,-1].k && p[0,+1].k)' out.ppm 


Than we sacrifice another half a second. As it turns out later, this will increase the success rate from 1% to 22% (in 100 control pictures).

Character Recognition


It's still easier. We set tesseract on the picture, we take the result from the file. If you wish, you can teach him captcha font as a separate language, but not worth the candle.

  tesseract out.ppm result -psm 8 nobatch digits 2>/dev/null res=$(cat result.txt | sed -e 's/[^0-9]//g') 

Testing in real conditions


As I said above, this combination gave the correct answer to 22 out of 100 test images. But tests are not interesting, and I decided to check the speed of work in real conditions.

Main script
 #!/bin/bash c=$(curl -c cook.txt http://zapret-info.gov.ru/ | iconv -f cp1251 | grep capcha | sed -e 's/"/\n/g' | grep services) url="http://zapret-info.gov.ru$c" r=work01 #get curl -b cook.txt "$url" > "$r.png" #prepare convert $r.png -colorspace cmyk -fx 'k * (k >= 1.0)' $r.png convert $r.png -colorspace cmyk -fx 'k || (p[-1].k && p[+1].k) || (p[0,-1].k && p[0,+1].k)' $r.png #exterminate tesseract $r.png $r -psm 8 nobatch digits res=`cat $r.txt | sed -e 's/[^0-9]//g'` #check if [ "$(echo $res | wc -c)" -ne 9 ]; then echo fail && exit fi #send code=$(echo $c | sed -e 's/[^0-9]//g') fin=`curl -b cook.txt -d "act=search&secretcodeId=$code&searchstring=ya.ru&secretcodestatus=$res" 'http://zapret-info.gov.ru/'| iconv -f cp1251` if echo "$fin" | grep -q ' '; then echo succ else echo tesfail fi rm $r.png $r.txt cook.txt 
Wrapper
 #!/bin/bash score=0 all=0 while [ "$score" -lt 41 ]; do r=`bash per.sh 2>/dev/null` [ "$r" = 'succ' ] && let score++ let all++ echo "$score/$all; $r" done 

Launch string
 time bash tt.sh | tee -a log.txt 


Results, statistics


The above set of scripts produces 41 successful checks for the presence of ya.ru in the registry. Some statistics on it:

  : 41  : 218  : 5  4  (5m4.178s) : 19%      ,     : 57%     : 33%     : 12% 


Ways to improve

Source: https://habr.com/ru/post/157145/


All Articles