📜 ⬆️ ⬇️

Breaking the sound CAPTCHA on the example of the site Digg.com

Introduction


Many owners of news sites are faced with the need to post links back to their article in popular services such as Digg.com (one of the largest news aggregators). But the problem is that you must manually go to the site and add a link every time, or shift this process onto the shoulders of visitors. Naturally I wanted to automate this process.

Digg.com site provides a fairly powerful API that allows you to do many things: comment, vote for news, search, etc. ... But it does not allow the main thing - to publish your news. In principle, we do not limit anything to write a script to automatically add news from your source. The only obstacle is Captsha, which we will do.


Captcha structure


And here we will stop ... The methods of character recognition in pictures with python, OCR and neural networks have already run through the Habré. This topic is most fully covered in the article by the distinguished Indalo . But this method did not give 100% recognition probability and is relatively difficult to implement. Knowing that there is always another way to solve the problem easier, I accidentally saw an interesting phrase: “Can't read the text? Listen it.
')
Having listened, I noticed that one speaker announces all the letters and is always the same, without hindrances and extraneous sounds. And indeed, dubbing is intended to help people who are unable to see all the letters, to enter the correct characters. If this method is easier for human perception, then it should be easier for the bot, respectively.

When filling out forms, the site gives us this type of picture (Attention requires cookies!):
http://digg.com/captcha/2c7ea3845d5ddfc5a7461c5429b6a7e5.jpg



The sound file will look like this (Attention requires cookies!):
http://digg.com/captcha/2c7ea3845d5ddfc5a7461c5429b6a7e5.mp3



After the experiments, we managed to find out that a fragment of each letter is ~ 2000 bytes. There are noises in the background, but they are not randomly generated, and the same letter on different captchas is absolutely identical. Therefore, our mp3 files should be considered as a simple array of characters for searching for such fragments.


Character Recognition


The following is the recognition process. In this paper, I used python, but nothing prevents to transfer the project to other languages.
  1. Manually create a base with ready-made caps (approximately 100 pieces).
  2. For each character, a pair of sound captchas in which it occurs only once, and all other characters are unique, i.e. in different captcha does not repeat. For example, for the number - 2, take the following: AS2DE, 2ZTKJ.
  3. In the selected captcha, the usual search is looking for the same maximum matching sequence. At the output we get about 2000 characters.
  4. We control that we don’t get a fragment of a 'pause'.
  5. Add the result to the database.

An example of a simple search for two captchas:
  1. def compare ( letter, filename1, filename2 ) :
  2. tfile1 = filename1 + '.mp3'
  3. tfile2 = filename2 + '.mp3'
  4. f = open ( tfile1, "r" )
  5. test1 = f. read ( )
  6. f2 = open ( tfile2, "r" )
  7. test2 = f2. read ( )
  8. cnt = i = j = - 1
  9. k = 3000
  10. for item in test1 [ : -k ] :
  11. i = i + 1
  12. j = i + k
  13. cnt = test2. find ( test1 [ i: j ] )
  14. if cnt > 0 :
  15. res = test2 [ cnt: cnt + k ]
  16. f3 = open ( 'sources /' + letter, 'w' )
  17. f3. write ( res )
  18. return
  19. return

That's all, the recognition result is 100%. Now, when the robot sends our news to digg.com, it finds the address of the Captcha image on the page, replaces it with mp3, requests voice acting using cookies, finds the required 6 characters, compares it with its own database, and sends the result. All news from your site will be published on digg.com in seconds.


If your site has sound captchas, I recommend to refuse them, or to secure them with the following recommendations:

For hacking more secure sound Catcha, a simple method of comparing pieces of mp3 files may not give positive results. In this case, it is recommended to use special filters for processing audio tracks and removing noise. Then, as an option, you can use neural networks for sequence analysis. Of course, the result will be less than 100%, but it will remain at the level. In addition, you can try speech recognition services. The best I've met is Google Voice, you just need to send voice mail from our mp3 and after a while get a transcription (it would be interesting to look at the results).


findings


Many sites on the Internet are so fascinated by the complication of their protection against bots that as a result they only distanced themselves from real users. And trying to establish a return contact with them, they themselves create weak spots, which someone will definitely use. Of the very large websites affected by this vulnerability, I can mention GoDaddy.com, the exact same Captcha audio in their whois service when checking domains.

All scripts are executed using the Python language and are available here .

Upd: Transferred to the blog Information Security.

Source: https://habr.com/ru/post/101208/


All Articles