
Inspired by the article
“MintEye CAPTCHA Solution in 23 Lines of Code” , and also with the desire to get deeper into the methods of selecting image edges, such as
the Sobel operator and
the Canny operator , I decided to try to repeat the algorithm described in the article myself.
Having quickly sketched a script that loads a set of “experimental” images from
the MintEye website , I was ready to open my favorite IDE to start experiments with “high technologies”, but looking in the catalog with downloaded pictures, I found one very interesting pattern.
All images (in JPEG format, which is very important), related to one captcha,
had the same size in bytes!“But this is impossible!” - I thought. To explain why it is impossible, I’ll briefly remind you of the main ideas behind the JPEG compression algorithm. For a more detailed description, welcome to
Wikipedia.')
The JPEG algorithm compresses the original image in the following sequence:
- Color model conversion in progress: the translation of the source image from the RGB color space to the YCbCr color space. After this step, the image is divided into a luminance channel (Y) and two “color difference channels” - Cb and Cr;
- Decimation (downsampling) of the Cb and Cr color channels is performed. As a rule, simply reducing their size by half. Due to the fact that the human eye is more sensitive to changes in brightness than color, already at this a tangible reduction in image size is achieved with very little quality loss;
- Each channel is divided into so-called “coding blocks”. In the most common case, a “coding block” is a linear array of 64 bytes, obtained from an 8x8 pixel image block by traversing it along such a tricky trajectory that resembles a “snake”;
- Discrete cosine transform is applied to coding blocks. I will not go into details, but after this conversion, the “coding block” turns, roughly speaking, into a certain set of coefficients, closely related to the number of fine details and smooth color transitions in the original image.
- Quantization is performed. At this stage, the “compression ratio” or “desired image quality” option comes into play. Again, very roughly, the idea here is that all coefficients are less than a certain threshold value (determined by the desired degree of compression), stupidly reset. The remaining coefficients still allow you to restore the original image with some accuracy. It is this stage that “generates” compression artifacts so well known to all;
- And finally, all that remains of our poor original image is finally “pressed on” by the lossless compression algorithm. For JPEG, this is almost always the Huffman algorithm .
What is the knowledge of the “internal kitchen” of the JPEG algorithm that can be useful for us in solving MintEye CAPTCHA? And the fact that, knowing all this, it becomes obvious that two
different images that have the
same size (in pixels) and are compressed with the
same quality settings, with a probability of almost 100% will have a
different size in bytes! Moreover, the largest size will have that picture, in which there are more small details, and less smooth color transitions.
To prove this, we take an old Lena and conduct the following experiment (all three images are compressed with standard Photoshop “Save for Web”, with quality 40):
| | |
Gaussian Noise 5% Size: 10342 bytes | Original Size: 7184 bytes | Gaussian Blur 1,5px Size: 4580 bytes |
Well, as required to prove: ceteris paribus, the more noise - the larger the file size, the greater the blur - the smaller the size.
Why then all MintEye CAPTCHA pictures have the same size? The secret of the focus turned out to be primitive to impossibility: the files are
simply padded with zeros to the size of the largest of them!
Having discovered this, almost immediately, in my head, I was born in some way a “brazen”, but extremely simple and effective solution for recognizing this, if I may say “captcha”. Take a look at these two pictures:

On the left - a slightly distorted image that is one position to the left than the “correct” one. On the right is an undistorted image, which is the correct answer.
The left picture at first glance is curved very slightly. But in reality, such a “twisting” at a small angle strongly “blurs” sharp boundaries and small details. So, based on the JPEG compression features known to us, such a slightly distorted picture should differ in size from the correct one, and differ
sharply in a smaller direction!In order to test such a bold assumption, I open the IDE and literally in 10 minutes write the following:
import java.io.IOException; import java.io.RandomAccessFile; public class MintEye { public static void main(String[] args) throws IOException { int maxDelta = 0; int correctNum = 0; int zeroes = 0; int prevZeroes = 0; for (int n = 0; n < 30; n++) { if (n > 0) prevZeroes = zeroes; zeroes = 0; RandomAccessFile raf = new RandomAccessFile( String.format("images/%1$02d.jpg", n + 1), "r"); long fileLen = raf.length(); for (int i = (int) fileLen - 1; i >= 0; i--) { raf.seek(i); if (raf.read() != 0) break; zeroes++; } int delta = prevZeroes - zeroes; if (delta > maxDelta) { maxDelta = delta; correctNum = n; } raf.close(); } System.out.printf("Correct image: %d%n", correctNum + 1); } }
In a nutshell: we go through our pictures (MintEye has 30 of them), counting the number of zeros at the end of the current file and comparing it with the number of zeros at the end of the previous one. The file for which this difference will be maximal is supposed to be the original, undistorted picture, that is, the correct answer.
The idea was absolutely true:
100%! 10 out of 10! All downloaded sets of pictures were unmistakably recognized. While not using absolutely no image processing libraries, and not even loading pictures into memory!
As a result, I was once again convinced that for every tricky captcha, sooner or later there will be a “recognizer”, and my personal “collection of scalps”, added to MintEye (who can now safely turn off their startup
and make hara-kiri ).
Well, Habrahabr added to this article.
PS I know that the above code is far from perfect, in particular, it will not work if the very first picture is correct. But I did not strive for the ideal, I just wanted to illustrate the idea as briefly as possible.