Exploring OpenCL through password cracking

Introduction

Recently, after reading various articles and presentations about GPGPU, I decided to also try programming for a video card. In fact, the choice of technologies in this area is not great - only CUDA (proprietary standard nVidia) and OpenCL (free standard, running on ATI's GPU, nVidia, and also on central processors) are alive and well. Due to the fact that my laptop has an ATI video card (Mobility Radeon 5650 HD), the choice has come down to only one option - OpenCL. This article focuses on the process of learning OpenCL from scratch, as well as what came of it.

OpenCL and PyOpenCl Overview

At first glance, it seemed to me very confusing, both the control code in C and the code of the so-called kernels - kernels. In the provided C API, even launching the simplest program takes a large number of lines, especially with handling at least some errors, so I wanted to find something more convenient and humane. The choice fell on the PyOpenCL library, from the name of which it is clear that the control code is written in Python. Everything looks more understandable in it, even for someone who sees the code on OpenCL for the first time (of course, this applies only to simple examples). However, the code of the cores themselves is still written in slightly modified C, so you will still have to study it. Full documentation on it can be obtained on the standard developer’s website ( Khronos ), and information on specific implementations can be obtained on the ATI and nVidia sites, respectively.
You can get the first impression about the language by the simplest example (adding two arrays):

__kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

And here is the full necessary code to run this example and validate (taken from the PyOpenCL documentation):
')

Hidden text

 import pyopencl as cl import numpy import numpy.linalg as la a = numpy.random.rand(50000).astype(numpy.float32) b = numpy.random.rand(50000).astype(numpy.float32) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes) prg = cl.Program(ctx, """ __kernel void sum(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """).build() prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf) a_plus_b = numpy.empty_like(a) cl.enqueue_copy(queue, a_plus_b, dest_buf) print la.norm(a_plus_b - (a+b))

Specific lines are immediately visible: creating a context, execution queues, creating and copying buffers to the device, as well as compiling and running the actual kernel. About contexts and queues in OpenCL, you can read in detail in the documentation, and for relatively simple programs you only need one queue and one context, which will be created in lines that are very similar to those in the example. In general, often the structure of calculations in programs on OpenCL look like this:

creating context, queue, program compilation
copying data (buffers) to the device that do not change during execution
cycle
- copying data specific to this iteration to a device
- kernel execution
- copying the calculated data back to main memory, maybe some processing

Hashing SHA1

It is time to go down a level and understand how the code of the kernel itself is arranged. For whom, in order for the OpenCL function to run from the outside, it must be indicated by the __kernel attribute, have the value type void and a number of arguments, which can be either the values itself (int, float4, ...), or pointers to areas memory __global, __constant, __local. Also, for convenience, other functions that are called from the kernel can be declared for convenience, and this does not affect performance: all functions are automatically substituted (that is, as with the inline directive). Related to this is that recursion in OpenCL is not supported at all.

Using the fact that the OpenCL language is a modified C, you can take a ready-made implementation of the hashing function, for example SHA1, and with minor modifications use it:

Hidden text

 #define K1 0x5A827999 #define K2 0x6ED9EBA1 #define K3 0x8F1BBCDC #define K4 0xCA62C1D6 #define a0 0x67452301; #define b0 0xEFCDAB89; #define c0 0x98BADCFE; #define d0 0x10325476; #define e0 0xC3D2E1F0; #define f1(x,y,z) ( z ^ ( x & ( y ^ z ) ) ) /* Rounds 0-19 */ #define f2(x,y,z) ( x ^ y ^ z ) /* Rounds 20-39 */ #define f3(x,y,z) ( ( x & y ) | ( z & ( x | y ) ) ) /* Rounds 40-59 */ #define f4(x,y,z) ( x ^ y ^ z ) /* Rounds 60-79 */ #define ROTL(n,X) ( ( ( X ) << n ) | ( ( X ) >> ( 32 - n ) ) ) #define expand(W,i) ( W[ i & 15 ] = ROTL( 1, ( W[ i & 15 ] ^ W[ (i - 14) & 15 ] ^ \ W[ (i - 8) & 15 ] ^ W[ (i - 3) & 15 ] ) ) ) #define subRound(a, b, c, d, e, f, k, data) \ ( e += ROTL( 5, a ) + f( b, c, d ) + k + data, b = ROTL( 30, b ) ) #define REVERSE(value) value = ((value & 0xFF000000) >> 24) | ((value & 0x00FF0000) >> 8) | ((value & 0x0000FF00) << 8) | ((value & 0x000000FF) << 24) long sha1(uint *eData, const int length) { unsigned int A = a0; unsigned int B = b0; unsigned int C = c0; unsigned int D = d0; unsigned int E = e0; ((__local char *)eData)[length] = 0x80; for (int i = 0; i <= length / 4; i++) { REVERSE(eData[i]); } eData[14] = 0; eData[15] = length * 8; subRound( A, B, C, D, E, f1, K1, eData[ 0 ] ); subRound( E, A, B, C, D, f1, K1, eData[ 1 ] ); subRound( D, E, A, B, C, f1, K1, eData[ 2 ] ); subRound( C, D, E, A, B, f1, K1, eData[ 3 ] ); subRound( B, C, D, E, A, f1, K1, eData[ 4 ] ); subRound( A, B, C, D, E, f1, K1, eData[ 5 ] ); subRound( E, A, B, C, D, f1, K1, eData[ 6 ] ); subRound( D, E, A, B, C, f1, K1, eData[ 7 ] ); subRound( C, D, E, A, B, f1, K1, eData[ 8 ] ); subRound( B, C, D, E, A, f1, K1, eData[ 9 ] ); subRound( A, B, C, D, E, f1, K1, eData[ 10 ] ); subRound( E, A, B, C, D, f1, K1, eData[ 11 ] ); subRound( D, E, A, B, C, f1, K1, eData[ 12 ] ); subRound( C, D, E, A, B, f1, K1, eData[ 13 ] ); subRound( B, C, D, E, A, f1, K1, eData[ 14 ] ); subRound( A, B, C, D, E, f1, K1, eData[ 15 ] ); subRound( E, A, B, C, D, f1, K1, expand( eData, 16 ) ); subRound( D, E, A, B, C, f1, K1, expand( eData, 17 ) ); subRound( C, D, E, A, B, f1, K1, expand( eData, 18 ) ); subRound( B, C, D, E, A, f1, K1, expand( eData, 19 ) ); subRound( A, B, C, D, E, f2, K2, expand( eData, 20 ) ); subRound( E, A, B, C, D, f2, K2, expand( eData, 21 ) ); subRound( D, E, A, B, C, f2, K2, expand( eData, 22 ) ); subRound( C, D, E, A, B, f2, K2, expand( eData, 23 ) ); subRound( B, C, D, E, A, f2, K2, expand( eData, 24 ) ); subRound( A, B, C, D, E, f2, K2, expand( eData, 25 ) ); subRound( E, A, B, C, D, f2, K2, expand( eData, 26 ) ); subRound( D, E, A, B, C, f2, K2, expand( eData, 27 ) ); subRound( C, D, E, A, B, f2, K2, expand( eData, 28 ) ); subRound( B, C, D, E, A, f2, K2, expand( eData, 29 ) ); subRound( A, B, C, D, E, f2, K2, expand( eData, 30 ) ); subRound( E, A, B, C, D, f2, K2, expand( eData, 31 ) ); subRound( D, E, A, B, C, f2, K2, expand( eData, 32 ) ); subRound( C, D, E, A, B, f2, K2, expand( eData, 33 ) ); subRound( B, C, D, E, A, f2, K2, expand( eData, 34 ) ); subRound( A, B, C, D, E, f2, K2, expand( eData, 35 ) ); subRound( E, A, B, C, D, f2, K2, expand( eData, 36 ) ); subRound( D, E, A, B, C, f2, K2, expand( eData, 37 ) ); subRound( C, D, E, A, B, f2, K2, expand( eData, 38 ) ); subRound( B, C, D, E, A, f2, K2, expand( eData, 39 ) ); subRound( A, B, C, D, E, f3, K3, expand( eData, 40 ) ); subRound( E, A, B, C, D, f3, K3, expand( eData, 41 ) ); subRound( D, E, A, B, C, f3, K3, expand( eData, 42 ) ); subRound( C, D, E, A, B, f3, K3, expand( eData, 43 ) ); subRound( B, C, D, E, A, f3, K3, expand( eData, 44 ) ); subRound( A, B, C, D, E, f3, K3, expand( eData, 45 ) ); subRound( E, A, B, C, D, f3, K3, expand( eData, 46 ) ); subRound( D, E, A, B, C, f3, K3, expand( eData, 47 ) ); subRound( C, D, E, A, B, f3, K3, expand( eData, 48 ) ); subRound( B, C, D, E, A, f3, K3, expand( eData, 49 ) ); subRound( A, B, C, D, E, f3, K3, expand( eData, 50 ) ); subRound( E, A, B, C, D, f3, K3, expand( eData, 51 ) ); subRound( D, E, A, B, C, f3, K3, expand( eData, 52 ) ); subRound( C, D, E, A, B, f3, K3, expand( eData, 53 ) ); subRound( B, C, D, E, A, f3, K3, expand( eData, 54 ) ); subRound( A, B, C, D, E, f3, K3, expand( eData, 55 ) ); subRound( E, A, B, C, D, f3, K3, expand( eData, 56 ) ); subRound( D, E, A, B, C, f3, K3, expand( eData, 57 ) ); subRound( C, D, E, A, B, f3, K3, expand( eData, 58 ) ); subRound( B, C, D, E, A, f3, K3, expand( eData, 59 ) ); subRound( A, B, C, D, E, f4, K4, expand( eData, 60 ) ); subRound( E, A, B, C, D, f4, K4, expand( eData, 61 ) ); subRound( D, E, A, B, C, f4, K4, expand( eData, 62 ) ); subRound( C, D, E, A, B, f4, K4, expand( eData, 63 ) ); subRound( B, C, D, E, A, f4, K4, expand( eData, 64 ) ); subRound( A, B, C, D, E, f4, K4, expand( eData, 65 ) ); subRound( E, A, B, C, D, f4, K4, expand( eData, 66 ) ); subRound( D, E, A, B, C, f4, K4, expand( eData, 67 ) ); subRound( C, D, E, A, B, f4, K4, expand( eData, 68 ) ); subRound( B, C, D, E, A, f4, K4, expand( eData, 69 ) ); subRound( A, B, C, D, E, f4, K4, expand( eData, 70 ) ); subRound( E, A, B, C, D, f4, K4, expand( eData, 71 ) ); subRound( D, E, A, B, C, f4, K4, expand( eData, 72 ) ); subRound( C, D, E, A, B, f4, K4, expand( eData, 73 ) ); subRound( B, C, D, E, A, f4, K4, expand( eData, 74 ) ); subRound( A, B, C, D, E, f4, K4, expand( eData, 75 ) ); subRound( E, A, B, C, D, f4, K4, expand( eData, 76 ) ); subRound( D, E, A, B, C, f4, K4, expand( eData, 77 ) ); subRound( C, D, E, A, B, f4, K4, expand( eData, 78 ) ); subRound( B, C, D, E, A, f4, K4, expand( eData, 79 ) ); A += a0; B += b0; C += c0; D += d0; E += e0; return as_ulong((uint2)(D, E)); }

Here you need to make some explanations. As an “experimental” password hash for cracking, I took LinkedIn leaked hashes, which are almost 6 million (unique). There are several options for a fairly quick check of the presence in the list, I used hash tables (more on this later). To reduce memory consumption and speed up work, the idea appeared to store not the full 20 bytes of SHA1, but the last 8 bytes, i.e. single value long / ulong. Of course, this increases the probability of a false coincidence, but it remains very small: of all the passwords I had, I had only 6 such cases, which is not at all critical. Therefore, the truncated value is returned immediately from the above function (last 8 bytes). Otherwise, everything is standard, the SHA1 algorithm is implemented directly according to the specification for the case of lines less than 56 bytes.

Brute force organization

Next you need to organize the search itself. The simplest option is brute force, for all positions the same set of characters, and this can be directly implemented, for example, in a similar way:

Hidden text

 __kernel void do_brute( __global const long *table, const ulong start_num, __global ulong *result, __global uint *res_ind) { char s[64]; uint *s_l = (__local uint *)s; int i, j; ulong _n, n; ulong start = start_num + get_global_id(0) * HASHES_PER_WORKITEM; for (j = 0; j < HASHES_PER_WORKITEM; j++) { n = _n = j + start; for (i = 15; i >= 0; i--) { s_l[i] = 0; } for (i = COMB_LEN - 1; i >= 0; i--) { s[i] = charset[n % CHARS_CNT]; n /= CHARS_CNT; } if (check_in_table(table, sha1(s_l, COMB_LEN))) { result[atomic_inc(res_ind)] = _n; } } }

Here HASHES_PER_WORKITEM is the number of hashes processed in one run by one work item (stream), COMB_LEN is the length of the combination, charset is an array of characters, CHARS_CNT is the number of characters in the array. As you can see, when the kernel is launched, a pointer to the hash table, the password number from which the search starts, and a pointer to the array for outputting the result and the index in it are passed.

In OpenCL, not one thread is launched at once, but some of them, called the global work size, and all the threads get the same arguments. In order for each of them to go through its own part of the key space, the string ulong start = start_num + get_global_id(0) * HASHES_PER_WORKITEM; calculates the number for a specific thread (get_global_id (0) - a standard function that returns a thread index from 0 to the current global work size).

Next, each stream of the HASHES_PER_WORKITEM passwords is iterated, each of which is hashed by the sha1 function and checked for the presence of the check_in_table function, which is further on. In this form, and with the simplest implementation of a hash table, I got the result of about 20 million passwords per second, which frankly is not much compared to, for example, oclHascat, which gives out more than 300 million on my laptop (even the fact that the check is in the large list of hashes does not justify this). Looking ahead, for a simple brute force, I managed to reach a speed of 160 million per second, which is more than half the speed of oclHascat (with one hash).

Hash table

So, check for the existence of a hash. The first implemented variant was the simplest openly addressed hash table . It was filled in, in order not to complicate matters, with a processor, and not with a video card, in OpenCL there were only queries to it. This case looked like this:

Hidden text

 bool check_in_table( __global const long *table, const long value) { uint index = calc_index(value); uint step = calc_step(value); for (uint a = 1; ; a++) { index %= TABLE_SIZE; if (table[index] == 0) { return false; } if (table[index] == value) { return true; } index += a * step + 1; } }

I tried different table sizes and probing methods, but the speed did not improve significantly. While searching for material on hash tables on GPU, I came across the article “Building an Efficient Hash Table on the GPU” by Vasily Volkov , which mentions a certain Cuckoo Hashtable (I don’t know if there is a well-established Russian translation) did not hear. In short, its essence is to use several hash functions instead of one and a special filling method, after which an element is found in no more than k memory accesses, where k is the number of hash functions. Since the speed of work is more important to me than the memory I occupy, I used k = 2. Filling it also occurs on the CPU.

Conclusion

Of course, the optimizations also affected another part of the code, namely the generation of passwords. In the above variant, several non-optimal places are immediately visible, for example, each next password is generated from scratch, although you can change the previous one. There are other places to optimize, in particular, specific to OpenCL: using global or constant memory for an array of characters instead of much faster local memory (read more about memory areas directly from the developer of a specific implementation). However, it’s worth writing a separate article about various kernel code optimizations with details, but here I’ll say that when programming for a GPU, it’s worth trying different options and looking at their speed, because It is not always possible to tell by eye that it will work faster. Sometimes even the removal of some instructions slows down the execution, and significantly.

Later I added support for different alphabets for different positions, as well as, more significantly, alphabets of words, and not just individual characters. For convenience and flexibility, the kernel code is processed by the Mako template engine. All this is in the archive (see below).

findings

So, in the end, I acquired:

OpenCL knowledge with many subtleties (from scratch, I think that is quite successful)
raised the level of programming in Python (not quite from scratch, but from a fairly low level)
I found out and tried the related technologies: Cython compiler, Mako template engine, VCS git (I haven't used it before)

Tangible results:

the program goes through passwords at a speed of about 160 million per second for a simple brute force (when using an alphabet of words rather than characters, the speed is less: from 50 million per second) on a notebook video card - can be compared on one side with 300 million from oclHascat working with one hash, and on the other with 30 million of the fastest brute forcer on the CPU (I have i5 2.5 GHz), written using assembler and SSE instructions
with its use and without any tricks, about 2.5 million passwords from LinkedIn hashes were picked up (not so much, but here I didn’t strive for a record)
You can download it here (in some places there is a terrible govnod!), you need Python 2.7, numpy, PyOpenCL, Cython, Mako and a file with LinkedIn hashes to run.

Impressions:

Python is a very slow language (but simple enough to write), processing large lists in cycles is very slow, as well as various comprehensions, which is why Cython is used for some parts of the program - by the way, a really convenient thing
With the proper approach to profiling and subsequent optimizations, OpenCL can work very quickly, and you don’t need to rewrite anything in assembler

PS:

Various comments and recommendations are welcome.

Source: https://habr.com/ru/post/150289/

All Articles

Exploring OpenCL through password cracking

Introduction

OpenCL and PyOpenCl Overview

Hashing SHA1

Brute force organization

Hash table

Conclusion

findings

PS:

More articles: