Is it legal? Good question. Since these files are only a small part of the MS-DOS program, and they are useless by themselves, and since I post them only for educational purposes, I believe that this falls under fair use requirements. I hope all the interested parties will agree with me. (If, after all, I receive a threatening letter from lawyers, I can change the article to put the data files in an amusing key, and then declare that this is a parody.)
As you can see, all files end in$ ls levels all_full.pak cake_wal.pak eeny_min.pak iceberg.pak lesson_5.pak mulligan.pak playtime.pak southpol.pak totally_.pak alphabet.pak castle_m.pak elementa.pak ice_cube.pak lesson_6.pak nice_day.pak potproproppropppropp. pak special.pak traffic_.pak amsterda.pak catacomb.pak fireflie.pak icedeath.pak lesson_7.pak nightmar.pak problems.pak spirals.pak trinity.pak apartmen.pak cellbloc.pak firetrap.pak icehouse.pak lesson_8.pak now_you_ .pak refracti.pak spooks.pak trust_me.pak arcticfl.pak chchchip.pak floorgas.pak invincib.pak lobster_.pak nuts_and.pak reverse_.pak steam.pak undergro.pak balls_o_.pak chiller.pak forced_e.pak i.pak lock_blo.pak on_the_r.pak rink.pak stripes.pak up_the_b.pak beware_o.pak chipmine.pak force_fi.pak i_slide.pak loop_aro.pak oorto_ge.pak roadsign.pak suicide.pak vanishin.pak blink.pak mad. pak jailer.pak memory.pak open_que.pak sampler.pak telebloc.pak victim.pak blobdanc.pak colony.pak fortune_.pak jumping_.pak metastab.pak oversea_.pak scavenge.p ak telenet.pak vortex.pak blobnet.pak corridor.pak four_ple.pak kablam.pak mind_blo.pak pain.pak scoundre.pak t_fair.pak wars.pak block_fa.pak cypher.pak four_squ.pak knot.pak mishmeshppakkot.pak mishmeshp .pak seeing_s.pak the_last.pak writers_.pak block_ii.pak deceptio.pak glut.pak ladder.pak miss_dir.pak partial_.pak short_ci.pak the_mars.pak yorkhous.pak block_n_.pak deepfree.pak goldkey.pak mixed_nu.pak pentagra.pak shrinkin.pak the_pris.pak block_ou.pak digdirt.pak go_with_.pak lesson_1.pak mix_up.pak perfect_.pak skelzie.pak three_do.pak block.pak digger.pak grail.pak lesson_2.pak pak pier_sev.pak slide_st.pak time_lap.pak bounce_c.pak doublema.pak hidden_d.pak lesson_3.pak morton.pak ping_pon.pak slo_mo.pak torturec.pak brushfir.pak drawn_an.pak hunt.pak lesson_4.pak mugger_.pak .pak socialis.pak tossed_s.pak
.pak
. .pak
is the standard resolution for the application data file, and this, unfortunately, does not give us any information about its internal structure. File names are the first eight characters of the level name, with some exceptions. (For example, the word “buster” is omitted from the file names of the “BLOCK BUSTER” and “BLOCK BUSTER II” levels so that they do not match.)There are 148 data files in the catalog, and there are actually 148 levels in the game, so everything is the same.$ ls levels | wc 17,148 1974
xxd
is a standard utility for dumping hex data (hexdump). Let's see what it looks like inside LESSON 1.$ xxd levels / lesson_1.pak 00000000: 1100 cb00 0200 0004 0202 0504 0407 0505 ................ 00000010: 0807 0709 0001 0a01 010b 0808 0d0a 0a11 ................ 00000020: 0023 1509 0718 0200 2209 0d26 0911 270b. # ...... ".. & .. '. 00000030: 0b28 0705 291e 0127 2705 020d 0122 0704. (..) ..''.... ".. 00000040: 0902 090a 0215 0426 0925 0111 1502 221d ....... &.% .... ". 00000050: 0124 011d 0d01 0709 0020 001b 0400 1a00. $ ....... ...... 00000060: 2015 2609 1f00 3300 2911 1522 2302 110d. & ... 3.) .. "# ... 00000070: 0107 2609 1f18 2911 1509 181a 0223 021b .. & ...) ...... # .. 00000080: 0215 2201 1c01 1c0d 0a07 0409 0201 0201 .. "............. 00000090: 2826 0123 1505 0902 0121 1505 220a 2727 (&. # .....! .. ". '' 000000a0: 0b05 0400 060b 0828 0418 780b 0828 0418 ....... (.. x .. (.. 000000b0: 700b 0828 0418 6400 1710 1e1e 1a19 0103 p .. (.. d ......... 000000c0: 000e 1a17 1710 0e1f 010e 1314 1b29 1f1a .............) .. 000000d0: 0012 101f 011b 0c1e 1f01 1f13 1001 0e13 ................ 000000e0: 141b 001e 1a0e 1610 1f2d 0020 1e10 0116 .........-. .... 000000f0: 1024 291f 1a01 1a1b 1019 000f 1a1a 1d1e. $) ............. 00000100: 2d02
What is a hexdump utility? Hexadecimal dump is a standard way to display exact bytes of a binary file. Most byte values cannot be associated with printable ASCII characters, or they have an incomprehensible appearance (for example, a tab character). In the hexadecimal dump, individual bytes are output as numerical values. Values are displayed in hexadecimal, hence the name. In the example shown above, 16 bytes are displayed on one output line. The leftmost column shows the position of the line in the file, also in hexadecimal, so the number in each line is increased by 16. The bytes are displayed in eight columns, and in each column two bytes are displayed. On the right in hexdump it is shown how the bytes would look when displayed with characters, only all non-printing ASCII values are replaced by dots. This makes it easy to find strings that can be embedded in a binary file.Obviously, the reverse engineering of these files will not be reduced to the usual viewing of the contents and the study of what is visible there. So far, there is nothing telling us which functions are performed by the data.
There is nothing visible except arbitrary fragments of ASCII garbage.$ strings levels / * | less : !!; # &> '' :: 4 # . ,,! -54 "; / & 67 !) 60 <171 * (0 * 82> '= / 8> <171 && 9> # 2 ') ( , )9 0hX `@PX ) "" * 24 ** 5 ;)) < B777: .. 22C1 E ,, F -GDED EGFF16G ;; H < IECJ 9K444 = MBBB >> N9 "O" 9P3? Q lines 1-24 / 1544 (more)
-S
option sorts files in descending order of size.$ ls -lS levels | head total 592 -rw-r - r-- 1 breadbox breadbox 680 Jun 23 2015 mulligan.pak -rw-r - r-- 1 breadbox breadbox 675 Jun 23 2015 shrinkin.pak -rw-r - r-- 1 breadbox breadbox 671 Jun 23 2015 balls_o_.pak -rw-r - r-- 1 breadbox breadbox 648 Jun 23 2015 cake_wal.pak -rw-r - r-- 1 breadbox breadbox 647 Jun 23 2015 citybloc.pak -rw-r - r-- 1 breadbox breadbox 639 Jun 23 2015 four_ple.pak -rw-r - r-- 1 breadbox breadbox 636 Jun 23 2015 trust_me.pak -rw-r - r-- 1 breadbox breadbox 625 Jun 23 2015 block_n_.pak -rw-r - r-- 1 breadbox breadbox 622 Jun 23 2015 mix_up.pak
-r
option tells ls
to reverse the order.$ ls -lSr levels | head total 592 -rw-r - r-- 1 breadbox breadbox 206 Jun 23 2015 kablam.pak -rw-r - r-- 1 breadbox breadbox 214 Jun 23 2015 fortune_.pak -rw-r - r-- 1 breadbox breadbox 219 Jun 23 2015 digdirt.pak -rw-r - r-- 1 breadbox breadbox 226 Jun 23 2015 lesson_2.pak -rw-r - r-- 1 breadbox breadbox 229 Jun 23 2015 lesson_8.pak -rw-r - r-- 1 breadbox breadbox 237 Jun 23 2015 partial_.pak -rw-r - r-- 1 breadbox breadbox 239 Jun 23 2015 knot.pak -rw-r - r-- 1 breadbox breadbox 247 Jun 23 2015 cellbloc.pak -rw-r - r-- 1 breadbox breadbox 248 Jun 23 2015 torturec.pak
$ for f in levels / *; do xxd $ f | sed -n 1p; done | less 00000000: 2300 dc01 0300 0004 0101 0a03 030b 2323 # ............. ## 00000000: 2d00 bf01 0300 0015 0101 2203 0329 2222 -......... "..)" " 00000000: 2b00 a101 0301 0105 0000 0601 0207 0505 + ............... 00000000: 1d00 d300 0200 0003 0101 0402 0205 0102 ................ 00000000: 2d00 7a01 0300 0006 1414 0701 0109 0303 -.z ............. 00000000: 3100 0802 0200 0003 0101 0502 0206 1313 1 ............... 00000000: 1a00 b700 0200 0003 0100 0502 0206 0101 ................ 00000000: 1a00 0601 0300 0005 0001 0601 0107 0303 ................ 00000000: 2000 7a01 0200 0003 0202 0401 0105 0028 .z ............ ( 00000000: 3a00 a400 0200 0003 2828 0428 0205 0303: ....... ((. (.... 00000000: 2600 da00 0300 0004 0507 0901 010a 0303 & ............... 00000000: 2400 f000 0300 0004 0303 0504 0407 0101 $ ............... 00000000: 2a00 ef01 0300 0005 0101 0614 0007 0303 * ............... 00000000: 2c00 8c01 0300 0004 0303 0500 0107 0101, ............... 00000000: 2a00 0001 0300 0004 0303 0501 0107 0404 * ............... 00000000: 1b00 6d01 0200 0003 0101 0502 0206 0003 ..m ............. 00000000: 1e00 1701 0200 0003 0202 0401 0105 0013 ................ 00000000: 3200 ee01 0f00 0015 0101 270f 0f29 1414 2 ......... '..) .. 00000000: 2a00 5b01 0300 0005 0303 0601 0107 1414 *. [............. 00000000: 2c00 8a01 0200 0003 0202 0401 0105 0303, ............... 00000000: 1d00 9c00 0216 1604 0000 0516 0107 0205 ................ 00000000: 2000 e100 0200 0003 0101 0402 0205 0303 ............... 00000000: 2000 2601 0300 0004 0303 0502 0207 0101. & ............. 00000000: 1f00 f600 0132 0403 0000 0532 3206 0404 ..... 2 ..... 22 ... lines 1-24 / 148 (more)
10
to 40
hexadecimal (or approximately 20–60
in decimal). This is a rather specific feature.What is little-endian? When saving a numerical value that is more than one byte, you need to select in advance the order in which the bytes will be stored. If you first save a byte representing the smaller part of a number, then this is called direct order ( little-endian ); if you first save the bytes that represent most of the number, then this is the reverse order ( big-endian ). For example, we write decimal values in the reverse order (big-endian): the string "42" means "forty-two", and not "four and twenty". Little-endian is a natural order for many microprocessor families, so it is usually more popular, with the exception of network protocols, which usually require big-endian.
00
, 01
or 02
, and 01
is most common. This also hints to us that these two bytes make up another 16-bit value, which is approximately in the range of 0–700 decimal values. This hypothesis can also be confirmed by the fact that the value of the third byte is usually low if the value of the fourth is 02
, and is usually large if the fourth byte is 00
.xxd
with xxd -g1
to completely disable grouping, and you will notice that the recognition of pairs of bytes in the middle of the line requires a lot of effort. This is a simple example of how the tools used to study unfamiliar data incline us to notice certain types of patterns. It’s good that by default xxd
highlights this pattern, because it is very common (even today, when 64-bit computers are used everywhere). But it is useful and to know how to change these parameters if they do not help.02
and 03
, and the maximum value seems to be 05
. The sixth byte of a file is very often zero - but sometimes it contains much larger values, for example 32
or 2C
. In this pair, our assumption about the values distributed in the interval is not particularly confirmed.od
to generate a hex dump. The od
utility is similar to xxd
, but provides a much larger selection of output formats. We can use it to dump the output as 16-bit decimal integers:od
option of the od
utility indicates the output format. In this case, u
stands for unsigned decimal numbers, and 2
stands for two bytes per record. (You can also set this format with the -d
option.)$ for f in levels / *; do od -tu2 $ f | sed -n 1p; done | less 0000000 35 476 3 1024 257 778 2819 8995 0000000 45 447 3 5376 257 802 10499 8738 0000000 43 417 259 1281 0 262 1794 1285 0000000 29 211 2 768 257 516 1282 513 0000000 45 378 3 1536 5140 263 2305 771 0000000 49 520 2 768 257 517 1538 4883 0000000 26 183 2 768 1 517 1538 257 0000000 26 262 3 1280 256 262 1793 771 0000000 32 378 2 768 514 260 1281 10240 0000000 58 164 2 768 10280 10244 1282 771 0000000 38 218 3 1024 1797 265 2561 771 0000000 36 240 3 1024 771 1029 1796 257 0000000 42 495 3 1280 257 5126 1792 771 0000000 44 396 3 1024 771 5 1793 257 0000000 42 256 3 1024 771 261 1793 1028 0000000 27 365 2 768 257 517 1538 768 0000000 30 279 2 768 514 260 1281 4864 0000000 50 494 15 5376 257 3879 10511 5140 0000000 42 347 3 1280 771 262 1793 5140 0000000 44 394 2 768 514 260 1281 771 0000000 29 156 5634 1046 0 5637 1793 1282 0000000 32 225 2 768 257 516 1282 771 0000000 32,294 3 1024 771 517 1794 257 0000000 31 246 12801 772 0 12805 1586 1028 lines 1-24 / 148 (more)
wc
with the option -c
. Similarly, you can add to od
options that allow you to display only values that interest us. Then we can use command substitution to write these values to shell variables and display them together:-An
utility option -An
disables the leftmost column, which displays the offset in the file, and -N4
tells od
stop after the first 4 bytes of the file.$ for f in levels / *; do size = $ (wc -c <$ f); data = $ (od -tuS -An -N4 $ f); echo "$ size: $ data"; done | less 585: 35,476 586: 45 447 550: 43,417 302: 29,211 517: 45 378 671: 49,520 265: 26,183 344: 26,262 478: 32,378 342: 58,164 336: 38,218 352: 36,240 625: 42,495 532: 44,396 386: 42,256 450: 27,365 373: 30 279 648: 50,494 477: 42 347 530: 44,394 247: 29,156 325: 32,225 394: 32,294 343: 31,246
read
to extract two numbers from the output od
into separate variables, and then use command shell arithmetic to find their sum:read
cannot be used from the right side of the vertical line, because the commands transmitted to the pipeline are executed in a child command processor (subshell), which, when exiting, takes its environment variables to the bit receiver. Therefore, instead, we need to use the process substitution function bash
and send the output od
to a temporary file, which can then be redirected to the command read
.$ for f in levels / *; do size = $ (wc -c <$ f); read v1 v2 <<(od -tuS -An -N4 $ f); sum = $ (($ v1 + $ v2)); echo "$ size: $ v1 + $ v2 = $ sum"; done | less 585: 35 + 476 = 511 586: 45 + 447 = 492 550: 43 + 417 = 460 302: 29 + 211 = 240 517: 45 + 378 = 423 671: 49 + 520 = 569 265: 26 + 183 = 209 344: 26 + 262 = 288 478: 32 + 378 = 410 342: 58 + 164 = 222 336: 38 + 218 = 256 352: 36 + 240 = 276 625: 42 + 495 = 537 532: 44 + 396 = 440 386: 42 + 256 = 298 450: 27 + 365 = 392 373: 30 + 279 = 309 648: 50 + 494 = 544 477: 42 + 347 = 389 530: 44 + 394 = 438 247: 29 + 156 = 185 325: 32 + 225 = 257 394: 32 + 294 = 326 343: 31 + 246 = 277 lines 1-24 / 148 (more)
$ for f in levels / *; do size = $ (wc -c <$ f); read v1 v2 <<(od -tuS -An -N4 $ f); diff = $ (($ size - $ v1 - $ v2)); echo "$ size = $ v1 + $ v2 + $ diff"; done | less 585 = 35 + 476 + 74 586 = 45 + 447 + 94 550 = 43 + 417 + 90 302 = 29 + 211 + 62 517 = 45 + 378 + 94 671 = 49 + 520 + 102 265 = 26 + 183 + 56 344 = 26 + 262 + 56 478 = 32 + 378 + 68 342 = 58 + 164 + 120 336 = 38 + 218 + 80 352 = 36 + 240 + 76 625 = 42 + 495 + 88 532 = 44 + 396 + 92 386 = 42 + 256 + 88 450 = 27 + 365 + 58 373 = 30 + 279 + 64 648 = 50 + 494 + 104 477 = 42 + 347 + 88 530 = 44 + 394 + 92 247 = 29 + 156 + 62 325 = 32 + 225 + 68 394 = 32 + 294 + 68 343 = 31 + 246 + 66 lines 1-24/148 (more)
$ echo P5 200 148 255 >hdr.pgm
PGM? PGM, «portable graymap» (« ») — , : ASCII, . — PBM («portable bitmap», « »), 8 , PPM («portable pixmap», « »), 3 .
P5
Is the initial signature indicating the format of the PGM files. The next two numbers, 200
and 148
, set the width and height of the image, and the last 255
, indicates the maximum value per pixel. The PGM header ends with a transition to a new line, followed by pixel data. (It is worth noting that the PGM header is most often broken into three separate lines of text, but the PGM standard only requires that the elements be separated by some whitespace.)head
to extract the first 200 bytes from each file:$ for f in levels / *; do head -c200 $ f; done> out.pgm
xview
- this is an old program X for displaying images in a window. You can replace it with your favorite image viewer, for example, with the utility display
from ImageMagick, but note that there are surprisingly many image viewing utilities that do not accept the image file redirected to the standard input.$ cat hdr.pgm out.pgm | xview / dev / stdin
pgmtoppm
from ImageMagick to convert pixels to a different range of colors. This version will create a “negative” image:$ cat hdr.pgm out.pgm | pgmtoppm white-black | xview / dev / stdin
$ cat hdr.pgm out.pgm | pgmtoppm yellow-blue | xview / dev / stdin
$ cat hdr.pgm out.pgm | xview -zoom 300 / dev / stdin
head
utility tail
and see what the last 200 bytes look like:$ for f in levels / *; do tail -c200 $ f; done> out.pgm; cat hdr.pgm out.pgm | xview -zoom 300 / dev / stdin
PIL
("Pillow"): import sys from PIL import Image # Retrieve the full list of data files. filenames = sys.argv[1:] # Create a grayscale image, its height equal to the number of data files. width = 750 height = len(filenames) image = Image.new('L', (width, height)) # Fill in the image, one row at a time. for y in range(height): # Retrieve the contents of one data file. data = open(filenames[y]).read() linewidth = len(data) # Turn the data into a pixel-high image, each byte becoming one pixel. line = Image.new(image.mode, (linewidth, 1)) linepixels = line.load() for x in range(linewidth): linepixels[x,0] = ord(data[x]) # Stretch the line out to fit the final image, and paste it into place. line = line.resize((width, 1)) image.paste(line, (0, y)) # Magnify the final image and display it. image = image.resize((width, 3 * height)) image.show()
$ python showbytes.py levels / *
import sys data = sys.stdin.read() for c in range(256): print c, data.count(chr(c))
$ cat levels / * | python ./census.py | less 0 2458 1 2525 2 1626 3 1768 4 1042 5 1491 6 1081 7 1445 8,958 9 1541 10 1279 11 1224 12,845 13,908 14,859 15 1022 16 679 17 1087 18,881 19 1116 20,100 21 1189 22 1029 23,733 lines 1-24 / 256 (more)
gnuplot
and turn this census into a histogram:-p
utility option gnuplot
tells you not to close the window with the schedule after the work is completed gnuplot
.$ cat levels / * | python ./census.py | gnuplot -p -e 'plot "-" with boxes'
import sys from PIL import Image # Retrieve the full list of data files. filenames = sys.argv[1:] # Create a color image, its height equal to the number of data files. width = 750 height = len(filenames) image = Image.new('RGB', (width, height)) # Fill in the image, one row at a time. for y in range(height): # Retrieve the contents of one data file. data = open(filenames[y]).read() linewidth = len(data) # Turn the data into a pixel-high image, each byte becoming one pixel. line = Image.new(image.mode, (linewidth, 1)) linepixels = line.load() # Determine which group each byte belongs to and assign it a color. for x in range(linewidth): byte = ord(data[x]) if byte < 0x04: linepixels[x,0] = (255, 0, 0) elif byte < 0x40: linepixels[x,0] = (0, 255, 0) elif byte % 8 == 0: linepixels[x,0] = (0, 0, 255) else: linepixels[x,0] = (255, 255, 255) # Paste the line of pixels into the final image, stretching to fit. line = line.resize((width, 1)) image.paste(line, (0, y)) # Magnify the final image and display it. image = image.resize((width, 3 * height)) image.show()
$ python showbytes2.py levels / *
-s4
tells you to xxd
skip the first 4 bytes of the file.$ for f in levels / *; do xxd -s4 $ f | sed -n 1p; done | less 00000004: 0200 0003 0202 0401 0105 0303 0700 0108 ................ 00000004: 0201 0104 0000 0504 0407 0202 0902 010a ................ 00000004: 0300 0004 0303 0504 0407 0505 0801 0109 ................ 00000004: 0300 0009 0202 1203 0313 0909 1401 0115 ................ 00000004: 0203 0305 0000 0602 0207 0505 0901 010a ................ 00000004: 0203 0304 0000 0502 0206 0404 0901 010a ................ 00000004: 0300 0005 022a 0602 2907 0303 0902 000a ..... * ..) ....... 00000004: 0203 0305 0000 0605 0507 0101 0802 0209 ................ 00000004: 0300 0007 0303 0901 010a 0707 0b09 090c ................ 00000004: 0300 0004 0101 0903 030e 0404 1609 0920 ............... 00000004: 0200 0003 1313 0402 0205 0013 0701 0109 ................ 00000004: 0500 0006 0505 0701 0109 0606 0e07 070f ................ 00000004: 0100 0003 0101 0a03 030b 0a0a 0e32 3216 .............22. 00000004: 0300 0004 0705 0903 030a 0606 0b08 080c ................ 00000004: 0200 0003 0701 0402 0209 0501 0a08 080b ................ 00000004: 0200 0003 0202 0901 010a 0303 0b05 010d ................ 00000004: 0200 0003 0202 0403 0305 0101 0904 040a ................ 00000004: 0300 0007 0303 0f01 0115 0707 2114 1422 ............!.." 00000004: 0200 0003 0202 0403 0309 0101 0a04 040b ................ 00000004: 0231 3103 0202 0500 0006 0303 0701 0109 .11............. 00000004: 0200 0003 0202 0b32 320c 0303 0e08 0811 .......22....... 00000004: 0201 0103 0000 0902 020a 0303 0b09 090c ................ 00000004: 0200 0003 0202 0a01 010b 0303 0d0b 0b0f ................ 00000004: 0300 0005 0303 0701 0109 0001 0b05 051b ................ lines 27-50/148 (more)
-g3
sets the grouping by three bytes instead of two. Option-c18
specifies 18 bytes (a multiple of 3) per line instead of 16.$ for f in levels / *; do xxd -s4 -g3 -c18 $ f | sed -n 1p; done | less 00000004: 050000 060505 070101 090606 0e0707 0f0001 .................. 00000004: 010000 030101 0a0303 0b0a0a 0e3232 161414 ............. 22 ... 00000004: 030000 040705 090303 0a0606 0b0808 0c0101 .................. 00000004: 020000 030701 040202 090501 0a0808 0b0101 .................. 00000004: 020000 030202 090101 0a0303 0b0501 0d0302 .................. 00000004: 020000 030202 040303 050101 090404 0a0302 .................. 00000004: 030000 070303 0f0101 150707 211414 221313 ............! .. ".. 00000004: 020000 030202 040303 090101 0a0404 0b0001 .................. 00000004: 023131 030202 050000 060303 070101 090505 .11 ............... 00000004: 020000 030202 0b3232 0c0303 0e0808 110b0b ....... 22 ......... 00000004: 020101 030000 090202 0a0303 0b0909 0c0a0a .................. 00000004: 020000 030202 0a0101 0b0303 0d0b0b 0f2323 ................## 00000004: 030000 050303 070101 090001 0b0505 1b0707 .................. 00000004: 022323 030000 040202 050303 030101 070505 .##............... 00000004: 031414 050000 060303 070505 080101 090707 .................. 00000004: 030000 050202 060303 070505 080101 090606 .................. 00000004: 030202 040000 050303 070404 080005 090101 .................. 00000004: 030202 040000 050303 090404 1d0101 1f0909 .................. 00000004: 020000 050303 060101 070202 0f0300 110505 .................. 00000004: 050000 070101 0c0505 0d0007 110c0c 120707 .................. 00000004: 030202 050000 060303 070505 080101 090606 .................. 00000004: 020000 030101 050202 060505 070100 080303 .................. 00000004: 020000 030202 050303 090101 0a0505 0b0302 .................. 00000004: 022c2c 030000 040202 020303 050101 060202 .,,............... lines 38-61/148 (more)
0000
. But nonzero values are also often found in pairs, for example, 0101
or 2323
. This pattern is also not ideal, but it has too much in common for it to be a coincidence. And looking at the ASCII column on the right, we will see that when we have byte values that correspond to the printed ASCII character, they are often found in pairs.$ xxd -s4 -g3 -c18 levels / lesson_1.pak 00000004: 020000 040202 050404 070505 080707 090001 .................. 00000016: 0a0101 0b0808 0d0a0a 110023 150907 180200 ........... # ...... 00000028: 22090d 260911 270b0b 280705 291e01 272705 ".. & .. '.. (..) ..' '. 0000003a: 020d01 220704 090209 0a0215 042609 250111 ... "......... &.% .. 0000004c: 150222 1d0124 011d0d 010709 002000 1b0400 .. ".. $ ....... .... 0000005e: 1a0020 152609 1f0033 002911 152223 02110d ... & ... 3.) .. "# ... 00000070: 010726 091f18 291115 09181a 022302 1b0215 .. & ...) ...... # .... 00000082: 22011c 011c0d 0a0704 090201 020128 260123 "............. (&. 00000094: 150509 020121 150522 0a2727 0b0504 00060b .....! .. ".''...... 000000a6: 082804 18780b 082804 18700b 082804 186400 .(..x..(..p..(..d. 000000b8: 17101e 1e1a19 010300 0e1a17 17100e 1f010e .................. 000000ca: 13141b 291f1a 001210 1f011b 0c1e1f 011f13 ...).............. 000000dc: 10010e 13141b 001e1a 0e1610 1f2d00 201e10 .............-. .. 000000ee: 011610 24291f 1a011a 1b1019 000f1a 1a1d1e ...$)............. 00000100: 2d02
00000036
. This is not accurate, but the first byte of each triplet constantly increases its value, and then decreases on the eighteenth triplet. Another proof: in the eighteenth triplet, the second byte has the same meaning as the first. We have not noticed this yet, but if we go back and see, we will see that the first byte is never equal to the second or third byte.$ od -An -tuS -N4 levels / lesson_1.pak 17 203
$ for f in levels / *; do size = $ (wc -c <$ f); read v1 v2 <<(od -tuS -An -N4 $ f); diff = $ (($ size - 3 * $ v1 - $ v2)); echo "$ size = 3 * $ v1 + $ v2 + $ diff"; done | less 585 = 3 * 35 + 476 + 4 586 = 3 * 45 + 447 + 4 550 = 3 * 43 + 417 + 4 302 = 3 * 29 + 211 + 4 517 = 3 * 45 + 378 + 4 671 = 3 * 49 + 520 + 4 265 = 3 * 26 + 183 + 4 344 = 3 * 26 + 262 + 4 478 = 3 * 32 + 378 + 4 342 = 3 * 58 + 164 + 4 336 = 3 * 38 + 218 + 4 352 = 3 * 36 + 240 + 4 625 = 3 * 42 + 495 + 4 532 = 3 * 44 + 396 + 4 386 = 3 * 42 + 256 + 4 450 = 3 * 27 + 365 + 4 373 = 3 * 30 + 279 + 4 648 = 3 * 50 + 494 + 4 477 = 3 * 42 + 347 + 4 530 = 3 * 44 + 394 + 4 247 = 3 * 29 + 156 + 4 325 = 3 * 32 + 225 + 4 394 = 3 * 32 + 294 + 4 343 = 3 * 31 + 246 + 4 lines 1-24/148 (more)
gzip
and zlib
allows you to recreate a dictionary directly from the data stream. But such cases are the exception rather than the rule.) import struct import sys # Read the compressed data file. data = sys.stdin.read() # Extract the two integers of the four-byte header. tablesize, datasize = struct.unpack('HH', data[0:4]) data = data[4:] # Separate the dictionary table and the compressed data. tablesize *= 3 table = data[0:tablesize] data = data[tablesize:datasize] # Apply the dictionary entries to the data section. for n in range(0, len(table), 3): key = table[n] val = table[n+1:n+3] data = data.replace(key, val) # Output the expanded result. sys.stdout.write(data)
$ python ./decompress.py <levels/lesson_1.pak | xxd
00000000: 0b0b 0b0b 0404 0000 0a0a 0109 0d05 0502 ................
00000010: 0200 0100 0000 0101 0100 0009 0702 0209 ................
00000020: 1100 0125 0100 2309 0700 0009 0d1d 0124 ...%..#........$
00000030: 011d 0a0a 0105 0500 0100 2000 1b02 0200 .......... .....
00000040: 1a00 2009 0709 1100 011f 0033 001e 0100 .. ........3....
00000050: 2309 0709 0d23 0000 0023 0a0a 0105 0509 #....#...#......
00000060: 1100 011f 0200 1e01 0023 0907 0001 0200 .........#......
00000070: 1a00 0023 0000 1b00 0009 0709 0d01 1c01 ...#............
00000080: 1c0a 0a01 0105 0502 0200 0100 0001 0000 ................
00000090: 0107 0509 1101 2309 0704 0400 0100 0001 ......#.........
000000a0: 2109 0704 0409 0d01 010b 0b0b 0b08 0804 !...............
000000b0: 0402 0200 0608 0807 0707 0502 0202 0078 ...............x
000000c0: 0808 0707 0705 0202 0200 7008 0807 0707 ..........p.....
000000d0: 0502 0202 0064 0017 101e 1e1a 1901 0300 .....d..........
000000e0: 0e1a 1717 100e 1f01 0e13 141b 1e01 1f1a ................
000000f0: 0012 101f 011b 0c1e 1f01 1f13 1001 0e13 ................
00000100: 141b 001e 1a0e 1610 1f2d 0020 1e10 0116 .........-. ....
00000110: 1024 1e01 1f1a 011a 1b10 1900 0f1a 1a1d .$..............
00000120: 1e2d 0000
0b
, 04
, 00
, 0a
- they all occur in pairs. Looking at the compressed original, we will see that all these pairs arose because of the replacement by the dictionary. But in the process, we immediately note that all these duplicate values also correspond to entries in the dictionary. That is, if we again apply the dictionary, then the data will expand again. Perhaps we are not unpacking enough?Although re-pair compression does not produce particularly impressive results, it has an advantage: the unpacker can be implemented with a minimum of code. I myself used re-pair in some situations when I needed to minimize the total size of the compressed data and the decompression code.
import struct import sys # Read the compressed data file. data = sys.stdin.read() # Extract the two integers of the four-byte header. tablesize, datasize = struct.unpack('HH', data[0:4]) data = data[4:] # Separate the dictionary table and the compressed data. tablesize *= 3 table = data[0:tablesize] data = data[tablesize:datasize] # Apply the dictionary entries to the data section in reverse order. for n in range(len(table) - 3, -3, -3): key = table[n] val = table[n+1:n+3] data = data.replace(key, val) # Output the expanded result. sys.stdout.write(data)
$ python ./decompress2.py <levels/lesson_1.pak | xxd | less 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000100: 0000 0000 0000 0000 0000 0101 0101 0100 ................ 00000110: 0101 0101 0100 0000 0000 0000 0000 0000 ................ 00000120: 0000 0000 0000 0000 0000 0100 0000 0101 ................ 00000130: 0100 0000 0100 0000 0000 0000 0000 0000 ................ 00000140: 0000 0000 0000 0000 0000 0100 2300 0125 ............#..% 00000150: 0100 2300 0100 0000 0000 0000 0000 0000 ..#............. 00000160: 0000 0000 0000 0000 0101 0101 011d 0124 ...............$ 00000170: 011d 0101 0101 0100 0000 0000 0000 0000 ................ lines 1-24 / 93 (more)
$ python ./decompress2.py <levels / lesson_1.pak | xxd -c32 | less 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000000 ................................ 00000020: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000000 ................................ 00000040: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000000 ................................ 00000060: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 00000080: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 00000100: 0000 0000 0000 0000 0000 0101 0101 0100 0101 0101 0100 0000 0000 0000 0000 0000 ................................ 00000120: 0000 0000 0000 0000 0000 0100 0000 0101 0100 0000 0100 0000 0000 0000 0000 0000 ................................ 00000140: 0000 0000 0000 0000 0000 0100 2300 0125 0100 2300 0100 0000 0000 0000 0000 0000 ............#..%..#............. 00000160: 0000 0000 0000 0000 0101 0101 011d 0124 011d 0101 0101 0100 0000 0000 0000 0000 ...............$................ 00000180: 0000 0000 0000 0000 0100 2000 1b00 0000 0000 1a00 2000 0100 0000 0000 0000 0000 .......... ......... ........... 000001a0: 0000 0000 0000 0000 0100 2300 011f 0033 001e 0100 2300 0100 0000 0000 0000 0000 ..........#....3....#........... 000001c0: 0000 0000 0000 0000 0101 0101 0123 0000 0023 0101 0101 0100 0000 0000 0000 0000 .............#...#.............. 000001e0: 0000 0000 0000 0000 0100 2300 011f 0000 001e 0100 2300 0100 0000 0000 0000 0000 ..........#.........#........... 00000200: 0000 0000 0000 0000 0100 0000 1a00 0023 0000 1b00 0000 0100 0000 0000 0000 0000 ...............#................ 00000220: 0000 0000 0000 0000 0101 0101 0101 1c01 1c01 0101 0101 0100 0000 0000 0000 0000 ................................ 00000240: 0000 0000 0000 0000 0000 0000 0100 0001 0000 0100 0000 0000 0000 0000 0000 0000 ................................ 00000260: 0000 0000 0000 0000 0000 0000 0100 2301 2300 0100 0000 0000 0000 0000 0000 0000 ..............#.#............... 00000280: 0000 0000 0000 0000 0000 0000 0100 0001 2100 0100 0000 0000 0000 0000 0000 0000 ................!............... 000002a0: 0000 0000 0000 0000 0000 0000 0101 0101 0101 0100 0000 0000 0000 0000 0000 0000 ................................ 000002c0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ 000002e0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ................................ lines 1-24/47 (more)
xcd
is a non-standard tool, but you can download it from here . Note the option of the -r
utility less
, which orders to clear control sequences.00
encodes an empty tile, 01
encodes a wall, and 23
denotes a chip. 1A
denotes a red door, 1B
a blue door, and so on. We can assign exact values to chips, keys, doors, and all other tiles that make up the entire level map.000003FF
).sed
. Since we display 32 bytes per line, so we skip the first 1024 bytes.00000400
- 0000057F
), almost all of which are zero, but nonzero values are found between them. After that there is a completely different byte pattern, so it would be logical to assume that this 384-byte sequence is a separate part.$ python ./decompress2.py <levels/lesson_2.pak | xxd | less 00000400: 0608 1c1c 0808 0000 0000 0000 0000 0000 ................ 00000410: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000420: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000430: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000440: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000450: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000460: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000470: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000480: a870 98a0 6868 0000 0000 0000 0000 0000 .p..hh.......... 00000490: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000004f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000500: 6060 6060 5868 0000 0000 0000 0000 0000 ````Xh.......... 00000510: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000520: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000530: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000540: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000550: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000560: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000570: 0000 0000 0000 0000 0000 0000 0000 0000 ................ lines 64-87 / 93 (more)
06
, three bytes, 08
and two bytes 1C
. It will be reasonable to conclude that 06
Chip means 08
- a beetle, and 1C
- a block.04
, 05
, 06
or07
. This set of symbols actually contains all mobs. A careful study of different values, we eventually realize that byte value, indicating the type, is added the value 0, 1, 2 or 3, denoting the initial direction of the mob: north, east, south or west. That is, for example, a byte value 06
denotes a Chip looking south.)xxd
accepts a -s
hex value for the option .$ for f in levels / *; do python ./decompress2.py <$ f | xxd -s 0x580 | sed -n 1p; done | less 00000580: 9001 0c17 1701 1120 1717 00 ....... ... 00000580: 0000 0c17 1b13 0c0d 101f 011e 1a20 1b00 ............. .. 00000580: f401 0c18 1e1f 101d 0f0c 1800 ............ 00000580: 2c01 0c1b 0c1d 1f18 1019 1f00, ........... 00000580: 9001 0c1d 0e1f 140e 1117 1a22 00 ........... ". 00000580: 2c01 0d0c 1717 1e01 1a01 1114 1d10 00, .............. 00000580: 2c01 0d10 220c 1d10 011a 1101 0d20 1200, ... "........ .. 00000580: 5802 0d17 1419 1600 X ....... 00000580: 0000 0d17 1a0d 0f0c 190e 1000 ............ 00000580: f401 0d17 1a0d 1910 1f00 .......... 00000580: f401 0d17 1a0e 1601 110c 0e1f 1a1d 2400 .............. $. 00000580: ee02 0d17 1a0e 1601 0d20 1e1f 101d 0114 ......... ...... 00000580: 5802 0d17 1a0e 1601 1901 1d1a 1717 00 X.............. 00000580: 5e01 0d17 1a0e 1601 1a20 1f00 ^........ .. 00000580: c201 0d17 1a0e 1601 0d20 1e1f 101d 00 ......... ..... 00000580: 2c01 0d1a 2019 0e10 010e 141f 2400 ,... .......$. 00000580: 5000 0d1d 201e 1311 141d 1000 P... ....... 00000580: e703 0e0c 1610 0122 0c17 1600 .......".... 00000580: 5802 0e0c 1e1f 1710 0118 1a0c 1f00 X............. 00000580: 8f01 0e0c 1f0c 0e1a 180d 1e00 ............ 00000580: 0000 0e10 1717 0d17 1a0e 1610 0f00 1b1d ................ 00000580: 2c01 0e13 0e13 0e13 141b 1e00 ,........... 00000580: 8f01 0e13 1417 1710 1d00 .......... 00000580: bc02 0e13 141b 1814 1910 00 ........... lines 1-24/148 (more)
od
uses the command to go to the original offset -j
instead -s
. It is also worth noting the command printf
: in addition to providing formatting, it is a convenient way to display text without a new line hanging at the end of the character.$ for f in levels / *; do printf "% -20s" $ f; python ./decompress2.py <$ f | od -An -j 0x580 -tuS -N2; done | less levels / all_full.pak 400 levels / alphabet.pak 0 levels / amsterda.pak 500 levels / apartmen.pak 300 levels / arcticfl.pak 400 levels / balls_o_.pak 300 levels / beware_o.pak 300 levels / blink.pak 600 levels / blobdanc.pak 0 levels / blobnet.pak 500 levels / block_fa.pak 500 levels / block_ii.pak 750 levels / block_n_.pak 600 levels / block_ou.pak 350 levels / block.pak 450 levels / bounce_c.pak 300 levels / brushfir.pak 80 levels / cake_wal.pak 999 levels / castle_m.pak 600 levels / catacomb.pak 399 levels / cellbloc.pak 0 levels / chchchip.pak 300 levels / chiller.pak 399 levels / chipmine.pak 700 lines 1-24 / 148 (more)
$ python ./decompress2.py <levels / all_full.pak | xxd -s 0x0582 00000582: 0c17 1701 1120 1717 00 ..... ...
17
, collected in two pairs. We immediately note that the pattern of numbers 17
corresponds to the pattern of letters L
in the title of the “ALL FULL” level. The name has a length of eight characters, so the zero byte at the end is most likely the end of line character. Having discovered this, you can trivially look at all the other levels and use their names to build a complete list of characters:00 | end of line |
01 | space |
02 - 0B | numbers 0-9 |
0C - 25 | letters AZ |
26 - 30 | punctuation marks |
23
we initially discovered, but also uses the value31
denoting a chip that does not affect the total amount needed to open the chip connector. (However, from the point of view of the gameplay, both types of chips are the same. If there is one type of chip 31
on the level, then you can not collect any number of chips on the level.)tr
makes it easy to convert your own character set of data files to ASCII.$ tr '\ 001- \ 045' '0-9A-Z' <levels / lesson_1.pak | xxd 00000000: 4600 cb00 3000 0032 3030 3332 3235 3333 F ... 0..200322533 00000010: 3635 3537 0020 3820 2039 3636 4238 3846 6557. 8 966B88F 00000020: 0058 4a37 354d 3000 5737 4226 3746 2739 .XJ75M0 .W7B & 7F'9 00000030: 3928 3533 2953 2027 2733 3042 2057 3532 9 (53) S '' 30B W52 00000040: 3730 3738 304a 3226 375a 2046 4a30 5752 70780J2 & 7Z FJ0WR 00000050: 2059 2052 4220 3537 0055 0050 3200 4f00 Y RB 57.U.P2.O. 00000060: 554a 2637 5400 3300 2946 4a57 5830 4642 UJ & 7T.3.) FJWX0FB 00000070: 2035 2637 544d 2946 4a37 4d4f 3058 3050 5 & 7TM) FJ7MO0X0P 00000080: 304a 5720 5120 5142 3835 3237 3020 3020 0JW Q QB85270 0 00000090: 2826 2058 4a33 3730 2056 4a33 5738 2727 (& XJ370 VJ3W8'' 000000a0: 3933 3200 3439 3628 324d 7839 3628 324d 932.496(2Mx96(2M 000000b0: 7039 3628 324d 6400 4c45 5353 4f4e 2031 p96(2Md.LESSON 1 000000c0: 0043 4f4c 4c45 4354 2043 4849 5029 544f .COLLECT CHIP)TO 000000d0: 0047 4554 2050 4153 5420 5448 4520 4348 .GET PAST THE CH 000000e0: 4950 0053 4f43 4b45 542d 0055 5345 204b IP.SOCKET-.USE K 000000f0: 4559 2954 4f20 4f50 454e 0044 4f4f 5253 EY)TO OPEN.DOORS 00000100: 2d30 -0
00000035
there is a right bracket, followed by a capital S and a space.) From this, I calculated a compression scheme similar to the process described in the article. Everything else was pretty simple.Source: https://habr.com/ru/post/447562/
All Articles