How (and why) we ported Shenzhen Solitaire under MS-DOS

Kate Holman and Zach Barth - developers from the game studio Zachtronics. If you like complicated technical articles about creating games for 25-year-old computers, you might like our puzzle games: SpaceChem , Infinifactory , TIS-100 and SHENZHEN I / O !

Getting started

Can two programmers, accustomed to creating games for modern computers with gigabytes of RAM and full-color HD displays, port their games to MS-DOS? None of us had the experience of developing on such old equipment, but working in artificially limited systems is the basis of the design of Zachtronics games, so we were obliged to try!
')
The project began with Zack creating the layout of SHENZHEN SOLITAIRE, a mini-game of solitaire from our game SHENZHEN I / O (also sold as a separate game ). This is how solitaire could look on a 256-color VGA display:

It looks like the games that could be seen on the PC screens of the early 90s! Now you just write the code, right?

Development environment

First we need to figure out how to write programs that can be run on an ancient computer with DOS. The target equipment will be a vintage IBM-compatible PC from the Zack collection:

Intel 80386SX processor with a frequency of 20 MHz
5 MB of RAM
VGA card
3½-inch drive
Serial mouse
MS-DOS version 6.22

The only reasonable variant of the programming language for the machine of that era will be C. We were not going to write the whole game in x86 assembly language! Thinking over the different versions of the working tools, we settled on Borland C ++ 3.1, released in 1992. Launched in DOSBox, Borland C ++ has provided a convenient and accurate emulation of the target machine.

Graphics

Computers with VGA graphics had a couple of drawing modes. A popular and simple option was Mode 13h : a resolution of 320 x 200 pixels with 256 colors. A more difficult option was the unofficial Mode X mode, which had a higher resolution: 320 x 240 pixels. There are many manuals on working with the Mode 13h mode on the Internet, and programming for it is very simple: 320x200 pixels are represented by an array of 64,000 bytes, each byte is one pixel. Mode X is more mysterious and difficult to use, but it has obvious advantages: a resolution of 320 by 240, one of the few VGA resolutions with square pixels. It also supports drawing with 256 colors selected from the palette. Most colors can be used in any VGA mode. We wanted to make the graphics better and complicate our work because we love it. Therefore, we decided to use Mode X.

So, we have a graphics mode with a lot of pixels and colors (for computers of the time). What are we going to draw? The most important elements are:

Full-screen backgrounds (desktop wallpaper, green card table)
Maps (including map icons and numbers on maps in three colors)
Text (for buttons and tags)

We already had full-color versions in high resolution of all these resources from the original version of SHENZHEN SOLITAIRE, but they need to be converted to a much lower resolution and use in general no more than 256 colors. Some tricks would not have been able to perform such a transformation, only a few hours of work in Photoshop with manual redrawing of maps, symbols and interface elements, scaling resolutions and background color palettes.

And here the main drawback of Mode X becomes important: for presentation of 320 x 240 pixels, 76,800 bytes are needed. VGA cards have a total of 256 kilobytes of video memory, but they are divided into four “planes” of 64 KB each. You can access only one of them at the same time *. This is quite suitable for the Mode 13h mode, which requires only 64,000 bytes, but in Mode X you need to divide its video data into several planes.

* Everything is a bit more complicated: in some VGA modes, including in Mode 13h, you can get access to several planes at the same time, but this approach has its drawbacks. Mode X allows the programmer to access only one plane at a time.

At this stage, the graphics code starts to look complicated, so we turned to Michael Abrash's Graphics Programming Black Book , the best source of knowledge about VGA. As the Black Book explains, Mode X divides pixel data into four planes. Each plane stores a quarter of pixels in an interlaced pattern. In plane 0, pixels 0, 4, 8, 12, etc. are stored. In plane 1, pixels 1, 5, 9, 13, and so on are stored.

This is a classic situation in the programming of games: we know the output that we need to create, and the choice of how to structure the input data (in our case, images) will have a tremendous impact on the complexity and speed of the rendering code. Fortunately, we didn’t have to figure it out on our own: in the book of Abrash there are a lot of useful tips for efficient rendering in Mode X. Since the entire screen is divided into four planes, and switching between planes is a relatively slow process, the most convenient (and fast! ) option - split each image into four data blocks so that the data portions of each of the planes are in one adjacent fragment. Thanks to this, the code becomes incredibly simple and as fast as possible. Here is the code that draws a full-screen (320 x 240) image into VGA memory:

//  :     VGA void far *vgaMemory = MAKE_FAR_POINTER(0xA000, 0x0000); short bytesPerPlane = (320 / 4) * 240; for (plane = 0; plane < 4; plane++) { SetCurrentVgaPlane(plane); _fmemcpy(vgaMemory, imageData, bytesPerPlane); imageData += bytesPerPlane; }

This code snippet also demonstrates some oddities that we will have to deal with when writing a 16-bit program for DOS. Using 16-bit pointers, you can directly access only 64 KB of memory. (These are “near” pointers.) However, most computers under DOS have much more memory, and addresses related to VGA memory already occupy 64 KB! In the assembler, this problem is handled with the help of segment registers. In C, “long pointers” are commonly used, that is, a 32-bit type of pointers, allowing to access 1 megabyte. (Life became much simpler with the advent of 32-bit processors, because it became possible to use 4 GB of memory without worrying about the segment registers.)

Fortunately, 64 KB is a lot, and almost the whole game fits into this framework. Only two parts of the code require the use of far pointers. I have already mentioned the first one: VGA memory is located in the whole 64-kilobyte range from the address 0xA0000. The second fragment is image data. Converting all the graphics into low resolution with one byte per pixel, we got about 250 kB of image data stored in one large file. This is more than 64 KB, so a long pointer was also required for it. In addition, this part was the only one in the code for which we used dynamic memory allocation.

Memory management

The main source of errors and difficulties in many C programs is the management of dynamically allocated memory. In Shenzhen Solitaire, we made everything simple: we knew exactly how many cards we needed to track, so we allocated memory for them in advance. There are no malloc() calls in the code that can lead to errors, and there are no free() calls that you can forget about. (The same strategy applies to the state in the rest of the game.) This is how the state of the solitaire engine looks like:

 //  :     struct CardOrCell { byte Suit; byte Value; byte CellType; struct CardOrCell *Parent; struct CardOrCell *Child; }; CardOrCell Cards[NUM_CARDS]; CardOrCell FreeCells[NUM_FREE_CELLS]; CardOrCell FoundationCells[NUM_FOUNDATION_CELLS]; CardOrCell TableauCells[NUM_TABLEAU_CELLS]; CardOrCell FlowerCell; CardOrCell *DraggedCard;

As I mentioned above, the only place where we could not use this strategy is the image data, because they occupied much more than 64 KB. Therefore, we saved all the images into one large file (as explained above, the image data was divided into four planes) and loaded them when loading the game along with the display of the loading bar:

 //  :      //        , //  ,        . FILE *f = fopen("SHENZHEN.IMG", "rb"); assert(f); long size = GetSizeOfFile(f); ImageData = farmalloc(size); assert(ImageData); long loaded = 0; byte huge *dest = ImageData; while (loaded < size) { long result = fread(dest, 1, 1024, f); assert(result != 0); loaded += result; dest += result; // (  ) } fclose(f);

The memory allocated for image data is never released, because we use it throughout the program.

There is an amusing detail in the boot code: on the boot screen above the boot process bar, we draw the boot logo. However, we draw the loading screen before reading the image data! To cope with this cyclic dependency, we added a special rule to the image data packaging program: first of all, place the image data for the boot logo. They are loaded in the first one or two seconds after starting the game, and then can be displayed for the rest of the loading process.

 //  :     bool logoDrawn = false; while (loaded < size) { ... //       : if (!logoDrawn && loaded > ImageOffsets[IMG_BOOT_BACKGROUND + 1]) { Blit(IMG_BOOT_BACKGROUND, 101, 100); logoDrawn = true; } }

Preliminary image processing

How to find and use these images when drawing it? All images used in the game were saved in PNG files created in modern software. These PNGs were then processed in a special content baking program. The content baker converted PNGs into a format that can be easily processed with a 16-bit DOS program. Three files were created: a large binary file containing all the pixel data as palette indexes, a small binary file containing a palette of 256 colors common to all images, and a header file C containing metadata to name and find the images . Here is the metadata:

 //  :  ,   //     #define IMG_BUTTON 112 #define IMG_BUTTON_HOVER 113 #define IMG_CARD_FRONT 114 // ID  ... long ImageOffsets[] = { 0L, 4464L, 4504L, 4544L, ... }; short ImageWidths[] = { 122, 5, 5, 5, ... }; short ImageHeights[] = { 36, 5, 5, 5, ... }; byte huge *ImageData = /*      */

For each PNG file processed by a content maker, they assign an ID (for example, IMG_CARD_FRONT), write its width, height, and location of the data. To render an image, we call the render function, for example, Blit (IMG_CARD_FRONT, x, y). The render function then calls ImageInfo (IMG_CARD_FRONT, ...) to get the image data and metadata.

 //  :     void ImageInfo(short imageID, short *w, short *h, byte huge **image) { assert(imageID >= 0 && imageID < IMAGE_COUNT); *w = ImageWidths[imageID]; *h = ImageHeights[imageID]; *image = ImageData + ImageOffsets[imageID]; }

Low level optimization: some assembly language

After completing the implementation of work with graphics, the logic of the solitaire gameplay was completed quickly, and everything looked beautiful, although it worked very slowly. As always, the only way to efficiently optimize is code profiling. Borland C ++ includes Turbo Debugger, a very useful profiler, considering that it is 25 years old:

In our case, the results of profiling are not surprised. The program spent almost all the time on drawing procedures, copying image data into VGA memory. Copying hundreds of thousands of pixels per second to 386 at 20 MHz is a very demanding process! Studying the assembler code generated by the Borland compiler made us understand that there are many overheads that slow down the internal cycles of the drawing procedures:

Although we tried to adhere to C to the maximum, it was obvious that in order to optimize the critical internal cycles of the drawing procedures, we could not do without assembly language. Assembly language was often used in programming games of the 90s, when processors were slow, and compiler optimizations were not as effective. Therefore, we rewrote only the internal parts of the drawing functions on the built-in assembly code. Here is the built-in assembly code inside the Blit () function, which is used for most of the game rendering process:

 //  :     byte far *vgaPtr = /*   VGA */; byte far *imgPtr = /*     */; asm PUSHA asm PUSH DS asm LES DI, vgaPtr asm LDS SI, imgPtr asm MOV BX, h asm CLD each_row: asm MOV CX, lineWidth asm REP MOVSB //     ! asm ADD DI, dstSkip asm ADD SI, srcSkip asm DEC BX asm JNZ each_row asm POP DS asm POPA

Before this assembler code, there is a lot of C code that calculates where to draw, where to read the image data, how to insert the drawn image into the visible area, but most of the execution time is spent on the MOVSB REP command. This command is similar to memcpy () from C, but natively supported by x86 processors, and this is the fastest * way to copy memory on early x86 processors, such as 386. As in memcpy (), you need to specify a pointer to the source, a pointer to the recipient and the number (in registers), after which the processor automatically copies the necessary amount of data in a loop.

In Shenzhen Solitaire for MS-DOS, the built-in assembly language is used in just three sections of code, and they are all very similar to this code fragment. Using the built-in assembler in these critical drawing functions increased the speed of these functions by 5-10 times. Despite the many lines of setup code and other code running in the game, most of the CPU time was spent on simply copying bytes.

* In fact, REP MOVSW or REP MOVSD would be even faster, because instead of one, they copy 2 or 4 bytes at a time. However, in our version of Borland C ++, only 16-bit commands were supported, which prevented the use of MOVSD. We tried to use MOVSW, but we had difficulties with its proper operation. Even if REP MOVSW would have doubled the speed of the drawing procedure, it would still not be enough to solve our high-level performance problems.

Double buffering

The next problem to be solved was flicker.

In Mode X, 75 kB of VGA memory is required to describe the full screen graphics. But 256 kB of VGA memory was enough to store three “screens” simultaneously. In our code, we mentally called them Screen 0, Screen 1, and Screen 2. So far, we used Screen 0. To implement double buffering, we used Screen 0 and 1. At one point in time, one screen was the “front buffer” displayed on the screen, and the second was used as a “back buffer”, an area in which the game draws new or changed elements. When we are ready to display a new frame of graphics, we perform a “flip page”, a VGA function that instantly changes the displayed portion of memory.

Double buffering solved the flicker problem, but the game was still too slow. Simple editing of low-level rendering code was not enough. As usual, Abrash foresaw this. The main message of the Graphics Programming Black Book is that the choice of suitable algorithms and high-level design of the program gives a greater performance gain than low-level assembly tricks.

High-level optimization: static background

First, we noticed that most of the screen almost always remains the same and does not change from frame to frame.

The mouse cursor usually moves across the screen, sometimes several cards are dragged, but the background and most of the cards remain completely unchanged. To take advantage of this, we have assigned the last full screen of VGA Screen 2 memory as a “static background”. We began to draw all the rarely changeable interface elements (that is, almost everything except the mouse cursor) on Screen 2. That is, the procedure for drawing each frame was usually this:

Copy the static background into the back buffer.
Draw the mouse (and all drag cards) into the back buffer.
Make the back buffer visible.

This made it possible to get rid of the solid part of the work in most frames, but copying the entire static background was still a lot of work. Worse, when the static background changed, for example, when the map was lifted with the mouse, then at first the static background needed to be completely redrawn in addition to other time-lapse work. This meant that the game usually worked smoothly, but when you clicked on the cards or buttons there were noticeable and unacceptable braking.

At this stage of development, we had two main performance costs: redrawing the static background (from time to time) and copying the static background to the back buffer (in each frame). We found a simple solution to reduce the cost of redrawing the background: each stack of maps has its own area on the screen, and usually only one or two piles should be redrawn per frame. This meant that only 10-20% of the static background needed to be redrawn, which took proportionally less time.

High-level optimization: dirty rectangles

The last problem that prevented the game from operating at an acceptable frame rate was the copying of a static background that occurred every frame. Our solution was another classic technique from the era before the onset of hardware-accelerated graphics: tracking dirty rectangles. You may remember “dirty rectangles” on Windows 95; when a program hangs, such situations occurred often. When the program hung, the user could drag another window across the hung program window until it became empty or covered with graphical garbage. The areas that the user redrawn, dragging the window, were “dirty”. The hung program had to redraw them as soon as the user removed the foreground window from them.

We used dirty rectangles in a similar way. When the player moves the mouse or a part of the static background changes, we add this area to the list of dirty rectangles. Then, instead of copying the entire static background into the back buffer, we only copy dirty rectangles. Copying fewer pixels means that it will take less time to draw a frame, which increases the frame rate.

Thanks to this improvement, the game finally managed to work smoothly on our test machine. We decided that the majority of computers equipped with VGA-cards have better characteristics than our 20-MHz braking machine with a 386 processor, so we were confident that the performance of the game would be high enough for the release of the game.

Input

Compared to the graphics, all other aspects of the game were unusually simple. On IBM-compatible computers, most essential basic services are provided by the BIOS and DOS using an interrupt-based API. These services include reading keyboard and mouse data, reading and writing to disks, opening files and allocating memory. Here are some of the interruptions we used in the Shenzhen Solitaire:

INT 16h: keyboard API
INT 21h: DOS system calls
Int 33h: mouse API

These APIs should be used from assembly code, like this:

 //  :   BIOS    //  INT 16h, 1h     . mov AL, 16h mov AH, 01h int 16h //      . jnz no_key_pressed //   … no_key_pressed: //   ...

Reading the state of the mouse is done in a similar way: by calling INT 33h, 3h to get the current position of the mouse and the state of the buttons. Compared to direct access to hardware or even using modern operating system APIs, this is incredibly easy.

Sound

Sound effects are an important part of any game, but playing music and sound on very old PCs is a difficult task. The simplest option available to all IBM-compatible PCs is the “PC speaker” speaker. This speaker is controlled by a simple timer chip on the motherboard. This means that the speaker is very easy to configure for continuous sound playback at a constant frequency. Some PCs also have special sound cards, but they are much more difficult to use and they are not always in the system.

Due to the limitations of the PC speaker, we only reproduced short “musical” fragments at certain points in the game: when the game starts, closes it and after winning. In the PC era of the 386 processors, there is little opportunity to accurately measure time, and it is difficult to play even simple music through the PC speaker at the same time as other actions. Therefore, we again chose the simplest option: since the music fragments are very short, we simply stop the game for one or two seconds, that is, for the duration of the music effect. The code for setting and playing such fragments is very simple:

 void PlayStartupMusic(void) { sound(73); delay(220); sound(110); delay(220); sound(165); delay(220); sound(98); delay(220); sound(110); delay(220); sound(147); delay(440); nosound(); }

In addition to musical fragments, we wanted to add a sound effect with which cards would be drawn and lay down so that the game would feel more physical. It was also implemented through the PC speaker by temporarily turning off the automatic sound generator based on a timer and turning it on and off manually to create a short “click”.

Release the game!

Porting Shenzhen Solitaire to MS-DOS was both simpler and harder than we expected. Despite the enormous difference in the power of the processors of the target equipment (single-core Intel 80386SX with a frequency of 20 MHz) and our usual equipment (quad-core Intel Core i7-4770K with a frequency of 3500 MHz), the non-optimized gameplay logic worked with lightning speed and had little effect on performance. But the drawback was that everything seems to be fast when drawing all the pixels on the screen takes almost 150 milliseconds! In the process, we learned a lot about the architecture of computers, which now can only be considered understandable only to the initiate. Despite the informative and interesting process, we are unlikely to use segment registers in future Zachtronics games. We are not so sad!

If you are interested in playing at the port of Shenzhen Solitaire under MS-DOS and reading this article until September 12, 2017, rate our project on Kickstarter and order a copy of the game as part of our release only for floppy disks. Full source code included!

Source: https://habr.com/ru/post/336840/

All Articles