The first part is
here .
8080 processor disassembler
Acquaintance
We will need information about opcodes and their corresponding commands. When you search for information on the Internet, you will notice that there is a lot of mixed information about the 8080 and Z80. The Z80 processor was a successor to the 8080 - it executes all 8080 instructions with the same hex codes, but also has additional instructions. I think, while you should avoid information on the Z80, so as not to get confused. I created a table of opcodes for our work, it is
here .
Each processor has a manufacturer’s reference manual. It is usually referred to as something like the “Programmer's Environment Manual”. The guide for the 8080 is called the "Intel 8080 Microcomputer Systems User's Manual". It has always been called a “reference book” (“data book”), so I will call it that way too. I managed to download the 8080 directory from
http://www.datasheetarchive.com/ . This PDF is a poor-quality scan, so if you find a better version, use it.
Let's get started and look at the Space Invaders rom file. (The ROM file can be found on the Internet.) I work on Mac OS X, so I simply use the “hexdump” command to view its contents. For further work, find the hex editor for your platform. Here are the first 128 bytes of the "invaders.h" file:
')
$ hexdump -v invaders.h 0000000 00 00 00 c3 d4 18 00 00 f5 c5 d5 e5 c3 8c 00 00 0000010 f5 c5 d5 e5 3e 80 32 72 20 21 c0 20 35 cd cd 17 0000020 db 01 0f da 67 00 3a ea 20 a7 ca 42 00 3a eb 20 0000030 fe 99 ca 3e 00 c6 01 27 32 eb 20 cd 47 19 af 32 0000040 ea 20 3a e9 20 a7 ca 82 00 3a ef 20 a7 c2 6f 00 0000050 3a eb 20 a7 c2 5d 00 cd bf 0a c3 82 00 3a 93 20 0000060 a7 c2 82 00 c3 65 07 3e 01 32 ea 20 c3 3f 00 cd 0000070 40 17 3a 32 20 32 80 20 cd 00 01 cd 48 02 cd 13 ...
This is the start of the Space Invaders program. Each hexadecimal number is a command or data for a program. We can use the directory or other reference information to understand what these hex codes mean. Let's explore the ROM image code a little more.
The first byte of this program is $ 00. Looking at the table, we see that this is a NOP, like the following two commands. (But don't worry, Space Invaders probably used these commands as a delay to let the system calm down after turning on the power.)
The fourth command is $ C3, that is, judging by the table, this is JMP. The definition of a JMP command says that it receives a two-byte address, that is, the next two bytes are the JMP jump address. Then two more NOPs follow ... so, you know what? Let me just sign for the first few instructions ...
0000 00 NOP 0001 00 NOP 0002 00 NOP 0003 c3 d4 18 JMP $18d4 0006 00 NOP 0007 00 NOP 0008 f5 PUSH PSW 0009 c5 PUSH B 000a d5 PUSH D 000b e5 PUSH H 000c c3 8c 00 JMP $008c 000f 00 NOP 0010 f5 PUSH PSW 0011 c5 PUSH B 0012 d5 PUSH D 0013 e5 PUSH H 0014 3e 80 MVI A,#0x80 0016 32 72 20 STA $2072
It seems there must be some way to automate this process ...
Disassembler, part 1
A disassembler is a program that simply translates a stream of hex numbers back to source code in assembly language. We performed this task by hand in the previous section - a great opportunity to automate this work. Writing this piece of code, we get to know the processor and get a handy piece of debugging code, which is useful when writing a CPU emulator.
Here is the disassembly algorithm for code 8080:
- Read code into buffer
- Get pointer to start of buffer
- We use byte in the pointer to determine the opcode
- Print the name of the opcode, if necessary, using bytes after the opcode as data
- Move the pointer to the number of bytes used by this command (1, 2 or 3 bytes)
- If the buffer is not over, go to step 3.
To lay the groundwork for the procedure, I added a couple of instructions below. I will lay out the full procedure for downloading, but I recommend that you try to write it yourself. It will not take much time, and in parallel you will learn the 8080 processor instruction set.
int Disassemble8080Op(unsigned char *codebuffer, int pc) { unsigned char *code = &codebuffer[pc]; int opbytes = 1; printf ("%04x ", pc); switch (*code) { case 0x00: printf("NOP"); break; case 0x01: printf("LXI B,#$%02x%02x", code[2], code[1]); opbytes=3; break; case 0x02: printf("STAX B"); break; case 0x03: printf("INX B"); break; case 0x04: printf("INR B"); break; case 0x05: printf("DCR B"); break; case 0x06: printf("MVI B,#$%02x", code[1]); opbytes=2; break; case 0x07: printf("RLC"); break; case 0x08: printf("NOP"); break; case 0x3e: printf("MVI A,#0x%02x", code[1]); opbytes = 2; break; case 0xc3: printf("JMP $%02x%02x",code[2],code[1]); opbytes = 3; break; } printf("\n"); return opbytes; }
In the process of writing this procedure and studying each opcode, I learned a lot about the 8080 processor.
- I realized that most commands take one byte, the rest two or three. In the above code, it is assumed that the command is one byte in size, but two and three byte instructions change the value of the variable "opbytes" to return the correct size of the command.
- The 8080 has registers with the names A, B, C, D, E, H and L. There is also a command counter (program counter, PC) and a separate stack pointer (stack pointer, SP).
- Some instructions work with registers in pairs: B and C are pairs, as well as DE and HL.
- A is a special register, many instructions work with it.
- HL is also a special register; it is used as an address every time data is read and written to memory.
- I was curious about the RST team, so I read the handbook a little. I noticed that it runs the code in fixed locations and the interrupt handling reference is mentioned in the directory. Upon further reading, it turned out that all this code at the beginning of the ROM is interrupt service routines (ISR). Interrupts can be generated programmatically using the RST command, or generated by third-party sources (not the 8080 processor).
To turn it all into a working program, I simply concocted a procedure that would do the following:
- It opens a file filled with compiled code 8080
- Reads it to the memory buffer.
- Passes through the memory buffer, causing Disassemble8080Op
- Increases PC by the amount returned by Disassemble8080Op
- Exits at the end of the buffer
It might look something like this:
int main (int argc, char**argv) { FILE *f= fopen(argv[1], "rb"); if (f==NULL) { printf("error: Couldn't open %s\n", argv[1]); exit(1); }
In the second part, we will study the output obtained when disassembling ROM Space Invaders.
Memory allocation
Before we start writing a processor emulator, we need to examine another aspect. All CPUs have the ability to communicate with a certain number of addresses. Old processors had 16, 24, or 32-bit addresses. The 8080 has 16 address contacts, so the addresses are in the range of 0- $ FFFF.
To deal with the distribution of the memory of the game, we need to conduct a small investigation. Collecting information
here and
here piece by piece, I learned that the ROM is located at address 0, and the game has 8KB of RAM, starting at $ 2000.
The author of one of the pages found out that the video buffer starts in RAM from $ 2400, and also told us how the 8080 I / O ports are used to communicate with controls and sound equipment. Fine!
Inside the invaders.zip ROM file, which can be found on the Internet, there are four files: invaders.e, .f, .g, and .h. After googling, I came across an informative
article that tells you how to put these files into memory:
Space Invaders, (C) Taito 1978, Midway 1979
: Intel 8080, 2 ( Zilog Z80)
: $cf (RST 8) vblank, $d7 (RST $10) vblank.
: 256(x)*224(y), 60 , .
.
: 7168 , 1 (32 ).
: SN76477 .
:
ROM
$0000-$07ff: invaders.h
$0800-$0fff: invaders.g
$1000-$17ff: invaders.f
$1800-$1fff: invaders.e
RAM
$2000-$23ff:
$2400-$3fff:
$4000-:
There is still some useful information, but we are not yet ready to use it.
Bloody details
If you want to know what size of address space the processor has, then you can understand this by looking at its characteristics. The 8080 specification tells us that the processor has 16 address contacts, that is, it uses 16-bit addressing. (Instead of the specification, it is enough to read the reference book, Wikipedia, google and so on ...)
On the Internet, there is quite a lot of information about Space Invaders hardware. If you could not find this information, you can get it in a couple of ways:
- Observe the code running in the emulator and see what it does. Take notes and watch carefully. It should be simple enough to understand, for example, where, in the opinion of the game, RAM should be located. It is also easy to determine where she is looking for video memory (we will spend some time studying this).
- Find the schematic diagram of the arcade machine and track the signals from the address pins of the CPU. See where they are going. For example, A15 (the most senior address) can only go to the ROM. From this we can conclude that the ROM addresses start at $ 8,000.
It can be very interesting and informative to find out on your own, watching the execution of the code. Someone had to deal with all this for the first time.
Command line development
The task of this tutorial is not to teach you to write code for a specific platform, although we will not be able to avoid platform-specific code. I hope that before the start of the project you already knew how to compile for your target platform.
When you work with offline code that simply reads files and prints text to the console, it is not necessary to use some kind of overdeveloped development system. In fact, it only complicates things. All you need is a text editor and a terminal.
I think that anyone who wants to program at a low level should know how to create simple programs from the command line. You may think that I am teasing you, but your elite hacker skills are not worth much if you cannot function outside of Visual Studio.
On a Mac, you can use to compile TextEdit and Terminal. On Linux, you can use gedit and Konsole. On Windows, you can install cygwin and tools, and then use N ++ or another text editor. If you want to be really cool, then all these platforms support vi and emacs for text editing.
Compiling programs from a single file using the command line is a trivial task. Suppose you saved your program in a file called
8080dis.c
. Navigate to the folder with this text file and compile it like this:
cc 8080dis.c
. If you do not specify the name of the output file, it will be called
a.out
, and you can start it by typing
./a.out
.
That's all.
Use debugger
If you are working on one of the Unix-based systems, then here is a brief introduction to debugging command line programs with GDB. You need to compile the program like this:
cc -g -O0 8080dis.c
. The
-g
parameter generates debugging information (that is, you can debug based on the source text), and the
-O0
disables the optimization so that when the program is
-O0
, the debugger can accurately track the code in full accordance with the source text.
Here is an annotated log of the start of the debug session. My comments are in lines marked with a pound sign (#).
$ gdb a.out GNU gdb 6.3.50-20050815 (Apple version gdb-1708) (Mon Aug 8 20:32:45 UTC 2011) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done # , (gdb) b Disassemble8080Op Breakpoint 1 at 0x1000012ef: file 8080dis.c, line 7. # "invaders.h" (gdb) run invaders.h Starting program: /Users/bob/Desktop/invaders/a.out invaders.h Reading symbols for shared libraries +........................ done Breakpoint 1, Disassemble8080Op (codebuffer=0x100801000 "", pc=0) at 8080dis.c:7 7 unsigned char *code = &codebuffer[pc]; #gdb n "next". "next" (gdb) n 8 int opbytes = 1; #p - "print", *code (gdb) p *code $1 = 0 '\0' (gdb) n 9 printf("%04x ", pc); # "", gdb , "next" (gdb) 10 switch (*code) (gdb) n # , "NOP" 12 case 0x00: printf("NOP"); break; (gdb) n 285 printf("\n"); #c - "continue", (gdb) c Continuing. 0000 NOP # Disassemble8080Op. *opcode, # , NOP, . Breakpoint 1, Disassemble8080Op (codebuffer=0x100801000 "", pc=1) at 8080dis.c:7 7 unsigned char *code = &codebuffer[pc]; (gdb) c Continuing. 0001 NOP Breakpoint 1, Disassemble8080Op (codebuffer=0x100801000 "", pc=2) at 8080dis.c:7 7 unsigned char *code = &codebuffer[pc]; (gdb) n 8 int opbytes = 1; (gdb) p *code $2 = 0 '\0' # NOP, (gdb) c Continuing. 0002 NOP Breakpoint 1, Disassemble8080Op (codebuffer=0x100801000 "", pc=3) at 8080dis.c:7 7 unsigned char *code = &codebuffer[pc]; (gdb) n 8 int opbytes = 1; # ! (gdb) p *code $3 = 195 '?' # print , /x (gdb) p /x *code $4 = 0xc3 (gdb) n 9 printf("%04x ", pc); (gdb) 10 switch (*code) (gdb) # C3 - JMP. . 219 case 0xc3: printf("JMP $%02x%02x",code[2],code[1]); opbytes = 3; break; (gdb) 285 printf("\n");
Disassembler, part 2
Run the disassembler for the ROM file invaders.h and look at the output information.
0000 NOP 0001 NOP 0002 NOP 0003 JMP $18d4 0006 NOP 0007 NOP 0008 PUSH PSW 0009 PUSH B 000a PUSH D 000b PUSH H 000c JMP $008c 000f NOP 0010 PUSH PSW 0011 PUSH B 0012 PUSH D 0013 PUSH H 0014 MVI A,#$80 0016 STA $2072 0019 LXI H,#$20c0 001c DCR M 001d CALL $17cd 0020 IN #$01 0022 RRC 0023 JC $0067 0026 LDA $20ea 0029 ANA A 002a JZ $0042 002d LDA $20eb 0030 CPI #$99 0032 JZ $003e 0035 ADI #$01 0037 DAA 0038 STA $20eb 003b CALL $1947 003e SRA A 003f STA $20ea
The first instructions correspond to those we manually wrote down earlier. After them there are a few new instructions. Below, I have inserted hex data for reference. Note that if you compare the memory with the commands, then the addresses seem to be stored in memory in reverse order. And there is. This is called little endian — machines with little endian, like the 8080, store the low bytes of numbers first in memory. (More on endian is written below).
I mentioned above that this code is the ISR-code of the Space Invaders game. The code for interrupts 0, 1, 2, ... 7 begins with the address $ 0, $ 8, $ 20, ... $ 38. It seems that the 8080 simply gives up 8 bytes for each ISR. Sometimes the Space Invaders program bypasses this system, simply moving to a different address with more space. (This happens at $ 000c).
In addition, it seems that ISR 2 is longer than the memory allocated for it. Its code comes in at $ 0018 (this is the place for ISR 3). I think that Space Invaders do not expect to see anything that uses interrupt 3.
Space Invaders ROM file from the Internet consists of four parts. I will explain this below, but for now, to go to the next section, we need to combine these four files into one. In Unix:
cat invaders.h > invaders cat invaders.g >> invaders cat invaders.f >> invaders cat invaders.e >> invaders
Now run the disassembler with the resulting file "invaders". When a program starts at $ 0000, the first thing it does is make the transition to $ 18d4. I will consider this the beginning of the program. Let's take a quick look at this code.
18d4 LXI SP,#$2400 18d7 MVI B,#$00 18d9 CALL $01e6
So, he performs two operations and calls $ 01e6. I'm going to insert a piece of code with transitions into this code:
01e6 LXI D,#$1b00 01e9 LXI H,#$2000 01ec JMP $1a32 ..... 1a32 LDAX D 1a33 MOV M,A 1a34 INX H 1a35 INX D 1a36 DCR B 1a37 JNZ $1a32 1a3a RET
As we have seen from the Space Invaders memory allocation, some of these addresses are interesting. $ 2000 is the start of a “working RAM” program. $ 2400 - the beginning of the video memory.
Let's add comments to the code to explain what it does right at launch:
18d4 LXI SP,#$2400 ; SP=$2400 - 18d7 MVI B,#$00 ; B=0 18d9 CALL $01e6 ..... 01e6 LXI D,#$1b00 ; DE=$1B00 01e9 LXI H,#$2000 ; HL=$2000 01ec JMP $1a32 ..... 1a32 LDAX D ; A = (DE), , $1B00 1a33 MOV M,A ; A (HL), $2000 1a34 INX H ; HL = HL + 1 ( $2001) 1a35 INX D ; DE = DE + 1 ( $1B01) 1a36 DCR B ; B = B - 1 ( 0xff, 0) 1a37 JNZ $1a32 ; , , b=0 1a3a RET
It looks like this code will copy 256 bytes from $ 1b00 to $ 2000. What for? I dont know. You can study the program in more detail and reflect on what it does.
There is a problem here. If we have an arbitrary fragment of memory containing the code, then the data is likely to alternate with it.
For example, sprites for game characters can be mixed with code. When a disassembler gets into such a fragment of memory, he will think that this is a code and continue to chew it. If you are unlucky, then any code disassembled after this piece of data may be incorrect.
So far we can do almost nothing with it. Just keep in mind that such a problem exists. If you see something like this:
- transition from exactly good code to a team that is not in the disassembler listing
- stream of meaningless code (for example, POP B POP B POP B POP C XTHL XTHL XTHL)
there is probably data here that messed up some of the disassembled code. If this happens, you need to start over again at offset.
It turns out that Space Invaders periodically comes across zeros. If our disassembly ever stops, the zeros will force it to perform a reset.
A detailed analysis of the Space Invaders code can be found
here .
Endian
In different processor models, the bytes are stored in memory differently, and the storage depends on the size of the data. Big-endian machines store data from older to younger. Little-endian keep them from the youngest to the oldest. If a 32-bit integer 0xAABBCCDD is written into the memory of each machine, it will look like this:
In little-endian: $ DD $ CC $ BB $ AA
In big-endian: $ AA $ BB $ CC $ DD
I started to program on Motorola processors, which used big-endian, so it seemed to me more "natural", but then I got used to little-endian.
My disassembler and emulator completely avoid endian problems, because they only read one byte at a time. If you want, for example, to use a 16-bit reader to read an address from a ROM, then note that this code is not portable between CPU architectures.