I want to talk about the Nintendo Entertainment System (NES) game console project in an implementation on FPGA. In the former Soviet Union, it is known as Dendy.
Those wishing to watch the video and ponozalgirovat please under the cat. I think most people of my age remember this game console well. I had it too. In the 90s, there was not much money in our family, so I didn’t even have a Dandy, but a completely Chinese clone Subor. I must say that he worked without any complaints, except for the often breaking joysticks that had to be repaired many times. Of course, after a short period I could not resist the temptation and disassembled the console. It was made on two printed circuit boards, one - the RF modulator and the power supply on the LM7805, installed without a radiator, it was very hot, and the second processor board, which, unfortunately, was made on a single packageless chip - the “drop”. As far as I remember, there was nothing more on it except quartz, a pair of capacitors and a cartridge connector. In those days, it was very difficult to find any information, and I didn’t even know on which processor the Dandy was running. Only once I saw NES in a “discrete” implementation with a neighbor - a radio amateur. ')
A little more than six months ago, I ordered a debugging board on STM32 on eBay and saw relatively inexpensive FPGA kits for Altera Cyclone II, without thinking twice, I also ordered it. In general, judging by the forums and opinions of familiar electronics engineers, the topic of FPGA still remains shrouded in a halo of inaccessibility and complexity of working with it. I, too, at one time "fell for" this misconception and did not pay due attention to the FPGA theme, as it turned out, was completely in vain. I fell in love with FPGA at first sight! Difficult for the usual radio amateur, engaged in microcontrollers, things suddenly became a reality. For example, full-fledged work with SDRAM, connecting a laptop matrix with an LVDS interface (and the frequencies are simply killer). My appetites grew, and I bought myself a Altera DE2-115 board , which is used for the NES project. Now I very much regret that I did not take up the topic of FPGA 10 years ago, do not repeat my mistakes, FPGA is fun and not so difficult at all!
Having blinked the LED (by the way, unlike the same STM32, where you need to configure a bunch of peripherals, this is done obscenely on the FPGA), I played the monitor and played with sound, I decided to make something more serious.
What is all this for? As they say, Just for fun. Of course, someone may say that this is a very ancient platform and there is no special reason to recreate it on the FPGA, but for me personally it was very pleasant to work on this project and see the final result. This, so to speak, how to accidentally find and restore a toy from your childhood, with which warm memories are connected.
In addition, it feels like when working, the hardware implementation is significantly different from software simulators. This is partly, of course, a psychological effect, but NES on the FPGA more precisely “holds” the timings, there are no subtle delays and video artifacts, which in the software implementation are caused by attempts to optimize the algorithm of the graphics processor, which is quite difficult.
In Russian, the network can be found description of the architecture of NES, for example, here .
In principle, the description is quite good, but you cannot create a working emulator based on it, since the PPU registers are very superficially described in that article - very important points are omitted.
In this article I will not describe in detail the architecture and each control register, but I will try to point out possible problems in the implementation of emulators and describe the solutions that I used in my project.
This is an electrical diagram of the Nintendo Famicom:
Image from nesdev.com
The core of the A203 (U6) processor is based on the MOS Technology 6502 eight-bit microprocessor base. On a single 6502 chip, there is a DMA controller and an audio processor.
On the chip 74LS139 (U3) the decoder of the CPU address is executed. The address bus is 16-bit, so the processor can address up to 64 KB. The address space is distributed as follows.
Graphic processor (PPU) 2C02 - chip U5. To save PPU pins, the lower 8 bits of the address bus are multiplexed with the data bus, so it spends two clocks on accessing the memory. To de-multiplex the address, the U2 chip is used - the latch-register 74LS373 (our analogue of IG22).
To store one video page, 1 KB of memory is required (the background image is made up of blocks (tiles) of 8x8 pixels, and 30 rows of 32 tiles are placed on the screen). The PPU architecture provides for the use of 4 video pages, however, the NES itself has only 2K of video memory (U4 chip), and the missing 2 KB when using all 4 pages must be located on the cartridge (the reflection of the pages and the PPU address space will be discussed below). Honestly, for me, as an engineer, it looks a little wild. It is clear that during the development of consoles for such measures went from marketing considerations and cheaper consoles at the cost of increasing the cost of the cartridge. I do not know how much 4 KB of SRAM was worth at that time, but perhaps because of this there are not many games using 4 pages.
The master oscillator is made on transistors Q2, Q3 and generates a frequency of 21.47727 MHz for the NTSC version and 26.6017 MHz for PAL. The core of the CPU operates at a frequency of about 1.79 MHz, and the pixel frequency of the PPU is three times higher than the frequency of the CPU and is approximately equal to 5.37 MHz. A relatively high oscillator frequency — 24.47727 MHz — is required to encode color information in a composite video signal and generate a “flash” of the color subcarrier.
Cartridge emulation
Initially, I wanted to find the original NES cartridges and use them, but I did not succeed and this is good, because I had to emulate the cartridge. In the simplest version, the cartridge is just two ROM chips - the CPU program memory (PRG ROM) and the PPU character generator memory (CHR ROM), in this case the maximum PRG ROM is 32 KB and the CHR ROM is 8 KB. So you can run simple games like Super Mario Bros.1, Lode Runner, Popeye, etc. Of course, 32 KB is extremely small for a more or less serious game, so special solutions (mappers) are used to switch memory banks, which allows expanding the available volume to several megabytes. There are a huge variety of mappers, both on the 74 series discrete logic, for example, the UxROM mapper is built on the 74HC161 counter, which is used as a latch and 74HC32 - 4 OR elements, and specialized ASIC solutions, such as MMC3. Now only these two mappers are implemented in the project. MMC3 was not chosen by chance, since it was on it that the majority of all favorite games were released.
On average, the volume of the cartridge is about 256 KB, cartridges of more than 1 MB are rare. The DE2-115 has 2 MB of SRAM memory (1M x 16) and 128 MB of SDRAM. I decided that 640 KB is enough for all 2 MB for a cartridge more than enough and allocated 1 MB for PRG ROM and 1 MB for CHR ROM. The address space is shared, the PRG data is stored in the high byte of the word, and the CHR in the low byte. Images are downloaded from an SD card from files in iNES format. I used the Nios II processor to maintain the FAT and download the files (in terms of NES itself, everything is hardware implemented, Nios is not used).
The diagram shows a cartridge emulation unit:
The unit is connected to the Nios II processor with a 4-bit address bus and an 8-bit data bus. The controller of the unit has 7 control registers with the help of which you can control the state of the NES - suspend, perform a hardware reset, set the video page reflection options, mapper type, address width PRG and CHR ROM. To download the image, there are commands to select the download area (CHR or PRG), reset the address. After writing the next byte, the address is auto-incremented.
Since the address bus is common for PPU and CPU, the multiplexer unit operates at an increased frequency, in my case it is 32.22 MHz - 6 times higher than the PPU frequency. Further, already separate address buses enter the mapper block where MMC3 and UxROM are implemented, and the selection of the active mapper sets the state of the control register. If necessary, add support for any other mapper is very easy.
For example, the UxROM implementation looks like this:
In the original NES, from the hardware point of view, nothing interferes with installing the RAM in the cartridge instead of the character generator ROM and initializing it in the process of program execution, recording via the PPU registers. Such cartridges and games exist, for example, Contra, Castlevania and many other games on UxROM. This approach has its advantages, for example, some of the icons of the character generator can be synthesized programmatically, you can also store the contents of the character generator in compressed form and produce decompression in CHR RAM, this will save space on the cartridge.
Since I already use RAM for CHR storage, this function turned out to be my default.
Input
The input device is a USB joystick. USB service is also performed by the Nios II processor:
The state of all joystick buttons is packed in a word and enters a serializer block on a parallel bus. The NES processor, writing to the address 0x4016, generates a strobe pulse, while loading the status byte (8 buttons) into the shift register. Further, when accessing 0x4016 (for the first joystick) and 0x4017 (for the second), the state of the next button is shifted and read. The “Turbo” buttons in the original NES are implemented in the joystick itself and simulate the frequent pressing of the A / B buttons, that is, in fact, the NES does not distinguish which button is held down, “Turbo A” or “A”. In the project, this function is performed in the serializer block, the frequency is about 20 Hz.
PPU
Functionally, PPU can be divided into a background rendering block and a sprite rendering block. If we talk about software emulators, I want to note that even though the rendering algorithm itself is simple, but due to the fact that the CPU and PPU work in sync, creating the correct software emulator is quite a challenge. More specifically, implementing the PPU software emulator in the forehead is quite simple, but such an implementation will consume a lot of resources and will not work optimally. I think any programmer will be the first to come to the idea that if there is an array of tiles on the screen, and there is a character generator, then you can draw the entire frame or at least part of it. Of course, it is possible to draw, but here problems with the synchronous operation of the PPU and the CPU will begin. The whole point is that to create graphical effects, the CPU can change the control registers of the PPU directly during the frame drawing process and can switch the character generator memory bank, and more than once during the frame. Such actions are implemented scrolling the screen with a gap, both vertical and horizontal, fixed areas, the output in them all kinds of static information. Therefore, these moments must be taken into account, up to the number of cycles for which each CPU command is executed, to know exactly at what point in time the CPU performs operations on the control registers of the PPU.
Because of such features, some software emulators sin with visual artifacts in games, especially those that use the capabilities of the PPU outside the box.
PPU address space is distributed as follows:
Video pages are arranged in such a way that, for example, with horizontal or vertical scrolling, the screen will show the contents of two pages at once, and with diagonal - four. The layout of the pages and coordinates are shown in the picture:
As I wrote above, on the NES itself, under the memory of the pages, there is 2 KB of RAM. Therefore, if the missing memory is not installed on the cartridge, page flipping is applied. Vertical, when the second page reflects the zero, and the third - the first, and the horizontal - the first page reflects the zero, the third - the second.
There is no point in saving 2 KB in a project on an FPGA, so all 4 KB is allocated for video memory. Hardware reflection is implemented very simply - in the case of a vertical reflection, the address RAM bus is torn off from the PPU and the line belonging to bit 11 is connected to zero, and in the case of the vertical line 11 and 10, the places are reversed, and the high line is also connected to zero:
Some mappers can dynamically change page reflection options by switching the address bus lines. As an example, the game Super Mario Bros. 2, where at the very beginning of the game in the fall horizontal reflection is used, and then switching to the vertical one.
PPU registers
For interaction with PPU, 8 registers are used (meaning the address space of the CPU).
Register 0x2000 (write only)
This register controls the state of the PPU, for example, sets the size of the sprites (8x8 or 8x16), the video memory page, and the resolution of the NMI interrupt generation.
Register 0x2001 (write only)
Register sets the permission to draw the sprites and the background. Also, it can be used to disable rendering in general, and then the CPU can access the PPU memory areas at any time, this is used by games when initially initializing pages when changing the game environment and filling up CHR RAM if RAM was used instead of ROM. Without a ban, PPU memory access is possible only in the VBLANK period when the PPU does not access it.
Register 0x2002 (read only)
In this register are the PPU status flags. This is the fact that the first opaque pixel of the zero sprite is drawn, the beginning of the VBLANK period and the flag indicating that more than 8 sprites fall on the current line.
Register 0x2003 (write only)
Sets the address for the subsequent sprite memory manipulation (Object Attribute Memory - OAM).
Register 0x2004 (write / read)
Read and write OAM data. After the operation, the address value is auto-incremented. Usually, games do not write to OAM in this way, but use DMA.
Register 0x2005 (write only)
Scrolling register At the first write operation in the register, the value of horizontal scrolling is set, at the second - vertical.
Register 0x2006 (write only)
Sets the address for the subsequent operation with the memory of video pages. In the first operation, the upper part of the address is specified, in the second, the lower part.
Register 0x2007 (write / read)
Reading and writing data from the video memory. After the operation, the address value is auto-incremented by 1 (next column) or 32 (next line), it depends on the state of bit 2 of register 0x2000.
Registers with double-entry use a common trigger, so you can not, for example, make a single entry in 0x2005, then in 0x2006, and then again return to 0x2005. More precisely, it is possible, but at the same time it is necessary to understand why you are doing this. If the state of the trigger is unknown, you can read the register 0x2002, while the trigger is reset.
It sounds easy! But there is an extremely important feature that is not indicated in the above-mentioned Russian-language description.
The fact is that the listed registers, so to speak, are not completely independent.
There are two 15-bit registers: vVRAM is the current address of the video memory (hereinafter simply “v”); tVRAM - temporary address (hereinafter simply “t”); and a 3-bit “accurate” scrolling X register (fine X scroll). Sets scrolling (0..7) within 1 tile.
The addresses v and t are configured as follows:
Thus, setting the page by writing to register 0x2000 changes bits 11.10 of register t. Writing to register 0x2005 sets the value of bits 4: 0 and 9: 5/14: 12 of register t and the value of “accurate” scrolling along X. But writing to register 0x2006 can generally spoil everything, since in this way you can change the value of all bits at once register t, with v = t being copied at the second write operation.
In the process of drawing a line, PPU increases the values ​​of coarse scrolling by X in register v and changes the address of the video page accordingly when the value of coarse scrolling overflows. At the end of the visible line (pixels 256-257), the PPU increases the Y value and copies the components related to horizontal scrolling (v [4: 0] = t [4: 0] and v [10] = t [10]). Before starting a new frame (pre-render), the vertical scrolling components are copied (v [9: 5] = t [9: 5], v [14:12] = t [14:12], v [11] = t [ eleven]). And it all starts over again.
Now it is clear how changing the values ​​of the PPU registers allows one to obtain various effects of background rupture.
A bit about image formation
The color of each pixel is selected from the palette. For the background and sprites there are separate palettes. The palette is a memory area of ​​16 bytes. The zero element of the background palette sets the canvas color. When rendering a background pixel, a 4-bit address is formed, indicating the color element in the palette. The two high-order bits of the address are the value of the tile group attribute, and the two low-order bits are specified by an image from the character generator. Changing the attributes of a group, you can change the color of tiles using the same element of the character generator. The first 960 bytes of the video page (name table) specify the addresses of tiles from the character generator CHR, which are on the page. Each of the remaining 64 bytes of the video page (attribute table) sets an attribute to a group of 16 tiles (32x32 pixel area).
To store the image of one icon in the memory of the character generator uses 16 bytes. Each pixel, as already mentioned above, is encoded with two bits. The first 8 bytes refer to the low bit of the pixel, and the next 8 bytes refer to the high bit. That is, a pair of bytes 0 + 8, 1 + 9, etc. set the line icon.
The colors in the example are conditional.Items that are zero are transparent and the attribute value does not affect them.
Attributes of a group are defined as follows:
One pixel must be drawn per PPT clock cycle. We must understand that on the original NES, for 2 PPU cycles we can get only 1 value from the memory. Therefore, before the beginning of the line at the end of the HBLANK period, data is sampled for the first two tiles of the new line.
The selection of data required for displaying a tile line is performed in 8 clock cycles of PPU. First, from the video memory we get the tile address in the character generator (2 clocks), then the group attribute value (2 clocks), then the low byte of the tile line from the character generator (2 clocks), and, finally, the high byte of the tile (also 2 clocks). And it all starts over again.
Since I do not multiplex the address bus with the data bus, in my project I can receive data on each PPU clock cycle.
Sprites
In the section describing the PPU registers, the sprite memory has been mentioned - Object Attribute Memory (OAM). Its size is 256 bytes, it is located on a PPU chip in a separate address space, which means that this area can be accessed simultaneously and independently from, for example, VRAM and CHR ROM.
For each sprite in OAM 4 bytes are allocated - therefore at the same time no more than 64 sprites can be on the screen.
Images of sprites are also stored in character generator (CHR ROM).
Each sprite on the screen is described by its position on the screen by X and Y, the address of the icon in the character generator, the attribute (the same as the group attribute when drawing the background), the reflection flags of the sprite horizontally and vertically (you can use the halves of one character generator) and the priority flag.
Sprites can be 8x8 and 8x16 pixels.
The formation of the image of the sprite is no different from the formation of the image of the background. However, in the hardware implementation, again there are some peculiarities.
Simultaneously with the line drawing, a search for sprites is performed that will be visible (fall) on the next line (in range evaluation). In the PPU, there is a memory area (secondary OAM) that can store information about 8 sprites. If the search turns out that the next line contains more than 8 sprites, then the extra sprites are ignored and the flag signaling this is set in the PPU 0x2002 register.
Since the CHR ROM bus is busy while the line is being drawn, the data from the character generator about the pixel color of these 8 sprites is sampled in the HBLANK period.
NES generates a picture with a resolution of 256x240 pixels. To display the image, I use the standard resolution VGA 640x480. PPU renders the image into the frame buffer. Data from the frame buffer is sent to the block in which the resolution is doubled (upscaler). In the future, I want to implement hq2x. Before data is fed to the video DAC, the color is converted to RGB.
DMA controller
To quickly fill in OAM, the NES processor can use a DMA controller. The DMA controller is implemented very simply. Before starting the copy operation, the CPU must set the OAM start address (register 0x2003), or rather, reset it to 0. Then the CPU writes the value of the start address (0x ?? 00) in the address space of the CPU at address 0x4014. The DMA controller stops the CPU and starts copying 256 bytes from the area 0x ?? 00 - 0x ?? FF (where ?? is the value specified by the CPU) to the PPU 0x2004 register. PPU increments the OAM address by one for each write operation. At the end of the procedure, the DMA controller returns control to the CPU.
Priority and Sprite 0 Hit
The final pixel color value is formed as follows:
It is possible and often happens that there are more than one sprite in the same area.
In this case, two factors affect the display priority — the value of the sprite's address in the OAM and the priority flag (0 - foreground, 1 - rear). For example, a sprite whose priority flag indicates that it belongs to the background, but the OAM address is less than the foreground sprite, then the background sprite may close the front sprite, which will result in the background pixel being output, but if it is not transparent.
When drawing the first opaque pixel of the sprite with the address 0 (the background at this point should also be opaque), in the register 0x2002 a flag is set indicating that the Sprite 0 hit event occurred. An example of the use of this flag is the game Super Mario Bros. 1, it is used to break the screen and separate static information about points and time from the game area. The image of the coin serves as the background, and its shadow appears as the zero sprite. The CPU periodically checks the value of the flag, and when an event occurs, the output of the playing field begins.
Interruption from mapper MMC 3
Mapper MMC3 has a row counter, the value of which decreases when the next PPU line is output. When reaching zero, the counter is reloaded with a value that can be set by pre-writing to the corresponding mapper register and, if the enable flag is set, a CPU interrupt occurs. The counter clocking line is rather original - it is connected to the address line A12 of the GPU. The background bank is usually used for storing background tiles (area 0x000 - 0x0FFF), and for sprites - the senior bank (area 0x1000 - 0x1FFF). When a visible line is drawn, PPU refers to one bank, and when sampling image data of sprites in the HBLANK period, it refers to another. Therefore, the frequency on the A12 line will correspond to the frequency of the output lines.Mapper interrupts are mainly used to break the screen and switch the character generator bank.
Examples can look at the video.
APU
The NES audio processor is on the same chip as the CPU. Functionally, an APU is a set of control registers, a frame counter, and 5 blocks of audio channels.
From the hardware point of view, the APU is a bunch of counters, there are no pitfalls, so the description will be brief.
The frame counter generates clocking pulses of approximately 240 Hz, 120 Hz for APUs, and an IRQ interrupt for the CPU. Interrupt generation can be disabled by setting the APU registers. The term “frames” should not be confused, in this case it has nothing to do with PPU.
By the way, an unpleasant incident occurred during development, the APU frame counter was already implemented and I messed up with the interrupt enable flag (it turned out to be inverted), so the interrupt was generated at about 60 Hz. Outwardly, this was manifested very unexpectedly - in the Prince of Persia game on the left side of the screen, there was no door opening with a grille, and this was the only graphic artifact. I broke my head until I found the real reason. And I thought, naturally, on PPU - I checked the code a hundred times, depending on ModelSim. It was unpleasant!
The APU has five channels:
Two rectangular channels, as the name implies, they form a rectangular signal, one triangular channel, one noise channel and a delta modulation channel (DMC).
Rectangular channels can form a signal with a variable duty cycle (4 gradations) and a period, with the possibility of setting the duration, and also have sweep and envelope blocks. The sweep block can change (successively increase or decrease) the signal period in time, and the envelope block (envelope) - reduce the signal span in time with the possibility of looping, the shape of the envelope in this case is sawtooth.
The triangular channel generates a triangular signal with variable frequency and duration, but does not have the ability to adjust the volume.
The noise channel generates a pseudo-random signal with variable duration. Like the rectangular channel, the noise channel has an envelope control block.
The delta modulation channel in the project has not yet been implemented.
Channel mixing is done in a tabular way so as not to use “heavy” LE operations in terms of consumption.
After mixing, the signal goes to the domain synchronization unit and the filter, and then through the serializer unit via the I2S bus goes to the WM8731 audio codec.
When implementing the project, I used the Verilog 2001 hardware description language.
Anticipating questions regarding the source code, I can say that since this is my first big FPGA project, then for sure I have implemented many things in a very inefficient way. In addition, this is the very first version, the code must be optimized and cleaned, because I rewritten many blocks several times. Therefore, I would not want to provide it in this form. If I find time, get together and put the code in order, then maybe it will be laid out under the GNU GPL license.