History 3dfx Voodoo1

This is the second article from the series “3D maps of the late 90s, on which Quake worked”. In the first part, we looked at the Rendition Vérité 1000 of the end of 1996 and a special game port for it called vQuake. Rendition was able to beat everyone in the Quake market. For a short period of time, it remained the only board capable of launching the id Software blockbuster with hardware acceleration.

But that all changed in January 1997, when id Software released a new version of Quake called GLQuake. Since the port was created using miniGL (a subset of the OpenGL 1.1 standard), any hardware accelerator manufacturer could write miniGL drivers and take part in a 3D map race. From this point on, the possibility of competition was open to everyone. The goal was to generate as many frames per second as possible. The reward was the fame and money of buyers. Having briefly studied history, it can be understood that the two authorities of that time undoubtedly considered the kings of the mountain to be two producers.

So far, there is no doubt about this: the world of Quake is ruled by Voodoo. And since Quake rules the world of games, the purchase of 3Dfx Voodoo is almost inevitable for gamers.
')
- Tom's Hardware, November 30, 1997

3DFX Voodoo 1
- The benchmark against which all other cards are measured.

- John Carmack's .plan file. February 12, 1998 ^[2]

Just looking at the specifications ^[3] , which claimed a filling rate of 50 megapixels / s, I immediately wanted to study this map and understand what 3dfx did to create such a powerful product.

3dfx Interactive

Ross Smith, Scott Sellers and Gary Tarolli met when they worked together at SGI ^[4] . After working a bit at Pellucid, where they tried to sell IrisVision PC boards (in 1994, these boards cost $ 4,000 apiece), they founded their own company with the support of Gordi Campbell TechFarm. 3dfx Interactive, headquartered in San Jose, California, was founded in 1994.

Initially, the company intended to create powerful hardware systems for arcade machines, but changed its course by developing PC boards. There were three reasons for this.

Rather low price of RAM.
Starting with FastPage RAM, and then EDO RAM, RAM delays have decreased by 30%. Now the memory could work with a frequency of up to 50 MHz.
Games in 3D (or pseudo-3D) became more and more popular. The success of games such as DOOM, Descent and Wing Commander III showed that a market for 3D accelerators should soon emerge.

The founders of the company realized that they need to create something powerful, designed for games and with a retail price in the range of 300-400 dollars. In 1996, the company announced the creation of the SST1 architecture (named after the founders - Sellers-Smith-Tarolli-1), which was soon licensed by several OEMs, such as Diamond, Canopus, Innovision, and ColorMAX. For their creations came up with the marketing name "Voodoo1", emphasizing its magical performance.

As in the case of the V1000, when creating cards, manufacturers could only change the selected type of RAM (EDO or DRAM), the color of the boards, and the physical location of the chips. Almost everything else was standardized.

Diamond Monster 3D, image taken from vgamuseum.info.

Canopus Pure3D, image taken from vgamuseum.info.

BIOSTAR Venus 3D, image taken from vgamuseum.info.

ORCHID Righteous 3D, image taken from vgamuseum.info.

When looking at the SST1 board, it was struck by how different it was from its competitors - Rendition Verite 1000 and NVidia NV1.

First, 3dfx made a bold move, refusing to support 2D rendering. Voodoo1 had two VGA ports, one used as an output and the other as an input. The map was developed as an add-on, it took as input the output from a two-dimensional VGA card already installed in the computer. When the user worked with the operating system (DOS or Windows), then Voodoo1 simply redirected the signal from its VGA input to the VGA output. When switching to 3D mode, Voodoo1 took control of the VGA output and ignored the signal from its VGA input. Some boards had a mechanical switch that clicked when switching between 2D and 3D modes. Such a solution meant that the map can only be used for full-screen rendering, there was no “window” mode.

The second remarkable aspect of SST1 was that it was made not from a single CPU, but from two non-programmable ASIC (Application-Specific Integrated Circuit, special-purpose integrated circuits). If you go along the tire tracks, you can see that each of the chips labeled as “TMU” and “FBI” has its own RAM. On the memory card, 4 MB of MB of RAM was divided equally: 2 TMB of MBU for storing textures and 2 MB of FBI for storing the color buffer and z-buffer, while the values were stored as 16-bit RGBA and 16-bit integer / half-float, respectively. A card with 4 MB memory supported resolution up to 640x480 (2 color buffers (640x480x2) for double buffering + 1 depth buffer (640x480x2) = 1 843 200). Later models with 4 FBI mebibytes allowed to use resolutions up to 800x600 (2x800x600x2 + 800x600x2 = 2,880,000).

SST1 rendering pipeline

The pipeline in the specifications is not described in detail. According to my interpretation, the life of the triangle consisted of five stages.

A triangle is created and transformed in the main computer processor (usually a Pentium). Such operations include multiplication by the model / projection space matrix, truncation, perpendicular perspective division, clipping of homogeneous coordinates, and transformation of the field of visibility. At the end of this process, only visible triangles of the screen space remain (due to clipping, one triangle may turn out to be two).
Using the triangleCMD command, the triangles are transmitted over the PCI bus to the Frame Buffer Interface (FBI). They are converted to raster string queries created by the Texture Mapping Unit. For each element of a raster line (called a fragment), the TMU performs up to four search queries per pixel if the developer needs bilinear filtering. The partial perspective division is also performed in the TMU.
The TMU sends fragments to the FBI as a textured 16-bit RGBA + 16-bit z-value.
The FBI performs fragment tests in the z-buffer, comparing them with the allocated RAM storing the RGBA values and the z-values of the frame buffer.
Finally, lighting is applied to the fragment based on its color attribute and a 64-element fog table search. If mixing is required, the FBI combines the resulting fragment with what is already in the color buffer.

Interesting fact: if you are a 3D fan, you probably know about the fast reverse square root code that has become famous for its Quake 3 source code:

float Q_rsqrt(float number) { long i; float x2, y; const float threehalfs = 1.5f; x2 = number * 0.5f; y = number; i = * (long*) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration return y; }

In search of ^{[5] the} original source of Q_rsqrt, Rys for Software contacted Gary Tarolli, who said that he had used this code while still working in SGI. So it is fair to assume that it was used in the SST1 pipeline.

Something does not match

Acquainted with the pipeline and knowing that each component (TMU, FBI, EDO RAM) operates at a frequency of 50 MHz, we can understand that there is some error in the calculations and the card cannot reach speeds of 50 megapixels / s. Here it was necessary to solve two problems.

First, the TMU device had to read four texels to perform bilinear texture filtering. This means that four cycles of accessing RAM are needed, which would lead to a lack of data for the TMU and a fill rate of 50/4 = 12.5 megapixels / s.

There is another bottleneck at the FBI level. If the z-buffer check is enabled, then the incoming z-value of the fragment must be compared with what is already in the z-buffer before recording or discarding. If the check was successful, the value should be recorded. These are two operations with RAM, which led to a halving of the filling rate: 50/2 = 25 megapixels / s.

Quadrilateral TMU interlacing

The solution to the four-sample problem at the TMU stage is mentioned in the SST1 specification.

Full alternation is implemented in the texture memory data path, which allows an individual bank to access data regardless of the address used to access data in other banks.

- SST1 specification

It does not indicate whether the bus uses multiplexing addresses, or shared data and address buses. It's easier to figure out if you draw them without multiplexing and without separation.

Regardless of the details, the TMU architecture made it possible to get 4 x 16-bit texels per clock. If the input data is received at the correct frequency, then the TMU can perform fractional division by w, and then generate the fragment z-value (16-bit) and fragment color (16-bit), which were transmitted to the FBI.

Two-way FBI interlacing

The solution to the problem of two operations of accessing RAM at the FBI stage is also not described in the specification. However, the document mentions the fill rate of 100 megapixels / s, achieved with glClear due to the ability to record two pixels per clock, and this makes us realize that two-way interlacing was used here.

The FBI read and wrote two pixels at a time (2 x 1 pixels consisting of 16-bit color and 16-bit z = 64 bits). For this, a 21-bit address generates two 20-bit addresses, in which the least significant bit is discarded for reading / writing two pixels in order. Since the raster line algorithm required for writing / reading in horizontal lines moves from left to right, reading two ordinal pixels at a time worked very well.

64-bit TMU-> FBI bus

The last piece of the puzzle is the 64-bit FBI-TMU bus. Almost nothing is written about it in the specification, but its behavior can be understood from the data that the FBI consumes. Since the FBI processes two pixels at a time, it is reasonable to assume that the TMU does not send texels as quickly as possible, but combines them two as two 16-bit colors + 16-bit z-value.

Programming Voodoo1

At the lowest level, programming Voodoo1 was done using registers with memory mapping. The API consists of a surprisingly small number of commands, there are only five of them: TRIANGLECMD (with a fixed comma), FTRIANGLECMD (with a floating comma), NOPCMD (no-op), FASTFILLCMD (cleaning the buffer) and SWAPBUFFERCMD related to loading data registers to adjust the mix, z-test, download colors of fog and much more. Textures were loaded into VRAM through 8 mebibytes write-only PCI RAM with memory mapping.

(Present) programming Voodoo1

Developers programmed Voodoo1 through the Glide API ^[6] . The API design logic was inspired by IRIS GL / OpenGL, it used a state machine and prefixes for everything (only gr was used instead of gl, and programmers had to manage VRAM, as it is now done in Vulkan.)

 #include <glide.h> void main( void ) { GrHwConfiguration hwconfig; grGlideInit(void); grSstSelect( 0 ); grSstQueryHardware(&hwconfig); grSstSelect(0); grSstWinOpen(null, GR_RESOLUTION_640x480, GR_REFRESH_60HZ, GR_COLORFORMAT_RGBA, GR_ORIGIN_LOWER_LEFT, 2, 0); grBufferClear(0, 0, 0); GrVertex A, B, C; ... // Init A, B, and C. guColorCombineFunction( GR_COLORCOMBINE_ITRGB ); grDrawTriangle(&A, &B, &C); grBufferSwap( 1 ); grGlideShutdown(); }

"Standard" MiniGL

Although the MiniGL was a subset of the OpenGL 1.1 standard, the specification was never released for it. The MiniGL was "only those features that Quake uses." By running objdump for the quake.exe binary file, it is easy to build an “official” list.

  $ objdump -p glquake.exe |  grep "gl"

 glAlphaFunc glDepthMask glLoadIdentity glShadeModel
 glBegin glDepthRange glLoadMatrixf glTexCoord2f
 glBlendFunc glDisable glMatrixMode glTexEnvf
 glClear glDrawBuffer glOrtho glTexImage2D
 glClearColor glEnable glPolygonMode glTexParameterf
 glColor3f glEnd glPopMatrix glTexSubImage2D
 glColor3ubv glFinish glPushMatrix glTranslatef
 glColor4f glFrustum glReadBuffer glVertex2f
 glColor4fv glGetFloatv glReadPixels glVertex3f
 glCullFace glGetString glRotatef glVertex3fv
 glDepthFunc glHint glScalef glViewport

If you started learning OpenGL recently, you should be intrigued by function names such as glColor3f, glTexCoord2f, glVertex3f, glTranslatef, glBegin, and glEnd. They were used for a mode called “Immediate mode”, in which the vertex coordinate, texture coordinate, matrix manipulations, and color were indicated by one function call at a time.

Here is how “in those times” one triangle textured and shaded by Gouraud was drawn.

 void Render { glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glEnable(GL_TEXTURE_2D); glShadeModel(GL_SMOOTH); glBindTexture(GL_TEXTURE_2D, 1); // Assume a texture was loaded in textureId=1 glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(-1.0, 1.0, -1.0, 1.0, -1.0, 1.0); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glBegin(GL_TRIANGLES); glColor3f(1.0f, 1.0f, 1.0f); glTexCoord2f(0.0f, 0.0f); glVertex3f(-1.0f,-0.25f,0.0f); glColor3f(0.0f, 0.0f, 0.0f); glTexCoord2f(1.0f, 0.0f); glVertex3f(-0.5f,-0.25f,0.0f); glColor3f(0.5f, 0.5f, 0.5f); glTexCoord2f(0.0f, 1.0f); glVertex3f(-0.75f,0.25f,0.0f); glEnd();

GLQuake

The theoretical maximum fill rate of 50 megapixels / s was supposed to provide almost 50 frames per second at a resolution of 640x480. However, since Quake combined two layers of textures to the surface (one for color and another for light map), SST1 had to draw each frame twice with additional blending in the second pass. As a result, the P166Mhz Quake worked at a speed of 26 fps.

By reducing the resolution to 512x384 on the same machine, it was possible to achieve a smooth 41 fps ^[7] , which no competitor could provide at that time.

Software rendering

GLQUAKE VOODOO1

Interesting fact: SST1 was not for everyone. Some people liked the pixels and they found the bilinear filtering “blurred”. Others were annoyed by the loss of gamma correction.

Glquake sucks. I think someone might argue with that, but let's admit it looks awful, especially on NVidia cards. On 3dfx boards, things are not so bad ... but the colors are still blurred. On TNT2, the picture is disgusting; she is too dark and gloomy.

- @Frib, Unofficial Glquake & QW Guide ^[8]

3fdx Voodoo ²

If I said that 3dfx rules in the market from 1996 to 1998, this would be an understatement. After SST1, Voodoo ² technology raised the bar even more thanks to the 100 MHz EDO RAM, 90 MHz ASIC, and not just one, but two TMUs that allow you to draw a multi-texture Quake frame (color + lighting) in one pass ^[9] . This technology was a real monster, and even the graphics cards themselves looked luxurious.

The fill rate in Voodoo ^{2 has} nearly doubled, reaching 90 megapixels / s. Quake benchmarks soared to a staggering 80 fps on a Pentium II 266 MMX (compared to 56 fps from Voodoo1), in fact, reaching the limits of gaming logic and monitor capabilities.

Super Voodoo 2 12MB, image taken from vgamuseum.info.

Unfortunately, after the release of Voodoo3 in 1999, the 3dfx story took a sharp turn. She began to strive to develop her own universal cards and stopped selling her technology to OEMs, faced with growing competition.

This transition did not end as expected, and the performance of Voodoo3 was disappointing compared to NVidia's GeForce 256, which was able to provide hardware tessellation and lighting (this part was done by Pentium in the pipeline).

As a response to NVidia, 3dfx canceled the development of Voodoo4 to begin creating Voodoo5 with VSA-100 technology (Voodoo Scalable Architecture). The result was unexpected: after the release of “Napalm” (codename of the card), it collided with more powerful NVidia GeForce 2 and ATI Radeon cards. Finally, on March 28, 2000, 3dfx filed for bankruptcy and was purchased by NVidia.

For those who lived in the late 90s and had the pleasure of playing Voodoo1 or Voodoo2, 3dfx remains a landmark company symbolizing excellence. She became an ode to the well-deserved success achieved through courage, outstanding talent and hard work. Thank you guys!

Reference materials

[1] Source: The story of the Rendition Vérité 1000

[2] Source: John Carmack .plan. Feb 12, 1998

[3] Source: SST-1, HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION

[4] Source: 3dfx Oral History Panel

[5] Source: Origin of Quake3's Fast InvSqrt ()

[6] Source: Glide Programming Guide

[7] Source: Comparison of GLQuake Using Voodoo & Voodoo 2 3D Cards

[8] Source: Frib, Unofficial Glquake & QW Guide

[9] Source: VOODOO2 GRAPHICS HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION

Source: https://habr.com/ru/post/446860/

All Articles