The speed of dynamic RAM and the ridiculous idea how to increase it

A bit of history

At the dawn of computing, dynamic memory worked quite well on the frequency of the processor. My first computer experience was related to a clone of the ZX Spectrum computer. The Z80 processor processed instructions on an average of 4 clocks per operation, with two clocks used to perform dynamic memory regeneration, which gives us, at a frequency of 3.5 MHz, no more than 875,000 operations per second.

However, after some time, the frequency of the processors reached a level where the dynamic memory could no longer cope with the load. To compensate for this, an intermediate link was introduced in the form of a cache memory, which made it possible to smooth out the difference in the speed of the processor and main memory using operations performed on a small amount of data.

Let's look at what the computer's RAM is now, and what can be done with it to increase the speed of the computer system.

Briefly about static and dynamic memory

Memory is built in the form of a table consisting of rows and columns. In each cell of the table there is an information bit (we are discussing semiconductor memory, however, many other implementations are built on the same principle). Each such table is called a “bank”. In the chip / module can be placed several banks. A set of memory modules is projected into the linear address space of the processor, depending on the capacity of individual elements.

The static memory cell is built on the basis of a trigger, which is usually in one of the stable states “A” or “B” (A =! B). The minimum number of transistors for a single cell is 6 pieces, while the complexity of tracing in the cells apparently does not allow making modules of static memory of 1 gig at the price of a regular module of 8 gig.

The dynamic memory cell consists of one capacitor responsible for storing information and one transistor responsible for isolating the capacitor from the data bus. In this case, not a hinged electrolyte is used as a capacitor, but a parasitic pn junction capacitance between the “substrate” and the transistor electrode (especially for these purposes it is increased, they are usually tried to get rid of it). The disadvantage of a capacitor is the leakage current (both in itself and in the key transistor) which is very difficult to get rid of, besides it increases with temperature, which leads to the probability of distortion of the stored information. To maintain reliability, “regeneration” is used in the dynamic memory, it consists in periodically updating the stored information at least a predetermined period during which the information retains a reliable value. The typical regeneration period is 8 ms, while more often it is possible to update the information, less often it is not recommended.

The rest of the principle of operation is identical and is as follows:

- the initial sampling of a line of memory leads to access to all of its contents placed in the buffer line with which further work is going on, or multiplexing of reference to columns occurs (the old, slow approach);
- the requested data is transmitted to the host device (usually the CPU), or the specified cells are modified during the write operation (there is a slight difference, for the static memory, direct modification of the cell of the selected row is possible, for the dynamic memory the buffer line is modified, and then write back the contents of the entire line in a special loop);
- closing and changing a memory line is also different for different types of memory, for a static one there can be an instant line change if the data has not changed, for dynamic memory it is necessary to write the contents of the buffer line into place, and only then you can select another line.

If at the dawn of computing, each read or write operation ended with a full memory cycle:

- row selection;
- read / write operation from the cell;
- change / re-select line.

Modern operation of working with chips "synchronous memory a la DDRX" is as follows:

- row selection;
- read / write operations of cells in a row in groups of 4-8 bits / words (multiple inversion within a single string is allowed);
- closing a line with recording information into place;
- change / re-select line.

This solution allowed saving data access time when, after reading the value from cell “1”, it is necessary to refer to cells “2, 3, 4, or 7” located in the same line, or immediately after the read operation, it is necessary to write back the changed value .

Read more about the work of dynamic memory in conjunction with the cache

The memory controller (in the chipset or embedded in the processor) sets the block address and line number (most significant block address) to the chip / memory module. The corresponding block is selected (further work will be considered within one block) and the resulting “binary number” is decoded into the positional address of the line, after which information is transferred to the buffer from which data is subsequently accessed. The time in ticks required for this operation is called tRCD and is displayed in the "9-9-9 / 9-9-9-27" schemes in second place.

After the row is activated, you can refer to the "columns" for this memory controller transmits the cell address in the row, and after a time "CL" (indicated in the above marked "xxxx" at 1 place), data starts to be transmitted from the memory chip into the processor (why in the plural? because the cache intervenes here) as a packet of 4-8 bits (for a single chip) in the cache line (size depends on the processor, the typical value is 64 bytes - 8 words 64 bits each, but and other meanings). After a certain number of ticks required to transfer a data packet, you can form the following request to read data from other cells of the selected row, or issue a command to close the row which is expressed in the form of tRP specified as the third parameter from “xxxx ... ". During the closing of the line, the data from the buffer is written back to the block line, after the recording is finished, you can select another line in the block. In addition to these three parameters, there is a minimum time during which the line should be active “tRAS”, and the minimum time for a full cycle of working with the line separating two commands to activate the line (affects random access).

grossws April 19, 2016 at 12:40

CL - CAS latency, tRCD - RAS to CAS delay, tRP - row precharge, CAS - column address strobe, RAS - row address strobe.

The speed of semiconductor technology is determined by the delays of the circuit elements. In order to obtain reliable information at the output, it is necessary to wait a certain time in order for all elements to take a steady state. Depending on the current state of the memory bank, the data access time changes, but in general, the following transitions can be characterized:

If the block is at rest (there is no active line), the controller issues a row select command, as a result, the binary line number is converted to the positional number, and the contents of the line are read during the "tRCD" time.

After the contents of the row have been read into the buffer zone, you can issue a command to select a column that converts the binary number of the column to a positional number during the “CL”, but depending on the alignment of the lower addresses, the order of transmission of bits may change.

Before changing / closing a line, it is necessary to write data into place, because during the reading, the information was actually destroyed. The time needed to restore information in the “tRP” line.

According to the full specification for dynamic memory, there are also many time parameters determining the sequence and delay of control signal changes. One of these is “tRCmin” which defines the minimum time for a complete cycle of a line, including: row selection, data access and writeback.

The RAS signal determines whether a row address is issued;
The CAS signal determines whether a column address is issued.

If earlier all control was shifted to the side of the memory controller and controlled by these signals, now there is a command mode when a command is issued to the module / microchip, and after a while data is being transmitted. For more details, see the standard specification, for example, DDR4 .

If we talk about working with the dram in general, then when mass reading it usually looks like this:

set the address of the string
put up the RAS (and through the beat removed)
waited tRCD,
set the address of the column we are reading from (and every next clock set the next column number),
exposed CAS,
waited for CL, started reading the data,
removed CAS, read the rest of the data (more CL cycles).

When passing to the next row, precharge (RAS + WE) is done, tRP is waited, RAS is performed with the specified line address, and then it is read as described above.

The random cell reading latency naturally follows from the above: tRP + tRCD + CL.
In fact, it depends on the previous state of the “memory bank” that is being accessed.

It is necessary to remember that the DDR has two frequencies:

- the main clock frequency determines the rate of transmission of commands and timings;
- the effective frequency of data transmission (double the clock frequency, which marks the memory modules).

The integration of the memory controller has increased the speed of the memory subsystem by eliminating the intermediate transmitting link. An increase in memory channels requires that this be taken into account by the application, for example, four-channel mode with a certain location of files does not give a performance boost (12 and 14 configurations).

Processing one element of a linked list with different steps (1 step = 16 bytes)

Now a little math

Processor: processor operating frequencies now reach 5 GHz. According to manufacturers, circuit solutions (pipelines, predictions and other tricks) allow you to perform one instruction per clock. To round off the calculations we take the clock frequency value of 4 GHz, which will give us one operation in 0.25 ns.

RAM: let's take for example the RAM of the new DDR4-2133 format with timings of 15-15-15.

Given:

CPU
Ftakt = 4 GHz
Tact = 0.25 ns (part-time execution of one operation "conditionally")

DDR4-2133 RAM
Fact = 1066 MHz
Fdata = 2133 MHz
ttakt = 0.94 ns
tdata = 0.47 ns
SPDmax = 2133 MHz * 64 = 17064 MB / s (data transfer rate)
tRCmin = 50 ns (minimum time between two line activations)

Data acquisition time

From registers and cache, data can be provided during the operating cycle (registers, cache level 1) or with a delay of several processor cycles for the cache of the 2nd and 3rd level.

For RAM, the situation is worse:

- line selection time is: 15 clk * 0.94 ns = 14 ns
- time to receive data from the column selection command: 15 clk * 0.94 ns = 14 ns
- line closing time: 15 clk * 0.94 ns = 14 ns (who would have thought)

From which it follows that the time between the command requesting data from the memory cell (if the cache is not included) may vary:

14 ns - the data is in the already selected line;
28 ns - data is in the unselected row, provided that the previous row is already closed (the block is in the “idle” state);
42-50 ns - the data is in a different line, while the current line needs to be closed.

The number of operations that the processor can perform (above) during this time ranges from 56 (14 ns) to 200 (50 ns line feed). Separately, it should be noted that by the time between the column selection command and the receipt of the entire data packet, the cache row loading delay is added: 8 bits of the packet * 0.47 ns = 3.76 ns. For the situation when the data will be available to the “program” only after the cache line is loaded (who knows what and how the processor developers wound up, the memory according to the specification allows to output the necessary data ahead), we get 15 more clock cycles.

In the framework of one work, I conducted a study of the speed of memory, the results showed that it is possible to completely "utilize" memory bandwidth only in sequential memory access operations; in the case of random access, processing time increases (for example, a coherent list of 32-bit pointer and three double words, one of which is updated) from 4-10 (sequential access) to 60-120 ns (changing lines), which gives a difference in processing speed of 12-15 times.

Data processing speed

For the selected module, we have a peak bandwidth of 17064 MB / s. That for a frequency of 4 GHz makes it possible to process 32-bit words per clock (17064 MB / 4000 MHz = 4.266 bytes per clock). The following restrictions apply here:

- without explicit planning of cache loading, the processor will be forced to idle (the higher the frequency, the larger the kernel just waits for data);
- in the cycles “reading modification record” the processing speed is halved;
- multi-core processors will divide the memory bus bandwidth between the cores, and for the situation when there will be competing requests (a degenerate case), the memory performance may deteriorate “200 times (changing lines) * X cores”.

Calculate:

17064 MB / s / 8 cores = 2133 MB / s per core in the optimal case.
17064 MB / s / (8 cores * 200 missed operations) = 10 MB / s per core for the degenerate case.

Translated into operations, we obtain for an 8-core processor: from 15 to 400 operations for processing a data byte, or from 60 to 1600 operations / cycles for processing a 32-bit word.

In my opinion slowly somehow. Compared with DDR3-1333 memory 9-9-9, where the full cycle time is approximately 50 ns, but the timings are different:

- data access time is reduced to 13.5 ns (1.5 ns * 9 cycles);
- the transmission time of a packet of eight words is 6 ns (0.75 * 8 instead of 3.75 ns) and with random access to memory, the difference in the data transfer rate almost disappears;
- peak speed will be 10,664 MB / s.

Not too far away. The situation is somewhat saved by the presence in the memory modules of the “banks”. Each “bank” is a separate table of memory which can be accessed separately, which makes it possible to change a row in one bank while reading / writing data from another row, by reducing idle time, allows you to “hammer” the data bus to the outset in optimized situations.

Actually there were ridiculous ideas

The memory table contains a specified number of columns, equal to 512, 1024, 2048 bits. Taking into account the cycle time for activation of rows at 50 ns, we get the potential data exchange rate: “1 / 0.00000005 s * 512 columns * 64 bits word = 81 920 MB / s” instead of the current 17 064 MB / s (163 840 and 327 680 MB / s for rows of 1024 and 2048 columns). You will say: “only 5 times (4.8) faster”, to which I will answer: “this is the exchange rate when all competing requests are addressed to one memory bank, and the available bandwidth increases in proportion to the number of banks, and the row length of each table increases (it will require an increase in the length of the operating line), which in turn rests mainly on the speed of the data exchange bus. ”

Changing the mode of data exchange will require the transfer of the entire contents of the line to the lower level cache, for which it is necessary to divide the cache levels not only by the speed of work, but also by the size of the cache line. For example, by implementing the “length” of the Nth level cache line (3212 columns * 64 word size) 32,768 bits, we can, by reducing the number of comparison operations, increase the total number of cache lines and increase its maximum volume accordingly. But if you make a parallel bus in a cache of this size, we can get a decrease in the frequency of operation, from which you can use a different approach to cache organization, if you divide the specified “Jumbo” cache line into blocks along the top cache line and exchange it with small portions, allows you to save the frequency of operation, dividing the access delay into stages: search for a cache line, and selection of the desired "word" in the found string.

As for the exchange between the cache and the main memory: it is necessary to transfer data with the rate of access to the rows of one bank, or having a certain margin to distribute requests to different banks. In addition, there is a difficulty with the time of access to data located in different areas of the line, for serial transmission in addition to the initial delay associated with the sampling line, there is a data transmission delay depending on the amount of data "in the packet", and the transmission speed. Even the “rambus” approach may not cope with the increased load. The situation can save the transition to the serial bus (possibly differential), by further reducing the bitness of the data, we can increase the channel throughput rate, I can reduce the time between the transfer of the first and last bit of data, apply a split line transfer to several channels. That will allow you to use a lower clock frequency of one channel.

Estimate the speed of this channel:

1 / 0.00000005 ns = 20 MHz (the frequency of changing lines within one block)
20 MHz * 32 768 bits = 655 360 Mbit / s
For differential transmission with the same data bus size, we get:
655 360 Mbit / s / 32 channels = 20 480 Mbit / s per channel.

Such a speed looks acceptable for an electrical signal (10 Gbit / s for a signal with a built-in synchronization of 15 meters is available, why not mastering 20 Gbit / s with an external synchronization of 1 meter); however, a further increase in the transmission speed to reduce the transmission delay between The first and last bit of information may require an increase in bandwidth, with the possible integration of an optical transmission channel, but this is a question for circuitry, I have little experience with such frequencies.

and here Ostap suffered

Changing the concept of projecting the cache to main memory to use “main memory as an intermediate ultrafast block storage device” will shift the prediction of loading data from the controller circuitry to the processing algorithm (and who better to know where it will break after some time is clearly not the memory controller) turn will increase the cache size of the external level, without sacrificing performance.

If you go further, you can additionally change the concept of orientation of the processor architecture from "switching the context of the actuator" to "the working environment of the program." Such a change can significantly improve the security of a code through the definition of a program as a set of functions with specified entry points for individual procedures, an accessible region for placing data for processing, and the possibility of hardware monitoring the ability to call a function from other processes. Such a change will also make it possible to more efficiently use multi-core processors by eliminating context switching for a part of the threads, and for processing events using a separate thread within the available “process” environment, which will allow more efficient use of 100+ nuclear systems.

PS: the accidental use of registered trademarks or patents is random. All original ideas are available for use under the "anthill" license agreement.

Source: https://habr.com/ru/post/281929/

All Articles