📜 ⬆️ ⬇️

RDMA: a view from the inside

The growing popularity of cluster systems as a medium for high performance computing or HPC (High Performance Computing) brings to the fore the task of ensuring the effective interaction of the platforms forming the cluster.



The leading place here is occupied by the RDMA (Remote Direct Memory Access) technology, which generalizes the concept of direct memory access from data transmission within the local platform to the interaction of several systems within a cluster.
To understand the meaning of this technology, we will try to view the NIC (Network Interface Controller) controller outside the context of network protocols (TCP), solely as a machine that can perform two actions:
  1. receive data from the network and transfer it to RAM
  2. read data from RAM and transfer it to the network.

Then, for several computers connected to the network, the set of their NIC controllers can be viewed as a kind of generalized DMA controller, which differs from a regular DMA controller in that it can interact not only with local memory, but also with remote memory.
')

How DMA works within a cluster


So, what is the need to supplement the traditional operation of copying a block of memory if the source and the recipient of such a copy are on different platforms of the same cluster.

RDMA-    Read Request
Figure 6 of RFC5040.txt (A Remote Direct Memory Access Protocol Specification) reveals the low-level meaning of tagging; the table shows the format of one of the RDMA requests: Read Request is a request for reading data

The request contains information identifying the source buffer and the address inside it, the destination buffer and the address inside it, as well as the length of the transmitted block. Passing such a complete set of parameters is a property of tagged requests that address specific address ranges in specific buffers.

Data Sink Steering Tag [32 bits] - the number under which the receiving buffer is registered within the cluster.
Data Sink Tagged Offset [64 bits] - the offset of the beginning of the block being written relative to the beginning of the receive buffer.
RDMA Read Message Size [32 bits] - the size of the block to be sent.
Data Source Steering Tag [32 bits] - the number under which the source buffer is registered within the cluster.
Data Source Tagged Offset [64 bits] - offset of the beginning of the readable block relative to the beginning of the source buffer.

Such a set of five parameters differs from the usual canonical set of three parameters of a copy operation (source address, recipient address, block length) only in that along with the use of 64-bit addresses (offsets), buffer numbers (tags) are used. The reason for this complication is obvious: several platforms interact, each platform has its own address space, therefore, to identify any object, knowledge of the address alone is not enough.

Buffers that are logically continuous may be fragmented in the physical address spaces of the platforms. This follows from the logic of the virtual memory in the operating system. 64-bit Tagged Offsets do not specify physical, but logical (virtual) addresses that need to be paged and defragmented using Scatter-Gather devices so that, ultimately, the object of RDMA operations of the RNIC controller is the memory of user applications, as prescribed by Direct Data Placement ( DDP ) data transfer model.

Summary


Obviously, RDMA technology, as well as the traditional network protocol stack, provides data transfer between computing platforms. How is it better than the traditional approach to the exchange of information on the local network?

In the case of using tagged requests , each of the platforms forming a cluster has the information necessary for the remote addressing of the memory of the other platform. Separate physically address spaces of systems forming a cluster can be combined into a single logical address space .

The addressing scheme used by the RDMA drivers ensures that at the time of the initiation of the data transfer operation, the final target addresses of both the source buffer and the destination buffer are known. Both of these buffers are in the address space of user applications. This form of data transfer organization, which eliminates the need for transit buffers and additional data copy operations, is called Zero-Copy . It is easy to see that the need for intervention by the CPU during the execution of such an operation is minimized, since all the information specifying the base addresses and sizes of copied blocks is programmed on the side generating the request at the time of its generation.

Source: https://habr.com/ru/post/271877/


All Articles