Resource Binding in Microsoft DirectX 12. Performance Issues

Let's take a closer look at resource mapping on Intel platforms. Now this is especially true in connection with the release of the 6th generation of Intel Core processor family (Skylake) and with the release of the Windows 10 operating system, which took place on July 29th.
In a previous article, Introducing Resource Binding in Microsoft DirectX * 12 , new ways to bind resources in DirectX 12 were described. The conclusion from this article was this: if you have such a wide choice, the main task is to choose the best binding mechanism for the target GPU, the best Resource types and how often they are updated.
This article describes the selection of various resource binding mechanisms to efficiently run applications on specific Intel GPUs.

Instruments

The development of games based on DirectX 12 requires the following:

Windows 10.
Visual Studio * 2013 or later.
The DirectX 12 SDK included in Visual Studio.
GPU and drivers that support DirectX 12.

Overview

A descriptor is a data block that describes an object for a GPU in an "opaque" format intended for a GPU. The following handles are supported in DirectX 12, which were previously called “resource views” in DirectX 11:

Constant Buffer Representation (CBV).
Shader Resource Representation (SRV).
Representation of unordered access (UAV).
Presenter sampler (SV).
Presentation of the target rendering (RTV).
Depth Template Representation (DSV).
Other.

These descriptors or resource representations can be considered a structure (block) that is consumed by the front end of the graphics processor. The size of the descriptors is approximately 32–64 bytes. The handles contain information about the size of the textures, their format and layout.
Descriptors are stored in a heap of descriptors, which is a sequence of structures in memory.

The descriptor table points to descriptors on the heap using offset values. The table maps a continuous range of descriptors to the shader cells, making them accessible using a root signature. A root signature can also contain root constants, root descriptors, and static samplers.

Figure 1. Descriptors, heap of descriptors, descriptor tables, root signature
')
Figure 1 shows the relationship between descriptors, a bunch of descriptors, descriptor tables, and a root signature.
The code described in Figure 1 looks like this.

// the init function sets the shader registers // parameters: type of descriptor, num of descriptors, base shader register // the first descriptor table entry in the root signature in // image 1 sets shader registers t1, b1, t4, t5 // performance: order from most frequent to least frequent used D3D12_DESCRIPTOR_RANGE Param0Ranges[3]; Param0Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 1); // t1 Param0Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 1); // b1 Param0Ranges[2].Init(D3D12_DESCRIPTOR_RANGE_SRV, 2, 4); // t4-t5 // the second descriptor table entry in the root signature // in image 1 sets shader registers u0 and b2 D3D12_DESCRIPTOR_RANGE Param1Ranges[2]; Param1Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_UAV, 1, 0); // u0 Param1Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 2); // b2 // set the descriptor tables in the root signature // parameters: number of descriptor ranges, descriptor ranges, visibility // visibility to all stages allows sharing binding tables // with all types of shaders D3D12_ROOT_PARAMETER Param[4]; Param[0].InitAsDescriptorTable(3, Param0Ranges, D3D12_SHADER_VISIBILITY_ALL); Param[1].InitAsDescriptorTable(2, Param1Ranges, D3D12_SHADER_VISIBILITY_ALL); // root descriptor Param[2].InitAsShaderResourceView(1, 0); // t0 // root constants Param[3].InitAsConstants(4, 0); // b0 (4x32-bit constants) // writing into the command list cmdList->SetGraphicsRootDescriptorTable(0, [srvGPUHandle]); cmdList->SetGraphicsRootDescriptorTable(1, [uavGPUHandle]); cmdList->SetGraphicsRootConstantBufferView(2, [srvCPUHandle]); cmdList->SetGraphicsRoot32BitConstants(3, {1,3,3,7}, 0, 4);

The source code shown above sets up a root signature so that it has two descriptor tables, one root handle and one root constant. The code also shows that root constants have no indirect reference, they are provided directly by calling SetGraphicsRoot32bitConstants . They are directly related to the shader registers; there is neither the constant buffer itself, the constant buffer descriptor, nor the binding. Root descriptors have only one level of indirect access, since they store a pointer to memory (descriptor -> memory), and descriptor tables have two levels of indirect access (descriptor table -> descriptor -> memory).

The descriptors are in different heaps depending on their types, for example, SV and CBV / SRV / UAV. This is due to the very large differences between the sizes of the descriptors of different types on different hardware platforms. For each type of heap of descriptors, only one heap should be allocated, since changing a heap can be an extremely resource-intensive operation.

In general, DirectX 12 maintains an advance allocation of more than one million descriptors, which is quite enough for a whole game level. In previous versions of DirectX, resource allocation occurred when the driver was working on its own “conditions”, and in DirectX 12, resource allocation during execution can be completely avoided. This means that allocating handles no longer affects performance.

Note. Intel® Core ™ 3rd generation (Ivy Bridge) and 4th generation (Haswell) processors using DirectX 11 and Windows Display Driver Model (WxDD) driver version 1.x dynamically mapped to memory based on resource references in the command buffer with the page table mapping operation. Due to this, it was possible to avoid copying data. Dynamic matching was important because these architectures allocated only 2 GB of memory to the GPU (more than that allocated to the Intel® Xeon® E3-1200 v4 processor family (Broadwell)).
In DirectX 12 and WDDM version 2.x, it is now impossible to reassign resources to the virtual address space of the GPU as needed, since the resources need to be assigned a static virtual address when created and this virtual address cannot be changed after creation. Even if the resource is pushed out of the memory of the GPU, it retains its virtual address for a later period, when it becomes resident again.
Therefore, the limiting factor may be the total memory capacity of 2 GB allocated to the GPU in the Ivy Bridge / Haswell families.

As stated in the previous article, a perfectly balanced application can use a combination of all types of bindings: root constants, root descriptors, descriptor tables for descriptors received on the fly as draw calls are issued, and dynamic indexing of large descriptors tables.
The performance of different architectures may vary when using large sets of root constants and root descriptors compared to using descriptor tables. For this reason, it may be necessary to optimally adjust the relationship between root parameters and descriptor tables, depending on the target hardware platforms.

Anticipated changes

To understand what changes will entail additional resource costs, you first need to figure out exactly how game engines typically change data, descriptors, descriptor tables, and rooted signatures.

Let's start with the so-called constant (constant) data. In most games, all persistent data is usually stored in “system memory”. The game engine changes the data in memory available to the CPU, and then in the frame. An entire block of persistent data is copied or mapped into the memory of the GPU, and then read by the GPU using the constant buffer representation or root handle.

If persistent data is provided using SetGraphicsRoot32BitConstants () as a root constant, the entry in the root handle is not modified, but the data may change. If they are provided using the CBV == descriptor and descriptor tables, then the descriptor does not change, but the data may change.

If we need several constant buffer representations (for example, for double or triple buffering during rendering), the CBV or the descriptor can be changed for each frame in the ruta signature.

For these textures, memory allocation of the GPU is assumed at startup. Then the SV == descriptor will be created, it will be saved in the descriptor table or in the static sampler, and it will be referenced in the root descriptor. After that, the data and the descriptor or static sampler are not changed.

For dynamic data, such as changing texture or buffer data (for example, textures with displayed localized text, buffers with animated vertices or created models), we allocate a rendering target or buffer, provide RTV or UAV, that is, descriptors, then these descriptors may no longer change. The data in the render target or in the buffer is subject to change.
If we need several rendering targets or buffers (for example, for double or triple buffering during rendering), the descriptors can be changed for each frame in the root signature.

In a further discussion, a change is considered important for resource binding if it is associated with the following:

Change or replace a descriptor in a descriptor table, such as CBV, RTV, or UAV.
Change the entry in the root signature.

Descriptors in descriptor tables in Haswell / Broadwell processor families

On Haswell / Broadwell platforms, if one descriptor table is changed, the root signature consumes as many resources as when all descriptor tables change. Changing one argument results in the equipment having to create a copy (version) of all current arguments. The number of root parameters in a root signature is the amount of data for which the equipment has to create a new version (that is, a full copy) when changing any subset.

Note. For all other types of memory in DirectX 12, such as heaps of descriptors, buffer resources, etc., the hardware does not create new versions.

In other words, about the same amount of resources is spent on changing all parameters as on changing one (see [Loritzen] and [MSDN]). Least of all resources are spent, of course, if nothing changes, but it is useless.

Note. On other equipment, where the memory is divided into fast and slow, the storage of root arguments (root arguments) will create a new version of only that memory area where the argument has changed, that is, either the fast area or the slow area.

On Haswell / Broadwell platforms, the additional cost of changing descriptor tables may be due to the limited size of the binding table in the hardware.

The descriptor tables on these hardware platforms use hardware “reference tables”. Each binding table entry is a single DWORD value that can be considered an offset in the heap of descriptors. In a 64 KB ring, 16,384 bindings records can be stored.

In other words, the amount of memory consumed by each draw call depends on the total number of descriptors that are indexed in the descriptor table and referenced in the root signature.
If we run out of 64 KB of memory for the binding table entries, the driver will allocate another 64 KB binding table. Switching between these tables causes the pipeline to stop, as shown in Figure 2.

Figure 2. Stopping the conveyor (Figure by Andrew Loritzen)

Suppose that a root signature refers to 64 descriptors in the descriptor table. Stopping will occur every 16,384/64 = 256 draw calls.

Since few resources are spent on changing the root signature, it is preferable to use many root signatures with a small number of descriptors in the descriptor table, rather than root signatures with a large number of descriptors in the descriptor table.

Therefore, on Haswell / Broadwell platforms, it is desirable that the descriptor tables contain as few references to the descriptors as possible.
What does this mean in terms of rendering? When using more descriptor tables with a smaller number of descriptors in each table (and, therefore, with more root signatures), the number of pipeline state objects (PSO) increases, because one-to-one relationships are maintained between such objects and root signatures.
An increase in the number of pipeline state objects can lead to an increase in the number of shaders, which in this case can be more specialized instead of longer shaders with a wider range of functions, according to the general recommendation.

Route constants and descriptors in Haswell / Broadwell processor families

As mentioned above, in terms of resources spent, changing one descriptor table is equivalent to changing all descriptor tables. Similarly, changing one root constant or root handle is equivalent to changing all (see [Loritzen]).

Route constants are implemented using “broadcast constants”, which are the buffer used by the equipment to pre-populate the operating unit registers (EU). Since values are available immediately after starting the EU stream, you can improve performance by storing persistent data as root constants, rather than in descriptor tables.
Route descriptors are also implemented using “broadcast constants”. They are pointers, transmitted in the form of constants to shaders with data reading through the usual memory access path.

Descriptor Tables and Route Constants / Descriptors in Haswell / Broadwell Processor Families

We reviewed the implementation of descriptor tables, root constants and descriptors. Now you can answer the main question of this article: "What is preferable to use?". Due to the limited size of the hardware binding table and the possible stops associated with exceeding this limit, changing root constants and root handles seems to be a less resource-intensive operation on Haswell / Broadwell platforms, since they do not need a hardware binding table. The greatest gain when following this principle is achieved if the data changes with each draw call.

Static samplers in Haswell / Broadwell processor families

As described in the previous article, you can define the samplers in the root signature or directly in the shader using the HLSL root signature language. Such samplers are called static.

On Haswell / Broadwell platforms, the driver places static samplers in a regular bunch of samplers. This is equivalent to putting them into handles manually. On other hardware platforms, the samplers are placed in the shader registers, so static samplers can be compiled directly into the shader.

In general, static samplers provide high performance on all platforms, so you can use them without any reservations. However, on Haswell / Broadwell platforms, there is a chance that as the number of descriptors in the descriptor table increases, we will more often encounter a pipeline stop, since the hardware descriptor table contains only 16,384 cells.
Here is the syntax of the static sampler in HLSL.

 StaticSampler( sReg, [ filter = FILTER_ANISOTROPIC, addressU = TEXTURE_ADDRESS_WRAP, addressV = TEXTURE_ADDRESS_WRAP, addressW = TEXTURE_ADDRESS_WRAP, mipLODBias = 0.f, maxAnisotropy = 16, comparisonFunc = COMPARISON_LESS_EQUAL, borderColor = STATIC_BORDER_COLOR_OPAQUE_WHITE, minLOD = 0.f, maxLOD = 3.402823466e+38f, space = 0, visibility = SHADER_VISIBILITY_ALL ])

Most of the parameters need no explanation, since they are similar to those used at the C ++ level. The main difference is in the color of the borders: at the C ++ level, the full color range is maintained, while at the HLSL level only opaque white, opaque black and transparent black are available. An example of a static shader.

 StaticSampler(s4, filter=FILTER_MIN_MAG_MIP_LINEAR)

Skylake

Skylake supports dynamic indexing of the entire heap of descriptors (about 1 million resources) in one descriptor table. This means that a single table of descriptors may be enough to index all available heap of descriptors.
Compared to previous architectures, there is no need to change the descriptor table entries so often in the root signature. In addition, you can also reduce the number of root signatures. Of course, different materials will require different shaders and, therefore, different pipeline state objects (PSO). But these PSOs can refer to the same root signatures.
Modern graphics engines use fewer shaders than previous versions in DirectX 9 and DirectX 11, so you can avoid wasting resources on changing shaders and related states, reducing the number of root signatures and (accordingly) PSO objects, which will provide performance gains on any hardware platform.

Conclusion

If we talk about platforms Haswell / Broadwell and Skylake, then the recommendations for improving the performance of DirectX 12 applications depend on the platform used. For Haswell / Broadwell, it is desirable that the number of descriptors in the descriptor table be small, whereas for Skylake it is recommended that the descriptor tables contain as many descriptors as possible, but the tables themselves are smaller.
For optimal performance, an application developer can check the type of hardware platform at startup, and then appropriately select a method for linking resources. (An example of a GPU definition that shows how to detect various Intel® hardware architectures is available here .) The choice of resource binding method determines how the shaders of the system will be written.

Links and useful materials

[Loritzen] Andrew Loritzen et al., “Efficient rendering using DirectX 12 on Intel Graphics GPU”, GDC, 2015
[MSDN] MSDN, "Extended Use of Descriptor Tables . "
Microsoft DirectX Blog
DirectX 12 on Twitter: @ DirectX12 .
Direct3D * 12 - efficiency and performance of console APIs on the PC
Microsoft DirectX 12 graphics tutorials (YouTube channel) .

Source: https://habr.com/ru/post/277121/

All Articles