📜 ⬆️ ⬇️

Eliminating the synchronization delay of the central and graphics processors in the game Galactic Civilizations 3



Galactic Civilizations 3 (GC3) is a turn-based global strategy developed and released by Stardock Entertainment . The game was released on May 14, 2015. During the demo access and beta testing, we collected and analyzed information about the performance of rendering processes in this game. One of the major improvements that we were able to implement was the elimination of several sources of delays in the synchronization of the central and graphics processors, which resulted in disruption of parallelism in the work of the processors. This article describes the identified problem and the solution found, and also discusses the importance of using performance analysis tools in the development process, taking into account the advantages and disadvantages of these tools.

Identify the problem


We began to study the effectiveness of rendering using the performance analyzers of the graphics subsystem Graphics Performance Analyzers , included in the package Intel INDE. The screenshot below shows the trace data (without vertical synchronization) before implementing the enhancements. In the GPU queue, there are gaps inside and between frames, and at each time point the volume of deferred load is less than one frame. If the graphics processor's queue does not receive enough resources from the central processor, time gaps arise that cannot be used by the application to improve performance and rendering accuracy.


To: frame duration - about 21 ms; queue length - less than 1 frame; gaps in the GPU queue; too long calls to the Map method
')
In addition, the GPA Platform Analyzer interface shows the time taken to process each call to the Direct3D API 11 (that is, send each command along the path “application - runtime - driver” and receive a response). The following screenshot shows the ID3D11DeviceContext :: Map method call, which takes about 15 ms together with the receipt of the response. During this time, the main application thread is idle.

The picture below shows an enlarged time scale with an interval of processing of one frame (from the beginning of the operation performed by the central processor to the end of the operation performed by the graphics processor). The idle gaps are marked with pink rectangles, their total duration is about 3.5 ms per frame. The Platform Analyzer tool also displays the total duration of calls to various APIs in this route (4.306 seconds), from which Map calls take 4,015 seconds!



It should be noted that the Frame Analyzer tool cannot detect a lengthy call to the Map using frame capture. The Frame Analyzer tool requests the GPU timer data to measure the time it takes to execute an erg , which includes state changes, resource binding, and rendering. Map calls are made by the central processor without the participation of the graphics processor.

Finding the source of the problem


(In the section on Direct3D resources, at the end of the article you will find basic instructions on how to use and update resources.)
The driver debug tool found that a long Map call uses the DX11_MAP_WRITE_DISCARD flag (the Platform Analyzer interface does not display the Map call arguments) to update the large vertex buffer created with the D3D11_USAGE_DYNAMIC flag.

This method is very often used when creating games to optimize data flows when accessing frequently updated resources. When mapping a dynamic resource using the DX11_MAP_WRITE_DISCARD attribute , the function returns an alias selected from the heap of aliases of this resource. Alias ​​is responsible for allocating memory for the resource at each match. When the space for aliases in the used resource heap ends, a shadow heap of aliases is allocated. This continues until the maximum number of heaps for the given resource is reached.

This was exactly the problem in the game Galactic Civilizations 3. Each time a similar situation occurred (that is, several times during the processing of each frame for several large resources that were matched many times), the driver waited until the Draw method using the previously assigned resource alias completed the process , to use this alias for another request. This problem occurred not only with the Intel driver. It also occurred with the NVIDIA driver, and in this case we used the GPUView tool to confirm the data obtained using the Platform Analyzer analyzer.

The vertex buffer size was about 560 KB (determined by the driver), and the buffer comparison was performed about 50 times during the processing of one frame (with a reset). For storing aliases, the Intel driver allocates several heaps on demand (1 MB each) per resource. The aliases are allocated from the heap before reaching the limit, after which the resource is assigned another shadow heap of 1 MB of aliases, and so on. In the case of a long Map call, the heap contained no more than one alias, therefore, each time the Map method accessed a resource, a new shadow heap was created for the new alias until the limit number of heaps was reached. This happened during the processing of each frame (this explains the repetition of the configuration in the diagram). On each call, the driver waited until the previous Draw call (performed for the same frame) would finish using the alias to use it again.

We examined the API log in the Frame Analyzer tool and sorted the resources that were mapped several times. There were several cases when the comparison with the vertex buffer was performed more than 50 times, with the user interface system being the main source of the problem. The driver debugging tool revealed that with each match only a small portion of the buffer was updated.


The same resource (with identifier 2322) is matched many times during the processing of one frame.

Solution to the problem


At Stardock , we set up all the visualization systems to display additional markers on the Platform Analyzer timeline , in particular, to make sure that the user interface was too long for the call, as well as to create profiles in the future.
We had several possible ways to solve the problem.

We chose the second option and replaced the constant in the buffer creation algorithm. The sizes of the vertex buffers for each subsystem were hard-coded, they only needed to be reduced. Now, each 1 MB heap could hold several aliases and, due to the relatively small number of Draw calls in Galactic Civilizations 3, the problem should have disappeared.
The elimination of this problem in one visualization subsystem increased its scale in another, therefore, the described actions were performed in all subsystems. The screenshot below shows the trace with corrections and the introduction of new tools, as well as an enlarged view of one frame.


After: frame duration - about 16 ms; queue length - 3 frames; no gaps in the GPU queue; no lengthy calls to the Map method



The total duration of the Map method calls was reduced from 4 seconds to 157 milliseconds! The delays in the GPU queue have disappeared. The queue duration was 3 frames stably, and after the GPU had finished processing the frame, the next frame was already waiting for its turn! A few simple changes have helped keep your GPU up and running. The performance improvement was about 24%: the processing time of each frame was reduced from about 21 to 16 ms.

Conclusion


Optimizing the performance of visualization processes in games is a daunting task. The means of capturing and playing back frames and tracks provide various important information about the performance of the game. In this article, we considered delays in the synchronization of the central and graphic processors, for the diagnosis of which tracing tools such as GPA Platform Analyzer or GPUView are required.

Direct3D * Resource Basics


In the Direct3D API, you can allocate resources for creating and deleting resources, setting the status of the rendering pipeline, associating resources with pipeline elements, and updating tools for specific resources. Most resource creation operations are performed while loading levels and scenes.

Processing a standard game frame includes assigning various resources to pipeline elements, setting the status of the pipeline, updating the resources in the central processor memory (constants, vertices and index buffers) depending on the state of the simulation processes, as well as updating the resources in the GPU memory (visualization objects). , unordered access representations [UAV]) by rendering, sending, and clearing operations.

During the creation of a resource, the D3D11_USAGE enumeration element is used to set the following resource parameters:
  1. GP access for reading and writing (DEFAULT - for visualization objects, UAV elements, rarely updated constant buffers);
  2. GP access only for reading (IMMUTABLE - for textures);
  3. CPU access for writing + GP access for reading (DYNAMIC - for frequently updated buffers);
  4. CPU access with the ability for GPs to copy data to a resource (STAGING).

Note that in order to provide use cases 3 and 4, you must correctly set the D3D11_CPU_ACCESS_FLAG flag for the resource.
The Direct3D 11 API provides three methods for updating resource data, each of which performs specific tasks (as described above):
  1. Map / Unmap ;
  2. UpdateSubresource ;
  3. CopyResource / CopySubresourceRegion .

There is an interesting scenario that requires implicit synchronization: when the CPU has access to the resource for writing, and the GP has read access. Such a scenario is often encountered during frame processing. Examples include updating the presentation matrix (model, projection) or converting the model bone (model bone) into animation. Waiting for the graphics processor to complete the use of the resource would lead to an unjustified decrease in speed. Creating several independent resources (resource copies) to implement this scenario would make the task for the creators of the application too complicated. As a result, in the Direct3D interface versions 9–11, this task is transmitted to the driver using the DX11_MAP_WRITE_DISCARD flag. Each time a resource is mapped using this flag, the driver creates a new memory area for the resource that the CPU uses. Thus, various Draw calls that update a given resource use different resource aliases, which undoubtedly increases the memory utilization of the GP.
More information about managing resources in Direct3D:

Source: https://habr.com/ru/post/267635/


All Articles