
At WWDC 2014, a surprise was waiting for all of us: the announcement of a new 3D graphic API called Metal. But this time we are not dealing with a new high-level API on top of OpenGL ES (as was the case with the Scene Kit), but with a new low-level API for rendering and computing, which can serve as a replacement for OpenGL in games. According to Apple, Metal can be up to 10 times faster than OpenGL ES (more precisely, it can generate
draw calls [transfer
calls ; transfer data to the GPU] 10 times faster) and is available only on devices with iOS and the latest generation A7 processor.
This announcement provoked a new wave of discussion and controversy about the need for the emergence of new graphics APIs that should (or should not - who knows) replace OpenGL. The post offered to your attention does not intend to participate in this discussion - its purpose is to clarify what Metal does differ from OpenGL ES, whose replacement it is. To understand what is so special (or vice versa, nothing special) is in the Metal API, we will have to look a bit under the “hood” of graphic API and GPU.
How do GPU and graphics APIs work?
A naive reader may assume that calling an API directly does something on the GPU or allows something to occur inside the GPU. An even more naive reader assumes that the GPU finishes handling this call when the API returns a result. Both of these statements are far from reality. If the driver was performing the rendering commands at the same moment when they were created and waiting for the rendering process to complete before returning the result to the API call, neither the CPU nor the GPU could work efficiently, since one of the processors would always be blocked for the sake of to another.
For a simple improvement in the performance of the GPU, this process should be started asynchronously; then the GPU will not block the CPU and the API calls will return the result almost instantly. In this case, the GPU may not be used to 100%, since it may have to wait for the CPU to render new calls (= start of frame), while the calls of the remaining teams will wait for the previous ones to complete. This is the reason why most graphics drivers collect all the
draw calls (and other tasks that need to be performed on the GPU — for example, changing states) to draw the entire frame before sending it to the GPU. These buffered commands will then be sent back after the command to draw the next frame is received, so that the GPU will be used as efficiently as possible. Of course, this will add one delay frame: while the CPU creates the task for the current frame, the last frame will be rendered on the GPU. In fact, it is possible to buffer more than one frame and thus achieve a higher frame rate - at the expense of an even greater delay.
')
Another mistake in our naive assumption is in the assumption of what the state change calls are doing.
So, we learned at least two important things about what is happening behind the scenes of OpenGL collaboration with modern GPUs: state changes can be complicated if a new combination of states is required and all operations on the GPU will be delayed for a certain amount of time.
In the application, one stream of actual commands for one frame, which must be executed on the GPU, is formed and sent to the GPU all at once at once (in fact, everything is a bit more complicated, but let's not go deep yet).
You can read more about how modern computer graphics pipeline works in the Fabian Giesens series of articles - “
A trip down the Graphics Pipeline “.
Why another software model may have advantages
As you have already seen, a huge number of difficulties and cunning tricks are hidden from the programmer (they are probably even more than I mentioned), which hide what is happening directly. Some of them make the life of a simple developer easier, others - make him look for ways to trick the driver or “dig” towards the side effects of API calls.
Some graphical APIs today are trying to remove most of these tricks by uncovering the "intricacies" they hide — and in some cases leaving the program to the solution to all related problems. The graphics API of the PS3 went in this direction, AMD goes with the Mantle, and the upcoming DirectX 12 and Apple Metal are also going there.
What has changed?
The command buffers are now open and the application must fill these buffers and send them to the command queue that will execute these buffers in the specified order on the GPI - so the application will have full control over the task sent to the GPU and determine how many delay frames you need to add (adding a delay, but at the same time increasing the degree of use of the GPU). Buffering commands to the GPU and sending them asynchronously to the next frame must be implemented by the application itself.
Since it becomes clear that these buffers will not be executed directly immediately (that is, at the time of creation) and that multiple buffers can be created and added to the queue for execution in a specific order, the application can afford to build them in multiple threads in parallel. It also becomes more obvious to the programmer which of the results of the calculations are already available and which are not.
State changes are now organized into state objects that can simply switch, while creating these objects will be more expensive. For example, MTLRenderPipelineState contains shaders and all states that are implemented by their patching.
Another plus from the new API is that it does not have to bear the burden of compatibility with previous versions and therefore will not be so conservative.
There is a nuance in the sharpening under the A7 - thanks to him Metal is sharpened for work on systems with shared memory, i.e. The CPU and GPU can directly access the same data without having to transfer it over the PCI bus. Metal gives direct access to the program to the buffers from the CPU, and the responsibility for the fact that this data is not used simultaneously by the GPU rests on the programmer’s shoulders. This useful feature allows you to mix the product of computing on the GPU and CPU.
And how is it 10 times faster?
Each draw call costs some time on the CPU and some time on the GPU. Metal API reduces the time spent by the CPU, by simplifying condition monitoring and by reducing the number of error checks from the driver for correct state combinations. The preliminary calculation of states also helps: you can not just check for errors during the build, but the state itself will require fewer API calls. The ability to build concurrent command buffers increases the number of draw calls even more if the application is attached to the CPU.
But rendering on the GPU on the other hand does not become faster, an application that makes very few draw calls for large meshes (a
mesh is a part of the model consisting of object vertices] will not get any benefit from switching to Metal.
Can the same thing be done on OpenGL?
At GDC 14, there was a great presentation of “
Approaching Zero Driver Overhead ” by Cass Everitt, John McDonald, Graham Sellers and Tim Foley. Its main idea was to reduce the work of the driver in OpenGL by increasing the amount of work produced by draw calls, and using new GL objects and fewer GL calls to increase efficiency.
This and other ideas will require further expansion of OpenGL and the emergence of new versions of this API, but much of this can be transferred to OpenGL ES. What we will lose is the ability to directly control the command buffers, with all our pros and cons.
What is the probability of seeing this in the future? Because of the backward compatibility support, it remains to hope only for the appearance of a certain set of functions, which can be called the “modern kernel”, but it will most likely have to be made compatible with everything up to the original function glBegin (). This restriction will work throughout the entire potential future of OpenGL and will become the limit of its evolution, making alternatives such as the Metal API increasingly preferred ...
Original article:
http://renderingpipeline.com/2014/06/whats-the-big-deal-with-apples-metal-api/