Optimizing the rendering of scenes from the Disney cartoon "Moana". Part 2

Inspired by the first victory in parsing using the description of the island scene from the Disney cartoon “Moana” , I further delved into the study of memory usage. Much more could have been done with the execution time, but I decided that it would be useful to first explore the situation.

I started the run-time study with pbrt built-in statistics; pbrt has a manual setting of significant memory allocations to track memory usage, and after rendering is complete, a memory allocation report is displayed. Here is what the original memory allocation report for this scene was:


BVH- 9,01
1,44
MIP- 2,00
11,02

As for the execution time, the built-in statistics turned out to be brief and issued a report only on the allocation of memory for known objects of 24 GB in size. top reported that about 70 GB of memory was actually used, that is, the statistics do not take into account 45 GB. Small deviations are quite understandable: dynamic memory allocators require additional space to register resource use, some are lost due to fragmentation, and so on. But 45 GB? Something bad is definitely hiding here.

To understand what we are missing (and to make sure that the tracing found was correct), I used massif to trace valid allocations of dynamic memory. It is rather slow, but at least it works well.
')

Primitives

The first thing I found when tracing massif was two lines of code that contained in memory instances of the Primitive base class, which are not counted in statistics. A little oversight that is fairly easy to fix . After that we see the following:

Primitives 24,67

Oh, her. So what is a primitive, and why all this memory?

pbrt distinguishes Shape , which is a pure geometry (sphere, triangle, etc.) and Primitive , which is a combination of geometry, material, sometimes the function of radiation and the involved medium inside and outside the surface of the geometry.

There are several variants of the Primitive base class: GeometricPrimitive , which is a standard case: a “vanilla” combination of geometry, material, etc., as well as a TransformedPrimitive , which is a primitive with transformations applied to it, either as an instance of an object or for moving primitives with time-varying transformations. It turns out that in this scene both of these types are a waste of space.

GeometricPrimitive: 50% extra space

Note: some erroneous assumptions are made in this analysis; they are revised in the fourth post of the series .

4.3 GB used on GeometricPrimitive . It’s funny to live in a world where 4.3 GB of used RAM is not your biggest problem, but let's still see where we have 4.3 GB of GeometricPrimitive . Here are the relevant parts of the class definition:

 class GeometricPrimitive : public Primitive { std::shared_ptr<Shape> shape; std::shared_ptr<Material> material; std::shared_ptr<AreaLight> areaLight; MediumInterface mediumInterface; };

We have a pointer to a vtable , three more pointers, and then a MediumInterface containing two more pointers with a total size of 48 bytes. In this scene, there are only a few meshes radiating illumination; therefore, areaLight almost always a null pointer, and there is no environment influencing the scene, so both mediumInterface pointers mediumInterface also null. Thus, if we had a specialized implementation of the Primitive class, which could be used in the absence of the radiation function and the environment, we would have saved almost half of the disk space occupied by GeometricPrimitive - in our case, about 2 GB.

However, I did not make a correction and add a new Primitive implementation to pbrt. We strive as much as possible to minimize the differences between the pbrt-v3 source code on github and the system described in my book, for a very simple reason — keeping them in sync makes it easier to read the book and work with the code. In this case, I decided that a completely new implementation of Primitive , never mentioned in the book, would be too big a discrepancy. But this fix will definitely appear in the new version of pbrt.

Before proceeding further, we will perform a verification rendering:

A beach from an island from the movie “Moana”, rendered pbrt-v3 with a resolution of 2048x858 and 256 samples per pixel. The total rendering time on a 12-core / 24-threaded Google Compute Engine instance with a frequency of 2 GHz with the latest version of pbrt-v3 was 2 h 25 min 43 s.

TransformedPrimitives: 95% of wasted space

The memory allocated for 4.3 GB GeometricPrimitive was a pretty painful hit, but what about 17.4 GB under TransformedPrimitive ?

As mentioned above, TransformedPrimitive used for both time-varying transformations and for instances of objects. In both cases, we need to apply an additional transformation to the existing Primitive . There are only two members in the TransformedPrimitive class:

  std::shared_ptr<Primitive> primitive; const AnimatedTransform PrimitiveToWorld;

So far so good: a pointer to a primitive and a time-varying transformation. But what is actually stored in the AnimatedTransform ?

  const Transform *startTransform, *endTransform; const Float startTime, endTime; const bool actuallyAnimated; Vector3f T[2]; Quaternion R[2]; Matrix4x4 S[2]; bool hasRotation; struct DerivativeTerm { // ... Float kc, kx, ky, kz; }; DerivativeTerm c1[3], c2[3], c3[3], c4[3], c5[3];

In addition to the pointers to the two transition matrices and the associated time, there is also a decomposition of the matrices into components of translation, rotation and scaling, as well as pre-calculated values used to limit the volume occupied by the moving bounding boxes (see section 2.4.9 of our book) Physically Based Rendering ). All this adds up to 456 bytes.

But nothing is moving in this scene. From the point of view of transformations, for instances of objects we need one pointer to transformation, and the values for decomposition and moving bounding boxes do not need it. (That is all you need 8 bytes). If you create a separate Primitive implementation for fixed instances of objects, 17.4 GB are compressed to only 900 MB (!).

As for GeometricPrimitive , fixing it is a non-trivial change from what was described in the book, so we will also postpone it for the next version of pbrt. At least, we now understand what is happening with the chaos of 24.7 GB of Primitive memory.

Conversion Cache Trouble

The next largest block of unrecorded memory, defined by massif, was TransformCache , which occupied approximately 16 GB. (Here is a link to the original implementation .) The idea is that the same transformation matrix is often used in the scene several times, so it’s best to have a single copy in memory so that everyone using its elements simply stores a pointer to the same thing. transformation.

For storing the cache, TransformCache used std::map , and massif reported that 6 out of 16 GB were used for black-mahogany nodes in std::map . This is a terrible lot: 60% of this volume is used for the transformations themselves. Let's look at the announcement for this distribution:

 std::map<Transform, std::pair<Transform *, Transform *>> cache;

Here the work is done perfectly: Transform is used entirely as keys for distribution. Even better, the pbrt Transform stores two 4x4 matrices (the transformation matrix and its inverse matrix), which results in 128 bytes of storage at each node of the tree. All this is absolutely unnecessary for the value stored for it.

Perhaps such a structure is quite normal in a world where it is important for us that the same transformation matrix is used in hundreds or thousands of primitives, and in general there are not so many transformation matrices. But for the scene with a bunch of mostly unique transformation matrices, as in our case, this is just a terrible approach.

Besides the fact that space is wasted on keys, a search in std::map to bypass the red-black tree involves a lot of navigation operations on the pointers, so it seems logical to try something completely new. Fortunately, there is little written about TransformCache in the book, so it is perfectly acceptable to completely rewrite it.

And finally, before we get started: after examining the signature of the Lookup() method, another problem becomes apparent:

 void Lookup(const Transform &t, Transform **tCached, Transform **tCachedInverse)

When the calling function provides a Transform , the cache stores and returns pointers to the transformation equal to the transmitted one, but also passes the inverse matrix. To make this possible, in the original implementation, when adding a transformation to the cache, the inverse matrix is always calculated and saved so that it can be returned.

The stupidity here is that most of the dial peers that use the transform cache do not query and do not use the inverse matrix. That is, different types of memory are wasted on unused reversible conversions.

The new implementation adds the following improvements:

It uses a hash table that allows you to speed up the search, and does not require storing anything other than the Transform * array, which, in effect, reduces the amount of memory used to the value that is really needed to store all Transform .
The search method signature now has the form Transform *Lookup(const Transform &t) Transform *Lookup(const Transform &t) Transform *Lookup(const Transform &t) ; in one place where the calling function wants to get an inverse matrix from the cache, it simply calls Lookup() twice.

For hashing, I used the hash function FNV1a . After its implementation, I discovered Aras’s post about hash functions ; Maybe I just needed to use xxHash or CityHash, because their performance is better; maybe someday my shame will win and i will fix it.

Thanks to the new TransformCache implementation, the overall system startup time has been significantly reduced, to 21 minutes 42 seconds. That is, we saved another 5 minutes 7 seconds, or accelerated 1.27 times. Moreover, more efficient use of memory has reduced the space occupied by the transformation matrices from 16 to 5.7 GB, which is almost equal to the amount of stored data. This allowed us not to try to take advantage of the fact that they are not projective, and store 3x4 matrices instead of 4x4. (In the usual case, I would be skeptical about the importance of this kind of optimization, but here it would save us more gigabytes - a lot of memory! This is definitely worth doing in the renderer of production.)

Minor performance optimization to complete

The too generalized TransformedPrimitive structure costs us both memory and time: the profiler reported that a significant amount of time was spent on launching in the AnimatedTransform::Decompose() function, which decomposes the transformation of the matrix into a quaternion rotation, transfer, and scale. Since nothing is moving in this scene, this work is not needed, and a thorough check of the implementation of the AnimatedTransform showed that none of these values are accessed if the two transformation matrices are actually identical.

By adding two lines to the constructor so that decomposition transformations are not performed when they are not required, we saved another 1 min 31 of the launch time: as a result, we came to 20 min 9 s, that is, we generally accelerated 1.73 times.

In the next article, we will seriously take up the parser and analyze what became important when we accelerated the work of other parts.

Source: https://habr.com/ru/post/417445/

All Articles