⬆️ ⬇️

Optimizing the rendering of scenes from the Disney cartoon "Moana". Parts 4 and 5

image


I have a pbrt branch, which I use to test new ideas, implement interesting ideas from scientific articles and in general to research everything that usually results in the new edition of the book Physically Based Rendering . Unlike pbrt-v3 , which we strive to keep as close as possible to the system described in the book, in this thread we can change anything. Today we will see how more radical changes to the system will significantly reduce the use of memory in the scene with an island from the Disney cartoon "Moana" .



A note on the methodology: in the previous three posts, all statistics were measured for the WIP version (Work In Progress) of the scene with which I worked before its release. In this article we will move on to the final version, which is a bit more complicated.



When rendering the latest island scene from Moana, the pbrt-v3 used 81 GB of RAM to store the pbrt-v3 scene description. Currently, pbrt-next uses 41 GB - approximately two times less. To obtain such a result, it was enough to make small changes that resulted in several hundred lines of code.



Reduced primitives



Let us recall that in pbrt Primitive is a combination of geometry, its material, the function of radiation (if it is a source of illumination), and the record of the environment inside and outside the surface. The pbrt-v3 GeometricPrimitive stores the following:

')

  std::shared_ptr<Shape> shape; std::shared_ptr<Material> material; std::shared_ptr<AreaLight> areaLight; MediumInterface mediumInterface; 


As mentioned earlier , most of the time areaLight is nullptr , and the MediumInterface contains a pair of nullptr . Therefore, in pbrt-next, I added a Primitive variant called SimplePrimitive , which stores only pointers to geometry and material. Where possible, it is used where possible instead of GeometricPrimitive :



 class SimplePrimitive : public Primitive { // ... std::shared_ptr<Shape> shape; std::shared_ptr<Material> material; }; 


For non-animated object instances, we now have a TransformedPrimitive , which stores only a pointer to a primitive and a transformation, which saves us about 500 bytes of wasted space that the AnimatedTransform instance added to the TransformedPrimitive renderer pbrt-v3.



 class TransformedPrimitive : public Primitive { // ... std::shared_ptr<Primitive> primitive; std::shared_ptr<Transform> PrimitiveToWorld; }; 


(in case you need an animated conversion to pbrt-next there is an AnimatedPrimitive .)



After all these changes, the statistics report that only 7.8 GB is used for Primitive , instead of 28.9 GB used in pbrt-v3. Although it's great that we saved 21 GB, this is not as much as the reduction we could expect from previous estimates; we will return to this discrepancy by the end of this part.



Reduced geometry



Also in pbrt-next, the amount of memory occupied by the geometry was significantly reduced: the space used for triangle meshes decreased from 19.4 GB to 9.9 GB, and the space for storing curves from 1.4 to 1.1 GB. Slightly more than half of this savings came from the simplification of the base class Shape .



In pbrt-v3, Shape carries with it several members that are transferred to all implementations of Shape — these are several aspects that are convenient to access in implementations of Shape .



 class Shape { // .... const Transform *ObjectToWorld, *WorldToObject; const bool reverseOrientation; const bool transformSwapsHandedness; }; 


To understand why these member variables cause problems, it is helpful to understand how meshes of triangles are represented in pbrt. First, there is the TriangleMesh class, in which the vertices and index buffers for the entire mesh are stored:



 struct TriangleMesh { int nTriangles, nVertices; std::vector<int> vertexIndices; std::unique_ptr<Point3f[]> p; std::unique_ptr<Normal3f[]> n; // ... }; 


Each triangle in the mesh is represented by the Triangle class, which is inherited from Shape . The idea is to keep the Triangle as small as possible: they only store a pointer to the mesh of which they are part, and a pointer to an offset in the index buffer, from which the indices of its vertices begin:



 class Triangle : public Shape { // ... std::shared_ptr<TriangleMesh> mesh; const int *v; }; 


When the Triangle implementation needs to find the positions of its vertices, it performs the appropriate indexing to get them from TriangleMesh .



The problem with Shape pbrt-v3 is that the values ​​stored in it are the same for all the triangles of the mesh, so it’s best to save them from each whole mesh in TriangleMesh and then give Triangle access to a single copy of the common values.



This problem is fixed in pbrt-next: the base class Shape in pbrt-next does not contain such members, and therefore each Triangle is 24 bytes less. Geometry Curve uses a similar strategy and also benefits from using a more compact Shape .



Triangle common buffers



Despite the fact that the Moana island scene actively uses the creation of object instances for clearly repetitive geometry, I was wondering how often reuse of index buffers, texture coordinate buffers and so on is used for various triangle meshes.



I wrote a small class that hashed these buffers on receipt and saved them to the cache, and modified TriangleMesh so that it checks the cache and uses the already saved version of any excess buffer it needs. The win turned out to be very good: I managed to get rid of 4.7 GB of excess capacity, which is much more than what I expected.



Catastrophe with std :: shared_ptr



After all these changes, the statistics reports approximately 36 GB of known allocated memory, and at the beginning of rendering, top indicates the use of 53 GB. Cause



I was afraid of another series of slow runs of massif to find out which allocated memory is missing in the statistics, but then a letter from Arseny Kapulkin appeared in my inbox. Arseny explained to me that my previous estimates of the GeometricPrimitive memory usage were badly flawed. I had to understand for a long time, but then I understood; Many thanks to Arseny for pointing out the error and detailed explanations.



Before writing Arseny, I mentally imagined the implementation of std::shared_ptr as follows: in these lines there is a common descriptor that stores the reference counter and a pointer to the object itself:



 template <typename T> class shared_ptr_info { std::atomic<int> refCount; T *ptr; }; 


Then I assumed that the shared_ptr instance simply points to it and uses it:



 template <typename T> class shared_ptr { // ... T *operator->() { return info->ptr; } shared_ptr_info<T> *info; }; 


In short, I assumed that sizeof(shared_ptr<>) is the same as pointer size, and that for every shared pointer, 16 bytes of extra space are wasted.



But it is not.



In the implementation of my system, the total descriptor is 32 bytes, and sizeof(shared_ptr<>) is 16 bytes. Consequently, GeometricPrimitive , which mainly consists of std::shared_ptr , is approximately twice as large as my ratings. If you are wondering why this happened, then in these two posts on Stack Overflow the reasons are explained in detail: 1 and 2 .



In almost all cases of using std::shared_ptr in pbrt-next, they are not required to be general pointers. Being engaged in crazy hacking, I replaced everything that I could with std::unique_ptr , which actually has the same size as a regular pointer. For example, here is how SimplePrimitive now looks SimplePrimitive :



 class SimplePrimitive : public Primitive { // ... std::unique_ptr<Shape> shape; const Material *material; }; 


The reward turned out to be more than I expected: memory usage at the beginning of rendering dropped from 53 GB to 41 GB - saving 12 GB, quite unexpected a few days ago, and the total volume is almost two times less than the pbrt-v3 used. Fine!



In the next part, we will finally complete this series of articles - examine the rendering speed in pbrt-next and discuss ideas for other ways to reduce the amount of memory needed for this scene.



Part 5.



To summarize this series of articles, we will start by examining the rendering speed of the island scene from the Disney cartoon “Moana” in pbrt-next - the pbrt branch, which I use to test new ideas. We will make more radical changes than is possible in pbrt-v3, which should adhere to the system described in our book. We conclude with a discussion of the directions for further improvements, from the simplest to the bit extreme.



Rendering time



In pbrt-next, many changes have been made to the light transfer algorithms, including changes in BSDF sampling and improvements to Russian roulette algorithms. As a result, it renders more rays than pbrt-v3 to render this scene, so it’s impossible to directly compare the execution time of these two renderers. The speed is generally close, with one important exception: when rendering the island scene from Moana , shown below, pbrt-v3 spends 14.5% of its execution time on performing texture searches for ptex . Previously, it seemed to me quite normal, but pbrt-next spends only 2.2% of the execution time. All this is terribly interesting.



After studying the statistics, we get 1 :



pbrt-v3:

Ptex 20828624

Ptex 712324767



pbrt-next:

Ptex 3378524

Ptex 825826507




As we see in pbrt-v3, the ptex texture is read from the disk on average every 34 texture searches. In pbrt-next, it is read only every 244 searches — that is, disk I / O has decreased by about 7 times. I assumed that this happens because pbrt-next calculates ray differences for indirect rays, and this results in accessing higher MIP levels of textures, which in turn creates a more complete series of access to the ptex texture cache, reduces the number of cache misses, and hence the number of I / O operations 2 . A brief check confirmed my guess: when disabling the difference in rays, the ptex speed became much worse.



The increase in ptex speed has not only affected the savings in computing and I / O. In the system with 32 CPUs, the pbrt-v3 had an acceleration of just 14.9 times after the parsing of the scene description was completed. pbrt usually demonstrates close to linear parallel scaling, which is why it pretty much disappointed me. Due to a much smaller number of conflicts with locks in ptex, the pbrt-next version was 29.2 times faster in the system with 32 CPUs, and 94.9 times faster in the system with 96 CPUs - we returned to our indicators again.





The roots of the island scene "Moana", rendered pbrt with a resolution of 2048x858 with 256 samples per pixel. The total rendering time on a Google Compute Engine instance with 96 virtual CPUs with a frequency of 2 GHz in pbrt-next is 41 minutes 22 seconds. Acceleration due to mulithreading during rendering was 94.9 times. (I do not quite understand what is happening with the bump mapping here.)



Work for the future



Reducing the amount of memory used in such complex scenes is a fascinating exercise: saving a few gigabytes with a small change is much more pleasing than dozens of megabytes saved in a simpler scene. I have a good list of what I hope to explore in the future, if time allows. Here is a quick overview.



Further decrease in triangle buffer memory



Even with repeated use of buffers that store the same values ​​for several triangle meshes for triangle buffers, quite a lot of memory is still used. Here is a breakdown of memory usage for different types of triangle buffers in a scene:



Type ofMemory
Positions2.5 GB
Normals2.5 GB
UV98 MB
Indices252 MB


I understand that nothing can be done with the transmitted vertex positions, but for other data there are opportunities for saving. There are many types of representation of normal vectors in a memory-efficient way , providing different trade-offs between the amount of memory / number of calculations. Using one of the 24-bit or 32-bit representations will reduce the space occupied by the normals to 663 MB and 864 MB, which will save us more than 1.5 GB of RAM.



In this scene, the amount of memory used for storing texture coordinates and index buffers is surprisingly small. I suppose that this happened because of the presence of a set of procedurally generated plants in the scene and due to the fact that all variations of the same type of plants have the same topology (and hence the index buffer) with parametrization (and therefore UV coordinates). In turn, reuse of matching buffers is quite effective.



For other scenes, it may be quite appropriate to sample the 16-bit UV coordinates of the textures or to use half-precision float values, depending on their range of values. It seems that in this scene all the coordinates of the textures are equal to zero or one, which means that they can be represented by one bit - that is, the memory can be reduced by 32 times. This state of affairs probably arose from the use of the ptex format for texturing, which eliminates the need for UV atlases. Taking into account the small volume occupied now by the coordinates of the textures, the implementation of this optimization is not particularly necessary.



pbrt always uses 32-bit integers for index buffers. For small meshes of less than 256 vertices, just 8 bits per index is enough, and for meshes less than 65,536 vertices, 16 bits can be used. Modifying pbrt to adapt it to this format will not be very difficult. If we wanted to optimize to the maximum, we could allocate exactly as many bits as necessary to represent the required range in the indices, while the price would be an increase in the complexity of finding their values. With the fact that now only a quarter of a gigabyte of memory is used for vertex indices, this task, compared to others, does not look very interesting.



Peak memory usage build bvh



We have not discussed yet another detail of memory use: there is a short-term peak of 10 GB of additional memory immediately before rendering. This happens when the (big) BVH of the entire scene is built. The pbrt renderer's BVH build code is written to run in two phases: first, it creates a BVH with the traditional presentation : two child pointers to each node. After building the tree, it is converted to a memory efficient scheme , in which the first child of the node is in memory immediately behind it, and the offset to the second child node is stored as an integer.



This separation was necessary from the point of view of teaching students - it is much easier to understand the algorithms for constructing BVH without the chaos associated with the need to transform the tree into a compact form in the construction process. However, the result is this peak memory usage; given its influence on the scene, the elimination of this problem seems attractive.



Converting pointers to integers



There are many 64-bit pointers in various data structures that can be represented as 32-bit integers. For example, each SimplePrimitive contains a pointer to Material . Most Material instances are common to many primitive scenes and are never more than a few thousand; therefore, we can store a single global vector all materials:



 std::vector<Material *> allMaterials; 


and just store 32-bit integer offsets to this vector in SimplePrimitive , which saves us 4 bytes. The same trick can be used with a pointer to TriangleMesh in each Triangle , as well as in many other places.



After such a change, there will be a slight redundancy in accessing the pointers themselves, and the system will become a little less understandable for students trying to understand its work; In addition, this is probably the case when, in the context of pbrt, it is better to maintain a slightly greater clarity of implementation, although at the cost of incomplete memory optimization.



Accommodation based on arenas (regions)



For each separate Triangle and primitive, a separate call is made to new (in fact, make_unique , but this is the same). Such memory allocations lead to the use of additional resource accounting, which occupies about five gigabytes of memory, which is not taken into account in statistics. Since the lifespan of all such placements is the same - until the rendering is complete - we can get rid of this additional accounting by selecting them from the memory arena (memory arena) .



Khaki vtable



My last idea is terrible, and I apologize for it, but it intrigued me.



Each triangle in the scene has an extra load of at least two vtable pointers: one for Triangle , and one for SimplePrimitive . This is 16 bytes. In the island scene of Moana, there are a total of 146,162,124 unique triangles, which adds almost 2.2 GB of redundant vtable pointers.



What if we didn’t have an abstract base class for Shape and each geometry implementation didn’t inherit from anything? This would save us a place on vtable pointers, but, of course, when we passed a pointer to a geometry, we would not know what kind of geometry it is, that is, it would be useless.



It turns out that on modern x86 CPUs, in fact, only 48 bits of 64-bit pointers are used . Therefore, there are extra 16 bits that we can borrow to store some information ... for example, the type of geometry we are pointing to. In turn, by adding a bit of work, we can make a way back to the possibility of creating an analogue of calls to virtual functions.



Here's how it will happen: first, we define a ShapeMethods structure that contains function pointers, for example, 3 :



 struct ShapeMethods { Bounds3f (*WorldBound)(void *); // Intersect, etc. ... }; 


Each geometry implementation will implement a constraint function, an intersection function, and so on, taking the analogue of this as the first argument:



 Bounds3f TriangleWorldBound(void *t) { //       Triangle. Triangle *tri = (Triangle *)t; // ... 


We would have a global table of structures ShapeMethods , in which the n -th element would be for a geometry type with index n :



 ShapeMethods shapeMethods[] = { { TriangleWorldBound, /*...*/ }, { CurveWorldBound, /*...*/ }; // ... }; 


When creating a geometry, we encode its type into some of the unused bits of the returned pointer. Then, taking into account the pointer to the geometry, the specific call of which we want to perform, we would extract this type index from the pointer and use it as an index in shapeMethods to find the corresponding function pointer. In fact, we would implement vtable manually, handling dispatch on our own. If we did this both for geometry and for primitives, we would save 16 bytes on Triangle , however, having done a rather difficult path.



I suppose that such a hack for implementing virtual functions management is not new, but I could not find links to it on the Internet. Here is a Wikipedia page about tagged pointers , but it deals with things like reference counters. If you know the link better, then send me her letter.



By sharing this clunky hack, I can complete a series of posts. Again I express my deep gratitude to Disney for publishing this scene. It was amazing to work with her; the gears in my head keep spinning.



Notes



  1. In the end, pbrt-next traces more rays in this scene than pbrt-v3, which probably explains the increase in the number of search operations.
  2. The ray differences for indirect rays in pbrt-next are calculated using the same hack used in expanding the texture cache for pbrt-v3. It seems that it works quite well, but its principles do not seem to me very tested.
  3. Rayshade . , C . Rayshade .

Source: https://habr.com/ru/post/417939/



All Articles