Breathing life in Avatar

Although Habré has already published some notes on the technical side of creating the Avatar movie, for example, HP’s blog had a note on creating a render farm built on HP blade systems, it seemed to me that this topic is still very interesting for “techies”, how is everything done “behind the scenes” in projects of this magnitude, especially since NetApp systems were directly involved in making the film. Therefore, I decided to translate an article for Habr exclusively from the recent fresh e-monthly newspaper Tech Ontap . By the way, you can subscribe to its Russian version.

The Avatar film, which was released on screens this year, broke all records of fees, reaching a value of 2.7 billion dollars, and continuing to collect cash. Weta Digital , a visual effects company that created special effects and computer animation for this film, also broke several of its own records in creating the impressive 3D world of Avatar . Weta Digital gained wide popularity in the professional environment after the release of the Lord of the Rings trilogy and several subsequent films , such as King Kong and District # 9 , but the creation of Avatar required very special technical efforts.

Weta Digital has pushed the boundaries of its computing infrastructures and storage systems far beyond what we had to work with earlier. When work on Avatar began in 2006, Weta Digital just completed work on King Kong . At this point, it had approximately 4,400 CPU cores in its rendering farm and about 100TB of storage space. Completing the Avatar production, the company had about 35,000 CPU cores and over 3000TB of storage. Only the rendering farm's RAM capacity today exceeds the total Weta Digital disk capacity at the time of the completion of the King Kong film.

I began working at Weta Digital in 2003 as a system administrator when work on the last part of the Lord of the Rings was completed . Soon, my main task was to lead the Weta Digital infrastructure solutions department. This department was responsible for all servers, networks and storage systems. Our task was to create such an infrastructure in order to make it possible to create a film of the Avatar level, and also to solve all the problems of the technical plan arising during the work on it.
')

Battle for growth

Despite the huge growth that occurred in Weta Digital during the work on Avatar, managing the grown infrastructure did not become such a problem as we feared. To a large extent, it was the merit of a team that knows how to work together. The team kept together, and when something went wrong, we jumped up and rushed to fix it all together. We worked hard, and, in most cases, we managed to anticipate the situation, instead of correcting something that had already happened.

We quickly realized that in order to cope with Avatar we would need to take two important steps.

Build a new data center. Weta Digital has so far used several small server rooms scattered across several buildings. The new data center provides centralized deployment and consolidation of all new infrastructure, which will need to be increased during the implementation of the Avatar project.
Build a high-speed fiber optic channel. Weta Digital does not have a localized campus. Instead, our "campus" consists of several independent buildings scattered in the suburbs of Wellington. We developed and built a high-speed "ring" using fiber, which connected all these buildings with a new data center. Each building was connected by a redundant connection, with a bandwidth of 10Gbit / s each, totaling EtherChannel forming 40Gbit / s, to connect the storage system and servers of the render farm.

These two elements gave us the physical ability to scale our infrastructure, increasing it and bandwidth to move data freely between locations. The new server infrastructure of the updated rendering farm was created using HP blade servers. With 8 cores and 24GB RAM on the blade, we were able to fit up to 1024 cores and up to 3TB RAM per rack. The new data center was organized as a series of 10 racks, totally 10240 cores. We installed the first 10,000, worked a little on them, added another 10,000, still worked, added another 10,000, and finally, we have set the last 5,000 cores so far.

Datacenter has a number of unique design elements:

Water cooling. Since the usual weather in Wellington, New Zealand, the city where Weta Digital is located, is rarely very hot, most of the time, water is pumped directly from the racks cooled by it to the radiators on the building’s roof for natural heat exchange. Servers operate in relatively “warm” mode, at a temperature of 23º C.
High power density. Datacenter provides power supply up to 30 kilowatts per rack.
High strength racks. Each Rittal rack with integrated water cooling can withstand the weight of the equipment and the cooling system placed in it to one ton.

In our storage infrastructure, we used products from several manufacturers, but its core consists of storage systems manufactured by NetApp, which contains about 1000TB of data. By the end of work on Avatar, we replaced all our old FAS980 and FAS3050 with cluster FAS6080. In the last eight months of the project, we also added four SA600 storage accelerator systems in order to solve one particularly painful performance problem.

Accelerating access to texture files using adaptive caching

In the visual effects industry, “textures” are special image files that are “applied” to a 3D model, making it realistic. The model is “wrapped” in textures that create the necessary details, color, shadows, character of the model's surface, without which the 3D model looks only monotonously gray. A “texture set” is all the various pictures that need to be applied to a specific model so that it looks like a tree, character, or creature. Most renders that include objects also include textures applied to these objects; thus, the contents of the texture files are constantly required by the farm servers when creating the scene.

A certain group of texture sets can be requested simultaneously by several thousand cores of render processors. Overlapping in content from the first, another group may be requested by another thousand cores, and so on. All that we can do to improve the situation with the speed of access to textures will dramatically increase the performance of the entire rendering farm as a whole.

No separate file server can provide enough bandwidth to load all of our texture sets, so we developed a special “publish” process that is designed to create replicas of each created texture set. This is shown in Figure 1.

weta infrastructure before

Figure 1) The old method of increasing bandwidth when transferring texture sets.

When the render task is run on the farm, they need access to the set of textures used in the project, it selects a random file server and reads the contents of the texture from one of its replicas. Allowing us to distribute the contents of textures across several file servers this process significantly increased system performance. Although this solution was better than the single file server option, the publishing and replication process was rather complicated and took time consuming integrity control to be sure that all replicas on all servers are absolutely identical.

We started using NetApp FlexCache and the SA600 storage accelerator as an easy way to solve the problem of lack of performance associated with working with texture sets. FlexCache software creates a caching layer in the storage infrastructure that automatically handles changing access patterns and eliminates performance bottlenecks. It automatically replicates and maintains access to active datasets, regardless of where the data resides in the storage infrastructure, using local caching volumes.

Instead of manually copying our textures to several file servers, FlexCache allows us to dynamically cache the textures currently used in the project and deliver them to the render farm from the SA600 devices. We tested this solution and saw that it works very well in our system, so 8 months before the end of the film, we decided to take a chance and installed four SA600 systems, each with two Performance Accelerator Modules (PAM) modules with a capacity of 16GB each. (PAM works like a cache, reducing response time when accessing data. More: eng and rus )

weta infrastructure after

Figure 2) Improved bandwidth increase method for texture sets using NetApp FlexCache, SA600, and PAM.

The total size of the texture set is about 5TB, but when we turned on FlexCache, we found out that only 500GB of them are actively used at any given time. Each SA600 has enough space on its local disks to place this active dataset, and when the textures in the active dataset change, the cache automatically reloads its contents without requiring us to manually intervene in this process. Aggregated bandwidth was 4GB / s, much more than we have achieved so far.

Texture caching with FlexCache was an excellent solution. Everything works faster, easier to manage textures and their sets. It was the final year of the project, which took four years of labor to create the film. If we decided to put the SA600 and problems would start with them, we would have to urgently change everything back so as not to disrupt the deadlines. But after a week passed, we almost forgot about their existence (except, of course, the fact of increased speed). This is the easiest way to make IT happy.

The performance of the storage subsystem has a significant impact on the speed of processing render jobs. Bottlenecks and lack of performance prevent the rendering farm from running at full power. In the last year of Avatar creation, we went deep into the details of the processes and added many opportunities to monitor and remove statistics in each task.

We constantly lagged behind the schedule, we constantly had delayed tasks waiting for the launch queue; every day, more and more jobs were accumulating that were waiting for them to turn. The team of so-called “shepherds” (wranglers) at Weta Digital is committed to ensuring that all tasks and work on the renderer go correctly and in the right order. The next morning, after we installed and launched FlexCache at night, the “senior shepherd” came to us with a report that all the tasks were completed. Everything worked so unexpectedly quickly that he decided that something had broken.

Why NetApp?

I'm a longtime fan of NetApp products. I first happened to use NetApp systems when I was working at an Internet provider in Alaska, during the “dot-boom” era in the late 90s — I was quite impressed with their capabilities to consistently promote them further in the companies where I worked. I was pleased to see that NetApp storage systems were already in use at Weta Digital when I arrived.

For companies like Weta Digital, it’s quite common to break something, as usually no one else does this to the infrastructure that we have to do. The key point in the situation is that you may need the help of the vendor to correct the situation (for example, analyzing the causes of the problem, or even urgently releasing a patch for the software if a bug is detected). Even if you work in a small company, there will always be someone in NetApp who will work on your problem with you until it is finally solved. Do you think that this is how it should always happen? My experience shows that other vendors do this quite rarely.

Data warehouses can be complicated. NetApp technologies make storage systems as simple as possible. There are things that I would like NetApp to do better than it is doing now, but compared to all others, I see that NetApp makes a reliable and versatile product that is simple to use and that is provided with reliable support. That's why we still use NetApp.

Adam shand
Earlier, the head of infrastructure department
Weta digital

Adam began his career in New Zealand where he founded one of the country's first Internet companies, together with his father. His next job, in 1997, led him to Alaska, then he worked in Portland, Oregon in the EDA industry. The similarity between EDA and the production of visual effects, plus the ability to work closer to home, led Adam in 2003 to Weta Digital. After several years at Weta Digital, Adam decided to leave the company, and set off on a one-year journey through Southeast Asia.

Source: https://habr.com/ru/post/98537/

All Articles

Breathing life in Avatar

Battle for growth

Accelerating access to texture files using adaptive caching

Why NetApp?

More articles: