Tupperware: Facebook's “killer” Kubernetes?

Efficient and reliable cluster management at any scale with Tupperware

Today, at Systems @Scale, we presented Tupperware, our cluster management system, which orchestrates containers on millions of servers, where almost all of our services work. We first deployed Tupperware in 2011, and since then our infrastructure has grown from 1 data center to as many as 15 geo-distributed data centers . All this time, Tupperware did not stand still and developed with us. We will describe in what situations Tupperware provides first-class cluster management, including convenient support for stateful services, a single control panel for all data centers and the ability to distribute power between services in real time. And we will share the lessons that we learned as our infrastructure develops.

Tupperware performs different tasks. Application developers use it to deliver and manage applications. It packages the code and dependencies of the application into an image and delivers it to the servers in the form of containers. Containers provide isolation between applications on a single server, so that developers are engaged in the logic of the application and do not think about how to find servers or monitor updates. And Tupperware monitors server performance, and if it finds a failure, transfers containers from the problem server.

Capacity planning engineers use Tupperware to allocate server capacity to teams according to budget and constraints. And they use it to improve server utilization. Data center operators are turning to Tupperware in order to properly distribute containers across data centers and stop or move containers for maintenance. Due to this, server, network and equipment maintenance requires minimal human participation.

Tupperware Architecture

The architecture of Tupperware PRN is one of the regions of our data centers. The region consists of several data center buildings (PRN1 and PRN2) located nearby. We plan to make one control panel that will manage all servers in one region.

Application developers deliver services in the form of Tupperware tasks. A task consists of several containers, and they all usually execute the same application code.

Tupperware is responsible for allocating containers and managing their life cycle. It consists of several components:

Tupperware frontend provides an API for user interface, CLI and other automation tools through which you can interact with Tupperware. They hide the entire internal structure from the Tupperware task owners.
The Tupperware Scheduler is the control panel responsible for managing the lifecycle of the container and the job. It is deployed at the regional and global levels, where a regional planner manages servers in one region, and a global planner manages servers from different regions. The scheduler is divided into shards, and each shard manages a set of tasks.
Tupperware's proxy scheduler hides internal sharding and provides a convenient single control panel for Tupperware users.
The Tupperware allocator assigns containers to servers. The scheduler deals with stopping, starting, updating, and failing containers. Currently, one distributor can control the entire region without division into shards. (Note the difference in terminology. For example, the scheduler in Tupperware corresponds to the control panel in Kubernetes , and the distributor Tupperware in Kubernetes is called the scheduler.)
The resource broker stores the source of truth for the server and maintenance events. We run one resource broker for each datacenter, and it stores all server information in this datacenter. The resource broker and the capacity management system, or the resource allocation system, dynamically decide which scheduler supply the server manages. The health check service monitors the servers and stores data about their health in the resource broker. If the server has problems or needs servicing, the resource broker tells the distributor and scheduler to stop the containers or transfer them to other servers.
The Tupperware agent is a daemon running on each server that prepares and deletes containers. Applications work inside the container, which gives them more isolation and reproducibility. At Systems @Scale last year’s conference we already described how individual Tupperware containers are created using images, btrfs, cgroupv2 and systemd.

Distinctive features of Tupperware

Tupperware is in many ways similar to other cluster management systems, such as Kubernetes and Mesos , but there are also differences:

Native support for stateful services.
A single control panel for servers in different data centers to automate the supply of containers based on the intention, withdrawal of clusters from operation and maintenance.
Clear separation of the control panel to zoom.
Elastic calculations allow you to distribute power between services in real time.

We developed these cool features to support a variety of stateless and stateful applications in a huge global shared server fleet.

Built-in support for stateful-sevisov.

Tupperware manages a variety of critical stateful services that store persistent product data for Facebook, Instagram, Messenger, and WhatsApp. These can be large storages of key-value pairs (for example, ZippyDB ) and monitoring data storages (for example, ODS Gorilla and Scuba ). Maintaining stateful services is not easy, because the system must ensure that the supply of containers withstands large-scale failures, including network interruption or power outages. And although the usual methods, for example, the distribution of containers across domains of failure, are well suited for stateless-services, stateful-sevisam need additional support.

For example, if one database replica becomes unavailable as a result of a server failure, do you need to enable automatic maintenance that updates the kernels on 50 servers from the 10,000 pool? Depends on the situation. If on one of these 50 servers there is another replica of the same database, it is better to wait and not lose 2 replicas at once. In order to dynamically make decisions about the maintenance and operation of the system, we need information about the internal data replication and the logic of the location of each stateful service.

The TaskControl interface allows stateful services to influence decisions that affect data availability. Using this interface, the scheduler notifies external applications about container operations (restart, update, migration, maintenance). The stateful service implements a controller that tells Tupperware when it is safe to perform each operation, and these operations can be swapped or temporarily postponed. In the example above, the database controller may tell Tupperware to update 49 out of 50 servers, but for now not touch a specific server (X). As a result, if the deadline for updating the kernel passes, and the database could not restore the problematic replica, Tupperware will still update the X server.

Many stateful services in Tupperware use TaskControl not directly, but through ShardManager, a common platform for creating stateful services on Facebook. With Tupperware, developers can indicate their intent about exactly how containers should be distributed across data centers. With ShardManager, developers indicate their intent on how data shards should be distributed across containers. ShardManager is aware of the data placement and replication of its applications and interacts with Tupperware through the TaskControl interface to schedule container operations without direct application participation. This integration greatly simplifies the management of stateful services, but TaskControl is capable of more. For example, our extensive web tier is stateless and uses TaskControl to dynamically adjust the speed of updates in containers. As a result, the web tier is able to quickly perform several software releases per day without compromising accessibility.

Server management in data centers

When Tupperware first appeared in 2011, a separate scheduler managed each server cluster. Then the Facebook cluster was a group of server racks connected to the same network switch, and the data center accommodated several clusters. The scheduler could manage servers in only one cluster, that is, the task could not be distributed to several clusters. Our infrastructure has grown, we are increasingly writing off clusters. Since Tupperware could not transfer the task from the cluster to other clusters without changes, it took a lot of effort and careful coordination between application developers and data center operators. This process wasted resources when the servers were idle for months due to the decommissioning procedure.

We created a resource broker to solve the problem of cluster decommissioning and coordinate the rest of the maintenance tasks. The resource broker keeps track of all the physical information associated with the server, and dynamically decides which scheduler controls each server. Dynamic binding of servers to schedulers allows the scheduler to manage servers in different data centers. Because the Tupperware task is no longer limited to a single cluster, Tupperware users can specify how containers should be distributed across failure domains. For example, a developer may declare his intention (suppose: “run my task on 2 failure domains in the PRN region”) without specifying specific access zones. Tupperware itself will find suitable servers to embody this intention even in case of cluster decommissioning or maintenance.

Scaling to support the entire global system

Historically, our infrastructure has been divided into hundreds of dedicated server pools for individual teams. Due to fragmentation and lack of standards, we had high transaction costs, and idle servers were harder to reuse. At last year’s Systems @Scale conference, we presented infrastructure as a service (IaaS) , which should integrate our infrastructure into a large single server park. But a single server park has its own difficulties. It must meet certain requirements:

Scalable. Our infrastructure grew as we added data centers to each region. Servers have become smaller and more energy efficient, so they are much more large in each region. As a result, the only scheduler in the region can not cope with the number of containers that can be run on hundreds of thousands of servers in each region.
Reliability. Even if the scale of the scheduler can be so increased, because of the large scope of the scheduler, the risk of errors will be higher, and a whole region of containers may become uncontrollable.
Fault tolerance. In the event of a huge infrastructure failure (for example, due to a network break or a power outage, the servers running the scheduler will fail) the negative consequences should affect only a part of the region’s servers.
The convenience of use. It may seem that you need to run several independent schedulers on one region. But in terms of convenience, a single entry point to the common pool in the region simplifies the management of facilities and tasks.

We divided the planner into shards to solve problems with the support of a large common pool. Each shard planner manages its own set of tasks in the region, and this reduces the risk associated with the planner. As the total pool grows, we can add more scheduler shards. For Tupperware users, shards and the scheduler proxy look like one control panel. They do not have to work with a bunch of shards that orchestrate tasks. Scheduler shards are fundamentally different from the cluster planners that we used before, when the control panel was divided without static separation of the total server pool by network topology.

Improving Efficiency with Elastic Computing

The larger our infrastructure, the more important it is to efficiently use our servers in order to optimize infrastructure costs and reduce workload. There are two ways to increase server utilization:

Elastic computations — scale down online services in quiet hours and use freed servers for offline loads, for example, for machine learning and MapReduce tasks.
Overloading - place online services and package workloads on the same servers so that the package loads are executed with low priority.

The bottleneck in our data centers is power consumption . Therefore, we prefer small energy efficient servers that together provide more computing power. Unfortunately, on small servers with a small amount of processor resources and memory, overload is less efficient. Of course, we can place several containers of small services on one small energy-efficient server that consume little processor resources and memory, but large services will have poor performance in such a situation. Therefore, we advise the developers of our large services to optimize them so that they use the servers entirely.

Basically, we increase the efficiency of use with the help of elastic calculations. The intensity of using many of our major services, such as news feeds, messaging functions, and the front-end web level, depends on the time of day. We deliberately reduce the scale of online services in quiet hours and use the released servers for offline loads, for example, for machine learning and MapReduce tasks.

We know from experience that it is best to provide entire servers as units of resilient capacity, because large services are at the same time the main donors and the main consumers of resilient capacity, and they are optimized for the use of entire servers. When a server is released from an online service during quiet hours, the resource broker gives the server to the scheduler for temporary use, so that it runs offline on it. If a peak load occurs in the online service, the resource broker quickly recalls the borrowed server and returns it to the online service together with the scheduler.

Lessons learned and future plans

In the past 8 years, we have developed Tupperware to keep up with the rapid development of Facebook. We talk about what we have learned, and we hope that this will help others manage their rapidly growing infrastructures:

Set up a flexible connection between the control panel and the servers it manages. This flexibility allows the control panel to manage servers in different data centers, helps automate the decommissioning and maintenance of clusters, and provides dynamic power distribution using elastic calculations.
With a single control panel in the region, it becomes more convenient to work with tasks and it is easier to manage a large common server park. Please note that the control panel supports a single entry point, even if its internal structure is divided for reasons of scale or fault tolerance.
Using the plug-in model, the control panel can notify external applications about upcoming container operations. Moreover, stateful services can use the plugin interface to customize container management. Using such a plugin model, the control panel provides simplicity and at the same time effectively serves many different stateful services.
We believe that elastic computing, in which we take entire servers for batch jobs, machine learning and other non-urgent services from donor services, is the best way to increase the efficiency of using small and energy efficient servers.

We are just starting to implement a single global shared server fleet . Now about 20% of our servers are in the general pool. To achieve 100%, many issues need to be addressed, including maintaining a common pool for storage systems, automating service, managing the requirements of different clients, improving server utilization, and improving support for machine learning workloads. We can't wait to tackle these tasks and share our successes.

Source: https://habr.com/ru/post/455579/

All Articles