📜 ⬆️ ⬇️

Tupperware: Facebook's “killer” Kubernetes?

Efficient and reliable cluster management at any scale with Tupperware



Today, at Systems @Scale, we presented Tupperware, our cluster management system, which orchestrates containers on millions of servers, where almost all of our services work. We first deployed Tupperware in 2011, and since then our infrastructure has grown from 1 data center to as many as 15 geo-distributed data centers . All this time, Tupperware did not stand still and developed with us. We will describe in what situations Tupperware provides first-class cluster management, including convenient support for stateful services, a single control panel for all data centers and the ability to distribute power between services in real time. And we will share the lessons that we learned as our infrastructure develops.


Tupperware performs different tasks. Application developers use it to deliver and manage applications. It packages the code and dependencies of the application into an image and delivers it to the servers in the form of containers. Containers provide isolation between applications on a single server, so that developers are engaged in the logic of the application and do not think about how to find servers or monitor updates. And Tupperware monitors server performance, and if it finds a failure, transfers containers from the problem server.


Capacity planning engineers use Tupperware to allocate server capacity to teams according to budget and constraints. And they use it to improve server utilization. Data center operators are turning to Tupperware in order to properly distribute containers across data centers and stop or move containers for maintenance. Due to this, server, network and equipment maintenance requires minimal human participation.


Tupperware Architecture



The architecture of Tupperware PRN is one of the regions of our data centers. The region consists of several data center buildings (PRN1 and PRN2) located nearby. We plan to make one control panel that will manage all servers in one region.


Application developers deliver services in the form of Tupperware tasks. A task consists of several containers, and they all usually execute the same application code.


Tupperware is responsible for allocating containers and managing their life cycle. It consists of several components:



Distinctive features of Tupperware


Tupperware is in many ways similar to other cluster management systems, such as Kubernetes and Mesos , but there are also differences:



We developed these cool features to support a variety of stateless and stateful applications in a huge global shared server fleet.


Built-in support for stateful-sevisov.


Tupperware manages a variety of critical stateful services that store persistent product data for Facebook, Instagram, Messenger, and WhatsApp. These can be large storages of key-value pairs (for example, ZippyDB ) and monitoring data storages (for example, ODS Gorilla and Scuba ). Maintaining stateful services is not easy, because the system must ensure that the supply of containers withstands large-scale failures, including network interruption or power outages. And although the usual methods, for example, the distribution of containers across domains of failure, are well suited for stateless-services, stateful-sevisam need additional support.


For example, if one database replica becomes unavailable as a result of a server failure, do you need to enable automatic maintenance that updates the kernels on 50 servers from the 10,000 pool? Depends on the situation. If on one of these 50 servers there is another replica of the same database, it is better to wait and not lose 2 replicas at once. In order to dynamically make decisions about the maintenance and operation of the system, we need information about the internal data replication and the logic of the location of each stateful service.


The TaskControl interface allows stateful services to influence decisions that affect data availability. Using this interface, the scheduler notifies external applications about container operations (restart, update, migration, maintenance). The stateful service implements a controller that tells Tupperware when it is safe to perform each operation, and these operations can be swapped or temporarily postponed. In the example above, the database controller may tell Tupperware to update 49 out of 50 servers, but for now not touch a specific server (X). As a result, if the deadline for updating the kernel passes, and the database could not restore the problematic replica, Tupperware will still update the X server.



Many stateful services in Tupperware use TaskControl not directly, but through ShardManager, a common platform for creating stateful services on Facebook. With Tupperware, developers can indicate their intent about exactly how containers should be distributed across data centers. With ShardManager, developers indicate their intent on how data shards should be distributed across containers. ShardManager is aware of the data placement and replication of its applications and interacts with Tupperware through the TaskControl interface to schedule container operations without direct application participation. This integration greatly simplifies the management of stateful services, but TaskControl is capable of more. For example, our extensive web tier is stateless and uses TaskControl to dynamically adjust the speed of updates in containers. As a result, the web tier is able to quickly perform several software releases per day without compromising accessibility.


Server management in data centers


When Tupperware first appeared in 2011, a separate scheduler managed each server cluster. Then the Facebook cluster was a group of server racks connected to the same network switch, and the data center accommodated several clusters. The scheduler could manage servers in only one cluster, that is, the task could not be distributed to several clusters. Our infrastructure has grown, we are increasingly writing off clusters. Since Tupperware could not transfer the task from the cluster to other clusters without changes, it took a lot of effort and careful coordination between application developers and data center operators. This process wasted resources when the servers were idle for months due to the decommissioning procedure.


We created a resource broker to solve the problem of cluster decommissioning and coordinate the rest of the maintenance tasks. The resource broker keeps track of all the physical information associated with the server, and dynamically decides which scheduler controls each server. Dynamic binding of servers to schedulers allows the scheduler to manage servers in different data centers. Because the Tupperware task is no longer limited to a single cluster, Tupperware users can specify how containers should be distributed across failure domains. For example, a developer may declare his intention (suppose: “run my task on 2 failure domains in the PRN region”) without specifying specific access zones. Tupperware itself will find suitable servers to embody this intention even in case of cluster decommissioning or maintenance.


Scaling to support the entire global system


Historically, our infrastructure has been divided into hundreds of dedicated server pools for individual teams. Due to fragmentation and lack of standards, we had high transaction costs, and idle servers were harder to reuse. At last year’s Systems @Scale conference, we presented infrastructure as a service (IaaS) , which should integrate our infrastructure into a large single server park. But a single server park has its own difficulties. It must meet certain requirements:



We divided the planner into shards to solve problems with the support of a large common pool. Each shard planner manages its own set of tasks in the region, and this reduces the risk associated with the planner. As the total pool grows, we can add more scheduler shards. For Tupperware users, shards and the scheduler proxy look like one control panel. They do not have to work with a bunch of shards that orchestrate tasks. Scheduler shards are fundamentally different from the cluster planners that we used before, when the control panel was divided without static separation of the total server pool by network topology.


Improving Efficiency with Elastic Computing


The larger our infrastructure, the more important it is to efficiently use our servers in order to optimize infrastructure costs and reduce workload. There are two ways to increase server utilization:



The bottleneck in our data centers is power consumption . Therefore, we prefer small energy efficient servers that together provide more computing power. Unfortunately, on small servers with a small amount of processor resources and memory, overload is less efficient. Of course, we can place several containers of small services on one small energy-efficient server that consume little processor resources and memory, but large services will have poor performance in such a situation. Therefore, we advise the developers of our large services to optimize them so that they use the servers entirely.


Basically, we increase the efficiency of use with the help of elastic calculations. The intensity of using many of our major services, such as news feeds, messaging functions, and the front-end web level, depends on the time of day. We deliberately reduce the scale of online services in quiet hours and use the released servers for offline loads, for example, for machine learning and MapReduce tasks.



We know from experience that it is best to provide entire servers as units of resilient capacity, because large services are at the same time the main donors and the main consumers of resilient capacity, and they are optimized for the use of entire servers. When a server is released from an online service during quiet hours, the resource broker gives the server to the scheduler for temporary use, so that it runs offline on it. If a peak load occurs in the online service, the resource broker quickly recalls the borrowed server and returns it to the online service together with the scheduler.


Lessons learned and future plans


In the past 8 years, we have developed Tupperware to keep up with the rapid development of Facebook. We talk about what we have learned, and we hope that this will help others manage their rapidly growing infrastructures:



We are just starting to implement a single global shared server fleet . Now about 20% of our servers are in the general pool. To achieve 100%, many issues need to be addressed, including maintaining a common pool for storage systems, automating service, managing the requirements of different clients, improving server utilization, and improving support for machine learning workloads. We can't wait to tackle these tasks and share our successes.


')

Source: https://habr.com/ru/post/455579/


All Articles