Cluster of Puppets: Amazon ECS Experience with iFunny

Despite the title, this article has nothing to do with the Puppet configuration management system.

Together with the trend of “cutting” large monoliths into small microservices, the trend for container orchestration has come into operation for web applications. Immediately after the HYIP on Docker, the HYIP rises on the service launch tools on top of Docker. Kubernetes is most often spoken of, but its many alternatives in the present also live and develop.

So in iFunny they thought about the benefits and values of the orchestrators and eventually chose the Amazon Elastic Container Service. In short: ECS is a container management platform for EC2 instances. For details and experience in battle, read below.

Why do we need container orchestration

After numerous articles about Docker on "Habré" and beyond, all supposedly have an idea of the purpose for which it is intended and for which it is not. Let us now clarify the situation, for which the platform above Docker is needed:

Automation of the deployment of services "out of the box"

Threw containers on the car? Great! And how to update them, avoiding any degradation of the service when switching traffic going to the static port to which the client is attached? And how to quickly and painlessly roll back in case of problems with the new version of the application? After all, Docker itself does not solve these issues.

Yes, you can write this thing yourself. At first, iFunny worked this way. For the blue-green deployment, Ansible playbook was used, in which the good old Iptables was managed by switching the necessary rules to the IP of the new container. Cleaning the connections from the old container was done by tracking the connections with the same old and kind conntrack.

It seems to be transparent and understandable, but, like any homemade thing, it led to the following problems:

incorrect processing of inaccessible hosts. Ansible does not understand that the EC2 Instance at any time may fall, not only because of faults, but also because of scaling down in the Autoscaling group. Thus, in the standby situation, Ansible can return an error after the playlist is executed. Issue on this issue seems to be closed, but still not resolved;
five hundred from the Docker API. Yes, sometimes Docker with large loads can give server error, and with Ansible this also does not handle;
can not stop warming. What happens if you kill a process that starts a playbook at the time the rules are replaced in IPtables? How big will the chaos be? What part of the machines will be unavailable?
lack of parallelization of tasks within a single host. In Ansible, you cannot run parallel iterations of which the task consists. It is easier to explain this way: if you have the task of launching 50 containers with a common pattern, but with different parameters, then you will have to wait for the launch of container 1 before launching container 2.

Summarizing all the problems, we can conclude that declarative SCMs are unsuitable for such imperative tasks as deploy. You ask, what have the orchestrator? Yes, despite the fact that any orchestrator will provide an opportunity to close the service with one team without the need to describe the process. All known patterns of deployment and smooth processing of the above failures are at your disposal.

In my opinion, the orchestration platforms are still the only way to implement a fast, simple and reliable deployment with Docker. Perhaps the same AWS lovers, like us, will cite the example of Elastic Beanstalk. We also used it in the food environment for some time, but there were enough problems with it so that they did not fit into this article.

Simplification of "configuration management"

At one time, I heard a very interesting comparison of the orchestration platform with the launch of processes on the CPU operating system. After all, you don’t care about the core of a program?

The same approach applies to orchestrators. You, by and large, will not care about which machine and how many copies the service is running, as the configuration is dynamically updated on the balancer. You need the very minimum of the configuration of the hosts in the production environment. Ideally, just install Docker. In an “even bigger ideal” is to remove Configuration Management altogether, if you have CoreOS.

Thus, your fleet of cars is not something you need to keep track of day and night, but a simple pool of resources, parts of which can be replaced at any minute.

Service-centric infrastructure approach

In recent years, the web application infrastructure has experienced a transition from a host-centric to a service-centric approach. In other words, this is a continuation of the previous paragraph, when instead of following the indications of the hosts, you follow the external indicators of the service. The philosophy of the orchestration platform fits into this paradigm much more harmoniously than if you keep the service in a strictly fixed pool of hosts.

You can also attach microservices to this item. In addition to automation of deployment, with orchestration it is easy and fast to create new services and link them together (Service Discovery tools are most often supplied “in the box” with orchestrators).

Approaching infrastructure to developers

DevOps for the iFunny development team is not an empty phrase, or even an engineer. Here, they strive to give developers maximum leeway to speed up those Flow, Feedback and Experimentation .

In the past year or two, the API monolith has been actively created, new microservices are constantly being launched. And in practice, orchestration with containers helps a lot in quickly starting up and standardizing the launch of a service as a technical process. With a good approach, the developer can at any time create new services himself, without waiting for a couple of weeks (or even a month), until his task from the general list reaches the admin.

There are many more reasons why it is good to use orchestrators. You can add and about the utilization of resources, but in this case, even the most attentive reader will not tolerate to the end.

Choosing an orchestrator

Here one could tell about the comparison of a dozen solutions on the market, endless benchmarks for launching containers and deploying clusters, disrupting covers in the form of a multitude of bugs blocking certain product features.

But in fact, everything is much more boring. IFunny tries to use AWS services to the maximum due to the fact that the team is small and there is not enough time, knowledge and experience to write your own bikes or “file” everyone, as always. Therefore, it was decided to go along the beaten track and take a simple and understandable tool. And yes, ECS as a service is free in itself: you only pay for EC2 instances, where your agents and containers are running, at a standard rate.

A small spoiler: this approach worked, but many other questions appeared with ECS. The confession about “how sad it is not Kubernetes” will be at the end of the article.

Terminology

Let's get to know the basic concepts in ECS.

Cluster → Service → Task

A bunch to remember first. Initially, the ECS platform is a cluster that does not need to be hosted on an instance and is managed through the AWS API.

In a cluster, you can run tasks from one or more containers that run in a common bundle. In other words, Task is an analog Pod in Kubernetes. For flexible container management — scaling, deploying, and the like — there is a concept of services.

The service consists of a certain number of tasks. If we already compare ECS with Kubernetes, then services are analogous to the concept of Deployments.

Task definition

Description of task launch parameters in JSON. A common specification that can be interpreted as a wrapper over the docker run command. All the “handles” for tagging and logging are presented, which is not, for example, in the Docker + Elastic Beanstalk bundle.

ECS-agent

Represents a local agent on instances as a running container. It is responsible for monitoring the state of an instance, disposing of its resources and sending commands to launch containers to the local Docker daemon. Agent source code is available on Github .

Application Load Balancer (ALB)

A new generation of balancer from AWS. It differs from ELB for the most part in its concept: if ELB balances traffic at the host level, then ALB balances traffic at the application level. In the ECS ecosystem, the balancer plays the role of the destination of user traffic. You do not need to think about how to send traffic to the new version of the application - you just hide the containers behind the balancer.

In ALB, there is the concept of a target group to which application instances are connected. It is to the target group that you can bind the ECS service. In such a bundle, the cluster will collect information on which ports the service containers are running on and transfer it to the target group to distribute traffic from the balancer. Thus, you do not need to worry about which port is open on the container or how to prevent a conflict between several services on the same machine. In ECS, this is resolved automatically.

Task Placement Strategy

Task distribution strategy for available cluster resources. Strategy consists of type and parameter. As in other orchestrators, there are 3 types: binpack (i.e., to hammer the car down to failure, then switching to another), spread (equal distribution of resources across the cluster) and random (I think, and so it is clear to everyone). Parameters can be CPU, Memory, Availability Zone and Instance ID.

On practical experience, the strategy of distribution of tasks according to Availability Zones (in other words, data centers) was chosen as a reinforced concrete variant. This reduces the competition for machine resources between the containers and spreads straw in case of an unexpected failure of one of the Availability zone in AWS.

Healthy percentage

The parameter of the minimum and maximum share of the desired number of tasks, between which the service is considered healthy. This parameter is useful for configuring service deployment.

The application version itself can be updated in two ways:

if max percentage> 100 , then new tasks are created in accordance with the parameter, and in the same amount, after connecting new tasks to the traffic, old ones are killed; if max percentage = 200, then everything happens in one iteration, if 150 - in two, and so on;
if min percentage <100 , and max percentage = 100 , then everything happens the other way around: first, old tasks are killed to make room for new tasks; at this time, all traffic takes the remaining tasks.

The first option is similar to the blue-green deployment and looks perfect if it were not for the need to keep 2 times more resources in a cluster. The second option gives an advantage in recycling, but can lead to degradation of the application, if a decent amount of traffic arrives on it. Which one to choose is up to you.

Autoscaling

Besides scaling at the level of EC2 instances, there is also scaling at the level of tasks in ECS. Just like in the Autoscaling Group, you can set up triggers for ECS service within Cloudwatch Alarms. The best option is to scale by the percentage of used CPU from the value specified in the Task Definition.

An important point : the CPU parameter, as in Docker itself, is indicated not in the number of cores, but in the CPU Units value. In the future, the processor time will be distributed based on which of the tasks has more units. In ECS terminology, 1 CPU core equates to 1024 units.

Elastic Container Repository

AWS service for Docker image hosting. In other words, you get yourself a Docker Registry for free without having to host it. Plus - in simplicity, minus in that you can not have more than one domain, and for each service you have to separately create your own repository.

Integration into existing infrastructure

Now the most interesting thing about how ECS got accustomed to iFunny.

Deployment pipeline

As an orchestrator and an ECS resource planner, it may be good, but it does not provide a deployment tool. In terminology for ECS deploy - this is a service update (update service). To update, you need to create a new version of Task Definition, update the service, indicate the revision number of the new definition, wait until it completes the update completely, and roll back to the old revision if something goes wrong. And at AWS at the time of this writing, there was no ready-made tool that would do everything at once. There is a separate CLI for ECS , but this is more about Docker Compose analogue than about deploying services in isolation.

Fortunately, the world of Open Source compensated for this lack with a separate ecs-deploy utility. In fact, this is a shell script of several hundred lines, but it does a very good job with the direct task. You simply specify the service, cluster and Docker image you want to add, and it will step through the entire algorithm. Also rolls back in case of update failures, and cleans outdated Task Definitions.

The only drawback at first was the inability to update Task Definition completely through the utility. Well, suppose you want to change the limits on the CPU or reconfigure the log driver. But this is a shell script, the simplest thing for DIY! This feature was added to the script for a couple of hours. It is used in the future, updating services exclusively by Task Definitions, which are stored in the root of the application repositories.

True, they haven't been paying attention to Pull Request for half a year, as well as a dozen others. This is the question of the disadvantages of Open Source.

Terraform

Through Terraform in iFunny, all the resources that exist in AWS are deployed. The resources necessary for the operation of services are not an exception: in addition to the service itself, these are the Application Load Balancer and its associated Listeners and Target Group, as well as the ECR repository, the first Task Definition version, autoscaling by alarm and the necessary DNS records.

The first idea was to combine all the resources into one Terraform module and use it every time you create a service. At first, it looked great: only 20 lines - and you get a production-ready service! But, as it turned out, to maintain such a thing over time is much more expensive. Since services are not always homogeneous and various requirements constantly appear, it was necessary to edit the module almost every time it was used.

In order not to think about “syntactic sugar”, I had to get everything back to square one, describing in Terraform state step by step all the resources, wrapping in small modules those things that can be wrapped: Load Balancing and Autoscaling.

At some point, the state has grown so much that one plan with its update took about 5-7 minutes, and he could be blocked by another engineer who is raising something on it right now. This problem was solved by dividing one large state into several small ones for each service.

Monitoring & logging

Here everything turned out extremely transparent and simple. Dashboards and alerts added a couple of new metrics for utilization of cluster services and resources so that it was clearly visible at what point the services began to scale and how well it eventually worked.

We, as before, wrote logs to the local agent Fluentd, who delivered them to Elasticsearch with a further opportunity to read them to Kibana. ECS supports any log-driver that Docker has, in contrast to the same Beanstalk, and this is configured within Task Definition.

Also in AWS, you can try the awslogs driver, which logs directly to the management console. A useful thing if you do not have so many logs to separately raise and maintain a system for collecting logs.

Scaling & resource distribution

This is where most of the pain was. The strategy of scaling services was chosen for a long time by trial and error. From this experience it became clear that:

Binpack on the CPU, of course, well utilizes the cluster, but with an influx of load, everything can lie down for a minute or two, until Docker can figure out how to divide the CPU time in such conditions;
None of the orchestrators (including ECS) in nature have a notion of dynamic rebalancing containers. For example, the problem of scaling at the time of the peaks could be solved by adding new hosts to the cluster in order for the cluster to evenly distribute resources. But they will stand idle until an update is launched on any service. This topic was discussed sharply in Docker Swarm , but it still remains unsolved. Most likely, due to the difficulty to solve it both conceptually and technically.

As a result, under the load, it was decided to scale the services instantly and in large volumes, and the instances - upon reaching 75% of resource reservations. It may not be the best option in terms of iron utilization, but at least all services in the cluster will work stably without interfering with each other.

Underwater rocks

Try to recall the case when the introduction of something new for engineers ended with one hundred percent happy ending. You can not? So in the iFunny episode with ECS was no exception.

Lack of flexibility in healthcheck

Unlike Kubernetes, where you can flexibly configure the availability and readiness of service , ECS has only one criterion: does the application send code 200 (or any other one that you have configured) to the same URL. There are only two criteria that the service is bad: either the container did not start at all, or it started, but did not respond to healthcheck.

This creates problems, for example, when a deploe breaks down a key part of the service, but it still responds to check. In this case, you will have to redo the old version yourself.
Lack of service discovery as such. AWS offers its own version of Service Discovery , but this solution looks, so to say, so-so. The best option in such a situation is to implement the Consul agent + Registrator bundle inside the hosts, which is what the iFunny development team is doing now.

Raw implementation of running tasks on a schedule

If it is not clear, then I'm talking about cron. Just from June of last year, the concept of Scheduled Tasks appeared in ECS, which allows you to run tasks on a cluster on a schedule. This feature has long been available to customers, but in operation it still seems raw for many reasons.

First, the API is created not by the task itself, but by 2 resources: the Cloudwatch Event with the indication of launch parameters and the Cloudwatch Event Target with the indication of the launch time. From the outside it looks opaque. Secondly, there is no normal declarative tool for the deployment of these tasks.

We tried to solve the problem with the help of Ansible, but it still has a bad situation with the standardization of tasks .

Ultimately, iFunny uses a self-written Python utility for describing tasks in a YAML file, and plans to make a full-fledged tool for deploying cron tasks to ECS.

No direct connection between cluster and hosts

When, for various reasons, the EC2 instance is removed, it is not deregistered in the cluster, and all tasks running on it simply fall. Since the balancer did not receive a signal on the withdrawal of the target from the cluster, it will send requests to it up to the moment until it realizes that the container is unavailable. It takes 10-15 seconds, and during this time you get a bunch of errors from the server.

Today, the problem can be solved with the help of the lambda function , which responds to the deletion of an instance from the Autoscaling Group and sends a request to delete the tasks of this machine (in terminology, instance draining) to the cluster. Thus, instance is nailed only after all tasks have been removed from it. It works well, but Lambda in infrastructure always looks like a crutch: this could be included in the platform functionality.

Lack of detailed monitoring

The AWS API gives only the number of registered machines and metrics by share of reserved capacity from a cluster, only the number of tasks and CPU and memory utilization as a percentage of the number set in Task Definition. Here comes the pain for adherents of the church "metrics". The lack of detail on the use of resources by a specific container can play a cruel joke when debugging problems with overloading the service. Also, metrics on I / O and network utilization would not hurt.

Container deregistration in ALB

An important point read from the AWS documentation. The deregistration_delay parameter in the balancer is not the timeout of waiting for deregistration of the target, but the total time of waiting. In other words, if the parameter is 30 seconds, and your container will be stopped after 15 seconds, then the balancer will still send requests to the target and give the client five hundredth errors.

The way out is to set the deregistration_delay of the service above the analogous parameter for ALB. It seems to be obvious, but obviously it is not stated anywhere in the documentation, which at first causes problems.

Vendor lock-in inside AWS

Like any AWS cloud service, you cannot use ECS outside of AWS. If for some reason you are thinking of moving to Google Cloud or (for some reason) to Azure, then in this case you will need to completely redo the services orchestration.

Simplicity

Yes, ECS and its surroundings in the form of AWS products are so simple that it is difficult to implement extraordinary tasks with them in the architecture of your application. For example, you need full HTTP / 2 support on the service, but you cannot do this because ALB does not support Server Push.

Or you need the application to accept requests at level 4 (it does not matter, TCP or UDP), but in ECS you will also not find a solution on how to send traffic to the service, since the ALBs work only over HTTP / HTTPS, and the old ELB does not work with ECS services and in general sometimes distorts traffic (for example, this is how it turned out with gRPC).

Retrospective

Summing up all the advantages of the orchestration mentioned at the beginning of the article, it can be said with confidence that all of them are true. Now iFunny has:
- simple and painless warm;
- less self-written code and configuration in Ansible;
- managing application units instead of managing hosts;
- start of services in production from scratch in 20-30 minutes directly by developers.
  But still the question of resource utilization remains unresolved.
The final step to fully migrating applications to ECS was the migration of the main API. Although all this went quickly, smoothly and without downtime, the question remained about the advisability of using orchestrators for large monolithic applications. For one unit of the application, it is often necessary to allocate a separate host, for reliable deployment you need to save the headroom as several unused machines, and configuration management is still present in one form or another. Of course, ECS solved many other issues in a positive way, but the fact remains that with monoliths you will not get very big benefits in the orchestration.
The scale has the following picture: 4 clusters (one of them is the test environment), 36 services in production and about 210-230 running containers in peaks, as well as 80 tasks that run on a schedule. Time has shown that orchestration is much faster and easier to scale. But if you have a fairly small number of services and running containers, then you need to think again whether you need orchestration at all.
As luck would have it, after all this battle, AWS began to launch its own hosting service for Kubernetes called EKS. The process is at a very early stage and there are still no reviews about its use in production, but everyone understands that now and in AWS you can configure the most popular orchestration platform in two buttons and still have access to most of its handles. Returning to the moment when the orchestrator was chosen, Kubernetes would be in priority due to its flexibility, rich functionality and the rapid development of the project.

Also in AWS, the ECS Fargate service appeared, which launches containers without having to host EC2 instances. In iFunny it has already been tried on a couple of test services and we can say that it is too early to draw any conclusions about its capabilities.

PS The article turned out to be quite large, but even this does not end all of our cases with ECS. Ask in the comments any questions on the topic or share your own successful implementation cases.

Source: https://habr.com/ru/post/348676/

All Articles