
This article discusses the different types of testing in production and the conditions under which each of them is most useful, and also discusses how to organize secure testing of various services in production.
It is worth noting that the content of this article applies only to those
services whose deployment is controlled by developers. In addition, you should immediately warn that the use of any of the types of testing described here is a difficult task, which often requires major changes to the systems design, development and testing processes. And, despite the title of the article, I do not consider that any of the types of testing in production is absolutely reliable. There is only an opinion that such testing can significantly reduce the level of risks in the future, and the investment costs will be reasonable.
')
(Approx. Lane .: since the original article is a longrid, for the convenience of readers, it is divided into two parts).Why do you need testing in production if it can be performed on pricing?
The value of a staging cluster (or staging environment) is perceived differently by different people. For many companies, deploying and testing a product on rating is an essential step prior to its final release.
Many well-known organizations perceive stying as a miniature copy of the work environment. In such cases, it is necessary to ensure their maximum synchronization. In this case, it is usually necessary to ensure the operation of differing instances of stateful systems, such as databases, and regularly synchronize data from the production environment with the staging. The only exception is confidential information that allows you to identify the user (this is necessary to comply with the requirements of
GDPR ,
PCI ,
HIPAA and other regulations).
The problem with this approach (in my experience) is that the difference lies not only in the use of a separate database instance containing the actual production data of the environment. Often the difference extends to the following aspects:
- The size of staging cluster (if it can be called a “cluster” - sometimes it’s just one server under the guise of a cluster);
- The fact that staging usually uses a cluster of a much smaller scale also means that the configuration parameters for each service will differ. This applies to load balancers, databases, and queues, for example, the number of open file descriptors, the number of open connections to the database, the size of the thread pool, etc. If the configuration is stored in a database or key-value data repository (for example, Zookeeper or Consul), these support systems must also be present in the staging environment;
- The number of real-time connections handled by a stateless service, or the way the proxy server reuses TCP connections (if this procedure is performed at all);
- Lack of monitoring on pricing. But even if it is monitored, some signals may be completely inaccurate, since an environment other than the working environment is monitored. For example, even if you are monitoring the delay of a MySQL query or response time, it is difficult to determine whether a new code contains a query that can initiate a full table scan in MySQL, since it is much faster (and sometimes even preferable) to perform a full scan of the small table used in the test database, rather than a production database, where the query may have a completely different performance profile.
Although it is fair to assume that all the above differences are not serious arguments against the use of staging as such, unlike antipatterns, which should be avoided. At the same time, the desire to do everything correctly often requires huge labor costs for engineers in an attempt to ensure compliance with the environments. Production is constantly changing and influenced by various factors, so trying to achieve the specified match is like going nowhere.
Moreover, even if the conditions on pricing are as close as possible to the working environment, there are other types of testing that are better applied based on real production information. A good example would be soak testing, in which the reliability and stability of a service is checked over an extended period of time with actual levels of multitasking and workload. It is used to detect memory leaks, determine the duration of pauses in the GC, the CPU load and other indicators for a certain period of time.
None of the above does not imply that stending is
absolutely useless (this will become apparent after reading the section on shadow data duplication when testing services). This only indicates that quite often relying on rating rely to a greater extent than is necessary, and in many organizations it remains the
only type of testing performed before the full product release.
The Art of Testing in Production
So historically, the concept of “testing in production” is associated with certain stereotypes and negative connotations (“partisan programming”, lack or absence of unit and integration testing, carelessness or inattention to the perception of the product by the end user).
Testing in production will certainly deserve such a reputation if it is performed carelessly and inadequately. It
does not in any way
replace testing at the pre-production stage, and under no circumstances is it a
simple task . Moreover, I argue that a
successful and
secure production testing requires a significant level of automation, a good understanding of established practices, and the design of systems with an initial focus on this type of testing.
To organize a comprehensive and secure process for effectively testing services in production, it is important not to regard it as a generic term for a set of different tools and techniques. This mistake, unfortunately, was made by me -
in my previous post I presented not quite a scientific classification of testing methods, and in the section “Testing in Production”, various methodologies and tools were grouped together.
From the note Testing Microservices, the sane way (“A reasonable approach to testing microservices”)Since the publication of the note at the end of December 2017, I discussed its content and, in general, the topic of testing in production with several people.
During these discussions, as well as after a series of individual conversations, it became clear to me that the topic of testing in production cannot be reduced to several points listed above.
The concept of "testing in production" includes a whole range of techniques used
in three different stages . What exactly - let's understand.

Three stages of production
Usually, discussions about production are conducted only in the context of deploying code to production, monitoring, or emergency situations, when something goes wrong.
I myself have so far used as synonyms terms such as "deployment", "release", "delivery", etc., thinking little about their meaning. A few months ago, all attempts to distinguish between these terms would be rejected by me as something unimportant.
After thinking about it, I came to the idea that
there is a real need to distinguish between the various stages of production.
Stage 1. Deployment
When testing (even in production) is a test for achieving the
best possible performance , the accuracy of testing (and indeed any testing) is ensured only if the method of performing tests is as close as possible to the actual use of the service in production.
In other words, tests must be performed in an environment that
best mimics the working environment .
And the
best imitation of the work environment is ... the work environment itself. To perform as many tests as possible in a production environment, it is necessary that the failure of any one of them does not affect the end user.
This, in turn, is possible only if,
when a service is deployed in a work environment, users do not receive direct access to this service .
In this article, I decided to use the terminology from the article
Deploy! = Release (“Deployment - not release”), written by
Turbine Labs . In it, the term "deployment" gives the following definition:
“Deployment is the installation by the working group of a new version of the service code in the production infrastructure. When we say that a new version of software is
deployed , we mean that it runs somewhere within the framework of the working infrastructure. This may be a new EC2 instance in AWS or a Docker container running in a pod in the Kubernetes cluster. The service started successfully, passed the performance check and is ready (you hope!) To process the production environment, but it may not receive any data in reality. This is an important point, I will emphasize it again:
for deployment, it is not necessary that users get access to a new version of your service . Given this definition, deployment can be called a process with almost zero risk. ”
The words “zero-risk process” are simply a balm for the soul of many people who have suffered from unsuccessful deployments. The ability to install software
in a real environment without user access to it has several advantages when it comes to testing.
First, the need to maintain separate environments for development, testing and staging, which inevitably has to be synchronized with production, is minimized (and may even disappear altogether).
In addition, at the design stage of services, it becomes necessary to isolate them from each other in such a way that the unsuccessful testing of a specific service instance in production
does not lead to cascading or affecting users of other services. One of the solutions to ensure this can be the design of the data model and database schema, in which nonidempotent queries (mainly
write operations ) can:
- Run against the production environment database for any test service launch in production (I prefer this approach);
- Be safely rejected at the application level until they reach the write or save level;
- Be selected or isolated at the record or save level in some way (for example, by storing additional metadata).
Stage 2. Release
Deploy! = Release makes the term “release” the following definition:
“When we say that a
release of a service version has taken place, we mean that it provides data processing in a production environment. In other words,
release is a process that directs production data to a new version of software. Given this definition, all the risks that we associate with sending new data streams (interruptions, customer dissatisfaction, poisonous notes in
The Register ) relate to the
release of new software, rather than its deployment (in some companies this stage is also called
release . In this article we will use the term
release ) ".
In Google’s SRE book, the term “release” is used in the
chapter on the organization of software release to describe it .
“A
release is a logical piece of work consisting of one or several separate tasks. Our goal is to align the deployment process with the risk profile of this service .
In development or pre-production environments, we can perform the build hourly and automatically send releases after passing all the tests. For large user-oriented services, we can start the release from one cluster and then scale it up until we upgrade all the clusters.
For important elements of the infrastructure, we can extend the implementation period by several days and perform it in turn in different geographic regions. ”In this terminology, the words “release” and “release” refer to what is generally understood as “deployment”, and terms often used to describe various
deployment strategies (for example, blue-green deployment or canary deployment) software.
Moreover, an unsuccessful
release of applications can cause partial or significant interruptions. At this stage, a
rollback or
hotfix is also performed if it turns out that the
released new version of the service is unstable.
The
release process works best when it is automated and
incremental . Similarly, a
rollback or
hotfix service
provides more benefit when the frequency of occurrence of errors and the frequency of requests are automatically correlated with basic indicators.
Stage 3. After release
If the release
went smoothly and the new version of the service processes the production environment data without obvious problems, we can consider
it successful. A successful release is followed by a stage that can be called “post-release”.
Any rather complex system will
always be in a state of gradual loss of productivity. This does not mean that a
rollback or
hotfix is necessarily required. Instead, it is necessary to monitor such deterioration (for various operating and work purposes) and, if necessary, perform debugging. For this reason, testing after the release is more like not the usual procedures, but rather
debugging or collecting analytical data.
In general, I believe that every component of the system should be created taking into account the fact that not a single large system works 100% flawlessly and that malfunctions should be recognized and taken into account during the design, development, testing, deployment and monitoring stages of the software. security
Now that we have defined the three stages of production, let's look at the different testing mechanisms available in each of them. Not everyone has the opportunity to work on new projects or rewrite code from scratch. In this article, I tried to clearly identify the techniques that would best show themselves in the development of new projects, and also tell about what else we can do to take advantage of the proposed methods, without making significant changes to existing projects.
Testing in production at the deployment stage
We have separated the deployment and release stages from each other, and now we will consider some types of testing that can be applied after deploying the code in the production environment.
Integration testing
Typically, integration testing is performed by a continuous integration server in an isolated test environment for each branch of Git. A copy of the
entire service topology (including databases, queues, proxy servers, etc.) is deployed for test suites of
all services that will work together.
I believe that this is not particularly effective for several reasons. First of all, the test environment, like staging, cannot be deployed so that it is
identical to the real production environment,
even if the tests are run in the same Docker container that will be used in production. This is especially true when the
only thing that runs in a test environment is the tests themselves.
Regardless of whether the test runs as a Docker container or a POSIX process, it most likely makes
one or more connections to an upstream service, database, or cache, which is rare if the service is in a production environment where it can process multiple concurrent connections, often reusing inactive TCP connections (this is called re-using HTTP connections).
Also, the problems are caused by the fact that most of the tests each time they run creates a new database table or cache key space on
the same node where this test is performed (thus, the tests are isolated from network failures). At best, this type of testing can show that the system works correctly with a very specific request. It is rarely effective in simulating serious, well-established types of failures, not to mention the various types of partial failures.
Exhaustive studies exist that confirm that distributed systems often exhibit
unpredictable behavior that cannot be foreseen using analysis performed differently than for the entire system.
But this does not mean that integration testing
is basically useless. We can only say that performing integration tests in an
artificial, completely isolated environment , as a rule, does not make sense. Integration testing should still be performed to verify that the new version of the service:
- Does not break interaction with higher or lower services;
- It does not adversely affect the goals and objectives of higher or lower services.
The first can be provided to some extent with the help of contract testing.
Due to only one ensuring the proper operation of
interfaces between services,
contract testing is an effective method of developing and testing individual services at
the pre-production stage , which does not require the deployment of the entire service topology.
Client-oriented contract testing platforms, such as
Pact , currently support interaction between services only through RESTful JSON RPC, although, most likely,
work is also under way to support asynchronous interaction via web sockets, off-server applications and message queues . In the future, support for the gRPC and GraphQL protocols will probably be added, but now it is not yet available.
However, before the
release of the new version, it may be necessary to check not only the correct operation of the
interfaces . And, for example, make sure that the duration of an RPC call between two services is within the allowable limit when the interface between them changes. It is also necessary to check that the cache hit ratio remains constant, for example, when adding an additional parameter to the incoming request.
As it turned out, integration testing is not
optional , its goal is to ensure that the change being tested does not lead to
serious, widespread types of system failure (usually those for which alerts are assigned).
In this regard, the question arises: how to
safely conduct integration testing in production?
To do this, consider the following example. The figure below shows the architecture that I worked with a couple of years ago: our mobile and web clients connected to a web server (service C) based on MySQL (service D) with a client part in the form of a memcache cluster (service B).
Despite the fact that this is a rather traditional architecture (and you will not call it microservice), the combination of stateful and stateless services makes this system a good example for my article.
Separating the
release from
deployment means that we can safely
deploy a new instance of the service in a production environment.
Modern service discovery utilities allow services with the same name to receive
tags (or tags) with which you can distinguish the
released and
deployed version of the service with the same name. Thanks to this feature, customers can only connect to the
released version of the desired service.
Suppose we are
deploying a new version of service C in production.
To verify that the
deployed version is working correctly, we must be able to run it and make sure that none of the contracts is violated. The main advantage of loosely coupled services is that they allow working groups to develop, deploy and scale independently. In addition, it is possible to independently perform
testing , which paradoxically applies to integration testing.
Google’s blog has an article called “
Just Say No to More End-to-End Tests ”, where integration tests are described as follows:
“
During the integration test, a small set of modules (usually two) is tested for consistency in their work. If the two modules do not integrate properly, why write a pass-through test? You can write a much smaller in volume and more narrowly integrated integration test, which can reveal the same errors .
Although in general it is necessary to think broader, there is no need to pursue the scale when it comes only to checking the joint work of the modules. ”Further, it is said that integration testing in production should follow the same philosophy: it should be sufficient and obviously
useful only for comprehensive testing of small groups of modules. With proper design, all upstream dependencies should be sufficiently isolated from the service being tested so that a poorly formed request from service A would not lead to a cascade failure in the architecture.
For our example, this means that testing of the
deployed version of the C service and its interaction with MySQL should be performed, as shown in the figure below.
Testing
read operations in most cases should be straightforward (unless the flow of data readable by the service being tested does not fill the cache with its subsequent “poisoning” with data used by the
released services). At the same time, testing of the interaction of
deployed code with MySQL becomes more complex if nonidempotent queries are used, which can lead to changes in data.
My choice is to perform integration testing using a production environment database. Previously, I kept a white list of clients who were allowed to send requests to the service being tested. Some workgroups support a special set of user accounts to perform tests in the production system, so that any accompanying data change is limited to a small, experienced series.
But if it is absolutely necessary that the data of the production environment
does not under any circumstances change during the execution of the test, then the write / change operations:
- You must reject requests at the C application level or write to another table / collection in the database;
- It is necessary to register in the database as a new record marked as “created” during the test.
If in the second case it is necessary to select
test write operations at the database level, then to support this type of testing, the database schema should be designed in advance (for example, by adding an additional field).
In the first case, the rejection of write operations at the application level can occur if the application is able to determine that the request should
not be processed. This is possible either by checking the IP address of the client sending the test request, or by the user ID contained in the incoming request, or by checking the request for a header that is expected to be specified by the client working in test mode.
What I propose is similar to mock or stub, but at the level of service, and this is not too far from the truth. This approach is accompanied by a fair number of problems.
Kraken’s Facebook brochure states the following:
“
An alternative design solution is to use the shadow data stream when an incoming request is recorded and replayed in a test environment. In the case of a web server, most operations have side effects that extend deep into the system. Shadow tests should not activate these side effects, as this may lead to changes for the user. The use of stubs for side effects in shadow testing is not only impractical due to frequent changes in the logic of the server, but also reduces the accuracy of the test, since dependencies that would otherwise be affected are not loaded. ”Although new projects can be designed so that side effects are minimized, prevented, or even completely eliminated, the use of stubs in a ready-made infrastructure can bring more problems than benefits.
The mesh architecture of the service can to some extent help with this. When using the service mesh architecture, services know nothing about the network topology and wait for connections on the local node. -. -, , , :
If we test service B, its outgoing proxy server can be configured to add a special
X-ServiceB-Test
header to each test request. In this case, the incoming proxy server of the upstream service C will be able to:
- Detect this header and send a standard response to service B;
- Report service C that the request is a test .
Integration testing of the interaction of the deployed version of service B with the released version of service C, where write operations never reach the databasePerforming integration testing in this way also provides testing of the interaction of service B with higher-level services
when they process normal production-environment data — this is probably a closer simulation of how service B will behave when it is
released in production.
It would also be nice if each service in this architecture supported real API calls in test or layout mode, allowing you to test the execution of service contracts with downstream services without changing the actual data. This would be equivalent to contract testing, but at the network level.
Shadow data duplication (dark data flow testing or mirroring)
Shadow duplication (in an article from a Google blog, it is called a
dark launch , and
Istio uses the term
mirroring ) in many cases has more advantages than integration testing.
The Principles of Chaotic Design (
Principles of Chaos Engineering ) states the following:
“
Systems behave differently depending on the environment and the data transfer scheme. Since the usage mode can change at any time ,
sampling real data is the only reliable way to fix the query path. ”Shadow data duplication is a method by which the data stream of the production environment that enters this service is captured and reproduced in the new
deployed version of the service. This process can be performed either in real time, when the incoming data stream is divided and sent to both the
released and the
deployed versions of the service, or asynchronously, when a copy of the previously captured data is reproduced in the
deployed service.
When I was working at
imgix (a startup with a staff of 7 engineers, of which only four were system engineers), dark data streams were actively used to test changes in our image visualization infrastructure. We recorded a certain percentage of all incoming requests and sent them to the Kafka cluster — we transferred the HAProxy access logs to the
heka pipeline, which in turn passed the analyzed request flow to the Kafka cluster. Before the
release stage
, a new version of our image processing application was tested on a captured dark data stream - this made sure that requests are being processed correctly. However, our imaging system was, by and large, a stateless service that was particularly well suited for this type of testing.
Some companies prefer not to capture part of the data stream, but to transfer a new version of the application to a
full copy of this stream.
Facebook's McRouter router (memcached proxy server) supports this kind of shadow duplication of the memcache data stream.
“
During testing of the new installation for the cache, we found it very convenient to be able to redirect a complete copy of the data stream from clients. McRouter supports flexible shadow shadowing. You can perform shadow duplication of a pool of various sizes (by re-caching the key space), copy only a part of the key space, or dynamically change the parameters in the process . ”
The negative aspect of the shadow duplication of the entire data stream for the
deployed service in the production environment is that if it is running at the time of maximum data transfer intensity, then it may need twice as much power.
Proxy servers such as Envoy support shadow duplication of data flow to another cluster in fire-and-forget mode. His
documentation says:
“The
router can perform shadow data duplication from one cluster to another. Fire-and-forget mode is currently implemented, in which the Envoy proxy server does not wait for a response from the shadow cluster before returning a response from the main cluster. For the shadow cluster, all the usual statistics are collected, which is useful for testing purposes. With shadow duplication, the -shadow
parameter is added to the host / authority -shadow
. This is useful for logging. For example, cluster1
turns into cluster1-shadow
".
However, it is often impractical or impossible to create a cluster replica synchronized with production for testing (for the same reason that it is problematic to organize synchronized aging cluster). If shadow duplication is used to test a new
deployed service that has many dependencies, it can trigger unintended changes in the state of the upstream services in relation to the test. Shadow duplication of the daily volume of user registrations in the
deployed version of the service with a record in the production database can lead to an increase in the error rate of up to 100% due to the fact that the shadow data stream will be perceived as repeated registration attempts and be rejected.
My personal experience suggests that shadow duplication is best suited for testing nonidempotent queries or stateless services with server-side stubs. In this case, shadow data duplication is more commonly used to test load, resilience, and configurations. In this case, using integration testing or styling, you can test how a service interacts with a stateful server when working with non-idempotent queries.
Tap comparison
The only mention of this term is in an
article from Twitter blog dedicated to the launch of services with a high level of service quality.
“To verify the correctness of the new implementation of the existing system, we used a method called tap-comparison . Our tap-comparison tool reproduces a sample of production data in a new system and compares the received answers with the results of the old one. The results obtained helped us find and correct errors in the system before the end users encountered them. ”Another article from the Twitter blog gives the definition of a tap comparison:
"Sending requests to service instances in both production and staging environments with validation of results and evaluation of performance characteristics."The difference between tap-comparison and shadow duplication is that in the first case, the answer returned by the
released version is compared with the answer returned by the
deployed version, and in the second, the request is duplicated into the
deployed version in the autonomous mode, like fire-and-forget.
Another tool for working in this area is the
scientist library, available on GitHub. This tool was developed to test Ruby code, but was then ported to
several other languages . It is useful for some types of testing, but has a number of unsolved problems. Here is what the developer wrote with GitHub in one professional Slack community:
“This tool simply performs two branches of code and compares the results. Be careful with the code for these branches. Care should be taken not to duplicate database queries if this leads to problems. I think that this applies not only to the scientist, but also to any situation in which you do something twice, and then compare the results. The scientist tool was created to verify that the new permission system works the same way as the old one, and at certain times was used to compare data that is characteristic of virtually every Rails request. I think that the process will take more time, since the processing is performed sequentially, but this is a Ruby problem that does not use threads.
In most cases known to me, the scientist tool was used to work with read operations rather than write, for example, to find out whether new improved requests and permission schemes receive the same answer as the old ones. Both options are performed in a production environment (on replicas). If the tested resources have side effects, I suppose the testing will have to be done at the application level. ”Diffy is an open source tool written in Scala that Twitter introduced in 2015.
An article from a Twitter blog called
Testing without Writing Tests is probably the best resource for understanding how tap comparisons work in practice.
“Diffy detects potential errors in the service, simultaneously launching a new and old version of the code. This tool works as a proxy server and sends all received requests to each of the running instances. It then compares the responses of the instances and reports all deviations detected during the comparison. Diffy is based on the following idea: if two service implementations return the same answers with a sufficiently large and diverse set of requests, then these two implementations can be considered equivalent, and the newer one - without any impairments in performance. Diffy’s innovative interference mitigation technique sets it apart from other comparative regression analysis tools. ”Tap comparisons are great when you need to check if the two versions give the same results. According to Mark McBride (
Mark McBride ),
“Diffy tool was often used when redesigning systems. In our case, we divided the Rails source code base into several services created using Scala, and a large number of API clients did not use the functions as we expected. Functions like date formatting were especially dangerous. ”Tap-comparison is not the best option for testing user activity or identity of the behavior of two versions of the service at maximum load. As with shadow duplication, side effects remain an unsolved problem, especially when both the deployed version and the production version write data to the same database. As in the case of integration testing, one of the ways to get around this problem is to use tap comparisons with only a limited set of accounts.
Stress Testing
For those who are not familiar with load testing,
this article can serve as a good starting point. There is no shortage of open source load testing tools and platforms. The most popular of them are
Apache Bench ,
Gatling ,
wrk2 ,
Tsung , written in Erlang,
Siege ,
Iago from Twitter, written in Scala (which reproduces the HTTP server, proxy server or network packet sniffer logs in a test instance). Some experts believe that the best tool for generating load is
mzbench , which supports a variety of protocols, including MySQL, Postgres, Cassandra, MongoDB, TCP, etc. Netflix
NDBench is another open source tool for load testing data warehouses. which supports most of the known protocols.
Iago’s official Twitter blog describes in more detail what characteristics a good load generator should have:
“Non-blocking requests are generated with a specified frequency based on the internal custom statistical distribution ( the Poisson process is modeled by default). The request rate can be changed as needed, for example, to prepare the cache before working at full load.
In general, the focus is on the frequency of requests in accordance with Little's law , rather than the number of concurrent users, which can vary depending on the amount of delay inherent in this service. Due to this, new opportunities appear to compare the results of several tests and prevent deterioration in the service, slowing down the load generator.
In other words, the Iago tool seeks to simulate a system in which requests are received regardless of the ability of your service to process them. This is different from load generators that simulate closed systems in which users will patiently work with the existing delay. This difference allows us to quite accurately simulate the failure modes that can be encountered in production. ”Another type of load testing is stress testing by redistributing the data stream. Its essence is as follows: the entire data stream of the production environment is sent to a smaller cluster than the one prepared for the service; if this causes problems, the data stream is transferred back to the larger cluster. This technique is used by Facebook, as described in one of the
articles of its official blog :
“We specifically redirect a larger data flow to individual clusters or nodes, measure the resource consumption at these nodes and determine the limits of service sustainability. This type of testing is particularly useful for determining the CPU resources needed to support the maximum number of simultaneous Facebook Live broadcasts. ”Here is what the former LinkedIn engineer writes in the professional Slack community:
“LinkedIn also used redline tests in production — servers were removed from the load balancer until the load reached thresholds or errors began to occur.”Indeed, Google search provides a link to a
full technical document and a LinkedIn blog
article on this topic:
“The Redliner solution for measurements uses real data flow from the production environment, thus avoiding errors that prevent accurate measurement of performance under laboratory conditions.
Redliner redirects part of the data stream to the service being tested and analyzes its performance in real time. This solution was implemented in hundreds of LinkedIn internal services and is used daily for various types of performance analysis.
Redliner supports parallel test execution for canary and working instances. This allows engineers to transfer the same amount of data to two different instances of the service: 1) a service instance that contains innovations, such as new configurations, properties, or new code; 2) an instance of the service of the current working version.The results of load testing are taken into account when making decisions and prevent the code from being deployed, which can lead to poor performance. ”Facebook has brought load testing using real-world data streams to a whole new level thanks to the Kraken system, and its
description is also worth reading.
Testing is implemented by redistributing the data flow when the weights change (read from the distributed configuration storage) for edge devices and clusters in the
Proxygen configuration (Facebook load balancer). These values ​​determine the volumes of real data sent respectively to each cluster and region at a given point of presence.
Data from the Kraken technical paperThe monitoring system (
Gorilla ) displays the performance of various services (as shown in the table above). Based on the monitoring data and threshold values, it is decided whether to further send data in accordance with the weights, or whether it is necessary to reduce or even completely stop the transfer of data to a specific cluster.
Configuration Tests
The new wave of open source infrastructure tools has made fixing all changes to the infrastructure in the form of code not only possible, but relatively
easy . It has also become possible to
test these changes to varying degrees, although most infrastructure-as-code tests at the pre-production stage can only confirm the correctness of the specifications and syntax.
At the same time, the refusal to test the new configuration before the
release of the code caused a
significant number of interruptions in operation .
For complete testing of configuration changes, it is important to distinguish between different types of configurations. Fred Hebert once suggested using the following quadrant:
This option, of course, is not universal, but this distinction makes it possible to decide how best to test each of the configurations and at what stage to do it. The build time configuration makes sense if you can ensure real repeatability of the builds. Not all configurations are static, and on modern platforms a dynamic configuration change is inevitable (even if we are dealing with a “permanent infrastructure”).
, , blue-green , . (
Jamie Wilkinson ), Google ,
:
« , , , - . . - , — , , . ., . , , — ».Facebook :
« . — , . . , .
. Facebook , . , .
(, JSON). , . .
(, Facebook Thrift) . , .
, , - . . — A/B-, 1 % . A/B-, . A/B- . , , , , . , A/B- . , A/B-. Facebook .
, A/B- 1% , 1% , ( « »). , . , .
Facebook . , . , , . , , .- Simple and convenient cancellation of changes
In some cases, despite all the preventive measures, an unworkable configuration is being deployed. Quickly finding and reversing changes is critical to solving a similar problem. In our configuration system, version control tools are available that make it much easier to undo changes. ”
To be continued!
UPD: continued here .