📜 ⬆️ ⬇️

Nightmare "Knight": an instructive story about DevOps



1066, more than 200 years have passed since the beginning of the Viking invasion of England. King Harold, gathering a squad of knights, marched to the River Dervent for a decisive battle with the troops of his namesake, the Norwegian king Harald. Gunmen worked for a whole month to forge a sufficient amount of new generation armor capable of protecting a knight from a Scandinavian ax. And how many before that there were experiments and tests in tournaments! But the expectation had to justify itself - light, but reliable equipment allowed even on foot, without large losses, to disperse the Viking hird. And finally, they met at Stamford Bridge. The main squad of knights led by the commander in brilliant armor clashed in the middle of the bridge with the enemies. Yes, it keeps the punch of steel piedmont masters!

Slowly but surely, the Vikings are moving to a circular defense. It seems the victory is near. And on the battlefield, finally, the knight commander and the Norwegian Jarl find each other.

The two-handed ax of the yarl is already broken and he is forced to defend himself with an ordinary saxophone, which cannot be compared with a knight's bastard and a half sword. A dagger wag - for the commander's armor, it is like a penknife, but it passes through the armor and ... it seems like magic has gone - the armor of the neighboring knights on the right flank is scattered! Another wave, again at the target, and now the knights on the left flank suddenly wear red hot armor. The third blow - the knights swim before their eyes, they stumble, fall, and never rise.
')
"*** ** **** ****!" - cried Petrovich, waking up at 5 am on Monday in a cold sweat. Everything went completely wrong: at 11 pm he climbed into Wikipedia looking for materials for a child’s report on the nature of the tundra, but by one o'clock for some reason he was on the description of the Viking invasion of England. And yet, at the weekend, they put into battle the next release, which should be launched today. As always, they tested it for a long time and thoroughly, drove a million times through the systems of continuous integration, whether everything will go smoothly ...

Fortunately for some, but unfortunately for the victims, there is always the opportunity to learn from the mistakes of others in order to improve something and get an additional portion of confidence. With this translation article, we want, once again, in more detail, to recall one of the cases in “our” industry.



Last year, at the conference, I talked about DevOps, configuration as code and continuous delivery. Using the story below, I explained the importance of creating fully automated and replicable deployments as part of the DevOps / continuous delivery initiative. After the conference, several people asked me to share a story on the blog. This is an absolutely true story. This is a retelling of the read, I myself did not participate in it.

So, the story of how a company with assets of almost $ 400 million went bankrupt in 45 minutes due to unsuccessful deployment.

A bit of background


Knight Capital Group ("Knight" in English means "knight") is an American global financial company engaged in market-making, electronic execution, institutional sales and trading. In 2012, Knight was the largest US stock trader with a market share of about 17% on the NYSE and NASDAQ. Knight's Electronic Trading Group (ETG) managed an average daily trading volume of over 3.3 billion trades a day, trading more than $ 21 billion ... a day. It is not joke!

On July 31, 2012, Knight had about $ 365 million in cash and cash equivalents.

On August 1, 2012, the NYSE planned to launch a new retail liquidity program - the Retail Liquidity Program (a program designed to improve pricing for retail investors through retail brokers such as Knight). In preparation for this event, Knight has updated its automated, high-speed, algorithmic SMARS router, which sends bids to the market for execution. One of the main functions of SMARS is to receive orders from other components of the Knights trading platform (“parent” orders), with the subsequent sending of one or several “subsidiary” orders for execution. In other words, SMARS will receive large bids from the trading platform and break them up into small ones to find a buyer / seller for stocks. The larger the parent application, the more child applications will be created.

The SMARS update should have replaced the old, unused code called “Power Peg” - Knight did not use this functionality for 8 years (why the code that was dead for so long still was in the codebase is a mystery, but this is not the main thing). The updated code reassigned the old flag that was used to activate the Power Peg functionality. The code was thoroughly tested, worked correctly and was reliable. What could go wrong?

What could go wrong? And really!


Between July 27, 2012 and July 31, 2012, Knight manually deployed new software on a limited number of servers per day — just eight (8) servers. This is what is written in the SEC document regarding manual deployment (the SEC is the Securities and Exchanges Comission, the American regulator of the stock market).

“During the deployment of a new code, one of the Knight employees did not copy the new code to one of eight SMARS servers. Knight did not repeat the technical review of this deployment, so the Power Peg code from the eighth server was not deleted, and the new RLP code was not added. The company has not been prescribed procedures requiring a re-check. ” Release number 70694, October 16, 2013

On August 1, 2012, at 9:30 am ET, markets opened, and Knight began processing applications from broker-dealers on behalf of its customers in the new Retail Liquidity Program. Seven (7) servers, which were deployed correctly, began to correctly process applications. And those applications that went to the eighth server, probably activated the modified flag and resurrected Power Peg.

Zombie attack: killer code


Here it is necessary to explain why the “dead” Power Peg code was needed. This functionality was intended to calculate the shares purchased / sold on the parent application as the execution of the child applications. After the parent application is executed, Power Peg prohibits sending child applications. In principle, Power Peg will track the child orders and stop their execution after the processing of the parent application has been completed. In 2005, Knight rolled back this cumulative tracking functionality to an earlier stage of code execution (thus removing the quantity tracking from Power Peg).

When the Power Peg flag on the eighth server was activated, Power Peg began to route its child orders for execution, but did not correlate them with the number of shares in the parent order — something like a closed loop occurred.

Hellish 45 minutes


Imagine: you have a system capable of sending automatic high-speed bids to the market without any tracking and the ability to see if enough bids have been completed. Yes, it turned out so bad.

When the market opened at 9:30 in the morning, people quickly realized that something was wrong. By 9:31 am, it became obvious to many on Wall Street that something serious was happening. The market was flooded with bids with an unusual, compared with the normal situation, trading volume for certain stocks. By 9:32 on Wall Street, they wondered why this outrage did not stop. Almost eternity in high-speed trading. Why didn't someone press the “kill” button on the system that did this? As it turned out, there was no switch. During the first 45 minutes of bidding execution of transactions from Knight accounted for more than 50% of trading volume, raising certain stocks up by more than 10% of their value. As a result, other stocks fell in price due to erroneous transactions.

Worse, the Knight system began to e-mail automatic messages even before these events — as early as 8:01 am (when SMARS processed bids suitable for pre-market trading). In the messages, the system referred to SMARS and showed the error "Power Peg Unavailable." Between 8:01 AM and 9:30 AM, 97 letters were sent to Knight employees. Of course, these letters did not have the appearance of system warnings, so no one immediately looked at them. Oh.

During hellish 45 minutes, Knight tried to stop the erroneous deals. It was not possible to turn off the system (as there were no documented procedures for responding to this situation), therefore, trying to deal with the problem in live trading, they remained on the market, where 8 million shares were sold every minute. Since company employees could not determine where the erroneous requests came from, they deleted the new code from the servers where it was deployed correctly. In other words, they removed the working code and left it broken. This only aggravated the problems causing additional parent requests to activate Power Peg code on all servers, and not just on where the code was originally deployed incorrectly. In the end, managed to stop the system - after 45 minutes of trading.

While trading was in progress, Power Peg code received and processed 212 parent applications. As a result, SMARS sent millions of subsidiaries to the market, executed 4 million transactions on 154 transactions with more than 397 million shares. For stock market experts, this meant that Knight bought shares of 80 different companies for $ 3.5 billion and sold shares of 74 companies for $ 3.15 billion. From a layman’s point of view, Knight Capital Group lost $ 460 million in 45 minutes. But Knight has only $ 365 million in cash and cash equivalents. For 45 minutes, Knight has turned from the largest American stock trader and major market maker on the NYSE and NASDAQ to bankrupt. They had 48 hours to collect the amount needed to cover the losses (which they managed to do thanks to an investment of $ 400 million from about half a dozen investors). Ultimately, Knight Capital Group was acquired by Getco LLC (in December 2012), and now the combined company is called KCG Holdings.

What conclusions need to be made


Events August 1, 2012 should be a lesson for all development teams and project teams. It is not enough to create great software and test it; you also need to make sure that it is properly delivered to the market, so that your customers receive exactly the value that you provide (and so that you do not bankrupt your company). The engineer (s) who deployed SMARS are not only to blame for the fact that the procedure followed in Knight did not take risks into account. The procedure (or lack thereof) was obviously erroneous. Every time the deployment process depends on how people read and follow instructions, you put yourself at risk. People make mistakes. Errors can be in instructions, in interpretation of instructions or in their execution.

The layout should be automated, reproducible and as free as possible from human errors. If Knight had implemented an automated deployment system — a set of configuration, automatic deployment, and testing — then an error that turned into a nightmare for the Knight could have been avoided.

Here are a couple of principles of continuous delivery (even if you do not implement the full process of continuous delivery):

Source: https://habr.com/ru/post/458304/


All Articles