📜 ⬆️ ⬇️

World is not perfect

image
The world is not perfect. At any moment something might go wrong. Fortunately, most of us do not launch rockets into space and do not build airplanes.

A modern person depends on the application in his phone and our task is to make it so that at any time, under any circumstances, he could open the application and look at the pictures with cats.

People are not perfect. We constantly make mistakes. We make typos, we can forget something or succumb to laziness. A person can trivially swell or get under the car.

Iron is not perfect. Hard drives are dying. Datacenters lose channels. Processors overheat and electrical networks fail.
')
Soft is not perfect. Memory flows. Connections are torn. Replicas break down and the data goes into oblivion.

Shit happens - as our overseas friends say. What can we do with all this? And the answer is banal to simplicity - nothing. We can always test, raise a ton of environments, copy production and keep one hundred thousand backup servers, but it still will not save: the world is not perfect.

The only right decision here is to accept it. We need to accept the world as it is and minimize losses. Every time setting up a new service you need to remember - it will break at the most inopportune moment.

It will break. You will surely make a mistake. Iron will fail. The cluster will crumble. And according to the laws of this imperfect world - this will happen exactly when you least expect it.

What do most of us do to deceive everyone (including ourselves)? - We set up alerts. We write clever metrics, collect logs and create alerts, thousands, hundreds of thousands of alerts. Our mailboxes are overflowing. Our phones are broken by SMS and calls. We plant entire floors of people looking at charts. And when once again we lose access to the service, the analysis begins: what have we forgotten to do?

All this is only the appearance of reliability. No alerts, metrics and monitoring will help.

Today they called you and you fixed the service - no one noticed that something was broken. And tomorrow you went to the mountains. And the day after tomorrow I forgot. People are not perfect. Fortunately, we are engineers, we live in a non-ideal world and we learn how to conquer it.

So why do you need to wake up at night or in the morning instead of coffee to read the mail. Why business should depend on one person and on his performance. Why. I do not understand.

I only understand that it is impossible to live this way, and I do not want to live like this. And the answer is simple: Automate it (yes, with a capital letter). We need not just alerts and calls at night. We need automatic responses to these messages. We must be sure that the system can repair itself. The system must be flexible and able to change.

Unfortunately, we do not have enough intelligent AI yet. Fortunately, all our problems are formalized.

I don't have a silver bullet, but I have a Proof of Concept for AWS.

AWS Lambda


Serverless - first of all, something that is not running can not break.
Event based - received the event, processed, turned off.
JVM is able - so, it is possible to use all experience from Java of the world (and means that I can use Clojure).
3d-party - No need to monitor and maintain AWS Lambda.

Pipeline looks like this:

Event -> SNS Topic -> AWS Lambda -> Reaction

By the way, SNS topic can have several endpoints. So, you can add a banal mail and receive the same notification. And we can extend the lambda function and make notifications much more useful: for example, send alerts immediately along with charts or add sending SMS.

A full example of one Lambda function can be found at: github.com/lowl4tency/aws-lambda-example
The lambda function nails all nodes in the ELB not in the inService state.

Parsing code


In this example, we will kill all nodes that are not in the InService state. By the way, the entire Lambda function takes ~ 50 lines of code in one file, which means ease of support and ease of entry.

Any Clojure project starts with project.clj

I used the official Java SDK and the excellent Amazonica library, which is a wrapper for this SDK. Well, so as not to drag a lot of excess, we exclude those parts of the SDK that we do not need

[amazonica "0.3.52" :exclusions [com.amazonaws/aws-java-sdk]] [com.amazonaws/aws-java-sdk-core "1.10.62"] [com.amazonaws/aws-lambda-java-core "1.1.0"] [com.amazonaws/aws-java-sdk-elasticloadbalancing "1.11.26" :exclusions [joda-time]] [com.amazonaws/aws-java-sdk-ec2 "1.10.62" :exclusions [joda-time]] [com.amazonaws/aws-lambda-java-events "1.1.0" :exclusions [com.amazonaws/aws-java-sdk-dynamodb com.amazonaws/aws-java-sdk-kinesis com.amazonaws/aws-java-sdk-cognitoidentity com.amazonaws/aws-java-sdk-sns com.amazonaws/aws-java-sdk-s3]]] 

For the greater flexibility of each Lambda function, I use a configuration file with the most common edn . In order to be able to handle events, we need to slightly change the function declaration.

 (ns aws-lambda-example.core (:gen-class :implements [com.amazonaws.services.lambda.runtime.RequestStreamHandler]) 

Point of entry. We read the event at the input, process this event using a handle-event and write to the JSON stream as a result.

 (defn -handleRequest [this is os context] "Parser of input and genarator of JSON output" (let [w (io/writer os)] (-> (io/reader is) json/read (-> (io/reader is) json/read walk/keywordize-keys handle-event (json/write w)) (.flush w)))) 

Workhorse:

 (defn handle-event [event] (let [instances (get-elb-instances-status (:load-balancer-name (edn/read-string (slurp (io/resource "config.edn"))))) unhealthy (unhealthy-elb-instances instances)] (when (seq unhealthy) (pprint "The next instances are unhealthy: ") (pprint unhealthy) (ec2/terminate-instances :instance-ids unhealthy)) {:message (get-in event [:Records 0 :Sns :Message]) :elb-instance-ids (mapv :instance-id instances)})) 


Get the list of nodes in ELB and filter them by status. All nodes that are in the InService state are removed from the list. The rest is terminological.

All that we print through the pprint will fall into the logs of CloudWatch. This can be useful for debugging. Since we do not have a constantly running lambda and there is no possibility to connect to the REPL, this can be quite useful.

  {:message (get-in event [:Records 0 :Sns :Message]) :instance-ids (mapv :instance-id instances)})) 

In this place, the entire structure that we generate and return from this function will be written in JSON and will be seen as a result of the execution of the Lambda interface on the Web.

In the unhealthy-elb-instances function , we filter our list and get instance-id only for those nodes that ELB considered inoperative. We get a list of instances and we filter them by tags.

 (defn unhealthy-elb-instances [instances-status] (->> instances-status (remove #(= (:state %) "InService")) (map :instance-id))) 

In the get-elb-instances-status function, call the API method and get a list of all nodes with statuses for one specific ELB

 (defn get-elb-instances-status [elb-name] (->> (elb/describe-instance-health :load-balancer-name elb-name) :instance-states (map get-health-status ))) 

For convenience, we remove the excess and generate a list only with information that is of interest to us. This is the instance-id and status of each instance.

 (defn get-health-status [instance] {:instance-id (:instance-id instance) :state (:state instance)}) 

And we filter our list, removing those nodes that are in the InService state.

 (defn unhealthy-elb-instances [instances-status] (->> instances-status (remove #(= (:state %) "InService")) (map :instance-id))) 

And that's all: 50 lines, which will allow you not to wake up at night and quietly go to the mountains.

Deployment


For simplicity, I use a simple bash-script.

 #!/bin/bash # Loader AWS Lambda aws lambda create-function --debug \ --function-name example \ --handler aws-lambda-example.core \ --runtime java8 \ --memory 256 \ --timeout 59 \ --role arn:aws:iam::611066707117:role/lambda_exec_role \ --zip-file fileb://./target/aws-lambda-example-0.1.0-SNAPSHOT-standalone.jar 

Configure the alert and fasten it to the SNS topic. SNS topic fasten to the lambda as an endpoint. Calmly we go to mountains or we get under the car.

By the way, due to flexibility, you can program any system behavior and not only by system metrics, but also by business metrics.

Thank.

Source: https://habr.com/ru/post/309040/


All Articles