These 6 lessons of working with cloudformation I learned for life

I started working with cloudformation 4 years ago. Since then, I have broken many infrastructures, even those that were already in production. But every time I spoiled something, I learned something new. Through this experience, I will share some of the most important lessons I learned.

Lesson 1: Verify Changes Before Deploying

I learned this lesson soon as I started working with cloudformation . I do not remember exactly what I broke then, but I do remember exactly that I used the aws cloudformation update command . This command simply rolls out the template without any check for changes that will be deployed. I do not think that explanations are required, for which you need to check all the changes before you deploy them.
')
After this failure, I immediately changed the deployment pipeline , replacing the update command with the create-change-set command

# OPERATION is either "UPDATE" or "CREATE" changeset_id=$(aws cloudformation create-change-set \ --change-set-name "$CHANGE_SET_NAME" \ --stack-name "$STACK_NAME" \ --template-body "$TPL_PATH" \ --change-set-type "$OPERATION" \ --parameters "$PARAMETERS" \ --output text \ --query Id) aws cloudformation wait \ change-set-create-complete --change-set-name "$changeset_id"

When a change set is created, it does not affect the existing stack. Unlike the update command, a changeset approach does not cause actual deployment. Instead, it creates a list of changes that you can view before deployment. You can view the changes in the aws console interface. But if you prefer to automate everything you can, check them in the CLI:

 # this command is presented only for demonstrational purposes. # the real command should take pagination into account aws cloudformation describe-change-set \ --change-set-name "$changeset_id" \ --query 'Changes[*].ResourceChange.{Action:Action,Resource:ResourceType,ResourceId:LogicalResourceId,ReplacementNeeded:Replacement}' \ --output table

This command should produce output similar to the following:

 -------------------------------------------------------------------- | DescribeChangeSet | +---------+--------------------+----------------------+------------+ | Action | ReplacementNeeded | Resource | ResourceId | +---------+--------------------+----------------------+------------+ | Modify | True | AWS::ECS::Cluster | MyCluster | | Replace| True | AWS::RDS::DBInstance| MyDB | | Add | None | AWS::SNS::Topic | MyTopic | +---------+--------------------+----------------------+------------+

Pay particular attention to changes where Action is Replace , Delete, or where ReplacementNeeded is True . These are the most dangerous changes and usually they lead to the loss of information.

When changes are viewed, they can be deployed.

 aws cloudformation execute-change-set --change-set-name "$changeset_id" operation_lowercase=$(echo "$OPERATION" | tr '[:upper:]' '[:lower:]') aws cloudformation wait "stack-${operation_lowercase}-complete" \ --stack-name "$STACK_NAME"

Lesson 2: Use stack policy to prevent state replacement or deletion of resources.

Sometimes just watching the changes is not enough. We are all human and we all make mistakes. Shortly after we started using the change sets, my teammate unknowingly performed the deployment, which led to an update of the database. Nothing terrible happened because it was a testing environment.

Despite the fact that our scripts displayed a list of changes and asked for confirmation, the Replace change was omitted because the list of changes was so large that it did not fit on the screen. And since it was a regular update in the test environment, not much attention was paid to the changes.

There are resources that you never want to replace or remove. These are statefull services, such as an RDS database instance or an elastichsearch cluster, etc. It would be nice if aws would automatically refuse to deploy if the operation being performed requires the removal of such a resource. Fortunately, cloudformation has a built-in way to do this. This is called stack policy, and you can learn more about this in the documentation :

 STACK_NAME=$1 RESOURCE_ID=$2 POLICY_JSON=$(cat <<EOF { "Statement" : [{ "Effect" : "Deny", "Action" : [ "Update:Replace", "Update:Delete" ], "Principal": "*", "Resource" : "LogicalResourceId/$RESOURCE_ID" }] } EOF ) aws cloudformation set-stack-policy --stack-name "$STACK_NAME" \ --stack-policy-body "$POLICY_JSON"

Lesson 3: Use UsePreviousValue when updating a stack with secret parameters.

When you create an RDS entity, mysql AWS requires you to provide MasterUsername and MasterUserPassword. Since it is better not to keep secrets in the source code, and I wanted to automate everything, I implemented a “smart mechanism” in which credentials are obtained from s3 before deployment, and if credentials are not found, new credentials are generated and stored in s3 .

These credentials will then be passed as parameters to the cloudformation create-change-set command. During experiments with the script, it happened that the connection with s3 was lost, and my “smart mechanism” viewed it as a signal to generate new credentials.

If I started using this script in a production environment, and the connection problem would arise again, it would update the stack with new credentials. In this particular case, nothing bad will happen. However, I abandoned this approach and began to use another one, providing credentials only once - when creating the stack. And later, when the stack requires an update, instead of specifying the secret parameter value, I would simply use UsePreviousValue = true :

 aws cloudformation create-change-set \ --change-set-name "$CHANGE_SET_NAME" \ --stack-name "$STACK_NAME" \ --template-body "$TPL_PATH" \ --change-set-type "UPDATE" \ --parameters "ParameterKey=MasterUserPassword,UsePreviousValue=true"

Lesson 4: use rollback configuration

Another team I worked with used the cloudformation function called rollback configuration . I have not met with her before and quickly realized that this would make the deployment of my stacks even cooler. Now I use every time I deploy my code in lambda or ECS using cloudformation.

How it works: you specify CloudWatch alarm arn in the --rollback-configuration parameter when you create a change set. Later, when you complete the change set, aws tracks the alarm for at least one minute. It rolls back the deployment if, during this time, the alarm changes state to ALARM.

Below is an example of a snippet of a cloudformation template in which I create a cloudwatch alarm that tracks a custom cloud metric as the number of errors in the cloud logs (the metric is created via MetricFilter ):

 Resources: # this metric tracks number of errors in the cloudwatch logs. In this # particular case it's assumed logs are in json format and the error logs are # identified by level "error". See FilterPattern ErrorMetricFilter: Type: AWS::Logs::MetricFilter Properties: LogGroupName: !Ref LogGroup FilterPattern: !Sub '{$.level = "error"}' MetricTransformations: - MetricNamespace: !Sub "${AWS::StackName}-log-errors" MetricName: Errors MetricValue: 1 DefaultValue: 0 ErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-errors" Namespace: !Sub "${AWS::StackName}-log-errors" MetricName: Errors Statistic: Maximum ComparisonOperator: GreaterThanThreshold Period: 1 # 1 minute EvaluationPeriods: 1 Threshold: 0 TreatMissingData: notBreaching ActionsEnabled: yes

Now alarm can be used as a rollback trigger when executing a set of tools:

 ALARM_ARN=$1 ROLLBACK_TRIGGER=$(cat <<EOF { "RollbackTriggers": [ { "Arn": "$ALARM_ARN", "Type": "AWS::CloudWatch::Alarm" } ], "MonitoringTimeInMinutes": 1 } EOF ) aws cloudformation create-change-set \ --change-set-name "$CHANGE_SET_NAME" \ --stack-name "$STACK_NAME" \ --template-body "$TPL_PATH" \ --change-set-type "UPDATE" \ --rollback-configuration "$ROLLBACK_TRIGGER"

Lesson 5: Make sure you deploy the latest version of the template.

It is easy to deploy not the most recent version of the cloudformation pattern, but it will cause a lot of damage. Once we had it this way: the developer did not send the latest changes from Git and unknowingly unfolded the previous version of the stack. This led to the downtime of the application that used this stack.

Something simple, such as adding a check to see if a branch is relevant before deploying, will be fine (assuming that git is your version control tool):

 git fetch HEADHASH=$(git rev-parse HEAD) UPSTREAMHASH=$(git rev-parse master@{upstream}) if [[ "$HEADHASH" != "$UPSTREAMHASH" ]] ; then echo "Branch is not up to date with origin. Aborting" exit 1 fi

Lesson 6: Don't reinvent the wheel.

It may seem that deploying with cloudformation is easy. You just need a bunch of bash scripts running aws cli commands.

4 years ago I started with simple scripts called the aws cloudformation create-stack command. Soon the script was no longer simple. Each lesson learned made the script more and more difficult. It was not only difficult, but also with a bunch of bugs.

Now I work in a small IT department. Experience shows that each team has its own way of deploying cloudformation stacks. And that's bad. It would be better if everyone used a unified approach. Fortunately, there are many tools that help deploy and configure cloud formation stacks.

These lessons will help you avoid mistakes.

Source: https://habr.com/ru/post/446918/

All Articles