Operations Infrastructure Month in Review #2

What’s this about?

The Remind OpsEng team has “Open Sourced” our monthly status reports. This post briefly describes some of the bigger tasks and projects we have worked on over the past month.

If you want us to elaborate on a specific topic, let us know!

Back to school 2017!

The back-to-school season has caused a surge of traffic on our services. We have found ourselves busy monitoring, scaling, and optimizing our infrastructure with the greater engineering org.

Fun fact:

we doubled our raw capacity in the last month!

Multiple fleet wide AMI releases

At Remind our infrastructure is immutable so when we want to introduce change at EC2 level, we rebuild every host in the fleet.

This month we performed the following AMI releases:

  • Patched for security
  • Cut over to using SSM Parameters for host level secrets
  • Upgraded Threat Stack to resolve a defect causing high CPU usage
  • Removed the New Relic sysmond agent (vendor deprecated)

Cleanups Regarding DNS

Last month, we moved all of our Route 53 hosted zones to CloudFormation managed by stacker. As a result, we quickly became aware of DNS records that were unused or no longer needed. This month, we spent some time cleaning up our DNS records and refactoring our stacker blueprints.

EMR Permissions Refactor

When our data engineering team first started using EMR, we were hesitant to allow EMR to run in our primary AWS account, because of the breadth of unscoped IAM permissions that it needs by default. Over time, managing a cross account EMR cluster that needed access to resources in 3 separate AWS accounts became an operational burden, and introduced a lot of complexity.

This month, we spent some time developing a set of least privilege IAM permissions for EMR so that we could bring our EMR cluster into our primary AWS account.

We created a specific role for EMR with minimal permissions (a subset of the AWS default EMR role) and integrated it into our stacker blueprints – this allowed us to safely run the EMR cluster in our main account.

More ECS Task roles

We moved a few more internal services over to using ECS task roles.

This required:

  • capturing the IAM policies that the service needs in our stacker blueprints
  • cutting the service code over to use task roles (or instance profiles) instead of access/secret keys
  • coordinating with the service owner to test in staging
  • ultimately releasing the change to production

Refactored how we manage secrets (encrypted EC2 parameters with KMS/SSM)

The old way we used to manage secrets was that each instance role had a KMS key. So if the same secret was used across many instance roles we needed to encrypt the secret multiple times (once for each key).

Now we only use one KMS key for our secrets for all roles.

To keep fine grained control over the secrets a role can access, we use IAM policies. We came up with a nice pattern to manage these policies that is simple and keeps things secure.

Canary environment

We set up a homegrown “canary” solution for sending small amounts of real traffic to new code.

We created a separate Empire daemon instance in both the prod/stage environments for canary deployments. This environment is like a subset of the top level prod/stage environment, and jobs get scheduled into the same ECS cluster as the top level environment.

For example, a developer may “canary” a change with slashdeploy to deploy into the canary environment:

/deploy r101-api@master to prod/canary

They may also scale up (or down) processes in the canary app:

emp-prod-canary scale web=10 -a r101-api

This enables a cool workflow of using git branches and continuous integration (/deploy) to act as a release pipeline.

For example, in the frontend’s case, a developer could merge a PR into a develop branch, which would trigger continuous integration to the prod/canary environment.

RDS Enhanced Monitoring

Previously if we wanted enhanced monitoring for RDS we had to enable it manually. Since Enhanced Monitoring was added to Cloudformation, we’ve added it to our stacker blueprint for RDS, allowing us to manage it entirely from stacker.

Updated Empire to support ECS placement constraints

Before ECS Placement Constraints, an ECS cluster was relatively flat; when you ran an ECS task, it could get scheduled on ANY host registered within the ECS cluster. Placement Constraints allow you to specify host level requirements that the task has; like instance type, operating system, availability zone, or an arbitrary metadata. This allows us to register multiple different instance profiles in a single ECS cluster so that tasks can be placed on instances that are more tailored to their type of workload (e.g. a profile for background jobs, and a profile for user facing web requests).

Using placement constraints brings its own set of operational challenges, since it’s harder to autoscale multiple pools of instances in a single cluster, so it’s something we’re just dipping our toes in to start.

AWS WAF

We’re experimenting with placing AWS WAF in front of our primary gateway to block malicious CIDR’s which we previously performed in our Nginx and application layers. We will investigate the possibility of also using this WAF for high level XSS and SQL injection mitigation.

A lock to prevent change is an anti pattern

We have found that if you lock a resource to prevent it from changing you will prevent destructive changes at the expense of also skipping normal upgrades.

When you lock a resource the deltas will grow over time which results in more risk when the time comes to update. As a result we have started to unlock some of our resources to allow change to happen on them more frequently. This means we have smaller more frequent deltas to deal with instead of large tricky deltas.

Instead of a lock we want to switch our tooling to support annotations for “protected” areas. This gives us a chance to review changes instead of skipping them.

We already started adding protected mode for stacks to stacker.

Cross team PRs to our stacker blueprints

This month we had many pull requests to our internal stacker blueprints repo from Remind colleagues on other teams. I call this out because managing infra-as-code allows for transparency & collaboration which is awesome.

Thank you for the help!

That’s it!

If you have any questions or want us to elaborate on a specific topic, please ping us at @RemindEng on Twitter.