Migrate the Empire to C5

Migrate the Empire to C5 — or why it took me a month to move from C4 to C5 AWS instances.

At around Dec 15th, 2017 my team reviewed our Instance Reservations and chose to switch our fleet’s scaling unit from c4.2xlarge to c5.2xlarge. The C5 instance type family is Amazon’s first cloud offering running on the KVM Hypervisor.

Additionally, C5 runs a newer generation of CPU, offers up an extra 1G of memory, and provides significantly faster intra VPC networking (through the use of ENA) all for less cost then the C4 family. Essentially we get the the best-bang-for-our-buck by choosing C5 over C4.

As a result I took on the task of switching our stage environment to c5.2xlarge instances.

AMI Complications

As a first test I naively adjusted the instance type in our stacker config and released to stage. This test revealled a set of complications related to our AMI.

The first complication was obvious in that CloudFormation refuses to launch an AMI on the C5 family without ENA support and rolls back. You may check to see if your current AMI supports ENA by running the following:

$ ami_id="ami-764a210c"
$ aws ec2 describe-images --image-id $ami_id --query "Images[].EnaSupport"
[
    true
]

We use Packer, Ansible, and git to document and automate our homegrown AMI builds.

There are two main ways to add ENA support in a homegrown AMI: the easy way and the hard way.

The easy way: Simply rebase to an AMI which already happens to include ENA kernel modules and let your upstream maintain updates.

The hard way: Install an OS package or compile the ENA kernel drivers yourself and flag the resulting AMI as ENA supported during publication.

I chose the easy way, in my case the latest Official “Cloud” Ubuntu 14.04 LTS includes the needed kernel modules for ENA.

If you choose the hard way, because you need more control over version or because your upstream OS doesn’t include kernel modules for ENA, I learned that the new version of Packer allows you to set ena_support in your .json so that the flag is present when you publish.

The next complication is that the C5 family presents disk devices as SSDs and therefore the device path changed to /dev/nvme1n1. To solve this I adjusted our Ansible playbooks which run on boot to use a conditional include so that on first boot the proper device path is detected, formatted, and mounted for use by docker.

Here is a snippet of Ansible which shows off the conditional include:

# Detect the docker device path.
- stat: path=/dev/xvdh
  register: dev_xvdh
- stat: path=/dev/nvme1n1
  register: dev_nvme1n1

# Manage the docker mount, passing the detected device_path.
- include: manage_mount.yml device_path=/dev/xvdh
  when: dev_xvdh.stat.exists
- include: manage_mount.yml device_path=/dev/nvme1n1
  when: dev_nvme1n1.stat.exists

Once these changes were made I tried another test in stage and this time it worked. Our Empire Minion Autoscaling Group (ASG) started to bring up C5 hosts and Empire (ECS) started scheduling containers to do work!

Subnet Complications

Unfortunately, after closer inspection the next day I learned that one of our existing private subnets resides in an availability zone (in our case us-east-1e) which has no c5.2xlarge capacity. Our ASG was only able to successfully bring up C5 instances in 3 of 4 private subnets/availability zones (AZs).

So I started to work towards adding an additional subnet to our vpc stack, specifically a subnet in us-east-1d AZ. After a couple days of trying to get this stack to manage an additional subnet, I found no matter the change — if modified it would try to recreate two subnets and all dependent resources!

Then I remembered — at some point a new AZ (us-east-1d) was introduce to our AWS Account. At first glance this seems harmless until our team realized that the GetAZs function, which we rely on in our vpc stack, will recompute the order of the AZs whenever the CloudFormation template is changed. This cascades into recreating multiple subnets and all dependent resources which would in our case, and likely your own, result in an outage.

To deal with this I created a new stacker blueprint called networks which does not rely on GetAZs. This new stack manages 8 new subnets (4 private, 4 public) in the availability zones: us-east-1{a, b, c, d}.

Our subnets always have a 1-to-1 pairing with a route table. Each of the public subnets has a NAT Gateway and each of the private subnets has a default route (0.0.0.0/0) to the NAT Gateway running in the public subnet of the same AZ. Each NAT Gateway has a static Elastic IP (EIP).

I ported each of these details over to the new networks stack and tried my test again in stage, moving just the Empire Minion ASG to the new subnets. This appeared to work, the ASG batched replacement instances into the new subnets and drained out and terminated instances in the old subnets.

Redshift Security Group Complications

Our main legacy Redshift clusters were born a while back, before clusters could run inside a VPC and even before we created stacker. We have not had a compelling reason to rebuild these clusters so they have been managed outside of our VPC and stacker. As a result we use Redshift Security Groups and manage a short list of CIDRs which may ingress into the clusters. Unfortunately the EIPs of the new NAT Gateways were default denied access to the stage cluster resulting in issues for our Data Engineering jobs that night.

The next day, to solve this, I started to manage a few Redshift Security Groups in stacker and added these new groups to the clusters. Now if NAT Gateway’s EIP changes, the rule granting ingress access will always stay up-to-date.

We ran this configuration in stage over the weekend.

Elastic Load Balancer Complications

On Monday we learned that because our Elastic Load Balancers (ELBs) did not have a subnet for the us-east-1d Availability Zone we were not sending traffic to a portion of our upstream instances. For example, if a service only had one container, there was a 1-in-4 chance it would be scheduled in us-east-1d and be completely out of service.

We currently manage our application and service ELBs with Empire a tool seperate from stacker.

My first try at fixing this was to move the ELBs over to the new subnets, but CloudFormation failed and rolled back when I tried. An ELB can have one or many subnets but not two subnets in the same AZ. The problem is that CloudFormation adds new subnets to the ELB before removing old subnets.

Therefore trying to replace all the old subnets with new ones (most in the same AZs) caused CloudFormation to error with:

ELB cannot be attached to multiple subnets in the same AZ

As a short term fix, my team found a way to let our ELBs send traffic to 5 AZ.

To do this, I added just the new us-east-1d subnet from the networks stack to the original 4 subnets from the vpc stack and passed all 5 to our empire_daemon stack. I then did a release and forced Empire to re-compute it’s CloudFormation templates for each app using this short bash script:

for app in $(emp-prod apps | awk '{ print $1 }')
  do
   # set a var to force empire to recompute cloudformation and perfrom a new release.
   emp-prod set -a $app CLOUDFORMATION_RELEASE=2 -m 'forcing cloudformation to run'
   sleep 30
done

In the long term we plan to completely migrate off the old subnets.

This configuration ran in staging for another day.

Route Table Complications

After considering all the migration details, I thought I was finally ready for a production release.

Little did I know that in production we rely on a VPC Peering connection to an essential 3rd party. This peer and route were not well documented and not managed with stacker. Worse yet, the stage environment does not use this special route, so the fact it was missing was of no consequence during testing. It wasn’t until the production release did it crop up as a major issue.

When I performed the release in production, the new route tables were missing the peer entry so as instances came up in the new subnets they couldn’t reach the 3rd party service. At the 40% point of the release we called it and rolled back. Fortunately, the rollback and recovery from this event was quick and automated.

To fix this, I refactored our networks stack to support extra routes so we don’t forget any in the future.

A bit bummed out, I closed my laptop for the night.

Success

Not one to become discouraged for breaking production, after my 2nd cup of coffee the next day, I tried the migration again. Needless to say, with all the missing details now in place the change worked. Over the next few hours, I methodically scaled down the c4.2xlarge ASG and scaled up the c5.2xlarge ASG until we had a 50:50 split.

Once our current C4 Reserved Instances expire we will almost certainly transition completely to C5. I’m patiently waiting for the time to come when we can fully realize the cost savings for my effort and I hope to blog more about my adventures finding additional cost saving opportunities.

Retrospective and Summary

  • The universe will always hide important details in plain sight. Expect this and have a plan when things don’t work.
  • Trust your tools and intuition when deciding to continue or fall back. Failing and falling back early gives you a chance to return the next day with fresh “morning eyes”. If you try to push forward in the middle of a partial outage, without fully understanding the fault, you are likely to make a poor decision.
  • Relying on GetAZs is dangerous. Personally I define it as an anti-pattern because subnet to AZ mapping should be declared statically.
  • Try to document all change. It doesn’t even have to be an automated tool. New people to the team do not have all the history or institutional knowledge so important details will be missed.
  • Slow down and pay off tech debt when it bites you, especially during a major change. Your whole organization will be faster in the long run.

Next steps

We still need to do the following:

  • move RDS databases out of the old subnets
  • move application and service ELBs, managed by Empire, out of the old subnets
  • move our RabbitMQ cluster out of the old subnets
  • refactor the vpc stack to destroy the old subnets and the scary GetAZs issue

Questions

If you have any questions, suggestions, or just want to share a war story please reach out to @russellbal or @RemindEng on Twitter.

Also if you want to work with me to help give every student an opportunity to succeed, you should checkout out our openings on the Remind careers page.

Of course, what technical post would be complete without a chart?

This one shows load and CPU averages across the fleet partitioned by the two generations (C4 is blue):