Operations Infrastructure Month in Review #4

What’s this about?

The Remind OpsEng team has “Open Sourced” our monthly status reports. This post briefly describes some of the bigger tasks and projects we have worked on over the past month.

If you want us to elaborate on a specific topic, let us know!

Tuned EBS

We noticed that that the Docker volumes on our ECS container instances were showing unusually high disk latency, sometimes spiking above 30 seconds. We’ve taken a few steps to ensure that disk operations on the Docker volume are faster and more predictable.

  • We disabled ext4 journaling. Since our ECS instances are highly ephemeral, filesystem journaling introduces extra overhead that isn’t necessary.
  • We moved our Docker volumes from standard magnetic volumes to gp2 SSD’s, which also allowed us to enable EBS optimization on these instances.

While increasing our costs a little bit, the effects were noticeable – here is a measuring of max wait times on disk reads. After the change, disk reads are much more predictable with less outliers.

IO Wait

Hardened docker configuration

As part of a series of security improvements, we’ve made some changes to our Docker configuration:

  1. Disable inter-container communication. Every app should run isolated.
  2. Disable Docker v1 registry, which Empire doesn’t use/need.
  3. Disable userland proxy for port forwarding (we now use iptables).
  4. Enabled live restore.
  5. Enabled user namespace remapping.

Dnsmasq Upgrade

We’ve upgraded Dnsmasq in our AMIs to deal with a recent security incident.

Autospotting reroll

Due to concerns about some poor performance we experienced, we rolled back autospotting in order to eliminate variables (lots of things change in our system daily). To ensure there wasn’t any questions about it going forward, we built dashboards to monitor these issues and slowly rolled it back out.

We didn’t experience the issues again – rather the opposite. Our standard operational unit is c4.2xlarge, and since Autospotting treats the previous generation (c3.2xlarge) as a compatible instance type it scaled them into our autoscaling groups. When this happened we noticed better performance on some metrics.

Stacker 1.1.1 release

New Stacker releases: 1.1.0/1.1.1!

Among other improvements:

  • DynamoDB lookup: get values from DynamoDB tables contributed by syphon7
  • Better handling of stack errors contributed by danielkza
  • Environment file is now optional
  • Templates can now be uploaded directly to CloudFormation (no bucket needed). This is useful for testing (see above), but it also means that the size of your CloudFormation templates must be smaller than 51,200 bytes
  • Stack-specific tags
  • Protected mode for stacks: stacker will switch to interactive mode for changes to these stacks
  • Remote configuration support, allowing to put additional configuration files in external storage (currently git).
  • New functional testsuite using Bats
  • Move S3 templates into sub-directories

See the full Changelog.

AWS Network Load Balancer testing

AWS released a new Network Load Balancer (NLB), similar to Application Load Balancer but working at layer 4 instead of layer 7. This will seemingly replace classic load balancers in TCP mode, and offer better performance and scalability.

We currently use a classic ELB to load balancer Postgres queries to a pool of PgBouncer hosts. Proxying it with a classic ELB did noticeably increase the latency for these queries, so the introduction of a new, higher performance TCP load balancer was intriguing, and we wanted to see if we could get better latency on queries with it.

Unfortunately, we found that at its current version it’s not possible to set security groups or to attach subnet mappings to an NLB – rather, the NLB you set up gets assigned an internal DNS name automatically per subnet.

This makes it difficult to set up ingress rules to limit access to/from them, since ingress rules require either a source security group or an IP address to set up from CloudFormation.

While there are workarounds to this (allowing access from an internal subnet, or creating a security rule outside CloudFormation by resolving the address), we’ve decided to wait for the ability to set security groups to be released before testing the NLB.

Improvements to our internal workflow

We keep an internal repository of Stacker blueprints where all Remind’s infrastructure is declared as code; and we encourage every team to propose changes and additions via Pull Requests to this repository.

In order to improve our workflow, we’ve made use of Github’s .github/OWNERS file so the OpsEng teams gets automatically added as a reviewer to any submitted PR. We’ve also set set up a .github/PULL_REQUEST_TEMPLATE file with helper boilerplate (nature of the change, security questions, estimated associated costs, etc.) in order to make the process smoother and get the PRs rolling as fast as possible.