How we reached 99.9% uptime during our busiest season

Availability is the lifeblood of any app. Downtime blocks new users from signing up, it damages the company reputation, and so on. Additionally for Remind, serving education administrators as they work to increase engagement of entire communities further drives the need for our systems to be available. While administrators share many of the same expectations that teachers have of Remind, that ultimate responsibility to their communities make it crucial for administrators to know that their messages were delivered and that delivery was timely.

The impact of downtime is most apparent during the back-to-school season. Because Remind usage is so closely tied to the school year, we see our traffic increase 5 times and our sign-ups increase 15 times between July 1st and September 30th each year.

As a result, we are always looking for ways to improve our availability. First, we need to be able to measure it.

Internally, we define downtime at Remind as an error rate over 5%. When we first set out to measure our availability 2 years ago, the first step was to establish an internal uptime Service Level Agreement, or SLA. Once we established a baseline, we prioritized work to help us conform to that SLA. And once we met that SLA, we optimized them to be even tighter.

We have integrated the cycle of honing our targets and tuning our systems to meet those targets into our product delivery cadence. This investment has paid off; during back to school 2018 we had over 1000 minutes of downtime across August and September. In 2019, we reduced that to 80 minutes.

Looking back, here are some tactics that enabled us to conform to progressively more aggressive SLAs at Remind.

Elastic systems are more forgiving

(Our original viewpoint on autoscaling systems was financially motivated. If we scale systems down at night, then we save money. This is still true, but improvements to our cost model have made the financial savings less of a motivator.)

As we focused on conforming to our availability SLA, we saw that one cause of downtime was improperly scaled resources. A database was CPU constrained or a background queue was backing up because there weren’t enough consumers. To address those situations we set out to move as much as we could to autoscaled systems. We’ve found that autoscaling removes the need, and latency, of manually intervening the system configuration and ultimately eliminates a whole class of downtime. We utilize autoscaling everywhere we can, e.g.:

  1. Amazon EC2 Auto Scaling Groups
  2. Amazon ECS Task Auto Scaling
  3. Amazon DynamoDB Auto Scaling
  4. Amazon Aurora Reader Auto Scaling

Taking human eyes off the system can also be troublesome. To mitigate, we applied scaling maximums, and we set alerts for when we hit 80% of those maximums. We found that this practice helps us to absorb the blow from poorly performing code, while still making it loud enough for us to investigate and identify the issues. Previously, a slow bit of code would have caused downtime. Now, autoscaling will provide a big spike in our capacity, and we can rollback the code without any of it increasing our error rates or impacting user experience.

We write better code through collaboration

Over the past year, we have placed a substantial emphasis on building correct and scalable systems. One way we have done this is by pushing for earlier and more frequent feedback in the development process. For example:

  1. RFC (Request for Comment)
  2. Distributed Code Review Process
  3. Operations Review

When writing RFC documents, engineers present the context around the problem we are trying to solve and explore multiple possible solutions. We then schedule time to review and discuss as a group, rigorously diving into the problem together to make sure we are accurately capturing the constraints of the system, including security, performance, and cost. Once the problem is properly understood, we evaluate the proposed solutions and provide guidance on a path forward. Often, we have multiple rounds of RFC review, giving us the opportunity to iterate. These can be as large or small as needed to be productive, and not all rounds lead to a scheduled review. We want the barrier to seek feedback to be as low as possible. We’ve found even small RFCs have a huge benefit to the team and also minimize procedure overhead.

Code review is something we expect every engineer to drive. While the person writing the code is ultimately responsible for its correctness and performance, code review is an opportunity to leverage teammates in order to discuss design, catch bugs, and knowledge share. It’s not (primarily) a gatekeeping process; it provides an opportunity for an individual to seek feedback and get answers to questions. Often, we’ll create a draft PR to seek early feedback and then get another review once we think it should be merged.

Operations Review is a final opportunity to look at how we have operationalized the system before we launch it. One guiding question is, “How will we know if the system is working?” We walk through the metrics and alerting we have in place along with any eventing that we need for product analysis. We discuss again any security concerns, as well as any incidents that occurred during development. We also follow up on any concerns raised in the RFC process. After operations review, we schedule a follow up review after launch to revisit any new incidents or usage patterns that we did not foresee.

Again, the goal of these reviews is not to be gatekeepers but instead to provide valuable feedback. To defend against process bloat, we try to end every review with the simple question, “Was this helpful?”, and we adjust our practices based on how we answer that question.

Codify processes into systems

Despite our best efforts to prevent incidents by building more forgiving systems and writing higher quality code, they do still happen. When they do, we try to maximize learning from each incident. Perhaps areas we thought were well monitored turn out not to be, or a pattern of user behavior we thought was rare turned out to be common.

To facilitate this learning process, we write up a post-mortem for each incident, which we review together and look for improvements we can make. We capture any human judgements that were made and look for ways to codify those human processes into systems. That might mean introducing a watchdog in code that crashes a process in an inconsistent state or updating our code to gracefully handle an error condition or building tools to do data lookups more directly and efficiently. These learnings and followups are critical to assure that errors in the past do not resurface in the future.

So what did cause downtime? What are we focusing on now?

So, what was the 80 minutes of downtime in 2019 from?

  1. Six minutes was from an Amazon Aurora writer failover. We’ve determined that 1) failovers are normally rare and 2) if you are going to have a failover, 6 minutes isn’t bad at all. We are, in general, very happy with Amazon Aurora.
  2. A bug in Amazon Aurora causing a boot loop. In preparation for back to school, we upgraded our Amazon Aurora cluster to the latest available version. This version had a few bugs that we hit that would restart instances. We hit pretty much every problem outlined in the update they later released.
  3. Our Redis cluster hit its capacity limits. Our metrics for Redis are based on latency, which degrades aggressively at high load. The cluster will be performing fine until a small spike in traffic causes latencies to drastically rise. We’ve been working to get better alerting.
  4. PGBouncer did not autoscale in sync with our applications. We use PGBouncer to multiplex database connections between our applications and Amazon Aurora. We configure it to have a fixed number of connections to Amazon Aurora and share them between application instances. As our application scaled up via autoscaling, the number of connections available to share remained static, and we had applications waiting for an available connection causing additional latency.

Our service uptime SLAs have helped us build a more reliable product for our users, but we are far from done. Our next focus is on performance. Not only should we fix and prevent errors on requests, but we should serve each request within a tight performance window, background jobs included. This is a longer path, because it requires some new infrastructure to serve complicated data with consistent performance, but the pattern remains: Measure, conform, optimize.