Operations Infrastructure Month in Review #1

What’s this about?

Recently in the Operations Engineering team at Remind, we’ve been talking about ways to share more with the rest of the company about the things we were working on. At one of those discussions someone had the idea:

What if we just shared with the world some of the stuff we’re working on?

A lot of what we do here isn’t specific to Remind, and the OpsEng team is really eager about Open Source, so why not “Open Source” our status reports?

So this is the inaugural “Operations Infrastructure Month in Review” post. The idea is to give small blurbs about some of the bigger things we’ve been working on over the past month. Some of these may be turned into larger, more detailed blog posts in the future - especially if people let us know they are interested in hearing more about them!

With that out of the way, lets get on with it!

What OpsEng has been up to this month

stacker managed DNS

We try and manage everything we can out of CloudFormation/stacker. Until now that didn’t include management of our public facing domains, like remind.com. This has been changed recently, and now we have stacker managed AWS Route53 domains. We haven’t changed resolution of them all over to the new domains, but many have.

Encrypted cross-region RDS snapshots

For a while now we’ve been creating copies of our RDS database snapshots, but since our RDS instances use encrypted storage, we weren’t able to copy those snapshots to other regions. Not-so-recently AWS made it possible to do this, so we’ve since updated our script (called dbsnap-copy) internally, and now we have snapshot copies in our prod region, as well as in us-west-1 in the case of a region specific issue.

S3/DynamoDB VPC Endpoints

S3 endpoints for your VPC have been around for a while now, and with the “recent” announcement of DynamoDB getting it’s own endpoints, we decided to go ahead and start working with them in our VPC. This required an update to our VPC stacker blueprint internally (which led to other issues - up next!).

We’ve rolled out S3 endpoints in production, and are currently testing the DynamoDB endpoints in staging in us-east-1.

New Availability Zone Scare

In rolling out the S3 VPC endpoints in staging, we had to update our VPC stack/blueprint in our private stacker repo. We haven’t had to modify our VPC in a long time (probably 2+ years), and wouldn’t you know that it would have issues.

The issues arose around the fact we use Fn::GetAzs in the blueprint to figure out what Availability Zones are available to the blueprint. It turns out (and is documented - not sure if it’s always been there!) that GetAZs can change if your AZs have changed, and since the last time we actually made a change to our VPC it seems that AWS has given us 2 additional AvailabilityZones. This led to the update failing - fortunately it failed in a safe way, and rolled back the VPC stack. In the end we had to specify the exact AZs we wanted in our stack, rather than relying on GetAZs, which isn’t optimal but allowed us to move ahead.

Disabled the ability to delete app specific DynamoDB tables

We updated our generic “App” blueprint (used to setup AWS infrastructure around the applications that are managed by Empire, which includes IAM roles & policies) to disallow the deletion of tables created by the app. Until this we allowed an app to do anything with any table which had a name prefixed by <environment>-<app_name>-, but we decided that as a safety precaution we’d take away DeleteTable. A small update to “app specific DynamoDB policy generator” function in stacker was rolled out to remove this permission.

Three-way AWS account S3 Bucket access woes

We have many different AWS accounts here at Remind, and we’ve found that managing S3 Bucket access between three different accounts is pretty painful. It’s mostly when you have a Bucket in accountA, a “writer” to that bucket in accountB, and then a reader from that bucket in accountC.

This largely arises when EMR (which runs in its own account) writes to a bucket in our production account, and then has to be read from a third account that is used for another data processing tool. It’s bitten us enough times that we’ve decided to bite the bullet and…

… bring EMR into our production account

The only reason we haven’t done this up until this point is that the default IAM policy has some pretty scary permissions in it. With the amount of pain this has caused us in the past, we’ve decided to bite the bullet and try to come up with a more limited permission set (using tags, etc, in conditions) to try and limit the potential damage EMR could cause if it went haywire. This is still very much in process - hopefully we’ll have more info about it in the next update!

Main API database upgrade (version & storage)

In preparation for Back to School (a crazy time of year for us) we decided to upgrade our very old database for our main API. The old database was running Postgres 9.3, and over the period of a few evenings we upgraded the database to 9.6. We learned a lot about how Postgres/RDS handles upgrades, and will probably do another blog post in the future sharing our findings/the gotchas we discovered.

Another thing we upgrade in preparation was to upgrade the amount of storage required on the database. This was easy to kick off over a weekend, since our traffic is very low during summer, and can be done with no downtime.

ECS Task based roles for most of our apps

Earlier in the summer we started the cross-company process of using ECS Task based roles in all of our apps. We had a goal of enabling this in 80% of our applications by the end of summer. It was a big cross-functional goal (once Empire supported it, something we added much earlier) effort, and I’m happy to say we now use ECS task based roles across all but 2 of our applications, greatly increasing the security of our services!

Enabled ECS Profiles on our Instances

In preparation for Back to School we realized we might want to take our more resource intensive applications (read: our main API web processes) and isolate them on their own cluster, rather than having them compete with all the other ECS tasks that Empire manages. With placement constraints in ECS this became a reality a while back, and we’ve started the process of including this functionality in Empire. In the meantime we needed to add some ECS instance attributes to our instances, so that work was done in stacker and should be rolled out soon.

While we don’t plan to move the expensive processes pro-actively before BTS, we wanted to make sure we had the option to do so if it looked like the increased load was causing resource contention.

Started a demo of Gremlin

For a long time we’ve admired the “Simian Army” that Netflix deploys, and really love their Chaos Engineering philosophy. We’ve talked about writing/deploying tools to do this kind of testing internally - but then we found out about Gremlin, Inc.. When we initially spoke w/ the Gremlin team back at the beginning of the summer they didn’t have much in the way of Docker support, but in the past month they’ve pulled together what looks to be really great Docker support (labels! yes!) and we’ve started testing/demoing the tool out. It’s a bit early to give a full review of the product, but it’s been pretty cool watching some of the engineers break our (staging) environment to discover how to make the user experience better in the case of actual service failures.

That’s it!

There we have it. The first of, what I hope, will become a regular practice here at Remind. If you have any questions, or want to hear more about any of these, please ping us at @RemindEng on Twitter or leave a comment on the Medium post that we’ll be creating for this as well. Thanks!