Transitive Closure in PostgreSQL
At Remind we operate one of the largest communication tools for education in the United States and Canada. We have...
The Remind OpsEng team has “Open Sourced” our monthly status reports. This post briefly describes some of the bigger tasks and projects we have worked on over the past month.
If you want us to elaborate on a specific topic, let us know!
We noticed that that the Docker volumes on our ECS container instances were showing unusually high disk latency, sometimes spiking above 30 seconds. We’ve taken a few steps to ensure that disk operations on the Docker volume are faster and more predictable.
While increasing our costs a little bit, the effects were noticeable – here is a measuring of max wait times on disk reads. After the change, disk reads are much more predictable with less outliers.
As part of a series of security improvements, we’ve made some changes to our Docker configuration:
We’ve upgraded Dnsmasq in our AMIs to deal with a recent security incident.
Due to concerns about some poor performance we experienced, we rolled back autospotting in order to eliminate variables (lots of things change in our system daily). To ensure there wasn’t any questions about it going forward, we built dashboards to monitor these issues and slowly rolled it back out.
We didn’t experience the issues again – rather the opposite. Our standard
operational unit is c4.2xlarge
, and since Autospotting treats the previous
generation (c3.2xlarge
) as a compatible instance type it scaled them into
our autoscaling groups. When this happened we noticed better performance on
some metrics.
New Stacker releases: 1.1.0/1.1.1!
Among other improvements:
AWS released a new Network Load Balancer (NLB), similar to Application Load Balancer but working at layer 4 instead of layer 7. This will seemingly replace classic load balancers in TCP mode, and offer better performance and scalability.
We currently use a classic ELB to load balancer Postgres queries to a pool of PgBouncer hosts. Proxying it with a classic ELB did noticeably increase the latency for these queries, so the introduction of a new, higher performance TCP load balancer was intriguing, and we wanted to see if we could get better latency on queries with it.
Unfortunately, we found that at its current version it’s not possible to set security groups or to attach subnet mappings to an NLB – rather, the NLB you set up gets assigned an internal DNS name automatically per subnet.
This makes it difficult to set up ingress rules to limit access to/from them, since ingress rules require either a source security group or an IP address to set up from CloudFormation.
While there are workarounds to this (allowing access from an internal subnet, or creating a security rule outside CloudFormation by resolving the address), we’ve decided to wait for the ability to set security groups to be released before testing the NLB.
We keep an internal repository of Stacker blueprints where all Remind’s infrastructure is declared as code; and we encourage every team to propose changes and additions via Pull Requests to this repository.
In order to improve our workflow, we’ve made use of Github’s
.github/OWNERS
file so
the OpsEng teams gets automatically added as a reviewer to any submitted PR.
We’ve also set set up a .github/PULL_REQUEST_TEMPLATE
file with helper boilerplate (nature of the change, security questions,
estimated associated costs, etc.) in order to make the process smoother and get
the PRs rolling as fast as possible.