Downtime Postmortem

February 08, 2010 ∴ code

Since joining oneforty last summer lots of things have gone well, but the mistakes we’ve made are usually more educational. The following is an attempt to capture the events that led to a brief site outage and some lessons learned.

A few weeks ago we rolled out an alpha version of our ecommerce platform and the news was covered on a few blogs, including TechCrunch. At roughly the same time (it seemed) there were alerts about the amount of swap space on one or more of our servers. The alerts would typically flap between a warning and then return to normal levels. I figured the two events were related and that the alerts were due to increased traffic, but not a serious issue.

Later in the evening as the alerts continued I investigated the situation. The site is built on Rails, running in Passenger and hosted on Engine Yard’s EC2-based cloud service. Running passenger-memory-stats on our “application master” instance showed that there were about twice as many Rails processes as there should be, and there was a discrepancy between what passenger-memory-stats showed (total rails processes) and what passenger-status revealed (those that Passenger is actively using). There was less than 15MB of free memory and little swap left due to the stale processes. Not good.

Then I put on the straw that broke the camel’s back. While trying to kill one of the stale processes, the machine locked up when it ran out of swap space. The Engine Yard configuration has the “app master” server double as both an application server and the load balancer, through haproxy, to the other application instances. This means that when that instance became unresponsive, the whole site went down. So now the clock is ticking (and I’m swearing to myself).

Engine Yard’s service noticed within 60 seconds that the app master was unresponsive. It automatically killed the existing app master instance, promoted one of the other app clones to be the master and created a fresh app instance to replace the clone. This worked smoothly, except for two issues. When a new instance is created it is added to the load balancer before our gem installation is run, so there is a window of time when it would throw 500 errors. The EY flow of specifying required gems is through their web interface, instead of in our application’s git repository. This is less than ideal (and it appears it might change soon), and we hadn’t yet invested in a better workaround. Not wanting to wait for the gems to be installed, I terminated the newly booted clone.

Once the new app master was promoted, the site was back alive. The second problem was that EY doesn’t automatically update the memcached config on each app server when an instance is terminated (also a known issue), so we were suffering increased cache misses that made the site very slow. I fixed the memcached config issue manually and the site was back to full functionality. Total damage was about 10 of downtime, and another 10 minutes of slow-to-unusable site performance.

Lessons Learned

I’m a fan of the idea of proportional investment when reacting to problems like this. The first instinct of most engineers, myself included, is that we need to build a sophisticated monitoring system, remove all single points of failure and have the site failover to redundant systems. Those are good goals, and maybe you eventually get there, but not until that level of investment is truly called for. Instead, we’ve taken the following steps:

Signed up for a more robust uptime monitor, Pingdom, for better email/sms alerts.
Fixed the issues causing stale processes. Initially it wasn't clear what was causing them to hang around after a deploy. The first step was to write a quick capistrano task that would kill any detected during deploys. This at least addressed the symptom. After more research (and a helpful pointer from EY's Ezra) it became clear that it was because of an interaction between A/B testing framework vanity's handling of redis connections and Passenger's forking model. A patch to vanity forced it to stop accidentally sharing a redis connection between processes to fix the underlying problem. (Passenger's model has real advantages but alters the "shared nothing" assumption many components make.)
Working to get to a point where alerts and notifications do not become background noise. When they do it is too easy to ignore them and miss real issues. I think this always sounds easier than it is. Webapps have a lot of moving parts and receive many odd requests that can trigger alerts from machines, exception trackers and performance monitoring tools. There will be ongoing work to find the right thresholds and to address issues as they crop up.