Postmortem: 2 hours of downtime on October 23rd 2014

Features & Updates Per Fragemann

We encountered a downtime of roughly 2 hours today. It lasted from about 5:45am to 7:45am CET (8:45pm PST to 10:45pm PST, or 1:45pm to 3:45pm Sydney time).

We’re very sorry for the inconvenience it caused. Here’s what happened and what we’ll improve:

Cause and solution
The problem was that we had exceeded our daily budget for our servers, so the application started failing with a “quota exceeded” exception, resulting in our generic error screen. Increasing the budget was the quick fix to get the servers back up and running.

Background
We had changed our performance settings a month ago, and we didn’t realize that this increased our daily spend a fair bit. We usually set a daily budget that’s 5 times higher than the average actual spend. But the performance settings increased our spend while we didn’t increase the maximum budget, so the ration dropped from 1:5 to 1:2. But everything worked fine until today, when all of a sudden our servers misbehaved: After a sudden increase in load the servers didn’t spin down as usual, but kept running idly for hours. Idle or not, they accumulated costs at a very high rate over several hours, until we finally exceeded the limit.

The error message was very generic, and in the first hour the site was still intermittently available, so it looked like a temporary glitch at first. Only when our US staff realized the problem didn’t go away, they started calling the Berlin dev team, which however was asleep, delaying things a bit.

The worst part of the problem occurred when the dev team realized that only a single person (me) had access to the billing settings to increase the budget, and my phone was on mute. It took some 30 minutes until the vibrate of my coworkers actually woke me. This is clearly unacceptable and the most embarrasing part of the story.

Fixes and longer term improvements

As you can see, quite a few things went wrong, and the problem could have been caught or at least mitigated earlier. Things we’ve done or are doing to prevent a similar case:

  • We’re in touch with a Google representative to figure out why our servers were costing us so much more money despite idling. This is the main reason for the downtime and we need to adress it to reduce our bills.
  • We’ve reverted the performance change and doubled our budget, so our safety factor is now 1:10 instead of 1:5
  • Additional staff can now adjust the daily server budget to provide a quick fix in case a similar problem strikes. We’ve been quite good at removing single points of failure so far, but we entirely missed this one.
  • We’ve added land-line numbers to our internal contacts list, since mobile phones are just too prone to be muted, have dead batteries, or just be in the wrong room.
  • We’ll adding automatic early warnings for SI staff if the spend is approaching 50% of the daily budget the moment our platform supports these kinds of queries. Unfortunately it’s not yet possible to automate this due to Google App Engine limitations. Until it becomes available, we’ll look at our average spend a lot more closely.

Affected client reimbursement
We don’t provide formal SLAs but we take these kinds of downtime really serious and will reimburse affected clients.

If you’re a client and you’ve been affected during an important phase of your project, please let us know at support@small-improvements.com, and we’ll either reimburse you for this month, or extend your license by a free month, whichever you prefer.