Post mortem: Service disruption on Dec 5 2013

Features & Updates Per Fragemann

The Small Improvements application was unavailable for roughly 8 hours on Thursday Dec 5, starting at 3pm PST and recovering at about 11pm PST. We’re very sorry about this. Here’s what happened and what we’ll do about it.

Starting at roughly 3pm PST many of our clients encountered our general error screen, or a timeout, or were able to log in but then ran into errors while using SI. While some managed to get some work done intermittently, we’d consider that the application was unusable for about 8 hours. This was a very painful experience, especially since many of our customers had a Friday deadline, and many end users tried logging in but couldn’t complete their work.

The error can be attributed to a service disruption at our data center. We’re using Google App Engine to host our service and our database, and we love the platform for its ease of use, robustness and security features. While no system is perfect and we do encounter occasional hiccups, these hiccups usually last seconds or minutes at most. The last real downtime we encountered was 18 months ago in July 2012.

In this case, the Google memcache service became unavailable for a long timem, and generated error messages that bubbled up through the application in a way that we didn’t anticipate. This was causing many of the error messages users were seeing. Also, due to the inavailabilty of Memcache, the entire application got very very slow (page load times longer than 20 seconds, which is unusable), and some server instances didn’t even manage to start up. We disabled parts of our memcache code to avoid the error-pages. This made the situation a bit better, because Memcache was occasionally bouncing back and offering good service for a couple of minutes, before disappearing again. However, the general slowness and random errors continued.

We reached out to our Google contacts immediately, and they confirmed that they would deal with the problem. It took quite a while though for them to react, and in the meantime we saw on the forums that a couple other clients had similar problems with the platform. The situation stabilized at 10pm PST and has been back to normal since 11pm PST. We’ve been stable for 13 hours as of this writing, other clients have reported their apps are back up as well, so we feel the situation is under control at last.

Lessons learned

Improve response time: We initially didn’t react as quickly as we should have. We used Intercom to update our clients who logged in, and responded to support tickets swiftly, and we posted an error message into Small Improvements about 2 hours into the problem. But we didn’t update our Twitter account for 3 hours. We uploaded a blogpost around 4 hours into the problem, after our initial attempts at solving the problem had failed. We should have done that earlier.

Improve resilience of the code: Although our code is pretty resilient to failure, we didn’t anticipate this kind of error message in our write-through cache. We disabled that particular cache for now, and will only re-enable it once it’s surviving that error message. We have been testing the application to work without memcache available, and that all worked fine, but we’ll add this undocumented error code to the set of our testcases.

Single Page App: Another thing that will help in the future is that we’re slowly moving towards a so-called Single Page App architecture now. The advantage is that each user interaction requires less database operations on the server on average. So in case memcache goes AWOL again, the impact will be much smaller. We don’t have to render a full page that has between 5 and 8 database operations (via memcache) on each click, but we’ll only trigger 2 or 3 db operations each time, which means without memcache available we’ll look at maybe 8 seconds per request instead of 20. Not great, but tons better.

International engineers Our engineering HQ is in Berlin. While we have customer support agents in the US and in Australia as well, our tech team is mostly unavailable (or at least very tired) between 1am and 7am Berlin time. We are considering to hire one or two developers outside of Europe who will help us cover more time zones when server problems occur. As mentioned above, the Google App servers are very stable so it has not been a real problem so far, but this incident means we’ll focus on hiring international developers earlier than originally planned.