Postmortem: intermittent mail delivery on July 20th and 21st, 2015

Features & Updates Alison Winters

On Monday and Tuesday this week, we experienced intermittent mail failures that resulted in many notifications from Small Improvements not being delivered. The failures started around 1:00pm CEST and continued till around 5:00pm CEST the next day (July 20th, 4:00am PDT – July 21st, 8:00am PDT).

We are sorry for any inconvenience that may have been caused, and we have taken steps to mitigate similar problems in the future.

Cause and solution

The cause of the outage was our application servers hitting a hidden limit that only affected outgoing network traffic like email. This meant that the application behaved normally for customers visiting the website, but it also set up a perfect storm. Because our outgoing network traffic was not working, we did not receive any of the notifications that were trying to report the failure.

We have now adjusted the application so that email sent from Small Improvements does not get affected by this particular limit.

Background

We are long-time users of the Google AppEngine platform, which allows us to easily scale our application to serve large amounts of data to customers all over the world. The trade-off of this massive automated scalability is that we have to run our application within a proprietary sandbox. The problem occurred due to our application servers exceeding a daily quota applied to our sandbox. The specific limit that we hit was an undocumented, low-level network operation that is not displayed in any of the standard monitoring tools Google provides to its customers.

To make matters worse, this particular limit did not just block email to customers; it also blocked our internal alerting system that ordinarily notifies our development team of any problems. The icing on the cake was that our external logging system was no longer being updated due to the limit, so we did not see any errors there either. This series of unfortunate events resulted in the worst-case scenario where our customers noticed a problem before we did.

Fixes and longer term improvements

As soon as we realized what had happened, we contacted Google and had the limit temporarily lifted, which restored email delivery. Meanwhile our development team went to work on modifying our application to send email using a different mechanism that would avoid triggering the limit in the future.

  • We are continuing our contact with Google to find out why this limit only just recently began to be triggered.
  • We have already deployed a change to our email subsystem to avoid usage of the quota-limited functionality.
  • We have begun an audit of our application integrations to ensure that other features could not be affected by a similar limit.
  • We have tuned the alerts delivered by our fallback error reporting service so that a failure in the email subsystem will still be reported to our developers.

We have already contacted the companies who were most seriously affected by the outage. Once more, we sincerely apologize to everyone affected.