Our product Small Improvements is a hosted, lightweight performance review and feedback system. Companies like Atlassian and Quiksilver use it to run their employee performance reviews and 360 degree feedback. It's all hosted on Google App Engine. Speed is crucial: Writing your employee self-assessment is not exactly everyone's favourite activity, and the last thing you want is an unresponsive application!
App Engine is really great, especially in terms of scaling. Load goes up? You automatically get additional servers. Load goes down, they stop. Perfect. All was well until early August 2011, when suddenly everything got really slow. A new release of App Engine changed the way new instances were spun up, and datastore issues complicated matters. Combined, these events exposed how parts of our application were inefficiently coded. It had worked fine before, but not anymore. Here's what we learned.
Although queries are almost equally fast on the local dev server, they are usually a lot slower on the production server. What's worse, when App Engine has a bad day, they degrade in performance so that even each single one of them really hurts. Also, frameworks like Objectify automatically cache every GET in memcache -- but not QUERY results.
Naturally, if you already know an object's ID you will load it by ID (e.g. if you got a person's ID from the URL).
We learned the hard way though that quite often we had been getting objects via a key (which means QUERY). We went through our DAOs and removed plenty of those, replacing these with GETs. In some cases, we even decided to change secondary objects to use the *same* uuid for their ID as their parent object, just so we'd always be able to get those secondary 1:1 objects by ID.
GET access on a random day:
QUERY access on the same day:
Quite often, we display lists of data, and have to pull additional information from the database. Think about a Employee list which also lists an employee's manager. Querying the Employees is easy, but how to display the managers' name efficiently? The naive approach is to GET every boss from the database, but that's not really fast, since it's 100 roundtrips to the database. Admittedly, some people will have the same boss so those GETs are cached by Objectify, but still, it's inefficient to GET 20 bosses too. So after getting the staff list, we iterate over the employees, collect the boss keys into a Set (avoiding duplicates) and query them in one batch GET. This is substantially faster.
Now we can store all the bosses in a HashMap, and retrieve them from there while rendering.
Our application is quite complex though, and we prefer a modularized approach: We have many self-contained panels, and we try to avoid passing data around. But this also means that if we just queried a user for the employee list panel, maybe the other panel wants to display the same person in a different context (e.g. my favourites list) We really don't want to pass the list of users around in the app!
So we added a simple threadlocal cache into the DAO. It's really fast and this way we don't even have to keep objects in a hashmap after batch GETs: We simply fire a batch GET after collecting the bosses keys, and then the UI just calls the DAO.get() method, the rest is handled transparently by the DAO and its ThreadLocal.
It definitely makes little sense to memcache everything, since caches add complexity (you have to clear them whenever anything changes). Using the awesome appstats tool, you will soon see what queries cost you time frequently. Cache those.
In our case, we cache based on the company-ID of people. If someone logs in from Company X, we memcache the request for "give me all employees in this company". The query is just used so often, and subsequent users will find a warm cache, speeding up their page loads. However, when you're the first user of that company, the memcache will be empty.
Now that's a problem! So we eagerly fill caches beforehand:
About 50% of our users enter our performance review application via the login screen, the others use cookies. (Cookies expire after 4 weeks). If a user comes via the login form, there's a good chance that the login-field is prefilled by the browser, OR that the user types it in, and then switches to the password field to fill it. In either case, this gives us roughly 5 seconds before the user actually has logged in and needs access to stuff that we like to keep in memcache (e.g. the employee list, her favourite coworkers, the dashboard activity list etc). So we fire an AJAX request once we know the login-ID, which pre-warms the memcaches for this user. Note that a user may of course enter random garbage, or another users' login-id, so while we warm the memcache, we don't store the results anywhere in the session, the user might not be who they pretend to be.
Naturally, this approach does not work as well for people who log in using a cookie. We do try to run a background task here as well (in the backend, even before we check the cookie token). But only if access is really poor the Task will warm caches before the page actually loads.
Still, improving performance for half your staff (those without cookies) isn't all that bad either.
Some pages are accessed more frequently than others in our application: The dashboard, the Feedback Overview page, and the Performance Review page. Managers also frequently access their team's performance review overview page, and admins frequently check out the user directory. It makes sense to warm the caches for these pages too, once a user has logged in. The very first page you access will have to load the pages data from the data store, but once a user logs in, we fire another AJAX request which warms up the memcache for the other 5 pages that you are most likely to visit next.
Upon login, we measure time for the first queries, and make a decision wether we really can let people continue to the performance management dashboard or if we need an interstitial page. If some data store accesses take more than 5 times longer than the expected average (e.g. 300ms instead of 60ms) we assume that we're currently seeing a datastore spike, and can't send the user to the dashboard without warming all memcaches first. That's when we send people to our "Please give us some feedback" page. It's a page that asks people about how they like our product, and if they have suggestion regarding how to make it better. This page has little data and just some html, so it loads fast. While the user is reading the page, we run the memcache warmup we mentioned above.

There's quite a lot of data that we need to memcache, and as you saw above we try to prewarm the memcaches even before a user logs in. But when a memcache has expired, or there was no time yet to load it, or some required data has not been denormalised yet, we may have to resort to graceful degradation. For instance, the dashboards' slowest two operations are retrieving the activity stream, and getting the users' favourites and closest coworkers. Especially the activity stream can take up to 2 seconds when not cached and when the datastore is slow. Way to long for the user to see no progress.
In that case, we make a on-the-fly decision to not render the activity list panel, but to wrap it into a asynchronous panel (Wicket makes this easy, but you could do it with a few lines of JQuery as well). The page renders fast, and the content gets loaded 2 seconds later - no problem, since the page is responsive already and maybe you didn't care about the activity stream anyway:
Quite often, users will use the application for a few minutes, and then do something else, and come back after 30 minutes. By now, the session-based caching is too old and will get reloaded, but there's a good chance that even some of the memcaches have expired. So we render the timestamp of a page, and then
a little JavaScript snippet will check if the user returned after more than 5 minutes and fire a little warmup query again. So by the time the user clicks the next link, there'sa good chance that an AJAX request
already warmed up at least some of the caches. It doesn't even matter if all caches managed to warm up, the more the merrier. :)
Whenever we add new features, we usually add them in a quick-and-dirty way. After all, we need user feedback fast, so we learn what's really important and what not. Often, we ditch features again, or at least substantially modify them after getting feedback. That's why we don't optimize early.
But once we know the feature is here to stay, we'll consider how we can denormalise any necessary queries. For instance, we used to query "who are this managers' direct subordinates". Which takes time, e.g. 300ms on average. This is only calculated once during login, but still, it makes a login slow. Nowadays we store the direct reports in a "UserReports" object, and recalculate it only when the reporting structure changes. So that's a GET instead of a QUERY, speeding up the login by some 295ms on average. Adds complexity, but makes users happy. Worth it.
I have been thinking classloading is slow as hell for a while now -- some 10 Class.forName() calls taking 20ms on the local dev server, but 200ms to 400ms on the production site. Given that we have thousands of classes, and given that the new 1.5.2 release changed how frequently new instances got spun up, performance was miserable in early August.
Until a post pointed us into the right direction: In fact, it's not the classloading, but the disk access that's slow! We had been deploying our classes "as is" without JARing them up. Once we changed the process, everything started to fly! And surprisingly, the 1.5MB jar file containing the Small Improvements sources and html files is faster to deploy than the changed class-files ever were. This might have been fixed in a more recent SDK, but if you add some logging you will easily find out if this is biting you too!
I thought by now everyone would have heard about warmup requests, but it turns out I was mistaken. So let's mention them again. You should cram as much initialisiation work into your warmup page as you can! Try to initialise all your subsystems and to exercise all crucial code paths you have. Even though classloading is not so much of an issue once you use JARs, it still does take time. In our case, our warmup request exercises the whole application stack: We query some items from the datastore, render some of the core UI components, make some date calculations with Jodatime, and simulate a login. Just to make sure the VM has exercised all code paths, loaded all classes, and started the VM-Level optimisations. Oh, and of course we warm some memcaches, but you should have guessed that by now! :)
Defer writes: And then there's writes and puts. We try to defer them if we know that they either take a lot of time, or if they happen while many other things are going on already. For instance, we increase the login counter and store the IP address when someone logs in. This only costs 50ms to 100ms on average, but log in is one of the slowest bits in the application already, since so much other data needs to be collected. Saving 100ms is good. And on a bad day this can grow to 300ms easily (just to increase the counter that is!). So we defer this operation to the task queue. Not that the task queue is without its performance quirks, when the datastore is in serious trouble the Queue seems to be affected as well. But it seems to work better this way.
Asynchronous mails: Sounds like a no-brainer, after all mails can take ages to send or even totally refuse to work. We only send from Tasks these days. But in the beginning we didn't, and of course when it broke, it broke where it hurt most. Registration-confirmation emails, and support-requests. Dang.
Asynchronous queries: Queries can run in parallel. Trouble is, our application is so modularised that we rarely have a chance to run two queries in parallel. We do in three cases, but in most other cases we'd have to pass the Iterables around and that wouldn't be funny. We're not that desperate yet. But of course it would be possible to fire a few queries right at the top of the page, store the Iterable in a Threadlocal cache, then retrieve it before actually painting the component.
Is your employee review tool ugly and clunky? Let your HR staff know about Small
Improvements then! Our solution is affordable, pretty and web-based. Learn more about our approach to performance review software.
![]()
Right now (Edit: this was in late August), most requests take between 300ms and 800ms, with the odd ones up at 2s, especially those that generate reports on the fly, or when memcache is stale. Commonly used pages like the dashboard load in 300ms though, sometimes even in mere 150ms. Pretty good, given that our application is complex, user-centric (no common HTML that could be cached!), strongly componentised, and given that Wicket would be considered a "heavy" web framework which does lots of magic in the background.
It's difficul to say which was the single most important change. The simplest and most fun one was the JAR issue though. Also, maybe some of the suggestions won't work for you, or there are better ways to achieve the same goals. Use at your own risk.
Also note that we had been profiling the application for CPU speed before, and we've been optimising memory consumption too. Not to mention all the endless hours spent fine-tuning the CSS. So the above App Engine specific suggestions are just one of many steps to make an application feel responsive.
Thanks to the Highscalability website for linking us! A quick update on the current situation might be interesting as we get tons of visitors as we speak.
Since we wrote this some weeks ago,
overall GAE performance has been improving, which means our optimisations paid off well. The GAE dashboard reports average response times of some 100ms for dynamic
requests (but that also includes AJAX requests that don't do much). What's more interesting is our pingdom report IMO:
Note that it is saying 562ms because it takes the horrible August month into account. This week, we averaged at 390ms. Okay, 400ms+ on average is obviously still way way slower than Facebook, Twitter etc, but given that we're a tiny company, don't have performance engineers, are totally self-taught and did this on the side next to much bigger projects, hey, it's a good start! :) While speed can kill usability, it's not the only factor either. Back to that 360 degree feedback redesign job...