ramblings on PHP, SQL, the web, politics, ultimate frisbee and what else is on in my life
back

Reverse proxy cache invalidation service

Liip is currently working on a news site. As news is all about being up to date, but still managing to serve a large number of users with milli second response time, we obviously run into a bit of dilemma. We can use Varnish to cache the content, but then we will need to use a relatively short cache time out or we risk not getting updates to our users quickly enough. A better approach is to use invalidation, where we can then set a relatively long cache time, but ensure that still no stale content is served. At Liip we provide a considerable budget for innovation available to all employees, so I figured it would be cool to use this to create a solution to allow us to move to cache invalidation for this site.

So far so good. Now the tricky bit in all of this is knowing which pages need to be invalidated if an article changes. The content of each article can appear not only on the article page itself, but also on various overviews, search results and even other articles that reference other articles. Now there is a nice standard for exactly this scenario called LCI. While it is not implemented yet in Varnish there are ways to make it happen already using a regexp when sending a purge.

With this approach we can quite elegantly handle the invalidation, provided we somehow manage to tell Varnish which articles were used when generating any given page, so that the regexp can do its job when purging. For this purpose we are considering creating some kind of listener that can pick up what article ids were returned from the backend and then automatically add a custom header to all responses which can then be used in the purge regexp to determine which pages should be purged. As we are using Symfony2 we will need to add some logic to the layer that communicates with the backend REST API and a Response listener to add the header. This way it would not require any additional code in the controllers to be able to leverage this approach. Sweet!

In this context one might start to wonder if such a regexp wouldn't take too much time to execute if one has a lot of data cached in Varnish. While we haven't done tests, from my reading it seems like Varnish should be able to handle this quite nicely. Basically Varnish maintains a purge list (I think in 3.0 its called ban list). Before returning anything from the cache the entire purge list is checked against the item in the cache. If its marked to be purged its purged causing a cache miss. Obviously if the purge list gets longer it will take longer and longer to determine if an item in the cache can actually be returned. Now the good news is that Varnish has a separate thread that keeps working on shortening the purge list. Basically it continuously scans through the entire cache, applying the purge list. Whenever it knows that a purge list item has been checked against the entire cache, it removes that entry from the list. This way the list should never grow too large.

In our setup things are even more complicated as there can be any number of frontend applications reading data from the central backend. Obviously the backend shouldn't have to know about the frontends and how they cache their content. Some may choose to use no caching at all, some might be entirely ok using a TTL, while others will need the above described invalidation. The solution we are looking into here is a message queue such as RabbitMQ to which frontends can subscribe consumers. The queue just gets informed when an article changes and the subscribers can then choose what to do.

We are using PHPCR in the backend to store the data in Jackrabbit. Right now we do not yet have support for JCR's observer concept in PHPCR, but as we have listener support inside PHPCR ODM, we can leverage this for notifying the message queue whenever an article is updated or removed. Additionally we can also send a message in case a new article is added as some frontends may want to use this information to determine if a newly added article should be added to an overview page. Obviously in this case it will not be able to send the normal regexp purge request, so likely the frontends would need set an additional header for overview pages to enable purging these. So in this scenario for added articles the frontend consumer could construct a different purge request to for example purge all overview pages that match the articles publishing day and category.

So as summary what we need to implement is the following:

  • Implement a listener or observer in the backend that sends a message whenever an article is added, updated or removed
  • Implement a message queue that can receive these messages
  • Provide some helper code to generate purge requests based on these messages that can be used to write message queue consumers
  • Implement a listener that automatically adds the article id's used to generate any given page as a header in the response
  • Implement a VCL script to handle the purge requests and that strips the custom headers before sending the content to the browser

Once this is implemented we can aggressively cache all content for hours or even days while still being able to almost instantaneously provide updated content on any number of frontends. BTW, Liip's innovation budget encourages to do to open source development, so if my proposal gets accepted the entire world should be able to reap the benefits.

Update: Here is a blog post that details the way varnish manages the purge/ban list in great detail.

Comments



Re: Reverse proxy cache invalidation service

Well done, cache channels are the final solution for cache invalidation.

Nottingham had a so great idea...

Re: Reverse proxy cache invalidation service

Hi Lukas,

your link to Liip is not working correctly. Thanks for the nice article, many important information that I have to read about now (PHPCR is interesting), I also have to check if my reverse proxy nginx is able to invalidate/purge cache entries the way varnish does it.

Re: Reverse proxy cache invalidation service

Thanks, fixed the link. As for Nginx vs. Varnish, we are actually using both. Varnish is just a lot more powerful when it comes to caching, especially as it supports ESI. Right now we are actually using Nginx <-> Varnish <-> Nginx, since we are still on Varnish 2.x which does not support Gzip compression with ESI. One of these days we will upgrade to Varnish 3.x, which will allow us to drop the additional Nginx in front.

Re: Reverse proxy cache invalidation service

This is probably my favourite subject so it's good to see other people are working on this as well. I think LCI is a pretty smart way to communicate with the cache via mechanisms already in place, but I don't see how it tracks resources so you know which cache entries to invalidate.

For instance, given a new comment on a news post, how does it know to invalidate not only the comments overview, but also the most recent comments and the RSS feed? My own solution to this problem involves adding code to my controllers that tracks all resources on the site, so you saying you don't need to add code to the controllers have me very curious.

I'd love to get some details on that so I can figure out if I can steal some ideas for my own solution. :)

In the nature of sharing, I put up my solution on github here: https://github.com/mfjordvald/Evil-Genius-Framework

The relevant parts are https://github.com/mfjordvald/Evil-Genius-Framework/blob/master/system/core/cachetracker.php for the backend logic.

and https://github.com/mfjordvald/Evil-Genius-Framework/blob/master/system/controllers/news.php and https://github.com/mfjordvald/Evil-Genius-Framework/blob/master/system/libraries/cachetest/news.php for the code required in controllers/libraries.

Re: Reverse proxy cache invalidation service

Well there are two types of invalidations:
updated/deleted items:
For these all I need is a listener that is able to pick up what article id's are read from the backend and which are then added to the response headers via yet another listener. This way if the given article gets updated or deleted Varnish will be able to find the cache items that used the given article during generation. So here all I need are two listeners, entirely independent of any controller.

added items:
Now these are more tricky. Adding items will be relevant for cache invalidation for overview pages. Aka this new article should be shown on the front page. For this our invalidation service will need to be able to determine which overview this article should (potentially) be shown on. In order to be able to purge the given overview page, I assume we will indeed need to add the relevant header inside the controller responsible for generating the given overview page.

Another topic that I ignored here is cache of search pages. Here the above approach probably doesn't work. Not sure yet if we will even bother with cache invalidation there at all. I think for now we will just accept a relatively short cache TTL for search pages. I can think of some approaches but not sure if they are worth it, especially since I do not yet have a feel how large our purge list will be on average.