Acunote is online project management and Scrum software. Acunote is fast and easy to use. It shows actual progress, not just wishful thinking. Click here to learn more.
« Back to posts

The Road to Hell Is Paved With Caching

In Acunote we are heavily using fragment caching. We cache whenever we can and whatever we can. Sounds like a great idea to do? Well, almost... Sometimes caching hurts and here's why.

Prehistory

Acunote is a project management tool and therefore the most prominent page in Acunote is the task list in the current sprint. The page itself shows some information specific to the user who logged in and some information that is shared among the users within the user's organization.

We can cache the shared part of the page in the fragment cache taking date and page filters into account. The code roughly looks like this:

<%- cache(:part => "#{filters}_#{DateUtils::today}") do -%>
    <%= render :partial => '...' -%>
<%- end -%>

Using filesystem caching store we end up having this (simplified) cache directory layout:

- organization1.acunote.com
  - sprints
    - show.part=filter1.date.cache
    - show.part=filter2.date.cache
    - show.part=filter3.date.cache
- organization2.acunote.com
  - sprints
    - show.part=filter1.date.cache
    - show.part=filter2.date.cache
    - show.part=filter3.date.cache

Problem

Problems start when we try to expire caches. We can't expire by url or by hash because we want to expire all cache parts (for all possible filters and for all dates). Therefore we use regexp:

expire_fragment(%r{show.part=.*})

That turned into a problem in production. The reason is that expire_fragment in Rails searches the whole cache directory recursively and removes files with names matching the regular expression. On machines with local disks that's usually fast, but on network filesystems or on shared disk cluster filesystems (like GFS in our case) stat'ing lots of files can be really slow. To make things worse, the performance of stat call on GFS really depends on the phase of the moon or something like that ;)

Our logs showed that operations that would usually take less than a second took on occasion took 70 seconds! They were so slow so I could actually log in to our production servers, find the process taking too much time to complete and attach strace to it. Strace blessed me, immediately showing multiple stat's for files in the cache directory.

Solution

Longer term we'll switch to a more scalable fragment caching strategy. There is a good reason why high-performance caches do not support fancy invalidation, and there are known way to code to their more restricted interface. Meanwhile, immediate solution was really simple. We only had to expire fragments in specific directories and to make that easier, I wrote a little extension to Rails' fragment caching code:

module ActionController
  module Caching
    module Fragments

    #dir is the cache path relative to the cache root
    def expire_matched_fragment_in_dir(dir, regexp, options = nil)
      return unless perform_caching
      self.class.benchmark("Expired fragments in dir matching: #{regexp.source}") do
        fragment_cache_store.delete_matched_in_dir(dir, regexp, options)
      end
    end

    class UnthreadedFileStore

      def delete_matched_in_dir(dir, matcher, options = nil)
        path = @cache_path + dir
        return unless File.exist?(path) #it's ok to not have the cache dir
        search_dir(path) do |f|
          if f =~ matcher
          begin
            File.delete(f)
          rescue SystemCallError => e
            # If there's no cache, then there's nothing to complain about
          end
          end
         end
        end
      end

    end

  end
end

#include only if you use cache test plugin
module Cosinux
  module FragmentCacheTest
    class TestStore
      def delete_matched_in_dir(dir, matcher, options = nil)
        @deleted_matchers.push(matcher)
      end
    end
  end
end

Conclusions

Monitoring the performance of production application is sometimes even more important than profiling it on the development machines. We already had monitoring system in place so we could detect when our application got slower, figure out the root cause and and fix the problem.

Another thing I highly recommend is of course strace. You may indeed see that your application is doing something so weird you'd never think of just looking at the code or at the profiler. So, I now declare strace to be officially added into my own arsenal of invaluable profiling tools.