The Battle For Performance
Rails is slow. Your rails application which does fairy simple database operations will be born slow and will eventually become slower and slower while still not doing anything fancy. That's like getting a new powerful locomotive (on Rails) together with a free bonus of overladen freight cars.
Having said that, I can virtually hear the herd of voices screaming "dude, optimize your own application and don't blame Ruby and Rails for everything". Sure thing, our application isn't perfect and I do believe it can be optimized but you can't win the performance battle just by that.
The winning tactics for this battle is guerrilla tactics. It turns out you can guerrilla-patch (aka monkey-patch) Ruby and Rails and easily get up to 3x speedup in some cases. We started doing that to our own application - Acunote and got amazing results so far. Read on to find out how.
Ruby-Prof: The Guerrilla's Weapon
Acunote, our pet rails application, often need to copy a bunch of ActiveRecord (AR) objects. As any other project management tools, our application works with tasks organized into iterations (sprints). Sometimes you want to copy tasks from current iteration/sprint to the next iteration/sprint. And boy, that copy can become really slow. As an example, let's look at the performance and profiling data of one request to copy 120 tasks.
Let's measure 3 things:
- the real time request takes in production environment;
- processing time profile information using the excellent ruby-prof profiler by Charlie Savage and Shugo Maeda;
- memory profile information gathered with patched ruby-prof (I took Sylvain Joyeux's memory allocations measurement mode patches and hacked together my own ruby-prof memory size measurement mode).
So, for "copy 120" request we get:
Request time (without profiler): 13.6 sec.
Time:
 Total: 16.78
 %self     total     self     wait    child    calls  name
 18.36      3.34     3.08     0.00     0.26    20766  Kernel#clone
  ...
Allocated Memory Size (in KB):
 Total: 1 001 730.00
 %self     total       self     wait    child    calls  name
 61.87  646795.00 619745.00     0.00 27049.00    20766  Kernel#clone
  ...
"WTF" politely describes my initial reaction when I first saw that. How on earth we can spend 17 seconds and consume 1 Gigabyte of memory by just copying 120 ActiveRecord objects? Ok, ok, we not only copy them but store some more information to the database like task tags, copy events and so on. Therefore we end up saving actually 360 AR objects instead of 120. Also the memory profiler is inaccurate and usually shows a bit more memory than the process actually takes.
Still, what is going on during those 17 seconds? Why do we need a gig of RAM? Gee, on the same machine I can compile the whole ruby interpreter in 57 seconds and the compiler will never take more than 130M...
Heh, let's calm down and see what profiler says. The obvious suspect is Kernel#clone method. The (stripped) call tree for it as reported by ruby-prof is
Task#save
...
ActiveRecord::Base#create_without_callbacks
ActiveRecord::Base#attributes_with_quotes
ActiveRecord::Base#attributes
ActiveRecord::Base#clone_attributes
ActiveRecord::Base#clone_attribute_value
Kernel#clone
From Charlie's blog we know that AR::Base::attributes method is evil. It clones attributes before returning them and Charlie gave a wise advice to not use it. And indeed in your application is a good idea to call, for example, task['foo'] instead of task.foo or task.attributes['foo']. But here AR::Base::create itself does the evil thing.
Let me cite Rails code (from rails/activerecord/lib/activerecord/base.rb):
def update
    connection.update(
        "UPDATE #{self.class.table_name} " +
        "SET #{quoted_comma_pair_list(connection, attributes_with_quotes(false))} " +
        "WHERE #{self.class.primary_key} = #{quote_value(id)}",
        "#{self.class.name} Update"
    )
end
def create if self.id.nil? && connection.prefetch_primary_key?(self.class.table_name) self.id = connection.next_sequence_value(self.class.sequence_name) end self.id = connection.insert( "INSERT INTO #{self.class.table_name} " + "(#{quoted_column_names.join(', ')}) " + "VALUES(#{attributes_with_quotes.values.join(', ')})", "#{self.class.name} Create", self.class.primary_key, self.id, self.class.sequence_name ) @new_record = false id end
def quoted_column_names(attributes = attributes_with_quotes) attributes.keys.collect do |column_name| self.class.connection.quote_column_name(column_name) end end
def attributes_with_quotes(include_primary_key = true) attributes.inject({}) do |quoted, (name, value)| if column = column_for_attribute(name) quoted[name] = quote_value(value, column) unless !include_primary_key && column.primary end quoted end end
Great! AR::create clones all attributes twice (first with quoted_column_names and then with attributes_with_quotes). AR::update is a nicer guy, it clones attributes only once. For each of those 120 copied tasks we use AR::create twice and AR::update once. Therefore we call clone_attributes not less than 12022+120 = 600 times.
Profiler says we lose 3 seconds and 650M memory because of that:
ActiveRecord::Base#clone_attributes:
Time:          3.21 sec
Memory:  650 301.00 K
Ok. Time to employ our promised guerrilla tactics, but first let's talk about memory.
Lyrical Digression: Why Do We Care About Memory?
While guerrilla warriors take a nap before the battle, let's digress and look at why we measure the size of allocated memory. Sure thing, allocating literally tons of memory is bad. But we're using Ruby and Ruby has its own guerrilla warrior which strikes our application when it innocently eats its tasty memory.
The name of our enemy is "Ruby Garbage Collector" and the cunning enemy it is. As ruby's gc.c source says and Why the Lucky Stiff explains, ruby GC kicks in when you allocate more memory than is available on the heap and heap boundary is 8M. When we sequentially allocate 650 megabytes there's a chance that GC will be called 650/8=81 times. Each GC call accounts for approximately 70-80ms which in our case should add roundabout 5 seconds to the running time.
Let's do some math... Kernel#clone itself takes 3 seconds and we expect garbage collection to take 5 more seconds. In total, 8 seconds are wasted because of cloning. See the trend? The more memory we allocate the worse our performance is.
Here's the morale: "watch your memory". Large memory consumption costs you more than you'd otherwise think because of garbage collection.
Guerrilla patching AR::create and AR::update
Now it's time to employ our guerrilla tactics and fix ActiveRecord. It's easy to rewrite attributes_with_quotes and avoid cloning the attributes:
module ActiveRecord
    class Base
    private
        def attributes_with_quotes(include_primary_key = true)
            result = {}
            @attributes.each_key do |name|
                if column = column_for_attribute(name)
                    result[name] = quote_value(read_attribute(name), column) unless !include_primary_key && column.primary
                end
            end
            result
        end
    end
end
Now let's put that to environment.rb and see what we get:
Request time (without profiler): 6.7 sec.
Time:
 Total: 9.5
 %self     total     self     wait    child    calls  name
  4.84      0.69     0.46     0.00     0.23    24023  ActiveRecord::ConnectionAdapters::Quoting#quote
  ...
  2.22      0.52     0.18     0.00     0.34     2779  <Module::Benchmark>#realtime
  ...
  0.00      0.00     0.00     0.00     0.00        6  Kernel#clone
  ...
Allocated Memory Size (in KB):
 Total: 350 966.00
 %self     total       self     wait    child    calls  name
 34.31  125655.00 120420.00     0.00  5235.00     2779  <Module::Benchmark>#realtime
  ...
  0.00       0.00      0.00     0.00     0.00        6  Kernel#clone
  ...
Ok, much better: 650 megabytes of memory saved with that little monkey patch! We estimated that will save 8 seconds. Let's look:
Original request time:          13.6 sec
Request time after AR patch:     6.7 sec
Time saved:                      6.9 sec
It turns out our estimation was almost correct. We indeed saved not only the time taken by Kernel#clone but also the time spent by garbage collector and gained 2x speedup for free. Nice! Our guerrilla tactics clearly works and we can for sure do better than 2x.
Guerrilla-patching Benchmark.realtime
Next suspect, as seen from memory profile above, is Benchmark#realtime. From time profiler the method doesn't look bad. We spend only 180 milliseconds in it but look at memory profiler! It 180ms our Benchmark#realtime friend allocates 120 megabytes of memory. Guess what? Ruby garbage collector will be happy to kick our ass again at least 15 times and that estimates to about 1 second loss.
Let's see what's wrong in benchmark.rb:
def measure(label = "")
    t0, r0 = Benchmark.times, Time.now
    yield
    t1, r1 = Benchmark.times, Time.now
    Benchmark::Tms.new(t1.utime  - t0.utime, t1.stime  - t0.stime,
        t1.cutime - t0.cutime, t1.cstime - t0.cstime, r1.to_f - r0.to_f, label)
end
def realtime(&blk) Benchmark::measure(&blk).real end
Great, we calculate lots of useless information (like cpu time, user time, etc) in Benchmark#measure and then just throw it away. Let's do better, simpler and more memory efficient:
module Benchmark
    def realtime
        r0 = Time.now
        yield
        r1 = Time.now
        r1.to_f - r0.to_f
    end
    module_function :realtime
end
That's it. No extra object creation. And here's our income:
Request time (without profiler): 5.8 sec.
Time:
 Total: 8.63
 %self     total     self     wait    child    calls  name
  4.29      0.69     0.46     0.00     0.23    24023  ActiveRecord::ConnectionAdapters::Quoting#quote
  ...
  0.35      0.19     0.03     0.00     0.16     2779  <Module::Benchmark>#realtime
  ...
Allocated Memory Size (in KB):
 Total: 225 668.00
 %self     total       self     wait    child    calls  name
 43.78   98805.00  98805.00     0.00     0.00     5280  Kernel#binding
  0.00    1918.00      2.00     0.00  1916.00     2779  <Module::Benchmark>#realtime
  ...
Whoa! 120 megabytes and 1 second were saved with 9-liner guerrilla patch. Nice, isn't it? Instead of original 13.6 sec our request now runs in 5.8 sec. That's more than 2x improvement already and we can do even better, but I'll leave that for the next blog.
When Guerrilla Tactics Help
Nothing with Ruby and Rails is wrong until you do things in the loop. For example, Benchmark#realtime doesn't have any impact if all you do is save one AR object. You just allocate 25 extra objects that take 45 extra kilobytes. But you're doomed once you do that in a loop. Do the benchmark 200 times and you'll guaranteed the garbage collector will happily shoot you in the back.
Also bear in mind that garbage collector should not be running while you profile your code. With GC you'll never get repeatable results because it will kick in at different times randomly increasing the time and memory consumption of your methods.
Make Your Application Fast: Optimize MEMORY
Let me reiterate, by optimizing memory usage instead of (or in parallel with) processing time you can get significant performance improvement, and guerrilla patching memory-inefficient bits is the right way to make Ruby and Rails faster I think we're on the right track here. Memory is what we need to optimize. Memory is what wasn't optimized before. And you've just seen what we can get by optimizing memory.
We (Acunote devs) will continue the memory profiling and will blog as soon as we get new results. Our previous blog post briefly summarize our current results, be sure to read it if you haven't done that yet. Meanwhile, don't wait for us, grab ruby-prof, my memory profiler patches and dig by yourself ;) I bet you'll find more big memory eaters and when you do so, please drop us a note. Let's collect our guerrilla patches and then file them upstream.
