My previous post about MagLev and the planning of the next Ruby shootout received a lot of attention. MagLev’s speed claims have been subject to a lot of skepticism, and many believe that these impressive figures are due to a combination of clever optimization for trivial tests and incompleteness. The skepticism is understandable. There have been very bright people working on alternative VMs for years, and this new product shows up after only 3 months, and claims to be way faster than anything seen before.
Except, that it’s not entirely new. What makes the fact that they may be onto something and develop a faster implementation credible, is that they are leveraging decades worth of Smalltalk experience, where the Smalltalk VMs underwent similar development challenges. Ruby and Smalltalk are family, there is not inherent reason why Ruby has to be dramatically slower than certain Smalltalk implementations. Parsing and compiling Ruby code into “smalltalk-ish” bytecode is not the hardest thing to do. So, from a certain prospective, MagLev is 30 years old, not 3 months old. I’m enthusiastic about the VM because I think MagLev is promising, but don’t let people tell you that I’m naive. Despite the fact that MagLev is incomplete, I want to challenge it so that we verify and clarify what kind of speed improvements are really offered at this stage. Let’s investigate how.
One very valid point that was raised by several people both in my comment section and on Slashdot, is the fact that many of the benchmarks that have been employed so far are not very useful. Some of them are meaningless, not because of the usual “micro-benchmark must be taken with a grain of salt” sound logic, but rather because they offer an opportunity for VMs to optimize them out. When a lazy/smart VM realizes that a given loop doesn’t produce any results which will be used in some way, it’ll just skip it, giving us the impression of being many, many times faster than the standard Ruby 1.8 implementation by Matz et al.
In the real world, when that loop has to do something meaningful and the results of the computation have to be printed on the screen, that impressive performance is nowhere to be seen. So far this set of tests, which were employed by Yarv for testing its own progress, have also been used to compare different implementations (and these benchmarks can be found in Rubinius’ repository as well). This was the easy thing to do, but if we’re going to get serious about it, we need to produce a better set of benchmarks, especially when the current ones question both Yarv and MagLev’s impressive results.
In the long run, it would be good to come up with some serious benchmarks based on AST nodes in order to test each of Ruby’s features. We can work on that, but let’s get started with some “beefed up” micro-benchmarks for the imminent shootout. In one of my comments I wrongly called the Yarv tests “standard”. That was unfortunate wording, because there are no “standard” benchmarks that we can rely on even minimally in the Ruby community. Let’s fix that.
I created an empty project on GitHub, called Ruby Benchmark Suite. This project will hold a set of benchmarks that VM implementers can use to monitor their own progresses and that I can use to run periodical shootouts between all of the major implementations.
I also created a Lighthouse project, so that we can have some support for communication and project management. For on going discussion about the project, I created a public Google Group which I invite you to join, if you’re interested in helping out. I’d like to see VM implementers get involved with this, in order to make it a set of reasonable, standard benchmarks that we can all agree upon.
For the next shootout, I’d like to start my multiple testing within the next week or two. So it’d be great if we could come up with a bunch of new tests and revisit the existing ones. What I’d like to see is the following:
- Eliminate or modify flawed tests from the Yarv collection of benchmarks. That means, remove the chance that a given VM could optimize out the actual computing, yielding surprising yet useless results;
- Focus on a multitude of simple, cross-platform programs that employ the Core and Standard libraries. These small programs should test a variety of operations (e.g. number crunching, text processing, etc…);
- There can be several types of application benchmarks, from simple algorithms to a program that performs statistical analysis of a large web server log file (for example) — and anything in between;
- The code in the project will be released under the MIT license. That means that if you contribute code that it’s not your own, you need to ensure that it’s released under a compatible license before we can include it. You’re free to place your name and copyright note at the top of the file, in a comment, but if you send it to us, you agree to release it under the MIT license.
- If you’re not too familiar with GitHub, send over your programs by email (acangiano (at) gmail.com) or by submitting them to the Google Group (assuming they are small). Please specify whether you’d like a “Submitted by” line and if you want your email to be included.
- Not all contributions that we receive will be included, but the greater the variety, the smaller the gap between the benchmarks and real world performances will be.
I hope I can count on your help for this project.