My previous post about MagLev and the planning of the next Ruby shootout received a lot of attention. MagLev’s speed claims have been subject to a lot of skepticism, and many believe that these impressive figures are due to a combination of clever optimization for trivial tests and incompleteness. The skepticism is understandable. There have been very bright people working on alternative VMs for years, and this new product shows up after only 3 months, and claims to be way faster than anything seen before.
Except, that it’s not entirely new. What makes the fact that they may be onto something and develop a faster implementation credible, is that they are leveraging decades worth of Smalltalk experience, where the Smalltalk VMs underwent similar development challenges. Ruby and Smalltalk are family, there is not inherent reason why Ruby has to be dramatically slower than certain Smalltalk implementations. Parsing and compiling Ruby code into “smalltalk-ish” bytecode is not the hardest thing to do. So, from a certain prospective, MagLev is 30 years old, not 3 months old. I’m enthusiastic about the VM because I think MagLev is promising, but don’t let people tell you that I’m naive. Despite the fact that MagLev is incomplete, I want to challenge it so that we verify and clarify what kind of speed improvements are really offered at this stage. Let’s investigate how.
One very valid point that was raised by several people both in my comment section and on Slashdot, is the fact that many of the benchmarks that have been employed so far are not very useful. Some of them are meaningless, not because of the usual “micro-benchmark must be taken with a grain of salt” sound logic, but rather because they offer an opportunity for VMs to optimize them out. When a lazy/smart VM realizes that a given loop doesn’t produce any results which will be used in some way, it’ll just skip it, giving us the impression of being many, many times faster than the standard Ruby 1.8 implementation by Matz et al.
In the real world, when that loop has to do something meaningful and the results of the computation have to be printed on the screen, that impressive performance is nowhere to be seen. So far this set of tests, which were employed by Yarv for testing its own progress, have also been used to compare different implementations (and these benchmarks can be found in Rubinius’ repository as well). This was the easy thing to do, but if we’re going to get serious about it, we need to produce a better set of benchmarks, especially when the current ones question both Yarv and MagLev’s impressive results.
In the long run, it would be good to come up with some serious benchmarks based on AST nodes in order to test each of Ruby’s features. We can work on that, but let’s get started with some “beefed up” micro-benchmarks for the imminent shootout. In one of my comments I wrongly called the Yarv tests “standard”. That was unfortunate wording, because there are no “standard” benchmarks that we can rely on even minimally in the Ruby community. Let’s fix that.
I created an empty project on GitHub, called Ruby Benchmark Suite. This project will hold a set of benchmarks that VM implementers can use to monitor their own progresses and that I can use to run periodical shootouts between all of the major implementations. I also created a Lighthouse project, so that we can have some support for communication and project management. For on going discussion about the project, I created a public Google Group which I invite you to join, if you’re interested in helping out. I’d like to see VM implementers get involved with this, in order to make it a set of reasonable, standard benchmarks that we can all agree upon.
For the next shootout, I’d like to start my multiple testing within the next week or two. So it’d be great if we could come up with a bunch of new tests and revisit the existing ones. What I’d like to see is the following:
- Eliminate or modify flawed tests from the Yarv collection of benchmarks. That means, remove the chance that a given VM could optimize out the actual computing, yielding surprising yet useless results;
- Focus on a multitude of simple, cross-platform programs that employ the Core and Standard libraries. These small programs should test a variety of operations (e.g. number crunching, text processing, etc…);
- There can be several types of application benchmarks, from simple algorithms to a program that performs statistical analysis of a large web server log file (for example) — and anything in between;
- The code in the project will be released under the MIT license. That means that if you contribute code that it’s not your own, you need to ensure that it’s released under a compatible license before we can include it. You’re free to place your name and copyright note at the top of the file, in a comment, but if you send it to us, you agree to release it under the MIT license.
- If you’re not too familiar with GitHub, send over your programs by email (acangiano (at) gmail.com) or by submitting them to the Google Group (assuming they are small). Please specify whether you’d like a “Submitted by” line and if you want your email to be included.
- Not all contributions that we receive will be included, but the greater the variety, the smaller the gap between the benchmarks and real world performances will be.
I hope I can count on your help for this project.
Get more stuff like this
Subscribe to my mailing list to receive similar updates about programming.
Thank you for subscribing. Please check your email to confirm your subscription.
Something went wrong.
May I make a suggestion?
I suggest you create a virtual machine for each of your platforms. Presuming none of your benchmarks are disk io bound you can test all of the platforms on the exact same machine under the same sort of conditions (same amount of memory, same CPU allocation etc).
VMware workstation allows you to create a base image and then create a set of “diffs” so it might be an ideal solution.
Thanks for inviting us, Antonio. This sounds like a great idea. We’ll be glad to participate.
Since I’ve been at GemStone since it started in 1982, I have a small correction to your timeline. Your readers can choose what age they wish to view our MagLev VM.
The first GemStone VM (32-bit) was released in 1986, and version 6.2.2 of that product was released last Friday. The first 64-bit GemStone VM was released in 2005 and version 2.2.5.2 of that product was also released last Friday. MagLev is *based* on our upcoming 3.x VM. That VM will be released at such time it is feature complete and passes the tests required for current production customers to upgrade their existing (i.e. non-Ruby) applications.
Running Ruby is new functionality for us, and as Charles has pointed out, Ruby compliance is no small task. We haven’t had time to implement anything besides Webrick, the microbenchmarks, and a bit of XML parsing. So I can’t give an estimate of when we will pass enough RubySpecs for people to view MagLev as real.
But we’ll be glad to share our Ruby code and performance benchmarks (good and bad) along the way.
— Monty
What about using the ones at shootout.alioth.debian.org as a start, they at least all require output i believe, so their shouldn’t be any cases of pointless loops. Also comes with the added bonus of implementations in many languages.
I think they were just added recently to the rubinius benchmark directory, as an additional set of benchmarks to test perf.
Hi Monthy,
thanks for stopping by. With my 30 year remark, I was mostly thinking about the kind of implementation challenges solved by pioneers like Alan Kay. But it’s clear that the latest advancements made by GemStone will be the real foundation for MagLev. I also want to clarify that I think you have quite a few challenges ahead, but having a fast performing VM created for a similar language, is a huge jump start.
@Charles L, I’m not sure about their license, but I ‘d like to add the ones that are missing (Yarv already includes a few of them).
This kind of thing is why I have a lot of respect for the Ruby community. When they encounter a problem instead of just whining about it they try to solve it with a combination of programming skill and open source.
Some comments I would add:
* I’ve never liked benchmarks that just have a single number result. I’d rather see the same test done with e.g. more and more data and get a nice graph that shows where bottlenecks get hit.
* Similarly if a result takes 1 second half the time and 11 seconds the rest, it shouldn’t be just averaged to 6, the variability should be shown somewhere
* Memory is important as well as speed. I believe alioth already reports on this?
* For VM’s that get faster with use Alioth also has a graph that shows the time taken for the first and subsequent runs. In effect this gives an upper and lower bound on times depending on how much you want to punish/reward systems in how they trade startup speed against ongoing performance. (http://shootout.alioth.debian.org/debian/miscfile.php?file=dynamic&title=Java%20Dynamic%20Compilation)
* Is the software that displays the alioth results open source, as well as the benchmarks themselves? Could be a good way to centrally report the results and give something back by letting other languages (e.g. Iron/J/python) use a similar setup customised for comparing across multiple implementations of a single language.
* I would have thought that also testing on a dual-core machine would be sensible. Don’t modern VMs use the other thread to do VM-stuff in the background? (http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.6)
While Ruby has improved by 5x with the newer 1.9 release.
It is still VERY slow comparing to other language like Lua or Java or even Python.
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=yarv&lang2=psyco
A quick glance sees Ruby 1.9 is still one of the slowest according to the test shown above.
Will this MagLev bring us up to speed with Lua / Java.