Benchmarking

Guidelines

Micro-benchmarks are notoriously inaccurate, in any system. Here are some guidelines you should read carefully before trying to construct an accurate benchmark in the Strongtalk system.

  1. Put your benchmark in a real method. As mentioned in the tour, to get compiled performance results in Strongtalk, the primary computation (the code where your benchmark is spending most of its time) needs to be in an actual method, not in a "do it" from a workspace. This is because the current version of the VM doesn't use the optimized method until the next time that it is called after compilation, and a "do it" method by definition is never called more than once. (In a real program or normal "do it", this effect is never an issue- only micro-benchmarks have loops that iterate zillions of times with the loop itself in the "do it"). This is not a fundamental limitation in the technology, but we hadn't implemented "on-stack-replacement" in the Smalltalk system at the time of release (we did implement it for Java).

    Note that this does not mean that the code that your "do it" invokes won't be optimized and used the first time around- it will. But the big performance gains for micro-benchmarks come from inlining all the called methods directly into the performance critical benchmark loop, and if that loop is literally in the "do it", that isn't possible.

    A good way to run your benchmark is to create a method in the Test class (which is there for this kind of thing) that runs for at least 100 milliseconds, and then call that method a number of times until it becomes optimized. The Test>benchmark: method will do this for you, and report the fastest time. To tell if your code is running enough, a good rule of thumb is that if your method doesn't get faster and then stabilize at some speed, then it's not being run.

  2. Know how to choose a benchmark. Micro-benchmarks are notorious for producing misleading results in all systems, which is why all real benchmarks are bigger programs that as much as possible use the same code on both systems. If you insist on writing a micro-benchmark, keep these issues in mind:

    1. Your code should spend its time in Smalltalk, not down in rarely-used system primitives or C-callouts. For example, 'factorial' spends almost all of its time in the LargeInteger multiplication primitive, not Smalltalk code.

    2. Use library methods that are commonly used in real performance-critical code. Take factorial as an example: when is the last time your program was performance bound on LargeInteger multiplication?

    3. Use code that is like normal Smalltalk code (use of core data structures, allocation, message sending in a normal pattern, instance variable access, blocks). This is the biggest reason most micro-benchmarks aren't accurate. Real code is broken up into many methods, with lots of message sends, instance variable reads, boolean operations, SmallInteger operations, temporary allocations, and Array accesses, all mixed together. These are the things that Strongtalk is designed to optimize.

    4. Use the same code and input data on both systems. Running a highly implementation- dependent operation like "compile all methods" is not a good benchmark because the set of methods is totally different, and the bytecode compilers are implemented completely differently. (Also, the byte-code compiler is not a performance critical routine in applications, so it has not been tuned at all in Strongtalk. When was the last time your users were twiddling their thumbs waiting for the bytecode compiler?)

How we did Benchmarking

When we benchmarked the system ourselves, we assembled a large suite of accepted OO benchmarks, such as Richards, DeltaBlue (a constraint solver), the Stanford benchmarks, Slopstones and Smopstones. These benchmarks are already in the image, if you want to run them. Try evaluating "VMSuite run" and look at the code it runs. If you want a real performance comparison, run these on other VMs.

As an example, I put a couple of very small microbenchmarks that are run the right way in the system tour (the code is in the Test class). You can try running them on other Smalltalks as a start.

Other benchmarking problems people have been having