For over a year now, I’ve been collaborating with Christian on a lightweight and portable runtime for the occam programming language which we call the Transterpreter. This has been a fun project, because occam is a clean little language, and I enjoy architecting and writing compilers, interpreters, and similar kinds of software.
We wrote our first version of the bytecode interpreter in Scheme. Because we were reverse engineering the Transputer (a processor that ran compiled ocacm bytecodes) from a variety of sources, we felt that we had enough concerns without worrying about, say, memory management or pointer arithmetic. Our first version was slow, but it was clear, and the first version of the rewrite into C was almost a line-for-line translation.
Early in the project, Christian had written a Python script to clean up the bytecodes emitted by the existing compiler. It did the job, but with the rewrite into C we began thinking about the infrastructure for supporting a virtual machine that should be able to run anywhere. The occam compiler runs under any *NIX-like environment, and can be compiled for Windows using Cygwin or an old Borland toolchain, so any extensions to the toolchain should be at least as portable, if not more-so.
We considered a number of alternatives, but because of my experience programming in Scheme, we decided that PLT Scheme would be the implementation platform of choice. PLT Scheme (which encompasses MzScheme and DrScheme) runs damn near everywhere; there is excellent support for executing Scheme programs (interpreted or bytecode compiled with a run-time support library) on Windows, Mac (OS 9 and X), most flavours of *NIX, and even some PDAs. This way, the backend of our toolchain would be very portable, and let us program in a clean, well-understood and well-used language.
So, the slinker was born. Although, it is more than a linker—it’s like a compiler backend-ish, really. It started off with a 15-20 micropass (pipelined) architecture (inspired by this work), has grown in reasonable ways, currently needs some cleanup, but the architecture has proved to be flexible enough that we’ve been able to accommodate interesting additions without having to do ugly things like, say, scrap the codebase. We successfully designed for extension, and that’s good.
Unfortunately, when we started running the CG test suite, we suffered; the slinker was taking as much as 20-30 minutes on some tests. The compiler took seconds, the slinker took half-an-hour, and then running the test took a few seconds. This was, we agreed, bad, but optimization takes time, and we didn’t have that. We did have machines laying around that could run the test suite over night, which was good enough.
An exercise in optimizing Scheme
Yesterday, we started optimizing. Now, I learned some interesting things (I should have known) about PLT Scheme in the process. For example, we had a loop that looked like
(while (
This code was killing us. Slaughtering us. On 150,000 element lists, we would run for 200,000ms or more (and this was a large, but not huge, bytecode array). All the other passes in the slinker ran in 0 to 60 ms, and the DrScheme profiler was really quite clear that our problem was buried in this loop---not elsewhere.
This unrolled to a 'named let' in Scheme, which should be every bit as efficient as a for in most other languages, so we assumed our loop was not a problem. Or, put another way, we had perfectly fast code elsewhere that used the same macro, so that wasn't the problem.
Immediately, I decided the consing to build a list was certainly a problem. So, we got rid of that. To find out what was a good structure to use, I wrote a little "benchmarking" script to test insertions into vectors and hash-tables. It was faster to do an insert into vectors, so the structure build-up was modified.
(while (
Now, this code ran faster. We got a barely perceptible speedup at this point. This struck me as odd, as we had just benchmarked this, and clearly we should have seen a large speedup.
So, that wasn't the bottleneck. Keep in mind now, we only have a few things going on in this loop:
- The test
(,
- A
vector-set!
- A
list-ref on the bytecode list
- A variable increment (the
set! on pos)
Lets eliminate a few things right now. First, the increment on the variable is not the source of the problem. The vector-set! had just been benchmarked, and we revisited our benchmarking script to verify, but decided that we certainly were OK on the vector insert. The test, also, isn't a problem: even with dynamic typing, the check to see if one integer is smaller than another is still fast. Which left our list lookup.
We achieved a 5x speedup by changing one data structure and one function call in our code above.
(while (
See it? vector-ref instead of list-ref. Why? I haven't looked at the MzScheme source yet (I will, as I'd like to know what's going on) but I have a good idea as to why this accounted for a massive speedup in our code. Vectors are probably treated as contiguous locations in memory, and so the entire structure is allocated on the heap in one big chunk. When you operate on it, and in particular, reference into a vector, it's a simple index into memory from the head pointer.
Lists in Scheme are different; they're a bunch of cons pairs (or, they can be). I suspect that MzScheme handles lists very differently from vectors, and when we build a list with 150,000 elements, the cons pairs end up all over the heap. Referencing into the list is certainly not a constant-time operation; in fact, we were being badly, badly abused by our references into the list. I have no idea how many trips out to main memory had to be made to reference into element 126449 of our list... perhaps dozens.
This morning, Christian worked on a few other passes that we suspected had similar ugliness---as they, too, took many minutes to run. Now, our largest test case, which previously took 30 minutes to link, now links in approximately 30 seconds. That's a 60x speedup.
Lessons learned
I don't think we've done anything wrong here; we wrote simple, clean code to start that allowed for flexibility while developing. When we finally reached a good place to stop and look at the entire project, we made some key optimizations that, perhaps, could have been implemented from the start. However, early in the project, speed didn't didn't matter. Good design and sound implementation mattered. The software worked. And, we think, it continues to prove itself as extensible and maintainable.
And, now, we see that our linker is easily optimized as well. Our entire compiler test suite of several thousand cases runs in 30 minutes, and perhaps even less before the day is over. A few days ago, we had to run it overnight because it took so long; now you can start a run before going to lunch, and have results when you get back. Sweet
But yes. I believe it's true: premature optimization continues to be the root of all evil.
(PS. I'll drop a note here when I take some time out to look at how MzScheme handles vectors vs. lists; that was an interesting revelation.)