It's time for celebration. My quad 2.5GHz G5 is now benching 1760ms on SunSpider using the internal 4.0.1pre I've whipped up, and a whopping 93 runs/sec on Dromaeo. That's a little over half the raw interpreter's SunSpider runtime (about 3370ms), and over half of the limping 3710ms it gets with the current hybrid interpreter-nanojit used by the G5 build of 4.0s. I bet you lucky dogs with dual 2.7GHz G5 systems will see even better. Everything is faster. Birds are chirping. The river is high. Gold bricks are falling from the sky, and not on people. This is so tasty it will be in 4.0.1, as so far it's just as stable as the G3 and G4 versions and runs even faster.
Since this blog is designed around some technical nerdosity, let's engage in a little, and also collect what I've learned over the difficult process of getting the nanojit tuned for the G5.
(In Firefox 4, Mozilla added a more conventional method-oriented compiler, based in part on Apple's Nitro JIT used in Safari. Since Firefox 4 is Intel and ARM only officially, the methodjit only runs on x86, x86_64 and ARM. This combination of the nanojit and methodjit is JaegerMonkey. There is no PowerPC methodjit.)
These are useful things to know not only for those of us trying to wring more performance out of our old Macs, but also for people using later PowerPC and POWER designs, such as Wii, Xbox 360 or PlayStation 3 hackers, because the PowerPC 970 (being a modified POWER4) is more closely related to the modern day IBM POWER systems and the Cell, Broadway and Xenon CPUs than the G3s and G4s that preceded them. So, for posterity, here's how we juiced the G5 nanojit.
The G5's data cache acts differently. The G5, because of its deep pipeline, is constantly trying to keep that pipeline full and reduce latency, and tight code trying to prime the data cache needs to be aware of the difference. The AltiVec dst (to hint the processor about where the data stream is coming from) instruction, for example, requires that pipeline to drain and can seriously impact performance. We don't manipulate the D-cache presently in the nanojit, but I mention for completeness that the more basic dcbt is preferred to dst as it does not need to be serialized; the G5 uses a 128-byte cacheline, not a 32-byte one, so dcbtl (which uses the native cacheline size) would be better still. Analogously, for zeroing out a cacheline use dcbzl, not dcbz, which only operates on 32 bytes even on the G5 and is therefore inefficient by wasting valuable cache space.
dcba and dcbi should never be used; they are illegal on the G5. Mac OS X emulates dcba on the G5 by simply ignoring it, since it's just a hint, but this causes a software interrupt to do so -- more about that in our final point.
Remember dispatch groups. The nanojit we use has a construct called the "swaptimizer," which swaps independent instructions around in such a way that more of the CPU's execution units can be running at the same time (i.e., improve instruction-level parallelism) by hoisting up instructions to run overlapped with other instructions that don't depend on that earlier step's result. In certain cases, this can be effective enough to get some instructions seemingly "for free," particularly comparisons that can write out independent comparison results (the PPC has a series of "mini-registers" for this which is quite convenient). For in-order CPUs like the G3 (remember that the G3 is essentially an evolved 603 with all the advantages and disadvantages), this is very valuable, as it only retires instructions in program order despite being superscalar, and it should also be useful for other in-order POWER chips like Xenon, the Cell PPE and POWER6. It is less valuable on the 604 and G4, which both have some limited out-of-order execution, but the G4s in most Macs don't have the reordering logic of later G4 designs.
On the other hand, the G5 is an aggressively out-of-order architecture to improve its instruction-level parallelism in hardware, and can have over 200 instructions in-flight (compared to around 30 for the G4). To reduce the amount of silicon needed to track each and every one of these flying instructions, IBM designed the G5 to take dispatch groups of instructions instead and these groups are what the CPU tracks and what travel through the pipeline. Prior to grabbing the instructions, the G5 will attempt to reorder them for maximum performance. As a result, the swaptimizer is less effective here because it's cherry-picking the low-hanging optimization fruit that the G5 already schedules for, but it does help to pack groups better so that an earlier group is less likely to need the result of a later group.
Dispatch groups actually contain operations rather than individual instructions. In most cases this is an academic point, as the operation is usually the same as the instruction. Ordinarily dispatch groups contain five operation slots: four for individual operations, and an optional branch instruction, so most dispatch groups fall between every fourth instruction. The branch instruction always terminates a dispatch group, and a dispatch group may never have more than one branch. The distinction between operations and instructions will be covered in my next point. The pieces of the group then enter the processor and are issued and executed, and the groups, not the already reordered instructions, are then retired in-order.
Because the instructions execute for most intents and purposes as a unit, certain interdependencies can really hurt, most notoriously loads and stores to nearby addresses in the same group. For example, in the LIR operation d2i that converts a double to an integer, the stfd instruction should be in an earlier group than the lwz that follows it (remember that instructions in the nanojit are emitted working down from the top of memory) because they work on memory addresses that are very close to each other. Since this could lead to a problem with aliasing if they run together in the same group, the G5 has to "undo the group" and split them apart, leading to a pipeline spill as this won't be detected until the group is formed and its addresses are calculated. It improves performance to insert a couple of nop instructions between them, which essentially act as empty space in the group, and forces the G5 to split the group ahead of time. This is tunable; our code worked best with two, and gained a small but meaningful number of points on V8. In fact, this is particularly a problem for any code that has to interconvert between floating point and integer registers, because such conversion must store in memory as an intermediate step (there are no direct FPR to GPR moves in any Power Mac processor). Shark checks for this specific situation and Apple calls it a "LSU [Load Store Unit] Reject" in ADC.
Certain instructions do better at certain positions in the group, too. For example,
mtctr (move to counter register) should be first in a dispatch group if at all possible, and I'm sure there are others (please mention them in the comments; this was the one I found in most references). We can't really leverage this because we only use CTR as an indirect branching register in constructs like mtctr bctr, so we're always branching soon after anyway. However, if you use CTR as, you know, a counter maybe 8), then you might want to group your instructions to force it first in line with a group.
Avoid microcoded and cracked instructions. I mentioned that instructions and operations can mostly, but not always, be treated synonymously. The case where they are not is if the instruction in question is cracked or microcoded.
If the instruction is cracked, it is actually two operations in one, and takes up two operations in the dispatch group; if the instruction is microcoded, it takes up all the operation slots (i.e., it can only travel alone in the dispatch group). Here is Apple's list of G5 cracked and microcoded instructions. Although these instructions do exist and do execute in hardware, they are obviously slower than other instructions. Use them only if you have to, which brings us to our final and most important point.
Never, ever, ever use mcrxr on the G5. And this might well apply to some other POWER CPUs, by the way, even though it appears in IBM's documentation. Apple doesn't mention anywhere in their documentation I could find that mcrxr is software-emulated on the G5. Yikes! In fact, simply eliminating the use of this instruction was what restored the vast majority of our speed.
The reason this hit us so badly is an oddiment of the code the nanojit generates. Some background: overflows and other math exceptions are, if the instruction requests it, annotated in a special purpose register ("SPR") called the XER. The XER, amongst other things, tracks both overflow "OV" of the last instruction, and a summary overflow "SO" which is sticky (i.e., any instruction that sets it, it stays set, unlike OV where the next instruction where overflow tracking is requested clears it). The result of XER cannot itself be used in branching: only the condition registers "CR" can be used to conditionally branch, so the CPU mirrors SO to one of the condition registers if the instruction requests it also (CR0, in case you're interested, for integer math).
The nanojit spits out lots of guard code for every arithmetic operation -- and I do mean every operation, from simple increments in a loop to computing the national debt -- so that an overflow state can be correctly trapped as an exception and appropriately handled. Most of the time, overflow does not occur, and only one instruction is being tested for overflow, so we want to use OV (or we waste cycles clearing SO constantly for instructions that are unlikely to set it). However, we can't branch on OV directly, so here comes mcrxr. mcrxr stands for move exception register to condition register, which puts the relevant bits into the CR we specify and clears those bits from the XER. In our case, this puts OV in the greater-than field of the condition register, and now we can branch on that. Problem solved! ... at least on the G3 and G4, where mcrxr is in hardware.
On the G5, mcrxr is trapped and emulated by the operating system. This means every time the instruction is encountered, there is a fault, the pipelines probably have to empty, the OS examines the instruction, sees what it is, runs equivalent code (to be discussed momentarily), and returns to regular execution. If you look at the code the nanojit generates for a simple loop, you will see that each and every time the loop runs, the increment is tested for overflow, and the mcrxr instruction is trapped and emulated. No wonder it sucked so badly! Apple doesn't mention this anywhere except if you dig through their code or find the few obscure bug reports about why a previously snappy executable on the G4 performs so badly on a G5.
So here is the one case of where a microcoded instruction is better than the alternative. We now use this equivalent code in the G5 nanojit:
mfxer r0 ; put XER into register 0, which is scratch
mtcrf 128, r0 ; put the relevant bits into condition register 0
rlwinm r0, r0, 0, 0, 28 ; clear out the relevant bits in register 0
mtxer r0 ; and put them back into XER, clearing it
You will notice that except for rlwinm every one of these instructions is microcoded and requires an entire dispatch group all to itself. There's no way around it, but it's better than triggering an illegal instruction fault and forcing software emulation. Way better. We still use mcrxr on the G3 and G4, which have it in hardware, but the G5 now uses this equivalent which is much faster on the 970. But phooey on both Apple and IBM: IBM for taking the instruction out, and Apple for not documenting it.
Well, that's enough nerdosity. It'll be in 4.0.1. It's great. More G5 docs, until Apple decides to 404 them: