It's time for celebration. My quad 2.5GHz G5 is now benching 1760ms on SunSpider using the internal 4.0.1pre I've whipped up, and a whopping 93 runs/sec on Dromaeo. That's a little over half the raw interpreter's SunSpider runtime (about 3370ms), and over half of the limping 3710ms it gets with the current hybrid interpreter-nanojit used by the G5 build of 4.0s. I bet you lucky dogs with dual 2.7GHz G5 systems will see even better. Everything is faster. Birds are chirping. The river is high. Gold bricks are falling from the sky, and not on people. This is so tasty it will be in 4.0.1, as so far it's just as stable as the G3 and G4 versions and runs even faster.
Since this blog is designed around some technical nerdosity, let's engage in a little, and also collect what I've learned over the difficult process of getting the nanojit tuned for the G5.
First, a little history for people who don't know what the nanojit is. Mozilla has three layers of JavaScript: the base interpreter (SpiderMonkey), the nanojit (TraceMonkey) and the methodjit (JaegerMonkey). The base interpreter is the base interpreter: it's pretty good as interpreters go, but it's just interpreting code and does not compile it. Until Firefox 3.5, this was the only way JavaScript was run in Mozilla-based browsers. On Power Macs, 3.5 and 3.6 still use it, as do Camino and SeaMonkey, and so do all the community PowerPC builds of Firefox 4 other than us.
Mozilla had been looking at a means of compiling JavaScript, and collaborated with Adobe on what was then called Tamarin, a new JavaScript engine. Tamarin overall actually turned out to be slower for the purpose, but a lasting advance from the Tamarin project was the concept of the nanojit: a "tracing just-in-time compiler" that watched for frequently executed portions of code, then recorded the atomic operations being generated in a special intermediate language called LIR, and finally compiled the LIR into machine code. This nanojit-based accelerator was christened TraceMonkey (no, I don't know why they have such a simian fetish either). Versions were immediately made for x86, and later other architectures followed, including ARM, SH4, SPARC and even MIPS. Adobe wrote up one for the PowerPC too, but it was not complete enough to be used in Firefox, and this was not rectified in Firefox 3.6. That's why all official Mac PowerPC builds of Firefox have such comparatively slow JavaScript performance.
(In Firefox 4, Mozilla added a more conventional method-oriented compiler, based in part on Apple's Nitro JIT used in Safari. Since Firefox 4 is Intel and ARM only officially, the methodjit only runs on x86, x86_64 and ARM. This combination of the nanojit and methodjit is JaegerMonkey. There is no PowerPC methodjit.)
For TenFourFox beta 9, then, we took that partially written PowerPC nanojit and finished it up, thus being the first PowerPC-based Mozilla browser to implement TraceMonkey. My first tests were on my G5, which were very disappointing. Although many operations were significantly faster, many operations were significantly slower, and I concluded that the nanojit didn't seem to be a good fit for the PowerPC ... until alert users tested it themselves and told me it was ridiculously fast on the G3 and G4. To get around this, I compulsively benchmarked various JavaScript low-level operations on the G5 and cut out the ones that seemed to be slow. This was good enough to beat the baseline score for Dromaeo and V8, and not suck too badly at SunSpider, and this got released in beta 11. At least at that time, that seemed about as good as it was going to get for the PowerPC 970.
Earlier this week, I was busy working on enabling VMX (AltiVec)-based WebM decoding (which by the way is about halfway done -- I now have it integrated into the build system and have started converting the existing assembly language code to the dain-bramaged as assembler Apple stuck in Tiger) and while investigating an oddiment in the assembler's syntax I stumbled across a key note I hadn't seen mentioned anywhere else before, let alone in Apple's documentation. As I dug around a little more, the floodgates opened and more stuff came in, including -- what a win -- the critical piece that now enables the G5 nanojit to fly. In fact, not only does it fly, it also no longer needs the shortcuts to dump those slow low-level JavaScript operations (JSOPs) because those JSOPs are no longer slow.
These are useful things to know not only for those of us trying to wring more performance out of our old Macs, but also for people using later PowerPC and POWER designs, such as Wii, Xbox 360 or PlayStation 3 hackers, because the PowerPC 970 (being a modified POWER4) is more closely related to the modern day IBM POWER systems and the Cell, Broadway and Xenon CPUs than the G3s and G4s that preceded them. So, for posterity, here's how we juiced the G5 nanojit.
The G5's data cache acts differently. The G5, because of its deep pipeline, is constantly trying to keep that pipeline full and reduce latency, and tight code trying to prime the data cache needs to be aware of the difference. The AltiVec dst (to hint the processor about where the data stream is coming from) instruction, for example, requires that pipeline to drain and can seriously impact performance. We don't manipulate the D-cache presently in the nanojit, but I mention for completeness that the more basic dcbt is preferred to dst as it does not need to be serialized; the G5 uses a 128-byte cacheline, not a 32-byte one, so dcbtl (which uses the native cacheline size) would be better still. Analogously, for zeroing out a cacheline use dcbzl, not dcbz, which only operates on 32 bytes even on the G5 and is therefore inefficient by wasting valuable cache space.
dcba and dcbi should never be used; they are illegal on the G5. Mac OS X emulates dcba on the G5 by simply ignoring it, since it's just a hint, but this causes a software interrupt to do so -- more about that in our final point.
Remember dispatch groups. The nanojit we use has a construct called the "swaptimizer," which swaps independent instructions around in such a way that more of the CPU's execution units can be running at the same time (i.e., improve instruction-level parallelism) by hoisting up instructions to run overlapped with other instructions that don't depend on that earlier step's result. In certain cases, this can be effective enough to get some instructions seemingly "for free," particularly comparisons that can write out independent comparison results (the PPC has a series of "mini-registers" for this which is quite convenient). For in-order CPUs like the G3 (remember that the G3 is essentially an evolved 603 with all the advantages and disadvantages), this is very valuable, as it only retires instructions in program order despite being superscalar, and it should also be useful for other in-order POWER chips like Xenon, the Cell PPE and POWER6. It is less valuable on the 604 and G4, which both have some limited out-of-order execution, but the G4s in most Macs don't have the reordering logic of later G4 designs.
On the other hand, the G5 is an aggressively out-of-order architecture to improve its instruction-level parallelism in hardware, and can have over 200 instructions in-flight (compared to around 30 for the G4). To reduce the amount of silicon needed to track each and every one of these flying instructions, IBM designed the G5 to take dispatch groups of instructions instead and these groups are what the CPU tracks and what travel through the pipeline. Prior to grabbing the instructions, the G5 will attempt to reorder them for maximum performance. As a result, the swaptimizer is less effective here because it's cherry-picking the low-hanging optimization fruit that the G5 already schedules for, but it does help to pack groups better so that an earlier group is less likely to need the result of a later group.
Dispatch groups actually contain operations rather than individual instructions. In most cases this is an academic point, as the operation is usually the same as the instruction. Ordinarily dispatch groups contain five operation slots: four for individual operations, and an optional branch instruction, so most dispatch groups fall between every fourth instruction. The branch instruction always terminates a dispatch group, and a dispatch group may never have more than one branch. The distinction between operations and instructions will be covered in my next point. The pieces of the group then enter the processor and are issued and executed, and the groups, not the already reordered instructions, are then retired in-order.
Because the instructions execute for most intents and purposes as a unit, certain interdependencies can really hurt, most notoriously loads and stores to nearby addresses in the same group. For example, in the LIR operation d2i that converts a double to an integer, the stfd instruction should be in an earlier group than the lwz that follows it (remember that instructions in the nanojit are emitted working down from the top of memory) because they work on memory addresses that are very close to each other. Since this could lead to a problem with aliasing if they run together in the same group, the G5 has to "undo the group" and split them apart, leading to a pipeline spill as this won't be detected until the group is formed and its addresses are calculated. It improves performance to insert a couple of nop instructions between them, which essentially act as empty space in the group, and forces the G5 to split the group ahead of time. This is tunable; our code worked best with two, and gained a small but meaningful number of points on V8. In fact, this is particularly a problem for any code that has to interconvert between floating point and integer registers, because such conversion must store in memory as an intermediate step (there are no direct FPR to GPR moves in any Power Mac processor). Shark checks for this specific situation and Apple calls it a "LSU [Load Store Unit] Reject" in ADC.
Certain instructions do better at certain positions in the group, too. For example,
mtctr (move to counter register) should be first in a dispatch group if at all possible, and I'm sure there are others (please mention them in the comments; this was the one I found in most references). We can't really leverage this because we only use CTR as an indirect branching register in constructs like mtctr bctr, so we're always branching soon after anyway. However, if you use CTR as, you know, a counter maybe 8), then you might want to group your instructions to force it first in line with a group.
Avoid microcoded and cracked instructions. I mentioned that instructions and operations can mostly, but not always, be treated synonymously. The case where they are not is if the instruction in question is cracked or microcoded.
If the instruction is cracked, it is actually two operations in one, and takes up two operations in the dispatch group; if the instruction is microcoded, it takes up all the operation slots (i.e., it can only travel alone in the dispatch group). Here is Apple's list of G5 cracked and microcoded instructions. Although these instructions do exist and do execute in hardware, they are obviously slower than other instructions. Use them only if you have to, which brings us to our final and most important point.
Never, ever, ever use mcrxr on the G5. And this might well apply to some other POWER CPUs, by the way, even though it appears in IBM's documentation. Apple doesn't mention anywhere in their documentation I could find that mcrxr is software-emulated on the G5. Yikes! In fact, simply eliminating the use of this instruction was what restored the vast majority of our speed.
The reason this hit us so badly is an oddiment of the code the nanojit generates. Some background: overflows and other math exceptions are, if the instruction requests it, annotated in a special purpose register ("SPR") called the XER. The XER, amongst other things, tracks both overflow "OV" of the last instruction, and a summary overflow "SO" which is sticky (i.e., any instruction that sets it, it stays set, unlike OV where the next instruction where overflow tracking is requested clears it). The result of XER cannot itself be used in branching: only the condition registers "CR" can be used to conditionally branch, so the CPU mirrors SO to one of the condition registers if the instruction requests it also (CR0, in case you're interested, for integer math).
The nanojit spits out lots of guard code for every arithmetic operation -- and I do mean every operation, from simple increments in a loop to computing the national debt -- so that an overflow state can be correctly trapped as an exception and appropriately handled. Most of the time, overflow does not occur, and only one instruction is being tested for overflow, so we want to use OV (or we waste cycles clearing SO constantly for instructions that are unlikely to set it). However, we can't branch on OV directly, so here comes mcrxr. mcrxr stands for move exception register to condition register, which puts the relevant bits into the CR we specify and clears those bits from the XER. In our case, this puts OV in the greater-than field of the condition register, and now we can branch on that. Problem solved! ... at least on the G3 and G4, where mcrxr is in hardware.
On the G5, mcrxr is trapped and emulated by the operating system. This means every time the instruction is encountered, there is a fault, the pipelines probably have to empty, the OS examines the instruction, sees what it is, runs equivalent code (to be discussed momentarily), and returns to regular execution. If you look at the code the nanojit generates for a simple loop, you will see that each and every time the loop runs, the increment is tested for overflow, and the mcrxr instruction is trapped and emulated. No wonder it sucked so badly! Apple doesn't mention this anywhere except if you dig through their code or find the few obscure bug reports about why a previously snappy executable on the G4 performs so badly on a G5.
So here is the one case of where a microcoded instruction is better than the alternative. We now use this equivalent code in the G5 nanojit:
mfxer r0 ; put XER into register 0, which is scratch
mtcrf 128, r0 ; put the relevant bits into condition register 0
rlwinm r0, r0, 0, 0, 28 ; clear out the relevant bits in register 0
mtxer r0 ; and put them back into XER, clearing it
You will notice that except for rlwinm every one of these instructions is microcoded and requires an entire dispatch group all to itself. There's no way around it, but it's better than triggering an illegal instruction fault and forcing software emulation. Way better. We still use mcrxr on the G3 and G4, which have it in hardware, but the G5 now uses this equivalent which is much faster on the 970. But phooey on both Apple and IBM: IBM for taking the instruction out, and Apple for not documenting it.
Well, that's enough nerdosity. It'll be in 4.0.1. It's great. More G5 docs, until Apple decides to 404 them:
wow, does it mean that from 4.0.2 we'll have nanojit_ppc on linux/ppc too?
ReplyDeleteYour work is going to be included into the official tree?
Well, I'm sure it will eventually, but not yet. What this means is that the performance deficit the nanojit has on G5 is now erased. In fact, I should respin a new patch for that bug, since you mention it.
ReplyDeleteI have no idea what you guys are talking about, not being a programmer. It sounds like you are working miracles behind the scenes with this 'nanojit' thing. All I know is that Javascript seems to be behaving itself now in TenFour. Big Thank You from this G5 PPC user!
ReplyDeleteThe requirement to trap on every overflow is crazy... whoever designed that didn't think about performance impact it has.
ReplyDelete