Ben's branchwork is the centrepiece of this release, however, and while it improves SunSpider by a modest amount, it improves loopy benchmarks like V8 by a huge degree. My quad G5 improves by about 45%, for example, and gets consistently around 900-950ms in SunSpider (down from around 1050). Our 1GHz 7450 G4 doesn't improve as much on SunSpider (2750 down to around 2600, so not quite at our AWOAFY? target), but still improves by about 40% on V8. Part of achieving this is splitting the way branches are handled into "big POWER" (G5) and "little POWER" (G3/G4) versions. Ben's original work did much what my code in our dear nearly-departed tracejit did, which was to have four-word branch stanzas padded with nops so that if a branch target was too big for a regular b[l] or bc[l] instruction (the normal relative branching instructions on PowerPC), we had enough room to turn it into lis ori mtctr b[c]ctr[l] which load the destination address into a register (usually r0), transfer it to the CTR, and then branch to the CTR (conditionally or always). This achieved 45% on G5 in V8 and about 35% on G4. SunSpider dropped down to less than 920 on the G5 as well.
On the G5, however, this actually hurt performance and SunSpider climbed to a poorer result, nearly 1100ms; even with the later tweak to reduce trampoline usage, it was still around 970ms. Our theory is that the G5, being (in Apple's words) "very hungry, very fast and very sequential," pays too big an aggregate penalty to branch to an out-of-line branch stanza when a far call is encountered, for two reasons. First, it appears to be a smaller penalty (possibly even near zero given the aggressive ordering of the G5 dispatch unit) to have empty nops inline that take up some small proportion of instruction cache, because when those empty instructions are patched to a far call in-place the G5 does not need to introduce bubbles in its pipeline doing a branch into the trampoline just to branch again. In addition, the hypernerds amongst you will recall from our previous treatise on G5 optimization that there can only be one branch instruction in a dispatch group. The trampoline version must run in (at least) two dispatch groups, because there are two branch instructions, one to the far call in the trampoline and one in the trampoline itself, and both will each introduce a pipeline bubble of variable length. The far call in-place will still introduce a bubble, but the entire branch can in the best case execute in a single dispatch group because there is only one branch (the branch-to-CTR instruction at the end), and there will be only one bubble.
Because the G5 is really just a POWER4 with a deeper pipeline and AltiVec, this property is likely shared by later "big POWER" CPUs like the POWER5, POWER6 and POWER7, as well as "big POWER-like" CPUs such as the G5, Cell PPE and Xenon. We will likely have consumers that will want this branch optimization strategy, but we don't want to lose the gains we get on "little POWER" (such as G3, G4, e500, QorIQ, Gekko/Broadway and PowerPC 4xx) with the cache-saving trampoline approach, so we do both. On the G5, the original four-word stanza branching is compiled in; everything else (G3, 7400 and 7450) use the two-word branch stanza with the constant pool trampoline. The best of both worlds is thus achieved.
One final note on G5 optimization: I tested compiling the browser with 32-byte-aligned blocks and labels in the JIT allocator, and that slowed things down too (it is not obvious whether this can be more fine-grained). For that matter, when I tried building the browser with 32-byte-aligned loops, jumps, functions and branch targets, that too slowed the browser over the 16-byte-alignment it uses now. It appears to be all a balancing act.
10.0.2-final will come out at the same time as the ESR release. I also plan to write the debug only 11 fairly soon. Please note that I will be transferring service from the Apple Network Server to the POWER6 this coming (USA) holiday weekend, so there may be some intermittent weirdness the weekend of the 18th/19th/20th. In the meantime, please grab a beta build and give it a spin on your architecture: