10.0.3pre is available. This fixes the issue with black images (in a better way than previously planned), adds Ben's patch for another 8% speed improvement on JavaScript SunSpider regex performance, and properly launches the right number of threads for WebM decoding on multiprocessor Macs (four on the quad, two on dualies and one on everything else -- note that dual CPU G4s under 1.25GHz are still likely not to do well, just less badly). The pull I did was before Mozilla dropped a big load of patches on the ESR tree, all of which are small individually, but the sum total suggests we should have an RC and we will a few days before this is scheduled to come out (mid-March). I'm also going to start working on the 11 changesets over the weekend between doing my taxes (TurboTax still runs on PPC Tiger, which is awesome).
Friday, February 24, 2012
Sunday, February 19, 2012
Server migration should be complete
As mentioned, www.floodgap.com finally is moved off its old, dear Apple Network Server 500 (as someone on 68KMLA calls it, the "end table server") and onto the POWER6. This mostly went okay except for a glitch in the webserver which caused some downtime this arvo, but I think all that is fixed. Please report if anything is still not working: you should only notice that it's faster (we went from a 200MHz 604e to a two-way SMT 4.2GHz POWER6, which is quite a leap ;).
There is already some work on 10.0.3pre to report, including a fix for black images that don't show up, enabling additional WebM decoder threads on multi-core systems, and an encore from Ben to squeak another 8% or so out of JavaScript regular expression performance. More on that soon. Now I'm going to go put the cat to bed and veg out with the rest of this pint of pralines and cream.
There is already some work on 10.0.3pre to report, including a fix for black images that don't show up, enabling additional WebM decoder threads on multi-core systems, and an encore from Ben to squeak another 8% or so out of JavaScript regular expression performance. More on that soon. Now I'm going to go put the cat to bed and veg out with the rest of this pint of pralines and cream.
Friday, February 17, 2012
Dangeresque 4: This time, it's Chemspill 2 (*)
In a bid to release a chemspill just about every week, here we are at 10.0.2. This one wasn't Mozilla's fault; this was a serious flaw in a major library and everyone took this up the orifice, including Google Chrome. It also affects Firefox 3.6 and products based on it such as Camino 2.1, so you holdouts should update as well. Yes, we are vulnerable too. Because it would be more work to pull the new JS hotness out, and because I've gotten uniformly positive results from you, the beta testing audience, we're gonna have to jump shipping it in this release also. For that reason, the changesets are exactly the same.
This was supposed to include the "black image" fix (Mozilla bugs 720035, 689962 et al., tracked for us as issue 132) but the patch Mozilla currently has for review in M689962 dragged the browser down enormously and I backed it out before release. I'm going to try a smaller scoped custom patch that simply disables image optimization if the image is all black and this should not have the performance impact of the "bigger" fix. That will be in 10.0.3pre or final, depending. If this seriously affects you, you can disable image optimization entirely for now and this is explained in M720035.
In the "I told you so" department, since we're talking about security, the latest flaw discovered in Adobe Flash can be used in a platform-independent manner to steal account information from the browser using malicious scripting. The Power Mac version of Flash 10.1 is vulnerable, and can be used to exploit the browser without special considerations for Power Macs. There will only be more of these in the future.
Release notes and versions:
UPDATE: Just read that the Frieden bros. got a updated version of Timberwolf running on AmigaOS 4. This is a fork of Firefox (much as we are also a fork of Firefox) for PowerPC Amigas, mostly targetted at the (object of my lust) PA6T-based AmigaOne X1000, but works on SAM boards and the like. While they still have some work to do to get it up to current, it's good to see PowerPC-based Firefox alive and well on other platforms besides the usual suspects.
(*) Yes, I did watch a whole bunch of Homestar Runner and Strong Bad DVDs while working on this. Why do you ask?
This was supposed to include the "black image" fix (Mozilla bugs 720035, 689962 et al., tracked for us as issue 132) but the patch Mozilla currently has for review in M689962 dragged the browser down enormously and I backed it out before release. I'm going to try a smaller scoped custom patch that simply disables image optimization if the image is all black and this should not have the performance impact of the "bigger" fix. That will be in 10.0.3pre or final, depending. If this seriously affects you, you can disable image optimization entirely for now and this is explained in M720035.
In the "I told you so" department, since we're talking about security, the latest flaw discovered in Adobe Flash can be used in a platform-independent manner to steal account information from the browser using malicious scripting. The Power Mac version of Flash 10.1 is vulnerable, and can be used to exploit the browser without special considerations for Power Macs. There will only be more of these in the future.
Release notes and versions:
UPDATE: Just read that the Frieden bros. got a updated version of Timberwolf running on AmigaOS 4. This is a fork of Firefox (much as we are also a fork of Firefox) for PowerPC Amigas, mostly targetted at the (object of my lust) PA6T-based AmigaOne X1000, but works on SAM boards and the like. While they still have some work to do to get it up to current, it's good to see PowerPC-based Firefox alive and well on other platforms besides the usual suspects.
(*) Yes, I did watch a whole bunch of Homestar Runner and Strong Bad DVDs while working on this. Why do you ask?
Sunday, February 12, 2012
10.0.2pre available, now with awesome, plus: more G5 optimization notes
After the unexpected 10.0.1 release comes now the beta that was supposed to be 10.0.1pre and is now 10.0.2pre. This is the next big leap in JM+TI that reworks branches from the simplistic always-far calls in 10.0.0 and 10.0.1 to a set of intelligent algorithms that try to favour the branch prediction of either the G3/G4 or the G5, both written by Ben. It also includes Dave's square root routine, slightly altered for efficiency. The G3/G4 square root version improves square-root heavy code significantly over the old JavaScript stub function version; V8 Raytrace improves by almost 12% (the G5 has square root in hardware, but this is a very good implementation). If I may say, it's a nice example of why POWER ISA is, well, powerful. Besides using the fast reciprocal square root estimate instruction in the 603 and up, it also makes heavy use of the built-in FPU fused multiply-add to do Newton's method in fewer instructions. I might also add that x86 only just added FMA; AMD finally implements FMA (as FMA4) in Bulldozer, but Intel won't have what PowerPC had in 1992 until Haswell in 2013 (and even then only as FMA3). So there.
Ben's branchwork is the centrepiece of this release, however, and while it improves SunSpider by a modest amount, it improves loopy benchmarks like V8 by a huge degree. My quad G5 improves by about 45%, for example, and gets consistently around 900-950ms in SunSpider (down from around 1050). Our 1GHz 7450 G4 doesn't improve as much on SunSpider (2750 down to around 2600, so not quite at our AWOAFY? target), but still improves by about 40% on V8. Part of achieving this is splitting the way branches are handled into "big POWER" (G5) and "little POWER" (G3/G4) versions. Ben's original work did much what my code in our dear nearly-departed tracejit did, which was to have four-word branch stanzas padded with nops so that if a branch target was too big for a regular b[l] or bc[l] instruction (the normal relative branching instructions on PowerPC), we had enough room to turn it into lis ori mtctr b[c]ctr[l] which load the destination address into a register (usually r0), transfer it to the CTR, and then branch to the CTR (conditionally or always). This achieved 45% on G5 in V8 and about 35% on G4. SunSpider dropped down to less than 920 on the G5 as well.
So this was already a great start, but Ben's next brainwave was to free up the G3/G4's limited cache by reducing the branch stanzas further to two words, either branching to the target as usual, or if not possible, branching to a "trampoline" (not to be confused with the trampoline the JavaScript interpreter uses to enter JITted code, which is common to all and I handwrote in assembly language) in a construct called the constant pool, which has the actual far call. The constant pool is part of the JS runtime, provided by Mozilla normally for the ARM JIT where they dump constants to be referenced by the JIT code, but it doesn't have to be used for that. By doing it this way, Ben keeps more running code in cache, and as predicted, on the G4 this improved performance by another 3-4% in aggregate. (Ben later added another piece that only uses the trampoline when absolutely necessary, which fractionally improved this number further on G4.)
On the G5, however, this actually hurt performance and SunSpider climbed to a poorer result, nearly 1100ms; even with the later tweak to reduce trampoline usage, it was still around 970ms. Our theory is that the G5, being (in Apple's words) "very hungry, very fast and very sequential," pays too big an aggregate penalty to branch to an out-of-line branch stanza when a far call is encountered, for two reasons. First, it appears to be a smaller penalty (possibly even near zero given the aggressive ordering of the G5 dispatch unit) to have empty nops inline that take up some small proportion of instruction cache, because when those empty instructions are patched to a far call in-place the G5 does not need to introduce bubbles in its pipeline doing a branch into the trampoline just to branch again. In addition, the hypernerds amongst you will recall from our previous treatise on G5 optimization that there can only be one branch instruction in a dispatch group. The trampoline version must run in (at least) two dispatch groups, because there are two branch instructions, one to the far call in the trampoline and one in the trampoline itself, and both will each introduce a pipeline bubble of variable length. The far call in-place will still introduce a bubble, but the entire branch can in the best case execute in a single dispatch group because there is only one branch (the branch-to-CTR instruction at the end), and there will be only one bubble.
Because the G5 is really just a POWER4 with a deeper pipeline and AltiVec, this property is likely shared by later "big POWER" CPUs like the POWER5, POWER6 and POWER7, as well as "big POWER-like" CPUs such as the G5, Cell PPE and Xenon. We will likely have consumers that will want this branch optimization strategy, but we don't want to lose the gains we get on "little POWER" (such as G3, G4, e500, QorIQ, Gekko/Broadway and PowerPC 4xx) with the cache-saving trampoline approach, so we do both. On the G5, the original four-word stanza branching is compiled in; everything else (G3, 7400 and 7450) use the two-word branch stanza with the constant pool trampoline. The best of both worlds is thus achieved.
One final note on G5 optimization: I tested compiling the browser with 32-byte-aligned blocks and labels in the JIT allocator, and that slowed things down too (it is not obvious whether this can be more fine-grained). For that matter, when I tried building the browser with 32-byte-aligned loops, jumps, functions and branch targets, that too slowed the browser over the 16-byte-alignment it uses now. It appears to be all a balancing act.
10.0.2-final will come out at the same time as the ESR release. I also plan to write the debug only 11 fairly soon. Please note that I will be transferring service from the Apple Network Server to the POWER6 this coming (USA) holiday weekend, so there may be some intermittent weirdness the weekend of the 18th/19th/20th. In the meantime, please grab a beta build and give it a spin on your architecture:
Ben's branchwork is the centrepiece of this release, however, and while it improves SunSpider by a modest amount, it improves loopy benchmarks like V8 by a huge degree. My quad G5 improves by about 45%, for example, and gets consistently around 900-950ms in SunSpider (down from around 1050). Our 1GHz 7450 G4 doesn't improve as much on SunSpider (2750 down to around 2600, so not quite at our AWOAFY? target), but still improves by about 40% on V8. Part of achieving this is splitting the way branches are handled into "big POWER" (G5) and "little POWER" (G3/G4) versions. Ben's original work did much what my code in our dear nearly-departed tracejit did, which was to have four-word branch stanzas padded with nops so that if a branch target was too big for a regular b[l] or bc[l] instruction (the normal relative branching instructions on PowerPC), we had enough room to turn it into lis ori mtctr b[c]ctr[l] which load the destination address into a register (usually r0), transfer it to the CTR, and then branch to the CTR (conditionally or always). This achieved 45% on G5 in V8 and about 35% on G4. SunSpider dropped down to less than 920 on the G5 as well.
So this was already a great start, but Ben's next brainwave was to free up the G3/G4's limited cache by reducing the branch stanzas further to two words, either branching to the target as usual, or if not possible, branching to a "trampoline" (not to be confused with the trampoline the JavaScript interpreter uses to enter JITted code, which is common to all and I handwrote in assembly language) in a construct called the constant pool, which has the actual far call. The constant pool is part of the JS runtime, provided by Mozilla normally for the ARM JIT where they dump constants to be referenced by the JIT code, but it doesn't have to be used for that. By doing it this way, Ben keeps more running code in cache, and as predicted, on the G4 this improved performance by another 3-4% in aggregate. (Ben later added another piece that only uses the trampoline when absolutely necessary, which fractionally improved this number further on G4.)
On the G5, however, this actually hurt performance and SunSpider climbed to a poorer result, nearly 1100ms; even with the later tweak to reduce trampoline usage, it was still around 970ms. Our theory is that the G5, being (in Apple's words) "very hungry, very fast and very sequential," pays too big an aggregate penalty to branch to an out-of-line branch stanza when a far call is encountered, for two reasons. First, it appears to be a smaller penalty (possibly even near zero given the aggressive ordering of the G5 dispatch unit) to have empty nops inline that take up some small proportion of instruction cache, because when those empty instructions are patched to a far call in-place the G5 does not need to introduce bubbles in its pipeline doing a branch into the trampoline just to branch again. In addition, the hypernerds amongst you will recall from our previous treatise on G5 optimization that there can only be one branch instruction in a dispatch group. The trampoline version must run in (at least) two dispatch groups, because there are two branch instructions, one to the far call in the trampoline and one in the trampoline itself, and both will each introduce a pipeline bubble of variable length. The far call in-place will still introduce a bubble, but the entire branch can in the best case execute in a single dispatch group because there is only one branch (the branch-to-CTR instruction at the end), and there will be only one bubble.
Because the G5 is really just a POWER4 with a deeper pipeline and AltiVec, this property is likely shared by later "big POWER" CPUs like the POWER5, POWER6 and POWER7, as well as "big POWER-like" CPUs such as the G5, Cell PPE and Xenon. We will likely have consumers that will want this branch optimization strategy, but we don't want to lose the gains we get on "little POWER" (such as G3, G4, e500, QorIQ, Gekko/Broadway and PowerPC 4xx) with the cache-saving trampoline approach, so we do both. On the G5, the original four-word stanza branching is compiled in; everything else (G3, 7400 and 7450) use the two-word branch stanza with the constant pool trampoline. The best of both worlds is thus achieved.
One final note on G5 optimization: I tested compiling the browser with 32-byte-aligned blocks and labels in the JIT allocator, and that slowed things down too (it is not obvious whether this can be more fine-grained). For that matter, when I tried building the browser with 32-byte-aligned loops, jumps, functions and branch targets, that too slowed the browser over the 16-byte-alignment it uses now. It appears to be all a balancing act.
10.0.2-final will come out at the same time as the ESR release. I also plan to write the debug only 11 fairly soon. Please note that I will be transferring service from the Apple Network Server to the POWER6 this coming (USA) holiday weekend, so there may be some intermittent weirdness the weekend of the 18th/19th/20th. In the meantime, please grab a beta build and give it a spin on your architecture:
Thursday, February 9, 2012
10.0.1 chemspill
Instead of the 10.0.1pre test release I was planning to make, Mozilla has identified a high-priority security and stability issue in 10 that we are affected by, so we now have a 10.0.1 available. All users are advised to upgrade. I also snuck in a fix for issue 129, since it was a low-risk repair. The faster JavaScript engine Ben was working on will be in 10.0.2pre once I have finished merging the G3/G4 and G5 versions together; they are different code paths in this version. We will talk more about that when it is available.
Saturday, February 4, 2012
Waving from the fast lane
I am typing this in the G5 build of 10.0.1pre, which has Ben's branch rework in it and Dave's square root (slightly modified). The G5 doesn't benefit from the square root, of course, and on SunSpider it only dropped a deceptive 100ms or so (from 1050ish to around 900 -- but this meets our target), but one test is, as usual, deceptive: Dromaeo rockets to a ridiculous 180+ runs/sec, and V8 is 45% faster.
I'm going to run 7450 and 7400 builds off today for conformance testing on the lab systems here at Floodgap Orbiting HQ. If they pass, then the last thing to do is switch the build snippets to run from the ESR and update the build documentation. If all goes well, you'll get to play with this new hotness this weekend or earlier. More strong work from our contributors!
I'm going to run 7450 and 7400 builds off today for conformance testing on the lab systems here at Floodgap Orbiting HQ. If they pass, then the last thing to do is switch the build snippets to run from the ESR and update the build documentation. If all goes well, you'll get to play with this new hotness this weekend or earlier. More strong work from our contributors!