Saturday, March 10, 2012

11.0, musings and gripes (starting the unstable branch off with a bang)

Mozilla is considering their options for the release of Firefox 11 given some recent events (more presently), but I think it is important to establish our unstable branch in a timely manner to reassure you and our studio audience that TenFourFox isn't throwing in the towel with 10. (How alliterative.) Remember, for those new to this blog, that 11 through 16 beta are unstable builds. Do not use them if you're not prepared to deal with bugs; use 10.x.

Firefox 11 is not the big leap that 10 was and there is little new user-facing, but there are some important changes in the machinery that are a big deal to us. The two most important are SPDY support (preffed off by default), and improved animated GIF performance. Many people have noticed and I commented way back in Fx4 that a big pack of animated GIFs on a page can bring the browser to a crawl. This doesn't completely get it back to Fx3.6, but it's a lot better, and at least now I can look at the smilies in 68KMLA without watching the core temperatures rise on the G5.

SPDY is another big deal, mostly because Google is pushing it hard and we are, near as I can tell, the only browser for PPC OS X that supports it in any form. SPDY is a modified HTTP with nearly ubiquitous TLS encryption and DEFLATE compression that furthermore multiplexes data transfer rather than traditional simultaneous sockets or sequential request pipelining. Personally, I'm not wild about it; it makes a moderately heavy protocol into a nightmare and my suspicion of Google knows no bounds. However, suitably written, it is faster, and it is definitely faster than SSL. Google Chrome supports it, natch, and uses it to talk to Google properties, and Twitter has recently deployed it, so the ball is rolling and the IETF is evaluating it for HTTP/2.0. It will become enabled by default in the Fx13 timeframe and like it or not, it's here to stay.

In the local changes dept., I rewrote the G3/G4 square root routine to completely avoid red zone stores and this seems to have fixed issue 134 (and made the square root routine shorter and faster to boot). Because we do not have automated test coverage and did not detect this problem with our routine testing, I have decided to leave our inline square root disabled on 10-stable unless there is a huge hue and cry over performance regressions. So, if you are using a math-heavy application, you should probably be using unstable.

Let's also have a little episode of "Optimizing for the G5, part III" (see part II and part I) in which, yet again, we discover another nasty little secret about the PowerPC 970 that Apple never told anyone about. In this episode, we focus on the mtctr and b[c]ctr[l] instructions, which act sort of like a computed GOTO. You can load any arbitrary address into a general-purpose register and use mtctr to transfer that register into the counter "CTR" special-purpose register, used both as a (surprise) loop counter and as a branch target. Thusly loaded, you can use bctr to branch there, bctrl to call a subroutine there (computed GOSUB?), or bcctr and bcctrl to do those based on a condition register status.

We already know from our previous treatise that the G5, and in fact all POWER CPUs from the POWER4 (on which the 970 is based) through today's POWER7, divvies up the instruction stream into dispatch groups of approximately 4 instructions, give or take, with an optional branch in slot 5. There are certain restrictions about the dispatch groups. While we knew that mtctr liked to be first, in fact, you can only manipulate one SPR per dispatch group, and any SPR-manipulation instruction must be in the first slot, not just mtctr. So, if you have something like mtctr r5:mflr r0 (load CTR from GPR 5; load GPR 0 from link register), this gets executed in two groups.

But wait, it gets worse! Recall we mentioned that there is an optional slot 5 where a branch instruction can be carried along for the ride. So, slicko: we can say mtctr r24:bctr and simply branch to register 24 or whatever in one group, right? Yes, you can, but you pay a specific and severe penalty for mtctr and any CTR branch in the same dispatch group. The G3 and G4 don't have this problem, only the G5 and other "big POWER" chips.

While auditing Fx11 in Shark to make sure gcc wasn't putting bad instructions like mcrxr in despite our CPU tuning parameters, I noticed that a particular routine had a disproportionate amount of access called JaegerStubVeneer. All architectures except x86 use something called a "veneer" in JavaScript JaegerMonkey, which is used to change the return address when a native C or C++ routine has to throw an exception. It is generally a performance robber -- the ARM guys estimate its penalty at around 4% -- but there is no good way around it on RISC systems because the return address is generally in a register, not on the stack, so it can't be adjusted without having a veneer routine to go through and manipulate it. There are a lot of natives available even to a JIT routine, so it gets called frequently. The PowerPC veneer is very short and looked like this:

; Stash LR in the reserved spot in the VMFrame.
mflr r0
stw r0, 124(r1)

; Call r12.
mtctr r12
bctrl

; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr


In Shark, that bctrl was amazingly hot because of this limit on the G5. Now it looks like this (and in the next release, we will align it to 16-bytes to favour the G5 and G4 even more):

; Prepare to call r12.
mtctr r12

; Stash LR in the reserved spot in the VMFrame. (second group)
mflr r0
stw r0, 124(r1)
#if defined(_PPC970_)
; Keep bctrl away from mtctr! This appears to be the optimal scheduling.
; If they are together, G5 pays a huge penalty, more than other SPRs.
; It actually got worse with two nops, and putting the stw with bctrl.
nop
nop
nop
#endif

; Branch. (third group)
bctrl

; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr


As you can see from the comments, this required quite a bit of empiric testing. Optimal scheduling executes this in three dispatch groups: the mtctr all by itself, and then the mtlr and stw (saving the return address so that it can be adjusted if the stub throws), and then the bctrl. We put in three nops to force the bctrl to be off in its own dispatch group and not in the branch slot of the second one. Despite being longer, this actually cuts the execution time of the veneer in half on the G5, and this small change improves V8 by over two percent!

Interestingly, changing our entire branching system to split them in dispatch groups actually made performance worse, presumably because it made the code longer and bulkier and caused less branches to fit into their standard displacement (which are always faster). Admittedly, it's hard to do instruction-level scheduling based on the current design of the JIT. Instead, we just do this in certain specific places where we know they will occur together and always occur. The net improvement is nearly 3% for what is ultimately some extra no-ops and just a few lines of code.

I found in the LLVM sources an interesting little source file on G5 hazards and designing optimal dispatch groups which we will use in future optimizations. I attached it to issue 135 for the interested.

Now for the musings and gripes. Pwn2Own has come to its typical explosive end, and the schadenfreude is thick since Google Chrome's much ballyhooed sandbox took it on the chin (but props to Google, who are paying their promised $60,000 bounty to both successful attackers, and already have fixes on the way). Naturally, Firefox fell too, and the suspicion is that this is a cross-platform flaw which I am not allowed to talk about in detail (you'll find out soon enough). If the attack is as suspected, then we are vulnerable to it, although it would require special effort to attack Power Macs.

It is not clear if this will delay Firefox 11, but details on the exact flaw are not available, and launch day is Tuesday, so Mozilla may choose to fudge on the release date until more information surfaces. There are also some issues with video drivers that do not pose an concern to us. There will definitely be a followup release for 10-stable to address the security issue (I will wait to see if Mozilla retracts the 10.0.3 RC and issues a new one; we will follow suit), and if there is a security issue on 11 (this is not yet confirmed either), I will chemspill on this branch too.

We are presently pushing upstream our JaegerMonkey-with-type-inference backend to Mozilla as bug 731110, pending a couple higher priority fixes getting in first that clash with our work. That should be a nice benefit to 10.5 PPC builders building from the tree, will work with little change on AIX, and gives our Linux, Amiga and BSD brethren a starting point to convert it to SysV ABI. But it might not be there very long because of this interesting post by David Anderson in which he gives an ETA for IonMonkey, the next generation JavaScript JIT, of about 2-3 months. And, well, that really sucks. Ostensibly IonMonkey builds on the work already done with JaegerMonkey, but looking at the in-progress Mercurial tree for the ARM version of IonMonkey (which we would be based on), I say the hell it is: it's an almost completely different set of macro-ops and requires significantly longer and more complex logic for code generation. So it's kind of Sisyphean to finally get our JIT boulder up to the top after tracejit foundered and then have it roll back to the bottom in a few short months with IonMonkey. This thing had better wash windows and do dishes after the amount of effort that we invested in JM+TI. I just hope it lands after Firefox 17 so that we have some cycles to work on it.

Getting back to less gripe-y things, I was made aware of a TenFourBird project that is building a Thunderbird for PowerPC based on our changesets, and probably a few others of their own to comm-central. There are no builds available, but there is a build wiki, and I am delighted to see the project appear because I know that people have requested such a thing in the past. Please note that I know nothing about the person(s) working on it, and am not personally involved with it myself, so the usual caveats apply. Also, if you hate our icon, you'll really have a conniption with theirs. ;) Jokes aside, please let me know if you make contact with the developer(s) or have tried to build it with their instructions.

This is also a good time to point out a couple of other community builds. Tobias is maintaining up-to-date WebKit frameworks for 10.5 and has incorporated some of the JIT work for regular expressions. You should be alert for bugs, and it does not support 10.4, but Tobias has been a valued contributor to this project and I'm sure his builds will serve those of you well who need WebKit (but also make sure you support OmniWeb, which is still 10.4-compatible).

hikerxbiker is also issuing SeaMonkey builds for 10.5 PPC. These are built more or less off the tree and don't include any of our special features right now, but will eventually include the JIT when that gets through the pipeline. This might be a good option for those of you who need SeaMonkey's additional features, such as mail-news, Chatzilla, etc.

Anyway, release notes and builds (please read comments):
  • G3 build removed until further notice due to architecture tag failure build now corrected
  • G4/7400
  • G4/7450
  • G5

14 comments:

  1. G3 crashes on launch with:
    Library not loaded: @executable_path/XUL
    Referenced from: /Applications/TenFourFoxG3.app/Contents/MacOS/libxpcom.dylib
    Reason: no suitable image found. Did find:
    /Applications/TenFourFoxG3.app/Contents/MacOS/XUL: incompatible cpu-subtype
    /Applications/TenFourFoxG3.app/Contents/MacOS/XUL: incompatible cpu-subtype

    ReplyDelete
  2. Damn it. It's linking as ppc7450. I'm going to have to retract the build and figure out why it won't relink libxul as ppc750.

    ReplyDelete
  3. They all need to be scrubbed, it's a problem with libffi. I'll leave the other three up while I figure this out since they will at least run (ppc7400 will run ppc7450 fine).

    ReplyDelete
  4. Stuck config.* setting, looks like a change in the build system that broke the buildbot. Leaving it to rebuild overnight. Sorry, G3 folks, you're stuck with 10.0.3 for right now.

    ReplyDelete
  5. I did some performance testing. Maybe this helps to evaluate the impact of the the square root issue.

    10.0.3pre
    Peacekeeper: 331, 327, 317
    Dromaeo V8: 51.27 runs/s (best of two test runs)

    10.0.3RC
    Peacekeeper: 346, 326, 332
    Dromaeo V8: 50.80 runs/s (best of two test runs)

    11.0b
    Peacekeeper: 321, 309, 351
    Dromaeo V8: 54.24 runs/s (best of two test runs)

    Each test run done after computer re-start with fresh profile and no other apps running (incl. Finder, Bluetoth, AirPort etc.). This is how I'll test from now on because I think it's the only way to get objective results with no caching, ram swapping etc. interferences. I'd like to have tested Sunspider, but they're on strike right now. Peacekeeper results are impressive, but unlike other tests still vary a lot in the same browser version, and I'm not sure what causes this.

    ReplyDelete
  6. @Phil, G3 version is restored. Please advise ASAP if this still has the same problem, but lipo -info says XUL, firefox and js are all ppc750 now. I ran this build off by hand, not the buildbot, so everything should be squeaky clean.

    @Chris, I don't know why Peacekeeper has such terrible inter-test variability and it hurts making substantive comparisons (309 to 351 is 14% difference: that's a heck of an error bar). For SunSpider and V8 I now run everything from the command line. The SunSpider bench pack from the WebKit tree has excellent reproducibility and minimal variability, on the order of <0.5%, and Dromaeo makes so many passes that it smoothes out these test-to-test spikes also. Peacekeeper really needs to find some way of ironing it out, which is a shame since it's the only widely accepted benchmark that tries to exercise all levels of the stack.

    ReplyDelete
  7. Oh, and the Dromaeo difference from pre to RC is expected. We have to balance the improvement against another big bug we can't test for, and Collusion is a major extension. So a change in Dromaeo of around 0.5 r/s is unfortunate, but acceptable for a branch where stability is the major goal. There must be some other sauce in 11 that brought up the rest of the score.

    ReplyDelete
  8. "TenFourBird"... I love the name. Then the PPC version of SeaMonkey can be called TenFourMonkey.

    ReplyDelete
  9. >Oh, and the Dromaeo difference from pre to RC is expected.
    Yes, I thought so. I've been reading along in issue 134 (I have no skills to contribute there, but it was interesting anyway); stability is definitely more important.

    ReplyDelete
  10. Excellent post! Currently typing from SeaMonkey 2.6!

    Thank you for the unstable! And realizing how useless 99% of us are with changesets xD

    Another caveat for 11 is VDH is currently broken, so if you are a flash-snatcher 10 may still be your best bet.

    ReplyDelete
  11. Is TFF 11 compiled with some kind of debug/trace? I'm seeing lots of "MOZ_EVENT_TRACE" entries in my Console log.

    ReplyDelete
  12. It shouldn't be, but yes, I'm seeing that too. Damn it!

    ReplyDelete
  13. ... but only on the G4; the G5 does it infrequently. I think it's because the event tracer is set too low. Frankly it's a lot of false alarms so I'm just going to turn it off. If you want a workaround, these env vars control it:

    * Set MOZ_INSTRUMENT_EVENT_LOOP=1 in the environment to enable
    * this instrumentation. Currently only the UI process is instrumented.
    *
    * Set MOZ_INSTRUMENT_EVENT_LOOP_OUTPUT in the environment to a
    * file path to contain the log output, the default is to log to stdout.
    *
    * Set MOZ_INSTRUMENT_EVENT_LOOP_THRESHOLD in the environment to an
    * integer number of milliseconds to change the threshold for reporting.
    * The default is 20 milliseconds. Unresponsive periods shorter than this
    * threshold will not be reported.
    *
    * Set MOZ_INSTRUMENT_EVENT_LOOP_INTERVAL in the environment to an
    * integer number of milliseconds to change the maximum sampling frequency.
    * This variable controls how often events will be sent to the main
    * thread's event loop to sample responsiveness. The sampler will not
    * send events twice within LOOP_INTERVAL milliseconds.
    * The default is 10 milliseconds.

    I'm not going to do a rebuild until 11's disposition is determined, however.

    ReplyDelete
  14. I see it on G3 as well. But it's not a big deal, it doesn't seem to affect performance much, and the cron scripts will do their work.

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.