First of all, it looks like there have been no more major issues with Fx11, so the plan continues with Fx12, the port of which will begin after beta 3. We are looking at getting the CoreGraphics backend working with 10.4. We can still use a Cairo fallback if this is no good, but it would accelerate our canvas performance by a decent amount, and it may be possible Mozilla may use this for the entire browser chrome, so this is a potential win if we get it operational. The other thing on tap is to reschedule XPCOM calls better for the G5, in the same vein that we did for JavaScript, which should further smooth out performance.
Anxiety is building in the Power Mac community now that Mozilla has officially announced 3.6 is dead, and no one is really sure what will happen to Camino (they still make updates, and historically they have maintained old branches after Firefox versions based upon it have ceased, such as 3.0 (Camino 2.0/Mozilla 1.9.0), but there is no clear future for the project and Mozilla 1.9.2 is quite long in the tooth by now). Some people are bailing out for Linux, and there are some nice options there. I tried Lubuntu 12.04 on my 867MHz TiBook which normally runs OS 9, and the LiveCD was very impressive. I missed having the JIT in Firefox, though ...
Nevertheless many of us, including yours truly, will remain 10.4 (and 10.5) users simply because we need the software options and/or Classic, and the browser options will continue to dwindle. Since Mozilla has indicated us as their official recommendation to Power Mac users going forward, we want to make sure we are covering all bases, which is why SUMO entries like this and this Mozillazine thread (possibly the same user) really cheese me off. Thanks, sport, for slamming the project in public (in so many words, "I think it's nice but it sucks") and then admitting in that same SUMO thread that you haven't even bothered to report your issue(s) to us, whatever they actually are. I can't fix what I don't see, but I don't even look unless you tell me about it. Chris and Theo are on the job in the support area, plus whatever crack users are around, plus myself. There's no excuse for not reporting issues as pervasive as this user claims to be experiencing anymore; we have the infrastructure and we have the volunteers to at least triage it. If we can't fix it, well, I suppose I can't blame someone for having a bad opinion of the software in that case (justified or not), but not even giving us a chance is just plain mean.
Also, yes, we know Mozilla has thrown in the towel and is working on an H.264 solution, so please stop E-mailing me, posting support threads about it, etc. A couple of notes: 1) it's going to have to involve QuickTime; I am very leery of the legal risk I or others would assume if we embed ffmpeg because there is no exception for decoders, even not for profit and 2) while it will be faster than WebM, it will still be slower than the QTE because the QTE doesn't have to composite the video and can use OpenGL hardware to accelerate display, whereas we still have to composite, scale, etc., the video in TenFourFox. Please keep these issues in mind (and accept that I won't enable the option if it turns out Mozilla won't support QT or uses a QT X-only API). I am really hoping Mozilla's legal counsel issues some opinion on this that those of us downstream can take to the bank.
As soon as Mozilla finishes with 10.0.4, which is usually week 3 or 4 in the cycle, we will push out a go-to-build, and hopefully 12 will be out by then too.
Friday, March 23, 2012
Tuesday, March 13, 2012
11.0 release 2
This fixes the MOZ_TRACE_* crap in Console.app (it just disables it entirely except for debug builds). Also, we did try aligning JaegerStubVeneer to 16-bytes (for the interested, it's .align 4) and it makes things worse on G4 and G5, and I don't understand these processors sometimes. 12 will come out sometime after Firefox 12b3. Once again, release notes and versions (this version is still 11.0, but has a later build ID):
Monday, March 12, 2012
10.0.3 released (11.0 tomorrow)
10.0.3 has been released, after confirmation that we are not vulnerable to the Pwn2Own attack.
11.0 will be out tomorrow evening after I have had a chance to rebuild and test everything. The spurious MOZ_EVENT_TRACE entries in the Console have also been turned off.
11.0 will be out tomorrow evening after I have had a chance to rebuild and test everything. The spurious MOZ_EVENT_TRACE entries in the Console have also been turned off.
Saturday, March 10, 2012
11.0, musings and gripes (starting the unstable branch off with a bang)
Mozilla is considering their options for the release of Firefox 11 given some recent events (more presently), but I think it is important to establish our unstable branch in a timely manner to reassure you and our studio audience that TenFourFox isn't throwing in the towel with 10. (How alliterative.) Remember, for those new to this blog, that 11 through 16 beta are unstable builds. Do not use them if you're not prepared to deal with bugs; use 10.x.
Firefox 11 is not the big leap that 10 was and there is little new user-facing, but there are some important changes in the machinery that are a big deal to us. The two most important are SPDY support (preffed off by default), and improved animated GIF performance. Many people have noticed and I commented way back in Fx4 that a big pack of animated GIFs on a page can bring the browser to a crawl. This doesn't completely get it back to Fx3.6, but it's a lot better, and at least now I can look at the smilies in 68KMLA without watching the core temperatures rise on the G5.
SPDY is another big deal, mostly because Google is pushing it hard and we are, near as I can tell, the only browser for PPC OS X that supports it in any form. SPDY is a modified HTTP with nearly ubiquitous TLS encryption and DEFLATE compression that furthermore multiplexes data transfer rather than traditional simultaneous sockets or sequential request pipelining. Personally, I'm not wild about it; it makes a moderately heavy protocol into a nightmare and my suspicion of Google knows no bounds. However, suitably written, it is faster, and it is definitely faster than SSL. Google Chrome supports it, natch, and uses it to talk to Google properties, and Twitter has recently deployed it, so the ball is rolling and the IETF is evaluating it for HTTP/2.0. It will become enabled by default in the Fx13 timeframe and like it or not, it's here to stay.
In the local changes dept., I rewrote the G3/G4 square root routine to completely avoid red zone stores and this seems to have fixed issue 134 (and made the square root routine shorter and faster to boot). Because we do not have automated test coverage and did not detect this problem with our routine testing, I have decided to leave our inline square root disabled on 10-stable unless there is a huge hue and cry over performance regressions. So, if you are using a math-heavy application, you should probably be using unstable.
Let's also have a little episode of "Optimizing for the G5, part III" (see part II and part I) in which, yet again, we discover another nasty little secret about the PowerPC 970 that Apple never told anyone about. In this episode, we focus on the mtctr and b[c]ctr[l] instructions, which act sort of like a computed GOTO. You can load any arbitrary address into a general-purpose register and use mtctr to transfer that register into the counter "CTR" special-purpose register, used both as a (surprise) loop counter and as a branch target. Thusly loaded, you can use bctr to branch there, bctrl to call a subroutine there (computed GOSUB?), or bcctr and bcctrl to do those based on a condition register status.
We already know from our previous treatise that the G5, and in fact all POWER CPUs from the POWER4 (on which the 970 is based) through today's POWER7, divvies up the instruction stream into dispatch groups of approximately 4 instructions, give or take, with an optional branch in slot 5. There are certain restrictions about the dispatch groups. While we knew that mtctr liked to be first, in fact, you can only manipulate one SPR per dispatch group, and any SPR-manipulation instruction must be in the first slot, not just mtctr. So, if you have something like mtctr r5:mflr r0 (load CTR from GPR 5; load GPR 0 from link register), this gets executed in two groups.
But wait, it gets worse! Recall we mentioned that there is an optional slot 5 where a branch instruction can be carried along for the ride. So, slicko: we can say mtctr r24:bctr and simply branch to register 24 or whatever in one group, right? Yes, you can, but you pay a specific and severe penalty for mtctr and any CTR branch in the same dispatch group. The G3 and G4 don't have this problem, only the G5 and other "big POWER" chips.
While auditing Fx11 in Shark to make sure gcc wasn't putting bad instructions like mcrxr in despite our CPU tuning parameters, I noticed that a particular routine had a disproportionate amount of access called JaegerStubVeneer. All architectures except x86 use something called a "veneer" in JavaScript JaegerMonkey, which is used to change the return address when a native C or C++ routine has to throw an exception. It is generally a performance robber -- the ARM guys estimate its penalty at around 4% -- but there is no good way around it on RISC systems because the return address is generally in a register, not on the stack, so it can't be adjusted without having a veneer routine to go through and manipulate it. There are a lot of natives available even to a JIT routine, so it gets called frequently. The PowerPC veneer is very short and looked like this:
; Stash LR in the reserved spot in the VMFrame.
mflr r0
stw r0, 124(r1)
; Call r12.
mtctr r12
bctrl
; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr
In Shark, that bctrl was amazingly hot because of this limit on the G5. Now it looks like this (and in the next release, we will align it to 16-bytes to favour the G5 and G4 even more):
; Prepare to call r12.
mtctr r12
; Stash LR in the reserved spot in the VMFrame. (second group)
mflr r0
stw r0, 124(r1)
#if defined(_PPC970_)
; Keep bctrl away from mtctr! This appears to be the optimal scheduling.
; If they are together, G5 pays a huge penalty, more than other SPRs.
; It actually got worse with two nops, and putting the stw with bctrl.
nop
nop
nop
#endif
; Branch. (third group)
bctrl
; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr
As you can see from the comments, this required quite a bit of empiric testing. Optimal scheduling executes this in three dispatch groups: the mtctr all by itself, and then the mtlr and stw (saving the return address so that it can be adjusted if the stub throws), and then the bctrl. We put in three nops to force the bctrl to be off in its own dispatch group and not in the branch slot of the second one. Despite being longer, this actually cuts the execution time of the veneer in half on the G5, and this small change improves V8 by over two percent!
Interestingly, changing our entire branching system to split them in dispatch groups actually made performance worse, presumably because it made the code longer and bulkier and caused less branches to fit into their standard displacement (which are always faster). Admittedly, it's hard to do instruction-level scheduling based on the current design of the JIT. Instead, we just do this in certain specific places where we know they will occur together and always occur. The net improvement is nearly 3% for what is ultimately some extra no-ops and just a few lines of code.
I found in the LLVM sources an interesting little source file on G5 hazards and designing optimal dispatch groups which we will use in future optimizations. I attached it to issue 135 for the interested.
Now for the musings and gripes. Pwn2Own has come to its typical explosive end, and the schadenfreude is thick since Google Chrome's much ballyhooed sandbox took it on the chin (but props to Google, who are paying their promised $60,000 bounty to both successful attackers, and already have fixes on the way). Naturally, Firefox fell too, and the suspicion is that this is a cross-platform flaw which I am not allowed to talk about in detail (you'll find out soon enough). If the attack is as suspected, then we are vulnerable to it, although it would require special effort to attack Power Macs.
It is not clear if this will delay Firefox 11, but details on the exact flaw are not available, and launch day is Tuesday, so Mozilla may choose to fudge on the release date until more information surfaces. There are also some issues with video drivers that do not pose an concern to us. There will definitely be a followup release for 10-stable to address the security issue (I will wait to see if Mozilla retracts the 10.0.3 RC and issues a new one; we will follow suit), and if there is a security issue on 11 (this is not yet confirmed either), I will chemspill on this branch too.
We are presently pushing upstream our JaegerMonkey-with-type-inference backend to Mozilla as bug 731110, pending a couple higher priority fixes getting in first that clash with our work. That should be a nice benefit to 10.5 PPC builders building from the tree, will work with little change on AIX, and gives our Linux, Amiga and BSD brethren a starting point to convert it to SysV ABI. But it might not be there very long because of this interesting post by David Anderson in which he gives an ETA for IonMonkey, the next generation JavaScript JIT, of about 2-3 months. And, well, that really sucks. Ostensibly IonMonkey builds on the work already done with JaegerMonkey, but looking at the in-progress Mercurial tree for the ARM version of IonMonkey (which we would be based on), I say the hell it is: it's an almost completely different set of macro-ops and requires significantly longer and more complex logic for code generation. So it's kind of Sisyphean to finally get our JIT boulder up to the top after tracejit foundered and then have it roll back to the bottom in a few short months with IonMonkey. This thing had better wash windows and do dishes after the amount of effort that we invested in JM+TI. I just hope it lands after Firefox 17 so that we have some cycles to work on it.
Getting back to less gripe-y things, I was made aware of a TenFourBird project that is building a Thunderbird for PowerPC based on our changesets, and probably a few others of their own to comm-central. There are no builds available, but there is a build wiki, and I am delighted to see the project appear because I know that people have requested such a thing in the past. Please note that I know nothing about the person(s) working on it, and am not personally involved with it myself, so the usual caveats apply. Also, if you hate our icon, you'll really have a conniption with theirs. ;) Jokes aside, please let me know if you make contact with the developer(s) or have tried to build it with their instructions.
This is also a good time to point out a couple of other community builds. Tobias is maintaining up-to-date WebKit frameworks for 10.5 and has incorporated some of the JIT work for regular expressions. You should be alert for bugs, and it does not support 10.4, but Tobias has been a valued contributor to this project and I'm sure his builds will serve those of you well who need WebKit (but also make sure you support OmniWeb, which is still 10.4-compatible).
hikerxbiker is also issuing SeaMonkey builds for 10.5 PPC. These are built more or less off the tree and don't include any of our special features right now, but will eventually include the JIT when that gets through the pipeline. This might be a good option for those of you who need SeaMonkey's additional features, such as mail-news, Chatzilla, etc.
Anyway, release notes and builds (please read comments):
Firefox 11 is not the big leap that 10 was and there is little new user-facing, but there are some important changes in the machinery that are a big deal to us. The two most important are SPDY support (preffed off by default), and improved animated GIF performance. Many people have noticed and I commented way back in Fx4 that a big pack of animated GIFs on a page can bring the browser to a crawl. This doesn't completely get it back to Fx3.6, but it's a lot better, and at least now I can look at the smilies in 68KMLA without watching the core temperatures rise on the G5.
SPDY is another big deal, mostly because Google is pushing it hard and we are, near as I can tell, the only browser for PPC OS X that supports it in any form. SPDY is a modified HTTP with nearly ubiquitous TLS encryption and DEFLATE compression that furthermore multiplexes data transfer rather than traditional simultaneous sockets or sequential request pipelining. Personally, I'm not wild about it; it makes a moderately heavy protocol into a nightmare and my suspicion of Google knows no bounds. However, suitably written, it is faster, and it is definitely faster than SSL. Google Chrome supports it, natch, and uses it to talk to Google properties, and Twitter has recently deployed it, so the ball is rolling and the IETF is evaluating it for HTTP/2.0. It will become enabled by default in the Fx13 timeframe and like it or not, it's here to stay.
In the local changes dept., I rewrote the G3/G4 square root routine to completely avoid red zone stores and this seems to have fixed issue 134 (and made the square root routine shorter and faster to boot). Because we do not have automated test coverage and did not detect this problem with our routine testing, I have decided to leave our inline square root disabled on 10-stable unless there is a huge hue and cry over performance regressions. So, if you are using a math-heavy application, you should probably be using unstable.
Let's also have a little episode of "Optimizing for the G5, part III" (see part II and part I) in which, yet again, we discover another nasty little secret about the PowerPC 970 that Apple never told anyone about. In this episode, we focus on the mtctr and b[c]ctr[l] instructions, which act sort of like a computed GOTO. You can load any arbitrary address into a general-purpose register and use mtctr to transfer that register into the counter "CTR" special-purpose register, used both as a (surprise) loop counter and as a branch target. Thusly loaded, you can use bctr to branch there, bctrl to call a subroutine there (computed GOSUB?), or bcctr and bcctrl to do those based on a condition register status.
We already know from our previous treatise that the G5, and in fact all POWER CPUs from the POWER4 (on which the 970 is based) through today's POWER7, divvies up the instruction stream into dispatch groups of approximately 4 instructions, give or take, with an optional branch in slot 5. There are certain restrictions about the dispatch groups. While we knew that mtctr liked to be first, in fact, you can only manipulate one SPR per dispatch group, and any SPR-manipulation instruction must be in the first slot, not just mtctr. So, if you have something like mtctr r5:mflr r0 (load CTR from GPR 5; load GPR 0 from link register), this gets executed in two groups.
But wait, it gets worse! Recall we mentioned that there is an optional slot 5 where a branch instruction can be carried along for the ride. So, slicko: we can say mtctr r24:bctr and simply branch to register 24 or whatever in one group, right? Yes, you can, but you pay a specific and severe penalty for mtctr and any CTR branch in the same dispatch group. The G3 and G4 don't have this problem, only the G5 and other "big POWER" chips.
While auditing Fx11 in Shark to make sure gcc wasn't putting bad instructions like mcrxr in despite our CPU tuning parameters, I noticed that a particular routine had a disproportionate amount of access called JaegerStubVeneer. All architectures except x86 use something called a "veneer" in JavaScript JaegerMonkey, which is used to change the return address when a native C or C++ routine has to throw an exception. It is generally a performance robber -- the ARM guys estimate its penalty at around 4% -- but there is no good way around it on RISC systems because the return address is generally in a register, not on the stack, so it can't be adjusted without having a veneer routine to go through and manipulate it. There are a lot of natives available even to a JIT routine, so it gets called frequently. The PowerPC veneer is very short and looked like this:
; Stash LR in the reserved spot in the VMFrame.
mflr r0
stw r0, 124(r1)
; Call r12.
mtctr r12
bctrl
; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr
In Shark, that bctrl was amazingly hot because of this limit on the G5. Now it looks like this (and in the next release, we will align it to 16-bytes to favour the G5 and G4 even more):
; Prepare to call r12.
mtctr r12
; Stash LR in the reserved spot in the VMFrame. (second group)
mflr r0
stw r0, 124(r1)
#if defined(_PPC970_)
; Keep bctrl away from mtctr! This appears to be the optimal scheduling.
; If they are together, G5 pays a huge penalty, more than other SPRs.
; It actually got worse with two nops, and putting the stw with bctrl.
nop
nop
nop
#endif
; Branch. (third group)
bctrl
; Get LR back.
lwz r0, 124(r1)
mtlr r0
blr
As you can see from the comments, this required quite a bit of empiric testing. Optimal scheduling executes this in three dispatch groups: the mtctr all by itself, and then the mtlr and stw (saving the return address so that it can be adjusted if the stub throws), and then the bctrl. We put in three nops to force the bctrl to be off in its own dispatch group and not in the branch slot of the second one. Despite being longer, this actually cuts the execution time of the veneer in half on the G5, and this small change improves V8 by over two percent!
Interestingly, changing our entire branching system to split them in dispatch groups actually made performance worse, presumably because it made the code longer and bulkier and caused less branches to fit into their standard displacement (which are always faster). Admittedly, it's hard to do instruction-level scheduling based on the current design of the JIT. Instead, we just do this in certain specific places where we know they will occur together and always occur. The net improvement is nearly 3% for what is ultimately some extra no-ops and just a few lines of code.
I found in the LLVM sources an interesting little source file on G5 hazards and designing optimal dispatch groups which we will use in future optimizations. I attached it to issue 135 for the interested.
Now for the musings and gripes. Pwn2Own has come to its typical explosive end, and the schadenfreude is thick since Google Chrome's much ballyhooed sandbox took it on the chin (but props to Google, who are paying their promised $60,000 bounty to both successful attackers, and already have fixes on the way). Naturally, Firefox fell too, and the suspicion is that this is a cross-platform flaw which I am not allowed to talk about in detail (you'll find out soon enough). If the attack is as suspected, then we are vulnerable to it, although it would require special effort to attack Power Macs.
It is not clear if this will delay Firefox 11, but details on the exact flaw are not available, and launch day is Tuesday, so Mozilla may choose to fudge on the release date until more information surfaces. There are also some issues with video drivers that do not pose an concern to us. There will definitely be a followup release for 10-stable to address the security issue (I will wait to see if Mozilla retracts the 10.0.3 RC and issues a new one; we will follow suit), and if there is a security issue on 11 (this is not yet confirmed either), I will chemspill on this branch too.
We are presently pushing upstream our JaegerMonkey-with-type-inference backend to Mozilla as bug 731110, pending a couple higher priority fixes getting in first that clash with our work. That should be a nice benefit to 10.5 PPC builders building from the tree, will work with little change on AIX, and gives our Linux, Amiga and BSD brethren a starting point to convert it to SysV ABI. But it might not be there very long because of this interesting post by David Anderson in which he gives an ETA for IonMonkey, the next generation JavaScript JIT, of about 2-3 months. And, well, that really sucks. Ostensibly IonMonkey builds on the work already done with JaegerMonkey, but looking at the in-progress Mercurial tree for the ARM version of IonMonkey (which we would be based on), I say the hell it is: it's an almost completely different set of macro-ops and requires significantly longer and more complex logic for code generation. So it's kind of Sisyphean to finally get our JIT boulder up to the top after tracejit foundered and then have it roll back to the bottom in a few short months with IonMonkey. This thing had better wash windows and do dishes after the amount of effort that we invested in JM+TI. I just hope it lands after Firefox 17 so that we have some cycles to work on it.
Getting back to less gripe-y things, I was made aware of a TenFourBird project that is building a Thunderbird for PowerPC based on our changesets, and probably a few others of their own to comm-central. There are no builds available, but there is a build wiki, and I am delighted to see the project appear because I know that people have requested such a thing in the past. Please note that I know nothing about the person(s) working on it, and am not personally involved with it myself, so the usual caveats apply. Also, if you hate our icon, you'll really have a conniption with theirs. ;) Jokes aside, please let me know if you make contact with the developer(s) or have tried to build it with their instructions.
This is also a good time to point out a couple of other community builds. Tobias is maintaining up-to-date WebKit frameworks for 10.5 and has incorporated some of the JIT work for regular expressions. You should be alert for bugs, and it does not support 10.4, but Tobias has been a valued contributor to this project and I'm sure his builds will serve those of you well who need WebKit (but also make sure you support OmniWeb, which is still 10.4-compatible).
hikerxbiker is also issuing SeaMonkey builds for 10.5 PPC. These are built more or less off the tree and don't include any of our special features right now, but will eventually include the JIT when that gets through the pipeline. This might be a good option for those of you who need SeaMonkey's additional features, such as mail-news, Chatzilla, etc.
Anyway, release notes and builds (please read comments):
Wednesday, March 7, 2012
10.0.3 RC (aka: Kaiser lies again)
Okay, I lied again. 10.0.3 RC is out (read the release notes), with issue 133 for better WebM on multicore Macs, and fixing issues 132 (black images cause graphic weirdness) and 134 (Collusion extension and other square-root-dependent code on G3/G4). Please verify all is good for you, and then stay tuned for 11 in a couple days pending Mozilla signing off on the final release. 11 will not have issue 134 fixed in it, on purpose for testing, so expect that.
Monday, March 5, 2012
10.0.3 and 11 beta
Mozilla has the ESR branch marked with DONTBUILD for some reason, so the ESR RC for 10.0.3 will not be out until this weekend along with a temporary patch for issue 134, which appears to be a bug in our software square root. In the interest of stability and not duplicating effort, we'll leave it off on the stable builds and try to get it fixed in unstable.
Speaking of unstable, this post is coming from TenFourFox 11 ("but ours goes to eleven"). I lied, I decided to do a beta for it after all. You'll get that soon.
Speaking of unstable, this post is coming from TenFourFox 11 ("but ours goes to eleven"). I lied, I decided to do a beta for it after all. You'll get that soon.
Subscribe to:
Posts (Atom)