Monday, April 25, 2011

Towards TenFourFox 5 and 4.0.2pre

I'm putting this post out a little early to talk about what we're going to do with Firefox 5 because my hand's been forced a little bit; I've been seeing some well-intentioned blog posters saying, among other things, that TenFourFox is nice and holds the fort but isn't going to get any updates. Besides the fact we just had one, this is unequivocally, unbelievably, unimaginably false. I personally use TenFourFox; I wouldn't shoot myself in the foot by failing even to maintain security updates, let alone additional features. When (not if) we reach a state where we cannot hack "Firefox.next" to build on 10.4 PPC, then we simply carry on maintenance on the branch we know is still working. If nothing else, the browser will remain safe to use even if the technology reaches a state of arrested development. Don't forget that Camino currently uses a three-year-old renderer, but they maintain and land security fixes from later branches to keep the browser safe to use. We'll be doing the exact same thing. I'll discuss that in more detail in a moment.

On that note, let's talk about Firefox 5. Firefox 5 really should be called 4.1 because it is, indeed, the minor feature update to Firefox 4 (as opposed to 4.0.1, which is the minor maintenance update). That said, it does have a number of important features we will want. Very few of these are user-facing, mind you. I've been playing with Aurora (essentially the "stable alpha") on my lonely Core 2 Duo mini, and at a UI level you're not going to see much difference. Like Snow Leopard to Mozilla's "Leopard," Firefox 5 really improves very little of the interface relative to Fx4.

Of the features that are there, however, the most important one is CSS Animation. Remember, Flash or any other plugin will not be enabled in "TenFourFox 5," and in this HTML5 world, we want to encourage animation and graphics methods that do not rely on Flash. CSS Animation is part of that transition away from proprietary, buggy and (in our case) unmaintained ways of doing graphics and animation in the browser. There are also quite a few performance improvements in Firefox 5 that will surely benefit us, because sadly our machines aren't going to get any faster by themselves.

For TenFourFox 5, in addition to getting all that working, I would also like to expand some of our marquee features to add even better performance. For example, the JavaScript nanojit is now stable and has achieved enough cross-processor parity that we can flip it on for the browser as well. In fact, I've been running my 4.0.1 in that manner for over a month (go into about:config, set javascript.options.tracejit.chrome to true), which has been nice and stable, and gives a noticible speed boost at the cost of additional memory needed for the traces. This is kind of a big change to throw at 4.x, but we definitely want it in 5.x. I'm also hoping to finally fix once and for all the problem with our generated prologue and certain kinds of native code calls, which are handled currently with a safe but slower abort to the interpreter. If these calls can be embedded in the trace, we can get even faster, but this has an obvious stability impact and I want extensive beta coverage. Finally, I want to support JavaScript typed arrays in the nanojit; these currently fall back on the interpreter, and the apps that are likely to attempt using typed arrays for increased performance are ironically going to perform worse on TenFourFox because of the fallback. This work has already been done preliminarily. I think this work list is fully achievable, especially because a lot of it is already done or in progress.

In the would-be-nice department, I want to continue our progress towards full AltiVec acceleration of the content chain. We've started (I think very successfully) by enabling AltiVec in libpixman and libvpx. There are some trivial additional code conversions that I can do in libvpx, and I'd like to try implementing them. In addition, Fx5 adds libjpeg-turbo, which is already faster in and of itself, but also gives us the chance to convert the MMX/SSE code it includes into AltiVec/VMX. This should greatly speed up our image display and processing. Finally, I want to track down other places in the code where Mozilla has already implemented SSE or SIMD equivalents for C code and write VMX versions. This work is unlikely to be fully completed for TenFourFox 5, but some parts should make it.

As I alluded to previously, Firefox 5 is not a slam dunk: we need to get Chromium IPC working on Tiger. I know it's possible to get it to compile, but linking was historically an issue because with IPC everything is clotted into a huge superlibrary called libxul (and in debug mode, this was too large for Tiger's linker to pull together), so it has never been tested. Fx4 does not require it (just strongly encourages it), but Fx5 demands it. Again, I remain cautiously confident the port is possible, but even if it is not, let me reiterate again that even if we remain stuck on Mozilla 2.0 forever, we will still get security updates landed on TenFourFox and the browser will remain functional and safe. You have my word on it: I eat my own dogfood. I hope that Classilla's longevity and update history demonstrates to people that TenFourFox is here for the long haul too.

That aside, Fx5 sounds pretty good, right? Well, don't get too raring to go just yet, because we have to consider the ramifications of the migration. Since our last discussion on this topic, more information has become available, and most of that information from our perspective is bad news. Fx4 will be the last of the "big" releases and the last to retain the old release EOL timeframe, which is usually six months of updates after release of the next version (and you can bet that under the new rapid release framework it won't be a second longer). Fx5 and later will be unsupported the second the next version comes out.

What does that mean for support? Well, barring something extraordinary, as soon as Fx6 emerges there will be no more Fx5 updates. Period. Similarly, when Fx7 comes out (!), no more Fx6. This means the safety net that would ordinarily be under us with a stable branch like Fx4 will no longer exist once we jump to Fx5. If there is a serious bug discovered in Fx5 and Fx6 has already come out, then there will be no fix for it in Fx5; Mozilla wants you to simply update. This is not ill-conceived on their part, but it sucks for people like us trying to actually build around the updates because the ground keeps shifting under our feet, and may require us to do a significant amount of backporting. I'm also not sure what this will mean for extension compatibility. Mozilla has plans in place for carrying forward addons semi-automatically, but if this process cuts off backward compatibility for old versions at the same time (as, possibly, a carrot for updating), that could really bite us.

To balance this and to spare much unnecessary work, this is how the rollout will happen, assuming the port of Fx5 is successful.

- Work on TenFourFox 5 will not commence until Firefox 5 is released. Other than chemspill releases, we know no more work will occur on the final release branch once it becomes final because the next 6-week release will clobber it. Therefore, we know nothing is going to change for Mozilla 5, and this simplifies our porting.

- Until work begins, there will be 4.x updates, and at least one more 4.x release (4.0.2pre; we'll get to that) will have feature work.

- Once work begins, we will have our own separate formal betas, just as before. I envision no more than three, most likely. 4.x updates will continue simultaneously as pure security and stability updates only.

- Once we are final, we will publish a limited set of 5.x security and stability releases while we look at Firefox 6, which should be out by then or soon. I am working on getting security access at Mozilla so that I can review and backport security fixes as they happen, though in the short term we can still get this information from publicly-available Mercurial commits, and they will land on 5.x too.

The cycle will then repeat as long as we are able to compile Firefox.next, whatever it happens to be. If we jumped to Firefox 6, then there will be 6.x betas (with 5.x releases), followed by 6.x releases while we look at Fx7, and so forth. In essence we will be emulating the old Mozilla release mechanism, except that we will be actually maintaining our own old branches instead of relying on Mozilla to do it for us.

At the point where we can no longer move forward, we set up our own Mercurial repo and maintain the last good branch "forever," backporting all security fixes that apply, any bug fixes that apply, and (very carefully) new features that we might need -- limited largely to browser core, layout, content, DOM and JS. JavaScript is pretty straightforward to keep current, fortunately, because it is largely self-contained. The rest will be examined on a case-by-case basis. This is what I've referred to as security parity and feature parity in the Wiki.

I really hope this puts people's concerns to rest; TenFourFox is here to stay. I need a browser that works on my G5. You can rely on this because I have my own skin in the game, as do, presumably, others who have contributed and who will contribute in the future.

Now, about 4.0.2pre. This is a feature release, but with only one feature, which is to try new compiler settings. Since this can lead to very subtle bugs, I really want to keep everything else constant to avoid nasty little variables I can't control for. Part of this will be to enable -ftree-vectorize for G4/G5 and attempt a proper 32-bit compile for G5 without using the hackish hybrid compile string we use now (64-bit Firefox on G5 still has issues and we don't really benefit from 64-bit anyway in the same way that x86 does). This won't affect the G3 releases, obviously. Similarly, it will include whatever other fixes for Firefox 4.0.2 that Mozilla has landed. By the way, they are holding their official release of Firefox 4.0.1 until later this week due to a last-minute issue, but this is apparently an oddiment of their update infrastructure and does not concern us. 4.0.2pre will be available when Mozilla releases it to the beta channel and I'll give you a heads up when it is about to arrive.

Well, that's enough blather; if this does not put your questions to rest, ask them in the comments.

Friday, April 22, 2011

4.0.1 final is pushed

4.0.1 final is pushed to Google Code and update notifications should be going out. Those of you on the beta edge shouldn't notice anything different; other than a quick internal branding change, it is otherwise the same as 4.0.1pre. I'll have more to say about 4.0.2pre and 5 in an upcoming post. For now, grab it and carry on.

Friday, April 15, 2011

4.0.1pre is now available

So, here's 4.0.1pre, and here's what's in it:

- The revised G5 JavaScript nanojit. Which is, you know, faster. And therefore doesn't suck. (G5 only)

- The AltiVec accelerated WebM video decoder (VP8, to be more precise). (7400/7450/G5)

- The scroll-slower-upper-thinger to make Flash applets not quite artifact as badly. They will still smear with scrolling, but less noticeably. (All)

- All the Firefox 4.0.1 fixes to date. There are quite a few; some don't affect us, but there are some security-related issues in here (none public) and some crash fixes. (All)

Plus this bonus hotness I snuck in:

- Tuned movie playback for all platforms, which affects not just WebM but also Theora. This causes slower Macs to buffer more decoded frames and smooth out the audio even if the video is choppy. The settings are different for G5 than G4/G3 because G5 has much bigger bandwidth due to its wider front-side bus. (All)

- Enabled AltiVec compositing also for libpixman. This means that pixel compositing is now done with AltiVec operations, which affects just about everything in the browser, really. The effect is subtle, but smooths out animations nicely and improves repainting speed. (7400/7450/G5)

Before you download this, consider grabbing a video and watching it in 4.0s just to see your starting point. I recommend going to www.youtube.com/html5 and signing up for the WebM trial (you don't need to have an account for that, just join with the link at the bottom), then viewing this video about Google's fiber experiment. If that video comes up in Flash, you did it wrong. This is a good test video because it has scenes with high data rates (the product manager on camera) alternating with lower ones (the animation sequences in the middle, and the title card and end card), so it's a good general overview.

On my G4/450 Sawtooth (7400) -- and by the way, I still don't support systems slower than 1.25GHz for video -- this video is pretty much unplayable with 4.0s. The audio is immensely fractured and you can forget about any video frames.

On my iBook G4/1.33 (7450) with 4.0s, the video and audio are extremely choppy with Energy Saver set to Automatic. Some frames appear and the audio is at least intelligible, but still fractured. It improves marginally at Highest.

On my quad G5 in Reduced with 4.0s, it plays in fits and starts. The snippets it does play are normal, but its data pipeline poops out and it has to buffer again repeatedly between them. On my quad G5 in Highest, it plays with only rare audio artifacts. I don't have a G3 handy right now (my PDQ is in pieces on my workbench waiting for a new hard disk).

So, now install 4.0.1pre and play it again. On the Sawtooth G4, the audio is artifacted, but now intelligible. The video is a glorified slideshow, but frames do appear at least occasionally. Hey, WebM video is pretty hefty to decode, and this computer is eleven years old, so whaddya want? :P

On my iBook G4 in Automatic with 4.0.1pre, it skips frames frequently, but the audio is nearly intact. In Highest, it still skips, but fewer, so I'd consider this playable.

On my quad G5 in Reduced with 4.0.1pre, it now plays perfectly.

Please note that Mozilla's streaming code is not terrifically robust. For example, if you enlarge or contract the Google video gadget using the arrows icon, then you start getting video artifacts when something overlays the video, and the buffering starts to seize up. The code doesn't seem to be able to handle a sudden drop in throughput well (a similar effect occurs when you change processor speed midstream in System Preferences). You can fix this by rewinding back to the beginning of the video, and then the streamer will be able to properly buffer. It looks like Mozilla's code just doesn't know what to do with the CPU changing performance characteristics abruptly or the video being dynamically resized as the video plays, but in fairness this is probably a lot to ask of it.

By the way, only my quad G5 in Highest could handle enlarged video fully (in Reduced, it got choppy), and don't even think about playing it full screen, because we can't hardware blit (this poor G5 got absolutely crushed trying to play Big Buck Bunny at 1920x1080 in software). This will only get better and still has some getting better to do, but it's a start, and the combination of buffer adjustments and VP8 VMX decoding does help.

Note that the WebM container decoding is still in C, and the actual portion of Mozilla's code that does the conversion to pixels (before they are blitted, which is VMX-accelerated) is also "just" in C. Firefox 5 is adding SIMD-based decoding of JPEG images; in at least older versions of libogg there was some AltiVec support; and I'd like to at some point have an entire AltiVec-accelerated content chain, all the way from raw data to rendering. There is some code that I think may be trivially converted to AltiVec (either in C or assembly) to gain us even more speed on the non-optimized sections, but that's for a later time.

Here is what I need from you wonderful beta hounds:

- G3 owners: Expect scrolling to be slower when a Flash applet is on-screen, but is it too slow? WebM video probably won't play hardly at all on your system, but it should not crash (which would mean AltiVec code snuck in). Is there any playback degradation with Ogg and VP3/Theora video using the new buffer settings?

- G4 owners: Improvement noticible? Above 1.25GHz, is video at least acceptable, even if it's a bit choppy or imperfect visually?

- G5 owners: Improvement noticible? How is JavaScript performance now?

Mozilla is planning an April 26th release for 4.0.1, and so will we. Anyway, go get:
You should get an upgrade notification when 4.0.1 final is available, so grab it and have at it, and post your observations in the comments.

Tuesday, April 12, 2011

I am the world's biggest liar

Dear readers, I must confess to you all what an amazingly brazen and horrible liar I have been. For days, nay, weeks, you have laboured under the completely fraudulent impression that AltiVec-accelerated WebM video was coming to 4.0.2pre, and my conscience, ravaged by guilt and dismay, cannot abide to persist in my duplicity any longer. I must therefore reveal to you the awful, shameful truth ...

... that AltiVec WebM is going to be in 4.0.1pre! YAY!!!

Yes, it actually works! With a little preprocessing of the included AltiVec sources from Google libvpx, some adjustments by hand and a lot of glue code in the build system, we are now building a mostly (not fully, I'll explain in a second) VMX/AltiVec-accelerated VP8 codec, just like the SSE2 and NEON-juiced VP8 codecs for x86 and ARM! It'll be in the 7400, 7450 and G5 releases.

This version now makes all but the highest data rate videos at least playable (and some completely playable) on G5 and high-end G4 machines, and makes it possible for video to play at all on low-end G4s (but please note that the recommended 1.25GHz clock speed remains). As my standard, I used a clip from Big Buck Bunny which would only play fully on my quad G5 if I turned Energy Saver to Highest (I usually run in Reduced to save power and increase the life of the machine), otherwise the data pipeline would run dry and it would seize up repeatedly. With the new AltiVec VP8, it runs all the way through. No stutters, no hiccups. YouTube HTML5 performed splendidly. It's a beautiful thing.

Oh, but that's not all that'll be in 4.0.1pre. Besides the G5-enabled JavaScript acceleration (down to 1760ms in SunSpider!) I talked about in our last post, I've also found a kludge to reduce Flash screen artifacts when scrolling. I'm sure the careful ones out there have noticed that scrollllling verrrrry sloooooowllly will keep the artifacting down to a minimum, and there's a way you can do this already: enable Smooth Scrolling under Preferences, Advanced, General. However, smooth scrolling is definitely slower and it stinks to do it all the time if you're not used to it. So, why not enable smooth scrolling when a plugin is onscreen, and then revert to the user's preference otherwise? Why not indeed! And, while there is still some artifacting, it is much, much less. Of course, if you use HTML5 video like our new AltiVec WebM, you won't need Flash. I'm just saying.

This is not an unqualified success, however. I said the WebM code is mostly AltiVec accelerated, and it is. However, we do not have assembly source for most of the inverse discrete cosine transform algorithms, just for one of them. I looked at the C version for the other inverse DCTs and it looks pretty obvious to vectorize, but this is going off into completely new work territory and I think I'd rather not do that for a stable branch even though this is getting beta coverage (read on). Fortunately, the especially computationally intensive parts such as the filtering are fully written in assembly and we do have those. Also, this means we have points to improve on in the future, so performance should only get better.

Also, just because the decoder is AltiVec-enabled doesn't mean the compositor is, and I've alluded before that the graphics stack in Firefox 4 is slower than Firefox 3.6 on non-accelerated systems (and all TenFourFox builds are non-accelerated because PPC Tiger lacks OpenGL 2). If you try to play a WebM video expanded or full screen, then you're also testing how well we blit to the screen and scale the image, and we already know that's a bottleneck. At least for now, full screen video will still be the domain of Flash Player. (I was hoping Cairo 1.10 would land in Firefox 5, but it looks like it won't make the cutoff after all.)

Do note that the reason I want to put all this hotness into a quickie beta release is because this is a lot of new and relatively untested code. G3 owners are particularly important because 1) I don't want AltiVec code leaking into your builds and crashing you and 2) I want to make sure that the plugin scrolling hack doesn't make your machines in particular too slow. (Flash itself might, ha ha, but we shouldn't.) Similarly, I want to make sure that the AltiVec acceleration on G4/G5 is as good on as wide a range of systems as I think it is, and ditto for JavaScript on the G5.

This is all very convenient because Mozilla is planning to release 4.0.1, which they have named "Macaw" (presumably after watching Rio trailers and playing a lot of Angry Birds), on March 26th. With luck, there should be nothing landing on the 2.0 release branch tomorrow, so I can pull down the changes, spin off a few builds, and hopefully have betas out to your lucky devils likely by this Thursday or Friday. There are a lot of fixes in 4.0.1, mostly for crashes and a couple for some possible security issues. The plan is to release our own betas and assuming they pass muster, we release the same day as the regular 4.0.1 to the general audience. In the meantime, I'll have more to say about Firefox 5 in a future post.

Anyway, will you forgive me for my lies and heartbreak? I knew you would.

Friday, April 8, 2011

Attention G5 owners: your JavaScript no longer sucks*

(* or, what Apple never told developers about the PowerPC 970)

It's time for celebration. My quad 2.5GHz G5 is now benching 1760ms on SunSpider using the internal 4.0.1pre I've whipped up, and a whopping 93 runs/sec on Dromaeo. That's a little over half the raw interpreter's SunSpider runtime (about 3370ms), and over half of the limping 3710ms it gets with the current hybrid interpreter-nanojit used by the G5 build of 4.0s. I bet you lucky dogs with dual 2.7GHz G5 systems will see even better. Everything is faster. Birds are chirping. The river is high. Gold bricks are falling from the sky, and not on people. This is so tasty it will be in 4.0.1, as so far it's just as stable as the G3 and G4 versions and runs even faster.

Since this blog is designed around some technical nerdosity, let's engage in a little, and also collect what I've learned over the difficult process of getting the nanojit tuned for the G5.

First, a little history for people who don't know what the nanojit is. Mozilla has three layers of JavaScript: the base interpreter (SpiderMonkey), the nanojit (TraceMonkey) and the methodjit (JaegerMonkey). The base interpreter is the base interpreter: it's pretty good as interpreters go, but it's just interpreting code and does not compile it. Until Firefox 3.5, this was the only way JavaScript was run in Mozilla-based browsers. On Power Macs, 3.5 and 3.6 still use it, as do Camino and SeaMonkey, and so do all the community PowerPC builds of Firefox 4 other than us.

Mozilla had been looking at a means of compiling JavaScript, and collaborated with Adobe on what was then called Tamarin, a new JavaScript engine. Tamarin overall actually turned out to be slower for the purpose, but a lasting advance from the Tamarin project was the concept of the nanojit: a "tracing just-in-time compiler" that watched for frequently executed portions of code, then recorded the atomic operations being generated in a special intermediate language called LIR, and finally compiled the LIR into machine code. This nanojit-based accelerator was christened TraceMonkey (no, I don't know why they have such a simian fetish either). Versions were immediately made for x86, and later other architectures followed, including ARM, SH4, SPARC and even MIPS. Adobe wrote up one for the PowerPC too, but it was not complete enough to be used in Firefox, and this was not rectified in Firefox 3.6. That's why all official Mac PowerPC builds of Firefox have such comparatively slow JavaScript performance.

(In Firefox 4, Mozilla added a more conventional method-oriented compiler, based in part on Apple's Nitro JIT used in Safari. Since Firefox 4 is Intel and ARM only officially, the methodjit only runs on x86, x86_64 and ARM. This combination of the nanojit and methodjit is JaegerMonkey. There is no PowerPC methodjit.)

For TenFourFox beta 9, then, we took that partially written PowerPC nanojit and finished it up, thus being the first PowerPC-based Mozilla browser to implement TraceMonkey. My first tests were on my G5, which were very disappointing. Although many operations were significantly faster, many operations were significantly slower, and I concluded that the nanojit didn't seem to be a good fit for the PowerPC ... until alert users tested it themselves and told me it was ridiculously fast on the G3 and G4. To get around this, I compulsively benchmarked various JavaScript low-level operations on the G5 and cut out the ones that seemed to be slow. This was good enough to beat the baseline score for Dromaeo and V8, and not suck too badly at SunSpider, and this got released in beta 11. At least at that time, that seemed about as good as it was going to get for the PowerPC 970.

Earlier this week, I was busy working on enabling VMX (AltiVec)-based WebM decoding (which by the way is about halfway done -- I now have it integrated into the build system and have started converting the existing assembly language code to the dain-bramaged as assembler Apple stuck in Tiger) and while investigating an oddiment in the assembler's syntax I stumbled across a key note I hadn't seen mentioned anywhere else before, let alone in Apple's documentation. As I dug around a little more, the floodgates opened and more stuff came in, including -- what a win -- the critical piece that now enables the G5 nanojit to fly. In fact, not only does it fly, it also no longer needs the shortcuts to dump those slow low-level JavaScript operations (JSOPs) because those JSOPs are no longer slow.

These are useful things to know not only for those of us trying to wring more performance out of our old Macs, but also for people using later PowerPC and POWER designs, such as Wii, Xbox 360 or PlayStation 3 hackers, because the PowerPC 970 (being a modified POWER4) is more closely related to the modern day IBM POWER systems and the Cell, Broadway and Xenon CPUs than the G3s and G4s that preceded them. So, for posterity, here's how we juiced the G5 nanojit.

The G5's data cache acts differently. The G5, because of its deep pipeline, is constantly trying to keep that pipeline full and reduce latency, and tight code trying to prime the data cache needs to be aware of the difference. The AltiVec dst (to hint the processor about where the data stream is coming from) instruction, for example, requires that pipeline to drain and can seriously impact performance. We don't manipulate the D-cache presently in the nanojit, but I mention for completeness that the more basic dcbt is preferred to dst as it does not need to be serialized; the G5 uses a 128-byte cacheline, not a 32-byte one, so dcbtl (which uses the native cacheline size) would be better still. Analogously, for zeroing out a cacheline use dcbzl, not dcbz, which only operates on 32 bytes even on the G5 and is therefore inefficient by wasting valuable cache space.

dcba and dcbi should never be used; they are illegal on the G5. Mac OS X emulates dcba on the G5 by simply ignoring it, since it's just a hint, but this causes a software interrupt to do so -- more about that in our final point.

Remember dispatch groups. The nanojit we use has a construct called the "swaptimizer," which swaps independent instructions around in such a way that more of the CPU's execution units can be running at the same time (i.e., improve instruction-level parallelism) by hoisting up instructions to run overlapped with other instructions that don't depend on that earlier step's result. In certain cases, this can be effective enough to get some instructions seemingly "for free," particularly comparisons that can write out independent comparison results (the PPC has a series of "mini-registers" for this which is quite convenient). For in-order CPUs like the G3 (remember that the G3 is essentially an evolved 603 with all the advantages and disadvantages), this is very valuable, as it only retires instructions in program order despite being superscalar, and it should also be useful for other in-order POWER chips like Xenon, the Cell PPE and POWER6. It is less valuable on the 604 and G4, which both have some limited out-of-order execution, but the G4s in most Macs don't have the reordering logic of later G4 designs.

On the other hand, the G5 is an aggressively out-of-order architecture to improve its instruction-level parallelism in hardware, and can have over 200 instructions in-flight (compared to around 30 for the G4). To reduce the amount of silicon needed to track each and every one of these flying instructions, IBM designed the G5 to take dispatch groups of instructions instead and these groups are what the CPU tracks and what travel through the pipeline. Prior to grabbing the instructions, the G5 will attempt to reorder them for maximum performance. As a result, the swaptimizer is less effective here because it's cherry-picking the low-hanging optimization fruit that the G5 already schedules for, but it does help to pack groups better so that an earlier group is less likely to need the result of a later group.

Dispatch groups actually contain operations rather than individual instructions. In most cases this is an academic point, as the operation is usually the same as the instruction. Ordinarily dispatch groups contain five operation slots: four for individual operations, and an optional branch instruction, so most dispatch groups fall between every fourth instruction. The branch instruction always terminates a dispatch group, and a dispatch group may never have more than one branch. The distinction between operations and instructions will be covered in my next point. The pieces of the group then enter the processor and are issued and executed, and the groups, not the already reordered instructions, are then retired in-order.

Because the instructions execute for most intents and purposes as a unit, certain interdependencies can really hurt, most notoriously loads and stores to nearby addresses in the same group. For example, in the LIR operation d2i that converts a double to an integer, the stfd instruction should be in an earlier group than the lwz that follows it (remember that instructions in the nanojit are emitted working down from the top of memory) because they work on memory addresses that are very close to each other. Since this could lead to a problem with aliasing if they run together in the same group, the G5 has to "undo the group" and split them apart, leading to a pipeline spill as this won't be detected until the group is formed and its addresses are calculated. It improves performance to insert a couple of nop instructions between them, which essentially act as empty space in the group, and forces the G5 to split the group ahead of time. This is tunable; our code worked best with two, and gained a small but meaningful number of points on V8. In fact, this is particularly a problem for any code that has to interconvert between floating point and integer registers, because such conversion must store in memory as an intermediate step (there are no direct FPR to GPR moves in any Power Mac processor). Shark checks for this specific situation and Apple calls it a "LSU [Load Store Unit] Reject" in ADC.

Certain instructions do better at certain positions in the group, too. For example,
mtctr (move to counter register) should be first in a dispatch group if at all possible, and I'm sure there are others (please mention them in the comments; this was the one I found in most references). We can't really leverage this because we only use CTR as an indirect branching register in constructs like mtctr bctr, so we're always branching soon after anyway. However, if you use CTR as, you know, a counter maybe 8), then you might want to group your instructions to force it first in line with a group.

Avoid microcoded and cracked instructions. I mentioned that instructions and operations can mostly, but not always, be treated synonymously. The case where they are not is if the instruction in question is cracked or microcoded.

If the instruction is cracked, it is actually two operations in one, and takes up two operations in the dispatch group; if the instruction is microcoded, it takes up all the operation slots (i.e., it can only travel alone in the dispatch group). Here is Apple's list of G5 cracked and microcoded instructions. Although these instructions do exist and do execute in hardware, they are obviously slower than other instructions. Use them only if you have to, which brings us to our final and most important point.

Never, ever, ever use mcrxr on the G5. And this might well apply to some other POWER CPUs, by the way, even though it appears in IBM's documentation. Apple doesn't mention anywhere in their documentation I could find that mcrxr is software-emulated on the G5. Yikes! In fact, simply eliminating the use of this instruction was what restored the vast majority of our speed.

The reason this hit us so badly is an oddiment of the code the nanojit generates. Some background: overflows and other math exceptions are, if the instruction requests it, annotated in a special purpose register ("SPR") called the XER. The XER, amongst other things, tracks both overflow "OV" of the last instruction, and a summary overflow "SO" which is sticky (i.e., any instruction that sets it, it stays set, unlike OV where the next instruction where overflow tracking is requested clears it). The result of XER cannot itself be used in branching: only the condition registers "CR" can be used to conditionally branch, so the CPU mirrors SO to one of the condition registers if the instruction requests it also (CR0, in case you're interested, for integer math).

The nanojit spits out lots of guard code for every arithmetic operation -- and I do mean every operation, from simple increments in a loop to computing the national debt -- so that an overflow state can be correctly trapped as an exception and appropriately handled. Most of the time, overflow does not occur, and only one instruction is being tested for overflow, so we want to use OV (or we waste cycles clearing SO constantly for instructions that are unlikely to set it). However, we can't branch on OV directly, so here comes mcrxr. mcrxr stands for move exception register to condition register, which puts the relevant bits into the CR we specify and clears those bits from the XER. In our case, this puts OV in the greater-than field of the condition register, and now we can branch on that. Problem solved! ... at least on the G3 and G4, where mcrxr is in hardware.

On the G5, mcrxr is trapped and emulated by the operating system. This means every time the instruction is encountered, there is a fault, the pipelines probably have to empty, the OS examines the instruction, sees what it is, runs equivalent code (to be discussed momentarily), and returns to regular execution. If you look at the code the nanojit generates for a simple loop, you will see that each and every time the loop runs, the increment is tested for overflow, and the mcrxr instruction is trapped and emulated. No wonder it sucked so badly! Apple doesn't mention this anywhere except if you dig through their code or find the few obscure bug reports about why a previously snappy executable on the G4 performs so badly on a G5.

So here is the one case of where a microcoded instruction is better than the alternative. We now use this equivalent code in the G5 nanojit:
mfxer r0                ; put XER into register 0, which is scratch
mtcrf 128, r0 ; put the relevant bits into condition register 0
rlwinm r0, r0, 0, 0, 28 ; clear out the relevant bits in register 0
mtxer r0 ; and put them back into XER, clearing it

You will notice that except for rlwinm every one of these instructions is microcoded and requires an entire dispatch group all to itself. There's no way around it, but it's better than triggering an illegal instruction fault and forcing software emulation. Way better. We still use mcrxr on the G3 and G4, which have it in hardware, but the G5 now uses this equivalent which is much faster on the 970. But phooey on both Apple and IBM: IBM for taking the instruction out, and Apple for not documenting it.

Well, that's enough nerdosity. It'll be in 4.0.1. It's great. More G5 docs, until Apple decides to 404 them:

Sunday, April 3, 2011

Ruminations on Mozilla rapid release, the end of embedding and Firefox 4.0.1

First, old business. Firefox 4.0.1 (this appears to indeed be the version number) is slowly shaping up. Assuming Mozilla is in any way typical, it will probably emerge in late April or early May. There are so far some minor stability updates, nothing especially major. There may or may not be some TenFourFox-specific fixes in there, depending on when I get them complete, but there haven't been many major issues specific to us (good!). The AltiVec WebM accelerator will not make this release; it will be in a separate beta (probably 4.0.2pre). Mozilla has stated they will continue with their "old" maintenance schedule for Mozilla 2.0, which means there will still be stability and security updates for a period of time, which is also good because of what I'll talk about next.

Before we move on, however, there is some somewhat unhappy news that Camino will drop Gecko after they release 2.1 (equivalent to Gecko 1.9.2, i.e., Firefox 3.6). This is sad to hear -- I myself was a Camino user until I started working on TenFourFox, mostly because I like and trust Gecko more than WebKit, but Fx 2.x and 3.x didn't feel like first-class Mac applications by comparison. I'd probably still use Camino if they were planning to support PowerPC, but they too will undoubtedly drop it after they end support for Camino 2.1, which in fairness to them will probably not be for many, many months.

The reason is because Mozilla is dropping embedding, at least for in-process. There is a promise in the mozilla.dev.embedding thread that out-of-process embedding will be reconsidered, but I doubt it, and for a big reason: Gecko really, really sucks to embed. It was always hard for Camino and many bugs were filed on getting it to work right (it still doesn't always), and WebKit's entire existence is owed to the fact Apple also thought Gecko would be a PITA to embed, and instead took KHTML and messed up around with that instead. Part of this is Gecko's scope, because Gecko requires XPCOM, and XPCOM isn't just some internal engine -- it's an entire object framework. But part of it is simply because Mozilla never made it a priority, and now the people who might have kept the embedding aspect alive will probably give up and work on something else. When the time comes for out-of-process embedding, there won't be any browser project around that's interested in it, and Mozilla themselves will have moved on to something else and it will never happen. This is essentially the end of Mozilla as an embedded rendering engine.

For the record, we are not embedded, neither in Classilla or TenFourFox -- we are the browser, not a shell around it, and we are built ultimately on XULRunner (or in Classilla's case, XPFE Apprunner), so this doesn't affect us. It also doesn't affect Songbird, SeaMonkey or anything else that uses a XUL-based front end, but this really sucks for people who are trying to break through XUL's interface limitations. Camino probably will survive the jump when "3.0" emerges with WebKit, but it's a real shame (especially because w/r/t custom WebKits on Mac, OmniWeb is really my personal choice and it will be hard for Camino to compete against Safari and Chrome as well), and it shows that Mozilla's priorities are nowhere near as aspirational as they used to be. Gecko was what distinguished Camino, but no longer. This is part of why WebKit will eventually eat the world, and we will damn it in the same tones we damn Internet Explorer. But I digress.

So, new business. Mozilla, freed from useful things like embedding, is busily working on Firefox 5, which right now they call Firefox 4.2 alpha. The difference in version numbers isn't too odd, as Fx4 was Fx3.7 originally. However, the major difference is more one of process than content: this will be the first release in which Mozilla will use their new "rapid release" framework. And frankly, I have no firm idea what that means, and from idly perusing the Mozilla newsgroups, there are certainly disagreements about this even among Mozilla higher-ups.

Let's review, then, what Mozilla has placed on their wiki and written by one of their key developers. The release schedule is pretty well known: raw work lands on mozilla-central, then stuff that is finished makes the cut to (these are provisional names) fx-experimental, then stuff that is shippable makes the cut to fx-beta, and then the final browser pops out into a release branch. Each stage is intended to last about six weeks, with approximately one week of overlap betwixt. Firefox 5, to jump start the process (and presumably because there is a lot of pent-up work that didn't make Fx4), will have only three weeks in mozilla-central before moving up the release ladder. That, in the words of Larry King, is what we "know-know."

What we don't know-know are several important considerations. The big one will be support intervals. Chemspills will be supported (and necessarily), but it is highly likely that releases will become unsupported much more rapidly. This makes it a lot harder for us to have a stable footing, and may indicate that security aspects discovered "late" in a release's short lifespan may go unrepaired directly by Mozilla if sufficiently late in the next release's cycle (we may have to backport significantly more fixes and forgo others when we lose source parity).

Similarly, we don't really have a handle yet on how updates will be delivered. Google Chrome has a background self-update mechanism, which has its plaudits and its pitfalls, respectively, that it keeps their audience current but also can more widely distribute serious bugs that escape into a release. Mozilla, continuing their Chrome crush, wants to do the same. From a developer perspective, this requires significant back-end infrastructure to distribute partial fixes -- we don't have this kind of back-end even for one architecture, let alone the four builds we release, including for just keeping users in sync, and would greatly complicate testing. Other Mozilla distributors probably have a similar problem. The solution is to simply build snapshots like we do now as the same sort of "big package," but this may require us to maintain a completely different release schedule. Version numbering is related to this problem. Sayre's document alludes at a way to turn automatic beta updates off if this became a reality, and we would probably ship with it hard-wired that way (just notifications, same as now).

In the meantime, I've been watching what's landing in mozilla-central for "Firefox 5" and the major thing concerning us is that IPC is now required. At one stage I did have Chromium IPC building on Tiger, but when built in debug mode, the resulting libxul (which is also now required) was too large to link. To get around this problem (and to avoid maintaining IPC), TenFourFox 4.x is built with both libxul and IPC off, but that will no longer be possible. Fortunately, I have a solution to the linker issue, and I think I can patch enough places in the current Chromium IPC to still get it to build (I'm not sure it will work, but I'm pretty confident I can at least get it to compile). Plus, the three week period makes it unlikely stuff will be "frivolously" removed that we depend on, let alone 10.5 compatibility. Therefore, the chance is not excellent, but it is reasonably good, that we will make the jump to Firefox 5 and still maintain source parity at least through Mozilla 2.2.

Assuming that we do, however, "Firefox 6" will be a major concern. 10.7 will have emerged by then, and it is quite possible that legacy code may be purged during the process of getting Fx6 into Lion. There is even an outside chance that 10.5 might be dropped and that would almost certainly doom source parity as there would be no good reason not to adopt 10.6-specific optimizations like GCD and the like. It's still too early to forecast this, but even if the worst happens early and we drop source parity later this year, like Classilla I still plan to backport security and stability features and make whatever improvements I am able to do so. After all, I still need a browser for this quad.