Thursday, November 29, 2012

17.0.1 released, and jawing about the WiiUCPU

17.0.1 release candidate is available. Near as I can tell from releng, Mozilla plans to release this tomorrow. Please test it today. Changesets are available.

I am still stymied by the locking problems we still have on some sites, where the browser spins its wheels in semaphore wait. It's only a minority of sites, thank goodness, but it even occurs on some pure HTML/CSS sites with no JavaScript at all implying that the hangup is within layout. Once I get a debug build of Fx19 mounted, I want to review this problem in detail; it seems there should be a simple solution.

The fanboi boards and the usual blogosphere suspects are aflutter with "WiiU CPU sux" based on a couple of tweets made by self-proclaimed Nintendo WiiU hacker Hector Martin. The triple-core PowerPC CPU in the WiiU, codenamed "Espresso," is according to Martin's measurements clocked at only around 1.24GHz per core, compared with around 1.6GHz (when multithreaded; 3.2GHz max) for the Xbox 360's tri-core PowerPC-based Xenon. Given that apparently dismal clock speed, is this another case of the megahertz myth reborn?

Both the preceding Nintendo GameCube's Gekko and the Wii's Broadway were evolved forms of the G3, specifically modifications of the PowerPC 750CXe and 750CL. IBM customized Gekko's FPU with SIMD instructions (sort of an "AltiVec Lite") to facilitate media processing and built a fast path from Gekko to the GameCube's Flipper GPU, and then took that same basic design and essentially cranked up the clock to generate Broadway. The systems run at 485MHz and 729MHz respectively. Even to this day IBM continues to make custom versions of the venerable G3; it's cheap to produce, simple to modify and can be ground out in quantity from their secret underground base in Fishkill.

Despite this, IBM can and has made other kinds of application-specific PowerPC designs. The Cell in the Playstation 3 is the most obvious example with its PowerPC "PPE" core and satellite SPEs, and the Xenon in the Xbox 360 is three PPE cores stuck together; Xenon even has modified AltiVec instructions ("VMX128"). As if to confirm IBM was trying something new with Nintendo, during the WiiU's hype-y-moon Nintendo marketing billed the processor as "the same processor technology found in Watson." We assumed this to be a POWER7 derivative, given that Watson was IBM's POWER7-based know-it-all computer cluster that pretended to win Jeopardy.

However, Martin claims that the WiiU CPU is also, once again, another morph of our old friend the PowerPC 750 with some additional cache. Well, I'm dubious, and the reason is that the 750 and its ancestor, the 603e, were never designed for multicore environments (and Espresso has three). The only multi-CPU 603 I ever met was the BeBox, and the BeBox's design hobbled its two CPUs because it had to overcome the 603's incomplete cache coherency with glue logic (the 603 and the G3 do not support all five MERSI states necessary for multiprocessing; they are only MEI). I've never met a multiprocessor G3, and Apple certainly never made one. In fact, early 7400 G4s have the same limitation.

Furthermore, Martin also admits that the CPU is out-of-order, which the 750 never was either. POWER7, interestingly enough, is. On the other hand, "big" POWER like the POWER7 has significantly different execution characteristics than "little" POWER. We know this personally from this project, because the G5 acts like a "big POWER" CPU and requires "big POWER" optimizations that are different from what a G3 or G4 would require. The G5 can run G3 and G4 code, but it certainly doesn't do so as well as its own, which was why early G5s weren't really all that much faster than the MDD G4. The fact that the WiiU cores implement the same instruction set as Broadway, including the modified SIMD FPU instructions -- because, other than emulation, how else could it still run Wii games acceptably? -- make it pretty unlikely it really is a POWER7 derivative, and Espresso also lacks POWER7's hardware threads. However, to work in a multicore environment, whatever is in there cannot be a garden-variety 750 (unless the system board situation is as tricky as it was with the BeBox). The pipeline is certainly not a regular 750's to have OOOE.

In the final analysis what we're looking at is a new design, inspired by the 750, but not a 750 (just like the 7400 G4 was more than a G3 with an AltiVec unit bolted on); there are too many fundamental changes in its operation to just call Espresso merely an evolutionary variant. Martin should know this and I'm a bit disappointed in his simplistic analysis, even though I salute his technical skills. It also means that, like the megahertz myth of the PowerPC vs x86 days, the clock of these cores is probably not at all comparable to the more deeply-pipelined PPE in Xenon and Cell. And that's why fanbois suck.

Wednesday, November 28, 2012

17.0.1 imminent

Mozilla is planning to ship a 17.0.1 to fix several regressions in 17.0, including bug 815359's problem with Bing, restoring the old user agent format, and a font regression on Windows that doesn't affect us. The plan is to go to build as early as today (I'm not sure when it will be released, exactly). This is a little inconvenient because Tobias wrote some additional performance patches in issue 191, our atomics/mutex performance bug, which look good and appear to work even on the quad G5 (which you would expect to have the most problems if we did something wrong with barriers). So I'm going to cross my fingers and make them part of 17.0.1 since it's a little complex to back them out now. I'm rebuilding everything as soon as build tags land on esr17 and hopefully candidate builds will be available Thursday evening or early Friday.

Saturday, November 24, 2012

"17.0.1pre" available

I put the "pre" in quotation marks because it still appears as 17.0, but this contains the new atomics as discussed in our last blog post. However, because JavaScript does not require (at least currently) the same memory guarantees as the rest of the browser does (might), I implemented an atomic set routine just for JavaScript that has no read or write barriers, which is a lot faster than the other approach (which is faster still than the old mutex method). Get it from the Downloads tab. The only difference you should note is that the browser is faster; there are no other changes.

Based on Tobias' initial analysis and work in 19, there appears to be enough stuff that Mozilla broke that we should just skip 18 and go to that to get started on undoing the bustage right away. Thus, Aurora 19 will be the inaugural release of the next unstable series. I'll probably start work on that this week once I've sufficiently recovered from tryptophan-induced Thanksgiving turkey coma (though actually this year we had roast duck).

Monday, November 19, 2012

17.0 goes final; a little more discussion of locking and atomics

17.0 is now official, along with our five translations. A great job by everyone to get us to this point.

I am nearly done with the "atomic" version of 17, which will be a "17.0.1pre" and then landed on 17ESR for our consumers with stable 17.0.1. There are some interesting things I discovered along the way, including yet another episode of "Optimizing for the Mysterious G5" (see our other exciting episodes of PowerPC 970 black magic). Let's savour the tasty tidbits:

  • To recap: Atomic operations help to insure that multiple threads remain synchronized and don't step on each other, especially when multiple cores or CPUs are involved. This concept was about replacing Mozilla NSPR's built-in atomic operations which use GNU pthread_mutex_* and are therefore slow on OS X with the OS X built-in atomics where possible, and a custom "atomic set" I wrote myself that 10.4 doesn't possess, and then looking at using the built-in gcc intrinsics for 18+ when we move to gcc 4.6.3.

    This only affects atomic operations -- mutexes where the browser actually locks a value for synchronization are still going through the (Apple even admits is slow) OS X implementation of pthread_mutex_*().

  • Overall performance did indeed improve, in some cases markedly. Of our two pathological test cases, as mentioned, iFixit got a lot better just by repairing the issue for JavaScript, and got even faster when the change was extended over the entire browser. However, local AM station KFI AM 640's page is still spending a massive amount of time in semaphore lock waiting for threads to release resources, so it still seems to be relying on pthreads, and we still probably need to do something about reducing locking overhead (or use of locking) in general to solve that problem.

  • OS X atomics are actually faster than gcc atomic intrinsics, so even after we switch compilers we probably still want to use this patch. The reason is that the OS X atomics self-optimize, which you can see better illustrated in this earlier version of osfmk's atomic.s. For the atomics to work correctly on multiprocessor Power Macs, there have to be memory and instruction barriers set up so that the processor doesn't inadvertently reorder instructions or memory access in such a way that we get the wrong value out at the end. A write barrier ensures that the atomic's read instruction executes only after all previous writes are complete, and a read barrier ensures everything after the atomic's store instruction executes only sees the value it stored and not stale data. These barriers are expensive because they interfere with processor optimization and may flush pipelines and caches, and are unnecessary on a uniprocessor Mac, so the operating system actually patches the unnecessary sync instructions to no-ops on single processor Power Macs (see also Amit Singh's holy word in OS X Internals, page 400). The gcc atomic intrinsics don't do this (arguably, they can't), so they are slower than they should be on uniprocessor Power Macs, which is the majority of our user base.

  • Furthermore, 10.4 atomics lack an "atomic set" function to just set something to a value (we have swaps, bitwise ops and basic arithmetic, but no actual "set" operation), so we need to use an external atomic for that. However, we can hand-optimize our own "atomic set" better than the gcc version, because we know what our target processor will be, and we know how Gecko is using it. There are several major instructions for setting up a barrier, and some others that have it as a side-effect, but we'll only talk about a few: eieio (which you can even sing along with), sync, lwsync, isync, and absolutely nothing, which suffices for JavaScript but I don't think is sufficiently safe for the browser as a whole.

    Unfortunately, our own routine right now can't easily patch itself for the uniprocessor case, so we need to pick barrier instructions that are as low-impact as possible:

    • The prototype instruction is sync, which forces the CPU to wait until all the previous instructions have executed before running another one and locks all processors together so that everything is in a predictable state. This covers all reads and writes necessarily. However, in doing so, it will grind everything on every core to a halt until execution is finished. This is a pipeline killer and we want to avoid it.
    • eieio is mostly for ordering memory mapped I/O, but our main interest is that it prevents load-store combining and ensures memory access is coherent, including across processors. However, eieio really is only ordering stores in practice, so it is not sufficient as a read or read/write barrier on its own. It does, however, suffice as a write barrier.
    • Alternatively, we could execute an isync, which is like sync in that it forces the CPU to wait until all instructions are completed. However, it also drops all prefetched instructions as well. This also only affects the current processor, so it is only sufficient as a read barrier.
    • Finally, there is lwsync, the "lightweight sync," which forces the core to synchronize to cache and orders most memory access so that reads after the barrier are coherent and all writes complete before it. The case it does not reorder is a read to a different address than the write. This is invariably the case we don't care about, but it means that it is really only a write barrier in practice.

    So which is faster?

    Since we know if we're on a G5, the solution then is to emit lwsync isync as the write and read barriers respectively, and eieio isync on the G3/G4 (since we have to account for multi-CPU G4s). Plus, an even better solution if we implement G3-specific code paths one day would be to ditch the sync instructions entirely on G3 since no G3 was ever a multiprocessor unit (and the BeBox was the only MP 603 I've ever encountered, for that matter). There is no particularly optimized path for any PowerPC processor with gcc for the equivalent atomic intrinsic, so we should use ours for the additional benefit it brings.

    I should note that the browser works just fine with no or incomplete barriers, but this might be pushing our luck on G5, so I think we'll just stay safe. We're already a lot faster.

I hope to have this ready after Thanksgiving. Then the next question will be 18 beta or 19 aurora; I haven't decided yet.

Friday, November 16, 2012

17.0 RC available and "locking" up

... from the usual place and with the usual notes. Also try the translations! This version will be released to the public on Monday and become the new stable branch. Excellent!

As stated in the prior blog entry, this version restores the kernel-based atomic locking used in TenFourFox 10 (issue 191). And, well, there's an interesting story associated with that. At baseline Mozilla NSPR implements locking, including atomic increment/decrement/set/add operations, in terms of pthread_mutex_*, which is just fine on Linux and *BSD but is apparently abysmally bad on OS X. These thread-safe operations are being called more and more by Gecko as they attempt to leverage a greater number of processor threads to make operations more asynchronous, improve browser responsiveness and take better advantage of multicore systems, and are essential to keep threads from walking all over each other. But the overhead, apparently worsened by an inefficient implementation in the OS X kernel, is rapidly becoming unacceptable. I patched the parts in JavaScript out for 17 final and you're using that now.

In gcc 4.1, the compiler implements faster intrinsics for atomic operations (the __sync_*() series of functions), and NSPR can use these for a fast path, but we don't have those in gcc 4.0.1. So the solution will be to extend the quick locking patch I wrote up for JavaScript over the entire browser, which should significantly reduce locking contention because less things will be trying to get mutexes for operations that don't need them. We would just use our own atomic operations and the operating system's atomic operations in place of NSPR's use of pthread_mutex_*, everywhere we would do any atomic operation.

Interestingly, AuroraFox won't have to do this, and neither will TenFourFox 18, because both of them are built with gcc 4.6 and can use the compiler's intrinsics. This will only be an issue for 17ESR because it is the last branch to be built with 4.0.1. I'm hoping to get this working maybe next week sometime for a test release prior to beginning work on 18, because I'd like to fix it for 17.0.1.

The larger problem of such locks being slow in general, however, means we'll probably have to come up with a better way of handling locking on PPC OS X, or see if some of the areas that use PR_Lock and friends could be better written with atomic intrinsics. Mozilla might even want some of these patches, since it doesn't look like Apple improved this a lot (Grand Central Dispatch is just a lightweight way around the problem).

Thursday, November 15, 2012

10.0.11 released, and rides off into the sunset

Please give 10.0.11 a proper sendoff by downloading it and trying it out; this is finally the last release of 10.x, as plainly stated in its Release Notes. The changesets include a fix for issue 130. Assuming no showstoppers during its life, it will be unsupported after January 2013. It served us well and now it is time to let it rest.

17.0 is getting some last minute fixes. Issue 188 is fixed, but I am discovering what appears to be a moderate performance regression on multiprocessor Power Macs on certain sites and I suspect changes in JavaScript locking (10.x uses native OS X locks; 17 uses NSPR locks) due to Shark showing the browser spending a massive amount of time in semaphore wait. This issue does not affect single-processor Macs, so we can ship this, actually, the larger issue also affects them too, see the subsequent post but I'd like to fix it if possible prior to release. If NSPR locks are at fault, then we might want to purge them from the browser entirely (or cause them to stub out into OS X libkern), but I'm just going to focus on JavaScript first since that is the difference between 10 and 17. If I can't find a way to easily remedy it, we will simply ship without it for the moment since the sites do work; they're just slower than they ought to be (iFixit is a particularly egregious example). Remember, this only makes a difference if your Mac has more than one CPU. ETA for 17 RC is Saturday.

Also, thanks to Chris and all our localizers for having installers ready for our release next week! Great work!

Wednesday, November 7, 2012

Falling to baseline? also: wither [sic] ESR 24?

Now that IonMonkey is part of Firefox 18 as the compiler for "long running" JavaScript, Mozilla is looking to try to replace JaegerMonkey, which is still the baseline compiler (we have a full implementation of JM with trace inference in TenFourFox, which long-time readers of the blog will know as "JM+TI"). This is not good news.

To recap history for newer readers, TenFourFox has implemented two JavaScript compiler backends for PowerPC. JavaScript is not an easy thing to compile. Our dearly departed tracejit (TraceMonkey, in 4-9) was a fast compiler because it wasn't really compiling; it was in simple terms just noting what operations got done by the interpreter and playing them back. JaegerMonkey actually does compile JavaScript methods, but it does so by compiling to assembly language elemental opcodes called JSOPs, which act as sort of an intermediate bytecode representation of JavaScript (and in interpreter mode, JSOPs are what the interpreter actually runs). The current implementation pairs JaegerMonkey with type inference, where the compiler tries to guess what actual data types are in play in a script (integers, floating point, etc.), and generate code more specialized to those "inferred" types. This has more latency than TraceMonkey, which low-end users complained about in our early implementations of PowerPC JM+TI, but the outcome is dramatically faster runtime overall and 17 has made JM+TI even faster than it used to be.

Still, this doesn't facilitate certain kinds of optimizations that, for example, your typical C compiler can perform; all JM+TI is essentially doing is computing stack depths on the first pass, and then plopping out little packages of assembler code for each operation on the second. There is no attempt, or indeed no easy way, to do much analysis of the generated code or the internalized source representation in this scheme because each JSOP is treated as an atom. So IonMonkey was written as a more traditional compiler to implement these optimizations (using more advanced intermediate representations), but optimizations take time, and the added overhead of IonMonkey does not pay off until you run that highly optimized code a lot. Thus, JaegerMonkey remains the baseline compiler because it is less expensive, and at least for the next few iterations of the unstable branch, will still be our only compiler. Only for major, more intensive applications is IM employed in Firefox 18 and 19. (IM is implemented for ARM and x86 only; there is no SPARC or MIPS version yet, or indeed any big-endian backend.)

(As an aside, IonMonkey is also more like a traditional C compiler in that it uses the processor stack directly instead of the internal JavaScript stack, which is independent. Running out of stack space in a 32-bit environment has been a big issue for us in the past, and we're still not entirely ABI compliant with our stack frames even though I revised this significantly for 17. Part of the reason for TenFourFox's higher memory demands, besides cached fragments of code, is that our stack is compiled to be very large to insulate us from crashes and swapping the stack in and out of memory is a performance killer on RAM-impaired systems. Besides getting JM+TI and IM to play nice together, I am also concerned that a complex and/or recursive IM code fragment could easily run off the end of the 1GB stack that already exists, and we might not be able to squeeze much more out of the addressing range we are limited to.)

To once again move from a hybrid compiler to a "grand unified approach," Mozilla needs to make IM less expensive if they want to use the same code as part of a "cheaper" compiler. Already, combined JM-IM took about a 3-5% haircut in SunSpider, for example, and worse on slower machines where IM's latency becomes a bigger proportion of runtime. This is the idea behind the Baseline Compiler: a profile, if you will, of IonMonkey that cuts back some of the more advanced and computationally complex portions of IM to generate "good enough" code. Google V8 Crankshaft already implements an analogous idea (here we are trying to "out-Chrome Chrome" again, as usual) for small scripts, and TraceMonkey served this purpose in a limited and not directly intentional sense when it existed. The baseline compiler, meta-tracked in bug 805241, really does aim to be as minimal as possible; besides implementing none of the more advanced optimizations of IonMonkey, it doesn't even implement type inference to avoid any risk of recompiling the script if type inference turns out to make incorrect assumptions. However, it is almost guaranteed it will use the IonMonkey backend to generate its code instead of the JaegerMonkey one.

This is bad for us because IonMonkey is much more complex to implement than JaegerMonkey was, and JaegerMonkey was already a huge effort between Ben and myself. We put a lot of time into optimizing it well, including Ben's work on special separated G5 and G3/G4 code paths, tightened pieces of code that self-optimize, and decreased compiler overhead. Assuming that the Baseline Compiler replaces JM+TI completely, we will basically be starting from scratch for the second time in as many years, and I don't see a SPARC or MIPS implementation yet that we can crib from. (At least we have ARM as a basis for our own RISCy implementation, but the ARM implementation is little-endian.) This would seriously drag down progress on this port and would count as a trigger to drop source parity if we couldn't get it working; losing tracejit was originally a show-stopper too until we got methodjit off the ground. I'm loathe to put significant time into it while the Baseline Compiler is off in the future because IM in its current implementation is likely to be a lot of work for little gain (and potentially some regression on low end G3 and G4 systems), but when the Baseline Compiler does emerge we can expect JaegerMonkey to be completely excised within a couple of releases just as TraceMonkey was after type inference landed, so I don't know what I'm going to do with this yet. We have the advantage of having learned from our experience developing PowerPC methodjit, but this is a much bigger task.

The situation is made a little more acute because Mozilla has said very little about whether there will be an ESR 24 after ESR 17. Ars Technica's latest browser survey shows that ESR 10 has not been as widely embraced as the howling over version numbers would suggest; it represents just 0.47% of all web users, and that undoubtedly includes our stable branch users. Mozilla committed to ESR 17 originally and they are obviously keeping their promise, but they have said nothing about ESR 24, and numbers like these will undoubtedly be weapons within Mountain View for the ESR's opponents to kill it off after the promised support period for 17 expires. If that turns out to be the case, there may be no stable branch for us to upgrade to, if we live that long.

So, with that cheery thought, Chris found an explanation for the minority of users who complained they could not download .pdfs or certain other files they had formerly viewed with plugins; it looks like our code to disable them is incomplete, and there is already a fix in issue 188. This changes some internal semantics, so I will not implement it in 10.x. I also took the liberty of tweaking our internal locale a bit for the QTE, and implemented bug 752376 for a little more snappy in the tab bar. Mozilla right now is trying to determine what they will do with Click-to-play (for plugins) in Firefox 17, which right now is buggy on certain sites, but this is irrelevant to us since we don't ship with plugins enabled anyway. Their plan is to have a release candidate ready somewhere around the 14th, and so will we; I will also build our last 10.x version around the same time, which will have a fix for issue 130, and finally terminate our support for ESR 10. I see our anonymous Tenfourbird builder(s) in the land of the Rising Sun are now issuing 17 betas themselves, so it looks like they will make the jump with us. It will be interesting to see what happens to Thunderbird now that Mozilla has said its development will be coming to a close with this release. Perhaps it's time for a TenFourMonkey after all.