TenFourFox Development: 17.0 goes final; a little more discussion of locking and atomics

17.0 is now official, along with our five translations. A great job by everyone to get us to this point.

I am nearly done with the "atomic" version of 17, which will be a "17.0.1pre" and then landed on 17ESR for our consumers with stable 17.0.1. There are some interesting things I discovered along the way, including yet another episode of "Optimizing for the Mysterious G5" (see our other exciting episodes of PowerPC 970 black magic). Let's savour the tasty tidbits:

To recap: Atomic operations help to insure that multiple threads remain synchronized and don't step on each other, especially when multiple cores or CPUs are involved. This concept was about replacing Mozilla NSPR's built-in atomic operations which use GNU pthread_mutex_* and are therefore slow on OS X with the OS X built-in atomics where possible, and a custom "atomic set" I wrote myself that 10.4 doesn't possess, and then looking at using the built-in gcc intrinsics for 18+ when we move to gcc 4.6.3.
This only affects atomic operations -- mutexes where the browser actually locks a value for synchronization are still going through the (Apple even admits is slow) OS X implementation of pthread_mutex_*().
Overall performance did indeed improve, in some cases markedly. Of our two pathological test cases, as mentioned, iFixit got a lot better just by repairing the issue for JavaScript, and got even faster when the change was extended over the entire browser. However, local AM station KFI AM 640's page is still spending a massive amount of time in semaphore lock waiting for threads to release resources, so it still seems to be relying on pthreads, and we still probably need to do something about reducing locking overhead (or use of locking) in general to solve that problem.
OS X atomics are actually faster than gcc atomic intrinsics, so even after we switch compilers we probably still want to use this patch. The reason is that the OS X atomics self-optimize, which you can see better illustrated in this earlier version of osfmk's atomic.s. For the atomics to work correctly on multiprocessor Power Macs, there have to be memory and instruction barriers set up so that the processor doesn't inadvertently reorder instructions or memory access in such a way that we get the wrong value out at the end. A write barrier ensures that the atomic's read instruction executes only after all previous writes are complete, and a read barrier ensures everything after the atomic's store instruction executes only sees the value it stored and not stale data. These barriers are expensive because they interfere with processor optimization and may flush pipelines and caches, and are unnecessary on a uniprocessor Mac, so the operating system actually patches the unnecessary sync instructions to no-ops on single processor Power Macs (see also Amit Singh's holy word in OS X Internals, page 400). The gcc atomic intrinsics don't do this (arguably, they can't), so they are slower than they should be on uniprocessor Power Macs, which is the majority of our user base.
Furthermore, 10.4 atomics lack an "atomic set" function to just set something to a value (we have swaps, bitwise ops and basic arithmetic, but no actual "set" operation), so we need to use an external atomic for that. However, we can hand-optimize our own "atomic set" better than the gcc version, because we know what our target processor will be, and we know how Gecko is using it. There are several major instructions for setting up a barrier, and some others that have it as a side-effect, but we'll only talk about a few: eieio (which you can even sing along with), sync, lwsync, isync, and absolutely nothing, which suffices for JavaScript but I don't think is sufficiently safe for the browser as a whole.
Unfortunately, our own routine right now can't easily patch itself for the uniprocessor case, so we need to pick barrier instructions that are as low-impact as possible:
- The prototype instruction is sync, which forces the CPU to wait until all the previous instructions have executed before running another one and locks all processors together so that everything is in a predictable state. This covers all reads and writes necessarily. However, in doing so, it will grind everything on every core to a halt until execution is finished. This is a pipeline killer and we want to avoid it.
- eieio is mostly for ordering memory mapped I/O, but our main interest is that it prevents load-store combining and ensures memory access is coherent, including across processors. However, eieio really is only ordering stores in practice, so it is not sufficient as a read or read/write barrier on its own. It does, however, suffice as a write barrier.
- Alternatively, we could execute an isync, which is like sync in that it forces the CPU to wait until all instructions are completed. However, it also drops all prefetched instructions as well. This also only affects the current processor, so it is only sufficient as a read barrier.
- Finally, there is lwsync, the "lightweight sync," which forces the core to synchronize to cache and orders most memory access so that reads after the barrier are coherent and all writes complete before it. The case it does not reorder is a read to a different address than the write. This is invariably the case we don't care about, but it means that it is really only a write barrier in practice.
So which is faster?
- sync is by far the worst, and even more painful on G5.
- eieio is unsuitable by itself. We'd need to pair it with another sync instruction, and it is also slow on G5 (see below).
- lwsync is faster than isync on G5 and most "big POWER" CPUs, and is faster than eieio also (see comment at bottom of source).
- lwsync becomes a sync on G3, G4, E500, PA6T and others. Ouch! And some earlier PowerPC CPUs don't even implement it! (For example, it's not in the 601 or 603.)
Since we know if we're on a G5, the solution then is to emit lwsync isync as the write and read barriers respectively, and eieio isync on the G3/G4 (since we have to account for multi-CPU G4s). Plus, an even better solution if we implement G3-specific code paths one day would be to ditch the sync instructions entirely on G3 since no G3 was ever a multiprocessor unit (and the BeBox was the only MP 603 I've ever encountered, for that matter). There is no particularly optimized path for any PowerPC processor with gcc for the equivalent atomic intrinsic, so we should use ours for the additional benefit it brings.
I should note that the browser works just fine with no or incomplete barriers, but this might be pushing our luck on G5, so I think we'll just stay safe. We're already a lot faster.

I hope to have this ready after Thanksgiving. Then the next question will be 18 beta or 19 aurora; I haven't decided yet.

6 comments:

TobiasNovember 20, 2012 at 1:17 AM
Aurora 19 doesn't seem to break many things here on 10.5, see http://code.google.com/p/aurorafox/issues/detail?id=30 .
But YARR was updated and some work has to be done to the JIT part to get it working with the PPC macro assembler - I didn't investigate but just disabled it.
ClassicHasClassNovember 20, 2012 at 6:29 AM
By that, do I assume it's already basically working? If that's the case, then it seems worth it to start on 19 since the browser can basically stand up (and this gets around a lot of the churn with display-list based invalidation).
mclcmmNovember 21, 2012 at 9:37 AM
I really like the the new home page, but I can't find the link to the development blog anymore. Also, the "Lion" link at the bottom of the page leads to the old home page.

Due to an increased frequency of spam, comments are now subject to moderation.

Monday, November 19, 2012

17.0 goes final; a little more discussion of locking and atomics

6 comments: