I am nearly done with the "atomic" version of 17, which will be a "17.0.1pre" and then landed on 17ESR for our consumers with stable 17.0.1. There are some interesting things I discovered along the way, including yet another episode of "Optimizing for the Mysterious G5" (see our other exciting episodes of PowerPC 970 black magic). Let's savour the tasty tidbits:
- To recap: Atomic operations help to insure that multiple threads remain synchronized and don't step on each other, especially when multiple cores or CPUs are involved. This concept was about replacing Mozilla NSPR's built-in atomic operations which use GNU pthread_mutex_* and are therefore slow on OS X with the OS X built-in atomics where possible, and a custom "atomic set" I wrote myself that 10.4 doesn't possess, and then looking at using the built-in gcc intrinsics for 18+ when we move to gcc 4.6.3.
This only affects atomic operations -- mutexes where the browser actually locks a value for synchronization are still going through the (Apple even admits is slow) OS X implementation of pthread_mutex_*().
- OS X atomics are actually faster than gcc atomic intrinsics, so even after we switch compilers we probably still want to use this patch. The reason is that the OS X atomics self-optimize, which you can see better illustrated in this earlier version of osfmk's atomic.s. For the atomics to work correctly on multiprocessor Power Macs, there have to be memory and instruction barriers set up so that the processor doesn't inadvertently reorder instructions or memory access in such a way that we get the wrong value out at the end. A write barrier ensures that the atomic's read instruction executes only after all previous writes are complete, and a read barrier ensures everything after the atomic's store instruction executes only sees the value it stored and not stale data. These barriers are expensive because they interfere with processor optimization and may flush pipelines and caches, and are unnecessary on a uniprocessor Mac, so the operating system actually patches the unnecessary sync instructions to no-ops on single processor Power Macs (see also Amit Singh's holy word in OS X Internals, page 400). The gcc atomic intrinsics don't do this (arguably, they can't), so they are slower than they should be on uniprocessor Power Macs, which is the majority of our user base.
Unfortunately, our own routine right now can't easily patch itself for the uniprocessor case, so we need to pick barrier instructions that are as low-impact as possible:
- The prototype instruction is sync, which forces the CPU to wait until all the previous instructions have executed before running another one and locks all processors together so that everything is in a predictable state. This covers all reads and writes necessarily. However, in doing so, it will grind everything on every core to a halt until execution is finished. This is a pipeline killer and we want to avoid it.
- eieio is mostly for ordering memory mapped I/O, but our main interest is that it prevents load-store combining and ensures memory access is coherent, including across processors. However, eieio really is only ordering stores in practice, so it is not sufficient as a read or read/write barrier on its own. It does, however, suffice as a write barrier.
- Alternatively, we could execute an isync, which is like sync in that it forces the CPU to wait until all instructions are completed. However, it also drops all prefetched instructions as well. This also only affects the current processor, so it is only sufficient as a read barrier.
- Finally, there is lwsync, the "lightweight sync," which forces the core to synchronize to cache and orders most memory access so that reads after the barrier are coherent and all writes complete before it. The case it does not reorder is a read to a different address than the write. This is invariably the case we don't care about, but it means that it is really only a write barrier in practice.
- sync is by far the worst, and even more painful on G5.
- eieio is unsuitable by itself. We'd need to pair it with another sync instruction, and it is also slow on G5 (see below).
- lwsync is faster than isync on G5 and most "big POWER" CPUs, and is faster than eieio also (see comment at bottom of source).
- lwsync becomes a sync on G3, G4, E500, PA6T and others. Ouch! And some earlier PowerPC CPUs don't even implement it! (For example, it's not in the 601 or 603.)
I should note that the browser works just fine with no or incomplete barriers, but this might be pushing our luck on G5, so I think we'll just stay safe. We're already a lot faster.