Saturday, February 28, 2015

The oldest computer running TenFourFox

In the "this makes me happy" department, Miles Raymond posted his Power Macintosh 9500 (with a 700MHz G4 and 1.5GB of RAM) running TenFourFox in Tiger. I'm pretty sure there is no older system that can boot and run 10.4, but I'd be delighted to see if anyone can beat this. FWIW, the 9500 was released May 1995, making it 20 years old this year and our very own "Twentieth Anniversary" Macintosh.

And Mozilla says you need an Intel Mac to run Firefox! Oh, those kidders! They're a laugh a minute!

Wednesday, February 25, 2015

IonPower passes V8!

At least in Baseline-only mode, but check it out!

Starting program: /Volumes/BruceDeuce/src/mozilla-36t/obj-ff-dbg/dist/bin/js --no-ion --baseline-eager -f run.js
warning: Could not find malloc init callback function.
Make sure malloc is initialized before calling functions.
Reading symbols for shared libraries ....................................................................+++......... done
Richards: 144
DeltaBlue: 137
Crypto: 215
RayTrace: 230
EarleyBoyer: 193
RegExp: 157
Splay: 140
NavierStokes: 268
----
Score (version 7): 180

Program exited normally.

Please keep in mind this is a debugging version and performance is impaired relative to PPCBC (and if I had to ship a Baseline-only compiler in TenFourFox 38, it would still be PPCBC because it has the best track record). However, all of the code cleanup for IonPower and its enhanced debugging capabilities paid off: with one exception, all of the bugs I had to fix to get it passing V8 were immediately flagged by sanity checks during code generation, saving much labourious single stepping through generated assembly to find problems.

I have a Master's program final I have to study for, so I'll be putting this aside for a few days, but after I thoroughly bomb it the next step is to mount phase 4, where IonPower can pass the test suite in Baseline mode. Then the real fun will begin -- true Ion-level compilation on big-endian PowerPC. We are definitely on target for 38, assuming all goes well.

I forgot to mention one other advance in IonPower, which Ben will particularly appreciate if he still follows this blog: full support for all eight bitfields of the condition register. Unfortunately, it's mostly irrelevant to generated code because Ion assumes, much to my disappointment, that the processor possesses only a single set of flags. However, some sections of code that we fully control can now do multiple comparisons in parallel over several condition registers, reducing our heavy dependence upon (and usually hopeless serialization of) cr0, and certain FPU operations that emit to cr1 (or require the FPSCR to dump to it) can now branch directly upon that bitfield instead of having to copy it. Also, emulation of mcrxr on G5/POWER4+ no longer has a hard-coded dependency upon cr7, simplifying much conditional branching code. It's a seemingly minor change that nevertheless greatly helps to further unlock the untapped Power in PowerPC.

Tuesday, February 24, 2015

Two victories

Busted through my bug with stubs late last night (now that I've found the bug I am chagrined at how I could have been so dense) and today IonPower's Baseline implementation successfully computed π to an arbitrary number of iterations using the nice algorithm by Daniel Pepin:

% ../../../obj-ff-dbg/dist/bin/js --no-ion --baseline-eager -e 'var pi=4,top=4,bot=3,minus = true;next(pi,top,bot,minus,30);function next(pi,top,bot,minus,num){for(var i=0;i<num;i++){pi += (minus == true)?-(top/bot):(top/bot);minus = \!minus;bot+=2;}print(pi);}'
3.1738423371907505
% ../../../obj-ff-dbg/dist/bin/js --no-ion --baseline-eager -e 'var pi=4,top=4,bot=3,minus = true;next(pi,top,bot,minus,30000);function next(pi,top,bot,minus,num){for(var i=0;i<num;i++){pi += (minus == true)?-(top/bot):(top/bot);minus = \!minus;bot+=2;}print(pi);}'
3.141625985812036

Still work to be done on the rudiments before attacking the test suite, but code of this complexity running correctly so far is a victory. And, in a metaphysical sense, speaking from my perspective as a Christian (and a physician aware of the nature of his illness), here is another victory: a Mozillian's last post from the end stages of his affliction. Even for those who do not share that religious perspective, it is a truly brave final statement and one I have not seen promoted enough.

Saturday, February 21, 2015

Biggus diskus (plus: 31.5.0 and how to superphish in your copious spare time)

One of the many great advances that Mac OS 9 had over the later operating system was the extremely flexible (and persistent!) RAM disk feature, which I use on almost all of my OS 9 systems to this day as a cache store for Classilla and temporary work area. It's not just for laptops!

While OS X can configure and use RAM disks, of course, it's not as nicely integrated as the RAM Disk in Classic is and it isn't natively persistent, though the very nice Esperance DV prefpane comes pretty close to duplicating the earlier functionality. Esperance will let you create a RAM disk up to 2GB in size, which for most typical uses of a transient RAM disk (cache, scratch volume) would seem to be more than enough, and can back it up to disk when you exit. But there are some heavy duty tasks that 2GB just isn't enough for -- what if you, say, wanted to compile a PowerPC fork of Firefox in one, he asked nonchalantly, picking a purpose at random not at all intended to further this blog post?

The 2GB cap actually originates from two specific technical limitations. The first applies to G3 and G4 systems: they can't have more than 2GB total physical RAM anyway. Although OS X RAM disks are "sparse" and only actually occupy the amount of RAM needed to store their contents, if you filled up a RAM disk with 2GB of data even on a 2GB-equipped MDD G4 you'd start spilling memory pages to the real hard disk and thrashing so badly you'd be worse off than if you had just used the hard disk in the first place. The second limit applies to G5 systems too, even in Leopard -- the RAM disk is served by /System/Library/PrivateFrameworks/DiskImages.framework/Resources/diskimages-helper, a 32-bit process limited to a 4GB address space minus executable code and mapped-in libraries (it didn't become 64-bit until Snow Leopard). In practice this leaves exactly 4629672 512-byte disk blocks, or approximately 2.26GB, as the largest possible standalone RAM disk image on PowerPC. A full single-architecture build of TenFourFox takes about 6.5GB. Poop.

It dawned on me during one of my careful toilet thinking sessions that the way awound, er, around this pwobproblem was a speech pathology wefewwal to RAID volumes together. I am chagrined that others had independently came up with this idea before, but let's press on anyway. At this point I'm assuming you're going to do this on a G5, because doing this on a G4 (or, egad, G3) would be absolutely nuts, and that your G5 has at least 8GB of RAM. The performance improvement we can expect depends on how the RAM disk is constructed (10.4 gives me the choices of concatenated, i.e., you move from component volume process to component volume process as they fill up, or striped, i.e., the component volume processes are interleaved [RAID 0]), and how much the tasks being performed on it are limited by disk access time. Building TenFourFox is admittedly a rather CPU-bound task, but there is a non-trivial amount of disk access, so let's see how we go.

Since I need at least 6.5GB, I decided the easiest way to handle this was 4 2+GB images (roughly 8.3GB all told). Obviously, the 8GB of RAM I had in my Quad G5 wasn't going to be enough, so (an order to MemoryX and) a couple days later I had a 16GB memory kit (8 x 2GB) at my doorstep for installation. (As an aside, this means my quad is now pretty much maxed out: between the 16GB of RAM and the Quadro FX 4500, it's now the most powerful configuration of the most powerful Power Mac Apple ever made. That's the same kind of sheer bloodymindedness that puts 256MB of RAM into a Quadra 950.)

Now to configure the RAM disk array. I ripped off a script from someone on Mac OS X Hints and modified it to be somewhat more performant. Here it is (it's a shell script you run in the Terminal, or you could use Platypus or something to make it an app; works on 10.4 and 10.5):

% cat ~/bin/ramdisk
#!/bin/sh

/bin/test -e /Volumes/BigRAM && exit

diskutil erasevolume HFS+ r1 \
        `hdiutil attach -nomount ram://4629672` &
diskutil erasevolume HFS+ r2 \
        `hdiutil attach -nomount ram://4629672` &
diskutil erasevolume HFS+ r3 \
        `hdiutil attach -nomount ram://4629672` &
diskutil erasevolume HFS+ r4 \
        `hdiutil attach -nomount ram://4629672` &
wait
diskutil createRAID stripe BigRAM HFS+ \
        /Volumes/r1 /Volumes/r2 /Volumes/r3 /Volumes/r4

Notice that I'm using stripe here -- you would substitute concat for stripe above if you wanted that mode, but read on first before you do that. Open Disk Utility prior to starting the script and watch the side pane as it runs if you want to understand what it's doing. You'll see the component volume processes start, reconfigure themselves, get aggregated, and then the main array come up. It's sort of a nerdily beautiful disk image ballet.

One complication, however, is you can't simply unmount the array and expect the component RAM volumes to go away by themselves; instead, you have to go seek and kill the component volumes first and then the array will go away by itself. If you fail to do that, you'll run out of memory verrrrry quickly because the RAM will not be reclaimed! Here's a script for that too. I haven't tested it on 10.5, but I don't see why it wouldn't work there either.

% cat ~/bin/noramdisk
#!/bin/sh

/bin/test -e /Volumes/BigRAM || exit

diskutil unmountDisk /Volumes/BigRAM
diskutil checkRAID BigRAM | tail -5 | head -4 | \
        cut -c 3-10 | grep -v 'Unknown' | \
        sed 's/s3//' | xargs -n 1 diskutil eject

This script needs a little explanation. What it does is unmount the RAM disk array so it can be modified, then goes through the list of its component processes, isolates the diskn that backs them and ejects those. When all the disk array's components are gone, OS X removes the array, and that's it. Naturally shutting down or restarting will also wipe the array away too.

(If you want to use these scripts for a different sized array, adjust the number of diskutil erasevolume lines in the mounter script, and make sure the last line has the right number of images [like /Volumes/r1 /Volumes/r2 by themselves for a 2-image array]. In the unmounter script, change the tail and head parameters to 1+images and images respectively [e.g., tail -3 | head -2 for a 2-image array].)

Since downloading the source code from Mozilla is network-bound (especially on my network), I just dumped it to the hard disk, and patched it on the disk as well so a problem with the array would not require downloading and patching everything again. Once that was done, I made a copy on the RAM disk with hg clone esr31g /Volumes/BigRAM/esr31g and started the build. My hard disk, for comparison, is a 7200rpm 64MB buffer Western Digital SATA drive; remember that all PowerPC OS X-compatible controllers only support SATA I. Here's the timings, with the Quad G5 in Highest performance mode:

hard disk: 2 hours 46 minutes
concatenated: 2 hours 15 minutes (18.7% improvement)
striped: 2 hours 8 minutes (22.9% improvement)

Considering how much of this is limited by the speed of the processors, this is a rather nice boost, and I bet it will be even faster with unified builds in 38ESR (these are somewhat more disk-bound, particularly during linking). Since I've just saved almost two hours of build time over all four CPU builds, this is the way I intend to build TenFourFox in the future.

The 5.2% delta observed here between striping and concatenation doesn't look very large, but it is statistically significant, and actually the difference is larger than this test would indicate -- if our task were primarily disk-bound, the gulf would be quite wide. The reason striping is faster here is because each 2GB slice of the RAM disk array is an independent instance of diskimages-helper, and since we have four slices, each slice can run on one of the Quad's cores. By spreading disk access equally among all the processes, we share it equally over all the processors and achieve lower latency and higher efficiencies. This would probably not be true if we had fewer cores, and indeed for dual G5s two slices (or concatenating four) may be better; the earliest single processor G5s should almost certainly use concatenation only.

Some of you will ask how this compares to an SSD, and frankly I don't know. Although I've done some test builds in an SSD, I've been using a Patriot Blaze SATA III drive connected to my FW800 drive toaster to avoid problems with interfacing, so I doubt any numbers I'd get off that setup would be particularly generalizable and I'd rather use the RAM disk anyhow because I don't have to worry about TRIM, write cycles or cleaning up. However, I would be very surprised if an SSD in a G5 achieved speeds faster than RAM, especially given the (comparatively, mind you) lower SATA bandwidth.

And, with that, 31.5.0 is released for testing (release notes, hashes, downloads). This only contains ESR security/stability fixes; you'll notice the changesets hash the same as 31.4.0 because they are, in fact, the same. The build finalizes Monday PM Pacific as usual.

31.5.0 would have been out earlier (experiments with RAM disks notwithstanding) except that I was waiting to see what Mozilla would do about the Superfish/Komodia debacle: the fact that Lenovo was loading adware that MITM-ed HTTPS connections on their PCs ("Superfish") was bad enough, but the secret root certificate it possessed had an easily crackable private key password allowing a bad actor to create phony certificates, and now it looks like the company that developed the technology behind Superfish, Komodia, has by their willful bad faith actions caused the same problem to exist hidden in other kinds of adware they power.

Assuming you were not tricked into accepting their root certificate in some other fashion (their nastyware doesn't run on OS X and near as I can tell never has), your Power Mac is not at risk, but these kinds of malicious, malfeasant and incredibly ill-constructed root certificates need to be nuked from orbit (as well as the companies that try to sneak them on user's machines; I suggest napalm, castration and feathers), and they will be marked as untrusted in future versions of TenFourFox and Classilla so that false certificates signed with them will not be honoured under any circumstances, even by mistake. Unfortunately, it's also yet another example of how the roots are the most vulnerable part of secure connections (previously, previously).

Development on IonPower continues. Right now I'm trying to work out a serious bug with Baseline stubs and not having a lot of luck; if I can't get this working by 38.0, we'll ship 38 with PPCBC (targeting a general release by 38.0.2 in that case). But I'm trying as hard as I can!

Saturday, February 14, 2015

IonPower now entering low Earth orbit

Happy Valentine's Day. I'm spending mine in front of the computer with a temporary filling on my back top left molar. But I'm happy, because IonPower, the new PowerPC JIT I plan to release with TenFourFox 38, is now to the point where it can compile and execute simple scripts in Baseline mode (i.e., phase 3, where 1 was authoring, 2 was compiling, 3 is basic operations, 4 is pass tests in Baseline, 5 is basic operations in full Ion and 6 is pass tests in Ion). For example, tonight's test was debugging and fixing this fun script:

function ok(){print("ok");}function ok3(){ok(ok(ok()));}var i=0;for(i=0;i<12;i++){ok3();}

That's an awful lot of affirmation right there. Once I'm confident I can scale it up a bit more, then we'll try to get the test suite to pass.

What does IonPower bring to TenFourFox, besides of course full Ion JIT support for the first time? Well, the next main thing it does is to pay back our substantial accrued technical debt. PPCBC, the current Baseline-only implementation, is still at its core using the same code generator and assumptions about macroassembler structure from JaegerMonkey and, ultimately, TenFourFox 10.x; I got PPCBC working by gluing the new instruction generator to the old one and writing just enough of the new macroassembler to enable it to build and generate code.

JaegerMonkey was much friendlier to us because our implementation was always generating (from the browser's view) ABI-compliant code; JavaScript frames existed on a separate interpreter stack, so we could be assured that the C stack was sane. Both major PowerPC ABIs (PowerOpen, which OS X and AIX use, and SysV, which Linux, *BSD, et al. use) have very specific requirements for how stack frames are formatted, but by TenFourFox 17 I had all the edge cases worked out and we knew that the stack would always be in a compliant state at any crossing from generated code back to the browser or OS. Ben Stuhl had done some hard work on branch and call stanza optimization and this mostly "just worked" because we had full control to that level.

Ion (and Baseline, its simpleton sibling) destroyed all that. Now, script frames are built on the C stack instead, and Ion has its own stack frame format which it always expects to be on top, meaning when you enter Ion generated code you can't directly return to ABI-compliant code without some sort of thunk and vice versa. This frame has a descriptor and a return address, which is tricky on PowerPC because we don't store return addresses on the stack like x86, the link register we do have (the register storing where a subroutine call should return to) is not a general purpose register and has specialized means for controlling it -- part of the general class of PowerPC SPRs, or special purpose registers -- and the program counter (the internal register storing the location of the current instruction) is not directly accessible in any fashion. Compare this with MIPS, where the link register is a GPR ($ra), and ARM, where both the LR and PC are. Those CPUs can just sling them to and from the stack at will.

That return address is problematic in another respect. We use constructs called branch stanzas in the PowerPC port, because the PPC branch instructions have a limited signed displacement of (at most, for b) 26 bits, and branches may exceed that with the size of scripts these days; these stanzas are padded with nop instructions so that we have something to patch with long calls (usually lis ori mtctr bctr) later. In JagerMonkey (and PPCBC), if we were making a subroutine call and the call was "short," the bl instruction that performed it was at the top of the branch stanza with some trailing nops; JaegerMonkey didn't care how we managed the return as long as we did. Because we had multiple branch variants for G3/G4 and G5, the LR which bl set could be any number of instructions after the branch within the stanza, but it didn't matter because the code would continue regardless. To Ion, though, it matters: it expects that return address on the stack to always correlate to where a translated JSOP (JavaScript interpreter opcode) begins -- no falling between the cracks.

IonPower deals with the problem in two ways. First, we now use a unified stanza for all architectures, so it always looks the same and is predictable, and short branches are now at the end of the stanza so that we always return to the next JSOP. Second, we don't actually use the link register for calling generated code in most cases. In both PowerPC ABIs, because LR is only set when it gets to the callee (the routine being called) the callee is expected to save it in its function prologue, but we don't have a prologue for Baseline code. PPCBC gets around this by using the link register within Baseline code only; for calls out to other generated code it would clobber the link register with a bl .+4 to get the PC, compute an offset and store that as a return address. However, that offset was sometimes wrong (see above), requiring hacks in the Ion core to do secondary verification. Instead, since we know the address of the instruction for the trailing JSOP at link time, we just generate code to push a 32-bit literal and we patch those push instructions when we link to memory to the right location. This executes faster, too, because we don't have all that mucking about with SPR operations. We only use the link register for toggled calls (because we control the code it calls, so we can capture LR there) and for bailout tables, and of course for the thunk code that calls ABI-compliant library routines.

This means that a whole class of bugs, like branching, repatching and stack frame glitches, just disappear in IonPower, whereas with PPCBC there seemed to be no good way to get it to work with the much more sophisticated Ion JIT and its expectations on branching and patching. For that reason, I simply concentrated on optimizing PPCBC as much as possible with high-performance PowerPC-specific parallel type guards and arithmetic routines in Baseline inline caches, and that's what you're using now.

IonPower's other big step forward is improved debugging abilities, and chief amongst them is no compiler warnings -- at all. Yep. No warnings, at least not from our headers, so that subtle bugs that the compiler might warn about can now be detected. We also have a better code generation display that's easier to visually parse, and annoying pitfalls in PowerPC instructions where our typical temporary register r0 does something completely different than other registers now assert at code generation time, making it easier to find these bugs during codegen instead of labouriously going instruction by instruction through the debugger at runtime.

Incidentally, this brought up an interesting discussion -- how does, say, MIPS deal with the return address problem, since it has an LR but no GPR PC? The MIPS Ion backend, for lack of a better word, cheats. MIPS has a single instruction branch delay slot, i.e., the instruction following any branch is (typically) executed before the branch actually takes place. If this sounds screwy, you're right; ARM and PowerPC don't have them, but some other RISC designs like MIPS, SuperH and SPARC still do as a historical holdover (in older RISC implementations, the delay slot was present to allow the processor enough time to fetch the branch target, and it remains for compatibility). With that in mind, and the knowledge that $ra is the MIPS link register, what do you think this does?

addiu $sp,$sp,-8 ; subtract 8 from the current stack pointer,
                 ; reserving two words on the stack
jalr $a0         ; jump to address in register $a0 and set $ra
sw $ra, 0($sp)   ; (DELAY SLOT) store $ra to the top of the stack

What's the value of $ra in the delay slot? This isn't well-documented apparently, but it's actually already set to the correct return address before the branch actually takes place, so MIPS can save the right address to the stack right here. (Thanks to Kyle Isom, who confirmed this on his own MIPS system.) It must be relatively standard between implementations, but I'd never seen a trick like that before. I guess there has to be at least some advantage to still having branch delay slots in this day and age.

One final note: even though necessarily I had to write some of it for purposes of compilation, please note that there will be no asm.js for PowerPC. Almost all the existing asm.js code out there assumes little endian byte ordering, meaning it will either not run or worse run incorrectly on our big endian machines. It's just not worth the headache, so they will run in the usual JIT.

On the stable branch side, 31.5 will be going to build next week. Dan DeVoto commented that TenFourFox does not colour manage untagged images (unless you force it to). I'd argue this is correct behaviour, but I can see where it would be unexpected, and while I don't mind changing the default I really need to do more testing on it. This might be the standard setting for 31.6, however. Other than the usual ESR fixes there are no major changes in 31.5 because I'm spending most of my time (that my Master's degree isn't soaking up) on IonPower -- I'm still hoping we can get it in for 38 because I really do believe it will be a major leap forward for our vintage Power Macs.