Friday, September 23, 2016

TenFourFox 45.5.0b1 available: now with little-endian (integer) typed arrays, AltiVec VP9, improved MP3 support and a petulant rant

The TenFourFox 45.5.0 beta (yes, it says it's 45.4.0, I didn't want to rev the version number yet) is now available for testing (downloads, hashes). This blog post will serve as the current "release notes" since we have until November 8 for the next release and I haven't decided everything I'll put in it, so while I continue to do more work I figured I'd give you something to play with. Here's what's new so far, roughly in order of importance.

First, minimp3 has been converted to a platform decoder. Simply by doing that fixed a number of other bugs which were probably related to how we chunked frames, such as Google Translate voice clips getting truncated and problems with some types of MP3 live streams; now we use Mozilla's built-in frame parser instead and in this capacity minimp3 acts mostly as a disembodied codec. The new implementation works well with Google Translate, Soundcloud, Shoutcast and most of the other things I tried. (See, now there's a good use for that Mac mini G4 gathering dust on your shelf: install TenFourFox and set it up for remote screensharing access, and use it as a headless Internet radio -- I'm sitting here listening to National Public Radio over Shoutcast in a foxbox as I write this. Space-saving, environmentally responsible computer recycling! Yes, I know I'm full of great ideas. Yes. You're welcome.)

Interestingly, or perhaps frustratingly, although it somewhat improved Amazon Music (by making duration and startup more reliable) the issue with tracks not advancing still persisted for tracks under a certain critical length, which is dependent on machine speed. (The test case here was all the little five or six second Fingertips tracks from They Might Be Giants' Apollo 18, which also happens to be one of my favourite albums, and is kind of wrecked by this problem.) My best guess is that Amazon Music's JavaScript player interface ends up on a different, possibly asynchronous code path in 45 than 38 due to a different browser feature profile, and if the track runs out somehow it doesn't get the end-of-stream event in time. Since machine speed was a factor, I just amped up JavaScript to enter the Baseline JIT very quickly. That still doesn't fix it completely and Apollo 18 is still messed up, but it gets the critical track length down to around 10 or 15 seconds on this Quad G5 in Reduced mode and now most non-pathological playlists will work fine. I'll keep messing with it.

In addition, this release carries the first pass at AltiVec decoding for VP9. It has some of the inverse discrete cosine and one of the inverse Hadamard transforms vectorized, and I also wrote vector code for two of the convolutions but they malfunction on the iMac G4 and it seems faster without them because a lot of these routines work on unaligned data. Overall, our code really outshines the SSE2 versions I based them on if I do say so myself. We can collapse a number of shuffles and merges into a single vector permute, and the AltiVec multiply-sum instruction can take an additional constant for use as a bias, allowing us to skip an add step (the SSE2 version must do the multiply-sum and then add the bias rounding constant in separate operations; this code occurs quite a bit). Only some of the smaller transforms are converted so far because the big ones are really intimidating. I'm able to model most of these operations on my old Core 2 Duo Mac mini, so I can do a step-by-step conversion in a relatively straightforward fashion, but it's agonizingly slow going with these bigger ones. I'm also not going to attempt any of the encoding-specific routines, so if Google wants this code they'll have to import it themselves.

G3 owners, even though I don't support video on your systems, you get a little boost too because I've also cut out the loopfilter entirely. This improves everybody's performance and the mostly minor degradation in quality just isn't bad enough to be worth the CPU time required to clean it up. With this initial work the Quad is able to play many 360p streams at decent frame rates in Reduced mode and in Highest Performance mode even some 480p ones. The 1GHz iMac G4, which I don't technically support for video as it is below the 1.25GHz cutoff, reliably plays 144p and even some easy-to-decode (pillarboxed 4:3, mostly, since it has lots of "nothing" areas) 240p. This is at least as good as our AltiVec VP8 performance and as I grind through some of the really heavyweight transforms it should get even better.

To turn this on, go to our new TenFourFox preference pane (TenFourFox > Preferences... and click TenFourFox) and make sure MediaSource is enabled, then visit YouTube. You should have more quality settings now and I recommend turning annotations off as well. Pausing the video while the rest of the page loads is always a good idea as well as before changing your quality setting; just click once anywhere on the video itself and wait for it to stop. You can evaluate it on my scientifically validated set of abuses of grammar (and spelling), 1970s carousel tape decks, gestures we make at Gmail other than the middle finger and really weird MTV interstitials. However, because without further configuration Google will "auto-"control the stream bitrate and it makes that decision based on network speed rather than dropped frames, I'm leaving the "slower" appellation because frankly it will be, at least by default. Nevertheless, please advise if you think MSE should be the default in the next version or if you think more baking is necessary, though the pref will be user-exposed regardless.

But the biggest and most far-reaching change is, as promised, little-endian typed arrays (the "LE" portion of the IonPower-NVLE project). The rationale for this change is that, largely due to the proliferation of asm.js code and the little-endian Emscripten systems that generate it, there will be more and more code our big-endian machines can't run properly being casually imported into sites. We saw this with images on Facebook, and later with WhatsApp Web, and also with MEGA.nz, and others, and so on, and so forth. asm.js isn't merely the domain of tech demos and high-end ported game engines anymore.

The change is intentionally very focused and very specific. Only typed array access is converted to little-endian, and only integer typed array access at that: DataView objects, the underlying ArrayBuffers and regular untyped arrays in particular remain native. When a multibyte integer (16-bit halfword or 32-bit word) is written out to a typed array in IonPower-LE, it is transparently byteswapped from big-endian to little-endian and stored in that format. When it is read back in, it is byteswapped back to big-endian. Thus, the intrinsic big-endianness of the engine hasn't changed -- jsvals and doubles are still tag followed by payload, and integers and single-precision floats are still MSB at the lowest address -- only the way it deals with an integer typed array. Since asm.js uses a big typed array buffer essentially as a heap, this is sufficient to present at least a notional illusion of little-endianness as the asm.js script accesses that buffer as long as those accesses are integer.

I mentioned that floats (neither single-precision nor doubles) are not byteswapped, and there's an important reason for that. At the interpreter level, the virtual machine's typed array load and store methods are passed through the GNU gcc built-in to swap the byte order back and forth (which, at least for 32 bits, generates pretty efficient code). At the Baseline JIT level, the IonMonkey MacroAssembler is modified to call special methods that generate the swapped loads and stores in IonPower, but it wasn't nearly that simple for the full Ion JIT itself because both unboxed scalar values (which need to stay big-endian because they're native) and typed array elements (which need to be byte-swapped) go through the same code path. After I spent a couple days struggling with this, Jan de Mooij suggested I modify the MIR for loading and storing scalar values to mark it if the operation actually accesses a typed array. I added that to the IonBuilder and now Ion compiled code works too.

All of these integer accesses have almost no penalty: there's a little bit of additional overhead on the interpreter, but Baseline and Ion simply substitute the already-built-in PowerPC byteswapped load and store instructions (lwbrx, stwbrx, lhbrx, sthbrx, etc.) that we already employ for irregexp for these accesses, and as a result we incur virtually no extra runtime overhead at all. Although the PowerPC specification warns that byte-swapped instructions may have additional latency on some implementations, no PPC chip ever used in a Power Mac falls in that category, and they aren't "cracked" on G5 either. The pseudo-little endian mode that exists on G3/G4 systems but not on G5 is separate from these assembly language instructions, which work on all PowerPCs including the G5 going all the way back to the original 601.

Floating point values, on the other hand, are a different story. There are no instructions to directly store a single or double precision value in a byteswapped fashion, and since there are also no direct general purpose register-floating point register moves, the float has to be spilled to memory and picked up by a GPR (or two, if it's a double) and then swapped at that point to complete the operation. To get it back requires reversing the process, along with the GPR (or two) getting spilled this time to repopulate the double or float after the swap is done. All that would have significantly penalized float arrays and we have enough performance problems without that, so single and double precision floating point values remain big-endian.

Fortunately, most of the little snippets of asm.js floating around (that aren't entire Emscriptenized blobs: more about that in a moment) seem perfectly happy with this hybrid approach, presumably because they're oriented towards performance and thus integer operations. MEGA.nz seems to load now, at least what I can test of it, and WhatsApp Web now correctly generates the QR code to allow your phone to sync (just in time for you to stop using WhatsApp and switch to Signal because Mark Zuckerbrat has sold you to his pimps here too).

But what about bigger things? Well ...

Yup. That's DOSBOX emulating MECC's classic Oregon Trail (from the Internet Archive's MS-DOS Game Library), converted to asm.js with Emscripten and running inside TenFourFox. Go on and try that in 45.4. It doesn't work; it just throws an exception and screeches to a halt.

To be sure, it doesn't fully work in this release of 45.5 either. But some of the games do: try playing Oregon Trail yourself, or Where in the World is Carmen Sandiego or even the original, old school in its MODE 40 splendour, Те́трис (that's Tetris, comrade). Even Commander Keen Goodbye Galaxy! runs, though not even the Quad can make it reasonably playable. In particular the first two probably will run on nearly any Power Mac since they're not particularly dependent on timing (I was playing Oregon Trail on my iMac G4 last night), though you should expect it may take anywhere from 20 seconds to a minute to actually boot the game (depending on your CPU) and I'd just mute the tab since not even the Quad G5 at full tilt can generate convincing audio. But IonPower-LE will now run them, and they run pretty well, considering.

Does that seem impractical? Okay then: how about something vaguely useful ... like ... Linux?

This is, of course, Fabrice Belliard's famous jslinux emulator, and yes, IonPower now runs this too. Please don't expect much out of it if you're not on a high-end G5; even the Quad at full tilt took about 80 seconds elapsed time to get to a root prompt. But it really works and it's useable.

Getting into ridiculous territory was running Linux on OpenRISC:

This is the jor1k emulator and it's only for the highest end G5 systems, folks. Set it to 5fps to have any chance of booting it in less than five minutes. But again -- it's not that the dog walked well.

vi freaks like me will also get a kick out of vim.js. Or, if you miss Classic apps, now TenFourFox can be your System 7 (mouse sync is a little too slow here but it boots):

Now for the bad news: notice that I said things don't fully work. With em-dosbox, the Emscriptenoberated DOSBOX, notice that I only said some games run in TenFourFox, not most, not even many. Wolfenstein 3D, for example, gets as far as the main menu and starting a new game, and then bugs out with a "Reboot requested" message which seems to originate from the emulated BIOS. (It works fine on my MacBook Air, and I did get it to run under PCE.js, albeit glacially.) Catacombs 3D just sits there, trying to load a level and never finishing. Most of the other games don't even get that far and a few don't start at all.

I also tried a Windows 95 emulator (also DOSBOX, apparently), which got part way into the boot sequence and then threw a JavaScript exception "SimulateInfiniteLoop"; the Internet Archive's arcade games under MAME which starts up and then exhausts recursion and aborts (this seems like it should be fixable or tunable, but I haven't explored this further so far); and of course programs requiring WebGL will never, ever run on TenFourFox.

Debugging Emscripten goo output is quite difficult and usually causes tumours in lab rats, but several possible explanations come to mind (none of them mutually exclusive). One could be that the code actually does depend on the byte ordering of floats and doubles as well as integers, as do some of the Mozilla JIT conformance tests. However, that's not ever going to change because it requires making everything else suck for that kind of edge case to work. Another potential explanation is that the intrinsic big-endianness of the engine is causing things to fail somewhere else, such as they managed to get things inadvertently written in such a way that the resulting data was byteswapped an asymmetric number of times or some other such violation of assumptions. Another one is that the execution time is just too damn long and the code doesn't account for that possibility. Finally, there might simply be a bug in what I wrote, but I'm not aware of any similar hybrid endian engine like this one and thus I've really got nothing to compare it to.

In any case, the little-endian typed array conversion definitely fixes the stuff that needed to get fixed and opens up some future possibilities for web applications we can also run like an Intel Mac can. The real question is whether asm.js compilation (OdinMonkey, as opposed to IonPower) pays off on PowerPC now that the memory model is apparently good enough at least for most things. It would definitely run faster than IonPower, possibly several times faster, but the performance delta would not be as massive as IonPower versus the interpreter (about a factor of 40 difference), the compilation step might bring lesser systems to their knees, and it would require some significant additional engineering to get it off the ground (read: a lot more work for me to do). Given that most of our systems are not going to run these big massive applications well even with execution time cut in half or even 2/3rds (and some of them don't work correctly as it is), it might seem a real case of diminishing returns to make that investment of effort. I'll just have to see how many free cycles I have and how involved the effort is likely to be. For right now, IonPower can run them and that's the important thing.

Finally, the petulant rant. I am a fairly avid reader of Thom Holwerda's OSNews because it reports on a lot of marginal and unusual platforms and computing news that most of the regular outlets eschew. The articles are in general very interesting, including this heads-up on booting the last official GameCube game (and since the CPU in the Nintendo GameCube is a G3 derivative, that's even relevant on this blog). However, I'm going to take issue with one part of his otherwise thought-provoking discussion on the new Apple A10 processor and the alleged impending death of Mac OS macOS, where he says, "I didn't refer to Apple's PowerPC days for nothing. Back then, Apple knew it was using processors with terrible performance and energy requirements, but still had to somehow convince the masses that PowerPC was better faster stronger than x86; claims which Apple itself exposed — overnight — as flat-out lies when the company switched to Intel."

Besides my issue with what he links in that last sentence as proof, which actually doesn't establish Apple had been lying (it's actually a Low End Mac piece contemporary with the Intelcalypse asking if they were), this is an incredibly facile oversimplification. Before the usual suspects hop on the comments with their usual suspecty things, let's just go ahead for the sake of argument and say everything its detractors said about the G5 and the late generation G4 systems are true, i.e., they're hot, underpowered and overhungry. (I contest the overhungry part in particular for the late laptop G4 systems, by the way. My 2005 iBook G4 to this day still gets around five hours on a charge if I'm aggressive and careful about my usage. For a 2005 system that's damn good, especially since Apple said six for the same model I own but only 4.5 for the 2008 MacBooks. At least here you're comparing Reality Distortion Field to Reality Distortion Field, and besides, all the performance/watt in the world doesn't do you a whole hell of a lot of good if your machine's out of puff.)

So let's go ahead and just take all that as given for discussion purposes. My beef with that comment is it conveniently ignores every other PowerPC chip before the Intel transition just to make the point. For example, PC Magazine back in the day noted that a 400MHz Yosemite G3 outperformed a contemporary 450MHz Pentium II on most of their tests (read it for yourself, April 20, 1999, page 53). The G3, which doesn't have SIMD of any kind, even beat the P2 running MMX code. For that matter, a 350MHz 604e was over twice as fast at integer performance than a 300MHz P2. I point all of this out not (necessarily) to go opening old wounds but to remind those ignorant of computing history that there was a time in "Apple's PowerPC days" when even the architecture's detractors will admit it was at least competitive. That time clearly wasn't when the rot later set in, but he certainly doesn't make that distinction.

To be sure, was this the point of his article? Not really, since he was more addressing ARM rather than PowerPC, but it is sort of. Thom asserts in his exchange with Grüber Alles that Apple and those within the RDF cherrypick benchmarks to favour what suits them, which is absolutely true and I just did it myself, but Apple isn't any different than anyone else in that regard (put away the "tu quoque" please) and Apple did this as much in the Power Mac days to sell widgets as they do now in the iOS ones. For that matter, Thom himself backtracks near the end and says, "there is one reason why benchmarks of Apple's latest mobile processors are quite interesting: Apple's inevitable upcoming laptop and desktop switchover to its own processors." For the record I see this as highly unlikely due to the Intel Mac's frequent use as a client virtual machine host, though it's interesting to speculate. But the rise of the A-series is hardly comparable with Apple's PowerPC days at all, at least not as a monolithic unit. If he had compared the benchmark situation with when the PowerPC roadmap was running out of gas in the 2004-5 timeframe, by which time even boosters like yours truly would have conceded the gap was widening but Apple relentlessly ginned up evidence otherwise, I think I'd have grudgingly concurred. And maybe that's actually what he meant. However, what he wrote lumps everything from the 601 to the 970MP into a single throwaway comment, is baffling from someone who also uses and admires Mac OS 9 (as I do), and dilutes his core argument. Something like that I'd expect from the breezy mainstream computer media types. Thom, however, should know better.

(On a related note, Ars Technica was a lot better when they were more tech and less politics.)

Next up: updates to our custom gdb debugger and a maintenance update for TenFourFoxBox. Stay tuned and in the meantime try it and see if you like it. Post your comments, and, once you've played a few videos or six, what you think the default should be for 45.5 (regular VP8 video or MSE/VP9).

9 comments:

  1. All this looks very good, except a minor thing that's probably just a bug: The new PDF tick in the preferences vanishes at every restart even though PDF display is still enabled in about:config. Toggling true/false in about:config isn't reflected in the GUI preferences.

    ReplyDelete
  2. (On a related note, Ars Technica was a lot better when they were more tech and less politics.)

    As an aside, I definitely agree. But, hey, Ars is owned by Conde Nast now, so whaddya expect, eh?

    ReplyDelete
  3. Non of the Media Streams worked on my MDD DP 1.42, but they did now show up under Download Helper - for the first time, QT will play them great. So very happy with this arrangement. Is strange, the PDF view is working great, but will not allow the box to enable/disable it to be checked. Not a big deal, but curious.

    ReplyDelete
  4. PS the normal 'Captcha-Hell' of posting here is so much better now!!!!

    ReplyDelete
  5. About the PPC/Intel switch - there is a very definite Mandela effect at work around PPC in mainstream computing.

    Early on, x86 SUCKED. A Pentium 2 or 3 were GARBAGE. My first computer had a Pentium 3 500MHz with 128MB or RAM and it was all I could do to handle 11"x14", 200dpi digital images (3 or more layers - caused it to crawl). Now those are low specs, but as soon as I started using Blue&White G3s at my college, could handle images 4X as big like nothing. Even Pentium 4s are honestly - just not that great.

    Now x86 did get a lot better (Core Solos notwithstanding), and the Macbook made a lot of sense. BUT they run at the boiling-point of water, and use a LOT of energy. RISC got smarter (let's not forget because it often bit-swaps and runs x86 code as RISC code), but RISC also got a lot better too. ARM is really energy efficient ultra-scalable and CHEAP.

    A major game site was recently lamenting the Playstation 3's Cell-Processors as being inferior from a development standpoint to the XBox 360's Xenon (PC ;0) processors, without realizing that they are nearly the same architecture (M.S. basically simplified and stole SONY's idea). Just cause Xenon and Xeon sound similar, doesn't mean they were.

    As for Apple staying x86 for the sake of VirtualBox, I wonder if it will even matter. Graphics cards do half the work these days and pretty soon a CPU will seem like a 'Quaint' concept.

    BTW does anyone else remember that interview with Jobs right after the switch when he admitted the main reason for the switch was that he was tired of having to explain why his chips with lower clock-speeds were better? That really range true at the time. Apple is a total whore to appearances.

    ReplyDelete
    Replies
    1. I would argue the Pentium 4s based on the Netburst architecture were the real low point for the x86 line. The P6 architecture by comparison saw service not only in the Pentium 2 and 3, but was then used in the Pentium M and was tapped for Netburst's successor in the Core series, right up until Nehalem came along and Intel started adding i's to all their product names. It can't be too bad, if iterations of the P6 architecture managed to prove superior to alternatives for 13 years, and even found its way into nearly 6 years of Apple computers!

      Delete
  6. I'm gonna guess that Dr. Kaiser didn't agree with this video about the Power Mac G5 either... https://www.youtube.com/watch?v=6SqYMU81l8Y

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.