Saturday, February 4, 2017

An experiment: introducing the PopOutPlayer (plus: microbenchmarking PowerPC SPRs)

TenFourFox 45.8 is coming along -- Operation Short Change (my initiative to rework IonMonkey to properly emit single branch instructions instead of entire branch stanzas when the target is guaranteed to be in range) is smoothing out the browser quite a bit but there are hundreds of places that need to be changed. I've finished the macroassembler and the regular expression engine (and improved both of them in a couple other minor ways), and now I'm working on the Ion code generator. This mostly benefits systems that cannot fit generated code into cache, so while the Quad G5 with its (comparatively large) 1MB per core has only improved somewhere on the order of about 5%, I expect more on low-end systems because hot code is now more likely to stay in the i-cache. I am also planning to put user agent support in 45.8, and there are a few other custodial improvements. The idea is to release that to beta testing around Valentine's Day so I don't have to buy you lot any flowers.

A long while back I mentioned a couple secret projects; one of them was SandboxSafari, and the other was intended as a followup to the now obsolete MacTubes Enabler. That second project got backburnered for awhile, but since I'm noticing changes that may obsolete the work I've done on it as well, I'm going to push it out the door in its current state. Without further ado, let's introduce the PopOutPlayer.

This is not Photoshop. There is, just as you see, a floating video window playing the Vimeo movie that TenFourFox cannot (no built-in H.264 codec). In fact, I think this is the only way you can play Vimeo movies on Tiger now -- neither Safari nor OmniWeb work either.

Yes, it works fine with YouTube also:

Drop the application in your apps folder, drop the PopOutPlayer Enabler addon on TenFourFox, and then on any Vimeo (10.4 only) or YouTube (10.4 or 10.5) page or embedded video, right-click or SHIFT-right-click and select Pop Out Video. The video floats in its own window on top of other windows. You can close the browser tab and go look at something else while the video plays; the playback is a completely independent process. When you open another video, the window pops up in the same location where you left it. On multiprocessor Power Macs it can even get scheduled on another core for even smoother playback.

Sounds great, right? Well, there's a few reasons why I hadn't released this earlier.

First, the app itself is, and I'm actually feeling ill writing this, based on these sites' Flash players. Flash was the only reliable way to do playback and was even more performant than native H.264 in WebKit, and it also had substantially fewer user interface problems at the risk of sometimes being crashy. Flash is no safer now than it was before I condemned it from the browser, but I've learned a few lessons from SandboxSafari, of which the PopOutPlayer is actually a descendant. To remember its window location means we have to run the app with your uid (which eliminates SandboxSafari's primary means of protection), but we also have the advantage of only needing to support at most two specific video player applets, so we can design a very restrictive environment that protects the app from being subverted and rejects running video or Flash applets from other sources.

This different type of sandbox implements other restrictions, most notably preventing the applet from going full-screen. This is necessity reborn as virtue because most of our systems would not do well with full-screen playback, let alone HD (which is also blocked/unsupported), and prevents a subverted player from monkeying with the rest of your screen. The sandbox doesn't let URLs open from the applet either, and has its own Mach exception handler and CGEventTap to filter other possible avenues of exploit. However, that also means that you can't do many of the things you would expect to do if the Flash player were embedded in the browser. That won't change.

The window is fixed-size and floats. Allowing the window to resize caused lots of problems because the applets didn't expect to deal with that possibility. You get one size no matter how big your screen is, and you can only close it or move it.

Although the PopOutPlayer can play more YouTube videos than the QuickTime Enabler, there are still many it cannot play, though at least the Flash player will give you some explanation (the Vevo ones are the most notorious). The PopOutPlayer also isn't intended for generic HTML5 video playback; it doesn't replace the QTE in that sense, and the QTE is still the official video solution in general. The application will reject passing it URLs from other video sites.

The really big limitation, however, is that I could not get the Vimeo applet to run on 10.5 using the hacks I devised for 10.4. Leopard WebKit can play some Vimeo videos, so all is not lost, but no matter what I tried the PopOutPlayer simply wouldn't display any video itself. For the time being, Vimeo URLs on Leopard will generate an "Unsupported video URL" message. It is quite possible this might never be able to be fixed with the current method the PopOutPlayer uses for display, so don't expect it will necessarily be repaired in the future. For that matter, Vimeo on-demand doesn't even work with 10.4.

I consider the PopOutPlayer to be highly experimental, and when (not if) Vimeo and YouTube decommission their Flash players, it will abruptly cease to work without warning. But because I expect that time is coming sooner and not later you are welcome to use the PopOutPlayer for as long as it benefits you, and if I can solve some of these issues I might even make it a supported option in the future -- just don't hold your breath.

Download it here. It is unsupported. Source code is not currently available.

So back to TenFourFox. If you would, permit me now to indulge in some gratuitous nerderosity. Part of Operation Short Change was also to explore whether our branch stanza far calls could be made more efficient. Currently, if the target of a branch instruction exceeds the displacement the branch instruction (i.e., b(l) or bc(l)) can encode, we load the target into a general purpose register (GPR), transfer it to the counter register (a special purpose register, or SPR), and branch to that (i.e., lis/ori/mtctr/b(c)ctr(l)). The PowerPC ISA does not allow directly branching to a GPR or FPR, only to the counter register (CTR) and the link register (LR), which are both SPRs.

This would be all well and good except that the G5 groups instructions together, and IBM warns that there is a substantial execution penalty if mtctr and b(c)ctr(l) are in the same dispatch group. Since mtspr instructions like mtctr must always lead dispatch groups, the above stanza is guaranteed to put them both together (recall instruction dispatch groups are no more than four, or five with a branch, with branches being the last slot). Is it faster to insert nops and accept the code bloat? What about using the link register instead?

It's time for ... assembly language microbenchmarking!

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the addi below

        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
#if USE_LR
        mtlr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        blr
#else
        mtctr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        bctr
#endif

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr
This just runs a tight loop 1,073,741,824 times, branching to the loop header with either LR or CTR, and with the mtspr instruction separated from the blr/bctr with sufficient nops to put them in separate dispatch groups or not (there must be four to prevent the branch from getting in the terminal branch slot). That gives us four variations to test with a loop so tight the cost of the branch should substantially weigh on total runtime. Let's see what we get. If you're following along on your own Power Mac, compile these like so:
gcc -o lrctr_ctr lrctr.s
gcc -DUSE_NOPS -o lrctr_ctrn lrctr.s
gcc -DUSE_LR -o lrctr_lr lrctr.s
gcc -DUSE_LR -DUSE_NOPS -o lrctr_lrn lrctr.s
If you want to confirm what was actually assembled, you can look at the result with otool -tV.

Our control will be our trusty 1.0GHz iMac G4 (256K L2 cache). There should be no difference between the SPRs, and it all fits into cache and there are no dispatch groups, so if we did this right the runtimes should be nearly identical. In this case we are only interested in the user CPU time (the first field).

luxojr% time ./lrctr_ctr
7.550u 0.077s 0:08.88 85.8%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_ctrn
7.550u 0.071s 0:08.55 89.1%     0+0k 0+0io 0pf+0w
luxojr% time ./lrctr_lr
7.550u 0.070s 0:10.08 75.5%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_lrn
7.550u 0.069s 0:08.43 90.2%     0+0k 0+0io 0pf+0w
Excellent. Let's run it on the Quad G5 (numbers in reduced performance mode).
bruce% time ./lrctr_ctr
4.298u 0.028s 0:04.33 99.5%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_ctrn
4.986u 0.035s 0:05.03 99.6%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_lr
13.755u 0.050s 0:13.82 99.8%    0+0k 0+1io 0pf+0w
bruce% time ./lrctr_lrn
13.752u 0.048s 0:13.82 99.7%    0+0k 0+2io 0pf+0w
Wait, what? Putting mtctr and bctr in the same dispatch group was actually the fastest of these four variations. Not only was using LR slower, it was over three times slower. Even spacing the two CTR instructions apart was marginally worse. Just to see if it was an artifact of throttling, I ran them again in highest performance. Same thing:
bruce% time ./lrctr_ctr
2.149u 0.012s 0:02.16 99.5%     0+0k 0+1io 0pf+0w
bruce% time ./lrctr_ctrn
2.492u 0.010s 0:02.50 100.0%    0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lr
6.876u 0.013s 0:06.89 99.8%     0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lrn
6.874u 0.017s 0:06.89 99.8%     0+0k 0+1io 0pf+0w

I found this so surprising I rewrote it for AIX and put it on my POWER6, which is also dispatch-group based and uses an evolved version of the same instruction pipeline as the POWER4 (from which the G5 is derived). And, well ...

uppsala% time ./lrctr_ctr
3.752u 0.001s 0:04.41 85.0%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_ctrn
4.064u 0.001s 0:04.94 82.1%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lr
13.499u 0.001s 0:16.19 83.3%    0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lrn
13.215u 0.001s 0:15.66 84.3%    0+1k 0+0io 0pf+0w
Execution times should consider that I run this POWER6 throttled in ASMI to reduce power consumption and its microarchitectural differences, but the same relative run times hold. It's actually not faster to space the CTR and branch (that is, if there's nothing better you could be doing -- see below), and the CTR is the best SPR to use for branching on the G5 regardless of any penalty paid. It may well be that using LR fouls the CPU link cache and thus tanks runtime, but whatever the explanation, using it for far calls is clearly the worst option.

Now, as you may have guessed, I've deliberately presented a false choice here because all four of these options are patently pathological. The optimal instruction sequence would be to schedule some work between the mtctr and the bctr. We don't have much work to, uh, work with here, but here's one way.

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the mtctr below

        mtctr r4
        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
        bctr

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr

bruce% gcc -o ctrop ctrop.s
bruce% time ./ctrop
4.299u 0.028s 0:04.34 99.3%     0+0k 0+1io 0pf+0w
Almost identical runtimes (in reduced mode), but the beq takes the branch slot away from the bctr, guaranteeing the SPR operations will be split into two dispatch groups without trying to space them with nops. But inexplicably, if you coalesce the beq/bctr simply into bnectr (which does occupy the same branch slot), you get even faster:
bruce% time ./ctrop2
3.824u 0.039s 0:04.02 95.7%     0+0k 0+0io 0pf+0w
Is this optimal on the G4 and POWER6 as well?
luxojr% time ./ctrop
6.472u 0.064s 0:12.05 54.1%     0+0k 0+2io 0pf+0w
luxojr% time ./ctrop2
6.471u 0.063s 0:09.33 69.9%     0+0k 0+0io 0pf+0w

uppsala% time ./ctrop
3.918u 0.001s 0:04.61 84.8%     0+1k 0+0io 0pf+0w
uppsala% time ./ctrop2
2.612u 0.001s 0:03.09 84.4%     0+1k 0+0io 0pf+0w
Yup, it is, though it's worth noting the G4 did not improve with the bnectr. (This is still pathological, mind you: the best instruction sequence would be simply addi/cmp/bne, which the G5 in reduced mode runs in 2.580u 0.029s and the POWER6 in 1.261u 0.001s, reclaiming its speed crown. But when you have a far call, you don't have a choice.)

The moral of the story? Don't fix what ain't broke.

12 comments:

  1. Hmmm... actually I've never seen flash player perform better at h.264 than QTKit/CoreAnimation (which is what Leopard WebKit uses). That might be because I've always had Perian installed.
    Flash player might also perform better when accelerated compositing is not available (because of the graphics driver not supporting CoreAnimation) or disabled for WebKit; then QTKit wouldn't be able to render to a CALayer (OpenGL) and would have to resort to using an NSView instead.
    Or did you mean Tiger's H.264 codec used via my upgraded Tiger WebKit? That codec is quite limited in comparison with Leopard's.

    Would you like to pass me some Vimeo video URLs that don't play in Leopard WebKit?
    I didn't find any during brief testing - and it did always play using QTKit.

    ReplyDelete
    Replies
    1. This is 10.4, so I couldn't say about performance on 10.5.

      The Vimeo URL above plays for a few seconds on the latest Leopard WebKit on my DLSD G4, and then resets. Others seem to work, though. It could be a temporary glitch.

      Delete
  2. Cool! Flash does still have its place (like for playing "Bionic Chainsaw Pogostick Gorilla"!!

    BUT Vimeo (end many others like Dailymotion, Comedy Central, and of course Youtube - even plays entire playlists now!) have been playable all this time in PPC Media Center (Excuse the shameless plug ;0).

    http://ppcluddite.blogspot.com/2016/06/new-ppc-media-center-version-6.html

    ReplyDelete
  3. PS, or if in Dash Format - DLable then playable (kinda a mixed bag with Vimeo).

    ReplyDelete
  4. BTW C.K. Just realized you can Symlink all of the cookies from FoxBoxes together, so if say a Facebook FoxBox links to a Youtube FoxBox, you can click though and be already logged in! I have 8 or so of them and they all share the same login-cookies now.

    ReplyDelete
  5. Would it be possible to use QuickTime to implement embedded html5 video? I know my system, a quad, is fully capable hardware-wise of playing BluRays(used one of the sata ports), so most html5 videos would be easy for it. Problem I see is the software side.

    ReplyDelete
    Replies
    1. Embedded? No, not without substantial hacks. Mozilla does use the AV Toolkit but only for decoding, and that won't work for 10.4.

      Outside of the browser, though, you've essentially described the QuickTime Enabler.

      Delete
    2. It used to be possible to use QuickTime codecs in Firefox by means of GStreamer - but that doesn't accelerate rendering and would be limitted to the codecs available for QuickTime.

      Delete
    3. Is GStreamer still possible and if so, would it support a hardware overlay? Or is what you describe as rendering? GStreamer supports ffmpeg as well. Wouldn't help everyone, but ffmpeg may be better optimized already than what mozilla shipped with.

      I know hardware decoding is not possible, but VLC, which uses ffmpeg decoders, can decode 1080p BluRay h264 video off disc if you have the keys and right libraries installed on my QUAD. Mine has a 7800GT btw. Shame the last supported VLC is 2.0.

      Delete
  6. Exciting stuff happening at macrumors.com PowerPC users forum, with Dronecatcher using a tenfourfoxbox of tonvid.com and youtube-dl to pass on video to mplayer via Applescript, with this method youtube playback on 10.4 and a 500mhz G3 is possible, with an incredible 30% CPU being used. A real programmer could probably make a more elegant mashup. Thread:

    https://forums.macrumors.com/threads/new-youtube-player-downloader-even-for-g3.2031523/

    Another user, Lastic, has compiled SMTube for OS X PPC, using xcode and macports on Leopard. The Tiger build is giving him fits, but he is not giving up.

    https://forums.macrumors.com/threads/new-youtube-player-downloader-even-for-g3.2031523/

    ReplyDelete
  7. I have sad news. My PowerMac G5 has finally died (or at least it's pretty much at that state). I keep getting these cryptic firmware messages on randomly on startup as well as random crashes. I think the motherboard may be going bad. I'm currently using a cheap pc laptop I got not too long ago to type this.

    ReplyDelete
  8. I'm also quite confident the abysmal performance of blr is due to messing up the link cache. That said, for PIC you need to calculate the current address of the program counter at function start, and the only way to do that on PPC is via a call (bl, or generic: bcl) instruction, except if you're using the ELFv2 abi.

    Since that would also mess up the link cache normally (a call without a return), they've reserved a special variant of the bcl instruction that does not update the link cache: set the BO field to 20 and the BI field to 31. So maybe something similar holds for the bclr instruction, in that it may not attempt to use, nor invalidate/pop, the link cache if you use that BO/BI combination. I've never tried this though.

    ReplyDelete

Due to an increased frequency of spam, comments are now subject to moderation.