TenFourFox 45.8 is coming along -- Operation Short Change (my initiative to rework IonMonkey to properly emit single branch instructions instead of entire branch stanzas when the target is guaranteed to be in range) is smoothing out the browser quite a bit but there are hundreds of places that need to be changed. I've finished the macroassembler and the regular expression engine (and improved both of them in a couple other minor ways), and now I'm working on the Ion code generator. This mostly benefits systems that cannot fit generated code into cache, so while the Quad G5 with its (comparatively large) 1MB per core has only improved somewhere on the order of about 5%, I expect more on low-end systems because hot code is now more likely to stay in the i-cache. I am also planning to put user agent support in 45.8, and there are a few other custodial improvements. The idea is to release that to beta testing around Valentine's Day so I don't have to buy you lot any flowers.
A long while back I mentioned a couple secret projects; one of them was SandboxSafari, and the other was intended as a followup to the now obsolete MacTubes Enabler. That second project got backburnered for awhile, but since I'm noticing changes that may obsolete the work I've done on it as well, I'm going to push it out the door in its current state. Without further ado, let's introduce the PopOutPlayer.
This is not Photoshop. There is, just as you see, a floating video window playing the Vimeo movie that TenFourFox cannot (no built-in H.264 codec). In fact, I think this is the only way you can play Vimeo movies on Tiger now -- neither Safari nor OmniWeb work either.
Yes, it works fine with YouTube also:
Drop the application in your apps folder, drop the PopOutPlayer Enabler addon on TenFourFox, and then on any Vimeo (10.4 only) or YouTube (10.4 or 10.5) page or embedded video, right-click or SHIFT-right-click and select Pop Out Video. The video floats in its own window on top of other windows. You can close the browser tab and go look at something else while the video plays; the playback is a completely independent process. When you open another video, the window pops up in the same location where you left it. On multiprocessor Power Macs it can even get scheduled on another core for even smoother playback.
Sounds great, right? Well, there's a few reasons why I hadn't released this earlier.
First, the app itself is, and I'm actually feeling ill writing this, based on these sites' Flash players. Flash was the only reliable way to do playback and was even more performant than native H.264 in WebKit, and it also had substantially fewer user interface problems at the risk of sometimes being crashy. Flash is no safer now than it was before I condemned it from the browser, but I've learned a few lessons from SandboxSafari, of which the PopOutPlayer is actually a descendant. To remember its window location means we have to run the app with your uid (which eliminates SandboxSafari's primary means of protection), but we also have the advantage of only needing to support at most two specific video player applets, so we can design a very restrictive environment that protects the app from being subverted and rejects running video or Flash applets from other sources.
This different type of sandbox implements other restrictions, most notably preventing the applet from going full-screen. This is necessity reborn as virtue because most of our systems would not do well with full-screen playback, let alone HD (which is also blocked/unsupported), and prevents a subverted player from monkeying with the rest of your screen. The sandbox doesn't let URLs open from the applet either, and has its own Mach exception handler and CGEventTap to filter other possible avenues of exploit. However, that also means that you can't do many of the things you would expect to do if the Flash player were embedded in the browser. That won't change.
The window is fixed-size and floats. Allowing the window to resize caused lots of problems because the applets didn't expect to deal with that possibility. You get one size no matter how big your screen is, and you can only close it or move it.
Although the PopOutPlayer can play more YouTube videos than the QuickTime Enabler, there are still many it cannot play, though at least the Flash player will give you some explanation (the Vevo ones are the most notorious). The PopOutPlayer also isn't intended for generic HTML5 video playback; it doesn't replace the QTE in that sense, and the QTE is still the official video solution in general. The application will reject passing it URLs from other video sites.
The really big limitation, however, is that I could not get the Vimeo applet to run on 10.5 using the hacks I devised for 10.4. Leopard WebKit can play some Vimeo videos, so all is not lost, but no matter what I tried the PopOutPlayer simply wouldn't display any video itself. For the time being, Vimeo URLs on Leopard will generate an "Unsupported video URL" message. It is quite possible this might never be able to be fixed with the current method the PopOutPlayer uses for display, so don't expect it will necessarily be repaired in the future. For that matter, Vimeo on-demand doesn't even work with 10.4.
I consider the PopOutPlayer to be highly experimental, and when (not if) Vimeo and YouTube decommission their Flash players, it will abruptly cease to work without warning. But because I expect that time is coming sooner and not later you are welcome to use the PopOutPlayer for as long as it benefits you, and if I can solve some of these issues I might even make it a supported option in the future -- just don't hold your breath.
Download it here. It is unsupported. Source code is not currently available.
So back to TenFourFox. If you would, permit me now to indulge in some gratuitous nerderosity. Part of Operation Short Change was also to explore whether our branch stanza far calls could be made more efficient. Currently, if the target of a branch instruction exceeds the displacement the branch instruction (i.e., b(l) or bc(l)) can encode, we load the target into a general purpose register (GPR), transfer it to the counter register (a special purpose register, or SPR), and branch to that (i.e., lis/ori/mtctr/b(c)ctr(l)). The PowerPC ISA does not allow directly branching to a GPR or FPR, only to the counter register (CTR) and the link register (LR), which are both SPRs.
This would be all well and good except that the G5 groups instructions together, and IBM warns that there is a substantial execution penalty if mtctr and b(c)ctr(l) are in the same dispatch group. Since mtspr instructions like mtctr must always lead dispatch groups, the above stanza is guaranteed to put them both together (recall instruction dispatch groups are no more than four, or five with a branch, with branches being the last slot). Is it faster to insert nops and accept the code bloat? What about using the link register instead?
It's time for ... assembly language microbenchmarking!
#define REPS 0x4000
_main:
.globl _main
mflr r0
stwu r0, -4(r1)
li r3,0
lis r5,REPS
bl .+4
; the location of the following mflr is now in r4
mflr r4
addi r4, r4, 8
; now r4 points to the addi below
addi r3,r3,1
cmp cr0,r3,r5
beq done
#if USE_LR
mtlr r4
#if USE_NOPS
nop
nop
nop
nop
#endif
blr
#else
mtctr r4
#if USE_NOPS
nop
nop
nop
nop
#endif
bctr
#endif
done:
lwz r0, 0(r1)
mtlr r0
li r3,0
addi r1, r1, 4
blr
This just runs a tight loop 1,073,741,824 times, branching to the loop header with either LR or CTR, and with the
mtspr instruction separated from the
blr/bctr with sufficient
nops to put them in separate dispatch groups or not (there must be four to prevent the branch from getting in the terminal branch slot). That gives us four variations to test with a loop so tight the cost of the branch should substantially weigh on total runtime. Let's see what we get. If you're following along on your own Power Mac, compile these like so:
gcc -o lrctr_ctr lrctr.s
gcc -DUSE_NOPS -o lrctr_ctrn lrctr.s
gcc -DUSE_LR -o lrctr_lr lrctr.s
gcc -DUSE_LR -DUSE_NOPS -o lrctr_lrn lrctr.s
If you want to confirm what was actually assembled, you can look at the result with
otool -tV.
Our control will be our trusty 1.0GHz iMac G4 (256K L2 cache). There should be no difference between the SPRs, and it all fits into cache and there are no dispatch groups, so if we did this right the runtimes should be nearly identical. In this case we are only interested in the user CPU time (the first field).
luxojr% time ./lrctr_ctr
7.550u 0.077s 0:08.88 85.8% 0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_ctrn
7.550u 0.071s 0:08.55 89.1% 0+0k 0+0io 0pf+0w
luxojr% time ./lrctr_lr
7.550u 0.070s 0:10.08 75.5% 0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_lrn
7.550u 0.069s 0:08.43 90.2% 0+0k 0+0io 0pf+0w
Excellent. Let's run it on the Quad G5 (numbers in reduced performance mode).
bruce% time ./lrctr_ctr
4.298u 0.028s 0:04.33 99.5% 0+0k 0+2io 0pf+0w
bruce% time ./lrctr_ctrn
4.986u 0.035s 0:05.03 99.6% 0+0k 0+2io 0pf+0w
bruce% time ./lrctr_lr
13.755u 0.050s 0:13.82 99.8% 0+0k 0+1io 0pf+0w
bruce% time ./lrctr_lrn
13.752u 0.048s 0:13.82 99.7% 0+0k 0+2io 0pf+0w
Wait, what? Putting
mtctr and
bctr in the same dispatch group was actually the
fastest of these four variations. Not only was using LR slower, it was over
three times slower. Even spacing the two CTR instructions apart was marginally worse. Just to see if it was an artifact of throttling, I ran them again in highest performance. Same thing:
bruce% time ./lrctr_ctr
2.149u 0.012s 0:02.16 99.5% 0+0k 0+1io 0pf+0w
bruce% time ./lrctr_ctrn
2.492u 0.010s 0:02.50 100.0% 0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lr
6.876u 0.013s 0:06.89 99.8% 0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lrn
6.874u 0.017s 0:06.89 99.8% 0+0k 0+1io 0pf+0w
I found this so surprising I rewrote it for AIX and put it on my POWER6, which is also dispatch-group based and uses an evolved version of the same instruction pipeline as the POWER4 (from which the G5 is derived). And, well ...
uppsala% time ./lrctr_ctr
3.752u 0.001s 0:04.41 85.0% 0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_ctrn
4.064u 0.001s 0:04.94 82.1% 0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lr
13.499u 0.001s 0:16.19 83.3% 0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lrn
13.215u 0.001s 0:15.66 84.3% 0+1k 0+0io 0pf+0w
Execution times should consider that I run this POWER6 throttled in ASMI to reduce power consumption and its microarchitectural differences, but the same relative run times hold. It's actually
not faster to space the CTR and branch (that is, if there's nothing better you could be doing -- see below), and the CTR is the
best SPR to use for branching on the G5 regardless of any penalty paid. It may well be that using LR fouls the CPU link cache and thus tanks runtime, but whatever the explanation, using it for far calls is clearly the worst option.
Now, as you may have guessed, I've deliberately presented a false choice here because all four of these options are patently pathological. The optimal instruction sequence would be to schedule some work between the mtctr and the bctr. We don't have much work to, uh, work with here, but here's one way.
#define REPS 0x4000
_main:
.globl _main
mflr r0
stwu r0, -4(r1)
li r3,0
lis r5,REPS
bl .+4
; the location of the following mflr is now in r4
mflr r4
addi r4, r4, 8
; now r4 points to the mtctr below
mtctr r4
addi r3,r3,1
cmp cr0,r3,r5
beq done
bctr
done:
lwz r0, 0(r1)
mtlr r0
li r3,0
addi r1, r1, 4
blr
bruce% gcc -o ctrop ctrop.s
bruce% time ./ctrop
4.299u 0.028s 0:04.34 99.3% 0+0k 0+1io 0pf+0w
Almost identical runtimes (in reduced mode), but the
beq takes the branch slot away from the
bctr, guaranteeing the SPR operations will be split into two dispatch groups without trying to space them with
nops.
But inexplicably, if you coalesce the
beq/bctr simply into
bnectr (which
does occupy the same branch slot), you get even faster:
bruce% time ./ctrop2
3.824u 0.039s 0:04.02 95.7% 0+0k 0+0io 0pf+0w
Is this optimal on the G4 and POWER6 as well?
luxojr% time ./ctrop
6.472u 0.064s 0:12.05 54.1% 0+0k 0+2io 0pf+0w
luxojr% time ./ctrop2
6.471u 0.063s 0:09.33 69.9% 0+0k 0+0io 0pf+0w
uppsala% time ./ctrop
3.918u 0.001s 0:04.61 84.8% 0+1k 0+0io 0pf+0w
uppsala% time ./ctrop2
2.612u 0.001s 0:03.09 84.4% 0+1k 0+0io 0pf+0w
Yup, it is, though it's worth noting the G4 did not improve with the
bnectr. (This is still pathological, mind you: the best instruction sequence would be simply
addi/cmp/bne, which the G5 in reduced mode runs in
2.580u 0.029s and the POWER6 in
1.261u 0.001s, reclaiming its speed crown. But when you have a far call, you don't have a choice.)
The moral of the story? Don't fix what ain't broke.