Saturday, December 3, 2016

45.6.0b1 available, plus sampling processes for fun and profit

Test builds for TenFourFox 45.6.0 are available (downloads, hashes, release notes). The release notes indicate the definitive crash fix in Mozilla bug 1321357 (i.e., the definitive fix for the issue mitigated in 45.5.1) is in this build; it is not, but it will be in the final release candidate. 45.6.0 includes the removal of HiDPI support, which also allowed some graphical optimizations the iMac G4 particularly improved with, the expansion of the JavaScript JIT non-volatile general purpose register file, an image-heavy scrolling optimization too late for the 45ESR cut that I pulled down, the removal of telemetry from user-facing chrome JS and various minor fixes to the file requester code. An additional performance improvement will be landed in 45ESR by Mozilla as a needed prerequisite for another fix; that will also appear in the final release. Look for the release candidate next week sometime with release to the public late December 12 as usual, but for now, please test the new improvements so far.

There is now apparently a potential workaround for those of you still having trouble getting the default search engine to stick. I still don't have a good theory for what's going on, however, so if you want to try the workaround please read my information request and post the requested information about your profile before and after to see if the suggested workaround affects that.

I will be in Australia for Christmas and New Years' visiting my wife's family, so additional development is likely to slow over the holidays. Higher priority items coming up will be implementing user agent support in the TenFourFox prefpane, adding some additional HTML5 features and possibly excising telemetry from garbage and cycle collection, but probably for 45.8 instead of 45.7. I'm also looking at adding some PowerPC-specialized code sections to the platform-independent Ion code generator to see if I can crank up JavaScript performance some more, and possibly some additional work to the AltiVec VP9 codec for VMX-accelerated intraframe prediction. I'm also considering adding AltiVec support to the Theora (VP3) decoder; even though its much lighter processing requirements yield adequate performance on most supported systems it could be a way to get higher resolution video workable on lower-spec G4s.

One of the problems with our use of a substantially later toolchain is that (in particular) debugging symbols from later compilers are often gibberish to older profiling and analysis tools. This is why, for example, we have a customized gdb, or debugging at even a basic level wouldn't be possible. If you're really a masochist, go ahead and compile TenFourFox with the debug profile and then try to use a tool like sample or vmmap, or even Shark, to analyze it. If you're lucky, the tool will just freeze. If you're unlucky, your entire computer will freeze or go haywire. I can do performance analysis on a stripped release build, but this yields sample backtraces which are too general to be of any use. We need some way of getting samples off a debug build but not converting the addresses in the backtrace to function names until we can transfer the samples to our own tools that do understand these later debugging symbols.

Apple's open source policy is problematic -- they'll open source the stuff they have to, and you can get at some components like the kernel this way, but many deep dark corners are not documented and one of those is how tools like /usr/bin/sample and Shark get backtraces from other processes. I suspect this is so that they can keep the interfaces unstable and avoid abetting the development of applications that depend on any one particular implementation. But no one said I couldn't disassemble the damn thing. So let's go.

(NB: the below analysis is based on Tiger 10.4.11. It is possible, and even likely, the interface changed in Leopard 10.5.)

With Depeche Mode blaring on the G5, because Dave Gahan is good for debugging, let's look at /usr/bin/sample since it's a much smaller nut to crack than Shark.

% otool -L /usr/bin/sample
         /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 567.29.0)
         /System/Library/PrivateFrameworks/vmutils.framework/Versions/A/vmutils (compatibility version 1.0.0, current version 93.1.0)
         /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
         /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.3.4)

Interesting! A private framework! Let's see what Objective-C calls we might get (which are conveniently text strings).

% strings /usr/bin/sample |& more
Not currently sampling -- exiting immediately.
Waiting for '%s' to appear...
%s appeared.
%s cannot find a process you have access to which has a name like '%s'
Sampling process %d each %u msecs %u times
syntax: sample <pid/partial name> <duration (secs)> { <msecs between samples> } <options>
options: {-mayDie} {-wait} {-subst <old> <new>}*
-file filename specifies where results should be written
-mayDie reads symbol information right away
-wait wait until the process named (usually by partial name) exists, then start sampling
-subst can be used to replace a stripped executable by another
Note that the program must have been started using a full path, rather than a relative path, for analysis to work, or that the -subst option must be specified
%s cannot examine process %d for unknown reasons, even though it appears to exist.
%s cannot examine process %d because the process does not exist.
%s cannot examine process %d (with name like %s) because it no longer appears to be running.
%s cannot examine process %d because you do not have appropriate privileges to examine it.
%s cannot examine process %d for unknown reasons.

Most of that looks fairly straightforward Objective-C stuff, but what's NSSampler? That's not documented anywhere. Magic Hat can't find it either with the default libraries, but it does if we add those private frameworks. If I use class-dump (3.1.2 works with 10.4), I can get a header file with its methods and object layout. (The header file it generates is usually better than Magic Hat's since Magic Hat sorts things in alphabetical rather than memory order, which will be problematic shortly.) Edited down, it looks like this. (I added the byte offsets, which are only valid for the 32-bit PowerPC OS X ABI.)

@interface NSSampler : NSObject

00 BOOL _stop;
04 BOOL _stopped;
08 unsigned int _task;
12 int _pid;
16 double _duration;
24 double _interval;
32 NSMutableArray *_sampleData;
36 NSMutableArray *_sampleTimes;
40 double _previousTime;
48 unsigned int _numberOfDataPoints;
52 double _sigma;
60 double _max;
68 unsigned int _sampleNumberForMax;
72 ImageSymbols *_imageSymbols;
76 NSDictionary *_symbolRichBinaryMappings;
80 BOOL _writeBadAddresses;
84 TaskMemoryCache *_tmc;
88 BOOL _stacksFixed;
92 BOOL _sampleSelf;
96 struct backtraceMagicNumbers _magicNumbers;

- (void) _cleanupStacks;
- (void) _initStatistics;
- (void) _makeHighPriority;
- (void) _makeTimeshare;
- (void) _runSampleThread: (id) parameter1;
- (void) dealloc;
- (void) finalize;
- (void) forceStop;
- (void) getStatistics: (void*) parameter1;
- (id) imageSymbols;
- (id) initWithPid: (int) parameter1;
- (id) initWithPid: (int) parameter1 symbolRichBinaries: (id) parameter2;
- (id) initWithSelf;
- (void) preloadSymbols;
- (void) printStatistics;
- (id) rawBacktraces;
- (void) sampleForDuration2: (double) parameter1 interval: (double) parameter2;
- (void) sampleForDuration: (unsigned int) parameter1 interval: (unsigned int) parameter2;
- (int) sampleTask;
- (void) setImageSymbols: (id) parameter1;
- (void) startSamplingWithInterval: (unsigned int) parameter1;
- (void) stopSampling;
- (id) stopSamplingAndReturnCallNode;
- (void) writeBozo;
- (void) writeOutput: (id) parameter1 append: (char) parameter2;


Okay, so now we know what methods are there. How does one call this thing? Let's move to the disassembler. I'll save you my initial trudging through the machine code and get right to the good stuff. I've annotated critical parts below from stepping through the code in the debugger.

% otool -tV /usr/bin/sample
(__TEXT,__text) section
00002aa4        or      r26,r1,r1        << enter
00002aa8        addi    r1,r1,0xfffc
00002aac        rlwinm  r1,r1,0,0,26
00002ab0        li      r0,0x0
00002ab4        stw     r0,0x0(r1)
00003260        b       0x3310
00003264        bl      0x3840  ; symbol stub for: _getgid
00003268        bl      0x37d0  ; symbol stub for: _setgid

This looks like something that's trying to get at a process. Let's see what's here.

0000326c        lis     r3,0x0
00003270        or      r4,r30,r30
00003274        addi    r3,r3,0x3b9c
00003278        or      r5,r29,r29
0000327c        or      r6,r26,r26
00003280        bl      0x37c0  ; symbol stub for: _printf$LDBL128 // "Sampling process ..."
00003284        lbz     r0,0x39(r1)
00003288        cmpwi   cr7,r0,0x1
0000328c        bne+    cr7,0x32a0 // jumps to 32a0
000032a0        lis     r4,0x0
000032a4        lwz     r3,0x0(r31)
000032a8        or      r5,r25,r25
000032ac        lwz     r4,0x5010(r4) // 0x399c "sampleForDuration:..."
000032b0        or      r6,r23,r23
000032b4        bl      0x3800  ; symbol stub for: _objc_msgSend
000032b8        lis     r4,0x0
000032bc        lwz     r3,0x0(r31)
000032c0        lwz     r4,0x500c(r4) // 0x946ba288 "stopSampling"
000032c4        bl      0x3800  ; symbol stub for: _objc_msgSend
000032c8        lis     r4,0x0
000032cc        lwz     r3,0x0(r31)
000032d0        lwz     r4,0x5008(r4) // 0x3978 "writeOutput:..."
000032d4        or      r5,r22,r22
000032d8        li      r6,0x0
000032dc        bl      0x3800  ; symbol stub for: _objc_msgSend

That seems simple enough. It seems to allocate and initialize an NSSampler object, (we assume) sets it up with [sampler initWithPid], calls [sampler sampleForDuration], calls [sampler stopSampling] and then calls [sampler writeOutput] to write out the result.

This is not what we want to do, however. What I didn't see in either the disassembly or the class description was an explicit step to convert addresses to symbols, which is what we want to avoid. We might well suspect -(void) writeOutput is doing that, and if we put together a simple-minded program to make these calls as sample does, we indeed get a freeze when we try to write the output. We want to get to the raw addresses instead, but Apple doesn't provide any getter for those tantalizing NSMutableArrays containing the sample data.

Unfortunately for Apple, class-dump gave us the structure of the NSSampler object (recall that Objective-C objects are really just structs with delusions of grandeur), and conveniently those object pointers are right there, so we can pull them out directly! Since they're just NSArrays, hopefully they're smart enough to display themselves. Let's see. (In the below, replace XXX with the process you wish to spy on.)

/* gcc -g -o samplemini samplemini.m \
    -F/System/Library/PrivateFrameworks \
    -framework Cocoa -framework CHUD \
    -framework vmutils -lobjc */

#include <Cocoa/Cocoa.h>
#include "NSSampler.h"

int main(int argc, char **argv) {
    NSSampler *sampler;
    NSMutableArray *sampleData;
    NSMutableArray *sampleTimes;
    uint32_t count, sampleAddr;
    NSAutoreleasePool *shutup = [[NSAutoreleasePool alloc] init];

    sampler = [[NSSampler alloc] init];
    [sampler initWithPid:XXX]; // you provide
    [sampler sampleForDuration:10 interval:10]; // 10 seconds, 10 msec
    [sampler stopSampling];

    // break into the NSSampler struct
    sampleAddr = (uint32_t)sampler;
    count = *(uint32_t *)(sampleAddr + 48);
    fprintf(stdout, "count = %i\n", count);
    sampleData = (NSMutableArray *)*(uint32_t *)(sampleAddr + 32);
    sampleTimes = (NSMutableArray *)*(uint32_t *)(sampleAddr + 36);
    fprintf(stdout, "%s", [[sampleData description] cString]);
    fprintf(stdout, "%s", [[sampleTimes description] cString]);

    [sampler dealloc];
    return 0;
Drumroll please.

count = 519
    <NSStackBacktrace: Thread 1503: 0x9000af48 0xefffdfd0 0x907de9ac 0x907de2b0 0x932bcb20 0x932bc1b4 0x932bc020 0x937a1734 0x937a13f8 0x06d53d3c 0x9379d93c 0x0 6d57bc8 0x07800f48 0x0785f004 0x0785f9cc 0x0785fd20 0x00004ed4 0x00001d5c 0x0000 1a60 0x9000ae9c 0xffffffe1 > ,
    <NSStackBacktrace: Thread 1603: 0x9002ec8c 0x00424b10 0x05069cb4 0x0504638c 0x050490e0 0x05056600 0x050532cc 0x9002b908 0x0506717c 0x0000016b > ,
    <NSStackBacktrace: Thread 1703: 0x9002bfc8 0x90030a7c 0x015a0b84 0x04d4d40c 0x015a1f18 0x9002b908 0x90030aac 0xffffffdb > ,

We now have the raw backtraces and the timings, in fractions of a second. There is obviously much more we can do with this, and subsequent to my first experiment I improved the process further, but this suffices for explaning the basic notion. In a future post we'll look at how we can turn those addresses into actual useful function names, mostly because I have a very hacky setup to do so right now and I want to refine it a bit more. :) The basic notion is to get the map of where dyld loaded each library in memory and then compute which function is running based on that offset from the sampled address. /usr/bin/vmmap would normally be the tool we'd employ to do this, but it barfs on TenFourFox too. Fortunately our custom gdb7 can get such a map, at least on a running process. More on that later.

One limitation is that NSSampler doesn't seem able to get samples more frequently than every 15ms or so from a running TenFourFox process even if you ask. I'm not sure yet why this is because other processes have substantially less overhead, though it could be thread-related. Also, even though NSSampler accepts an interval argument, it will grab samples as fast as it can no matter what that interval is. When run against Magic Hat as a test it grabbed them as fast as 0.1ms, so stand by for lots of data!

Incidentally, this process is not apparently what Shark does; Shark uses the later PerfTool framework and an object called PTSampler to do its work instead of vmutils. Although it has analogous methods, the structure of PTSampler is rather more complex than NSSampler and I haven't fully explored its depths. Nevertheless, when it works, Shark can get much more granular samples of processor activity than NSSampler, so it might be worth looking into for a future iteration of this tool. For now, I can finally get backtraces I can work with, and as a result, hopefully some very tricky problems in TenFourFox might get actually solved in the near future.

Thursday, December 1, 2016

45.5.1 available, and 32-bit Intel Macs go Tier-3

Test builds for 45.5.1, with the single change being the safety fix for the Firefox 0-day in bug 1321066 (CVE-2016-9079), are now available. Release notes and hashes to follow when I'm back from my business trip late tonight. I will probably go live on this around the same time, so please test as soon as you can.

In other news, the announcement below was inevitable after Mozilla dropped support for 10.6 through 10.8, but for the record (from BDS):

As of Firefox 53, we are intending to switch Firefox on mac from a universal x86/x86-64 build to a single-architecture x86-64 build.

To simplify the build system and enable other optimizations, we are planning on removing support for universal mac build from the Mozilla build system.

The Mozilla build and test infrastructure will only be testing the x86-64 codepaths on mac. However, we are willing to keep the x86 build configuration supported as a community-supported (tier 3) build configuration, if there is somebody willing to step forward and volunteer as the maintainer for the port. The maintainer's responsibility is to periodically build the tree and make sure it continues to run.

Please contact me directly (not on the list) if you are interested in volunteering. If I do not hear from a volunteer by 23-December, the Mozilla project will consider the Mac-x86 build officially unmaintained.

The precipitating event for this is the end of NPAPI plugin support (see? TenFourFox was ahead of the curve!), except, annoyingly, Flash, with Firefox 52. The only major reason 32-bit Mac Firefox builds weren't ended with the removal of 10.6 support (10.6 being the last version of Mac OS X that could run on a 32-bit Intel Mac) was for those 64-bit Macs that had to run a 32-bit plugin. Since no plugins but Flash are supported anymore, and Flash has been 64-bit for some time, that's the end of that.

Currently we, as OS X/ppc, are a Tier-3 configuration also, at least for as long as we maintain source parity with 45ESR. Mozilla has generally been deferential to not intentionally breaking TenFourFox and the situation with 32-bit x86 would probably be easier than our situation. That said, candidly I can only think of two non-exclusive circumstances where maintaining the 32-bit Intel Mac build would be advantageous, and they're both bigger tasks than simply building the browser for 32 bits:

  • You still have to run a 32-bit plugin like Silverlight. In that case, you'd also need to undo the NPAPI plugin block (see bug 1269807) and everything underlying it.
  • You have to run Firefox on a 32-bit Mac. As a practical matter this would essentially mean maintaining support for 10.6 as well, roughly option 4 when we discussed this in a prior blog post with the added complexity of having to pull the legacy Snow Leopard support forward over a complete ESR cycle. This is non-trivial, but hey, we've done just that over six ESR cycles, although we had the advantage of being able to do so incrementally.

I'm happy to advise anyone who wants to take this on but it's not something you'll see coming from me. If you decide you'd like to try, contact Benjamin directly (his first name, smedbergs, us).

Tuesday, November 29, 2016

45.5.1 chemspill imminent

The plan was to get you a test build of TenFourFox 45.6.0 this weekend, but instead you're going to get a chemspill for 45.5.1 to fix an urgent 0-day exploit in Firefox which is already in the wild. Interestingly, the attack method is very similar to the one the FBI infamously used to deanonymise Tor users in 2013, which is a reminder that any backdoor the "good guys" can sneak through, the "bad guys" can too.

TenFourFox is technically vulnerable to the flaw, but the current implementation is x86-based and tries to attack a Windows DLL, so as written it will merely crash our PowerPC systems. In fact, without giving anything away about the underlying problem, our hybrid-endian JavaScript engine actually reduces our exposure surface further because even a PowerPC-specific exploit would require substantial modification to compromise TenFourFox in the same way. That said, we will still implement the temporary safety fix as well. The bug is a very old one, going back to at least Firefox 4.

Meanwhile, 45.6 is going to be scaled back a little. I was able to remove telemetry from the entire browser (along with its dependencies), and it certainly was snappier in some sections, but required wholesale changes to just about everything to dig it out and this is going to hurt keeping up with the ESR repository. Changes this extensive are also very likely to introduce subtle bugs. (A reminder that telemetry is disabled in TenFourFox, so your data is never transmitted, but it does accumulate internal counters and while it is rarely on a hot codepath there is still non-zero overhead having it around.) I still want to do this but probably after feature parity, so 45.6 has a smaller change where telemetry is instead only removed from user-facing chrome JavaScript. This doesn't help as much but it's a much less invasive change while we're still on source parity with 45ESR.

Also, tests with the "non-volatile" part of IonPower-NVLE showed that switching to all, or mostly, non-volatile registers in the JavaScript JIT compiler had no obvious impact on most benchmarks and occasionally was a small negative. Even changing the register allocator to simply favour non-volatile registers, without removing volatiles, had some small regressions. As it turns out, Ion actually looks pretty efficient with saving volatile registers prior to calls after all and the overhead of having to save non-volatile registers upon entry apparently overwhelms any tiny benefit of using them. However, as a holdover from my plans for NVLE, we've been saving three more non-volatile general purpose registers than we allow the allocator to use; since we're paying the overhead to use them already, I added those unused registers to the allocator and this got us around 1-2% benefit with no regression. That will ship with 45.6 and that's going to be the extent of the NVLE project.

On the plus side, however, 45.6 does have HiDPI support completely removed (because no 10.6-compatible system has a retina display, let alone any Power Mac), which makes the widget code substantially simpler in some sections, and has a couple other minor performance improvements, mostly to scrolling on image-heavy pages, and interface fixes. I also have primitive performance sampling working, which is useful because of a JavaScript interpreter infinite loop I discovered on a couple sites in the wild (and may be the cause of the over-recursion problems I've seen other places). Although it's likely Mozilla's bug and our JIT is not currently implicated, it's probably an endian issue since it doesn't occur on any Tier-1 platform; fortunately, the rough sampler I threw together was successfully able to get a sensible callstack that pointed to the actual problem, proving its functionality. We've been shipping this bug since at least TenFourFox 38, so if I don't have a fix in time it won't hold the release, but I want to resolve it as soon as possible to see if it fixes anything else. I'll talk about my adventures with the mysterious NSSampler in a future post soonish.

Watch for 45.5.1 over the weekend, and 45.6 beta probably next week.

Saturday, November 12, 2016

45.5.0 final available

The final release of TenFourFox 45.5.0 (downloads, hashish, er, hashes, release notes) is available. Pretty much everything made it, including the hybrid-endian JavaScript engine (the LE portion of IonPower-NVLE), the AltiVec VP9 IDCT/IADST/IHT transformations, the MP3 refactoring and the new custom in-browser prefpane. There is also a fix for PostScript-based front blocking which apparently glitched in 45. Assuming all goes well and there are no major regressions, this will go live either late Sunday or early Monday due to a planned power outage which will affect Floodgap on Tuesday.

Meanwhile, I still don't have a good understanding of what's wrong with Amazon Music (still works great in 38.10), nor the issue with some users being unable to make changes to their default search engine stick. This is the problem with a single developer, folks: what I can't replicate I can't repair. I have a couple other theories in that thread for people to respond to.

Next up will be actually ripping some code out for a change. I'm planning to completely eviscerate telemetry support since we have no infrastructure to manage it and it's wasted code, as well as retina Mac support, since no retina Mac can run 10.6. I don't anticipate these being major speed boosts but they'll help and they'll make the browser smaller. Since we don't have to maintain good compatibility with Mozilla source code anymore I have some additional freedom to do bigger surgeries like these. I'll also make a first cut at the non-volatile portion of IonPower-NVLE by making floating point registers in play non-volatile (except for the volatiles like f1 that the ABI requires to be live also); again, not a big boost, but it will definitely reduce stack pressure and should improve the performance of ABI-compliant calls. User agent switching and possibly some more AltiVec VP9 work are also on the table, but may not make 45.6.

The other thing that needs to be done is restoring our ability to do performance analysis because Shark and Sample on 10.4 freak out trying to resolve symbols from these much more recent gcc builds. The solution would seem to be a way to get program counter samples without resolving them, and then give that to a tool like addr2line or even gdb7 itself to do the symbol resolution instead, but I can't find a way to make either Shark or Sample not resolve symbols. Right now I'm disassembling /usr/bin/sample (since Apple apparently doesn't offer the source code for it) to see how it gets those samples and it seems to reference a mysterious NSSampler in the CHUD VM tools private framework. Magic Hat can dump the class but the trick is how to work with it and which selectors it will allow. More on that later.

Tuesday, November 8, 2016

Happy 6th birthday, TenFourFox

Today's the American election and no matter what, some of you are going to be delighted, some of you are going to be disappointed, and a few of you are going to be really steamed. But no matter what your perspective, we can all agree it's a good thing today is the sixth anniversary of TenFourFox's first beta release, 4.0b7, on the 8th of November 2010. Yes, we're six years old today! And by golly, we act like it!

Hail to the Chief!

Tuesday, November 1, 2016

Debian drops powerpc

This blog isn't generally or at least currently concerned with Linux/ppc happenings, primarily because that isn't my personal area of expertise and I use OS X/MacOS on almost all of my Power Macs, but this is rather major news that needs to be distributed a little more widely: Debian is dropping big-endian ppc and ppc64 in Stretch/Debian 9. The decision seems to be based on insufficient port maintainers. Because Ubuntu is based on Debian, the same should be expected in their next release, as well as any of the other Debian derivatives.

Power Architecture is not going away from Debian; they will still support little-endian 64-bit PowerPC, better known as ppc64el, and coincidentally the same architecture as the Raptor Talos which will run a wide choice of Linux distributions and probably some *BSDs. (It's up to $228K of $3.7M. I'm already in for a board and CPU. Back now!) And do note that Debian has stopped support for current architectures before like SPARC, so such a move is hardly unprecedented. But this is really bad news for our PowerPC friends in Amiga-land, and may seriously disturb the viability of boutique systems such as the AmigaOne X5000 -- AmigaOS is a lovely OS but it lacks the range of Linux on that hardware, and Debian was one of the lowest barriers to entry.

This doesn't mean some insane energetic freak like me couldn't take Debian/ppc and keep it rolling, but they'll have to step up now and a lot of work is in store. If you want a supported Linux on Power, however, you're gonna have to go POWER8. Otherwise, it might be a good time to check out NetBSD.

Thursday, October 27, 2016

Apple desktop users screwed again

Geez, Tim. You could have at least phoned in a refresh for the mini. Instead, we get a TV app and software function keys. Apple must be using the Mac Pro cases as actual trash cans by now.

Siri, is Phil Schiller insane?

That's a comfort.

(*Also, since my wife and I both own 11" MacBook Airs and like them as much as I can realistically like an Intel Mac, we'll mourn their passing.)