Sunday, July 21, 2013

22.1: the backup strategy (hello, jemalloc)

First, a word on 24. BaselineCompiler is now to the point where it can generate code without asserting for simple scripts, which is no mean feat. The trampoline is verifiably wrong in at least two places where I can't understand what I was thinking when I wrote it, and there are several bugs in the code generator for branches that I'm working out, but we are at stage 2 (linking and building) and getting very close to stage 3 (compiling and running simple scripts). At that point, the fourth and final state is to get it to pass the test suite, and then building the rest of the browser. That is still at least four to six weeks of work. Don't expect a beta anytime soon.

For some time on this blog I've been mentioning my attempts in issue 218 to get jemalloc, the higher-performance memory allocator "normal" Firefox uses, working on 10.4. Right now, we use the OS X system allocator, which when a lot of things are trying to compete for it, is like a whole bunch of people trying to suck a milkshake through a single straw: it's messy, it involves other people's saliva and it's slow. At the lowest level there's a lot of memory allocation and deallocation going on as bits of data move from one browser subdomain to another, so things that speed that process should speed the whole browser. And, mostly (see below), it does. But changing the way that a single application handles memory management is fraught with peril when the rest of the system is using a different one, as we found out in TenFourFox 21: it was noticeably faster, but it had several unconscionable showstoppers like, uh, cut and paste didn't work. That was a bummer. We can't ship a browser like that.

Because of the delays with 24, however, we need a backup strategy in case 17 runs off support while I'm trying to finish 24. Since 22 is closer to 24, it would be much easier to backport security updates to that than 17, and much less risky, and heaven forbid 24 doesn't work we'll need to use 22 to drop source parity. (23 is not an option; there's no methodjit in it, and BaselineCompiler is too new.)

Whatever bug was in 21 that prevented cut and paste and certain other widget operations from working was fixed somehow in 22 with the widget changes afoot for Australis, and now the browser basically works. But I mentioned there's a problem with one part of the system using one way to manage memory and us using another: when the two meet, they may not be compatible. Fortunately because Mozilla reinvents the wheel with XUL and handles much of its own interface and does a lot of its work independent of the operating system, this occurs in very few places, but the one that's there is a doozy: anywhere that we hand a pointer to a block of memory to the operating system that jemalloc allocated, but that the operating system is expected to ask jemalloc to free when it's no longer in use, we trip a serious OS X bug that hands the wrong pointer back to jemalloc. Mozilla analyzed this in bug 702250, where dragging and dropping large images brought the entire browser down. The operating system bug in question seems only partially fixed in 10.5, and as a result the level of crashes Mozilla observed caused them to turn off jemalloc entirely for 10.5 users in Firefox 10. We don't have that luxury if we want the level of performance we're going to need, so we either have to fix the problem or wallpaper it.

Testing "22.1" (as I dub it) showed that it was really easy, mercifully in this case, to replicate the problem in 10.4. Bug 702250 not only affects 10.4, it's markedly worse: any image, not just big ones, that is dragged to the desktop or to other applications will crash the browser. (Interestingly, dragging and dropping URLs and text is fine.) However, because the Clipboard is already allocated when the browser starts, Firefox is able to transmit data to it without issue, so you can copy an image and then paste it into another application, or right click and save it to disk. Since those work, the simplest solution here was to disable image dragging since fully functional workarounds are easily available.

The second part is a little trickier. Certain components ask the operating system for icon images, such as the Applications pane in Preferences. When those icons are no longer needed, the OS frees them, and the browser dies, to such a point where in the original build of 22.1 simply opening that pane would crash the browser. It also manifests in the Downloads window -- as icons are scrolled off screen, the OS will poisonously free those too, and bang. In this test version, icons are simply turned off. However, I don't think that's very nice or polished, but we can reinvent the wheel ourselves: create a new icon service that uses internal copies of icon images rather than asking the OS for them. It won't cover every file type, but it will work, and because the image lifetime is internal to TenFourFox the browser will not crash. (Other icons like favicons and browser chrome are similarly unaffected.)

jemalloc is not a panacea. It specifically speeds up operations where a lot of little bitty allocations need to be done repeatedly, such as very complex layouts or pages with a lot of widgets. However, it has a strange interaction I have not yet figured out with CoreGraphics: painting is slower, and painting big areas (particularly areas where the browser has to transform in software) is really slow. For most sites, the effect of the former greatly outweighs the effect of the latter, thus a net benefit, and on some pages very much so, but on those few that have relatively simplified layout but require lots of graphics painting jemalloc is slower. The best example I can find is my rat bastard electric company, Southern California Edison, where their new site design drags every time that damn entire window-full of smiling happy people who love to pay their confiscatory electricity rates advances to the next slide. It was bad on the old version, but it's worse now. It's also sluggish on my 10.6 Core 2 Duo Mac mini running real Firefox, by the way, but it's a much faster CPU so it does not lag near as much (I suspect hardware acceleration also improves it). For reasons I can't explain, the effect also seems to be worse on the G5; my iMac G4 and iBook G4 really light on fire with this, but the quad has some weird slowdowns.

A weird thing I noticed, but have not been able to resolve or consistently reproduce, is a questionable problem with large uploads (downloads are fine). Ironically, it occurred while trying to upload this test release, but it does not always happen and a debug build showed no problems. Even when it does happen, it works fine on the second try, and doesn't do it on all sites or even all uploads. I really need a site where you are unable to upload at all to deal with this, so I hesitate to call it a "bug" until I have good STRs on my part. Please don't report intermittents because I have enough of those already.

Also, while jemalloc may improve repeated allocations in a straight line the problems with multithreading and multiprocessing persist even with jemalloc enabled, so it does not solve issue 231. That brings us to our test builds -- you'll note there are only two, one for G5 and one for G4/7450. That's because this test version has been pretty banged on with 10.4, but not at all with 10.5. I particularly need multi-CPU testers, even better if you can demonstrate the performance is significantly better with Tiger than Leopard. If that's the case, we need to extend issue 231 to disable multiple CPUs on Tiger and Leopard, not just Tiger. Don't worry, G3 and G4/7400 owners: you are still loved and you will get the next version too (G4/7400 users can try the 7450 one, but it is not optimized for your platform).

Note thee well that I haven't decided that jemalloc is the way we're going and this is not releaseable in this state to a general audience although it is to you, you crazy wild unstable branch users, you. I need to know about other crashes you can find and reliable differences between it and the original 22.0 -- other than jemalloc, issue 231 and the fix for leaking font refs, this is the same browser. Reproducible, confirmable, consistent differences will be investigated thoroughly; intermittent, unconfirmable ones won't. With some experimentation, we'll see if we want to move the stable users to this if I have to put them on 22 for a little bit, or if this exercise is even worth it at all. I think it is, but let's see what you think.

Again, G5 and G4/7450 only, with 10.5 testers as a priority -- G3 and G4/7400 will come with the next scheduled update. 10.4 users are of course welcome, but I really want to know how it performs on 10.5 and a comparison of 10.4 to 10.5 performance if anyone can do that.

Back to debugging the trampoline.

17 comments:

  1. It's a pity I can't test this. My PowerMac G5 seems to have developed the dreaded "logic board failure" problem that I had heard so much about but blissfully ignored. I wonder if there's a website somewhere showing a bunch of smiling happy people who own dead PowerMacs.

    ReplyDelete
    Replies
    1. Dreaded "logic board failure" problem?

      Delete
    2. Yes, indeed. Apparently from what I've read, G5 PowerMacs have a common problem with the motherboard (Apple seems to prefer "logic board"... maybe they're anti-mother or something...), where you can get fractures in the solder joints - causing boot failures, random kernel panics and other fun stuff. The only solutions are either replacing the whole motherboard or getting bionic vision to locate the microscopic fractures and re-solder. (my vision is sadly non-bionic)

      Delete
    3. Hmm. I know some of the middle models were prone to it. So far this quad is doing fine, though, and I have a spare. Sad to hear about yours though. :(

      Delete
  2. Thank you for all the effort you guys have put into this! I've been using 17, and this version definitely seems quicker than that on 10.4, and roughly the same on 10.5. I have a g4 mini and a dual g4 that I'd be happy to test things on; how would I go about this? Besides just guessing which configuration seems fastest, is there a way to get actual numbers? Do you just hold a stopwatch to it?

    ReplyDelete
    Replies
    1. On single CPU Power Macs I don't expect much difference between 10.4 and 10.5 -- I'm most interested in this on multi-CPU Power Macs. If 10.5 is markedly slower, than the patch in issue 231 needs to be enabled on Leopard also.

      As far as timing, yeah, it's going to be crude wallclock and/or stopwatch times. The allocator does not benefit JavaScript much, though it does greatly improve DOM, so a benchmark like Peacekeeper might show some movement.

      Note that you're comparing 17 to 22.1, and 22.0 is already significantly faster than 17.

      Delete
  3. Youtube runs html on the first launch but quits after that. Any info or experience with this?

    ReplyDelete
    Replies
    1. What do you mean, quits? The browser crashes, or it just doesn't work? It works fine here (WebM).

      Delete
    2. Doesn't work...attempts to run but doesn't, right click in the video window brings up the html5 multiple choice menu. While loading the page you get the TFF plugin warning, when page loads the video load wheel spins for 1/2 sec then the Adobe Flash Player is required msg is displayed. Happens in safe mode.

      Delete
    3. I get that on some videos, notably those that YT wants to run ads on, but other videos do play. If this happens on some, I expect that. If this happens on *every* video, I can't reproduce that. I would be surprised if that were a 10.5-specific problem.

      Delete
  4. Same thing happens on 17 when I disable the flash plugin.

    ReplyDelete
    Replies
    1. Then it's not jemalloc, it's something else. Let's deal with that separately.

      Delete
  5. I tested 22.1 on a dual 1.8GHz g4 and a single 1.75GHz g4 mini using both 10.4 and 10.5. The dual was slower to load pages than the single, and with 10.5 the dual was usually a couple seconds slower than with 10.4. For example, racingsouthwest.com took ~25s to load on the dual 10.5, ~20s on the dual 10.4, and about 15s on the single with both OS versions. On both computers 10.5 was a little slower than 10.4.

    I could not find a site that really bogs down. If anyone knows a good test case I can try it.

    ReplyDelete
    Replies
    1. That is very helpful. Thanks. If I get any other similar reports, then I think we should disable multiCPU detection on 10.5 as well, since a 1.25x slowdown is definitely significant.

      Delete
  6. Used 22.1 since yesterday. There is no perceivable speed difference with general surfing between 22.0 and 22.1 on my PowerBook G4 1.33 GHz (10.5.8). I encountered no bugs or slowdowns. I wouldn't notice I'm on a new version without the missing image dragging and icons in the preferences.

    Peacekeeper is slightly worse in 22.1 (as always, take with a grain of salt), but Sunspider 1.0 is clearly faster, curiously. I did several test runs to confirm. 22.0: ~1700 22.1: ~1510

    ReplyDelete
    Replies
    1. This seems broadly similar to what Tobias reported, i.e., that 10.5.8's default memory allocator is pretty close in performance to jemalloc's. Still, it definitely seems worth it for 10.4 users, and as long as it doesn't hurt 10.5.

      Interesting about the Sunspider performance increase. On 10.4, SS didn't budge much, but Peacekeeper was about 10% faster.

      Delete

Due to an increased frequency of spam, comments are now subject to moderation.