Intel processors since at least the Pentium use a relatively simple BTB to aid these computations when finding the target of a branch instruction. The buffer is essentially a dictionary with virtual addresses of recent branch instructions mapping to their predicted target: if the branch is taken, the chip has the new actual address right away, and time is saved. To save space and complexity, most processors that implement a BTB only do so for part of the address (or they hash the address), which reduces the overhead of maintaining the BTB but also means some addresses will map to the same index into the BTB and cause a collision. If the addresses collide, the processor will recover, but it will take more cycles to do so. This is the key to the side-channel attack.
(For the record, the G3 and the G4 use a BTIC instead, or a branch target instruction cache, where the table actually keeps two of the target instructions so it can be executing them while the rest of the branch target loads. The G4/7450 ("G4e") extends the BTIC to four instructions. This scheme is highly beneficial because these cached instructions essentially extend the processor's general purpose caches with needed instructions that are less likely to be evicted, but is more complex to manage. It is probably for this reason the BTIC was dropped in the G5 since the idea doesn't work well with the G5's instruction dispatch groups; the G5 uses a three-level hybrid predictor which is unlike either of these schemes. Most PowerPC implementations also have a return address stack for optimizing the blr instruction. With all of these unusual features Power ISA processors may be vulnerable to a similar timing attack but certainly not in the same way and probably not as predictably, especially on the G5 and later designs.)
To get around ASLR, an attacker needs to find out where the code block of interest actually got moved to in memory. Certain attributes make kernel ASLR (KASLR) an easier nut to crack. For performance reasons usually only part of the kernel address is randomized, in open-source operating systems this randomization scheme is often known, and the kernel is always loaded fully into physical memory and doesn't get swapped out. While the location it is loaded to is also randomized, the kernel is mapped into the address space of all processes, so if you can find its address in any process you've also found it in every process. Haswell makes this even easier because all of the bits the Linux kernel randomizes are covered by the low 30 bits of the virtual address Haswell uses in the BTB index, which covers the entire kernel address range and means any kernel branch address can be determined exactly. The attacker finds branch instructions in the kernel code such as by disassembling it that service a particular system call and computes (this is feasible due to the smaller search space) all the possible locations that branch could be at, creates a "spy" function with a branch instruction positioned to try to force a BTB collision by computing to the same BTB index, executes the system call, and then executes the spy function. If the spy process (which times itself) determines its branch took longer than an average branch, it logs a hit, and the delta between ordinary execution and a BTB collision is unambiguously high (see Figure 7 in the paper). Now that you have the address of that code block branch, you can deduce the address of the entire kernel code block (because it's generally in the same page of memory due to the typical granularity of the randomization scheme), and try to get at it or abuse it. The entire process can take just milliseconds on a current CPU.
The kernel is often specifically hardened against such attacks, however, and there are more tempting targets though they need more work. If you want to attack a user process (particularly one running as root, since that will have privileges you can subvert), you have to get your "spy" on the same virtual core as the victim process or otherwise they won't share a BTB -- in the case of the kernel, the system call always executes on the same virtual core via context switch, but that's not the case here. This requires manipulating the OS' process scheduler or running lots of spy processes, which slows the attack but is still feasible. Also, since you won't have a kernel system call to execute, you have to get the victim to do a particular task with a branch instruction, and that task needs to be something repeatable. Once this is done, however, the basic notion is the same. Even though only a limited number of ASLR bits can be recovered this way (remember that in Haswell's case, bit 30 and above are not used in the BTB, and full Linux ASLR uses bits 12 to 40, unlike the kernel), you can dramatically narrow the search space to the point where brute-force guessing may be possible. The whole process is certainly much more streamlined than earlier ASLR attacks which relied on fragile things like cache timing.
As it happens, software mitigations can blunt or possibly even completely eradicate this exploit. Brute-force guessing addresses in the kernel usually leads to a crash, so anything that forces the attacker to guess the address of a victim routine in the kernel will likely cause the exploit to fail catastrophically. Get a couple of those random address bits outside the 30 bits Haswell uses in the BTB table index and bingo, a relatively simple fix. One could also make ASLR more granular to occur at the function, basic block or even single instruction level rather than merely randomizing the starting address of segments within the address space, though this is much more complicated. However, hardware is needed to close the gap completely. A proper hardware solution would be to either use most or all of the virtual address in the BTB to reduce the possibility of a collision, and/or to add a random salt to whatever indexing or hashing function is used for BTB entries that varies from process to process so a collision becomes less predictable. Either needs a change from Intel.
This little fable should serve to remind us that monocultures are bad. This exploit in question is viable and potentially ugly but can be mitigated. That's not the point: the point is that the attack, particularly upon the kernel, is made more feasible by particular details of how Haswell chips handle branching. When everything gets funneled through the same design and engineering optics and ends up with the same implementation, if someone comes up with a simple, weapons-grade exploit for a flaw in that implementation that software can't mask, we're all hosed. This is another reason why we need an auditable, powerful alternative to x86/x86_64 on the desktop. And there's only one system in that class right now.
Okay, okay, I'll stop banging you over the head with this stuff. I've got a couple more bugs under investigation that will be fixed in 45.5.0, and if you're having the issue where TenFourFox is not remembering your search engine of choice, please post your country and operating system here.