A few months ago (January 2018), the media reported breaking news of alleged security failures for Intel and ARM processors. In order to understand these “weaknesses”, and before briefly describing the nature of these “security holes”, we need to list some of the common characteristics shared by modern processors.
One of the features of present-day computers that provides security is the isolation between the different user spaces and between these and the operating system’s (OS) kernel. This isolation is achieved by two mechanisms; the most basic of which is the user/supervisor mode bit in the processor (MSR[PR] in PowerPC). This bit is activated when the OS code is executed and is deactivated when switching to a user process. The other mechanism is to use separate virtual memory spaces for the user and OS processes. In this mechanism, a virtual space is divided into a set of pages (usually 4 KB), which are mapped into physical memory pages through translation tables residing in memory (and on disk) and managed by the OS, called Page Tables. Each table entry maintains both the physical to virtual address translation and the access permissions: read/write, execute/data, user/supervisor.
Finally, to avoid the speed difference between the main memory (DDR3/DDR4) and the core processor, a cache hierarchy is used. The caches divide the physical memory space into fixed-size (32 KB in PowerPC) chunks called lines. When the core processor needs a piece of data, it first checks whether the fastest and closest cache, cache L1, has a copy. If it does (cache hit), the data is obtained from there (the PowerPC e5500 core takes 3 cycles to do this). If not (cache miss), the same process is repeated at the next cache level, L2 (11 cycles for the e5500), and successive levels. If the requested data can’t be found in the cache hierarchy, the entire line from main memory (+40 cycles) that the piece of data belongs to is copied and at the same time a copy is made in L1. This last action, the copy in L1, greatly speeds up the next access to that same piece of data and is key to explaining the attack.
Other common features among modern processors are out-of-order execution, branch prediction and speculative execution. Thus, for example, if a branch instruction depends on the result of an operation in progress, the predictor can guess the destination and direct the execution along the most probable path. When the result of the operation that the branch destination depends on finally becomes available, the processor will either discard the intermediate results – if the guess turned out to be incorrect – and restore the machine to the state it was in before the branch instruction; or – if the guess was correct – it will commit the results and continue. Instructions that are executed before knowing whether they correspond to the desired program flow, are said to be executed speculatively.
Now suppose that the following C code fragment is part of an attacker user process run with normal user privileges within the attacked system
struct array {
unsigned long length;
unsigned char key[…];
} *ptr = {…};
unsigned long x = …;
unsigned int y;
unsigned int probe_array[] = {…}; /* 8K integer matrix /*
if (x < ptr->length)
y = probe_array[(unsigned int)ptr->key[x] << 5];
Let’s also suppose:
1. That previous operations have prepared the branch predictor to estimate as more likely that the condition that checks the if is true, for example, by passing x values that were valid.
2. That x is chosen maliciously outside the area of data accessible to the user whose content is to be obtained fraudulently, with that object: x > ptr->length
3. That the code has ensured that ptr->length and the matrix probe_array[] are not present in the cache and that the rest of the data to be read is.
When the previous piece of code is executed, the processor starts comparing the malicious value x with ptr->length. When reading ptr->length, since it is not found in the cache and is fetched from the main memory, the decision’s latency will be many cycles in duration. While it is being resolved, the branch predictor assumes, based on recent history, that the if will be true. Therefore, it will proceed to calculate the offset in probe_array[] for which it will send a ptr->key[x] read request – that is outside the process’s permitted access range – to the memory subsystem. While this last reading continues its course, the ptr->length value finally arrives from the DRAM. The processor realizes that the speculative execution went the wrong way and reverts the machine state back to what it was before the bifurcation. As a result, no intermediate result is preserved and no instruction causes an exception (as would be the case when accessing out of range during the normal flow of the program). However, although the probe_array[offset] value cannot be consulted by the code because it was not stored in the y variable, nor in any registry of the architecture, there is still a subtle trace in the machine’s microarchitecture since there is a copy in the cache. And this is exactly what is exploited by the attack.
So how is the copy in the cache exploited? Well, simply by accurately measuring the access time to each of the probe_array[] elements. Remember that before executing the piece of code, the attack process had ensured that none of the matrix elements were in the cache. So, now only one of the measured times – the one corresponding to the element in the cache – will be significantly less than the others. And by detecting the element – after undoing the offset calculation – you can obtain the byte ptr->key[x] value from a memory area that the attack process should never have access to.
As conclusions, we can say that this type of attack exploits the most advanced features of modern processors, is independent of the operating system and doesn’t depend on software vulnerabilities.
1 All textual references to clock cycles refer to processor cycles, thus: 3 cycles @ 2.4 GHz (P5040) = 1.25 ns
2 Initiating the transfer at the address where the data is located to minimize waiting.
3 This can be achieved by means of cache control instructions executed in user mode. On the PowerPC: dcbf, rB (data cache block flush) where rA = 0, for example and rB contains the physical address of the entry that is evicted from the cache with rA and rB any of the architecture’s 32 GPRs.
4 Although the measure seems simple, it requires the routine being carefully serialized using synchronization barriers (mbar instruction in PowerPC), so that the results are reliable and not affected by the ability to execute outside the kernel.
5 In one of the consulted articles for the blog, the times on an i5 Intel Core obtained by the measuring routine were about 40 clock cycles for cache and about 280 for main memory.