1 Introduction

Side-channel attacks leak sensitive cryptographic information through physical channels such as power, timing, etc. and typically are specific to the actual implementation of the cryptographic algorithm [1]. An important class of these attacks is based on obtaining access time measurements from cache memory systems. The Advanced Encryption Standard (AES) [2] is a widely adopted algorithm for secret key cryptography. Most software implementations of AES, including those in the cryptographic library OpenSSL [3], make extensive use of cache-resident table look-ups in lieu of time-consuming mathematical field operations.

Many cache-based side-channel attacks aim to retrieve the key of a victim performing AES by exploiting the fact that access times to different levels of the cache-main memory hierarchy vary by 1–2 orders of magnitude [4]. The attacker (or spy) and the victim use a single copy of the same cryptographic library and its binaries are mapped to the virtual spaces of attacker and victim. In a Flush and Reload attack [5], for example, the spy process flushes out the AES tables from cache. When the victim is scheduled, it brings in some of the evicted lines. When the control returns back to the spy, it determines which of the evicted lines were fetched by the victim by measuring the time to access them. It then flushes out the AES tables from cache before relinquishing control of the CPU. Inputs obtained by the spy are then analysed to compute the AES key.

Our algorithms to deduce the AES key use (i) several blocks of ciphertext encrypted with the same key and (ii) the set of line (or block) numbers of AES table entries accessed by the victim during the decryption of each of those blocks of ciphertext. Our use of ciphertext (rather than plaintext as in some previous works) is far more realistic since the former is readily available. Several entries in the AES table are placed on a single cache line and the spy provides a set (not list) of lines accessed. The absence of spatial information (the specific table entry on a line required by the victim) and temporal information (the order of accesses) makes it challenging to deduce the key, especially for sets with larger cardinalities.

The goal of this work is the retrieval of the AES key with the fewest possible blocks of ciphertext. The attack in [6] was implemented with a similar goal. However, the presence of false negatives in the spy input defeats their attack. Moreover, cache prefetching (described later) was disabled in those experiments. As in [6, 7], we derive 16 equations that relate the output of a round to blocks of ciphertext and bits of the AES round key. Our principal contribution is the development of a theoretical framework upon which our AES key retrieval algorithms are based. The latter are not only more efficient in terms of execution time but also require up to 75% fewer blocks of ciphertext compared with previous work. The dramatic increase in efficiency of our approach and its error tolerance is attributed, in part, to our exploitation of the peculiar structure of these equations and extraction of the maximum possible information from them. The key retrieval algorithms are agnostic with respect to cache attack strategy (“Flush and Reload” [5], “Prime and Probe” [8], etc.), to specific software implementations (for example, single/multiple look-up tables [3]), to single core versus multi core attack scenarios and to noise volume on the cache side channel.

The complexity of cache attacks is exacerbated by hardware prefetchers [9] (implemented internally by processors to reduce memory latency). When a cache miss occurs, the processor not only fetches the missed line but also the next or previous line in anticipation of its use. Processors targeted in earlier attacks [8, 10] employed stride prefetchers while modern processors sport far more sophisticated prefetchers. The latter use history tables to record patterns of cache miss addresses, make predictions of future misses and perform aggressive prefetching based on their predictions [11]. The substantial increase in false positive rate (due to lines prefetched but not subsequently used during a run of the victim executing AES) has a serious adverse effect on the efficiency of access-driven attacks.

Our efforts with prefetching turned off enabled us to retrieve the AES key with a mere 2–3 blocks of ciphertext – two orders of magnitude better than the best result obtained so far. The natural question to be asked then is with prefetching turned on, what is the number of decryptions required? Our main contribution is to answer this question by implementing (i) prefetching-aware multi-threaded spy code, (ii) heuristics to sanitize the input received from the spy threads and (iii) key retrieval algorithms that operate on the sanitized input. The spy code was ported on to the Intel Core 2 Duo, Core i3, Core i5 and Core i7 (the latter three with very aggressive prefetchers [9]). Our heuristics and key retrieval algorithms succeeded in retrieving the entire AES key with only about 25 blocks of ciphertext.

Our second contribution is a systematic investigation into the effect of errors on the number of decryptions (ciphertext blocks) required. False positives and false negatives in the input provided by the spy threads are caused, in part, by measurement errors. Moreover, because of the way the spy threads are designed to operate, there is a substantial rise in false negatives besides the expected increase in false positives. All of our experiments were conducted in a lab setting with few other processes sharing the core in addition to the victim and spy. In more realistic scenarios, we would expect even greater noise and hence higher error rates. We develop analytical models to study the effect of varying false positive and false negative rates on the number of decryptions required and compare these to actual results from our experiment set-up. This work is the first in attempting a quantitative investigation of the effect of noise on the success probability of key retrieval.

This paper is organized as follows. Section 2 summarizes work related to the theme of this paper. Section 3 contains an introduction to AES and, in particular, its software implementation. Our analytical models were corroborated through experiments – the experimental set-up, spy code and key retrieval strategy are also described in section 3. Sections 4 and 5 present the algorithms, model and results related to the First Round and Second Round Attacks, respectively. Section 6 discusses other issues of relevance, including limitations and countermeasures, and section 7 contains the conclusion.

2 Related work

It was first mentioned by Hu [12] that cache memory can be considered as a potential vulnerability in the context of covert channels to extract sensitive information. Later Kocher [13] demonstrated the data-dependent timing response of cryptographic algorithms against various public-key systems. Based on his work, [14] mentioned the prospects of using cache memory to perform attacks based on cache hits in S-box ciphers like Blowfish. One formal study of such attacks using cache misses was conducted in [15].

Access-driven cache attacks were reported in [16] on RSA for multithreaded processors. Tromer et al [8] proposed an approach and analysis for the access-driven cache attacks on AES for the first two rounds. They introduced the Prime+Probe technique for cache attacks. In the Prime phase, the attacker fills the cache with its own data before the encryption. In the Probe phase, it accesses its data and determines whether each access results in a hit or miss. In the synchronous version of their attack, 300 encryptions were required to infer a 128-bit AES key on the Athlon64 platform.

The ability to detect whether a cache line has been evicted or not was further exploited by Neve et al [17]. They designed an improved access-driven cache attack on the last round of AES on a single-threaded processor. Aciiçmez et al [18] presented a realistic access-driven cache attack targeting I-cache based on vector quantization and hidden Markov models on OpenSSL’s DSA implementation.

Gullasch et al [10] proposed an efficient access-driven cache attack when attacker and victim use a shared crypto library. Their experimental measurement method is similar to ours but their key retrieval is very different. It does not require synchronization or knowledge of the plaintexts or ciphertexts. They retrieved the AES key using about 100 blocks of ciphertext. Extending the work of Gullasch et al [10], Yarom and Falkner [5] conducted a cross-core attack on the LLC (Last Level Cache, L3 on processors with three levels of cache) with the spy and the victim executing concurrently on two different cores. They introduced the Flush+Reload technique, which is effective across multiple processor cores and virtual machine boundaries. The first access-driven cache attack on smartphones was proposed in [19] on misaligned AES T-tables.

Acıiçmez and Koc [20] provided an analytical treatment of trace-driven cache attacks and analysed its efficiency against symmetric ciphers. Trace-driven cache attacks were first theoretically introduced in [15]. Gallais et al [21] proposed an improved adaptive known plaintext attack on AES implemented for embedded devices. Their attacks recover a 128-bit AES key with exhaustive search of at most 230 key hypotheses. Trace-driven cache attacks were further investigated by Zhao and Wang [22] on AES and CLEFIA by considering cache misses and S-box misalignment.

Tsunoo et al [23] pioneered the work on time-driven cache attacks. They demonstrated that DES could be broken using \(2^{23}\) known plaintexts and \(2^{24}\) calculations. A similar approach was used by Bonneau and Mironov [24], where they emphasized individual cache collisions during encryption instead of overall hit ratio. Although this attack was a considerable improvement over previous work, it still required \(2^{13}\) timing samples. Bernstein [25] presented a practical known plaintext attack on a remote server running AES encryption. [26] is a recent work applying Bernstein’s attack on ARM Cortex-A platform used on Android-based systems.

Neve et al [27] revisit Bernstein’s attack technique and explain why his attack works. Concurrent to but independent of the work of Bernstein [25], Osvik et al [28] introduced the Evict+Time technique in which an attacker evicts cache lines from all levels of the cache and then identifies those that are accessed during the encryption. Other time-driven attacks were investigated by [23, 29,30,31,32,33].

Irazoqui et al [34] introduced a new shared LLC attack that works across cores and VM boundaries. Their attack is implemented on the basis of Prime+Probe technique enabled by huge pages without de-duplication. A similar LLC attack was proposed in [35] on various versions of GnuPG. Another attack on LLC was implemented in [36] that does not require the use of large pages.

Zhang et al [37], targeting virtualized environments, extract the private ElGamal key of a GnuPG description running in the scheduler of the Xen hypervisor [38]. Weiß et al [39] used Bernstein’s timing attack on AES running inside an ARM Cortex-A8 single core system in a virtualized environment to extract the AES encryption key. Irazoqui et al [40] performed Bernstein’s cache-based timing attack in a virtualized environment to recover the AES secret key from co-resident VM with \(2^{29}\) encryptions. Later Irazoqui et al [41] used a Flush+Reload technique and recovered the AES secret key with \(2^{19}\) encryptions.

Hardware prefetchers, first discussed in [42], aim to provide temporal locality by bringing data into the cache, complicating the various cache-based attacks as also observed in [10, 28, 43]. Tromer et al [8] discussed a pointer chasing technique in which the attacker’s memory is organized into a linked list. They mounted Prime+Probe-based attack by traversing the list to avoid the read-ahead of memory by the hardware prefetcher.

Liu et al [44] discusses an LLC attack on ElGamal decryption in which the authors randomly organize the memory lines in each eviction set as a linked list. Their random permutation prevented the hardware prefetching memory lines in the eviction set. Recently, [45] described a novel prefetch side-channel attack exploiting weakness of prefetch instructions to obtain address information by defeating SMAP, SMEP and kernel ASLR. The attack is based on the primitive that prefetch can be used to fetch inaccessible privileged memory into caches without performing any privilege checks.

Fuchs and Lee [46] presented a randomized set-balanced policy able to disrupt the cache footprint construction and thwart Prime+Probe cache attacks. They suggested two extensions to the conventional prefetching scheme – randomized prefetching policy to operate at different levels of aggressiveness and a set balancer to balance the load across all cache lines. Hardware prefetching has been discussed further in [16, 47].

Table 1 Notations.

3 Background and attack preliminaries

We first summarize the key points in the software implementation of AES and then describe the espionage infrastructure designed by us. Finally, the strategy used for key retrieval is explained. The notations used are described in table 1.

3.1 AES preliminaries

AES is a symmetric key block cipher whose construction is based on a substitution-permutation network. It supports a key size of 128, 192 or 256 bits and block size = 128 bits. A round function is repeated a fixed number of times (10 for key size of 128 bits) to convert 128 bits of ciphertext to 128 bits of plaintext, during decryption. The 16-byte ciphertext \(C=(c_{0,\, } c_{1}, \ldots , c_{15} )\) may be thought of as being arranged column-wise in a \(4\times 4\) array of bytes. Each byte (represented as two hex characters) is an element in the binary field, \(GF\left( 2^{8} \right) \). This “state array” gets transformed after each step in a round. At the end of the last round, the state array contains the plaintext.

In decryption, all rounds except the last involve four steps − InvSubBytes (inverse byte substitution), InvShiftRows (inverse row shift), InvMixColumns (inverse column mixing) and a round key addition (the last round skips the inverse column mixing step). The substitution and column mixing steps are defined using algebraic operations over the field \(GF\left( 2^{8} \right) \) with irreducible polynomial \(x^8+ x^4+ x^3+x+1\). For example, in the inverse column mixing step, the state array is pre-multiplied by the following matrix \(B^{-1}\), where \(B^{-1}\) is the inverse of the matrix B used in the column mixing step of encryption:

$$\begin{aligned} B^{-1} = \left( \begin{array}{llll} 0E&{}0B&{}0D&{}09\\ 09&{}0E&{}0B&{}0D\\ 0D&{}09&{}0E&{}0B\\ 0B&{}0D&{}09&{}0E \end{array} \right) ,\quad B = \left( \begin{array}{llll} 02&{}03&{}01&{}01\\ 01&{}02&{}03&{}01\\ 01&{}01&{}02&{}03\\ 03&{}01&{}01&{}02 \end{array} \right) . \end{aligned}$$

Like a block of plaintext, ciphertext and the state array, the 128-bit AES key, the original 16-byte secret key K is arranged columnwise in a \(4\times 4\) array of bytes. It is used to derive 10 different round keys to be used in the round key operation of each round. The round keys are denoted as \(K^{(r)}, r = 1, 2, \ldots , 10\). In a software implementation, field operations are replaced by relatively inexpensive table look-ups, thereby speeding encryption and decryption.

In the versions of OpenSSL targeted in this paper, five tables are employed (each of size 1 KB). A table \(T_{t}\, ,\,\,0\le t\le 4\), is accessed using an 8-bit index, resulting in a 32-bit output. Let \(x^{\left( r\right) } =\left( x_{0}^{\left( r\right) },\ldots ,x_{15}^{\left( r\right) } \right) \) denote the input to Round r. The output of Round r (\(r = 1, \ldots , 9\)) is obtained from the input using 16 table look-ups (4 per table \(T_{t}\, ,\,\,0\le t\le 3\)) and 16 XOR operations as shown in (1)–(4):

$$\begin{aligned}&\left( x_{0}^{\left( r+1\right) },x_{1}^{\left( r+1\right) },x_{2}^{\left( r+1\right) },x_{3}^{\left( r+1\right) } \right) \nonumber \\&\quad \leftarrow T_{0}\left[ x_{0}^{\left( r\right) }\right] \oplus T_{1}\left[ x_{13}^{\left( r\right) }\right] \oplus T_{2}\left[ x_{10}^{\left( r\right) }\right] \oplus T_{3}\left[ x_{7}^{\left( r\right) }\right] \oplus K_{0}^{\left( r\right) }, \end{aligned}$$
(1)
$$\begin{aligned}&\left( x_{4}^{\left( r+1\right) },x_{5}^{\left( r+1\right) },x_{6}^{\left( r+1\right) },x_{7}^{\left( r+1\right) } \right) \nonumber \\&\quad \leftarrow T_{0} \left[ x_{4}^{\left( r\right) } \right] \oplus T_{1} \left[ x_{1}^{\left( r\right) } \right] \oplus T_{2} \left[ x_{14}^{\left( r\right) } \right] \oplus T_{3} \left[ x_{11}^{\left( r\right) } \right] \oplus K_{1}^{\left( r\right) }, \end{aligned}$$
(2)
$$\begin{aligned}&\left( x_{8}^{\left( r+1\right) },x_{9}^{\left( r+1\right) },x_{10}^{\left( r+1\right) },x_{11}^{\left( r+1\right) } \right) \nonumber \\&\quad \leftarrow T_{0} \left[ x_{8}^{\left( r\right) } \right] \oplus T_{1} \left[ x_{5}^{\left( r\right) } \right] \oplus T_{2} \left[ x_{2}^{\left( r\right) } \right] \oplus T_{3} \left[ x_{15}^{\left( r\right) } \right] \oplus K_{2}^{\left( r\right) }, \end{aligned}$$
(3)
$$\begin{aligned}&\left( x_{12}^{\left( r+1\right) },x_{13}^{\left( r+1\right) },x_{14}^{\left( r+1\right) },x_{15}^{\left( r+1\right) } \right) \nonumber \\&\quad \leftarrow T_{0} \left[ x_{12}^{\left( r\right) } \right] \oplus T_{1} \left[ x_{9}^{\left( r\right) } \right] \oplus T_{2} \left[ x_{6}^{\left( r\right) } \right] \oplus T_{3} \left[ x_{3}^{\left( r\right) } \right] \oplus K_{3}^{\left( r\right) }. \end{aligned}$$
(4)

Here \(K_{j}^{\left( r+1\right) }\) refers to the \(j^{\mathrm{th}}\) column of round key \(K^{\left( r+1\right) }\).

In the last round, table \(T_4\) is used instead of \(T_{0},\ldots ,T_{3}\) due to the absence of the column mixing step. The value returned by the table look-up is XORed with the corresponding byte of the round key. Since each round involves 16 table accesses, a complete encryption/decryption using 128-byte key involves a total of 160 table accesses.

For performance reasons, the tables reside in cache. The granularity of transfer between the cache and main memory is a line that is 64 bytes on all machines we experimented with. Each of \(T_{0}-T_{3}\) contains 256 4-byte elements. Hence, 16 elements reside on a single cache line and each of \(T_{0}-T_{3}\) occupies 16 lines. An index into a table is 8 bits – the high-order nibble specifies the line within the table and the low-order nibble specifies the element within a line.

3.2 Experimental set-up

Executions of the spy process and the victim performing decryptions are interleaved on the same core. Ideally, the victim (V) makes a few AES table accesses during its run before being preempted. The spy, scheduled next, then attempts to infer the line numbers of the table elements accessed in the preceding run of V. In most operating systems (OSs), however, the time slice provided to an executing process is about a millisecond. This is sufficient for V to decrypt tens of blocks of ciphertext, making it impossible for the spy to deduce the decryption key. What is needed is a way to restrict the duration of V’s time slice to about 1100 ns.

Similar to [10], we employ a multi-threaded spy process. The spy program creates a high-resolution POSIX timer (used by all the spy threads) and an array of binary semaphores – S[i] is the semaphore associated with Thread i. All but one of the semaphores are initialized to 0. Hence, all threads are blocked on their respective semaphores except for the one that is initialized to 1. The unblocked thread accesses a pre-determined list of cache lines to determine which of these have just been accessed by V. The exact list depends on whether prefetching is enabled and, if so, the type of prefetcher. The spy thread then flushes all the lines containing the AES tables from cache.

figure a

The last task of the spy thread is to arm the timer by initializing its value to \(t \sim 1100\,\hbox {ns}\). It then returns to the first statement of the loop, wherein it blocks on its semaphore (Line 6). Now all the spy threads are blocked on their respective semaphores. Hence, V is scheduled next and it resumes performing decryptions. On expiration of the timer value, the kernel sends a signal to the signal handler, which unblocks Thread \(i + 1\) (Line 3). V is preempted and Thread \(i + 1\) is scheduled for the reason explained here.

Beginning with version 2.6.23 of the Linux kernel, the default process scheduler is the CFS (Completely Fair Scheduler). For each process or thread, the OS maintains a virtual run time that is a crude measure of CPU run time granted to that process. Scheduler fairness implies that V and the n spy threads each get roughly \(1/(n+1)^{\mathrm{th}}\) of the CPU time (ignoring other processes executing on the core). When a thread is blocked, its virtual run time remains fixed. The unblocked threads and processes get scheduled in turn and their virtual run times increase. Hence, when the blocked thread is unblocked its virtual run time is sufficiently low that V is preempted and the former gets rescheduled.

When a spy thread executes, it accesses lines of the AES tables in order. A cache miss results in the next (or previous) line being prefetched, causing the spy to wrongly infer that the prefetched line was accessed in the previous run of the victim. To defeat the effect of lines prefetched during the execution of the victim process, the number of accesses made by the victim during each of its runs should be minimized. This was accomplished by setting the time interval of the high-resolution POSIX timer to \(\sim 1100\,\hbox {ns}\) on Intel Core i3/i5/i7 processors and \(\sim 1700\,\hbox {ns}\) on Core 2 Duo.

One possible way to defeat prefetching during the execution of the spy is to access only the even-indexed lines (with a stride of 2). This strategy results in a high rate of false negatives since all odd-indexed lines are missed out. Worse still, the prefetcher is able to detect fixed strides in the access sequence and prefetch lines, thus defeating our strategy. We modify our approach by randomly generating all even numbers between 0 and 72 using the expression \((17^i {\mathrm{mod}} 37)*2\). This generates the sequence of line numbers 34, 60, 58, 24, 38, 54, ..., 2, 0, which are loaded into the array cacheLines in the spy program. With this modification, we obtained the correct set of even indices accessed by the victim on the Core 2 Duo since its prefetcher is now unable to detect a fixed stride.

This strategy fails on the Intel Core i3/i5/i7 since it incorporates a more aggressive prefetcher that tracks and remembers the forward and backward strides of the 16 most recently accessed 4-KB pages [9]. Hence, we modified the spy code to access 32 randomly selected pages between two consecutive accesses to the AES tables. Our spy program is versatile enough to handle the prefetchers in both Intel Core 2 Duo and Intel Core i3/i5/i7.

3.3 Key retrieval strategy

Key retrieval involves two steps. In the first, the high-order nibble of each byte of the AES key is obtained and in the second, the low-order nibbles are deduced. The two steps use inputs to the First and the Second Rounds; hence, they are referred to as First and Second Round Attacks, respectively.

Each round uses 16 equations. The LHS of an equation is the input to a round (except for the first round, this is also the output of the previous round). As mentioned in section 3.1, the LHS can also be thought of as an index to an AES table. The RHS of an equation involves bytes of the ciphertext and the key. From the RHS of each equation, we identify a bunch of bits in the AES key (sub-key) whose true value we attempt to determine. The high-order nibble of the table index is the cache line number. The information provided by the spy threads is sets of line numbers. A preprocessing step is involved in converting this input to guesstimates of each cache line number accessed by the victim. Let \(x_{i,t,r,d}\) be the table index and \(G_{i,t,r,d}\) be the set of guesstimates of the line number corresponding to the \(i^{\mathrm{th}}\) access to the \(t^{\mathrm{th}}\) table in Round r of Decryption d.

Histograms are composed, one per equation; m sets of Bernoulli trials are conducted per histogram. Within a set, each sub-key value is assigned a score, 0 or 1 (success), which is determined as follows. The RHS of the equation is computed by substituting the sub-key value and ciphertext. If the high-order nibble of the computed byte matches a line number in the corresponding set \(G_{i,t,r,d}\), the outcome is success. The sub-key value with the highest cumulative score is declared the winner and is assumed to be the actual sub-key.

Let \(X_{c}\) and \(X_{i}, 1 \le i \le 2^s, i \ne c\), respectively, denote the random variables associated with the scores of correct and incorrect values of the sub-key and let \(p_c\) and \(p_{in}\) denote their success probabilities. Here \(2^s\) is the number of possible values of the s-bit sub-key. These variables are binomially distributed, i.e., \(X_{c} \sim B(p_c, m)\) and \(X_{i} \sim B(p_{in}, m)\).

Lemma 1

The probability \(P_h(p_c, p_{in}, 2^s, m)\) that the score of the correct s -bit sub-key value is greater than those of all others after m Bernoulli trials is

$$\begin{aligned} \sum _{n=0}^{m-1} \left\{ \left[ P(X_{c} = n+1) \right] \left[ P(X_{i} \le n)\right] ^{2^{s}-1} \right\} . \end{aligned}$$

The next two sections describe the algorithms for the First and Second Round Attacks and include results from, both, experiments and a model.

4 First Round Attack

We first show how guesstimates are computed and then used in the algorithms for the First Round Attack.

4.1 Obtaining the guesstimates

Let \(c_{0,j}, c_{1,j}, \ldots , c_{15,j}\) denote the 16 bytes of block j of the ciphertext and let \(k_0, k_1, \ldots , k_{15}\) be the bytes of the 128-bit AES key. Before the start of the First Round, the ciphertext is XORed with the key, so the table indices are computed as

$$\begin{aligned} x_{i,t,1,d} = c_{t+4i,d} \oplus k_{t+4i}, \quad 0 \le i, t \le 3, \quad 1 \le d \le \delta . \end{aligned}$$
(5)

In the First Round Attack, we obtain the high-order nibbles of the AES key from the ciphertext and table line numbers as in

$$\begin{aligned} k'_{t+4i} = x'_{i,t,1,d} \oplus c'_{t+4i,d}, \quad 0 \le i, t \le 3, 1 \le d \le \delta \end{aligned}$$
(6)

where \(x'\) and \(x''\) denote, respectively, the high-order and low-order nibble of byte x. The first step in the attack is to obtain the 16 sets of guesstimates.

Figure 1
figure 1

Sequence of table line numbers accessed.

Table 2 Cache line numbers reported by consecutive spy threads.

Figure 1 shows the exact sequence of line numbers accessed during the first two rounds of a decryption for a given key and ciphertext block. Table 2 shows the corresponding line numbers accessed by V in the first 10 runs as reported by the spy threads. For clarity of viewing, we represent line numbers in the range 0–63. Lines 0–15 are in \(T_{0}\), 16–31 are in \(T_{1}\), 32–47 are in \(T_{2}\) and 48–63 are in \(T_{3}\). Line 32 is the \(4^{\mathrm{th}}\) even line number access in Round 2 but it appears early on in Run 3, potentially leading to the erroneous conclusion that it occurs in Round 1. Also, the same line number may appear in multiple consecutive runs even though it has been accessed just once.

A typical scenario encountered by V and the spy threads is as follows. During a run, V suffers one or more cache misses since all AES tables are flushed out by the previously executing spy thread. Modern processors employ out-of-order execution so that, in the event of a memory stall, later instructions are executed assuming no data dependences. Table access instructions further upstream cause more cache misses and requests to main memory. The first few missed lines are fetched into cache and processed by the CPU while a few others are placed in cache but not processed. Just then, V is preempted. When the next spy thread is scheduled, it reports the presence of several lines in cache. Of these, only the first few have actually been processed, so the remainder of those fetched will have to be re-fetched during the next run of V and so on, thus appearing superfluously in multiple runs of the spy input.

We attempt to eliminate redundancies in the spy input by deriving two auxiliary tables – an Addition Table and a Deletion Table (table 2). In the former, we include, for Run i, only the line numbers appearing in Run i but not Run \(i-1\) of the spy input. In the Deletion Table, we include only line numbers missing in Run i but present in Run \(i-1\) of the spy input.

The challenge is that the spies provide us with only the even line numbers and, then too, we are not sure of their exact positions. Moreover, a round is not aligned on a run boundary, so we cannot precisely identify the point at which the First Round accesses end and the Second Round begins. One simple strategy is to include the fewest number of rows from the Addition Table at the start of a decryption so that the union of line numbers in those rows is 8 or more. The choice of 8 is because a round involves 16 table accesses; of these, 8 accesses, on average, would be to even line numbers.

Returning to table 2, we include the first two rows of the Addition Table to give us a total of 10 line numbers. From this, we compute the guesstimates as follows:

$$\begin{aligned} \begin{aligned} G_{i,0,1,d}&\,=\,\{ 8, 12, 14 \} \\ G_{i,1,1,d}&\,=\,\{ 16, 18 \} \\ G_{i,2,1,d}&\,=\,\{ 36, 38, 42 \} \\ G_{i,3,1,d}&\,=\,\{ 50,60\} \end{aligned} \quad 0 \le i \le 3. \end{aligned}$$

Note that line numbers 36 and 50 in \(G_{i,2,1,d}\) and \(G_{i,3,1,d}\), respectively, were incorrectly reported by the spy threads (they do not occur in the trace of Round 1 accesses in figure 1). We refer to such spurious accesses as false positives. Also, line number 28 was accessed but does not appear in \(G_{i,1,1,d}\). Each missing element in a set of guesstimates is treated as a false negative with respect to the line number being guessed. Line 28 appears in the third row of the Addition Table. If we included the elements in this run to our guesstimates, then lines 32 and 44 (which are false positives) would also have to be added to \(G_{i,2,1,d}\).

False positives and false negatives could be due to errors in the measurements made by the spy threads – for example, lines 36 and 50 are two such errors. On the other hand, they could be due to our preprocessing strategy being either too conservative or too liberal. With only two runs of the Addition Table, we excluded line 28. However, with three runs, we eliminated the false negative at the expense of two more false positives. Thus, there is, in general, a delicate tradeoff between reducing the number of false positives and reducing the number of false negatives.

4.2 Key recovery algorithm

Having computed the guesstimates, we next introduce an algorithm to recover the high-order nibbles of each byte of the AES key given \(\delta \) blocks of ciphertext.

Algorithm 1 builds 16 histograms, one per high-order nibble of the AES key. We start with histograms 0, 4, 8 and 12 constructed from the Guesstimates \(G_{i,0,1,d}, 0\le i\le 3\), containing line numbers of \(T_{0}\). After \(\delta \) decryptions (a decryption is associated with a Bernoulli trial for each of the 64 nibble values, 16 per histogram), we identify the nibble value across all four histograms with the highest score. In the event of a tie, the value with the most convincing lead over the others in that histogram is selected. We declare it to be the true value of that nibble and the histogram bearing the winner is set aside. Let the winner be value y in \(hist_{4m}\), so \(k'_{4m}=y\). We then compute the corresponding line numbers accessed in \(T_0\) using Eq. (6). Line number \(y \oplus c'_{4m,1}\) is deleted from Guesstimates \(G_{n,0,1,1}\) and \(y \oplus c'_{4m,2}\) is deleted from \(G_{n,0,1,2}\) and so on, where \(0\le n\le 3\) and \(n \ne m\).

This procedure is repeated for the remaining three histograms (Line 4 of Algorithm 1), then for the remaining two and finally for the last histogram. At this point, key nibbles \(k'_{4m}, 0\le m\le 3\), are obtained. This is repeated for the histograms corresponding to each of the other tables (Line 1 of Algorithm 1) to retrieve all high-order nibbles of the key.

4.3 Analysis and results

We next analyse the performance of Algorithm 1 as a function of the false positive and false negative rates.

Consider key nibbles \(k'_{4m}, \,0\le m\le 3\). We refined the set of guesstimates after retrieval of every key nibble among the four considered. After every refinement, the average cardinality of the set of guesstimates decreases and the average false negative rates increase since some line numbers may be accessed more than once in a round.

Let \(f_q\) be the average false negative rate after q refinements to the set of guesstimates. It can be thought of as the probability that the line number accessed is not included in the set of guesstimates. The average cardinality of the set of guesstimates after q refinements, denoted as \(\mid G \mid _{q}\), is a measure of the false positive rate per elementFootnote 1 accessed. Let \(p_c\) and \(p_{in}\), respectively, denote the probabilities of the correct and incorrect nibbles in a histogram receiving a boost. Hence

$$\begin{aligned} p_c = 1-f_q, \quad 0 \le q \le 3 \end{aligned}$$
(7)

and

$$\begin{aligned} p_{in} = f_q\left( \frac{\mid G \mid _{q}}{15}\right) +(1-f_q)\left( \frac{\mid G \mid _{q}-1}{15}\right) . \end{aligned}$$
(8)

Equation (8) follows from the fact the probability that an incorrect nibble receives a boost in case of a false negative is \(\frac{\mid G \mid }{15}\), and is \(\frac{\mid G \mid -1}{15}\) otherwise.

The probability that the correct value of a key nibble is the unrivalled top scorer in a histogram is computed from \(P_{h}(p_c, p_{in}, 2^s, m )\) in Lemma 1 with \(m = \delta \), the number of decryptions \(2^{s} = 16\), the number of possible values of a key nibble and success probabilities \(p_{c}\) and \(p_{in}\) obtained, respectively, from Eqs. (7) and (8). Lines 12 and 13 of Algorithm 1 first attempt to find an unrivalled top scorer in at least one of four histograms, \(hist_{4m}, 0 \le m \le 3\). The probability that this succeeds is \(1-(1 - P_{h}(p_c, p_{in}, 2^4, \delta ))^{4}\).

The success probability of Algorithm 1 is given in the following theorem.

Theorem 1

The probability of successfully recovering all 16 high-order nibbles of the AES key in \(\delta \) decryptions is

$$\begin{aligned} \left\{ \ \prod _{q=0}^{3} \left[ 1-[1- {\mathcal {P}}_{h}(f_q, \mid G \mid _{q}, 2^4,\delta )]^{4-q} \right] \right\} ^4 \end{aligned}$$

where \({\mathcal {P}}_{h}(f_q, \mid G \mid _{q}, 2^4, \delta ) = P_{h}(p_c, p_{in}, 2^4,\delta )\); \(f_q\) and \(\mid G \mid _{q}\) are derived in Appendix I.

We performed experiments on an Intel Core i3 running Debian 8, Linux kernel 3.18 with OpenSSL version 1.0.2a. We generated 1000 samples – each sample comprises a randomly selected decryption key together with 50 random blocks of ciphertext. We ran each sample with 200 spy threads and created the set of guesstimates from the spy input. Using Algorithm 1, we attempted to deduce the AES key with varying number of ciphertext blocks.

Figure 2
figure 2

Number of successes per 1000 samples (Round 1).

Figure 2 shows the number of successes in 1000 samples. By success, we mean that all 16 high-order nibbles are recovered. In some cases, we recovered 15 nibbles successfully but a tie occurred in determining the last nibble. All of these cases were, however, easily resolved by closer inspection of the spy input. We included all such cases as successes. The unsuccessful case typically involved one or a few nibble values that were incorrectly guessed. With 16 decryptions, the success rate is 74%, but increases to 86% and 96% with 20 and 26 decryptions, respectively. For comparison, we included the unrefined version of Algorithm 1, which works on each histogram independently without updating the guesstimates. As shown in figure 2, the performance of the unrefined algorithm is considerably inferior to the refined one.

figure b

We also ran the key retrieval algorithm on the same samples but assuming ideal guesstimates. \(G_{i,t,1,d}\) is a set of ideal guesstimates if it contains line numbers only from \(T_t\) accessed in Round 1 of decryption d, and its only false negatives are the odd-numbered cache lines accessed in Round 1. This is an ideal experiment – one in which no measurement error is committed by the spy threads and in which the end of the First Round is unambiguously identified. \(\mid G \mid _{0}\) and \(f_{0}\) in the ideal case are, respectively, 1.82 and 0.5. With experimentally obtained spy inputs and the preprocessing strategy employed in deriving guesstimate sets, \(\mid G \mid _{0}\) and \(f_{0}\) increase to 2.15 and 0.55, respectively. As figure 2 shows, the success rate in the ideal case is nearly 100% with 14 decryptions. Thus the errors, whether induced by lapses in spy measurements or inherent in the preprocessing strategy, are seriously detrimental to performance.

5 Second Round Attack

The goal of the Second Round Attack is to obtain the low-order nibble of each byte of the key.

5.1 Theoretical underpinnings

From the algorithm used to generate round keys and by tracking how the input to Round 1 gets transformed to its output, we derived 16 equations shown here. They relate the second round inputs to the ciphertext and to various bytes of the key.

In the LHS of Eqs. (9)–(24), \(x_j\) represents an element in table \(T_t\). For simplicity, we use a single subscript instead of \(x_{i,t,r,d}\) used earlier. Here, the subscript j is equal to \(t+4i\). We use Round 2 accesses, so \(r=2\). Also, these equations are true for all the decryptions. In addition, we use \(c_{t+4i}\) in lieu of \(c_{t+4i,j}\) (the second subscript j, the block number of ciphertext is suppressed for brevity). Also \(s^{-1}\) is the inverse substitution function in AES.

$$\begin{aligned} x_{0}&=0e\, \bullet \, s^{-1} (c_{0} \, \oplus \, k_{0} )\oplus 0b\bullet s^{-1} (c_{13} \oplus k_{13} ) \nonumber \\&\quad \oplus \,0d\bullet s^{-1} (c_{10} \oplus k_{10} )\oplus 09\bullet s^{-1} (c_{7} \oplus k_{7} )\nonumber \\&\quad \oplus \, 0e\bullet (k_{0}\oplus s(k_{9}\oplus k_{13}) \oplus 36)\oplus 0b\bullet (k_{1}\oplus s(k_{10}\oplus k_{14}))\nonumber \\&\quad \oplus \,0d\bullet (k_{2} \oplus s(k_{11} \oplus k_{15} ))\oplus 09\bullet (k_{3} \oplus s(k_{8} \oplus k_{12} )), \end{aligned}$$
(9)
$$\begin{aligned} x_{1}&=09\bullet s^{-1} (c_{0} \oplus k_{0} )\oplus 0e\bullet s^{-1} (c_{13} \oplus k_{13} )\nonumber \\&\quad \oplus \,0b\bullet s^{-1} (c_{10} \oplus k_{10} )\oplus 0d\bullet s^{-1} (c_{7} \oplus k_{7} )\nonumber \\&\quad \oplus \,09\bullet (k_{0}\oplus s(k_{9}\oplus k_{13}) \oplus 36)\oplus 0e\bullet (k_{1}\oplus s(k_{10}\oplus k_{14}))\nonumber \\&\quad \oplus \, 0b\bullet (k_{2} \oplus s(k_{11} \oplus k_{15}))\oplus 0d\bullet (k_{3} \oplus s(k_{8}\oplus k_{12})), \end{aligned}$$
(10)
$$\begin{aligned} x_{2}&=0d\bullet s^{-1} (c_{0} \oplus k_{0})\oplus 09\bullet s^{-1} (c_{13} \oplus k_{13} )\nonumber \\&\quad \oplus \, 0e\bullet s^{-1} (c_{10} \oplus k_{10} ) \oplus 0b\bullet s^{-1} (c_{7} \oplus k_{7} )\, \nonumber \\&\quad \oplus \,0d\bullet (k_{0}\oplus s(k_{9}\oplus k_{13}) \oplus 36)\oplus 09\bullet (k_{1}\oplus s(k_{10}\oplus k_{14}))\nonumber \\&\quad \oplus \, 0e\bullet (k_{2} \oplus s(k_{11} \oplus k_{15} )) \oplus 0b\bullet (k_{3} \oplus s(k_{8}\oplus k_{12})), \end{aligned}$$
(11)
$$\begin{aligned} x_{3}&=0b\bullet s^{-1} (c_{0} \oplus k_{0} )\oplus 0d\bullet s^{-1} (c_{13} \oplus k_{13} )\nonumber \\&\quad \oplus \, 09\bullet s^{-1} (c_{10} \oplus k_{10} ) \oplus 0e\bullet s^{-1} (c_{7} \oplus k_{7} )\, \nonumber \\&\quad \oplus \,0b\bullet (k_{0}\oplus s(k_{9}\oplus k_{13}) \oplus 36)\oplus 0d\bullet (k_{1}\oplus s(k_{10}\oplus k_{14}))\nonumber \\&\quad \oplus \, 09\bullet (k_{2} \oplus s(k_{11} \oplus k_{15} )) \oplus 0e\bullet (k_{3} \oplus s(k_{8} \oplus k_{12})), \end{aligned}$$
(12)
$$\begin{aligned} x_{4}&=0e\bullet s^{-1} (c_{4} \oplus k_{4} )\oplus 0b\bullet s^{-1} (c_{1} \oplus k_{1} )\nonumber \\&\quad \oplus \,0d\bullet s^{-1} (c_{14} \oplus k_{14} ) \oplus 09\bullet s^{-1} (c_{11} \oplus k_{11} )\,\nonumber \\&\quad \oplus \, 0e\bullet (k_{0} \oplus k_{4} ) \oplus 0b\bullet (k_{1} \oplus k_{5} ) \oplus 0d\bullet (k_{2} \oplus k_{6} )\nonumber \\&\quad \oplus \, 09\bullet (k_{3} \oplus k_{7} ), \end{aligned}$$
(13)
$$\begin{aligned} x_{5}&=09\bullet s^{-1} (c_{4} \oplus k_{4} )\oplus 0e\bullet s^{-1} (c_{1} \oplus k_{1} )\nonumber \\&\quad \oplus \, 0b\bullet s^{-1} (c_{14} \oplus k_{14} ) \oplus 0d\bullet s^{-1} (c_{11} \oplus k_{11} )\, \nonumber \\&\quad \oplus \, 09\bullet (k_{0} \oplus k_{4} ) \oplus 0e\bullet (k_{1} \oplus k_{5} ) \oplus 0b\bullet (k_{2} \oplus k_{6} )\nonumber \\&\quad \oplus \, 0d\bullet (k_{3} \oplus k_{7} ), \end{aligned}$$
(14)
$$\begin{aligned} x_{6}&=0d\bullet s^{-1} (c_{4} \oplus k_{4} )\oplus 09\bullet s^{-1} (c_{1} \oplus k_{1} )\nonumber \\&\quad \oplus \, 0e\bullet s^{-1} (c_{14} \oplus k_{14} )\oplus 0b\bullet s^{-1} (c_{11} \oplus k_{11} )\, \nonumber \\&\quad \oplus \, 0d\bullet (k_{0} \oplus k_{4} )\oplus 09\bullet (k_{1} \oplus k_{5} ) \oplus 0e\bullet (k_{2} \oplus k_{6} )\nonumber \\&\quad \oplus \, 0b\bullet (k_{3} \oplus k_{7} ), \end{aligned}$$
(15)
$$\begin{aligned} x_{7}&= 0b\bullet s^{-1} (c_{4} \oplus k_{4} )\oplus 0d\bullet s^{-1} (c_{1} \oplus k_{1} )\nonumber \\&\quad \oplus \, 09\bullet s^{-1} (c_{14} \oplus k_{14} ) \oplus 0e\bullet s^{-1} (c_{11} \oplus k_{11} )\, \nonumber \\&\quad \oplus \, 0b\bullet (k_{0} \oplus k_{4} )\oplus 0d\bullet (k_{1} \oplus k_{5} ) \oplus 09\bullet (k_{2} \oplus k_{6} )\nonumber \\&\quad \oplus \, 0e\bullet (k_{3} \oplus k_{7} ), \end{aligned}$$
(16)
$$\begin{aligned} x_{8}&=0e\bullet s^{-1} (c_{8} \oplus k_{8} )\oplus 0b\bullet s^{-1} (c_{5} \oplus k_{5} )\nonumber \\&\quad \oplus \, 0d\bullet s^{-1} (c_{2} \oplus k_{2} ) \oplus 09\bullet s^{-1} (c_{15} \oplus k_{15} )\, \nonumber \\&\quad \oplus \, 0e\bullet (k_{4} \oplus k_{8} )\oplus 0b\bullet (k_{5} \oplus k_{9} )\oplus 0d\bullet (k_{6} \oplus k_{10} )\nonumber \\&\quad \oplus \, 09\bullet (k_{7} \oplus k_{11} ), \end{aligned}$$
(17)
$$\begin{aligned} x_{9}&=\,09\bullet s^{-1} (c_{8} \oplus k_{8} )\oplus 0e\bullet s^{-1} (c_{5} \oplus k_{5} )\nonumber \\&\quad \oplus \, 0b\bullet s^{-1} (c_{2} \oplus k_{2} ) \oplus 0d\bullet s^{-1} (c_{15} \oplus k_{15} )\,\nonumber \\&\quad \oplus \, 09\bullet (k_{4} \oplus k_{8} )\oplus 0e\bullet (k_{5} \oplus k_{9} ) \oplus 0b\bullet (k_{6} \oplus k_{10} )\nonumber \\&\quad \oplus \, 0d\bullet (k_{7} \oplus k_{11} ), \end{aligned}$$
(18)
$$\begin{aligned} x_{10}&=\,0d\bullet s^{-1} (c_{8} \oplus k_{8} )\oplus 09\bullet s^{-1} (c_{5} \oplus k_{5} )\nonumber \\&\quad \oplus \, 0e\bullet s^{-1} (c_{2} \oplus k_{2} ) \oplus 0b\bullet s^{-1} (c_{15} \oplus k_{15} )\, \nonumber \\&\quad \oplus \, 0d\bullet (k_{4} \oplus k_{8} )\oplus 09\bullet (k_{5} \oplus k_{9} ) \oplus 0e\bullet (k_{6} \oplus k_{10} )\nonumber \\&\quad \oplus \, 0b\bullet (k_{7} \oplus k_{11} ), \end{aligned}$$
(19)
$$\begin{aligned} x_{11}&=\,0b\bullet s^{-1} (c_{8} \oplus k_{8} )\oplus 0d\bullet s^{-1} (c_{5} \oplus k_{5} )\nonumber \\&\quad \oplus \, 09\bullet s^{-1} (c_{2} \oplus k_{2} ) \oplus 0e\bullet s^{-1} (c_{15} \oplus k_{15} )\, \nonumber \\&\quad \oplus \, 0b\bullet (k_{4} \oplus k_{8} )\oplus 0d\bullet (k_{5} \oplus k_{9} ) \oplus 09\bullet (k_{6} \oplus k_{10} )\nonumber \\&\quad \oplus \, 0e\bullet (k_{7} \oplus k_{11} ), \end{aligned}$$
(20)
$$\begin{aligned} x_{12}&=\,0e\bullet s^{-1} (c_{12} \oplus k_{12} )\oplus 0b\bullet s^{-1} (c_{9} \oplus k_{9} )\nonumber \\&\quad \oplus \, 0d\bullet s^{-1} (c_{6} \oplus k_{6} ) \oplus 09\bullet s^{-1} (c_{3} \oplus k_{3} )\,\nonumber \\&\quad \oplus \, 0e\bullet (k_{8} \oplus k_{12} )\oplus 0b\bullet (k_{9} \oplus k_{13} ) \oplus 0d\bullet (k_{10} \oplus k_{14} )\nonumber \\&\quad \oplus \, 09\bullet (k_{11} \oplus k_{15} ), \end{aligned}$$
(21)
$$\begin{aligned} x_{13}&=\,09\bullet s^{-1} (c_{12} \oplus k_{12} )\oplus 0e\bullet s^{-1} (c_{9} \oplus k_{9} )\nonumber \\&\quad \oplus \, 0b\bullet s^{-1} (c_{6} \oplus k_{6} ) \oplus 0d\bullet s^{-1} (c_{3} \oplus k_{3} )\,\nonumber \\&\quad \oplus \, 09\bullet (k_{8} \oplus k_{12} )\oplus 0e\bullet (k_{9} \oplus k_{13} ) \oplus 0b\bullet (k_{10} \oplus k_{14} )\nonumber \\&\quad \oplus \, 0d\bullet (k_{11} \oplus k_{15} ), \end{aligned}$$
(22)
$$\begin{aligned} x_{14}&=\,0d\bullet s^{-1} (c_{12} \oplus k_{12} )\oplus 09\bullet s^{-1} (c_{9} \oplus k_{9} )\nonumber \\&\quad \oplus \, 0e\bullet s^{-1} (c_{6} \oplus k_{6} ) \oplus 0b\bullet s^{-1} (c_{3} \oplus k_{3} )\,\nonumber \\&\quad \oplus \, 0d\bullet (k_{8} \oplus k_{12} )\oplus 09\bullet (k_{9} \oplus k_{13} ) \oplus 0e\bullet (k_{10} \oplus k_{14} )\nonumber \\&\quad \oplus \, 0b\bullet (k_{11} \oplus k_{15} ), \end{aligned}$$
(23)
$$\begin{aligned} x_{15}&=\,0b\bullet s^{-1} (c_{12} \oplus k_{12} )\oplus 0d\bullet s^{-1} (c_{9} \oplus k_{9} )\nonumber \\&\quad \oplus \, 09\bullet s^{-1} (c_{6} \oplus k_{6} ) \oplus 0e\bullet s^{-1} (c_{3} \oplus k_{3} )\,\nonumber \\&\quad \oplus \, 0b\bullet (k_{8} \oplus k_{12} )\oplus 0d\bullet (k_{9} \oplus k_{13} ) \oplus 09\bullet (k_{10} \oplus k_{14} )\nonumber \\&\quad \oplus \, 0e\bullet (k_{11} \oplus k_{15} ). \end{aligned}$$
(24)

For ease of explanation, we refer to (9)–(12) as Set-1 equations, (13)–(16) as Set-2, (17)–(20) as Set-3 and (21)–(24) as Set-4 equations. Consider the equations in Set-2, Set-3 and Set-4. Because field multiplication is distributive over field addition, it is possible to split each of the last four terms in the RHS of those equations. Upon rearranging terms, (14), for example, can be re-written as

$$\begin{aligned} \begin{aligned}&x_{5} \oplus \, 09\bullet s^{-1} (c_{4} \oplus k_{4} )\oplus 0e\bullet s^{-1} (c_{1} \oplus k_{1} ) \\&\qquad \oplus \, 0b\bullet s^{-1} (c_{14} \oplus k_{14} ) \oplus 0d\bullet s^{-1} (c_{11} \oplus k_{11} ) \\&\qquad \oplus 09 \bullet ((k_{0} \oplus k_{4} )^{'} 0000) \oplus 0e\bullet ((k_{1} \oplus k_{5} )^{'} 0000) \\&\qquad \oplus 0b\bullet ((k_{2} \oplus k_{6} )^{'} 0000) \oplus \, 0d\bullet ((k_{3} \oplus k_{7} )^{'} 0000)\\&\quad = 09\bullet (0000 \,(k_{0} \oplus k_{4} )^{''}) \oplus 0e\bullet (0000\,(k_{1} \oplus k_{5} )^{''}) \\&\qquad \oplus 0b\bullet (0000\,(k_{2} \oplus k_{6} )^{''}) \oplus \, 0d\bullet (0000 \, (k_{3} \oplus k_{7} )^{''}). \end{aligned} \end{aligned}$$
(25)

Let the RHS of (25) be equal to the byte denoted as (\(u_0\, u_1\,u_2\, u_3 \,\, u_4\, u_5 \,u_6\, u_7\)). Also, let \(a_{0}a_{1}a_{2}z_{0},\, a_{3}a_{4}a_{5}z_{1},\,a_{6}a_{7}a_{8}z_{2}\) and \(a_{9}a_{10}a_{11}z_{3}\) denote the bits of the nibbles \((k_{0}\oplus k_{4})'', (k_{1}\oplus k_{5})'', (k_{2}\oplus k_{6})''\) and \((k_{3}\oplus k_{7})''\), respectively.

Table 3 Field multiplications involved in RHS of (25).

Multiplication in a binary field involves shifting and XORing of the multiplicand and, if necessary, reduction of the result by an irreducible polynomial. Table 3 shows the shift operation on RHS terms of (25). The high-order nibble (\(u_0, u_1, u_2, u_3\)) of the sum of the terms on the RHS is

$$\begin{aligned}&u_0 = 0,\\&u_{1}= a_{0} \oplus a_{3} \oplus a_{6} \oplus a_{9}, \\&u_{2}= a_{1} \oplus a_{3} \oplus a_{4} \oplus a_{7} \oplus a_{9} \oplus a_{10}, \\&u_{3}= a_{2} \oplus a_{3} \oplus a_{4} \oplus a_{5} \oplus a_{6} \oplus a_{8} \oplus a_{10} \oplus a_{11}. \end{aligned}$$

From table 3, it is evident that the bits denoted \(z_{0}, z_{1}, z_{2}\) and \(z_{3}\) do not affect the high-order nibble of the resultant byte. It follows that, of the 16 unknown bits corresponding to the four terms on the RHS of (25), only 12 bits determine the high-order nibble on the RHS.

The operations involved in calculating \(u_1,\,u_2\) and \(u_3\) can be represented using the matrix equation

$$\begin{aligned}&\begin{pmatrix} u_1\\ u_2\\ u_3 \end{pmatrix} = M_{9ebd} \bullet A \nonumber \\ \end{aligned}$$
(26)

where

$$M_{9ebd}=\left( \begin{array}{llllllllllll} 1&{}0&{}0&{}1&{}0&{}0&{}1&{}0&{}0&{}1&{}0&{}0\\ 0&{}1&{}0&{}1&{}1&{}0&{}0&{}1&{}0&{}1&{}1&{}0\\ 0&{}0&{}1&{}1&{}1&{}1&{}1&{}0&{}1&{}0&{}1&{}1 \end{array} \right)$$
$$ \text{and}\,\,A =(a_{0}\,\,\,a_{1}\,\,\,a_{2}\,\,\,a_{3}\,\,\, a_{4}\,\,\, a_{5}\,\,\,a_{6}\,\,\,a_{7}\,\,\,a_{8}\,\,\,a_{9}\,\,\,a_{10}\,\,\,a_{11})^{T}. $$

The subscript “9ebd” reflects the order of coefficients in the RHS of (25).

Let \({\mathbb {F}}_{2}^{12}\) represent a 12-dimensional binary vector space. We define 8 equivalence classes as follows:

$$\begin{aligned} C_{9ebd}\left( (u_{1}, u_{2}, u_{3})^{T}\right) = \left\{ A \in {\mathbb {F}}_{2}^{12} : M_{9ebd} \bullet A = \begin{pmatrix}u_1\\ u_2\\ u_3\end{pmatrix} \right\} . \end{aligned}$$
(27)

\(M_{9ebd}\) is in row canonical form with pivots in the first 3 columns. Hence, \(a_{0}, a_{1},a_{2}\) are pivot variables and the remaining nine are free variables in A. Thus, each equivalence class has \(2^{9} = 512\) sub-keys with class representative \((u_{1}u_{2}u_{3} \, 000 \, 000 \, 000) \). It leads to the following theorem.

Theorem 2

\({\mathbb {F}}_{2}^{12}\) can be partitioned into 8 equivalence classes based on (27), each containing 512 sub-keys. The class representatives are \((\left\{ 0,1 \right\} ^{3} \, 000 \, 000 \, 000)\).

This theorem implies that instead of validating all the \(2^{12}\) sub-keys, it is sufficient to validate only the class representatives.

The coefficients of the ciphertext-independent terms in the RHS of (13), (15) and (16) are shifted versions of those in (14). Analogous to \(M_{9ebd}\), we define \(M_{ebd9}, M_{bd9e}\) and \(M_{d9eb}\). These matrices are obtained in a manner similar to \(M_{9ebd}\) (table 3) and are column-shifted versions of \(M_{9ebd}\). Moreover, they are linearly related as follows:

$$\begin{aligned}&M_{ebd9}\, = \, M_1 \bullet M_{9ebd} \end{aligned}$$
(28)
$$\begin{aligned}&M_{d9eb} = M_2 \bullet M_{9ebd} \end{aligned}$$
(29)
$$M_{bd9e} = M_3 \bullet M_{9ebd}$$
(30)
$$\begin{aligned} \text {where}\;\; M_1=\begin{pmatrix} 1&{}0&{}0\\ 1&{}1&{}0\\ 1&{}1&{}1 \end{pmatrix},\quad M_2 = \begin{pmatrix} 1&{}0&{}0\\ 1&{}1&{}0\\ 0&{}1&{}1 \end{pmatrix}\quad {\mathrm{and}}\;\; M_3=\begin{pmatrix} 1&{}0&{}0\\ 0&{}1&{}0\\ 1&{}0&{}1 \end{pmatrix}. \end{aligned}$$
(31)

Let \(A_{1} \in C_{9ebd}\left( \left( u_{1}, u_{2}, u_{3} \right) ^{T} \right) \). From (27)

$$\begin{aligned} M_{9ebd} \bullet A_{1} = \left( u_{1}, u_{2}, u_{3} \right) ^{T}. \end{aligned}$$

Pre-multiplying by \(M_{1}\) on both sides gives

$$\begin{aligned} M_{ebd9} \bullet A_{1} =M_{1} \bullet \left( u_{1}, u_{2}, u_{3} \right) ^{T}. \end{aligned}$$

Hence, \(A_{1} \in C_{ebd9}\left( M_{1} \bullet \left( u_{1}, u_{2}, u_{3} \right) ^{T}\right) \). This leads to the following theorem.

Theorem 3

If \(A_{1} \in C_{9ebd}\left( \left( u_{1}, u_{2}, u_{3} \right) ^{T} \right) \), then

$$\begin{aligned}&A_{1} \in C_{ebd9}\left( M_{1} \bullet \left( u_{1}, u_{2}, u_{3} \right) ^{T}\right) ,\\&A_{1} \in C_{bd9e} \left( M_{2} \bullet \left( u_{1}, u_{2}, u_{3} \right) ^{T}\right) ,\\&A_{1} \in C_{d9eb} \left( M_{3} \bullet \left( u_{1}, u_{2}, u_{3} \right) ^{T}\right) . \end{aligned}$$

The following is of crucial importance in the design of Algorithm 2.

Corollary 1

If \(A_{1}, A_{2}\) belong to an equivalence class w.r.t. \(M_{9ebd}\), then they also belong to a single equivalence class w.r.t. \(M_{ebd9}\) or \(M_{bd9e}\) or \(M_{d9eb}\).

Thus, \(M_{9ebd}, M_{ebd9}, M_{bd9e}\) and \(M_{d9eb}\), each induce an identical partitioning over \({\mathbb {F}}_{2}^{12}\).

5.2 Algorithm 2: description

The guesstimate sets that are inputs to Algorithm 2 are derived as follows. Let r be the maximum number of runs, from the starting run of decryption d containing no more than 8 distinct cache line numbers. Let n be the minimum integer such that the number of distinct line numbers in runs \(r+1, r+2\), ..., \(r+n\) is 8 or more. Let \(\gamma \) be the set containing these line numbers. Then populate \(G_{i,t,2,d},\,\, 0\le i, t\le 3\), with line numbers from \(\gamma \) in the range from 16t to \(16(t+1)-1\). From the guesstimate sets and ciphertext, we derived the four histograms using Algorithm 2.

Algorithm 2 builds three histograms corresponding to the three sets of equations (Set-2, Set-3 and Set-4). A histogram contains \(2^{19}\) bins, each representing a possible sub-key. A sub-key for Set-2, for example, is formed by concatenating \(k_{4}'', k_{1}'', k_{14}'', k_{11}''\) to a three-bit attribute (lines 14–16 of algorithm 2). The first four nibbles are the unknowns in the LHS of Eq. (25).

figure c

From Theorem 2, the space of unknown key bits in the RHS of Eq. (25) that impacts the high-order nibble of the RHS can be reduced to just eight equivalence classes, each represented by three bits. These three bits comprise the last attribute of the sub-key. Further, from Corollary 1, the equivalence classes and their representatives are identical across the four equations of a set. Hence, the same “sub-key” value participates in four Bernoulli trials per decryption. As in the Round 1 Attack, a trial involves computing the high-order nibble of the byte value of the RHS and checking to see if it matches an element in the corresponding set of guesstimates. If so, the score of that sub-key value is incremented by 1 in the histogram. After \(\delta \) decryptions, we pick the sub-key with the highest score and declare it to be the winner.

The afore-mentioned procedure is completed for the three histograms and (hopefully) retrieves 12 low-order nibbles of the key. To recover the remaining nibbles, we create a histogram to score each of the \(2^{16}\) values of the sub-key made up of \(k_{0}'', k_{13}'', k_{10}'', k_{7}''\). We use the first set of equations (Set-1) to do so. Once again, we have \(4\delta \) Bernoulli trials but now only \(2^{16}\) possible sub-key values (against \(2^{19}\) in the earlier three histograms). The winner in this histogram is presumed to yield the true values of \(k_{0}'', k_{13}'', k_{10}'', k_{7}''\).

5.3 Analysis and results

As in section 4, \(f_{0}\) and \(\mid G \mid _{0}\) are, respectively, the false negative rate and the average cardinality of the set of guesstimates. Also \(p_c\) and \(p_{in}\) are, respectively, the probabilities of the correct and incorrect sub-keys receiving a boost. Hence

$$\begin{aligned} p_c = 1-f_{0} \end{aligned}$$
(32)

and

$$\begin{aligned} p_{in} = \frac{\mid G \mid _{0}}{16}. \end{aligned}$$
(33)

From Lemma 1, we obtain probabilities of successfully recovering the sub-keys from the four histograms. Combining the four probabilities, we obtain the following theorem.

Theorem 4

The probability of successfully recovering all 16 low-order nibbles of the AES key in \(\delta \) decryptions is

$$\begin{aligned} \left[ P_{h}\left( 1-f_{0},\frac{\mid G \mid _{0}}{16}, 2^{19}, 4\delta \right) \right] ^{3} P_{h}\left( 1-f_{0},\frac{\mid G \mid _{0}}{16}, 2^{16}, 4\delta \right) \end{aligned}$$
Figure 3
figure 3

Number of successes per 1000 samples (Round 2).

For the Second Round Attack, we used the same 1000 samples and spy input as in the First Round Attack. Figure 3 shows the number of unqualified successes per 1000 samples for varying number of decryptions. In these cases, the four histograms throw up unique winners, each with at least a four-point lead over their nearest rivals. If the four winners collectively yield the correct key, we refer to the experiment as an unqualified success.

Figure 4
figure 4

Distribution of the correct sub-key score (yellowish green) and distribution of the maximum score among the incorrect sub-keys (red).

Theoretically, the distribution of the score of correct sub-key and the distribution of the maximum score among \(2^{19}\) incorrect sub-keys’ scores are shown in figure 4 after 15 and 25 decryptions. To calculate these distributions, we assumed \(f_0 = 0.53\) and \(\mid G \mid _0 = 2.7\) based on our experiments. In our model, scores of correct and incorrect sub-keys are binomially distributed with success probabilities \(p_c\) and \(p_{in}\), respectively, which can be calculated using Eqs. (32) and (33). It is also interesting to note that while the distribution of scores of the correct sub-key value has no skew or excess kurtosis, the distribution of the maximum score among incorrect sub-keys’ scores exhibits positive skew and positive excess kurtosis. Using our model we estimated that the probability of the score of the correct sub-key exceeding the highest score among incorrect sub-keys is, respectively, 0.66, 0.88 and 0.97 for 15, 20 and 25 decryptions.

Figure 5 shows the scores of the 40 top-ranked sub-keys in a histogram. The sub-keys are arranged in order of their scores obtained after 25 decryptions. Their scores after 10, 15 and 20 decryptions are also shown. The correct sub-key emerges as the clear winner after only about 22 decryptions but has 6, 20 and 26 sub-keys at the same or higher score after 20, 15 and 10 decryptions, respectively. Hence, to guess the true sub-key value more accurately, we make the following slight modification to Algorithm 2.

Figure 5
figure 5

Scores of top 40 sub-keys after varying number of decryptions

If the top scorer in a histogram does not have a lead of at least 4 points over the rest, we identify all sub-key values with scores greater than or equal to the minimum score of the top eight sub-key values. We then construct a histogram for each combination of these top scorers from the first three histograms. Scores for the same \(2^{16}\) sub-key values in each histogram are obtained as before from the Set-1 equations. The final winner for the complete AES key is obtained by selecting the global top scorer over all histograms constructed in the final step. With this modification, the success rate for 25 decryptions, for example, increased from 92% to 96%.

The experimentally obtained average values of \(f_{0}\) and \(\mid G \mid _{0}\) for the 1000 samples were, respectively, 0.53 and 2.71. \(\mid G \mid _{0}\) in the First Round Attack was much lower at 2.15. This is because the first run of a decryption can be unambiguously identified and contains only accesses made in Round 1. This is not true for the first run of Round 2. From Eqs. (32) and (33), the success probabilities \(p_c\) and \(p_{in}\) are, respectively, 0.47 and 0.17. The means of the distributions of the correct and incorrect sub-key scores (equal to \(4\delta p_{c}\) and \(4\delta p_{in}\)) are much further apart than in the First Round Attack (equal to \(\delta p_{c}\) and \(\delta p_{in}\)). This might suggest that the winner in the Second Round Attack can be identified with fewer decryptions.

The number of sub-key values in the histogram for Round 1 is only 16 while it is \(2^{16}\) or \(2^{19}\) for Round 2. Thus, despite smaller intersection of the tails of the two distributions of Round 2, the sheer number of sub-key values increases the probability that one of the incorrect sub-key values has a higher score than the actual sub-keyFootnote 2. These two effects seem to balance out, resulting in roughly the same number of decryptions required for the First and Second Round Attacks.

Figure 3 also shows the successes in the case of ideal guesstimates. The success rate is nearly 100% with just 15 decryptions. This is attributed to the fact that \(\mid G \mid _{0}\) in the ideal case is only 1.82 (average over 1000 samples).

6 Discussion

In this section, we discuss further optimizations along with limitations of our approach and countermeasures.

6.1 Further optimizations and enhancements

With prefetching disabled, we were able to successfully retrieve all 16 bytes of the AES key in 2–3 decryptions. From the deletion table, we obtained the near-exact order in which the line numbers were accessed, resulting in about one line number per guesstimate set. With prefetching enabled, our spy code skips the odd-numbered lines and we get only a partial order of the even-numbered lines accessed. However, using the temporal information provided by the deletion table, we could use different per-element guesstimate sets instead of using the same guesstimate set for all elements accessing a given table. Moreover, weights could be assigned to different elements in the guesstimate set, resulting in fractional scores assigned to sub-keys rather than 0/1 scores in the current histograms. We expect that this change will speed up the convergence of our algorithms. Finally, if Algorithm 2 throws up multiple candidates for the final key, we could use equations for the third round to resolve the ties.

Most of the Intel caches have a block size of 64 bytes. However, the IBM Power PC processor has block size of 128 bytes. Our attack can be adapted to work on the latter. The First Round Attack will obtain the first three bits of each AES key. Obtaining the remaining five bits (Second Round Attack) will take a lot more time since the space of sub-keys will be much larger now.

6.2 Limitations

We assume that the core running the victim and spy does not simultaneously host another active process running AES decryption. Otherwise, the table accesses made by the latter may be mistaken for accesses by the victim, possibly leading to flawed conclusions.

As with some access-driven attacks, we assume that the victim and spy process are on the same processor core. This is not always possible though a determined attacker may be able to co-locate itself with the victim. For example, in a multiuser environment, an attacker together with her accomplices could simply request inordinate CPU resources and obtain access to multiple cores, including the one victim is running on.

Hardware support for AES is available via the AES-NI instructions on many modern processors beginning with Intel’s Westmere family. Since the hardware implementation does not use processor cache to store the look-up tables, the attack described here will not work. Use of the AES-NI instructions (rather than the software implementation) is the default option with the newer OpenSSL versions. However, some processors like Core 2 Duo with an installed base that is not insignificant do not have hardware support for AES as also the Pentium and Celeron models within the Westmere family. The default option in more recent versions of OpenSSL is the assembly language implementation. This uses only a single table and may not be vulnerable to the attacks presented in this paper. However, it can be easily overridden by setting no-asm flag during compilation.

Most recent Linux kernel versions support control groups (cgroups). When enabled, the CPU will be allocated equally among the processes of different cgroups. If the victim and spy are in different cgroups, then the victim will get roughly the same amount of CPU time as the spy, so it will perform many decryptions during its run, thus rendering this attack infeasible.

6.3 Countermeasures

An extreme countermeasure is physical isolation between a sensitive application and all others. The more practical countermeasures are either processor-based or OS-based. In [48], the OS makes spurious requests to obfuscate cache usage. Most cache-based side-channel attacks are critically dependent on the ability of the adversary to measure the state of the cache frequently. This is achieved by pre-emption of the victim after a few microseconds. By modifying the OS to enforce a lower limit of say \(100\,\upmu \hbox {s}\) on the minimum run time of a CPU-bound process [49], attacks such as these may be thwarted.

A processor-based defense is to drop the flush instruction from the instruction repertoire as in some versions of ARM. However, [50] shows that it may still be possible to evict cache lines by accesses to the same cache set. In this paper, we have retrieved the AES key despite the presence of the hardware prefetcher. Existing prefetchers could be enhanced by randomized and set-balanced stride prefetchers [46] to severely disrupt the cache footprint of a side-channel attack, leading to negligible leakage of useful information.

Removing the support of the high-resolution timer [10, 16] or reducing its accuracy [51, 52] has been suggested as a countermeasure to the cache attacks. However, it has been demonstrated [50] that removal of the timer is not a solution as a very high resolution can be achieved by incrementing a global variable in an endless loop by a counting thread. The boundaries (maximum stride) of the prefetcher can also be extended to defend against this attack.

7 Conclusion

We implemented cache access attacks on the Core 2 Duo and Core i3/i5/i7 processors to recover the AES key using a multi-threaded spy process. Hardware prefetchers on modern machines complicate cache access attacks as the prefetched cache lines are wrongly reported as being accessed by the victim. Our spy code decreased the number of false positives but greatly increased the number of false negatives. Yet, our key retrieval algorithms required about 25 blocks of ciphertext to retrieve the key with prefetching enabled and only 2–3 blocks with prefetching disabled. We also presented analytical models, which provide deeper insight into the effect of prefetching and of errors (false positives and false negatives) on performance.