Abstract
Side-channel attacks allow to extract secret keys from embedded systems like smartcards or smartphones. In practice, the side-channel signal is measured as a trace consisting of several samples. Also, several sensitive bits are manipulated in parallel, each leaking differently. Therefore, the informed attacker needs to devise side-channel distinguishers that can handle both multivariate leakages and multiple models. In the state of the art, these two issues have two independent solutions: on the one hand, dimensionality reduction can cope with multivariate leakage; on the other hand, online stochastic approach can cope with multiple models. In this paper, we combine both solutions to derive closed-form expressions of the resulting optimal distinguisher in terms of matrix operations, in all situations where the model can be either profiled offline or regressed online. Optimality here means that the success rate is maximized for a given number of traces. We recover known results for uni- and bivariate models (including correlation power analysis) and investigate novel distinguishers for multiple models with more than two parameters. In addition, following ideas from the AsiaCrypt’2013 paper “Behind the Scene of Side-Channel Attacks,” we provide fast computation algorithms in which the traces are accumulated prior to computing the distinguisher values.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Side-channel attacks allow to extract secret keys from cryptographic devices. Template attacks [4] have been introduced as the strongest analysis method. They consist in two phases: (1) a profiling offline phase where the leakage model of the device under attack is characterized and (2) an attack online phase in which the secret key is extracted using fresh measurements along with the precharacterized model. Such attacks are known to use a maximum likelihood principle to ensure the highest possible success probability (see, eg., [10]).
In this paper, we study optimal attacks with the best possible success probability when extracting the secret key.Footnote 1 We leverage on such optimal distinguishers to answer the following question: How to attack with the best probability of success when the leakage is multivariate and the model is multiple? An initial empiricalFootnote 2 work has already been carried out in [15] which confirmed that this type of approach can be very fruitful.Footnote 3
1.1 Contributions
We derive closed-form expressions for the optimal distinguishers in all situations where the model is known (e.g., using profiling) or regressed online. In the case of a known univariate model, we recover the results in [3]. However, our “fully matrix” formalism makes equations simpler and proofs shorter. Moreover, compared to [3] we extend the leakage model to the case where the traces are not necessarily centered, thereby allowing a more natural application on real traces. In the realistic “(online) stochastic attack” situation where the model is parametric, i.e., where the coefficients of the model are unknown, we express the optimal distinguisher by maximizing success over the whole set of possible coefficients. Finally, we provide fast computation algorithms for our novel distinguishers, which happen to be remarkably simple and efficient.
1.2 Outline
The remainder of this paper is organized as follows. Section 2 provides a modelization of a side-channel attack that is genericFootnote 4 enough to capture many different multivariate scenarios The main results of this paper are outlined in Sect. 3. Section 4 presents experimental results on simulated traces and real-world acquisition campaigns. Conclusions and perspectives are given in Sect. 5.
2 Notations and leakage model
2.1 Notations
We let X denote the leakage measurements, Y the model, N the measurement noise, and \(\alpha \) the link between the model and the measurements.Footnote 5 The model Y depends on a key guess k, an n-bit vector, and on some known text T (usually also an n-bit vector), e.g., through a function \(\phi \) such that \(Y=\phi (T,k)\). A well-known example is \(Y=w_H(T\oplus k)\), where \(w_H\) is the Hamming weight function. However, in general, some parameters of the model are unknown. To remain generic, we do not detail further the link between Y and the pair (T, k). As it is customary in side-channel analysis, the correct key is denoted by \(k^\star \). The corresponding model using the correct key \(Y(k^\star )\) is denoted by \(Y^\star \).
Let Q be the number of queries (number of measurements), D be the data dimensionality (number of time samples per measurement trace), and S be the model dimensionality (\(\phi : \mathbb {F}_2^n\times \mathbb {F}_2^n\rightarrow \mathbb {R}^S\) is a vectorial function, with S components). Roman letters in bold indicate vectors or matrices that have a dimension in Q, i.e., which are different for each trace \(q=1,2,\ldots ,Q\). More precisely, \(\mathbf X \) represents the full attack campaign, a matrix of \(D\times Q\) measurement samples. The q-th trace is denoted \(X_q\) which is a \(D\times 1\) column vector. Similarly, for the q-th trace, the \(S\times 1\) column vector \(Y_q\) represents the deterministic part of the model while the \(D\times 1\) column vector \(N_q\) is the corresponding measurement noise with \(D\times D\) correlation matrix \(\Sigma \).
We denote by \(\mathrm{tr}\left( \cdot \right) \) the trace of a square matrix, that is, the sum of its diagonal terms. Note that \(\mathrm{tr}\left( AB\right) = \mathrm{tr}\left( BA\right) \) for compatible matrix dimensions. Let \(\Vert \cdot \Vert _2\) denote the Euclidean norm of a \(1\times Q\) row vector. Thus, \(\Vert \mathbf {X}\Vert _2^2 ={ \mathbf {X} {\mathbf {X}}^\mathsf {T}} = \mathrm{tr}\left( {\mathbf {X}}^\mathsf {T} \mathbf {X}\right) \), where \({(\,\cdot \,)}^\mathsf {T}\) is the transposition operator. Finally, let \(\Vert \;\!\cdot \;\!\Vert _F\) denote the Frobenius norm of a matrix (square root of the sum of its squared elements), such that \(\Vert M \Vert _F^2 = \mathrm{tr}\left( M{M}^\mathsf {T}\right) \).
2.2 General model
We make the following simplifying assumptions. First, the (environmental) noise is steady, e.g., chip temperature and supply voltage do not vary during the side-channel measurements. Thus, \(N_1, N_2, \ldots , N_Q\) are independent and identically distributed (i.i.d.) (denoted by N with index q dropped). Second, the attacker does not inject partial information gathered from the leakage analysis into a possible choice of plaintexts/ciphertexts (non-adaptive attack).Footnote 6 Thus, \(Y_1, Y_2, \ldots , Y_Q\) are assumed i.i.d. (denoted by Y). Under the adopted leakage model, it follows that the leakage measurements \(X_1, X_2, \ldots , X_Q\) are also i.i.d. (denoted by X).
A distinguisher \(\mathcal {D}\) maps a collection of leakages \(\mathbf {x}\) and texts \(\mathbf {t}\) to an estimation of the secret key \(k^\star \). Let us recall that \(\mathbf {x}\) and \(\mathbf {t}\) are realizations of random variables \(\mathbf {X}\) and \(\mathbf {T}\): \(\mathbf {x}\) is a \(D\times Q\) matrix of real numbers (the acquisition campaign) and \(\mathbf {t}\) is a \(1\times Q\) vector of n-bit words (bytes when \(n=8\)) which are the publicly known plaintext or ciphertext bytes. An optimal distinguisher maximizes the probability of success \(\mathcal {D}(\mathbf {x}, \mathbf {t})=k^\star \).
The simplest situation occurs when X consists in a modulation of Y plus noise, in which case we let \(\alpha \) be the signal envelope. In real traces, however, we face the more general situation where the model can be offset by some quantity the general case being an S-dimensional parametric model with \(S\ge 2\) components. For this reason, we consider \(\alpha \) as a \(D\times S\) matrix and we set in matrix notation
where \(\mathbf {X}\) is \(D\times Q\), \(\alpha \) is \(D\times S\), \(\mathbf {Y}^\star \) is \(S\times Q\), and \(\mathbf {N}\) is \(D\times Q\). Notice that our convention to consider traces as lines and dimensions as rows allows us to write the deterministic part of the leakage as \(\alpha \mathbf {Y}^\star \) which writes more naturally than the opposite order where traces would be viewed as a vertical time series.Footnote 7
We notice that in the state of the art, monovariate leakage models are assumed vectorial in the context where they are considered unknown pseudo-Boolean functions of the pair (T, k). In this paper, we highlight that these coefficients extend to waveforms in the context of multivariate leakage and thus take on a physical interpretation of leakage models as a sum of waveforms resulting from the leakage of individual resources. This means that seeing \(\alpha \) as D lines, each representing the series of S weights, is awkward, since not related to the way the multivariate leakage is built from the processed data. Instead, it is natural to see \(\alpha \) as S columns, each representing the waveform which is generated by one coordinate of model Y.
For each trace \(q=1,2,\ldots ,Q\), we assume that the vector \(N=N_q\) follows that same multivariate normal distribution \(\mathcal {N}(0,\Sigma )\), where the \(D\times D\) correlation matrix \(\Sigma =\mathbb {E}(N {N}^\mathsf {T})\) is assumed known to the attacker.Footnote 8 Since \(\Sigma \) is assumed symmetric positive definite, there exists a matrix \(\Sigma ^{1/2}\), which is such that \(\Sigma ^{1/2} \Sigma ^{1/2} = \Sigma \). We refer to \(\Sigma ^{1/2}\) as the standard deviation noise matrix.
The model (1) used throughout the paper is quite generic and has multiple facets depending on the choice of S and the respective values given to \(\alpha \) and Y. This is discussed next.
2.3 Examples with \(S=2\) and \(S=9\)
For \(S=1\), the traces consist only in a modulation of the model plus noise as in [2, 3]. When considering traces that are not only modulated, but also have an offset term, we have \(S=2\). We then write the 2-dimensional model as \(\left( \begin{array}{l}{\mathbf {Y}}\\ {\mathbf {1}}\end{array}\right) \), where \(\mathbf {Y}\) and \(\mathbf {1}\) are \(1\times Q\) matrices \((Y_1, Y_2, \ldots , Y_Q)\) and \((1, 1, \ldots , 1)\). The \(D\times 2\) matrix \(\alpha \) in (1) actually takes the special form \(( \alpha \;\beta )\) where \(\beta \) is the offset value.
An illustration is provided in Fig. 1 where the parameter \(\beta \in \mathbb {R}^D\) is the waveform when there is no signal, whereas \(\alpha \in \mathbb {R}^D\) is the signal envelope. The complete model is the sum \(\alpha Y+\beta \), where Y is the Hamming weight of some intermediate variable (such as the XOR operation \(T\oplus k\)) on \(n=4\) bits. While the leakage signal may be represented as a continuous curve as illustrated in Fig. 1, the practical acquisition consists in a temporal series of D “discrete samples,” typically within one clock period. For \(S=2\), we thus write (1) as
where \(\mathbf {X}\) is \(D\times Q\), \(\alpha \) and \(\beta \) are \(D\times 1\), \(\mathbf {Y}^\star \) and \(\mathbf {1}=(1, \ldots , 1)\) are \(1\times Q\), and \(\mathbf {N}\) is \(D\times Q\). Here \(\mathbf {Y}\) is assumed centered: \(\mathbb {E}(\mathbf {Y})=\mathbf {0}=(0,\ldots ,0)\) (since the non-centered part is captured by the \(\beta \mathbf {1}\) term) and of unit variance for every q: \(\text {Var}(Y_q)=\mathbb {E}(Y_q^2)=1\).
For \(S\ge 2\), the actual value of S reflects the complexity of the model. For example, in the weighted sum of bits model, the model for each trace can be written as \(\sum _{s=1}^n \alpha _s Y_s + \beta \), where \(Y_s\) is the sth bit of the n-bit sensitive variable Y. Accordingly, we have
This leakage model is more complex than before, but may arise in practice. For example, Fig. 2 plots the coefficients \(\alpha _1,\ldots ,\alpha _8\) estimated of the traces taken from an ATMega smartcard—the datasets are available from the DPA contest V4 team [16]. In particular, one can observe that samples around [50, 80] are ordered by Hamming weight: this part of the trace resembles the upper left part of Fig. 1 for \(S=2\). By analyzing the \((n+1)\)-variate model of (3), one can indeed see that around [50, 80], the vectors \(\alpha _1, \ldots , \alpha _8\) are almost identical. However, samples in intervals [170, 250] or [330, 400] have a more complex model. These times, the eight vectors \(\alpha _1, \ldots , \alpha _8\) are clearly different, so the leakage is 9-variate.
In the sequel, we consider both types of attacks: those with offline profiling where \(\alpha \) for each component of the model is precharacterized like in Fig. 2 and also those where the model is learned online like in a linear regression attack [7].
3 Theoretical results and implementation
3.1 General mathematical expressions
In this section, we derive the mathematical expression of the optimal distinguisher \(\mathcal {D}\) in the general case of multivariate leakage (\(D\ge 1\)) and multiple models (\(S\ge 1\)). An illustration of our results is given in Fig. 3 for the case when the leakage is completely known (or profiled as in the template attack) and when the leakage is unknown and estimated online.
Definition 1
(Optimal distinguisher knowing or ignoring \(\alpha \))
In both cases (Theorems 1 and 3.1 below), the result is a distinguisher which is computed using simple matrix operations. While \(\mathcal {D}_{\mathrm {ML}}\) resembles a template attack with Gaussian templates [4], \(\mathcal {D}_{\mathrm {ML,sto}}\) is a novel expression that results from a non-trivial maximization over the matrix \(\alpha \) and may be interpreted as a generalization of a multivariate correlation power attack [1].
Theorem 1
The optimal maximum likelihood (ML) distinguisher [10] for Gaussian noise writes
Proof
From [10], we have \(\mathcal {D}_\mathrm {ML}(\mathbf {x}, \mathbf {t}) = \mathrm{argmax}_k \ p(\mathbf {x}|\mathbf {y})\) while from (1) we see that \(p(\mathbf {x}|\mathbf {y})= p_\mathbf {N}(\mathbf {x}-\alpha \mathbf {y})\). From the i.i.d. assumption, the noise density \(p_\mathbf {N}(\mathbf {n})\) is given by
Thus, \(p_\mathbf {N}(\mathbf {x}-\alpha \mathbf {y})\) is maximum when the expression \(\mathrm{tr}\left( {\mathbf {n}}^\mathsf {T} \Sigma ^{-1} \mathbf {n}\right) \) for \(\mathbf {n}=\mathbf {x}-\alpha \mathbf {y}\) is minimum. \(\square \)
In Eqn. (4) of Theorem 1, the trace
consists in:
-
the sum of Q Mahalanobis [12] distances (see also Eqn. (22) of [5]),
-
the sum of D elements (which is useful when \(D\ll Q\)), as attested by rewriting
$$\begin{aligned}&\text {tr}\Bigg (\underbrace{{(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y})}_{Q\times Q \text { matrix}}\Bigg )\\&\quad = \text {tr}\Bigg (\underbrace{\Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y}) {(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T}}_{D\times D \text { matrix}}\Bigg ) . \end{aligned}$$
Theorem 2
The optimal stochastic multivariate attack is given by
for which the optimal value of \(\alpha \) is given by
For the proof, we need some known results of linear algebra (Lemma 1) and linear regression (Lemma 2).
Lemma 1
Let \(\mathbf {b}\) an \(S\times Q\) matrix, with \(S<Q\). The \(S\times S\) matrix \(\mathbf {b} {\mathbf {b}}^\mathsf {T}\) is invertible if and only if \(\mathbf {b}\) has full rank S, i.e., if and only if the S lines of \(\mathbf {b}\) are independent.
Proof
Let x be a \(S\times 1\) column vector. We have that \({x}^\mathsf {T} \mathbf {b} {\mathbf {b}}^\mathsf {T} x = \Vert {\mathbf {b}}^\mathsf {T} x \Vert ^2 = 0\) implies \({\mathbf {b}}^\mathsf {T} x = 0\); hence, \(x=0\). Hence, the matrix \(\mathbf {b} {\mathbf {b}}^\mathsf {T}\) is positive definite. \(\square \)
Lemma 2
Let \(\mathbf {a}\), \(\mathbf {b}\), and \(\alpha \) be, respectively, \(1\times Q\), \(S\times Q\), and \(1\times S\) with \(S<Q\), where \(\mathbf {b}\) has full rank S. Then, \(\Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2\) reaches its minimum for \(\alpha =\mathbf {a}{\mathbf {b}}^\mathsf {T}(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1}\).
Proof
Expanding the squared norm gives \(\Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2^2 = (\mathbf {a}-\alpha \mathbf {b}){(\mathbf {a}-\alpha \mathbf {b})}^\mathsf {T} = \mathbf {a}{\mathbf {a}}^\mathsf {T} -2\alpha \mathbf {b}{\mathbf {a}}^\mathsf {T} + \alpha \mathbf {b}{\mathbf {b}}^\mathsf {T}{\alpha }^\mathsf {T}\). Therefore, the gradient \(\frac{\partial }{\partial \alpha } \Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2^2 = -2\mathbf {b}{\mathbf {a}}^\mathsf {T} + 2\mathbf {b}{\mathbf {b}}^\mathsf {T}{\alpha }^\mathsf {T}\) vanishes if and only if \({\alpha }^\mathsf {T}=(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1} \mathbf {b} {\mathbf {a}}^\mathsf {T}\), i.e., \(\alpha =\mathbf {a}{\mathbf {b}}^\mathsf {T}(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1}\) where we have used the fact that \(\mathbf {b}{\mathbf {b}}^\mathsf {T}\) is invertible by Lemma 1. \(\square \)
Proof
(Proof of Theorem 3.1) Let \(\mathbf {x}' = \Sigma ^{-1/2} \ \mathbf {x}\) and \(\mathbf {y}' = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \ \mathbf {y}\). The optimal distinguisher minimizes the following expression over \(\alpha \in \mathbb {R}^{D\times S}\):
By Lemma 2, the minimization over \(\alpha '_d\) yields \(\alpha '_d = (\mathbf {x}'_d {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) for all \(d=1,\ldots , D\). This gives \(\alpha '=(\mathbf {x}' {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) hence \(\alpha =(\mathbf {x} {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\), which remarkably does not depend on \(\Sigma \).
The minimized value of the distinguisher is thus
where \(\mathsf {Id}\) denotes the \(D\times D\) identity matrix and where \(\mathrm{tr}\left( {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \) is a constant independent of k. This proves Theorem 3.1. \(\square \)
The expression of \(\mathcal {D}_\mathrm {ML,sto}(\mathbf {x},\mathbf {t})\) given in Theorem 3.1 consists in the trace of a \(Q\times Q\) matrix, which can be admittedly very large. It can be, however, rewritten in a way that is easier to compute when Q is much greater than S and D:
Corollary 1
(Alternative expression of \(\mathcal {D}_\mathrm {ML,sto}\)) Letting \(\mathbf {x}' = \Sigma ^{-1/2} \ \mathbf {x}\), and \(\mathbf {y}' = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \ \mathbf {y}\) as in the proof of Theorem 3.1, we have
Here the Frobenius norm is of a \(D\times S\) matrix.
Proof
Let us write \((\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}=(\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2}\) in (5). By the properties of the trace,
\(\square \)
Remark 1
Notice that in Corollary 1, \(\mathbf {y}'\) is a vector of empirical covariance equal to the identity matrix. Indeed, \(\mathbf {y}' {\mathbf {y}'}^\mathsf {T} = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \mathbf {y} {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} = \mathsf {Id}\).
3.2 Mathematical expressions for \(S=2\)
In order to provide interpretations for the optimal distinguisher expressions, we detail how an optimal attack unfolds when the leakage consists in a sum of a modulated scalar model and an offset (\(S=2\)). The cases for profiled attacks (denoted \(\mathcal {D}_{\mathrm {ML}}^{S=2}\)) and non-profiled attacks (denoted \(\mathcal {D}_{\mathrm {ML,sto}}^{S=2}\)) are presented in Fig. 4.
Interestingly, when \(S=2\), the template attack can decompose in two steps (affine projection followed by a Euclidean distance to the model). Remarkably, the projection vector is the same for all key guesses. This extends similar results [3] where only the linear relationship between leakage and model is explored. As for the online attack, \(\mathcal {D}_{\mathrm {ML,sto}}^{S=2}\) consists in a sum of square of CPA attacks on transformed data, aiming at orthogonalizing the noise.
3.3 Efficient implementation
Both \(\mathcal {D}_{\mathrm {ML}}\) and \(\mathcal {D}_{\mathrm {ML,sto}}\) can be optimized using the idea presented in [11]. This article applies a precomputation step in the case the number of traces is larger than the number of possible plaintexts (\(Q> \#\mathcal {T}=2^n\)). In this case, all summations \(\sum _q\) can be advantageously replaced by \(\sum _t \sum _{t_q=t}\). In most cases, the sum \(\sum _{t_q=t}\) can be achieved on the fly and does not involve an hypothesis on the key. Therefore, a speed gain of \(2^n\) (the cardinality of the key space) is expected.
Such optimization strategy can be applied to \(\mathcal {D}_{\mathrm {ML}}\). Indeed, let us define \(\mathbf {x}'=\Sigma ^{-1/2}\mathbf {x}\) and \(\alpha '=\Sigma ^{-1/2}\alpha \). Then,
Notice that at line (8), the term \(\sum _{q/t_q=t} {x'_{d,q}}^2\) which does not depend on the key is simplified. The fast version of this computation is given in Algorithm 1.
The same optimization applies to \(\mathcal {D}_{\mathrm {ML,sto}}\). Indeed, in expression (7) of \(\mathcal {D}_{\mathrm {ML,sto}}(\mathbf x ,\mathbf t ) = \mathrm{argmax}_k \ \Vert \mathbf {x}' {\mathbf {y}'}^\mathsf {T} \Vert _F^2\), one can write
This means that \(\mathbf {x}'\) can be obtained by simple accumulation, exactly as in line 2 of Algorithm 1. The term \(y'_{s}(t,k)\) requires the computation of \(\mathbf {y}{\mathbf {y}}^\mathsf {T}\). In the case \(Q\gg 1\), it can be assumed that the texts \(\mathbf {t}\) are uniformly distributed. Hence, when \(Q\rightarrow +\infty \), by the law of large numbers,
Therefore, in (10), \(y'_s(t)\) can also be precomputed. To the best of our knowledge, this optimization has never been discussed previously. The resulting distinguishing procedure is given in Algorithm 2. At line 3, the argument of the Frobenius norm can be computed by a fast matrix multiplication. Also, we notice that the matrix inversion at line 0 is actually a precomputation which involves only the model. Besides, if the EIS (Equal Images under all Subkeys) assumption holds [14, Def. 2], e.g., y(t, k) only depends on \(t\oplus k\), then \(\sum _t y(t,k) {y(t,k)}^\mathsf {T}\) does not depend on k, hence only one single matrix inversion to compute. Eventually, the computational complexity of the optimal stochastic attack simply consists in traces accumulation per class and as many matrix products and Frobenius norms as keys to be guessed.
4 Practical results
4.1 Characterization of \(\Sigma \)
In this article, we assume that the attacker knows the noise covariance matrix. We give a straightforward procedure for the estimation.
-
1.
collect Q traces (i.e., a matrix \(\mathbf {x}\in \mathbb {R}^{D\times Q}\)) where the plaintext is fixed to a given value,
-
2.
estimate \(\Sigma \) as \(\hat{\Sigma }= \frac{1}{Q-1} \bigl ( \mathbf {x}-\frac{1}{Q}\mathbf {x}{\mathbf {1}}^\mathsf {T}\mathbf {1} \bigr ) \! {\bigl ( \mathbf {x}-\frac{1}{Q}\mathbf {x}{\mathbf {1}}^\mathsf {T}\mathbf {1} \bigr )}^\mathsf {T}\!\!,\) where \(\mathbf {1}=(1, \ldots , 1)\in \mathbb {R}^{1\times Q}\). This estimator is sample covariance matrix, which is unbiased.
Remark 2
Notice that \(\Sigma \) cannot be obtained by a direct profiling on the same traces to be used for the attack. Indeed, in those traces, the plaintext is varying; hence, the attacker would use for \(\hat{\Sigma }\) the covariance matrix of \(\mathbf {x}-\alpha ^\text {opt} \mathbf {y}\), where \(\alpha ^\text {opt}\) is equal to \(\alpha ^\text {opt} = (\mathbf {x} {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) [recall (6)]. Hence, \(\hat{\Sigma }= \frac{1}{Q-1} (\mathbf {x}-\alpha ^\text {opt} \mathbf {y}) {(\mathbf {x}-\alpha ^\text {opt} \mathbf {y})}^\mathsf {T}\). But the distinguisher \(\mathcal {D}_\mathrm {ML,sto}\) is
Indeed, at line (11), we demonstrated in the proof of Theorem 3.1 in that the minimal value (6) of \(\alpha \) is independent on \(\Sigma \). Eventually, it can be seen at line (12) that the distinguisher with \(\hat{\Sigma }\) instead of \(\Sigma \) does not depend on the key.Footnote 9
4.2 Attacks on synthetic (i.e., simulated) traces
In this subsection, we present simulations when \(\alpha \) is known exactly or regressed online. We consider an attack of PRESENT, where the SBox is \(n=4 \rightarrow n=4\). For the sake of the simulations, we choose two kinds of \(\alpha \):
-
“identical”: all the \(n=4\) bits leak the same waveform, like in the Hamming weight model,
-
“proportional”: the waveform has weight 1 for SBox bit 0 and is multiplied by 2 (resp. 3 and 4) for SBox bit 1 (resp. 2 and 3).
The waveform for each bit is that represented in Fig. 1 (upper left graph). Specifically, for all \(1\le d\le D\) and \(1\le s\le S\), the envelope consists in damped oscillations:
The noise is chosen normal, using two distributions:
-
“isotropic”: the covariance matrix is \(\sigma ^2\) times the \(D\times D\) identity,
-
“auto-regressive” (of “AR” for short): the covariance matrix element at position \((d,d')\), for \(1\le d,d'\le D\), is \(\sigma ^2 \rho ^{|d-d'|}\). This noise is not independent from sample to its neighbors, but the correlation \(\rho \) decreases exponentially as samples get further apart.
Proposition 1
The success probability of \(\mathcal {D}_\mathrm {ML}\) is greater than that of \(\mathcal {D}_\mathrm {ML,sto}\).
Proof
Indeed, according to [10], \(\mathcal {D}_\mathrm {ML}\) maximizes the success probability. Thus, the distinguisher \(\mathcal {D}_\mathrm {ML,sto}\) has a smaller success probability. The success probability is the same if the minimization over \(\alpha \) in the proof of Theorem 3.1 yields the exact matrix \(\alpha \) used in the model (1). \(\square \)
Simulations allow to estimate the loss in terms of efficiency of not knowing the model (Proposition 1), by comparing distinguishers \(\mathcal {D}_\mathrm {ML}\) [(4)] and \(\mathcal {D}_\mathrm {ML,sto}\) [(5)]. The success rate of the optimal distinguisher \(\mathcal {D}_\mathrm {ML}\) is drawn in order to materialize the limit between feasible (below) and unfeasible (above) attacks.
Results for low noise (\(\sigma =1\)) are represented in Fig. 5. We can see that the Hamming weight model is clearly harder to attack, because the leakage of one bit cannot be distinguished from that of the other bits. Besides, we notice that the stochastic attack is performing much worse than the optimal attack: about 10 times more traces are required for an equivalent success probability in key extraction.
Results for high noise (\(\sigma =4\)) are represented in Fig. 6. Again, the “proportional” model is easier to attack than the “identical” model (for each bit). Now, we also see that the gap between the optimal ML attack and the stochastic attack narrows: only about 5 times more traces are needed for the stochastic attack to perform as well as the optimal attack in terms of success probability. Besides, we notice that the AR noise is favorable to the attacker. It is therefore important in practice for the attacker to characterize precisely the noise distribution (recall the methodology presented in Sect. 4.1).
Clearly, these conclusions are in line with the template versus stochastic (offline) study carried out in [8]: for high noise, the (online) learning of the model requires more traces; hence, is more accurate. Therefore, the performance of \(\mathcal {D}_\mathrm {ML,sto}\) gets closer to that of \(\mathcal {D}_\mathrm {ML}\) than for high noise.
4.3 Attacks on real-world traces
We now compare CPA with \(\mathcal {D}_\mathrm {ML}\) and \(\mathcal {D}_\mathrm {ML,sto}\) on measurements provided by the DPA contest V4. These traces have been acquired from an 8-bit processor and hence have a signal-to-noise ratio greater than one, reaching 7 at some points in time. The interval for our case study is [170, 250] from Fig. 2; hence, \(D=80\). Regarding ML, two learning strategies have been implemented:
-
1.
the model is learned from a disjoint set of 5k traces, which is the operational scenario for a profiled attack;
-
2.
the model is learned from the traces being attacked (denoted self in Fig. 7). This case does not represent a realistic attack, but is interesting in that it highlights the best possible attacker.
The attack success rates are plotted in Fig. 7.
One can see that both variants of \(\mathcal {D}_\mathrm {ML}\) and \(\mathcal {D}_\mathrm {ML,sto}\) achieve better with \(S=9\) than with \(S=2\). This is consistent with the analysis carried out in Sect. 2.3. Actually, the CPA has a very poor performance because the model is very far from a Hamming weight: as shown in Fig. 2, some parameters \(\alpha _i\) (e.g., for \(i=2\) and 6) are positive in region [180, 200], whereas others \(\alpha _j\) (e.g., for \(j=1\), 3, 4 and 5) are negative. The compensating signs account why the Hamming weight model is inappropriate. The ML with model precharacterization on the traces under attack shows that very strong attacks are possible (using a few traces only). Interestingly, when the model used by ML is characterized on 5k traces distinct from the traces being attacked, the performance is almost similar. Eventually, the online stochastic attack derived in this paper (\(\mathcal {D}_\mathrm {ML,sto}\)) performs better than CPA (the distinguisher being the maximum value of the Pearson correlation over the \(D=80\) samples).
5 Conclusions and perspectives
Distinguishing a key from both multivariate leakage samples and multiple models can be done in one step as shown in this paper. A compact expression of the distinguisher is provided, using matrix operations. The strategy is applied to real-world traces in profiled and non-profiled scenarios. The resulting attack is more efficient than the traditional approach “dimensionality reduction then stochastic (linear regression) attack).” The new multivariate distinguisher outperforms the other state-of-the-art attacks. The presented methodology allows for leakage agnostic attacks on vectorial leakage measurements and complex models. In addition, the matrix-based expression of the distinguisher benefits from matrix-oriented software that implements computational optimizations for large dimensions.
A companion future work would consist in determining the optimal model dimensionality and basis from any acquisition campaign. Another perspective is to adapt the methodology to masked implementations, as already done for monovariate leakage in [6], yet for this case the distinguishers will certainly not exhibit simple closed-form expressions. However, we believe that the approach could be fruitful in practice backed with suitable optimization software.
Notes
The success probability in key recovery is chosen as a figure of merit for optimization. Such an objective is typical of “pure” side-channel attacks. Other approaches [9, 13, 17] relax the condition that the key found by the side-channel analysis be ranked first and complements it with a key enumeration stage. This yields a data versus complexity trade-off that is not explored in this paper.
The work in [15] does not detail the modus operandi result for the regression neither plugs it into the distinguisher, which is incidentally not chosen to be the optimal one.
Multi-target attacks [13, 18] have a somewhat different goal, namely the best aggregation of information about several subparts of a key, possibly leaking at different times with different models, in order to recover the full key efficiently. Here we consider only one multivariate leakage model and focus on recovering one subpart of the key. However, our derivation is capable of handling multivariate leakages and models and may still be combined with the multi-target approaches.
By generic, we qualify a leakage model more complex than the classical Hamming weight or distance, where each bit of the sensitive variable has different strengths of leakage (situation we will show to happen in practice).
Notations X, Y are consistent with the usual convention in machine learning, where X is for the collected data and Y for the classification labels.
In fact, our results tolerate chosen texts, but consider them as observed inputs to the attack. We do not optimize the attack according to chosen inputs.
We underline that \(\mathbf {Y}^\star \) denotes the model for the correct key; we use \(\mathbf {Y}(k)\) for a model assuming a guessed key k. Sometimes, in a view to make notations more legible, the dependence in k is tacit and \(\mathbf {Y}(k)\) simply writes as \(\mathbf {Y}\).
We may simplify (2) by incorporating \(\beta \mathbf {1}\) into the noise expectation, but the noise is intrinsically zero-mean and it is clearer to exhibit a specific offset term.
Indeed, \(\mathrm{argmin}_k \text {cst} = \mathbb {F}_2^n\), meaning that all keys are equiprobable. Intuitively, when both the noise and the model parameters are regressed at the same time, any key manages to achieve the same match between parametric model and side-channel observations.
References
Brier, É., Clavier, C., Olivier, F.: Correlation power analysis with a leakage model. In: Joye, M., Quisquater, J. (eds.) Cryptographic Hardware and Embedded Systems—CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11–13, 2004. Proceedings. Lecture Notes in Computer Science, vol. 3156, pp. 16–29. Springer (2004). doi:10.1007/978-3-540-28632-5_2
Bruneau, N., Danger, J., Guilley, S., Heuser, A., Teglia, Y.: Boosting higher-order correlation attacks by dimensionality reduction. In: Chakraborty, R.S., Matyas, V., Schaumont, P. (eds.) Security, Privacy, and Applied Cryptography Engineering—4th International Conference, SPACE 2014, Pune, India, October 18–22, 2014. Proceedings. Lecture Notes in Computer Science, vol. 8804, pp. 183–200. Springer (2014). doi:10.1007/978-3-319-12060-7_13
Bruneau, N., Guilley, S., Heuser, A., Marion, D., Rioul, O.: Less is more—dimensionality reduction from a theoretical perspective. In: Güneysu, T., Handschuh, H. (eds.) Cryptographic Hardware and Embedded Systems—CHES 2015—17th International Workshop, Saint-Malo, France, September 13–16, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9293, pp. 22–41. Springer (2015). doi:10.1007/978-3-662-48324-4_2
Chari, S., Rao, J.R., Rohatgi, P.: Template attacks. In: CHES. LNCS, vol. 2523, pp. 13–28. Springer (2002). San Francisco Bay (Redwood City), USA
Choudary, O., Kuhn, M.G.: Efficient template attacks. In: Francillon, A., Rohatgi, P. (eds.) Smart Card Research and Advanced Applications—12th International Conference, CARDIS 2013, Berlin, Germany, November 27–29, 2013. Revised Selected Papers. LNCS, vol. 8419, pp. 253–270. Springer (2013). doi:10.1007/978-3-319-08302-5_17
Dabosville, G., Doget, J., Prouff, E.: A new second-order side channel attack based on linear regression. IEEE Trans. Comput. 62(8), 1629–1640 (2013)
Doget, J., Prouff, E., Rivain, M., Standaert, F.X.: Univariate side channel attacks and leakage modeling. J. Cryptogr. Eng. 1(2), 123–144 (2011)
Gierlichs, B., Lemke-Rust, K., Paar, C.: Templates vs. stochastic methods. In: CHES. LNCS, vol. 4249, pp. 15–29. Springer (2006). Yokohama, Japan
Glowacz, C., Grosso, V., Poussier, R., Schüth, J., Standaert, F.: Simpler and more efficient rank estimation for side-channel security assessment. In: Leander, G. (ed.) Fast Software Encryption—22nd International Workshop, FSE 2015, Istanbul, Turkey, March 8–11, 2015, Revised Selected Papers. Lecture Notes in Computer Science, vol. 9054, pp. 117–129. Springer (2015). doi:10.1007/978-3-662-48116-5_6
Heuser, A., Rioul, O., Guilley, S.: Good is not good enough—deriving optimal distinguishers from communication theory. In: Batina, L., Robshaw, M. (eds.) Cryptographic Hardware and Embedded Systems—CHES 2014—16th International Workshop, Busan, South Korea, September 23–26, 2014. Proceedings. Lecture Notes in Computer Science, vol. 8731, pp. 55–74. Springer (2014). doi:10.1007/978-3-662-44709-3_4
Lomné, V., Prouff, E., Roche, T.: Behind the scene of side channel attacks. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT (1). Lecture Notes in Computer Science, vol. 8269, pp. 506–525. Springer (2013)
Mahalanobis, P.C.: On the generalised distance in statistics. In: Proceedings of the National Institute of Sciences of India, vol. 2, no. (1), pp. 49—55 (1936). http://ir.isical.ac.in/dspace/handle/1/1268
Mather, L., Oswald, E., Whitnall, C.: Multi-target DPA attacks: pushing DPA beyond the limits of a desktop computer. In: Sarkar and Iwata [14], pp. 243–261. doi:10.1007/978-3-662-45611-8_13
Schindler, W., Lemke, K., Paar, C.: A stochastic model for differential side channel cryptanalysis. In: LNCS (ed.) CHES. LNCS, vol. 3659, pp. 30–46. Springer (2005). Edinburgh, Scotland, UK
Sugawara, T., Homma, N., Aoki, T., Satoh, A.: Profiling attack using multivariate regression analysis. IEICE Electron. Express 7(15), 1139–1144 (2010). doi:10.1587/elex.7.1139
TELECOM ParisTech SEN Research Group: DPA Contest (4th edition) (2013–2014). http://www.DPAcontest.org/v4/
Veyrat-Charvillon, N., Gérard, B., Renauld, M., Standaert, F.X.: An optimal key enumeration algorithm and its application to side-channel attacks. In: Knudsen, L.R., Wu, H. (eds.) Selected Areas in Cryptography. Lecture Notes in Computer Science, vol. 7707, pp. 390–406. Springer (2012)
Veyrat-Charvillon, N., Gérard, B., Standaert, F.: Soft analytical side-channel attacks. In: Sarkar and Iwata [14], pp. 282–296. doi:10.1007/978-3-662-45611-8_15
Acknowledgements
The authors wish to thank Liran Lerman for interesting discussions about stochastic attacks. Part of this work has been funded by the ANR CHIST-ERA project SECODE (Secure Codes to thwart Cyber-physical Attacks). This work was supported in part by National Natural Science Foundation of China (No. 61632020).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bruneau, N., Guilley, S., Heuser, A. et al. Optimal side-channel attacks for multivariate leakages and multiple models. J Cryptogr Eng 7, 331–341 (2017). https://doi.org/10.1007/s13389-017-0170-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13389-017-0170-9