1 Introduction

Side-channel attacks allow to extract secret keys from cryptographic devices. Template attacks [4] have been introduced as the strongest analysis method. They consist in two phases: (1) a profiling offline phase where the leakage model of the device under attack is characterized and (2) an attack online phase in which the secret key is extracted using fresh measurements along with the precharacterized model. Such attacks are known to use a maximum likelihood principle to ensure the highest possible success probability (see, eg., [10]).

In this paper, we study optimal attacks with the best possible success probability when extracting the secret key.Footnote 1 We leverage on such optimal distinguishers to answer the following question: How to attack with the best probability of success when the leakage is multivariate and the model is multiple? An initial empiricalFootnote 2 work has already been carried out in [15] which confirmed that this type of approach can be very fruitful.Footnote 3

1.1 Contributions

We derive closed-form expressions for the optimal distinguishers in all situations where the model is known (e.g., using profiling) or regressed online. In the case of a known univariate model, we recover the results in [3]. However, our “fully matrix” formalism makes equations simpler and proofs shorter. Moreover, compared to [3] we extend the leakage model to the case where the traces are not necessarily centered, thereby allowing a more natural application on real traces. In the realistic “(online) stochastic attack” situation where the model is parametric, i.e., where the coefficients of the model are unknown, we express the optimal distinguisher by maximizing success over the whole set of possible coefficients. Finally, we provide fast computation algorithms for our novel distinguishers, which happen to be remarkably simple and efficient.

1.2 Outline

The remainder of this paper is organized as follows. Section 2 provides a modelization of a side-channel attack that is genericFootnote 4 enough to capture many different multivariate scenarios The main results of this paper are outlined in Sect. 3. Section 4 presents experimental results on simulated traces and real-world acquisition campaigns. Conclusions and perspectives are given in Sect. 5.

2 Notations and leakage model

2.1 Notations

We let X denote the leakage measurements, Y the model, N the measurement noise, and \(\alpha \) the link between the model and the measurements.Footnote 5 The model Y depends on a key guess k, an n-bit vector, and on some known text T (usually also an n-bit vector), e.g., through a function \(\phi \) such that \(Y=\phi (T,k)\). A well-known example is \(Y=w_H(T\oplus k)\), where \(w_H\) is the Hamming weight function. However, in general, some parameters of the model are unknown. To remain generic, we do not detail further the link between Y and the pair (Tk). As it is customary in side-channel analysis, the correct key is denoted by \(k^\star \). The corresponding model using the correct key \(Y(k^\star )\) is denoted by \(Y^\star \).

Let Q be the number of queries (number of measurements), D be the data dimensionality (number of time samples per measurement trace), and S be the model dimensionality (\(\phi : \mathbb {F}_2^n\times \mathbb {F}_2^n\rightarrow \mathbb {R}^S\) is a vectorial function, with S components). Roman letters in bold indicate vectors or matrices that have a dimension in Q, i.e., which are different for each trace \(q=1,2,\ldots ,Q\). More precisely, \(\mathbf X \) represents the full attack campaign, a matrix of \(D\times Q\) measurement samples. The q-th trace is denoted \(X_q\) which is a \(D\times 1\) column vector. Similarly, for the q-th trace, the \(S\times 1\) column vector \(Y_q\) represents the deterministic part of the model while the \(D\times 1\) column vector \(N_q\) is the corresponding measurement noise with \(D\times D\) correlation matrix \(\Sigma \).

We denote by \(\mathrm{tr}\left( \cdot \right) \) the trace of a square matrix, that is, the sum of its diagonal terms. Note that \(\mathrm{tr}\left( AB\right) = \mathrm{tr}\left( BA\right) \) for compatible matrix dimensions. Let \(\Vert \cdot \Vert _2\) denote the Euclidean norm of a \(1\times Q\) row vector. Thus, \(\Vert \mathbf {X}\Vert _2^2 ={ \mathbf {X} {\mathbf {X}}^\mathsf {T}} = \mathrm{tr}\left( {\mathbf {X}}^\mathsf {T} \mathbf {X}\right) \), where \({(\,\cdot \,)}^\mathsf {T}\) is the transposition operator. Finally, let \(\Vert \;\!\cdot \;\!\Vert _F\) denote the Frobenius norm of a matrix (square root of the sum of its squared elements), such that \(\Vert M \Vert _F^2 = \mathrm{tr}\left( M{M}^\mathsf {T}\right) \).

2.2 General model

We make the following simplifying assumptions. First, the (environmental) noise is steady, e.g., chip temperature and supply voltage do not vary during the side-channel measurements. Thus, \(N_1, N_2, \ldots , N_Q\) are independent and identically distributed (i.i.d.) (denoted by N with index q dropped). Second, the attacker does not inject partial information gathered from the leakage analysis into a possible choice of plaintexts/ciphertexts (non-adaptive attack).Footnote 6 Thus, \(Y_1, Y_2, \ldots , Y_Q\) are assumed i.i.d. (denoted by Y). Under the adopted leakage model, it follows that the leakage measurements \(X_1, X_2, \ldots , X_Q\) are also i.i.d. (denoted by X).

Fig. 1
figure 1

Example of leakage model with \(S=2\) and a model in Hamming weight, with \(n=4\) values (no noise is added)

A distinguisher \(\mathcal {D}\) maps a collection of leakages \(\mathbf {x}\) and texts \(\mathbf {t}\) to an estimation of the secret key \(k^\star \). Let us recall that \(\mathbf {x}\) and \(\mathbf {t}\) are realizations of random variables \(\mathbf {X}\) and \(\mathbf {T}\): \(\mathbf {x}\) is a \(D\times Q\) matrix of real numbers (the acquisition campaign) and \(\mathbf {t}\) is a \(1\times Q\) vector of n-bit words (bytes when \(n=8\)) which are the publicly known plaintext or ciphertext bytes. An optimal distinguisher maximizes the probability of success \(\mathcal {D}(\mathbf {x}, \mathbf {t})=k^\star \).

The simplest situation occurs when X consists in a modulation of Y plus noise, in which case we let \(\alpha \) be the signal envelope. In real traces, however, we face the more general situation where the model can be offset by some quantity the general case being an S-dimensional parametric model with \(S\ge 2\) components. For this reason, we consider \(\alpha \) as a \(D\times S\) matrix and we set in matrix notation

$$\begin{aligned} \mathbf {X} = \alpha \mathbf {Y}^\star + \mathbf {N} \end{aligned}$$
(1)

where \(\mathbf {X}\) is \(D\times Q\), \(\alpha \) is \(D\times S\), \(\mathbf {Y}^\star \) is \(S\times Q\), and \(\mathbf {N}\) is \(D\times Q\). Notice that our convention to consider traces as lines and dimensions as rows allows us to write the deterministic part of the leakage as \(\alpha \mathbf {Y}^\star \) which writes more naturally than the opposite order where traces would be viewed as a vertical time series.Footnote 7

We notice that in the state of the art, monovariate leakage models are assumed vectorial in the context where they are considered unknown pseudo-Boolean functions of the pair (Tk). In this paper, we highlight that these coefficients extend to waveforms in the context of multivariate leakage and thus take on a physical interpretation of leakage models as a sum of waveforms resulting from the leakage of individual resources. This means that seeing \(\alpha \) as D lines, each representing the series of S weights, is awkward, since not related to the way the multivariate leakage is built from the processed data. Instead, it is natural to see \(\alpha \) as S columns, each representing the waveform which is generated by one coordinate of model Y.

For each trace \(q=1,2,\ldots ,Q\), we assume that the vector \(N=N_q\) follows that same multivariate normal distribution \(\mathcal {N}(0,\Sigma )\), where the \(D\times D\) correlation matrix \(\Sigma =\mathbb {E}(N {N}^\mathsf {T})\) is assumed known to the attacker.Footnote 8 Since \(\Sigma \) is assumed symmetric positive definite, there exists a matrix \(\Sigma ^{1/2}\), which is such that \(\Sigma ^{1/2} \Sigma ^{1/2} = \Sigma \). We refer to \(\Sigma ^{1/2}\) as the standard deviation noise matrix.

The model (1) used throughout the paper is quite generic and has multiple facets depending on the choice of S and the respective values given to \(\alpha \) and Y. This is discussed next.

2.3 Examples with \(S=2\) and \(S=9\)

For \(S=1\), the traces consist only in a modulation of the model plus noise as in [2, 3]. When considering traces that are not only modulated, but also have an offset term, we have \(S=2\). We then write the 2-dimensional model as \(\left( \begin{array}{l}{\mathbf {Y}}\\ {\mathbf {1}}\end{array}\right) \), where \(\mathbf {Y}\) and \(\mathbf {1}\) are \(1\times Q\) matrices \((Y_1, Y_2, \ldots , Y_Q)\) and \((1, 1, \ldots , 1)\). The \(D\times 2\) matrix \(\alpha \) in (1) actually takes the special form \(( \alpha \;\beta )\) where \(\beta \) is the offset value.

An illustration is provided in Fig. 1 where the parameter \(\beta \in \mathbb {R}^D\) is the waveform when there is no signal, whereas \(\alpha \in \mathbb {R}^D\) is the signal envelope. The complete model is the sum \(\alpha Y+\beta \), where Y is the Hamming weight of some intermediate variable (such as the XOR operation \(T\oplus k\)) on \(n=4\) bits. While the leakage signal may be represented as a continuous curve as illustrated in Fig. 1, the practical acquisition consists in a temporal series of D “discrete samples,” typically within one clock period. For \(S=2\), we thus write (1) as

$$\begin{aligned} \mathbf {X} = \alpha \mathbf {Y}^\star + \beta \mathbf {1} + \mathbf {N} \end{aligned}$$
(2)
Fig. 2
figure 2

Leakage evaluation of traces from DPA contest V4 (knowing the mask)

Fig. 3
figure 3

Mathematical expression for multivariate (\(D\ge 1\)) optimal attacks with a linear combination of models (\(S\ge 1\))

where \(\mathbf {X}\) is \(D\times Q\), \(\alpha \) and \(\beta \) are \(D\times 1\), \(\mathbf {Y}^\star \) and \(\mathbf {1}=(1, \ldots , 1)\) are \(1\times Q\), and \(\mathbf {N}\) is \(D\times Q\). Here \(\mathbf {Y}\) is assumed centered: \(\mathbb {E}(\mathbf {Y})=\mathbf {0}=(0,\ldots ,0)\) (since the non-centered part is captured by the \(\beta \mathbf {1}\) term) and of unit variance for every q: \(\text {Var}(Y_q)=\mathbb {E}(Y_q^2)=1\).

For \(S\ge 2\), the actual value of S reflects the complexity of the model. For example, in the weighted sum of bits model, the model for each trace can be written as \(\sum _{s=1}^n \alpha _s Y_s + \beta \), where \(Y_s\) is the sth bit of the n-bit sensitive variable Y. Accordingly, we have

$$\begin{aligned}&S=n+1, \qquad \text { and thus:} \nonumber \\&\alpha = \left( \begin{array}{cccc} \alpha _1&\ldots&\alpha _n&\beta \end{array} \right) , \qquad \mathbf {Y} = {(\mathbf {Y}_1 \ldots \mathbf {Y}_n \; \mathbf {1})}^\mathsf {T}. \end{aligned}$$
(3)

This leakage model is more complex than before, but may arise in practice. For example, Fig. 2 plots the coefficients \(\alpha _1,\ldots ,\alpha _8\) estimated of the traces taken from an ATMega smartcard—the datasets are available from the DPA contest V4 team [16]. In particular, one can observe that samples around [50, 80] are ordered by Hamming weight: this part of the trace resembles the upper left part of Fig. 1 for \(S=2\). By analyzing the \((n+1)\)-variate model of (3), one can indeed see that around [50, 80], the vectors \(\alpha _1, \ldots , \alpha _8\) are almost identical. However, samples in intervals [170, 250] or [330, 400] have a more complex model. These times, the eight vectors \(\alpha _1, \ldots , \alpha _8\) are clearly different, so the leakage is 9-variate.

In the sequel, we consider both types of attacks: those with offline profiling where \(\alpha \) for each component of the model is precharacterized like in Fig. 2 and also those where the model is learned online like in a linear regression attack [7].

3 Theoretical results and implementation

3.1 General mathematical expressions

In this section, we derive the mathematical expression of the optimal distinguisher \(\mathcal {D}\) in the general case of multivariate leakage (\(D\ge 1\)) and multiple models (\(S\ge 1\)). An illustration of our results is given in Fig. 3 for the case when the leakage is completely known (or profiled as in the template attack) and when the leakage is unknown and estimated online.

Definition 1

(Optimal distinguisher knowing or ignoring \(\alpha \))

$$\begin{aligned}&\mathcal {D}_{\mathrm {ML}}(\mathbf {x}, \mathbf {t}) = \mathrm{argmax}_{k\in \mathbb {F}_2} p(\mathbf {x} | \mathbf {t}) \qquad \text { and }\\&\mathcal {D}_{\mathrm {ML,sto}}(\mathbf {x}, \mathbf {t}) = \mathrm{argmax}_{k\in \mathbb {F}_2} \max _{\alpha \in \mathbb {R}^{D\times S}} p(\mathbf {x} | \mathbf {t}, \alpha ) . \end{aligned}$$

In both cases (Theorems 1 and 3.1 below), the result is a distinguisher which is computed using simple matrix operations. While \(\mathcal {D}_{\mathrm {ML}}\) resembles a template attack with Gaussian templates [4], \(\mathcal {D}_{\mathrm {ML,sto}}\) is a novel expression that results from a non-trivial maximization over the matrix \(\alpha \) and may be interpreted as a generalization of a multivariate correlation power attack [1].

Theorem 1

The optimal maximum likelihood (ML) distinguisher [10] for Gaussian noise writes

$$\begin{aligned} \mathcal {D}_\mathrm {ML}(\mathbf {x}, \mathbf {t}) = \mathrm{argmin}_k \ \mathrm{tr}\left( {(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y})\right) . \end{aligned}$$
(4)

Proof

From [10], we have \(\mathcal {D}_\mathrm {ML}(\mathbf {x}, \mathbf {t}) = \mathrm{argmax}_k \ p(\mathbf {x}|\mathbf {y})\) while from (1) we see that \(p(\mathbf {x}|\mathbf {y})= p_\mathbf {N}(\mathbf {x}-\alpha \mathbf {y})\). From the i.i.d. assumption, the noise density \(p_\mathbf {N}(\mathbf {n})\) is given by

$$\begin{aligned} p_{\mathbf {N}}(\mathbf {n})&= \prod _{q=1}^Q \frac{1}{\sqrt{(2\pi )^D |\det \Sigma |}} \exp -\frac{1}{2} {n_q}^\mathsf {T} \Sigma ^{-1} n_q \\&= \frac{1}{(2\pi )^{D Q/2}} \frac{1}{(\det \Sigma )^{Q/2}} \exp -\frac{1}{2} \biggl ( \sum _{q=1}^Q {n_q}^\mathsf {T} \Sigma ^{-1} n_q \biggr )\\&= \frac{1}{(2\pi )^{DQ/2} (\det \Sigma )^{Q/2}} \exp -\frac{1}{2} \mathrm{tr}\left( {\mathbf {n}}^\mathsf {T} \Sigma ^{-1} \mathbf {n}\right) . \end{aligned}$$

Thus, \(p_\mathbf {N}(\mathbf {x}-\alpha \mathbf {y})\) is maximum when the expression \(\mathrm{tr}\left( {\mathbf {n}}^\mathsf {T} \Sigma ^{-1} \mathbf {n}\right) \) for \(\mathbf {n}=\mathbf {x}-\alpha \mathbf {y}\) is minimum. \(\square \)

In Eqn. (4) of Theorem 1, the trace

$$\begin{aligned} \mathrm{tr}\left( {(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y})\right) \end{aligned}$$

consists in:

  • the sum of Q Mahalanobis [12] distances (see also Eqn. (22) of [5]),

  • the sum of D elements (which is useful when \(D\ll Q\)), as attested by rewriting

    $$\begin{aligned}&\text {tr}\Bigg (\underbrace{{(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y})}_{Q\times Q \text { matrix}}\Bigg )\\&\quad = \text {tr}\Bigg (\underbrace{\Sigma ^{-1} (\mathbf {x} - \alpha \mathbf {y}) {(\mathbf {x} - \alpha \mathbf {y})}^\mathsf {T}}_{D\times D \text { matrix}}\Bigg ) . \end{aligned}$$

Theorem 2

The optimal stochastic multivariate attack is given by

$$\begin{aligned} \mathcal {D}_\mathrm {ML,sto}(\mathbf {x},\mathbf {t}) = \mathrm{argmax}_{k\in \mathbb {F}_2^n} \ \mathrm{tr}\left( {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1} \mathbf {y} \ \ {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \end{aligned}$$
(5)

for which the optimal value of \(\alpha \) is given by

$$\begin{aligned} \alpha ^\text {opt} = (\mathbf {x} {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1} . \end{aligned}$$
(6)

For the proof, we need some known results of linear algebra (Lemma 1) and linear regression (Lemma 2).

Lemma 1

Let \(\mathbf {b}\) an \(S\times Q\) matrix, with \(S<Q\). The \(S\times S\) matrix \(\mathbf {b} {\mathbf {b}}^\mathsf {T}\) is invertible if and only if \(\mathbf {b}\) has full rank S, i.e., if and only if the S lines of \(\mathbf {b}\) are independent.

Proof

Let x be a \(S\times 1\) column vector. We have that \({x}^\mathsf {T} \mathbf {b} {\mathbf {b}}^\mathsf {T} x = \Vert {\mathbf {b}}^\mathsf {T} x \Vert ^2 = 0\) implies \({\mathbf {b}}^\mathsf {T} x = 0\); hence, \(x=0\). Hence, the matrix \(\mathbf {b} {\mathbf {b}}^\mathsf {T}\) is positive definite. \(\square \)

Lemma 2

Let \(\mathbf {a}\), \(\mathbf {b}\), and \(\alpha \) be, respectively, \(1\times Q\), \(S\times Q\), and \(1\times S\) with \(S<Q\), where \(\mathbf {b}\) has full rank S. Then, \(\Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2\) reaches its minimum for \(\alpha =\mathbf {a}{\mathbf {b}}^\mathsf {T}(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1}\).

Proof

Expanding the squared norm gives \(\Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2^2 = (\mathbf {a}-\alpha \mathbf {b}){(\mathbf {a}-\alpha \mathbf {b})}^\mathsf {T} = \mathbf {a}{\mathbf {a}}^\mathsf {T} -2\alpha \mathbf {b}{\mathbf {a}}^\mathsf {T} + \alpha \mathbf {b}{\mathbf {b}}^\mathsf {T}{\alpha }^\mathsf {T}\). Therefore, the gradient \(\frac{\partial }{\partial \alpha } \Vert \mathbf {a}-\alpha \mathbf {b}\Vert _2^2 = -2\mathbf {b}{\mathbf {a}}^\mathsf {T} + 2\mathbf {b}{\mathbf {b}}^\mathsf {T}{\alpha }^\mathsf {T}\) vanishes if and only if \({\alpha }^\mathsf {T}=(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1} \mathbf {b} {\mathbf {a}}^\mathsf {T}\), i.e., \(\alpha =\mathbf {a}{\mathbf {b}}^\mathsf {T}(\mathbf {b}{\mathbf {b}}^\mathsf {T})^{-1}\) where we have used the fact that \(\mathbf {b}{\mathbf {b}}^\mathsf {T}\) is invertible by Lemma 1. \(\square \)

Proof

(Proof of Theorem 3.1) Let \(\mathbf {x}' = \Sigma ^{-1/2} \ \mathbf {x}\) and \(\mathbf {y}' = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \ \mathbf {y}\). The optimal distinguisher minimizes the following expression over \(\alpha \in \mathbb {R}^{D\times S}\):

$$\begin{aligned}&\mathrm{tr}\left( {(\mathbf {x}-\alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x}-\alpha \mathbf {y})\right) \\&\quad = \mathrm{tr}\left( (\mathbf {x}'-\alpha '\mathbf {y}){(\mathbf {x}'-\alpha '\mathbf {y})}^\mathsf {T}\right) = \sum _{d=1}^D \Vert \mathbf {x}'_d-\alpha '_d \mathbf {y} \Vert ^2 . \end{aligned}$$

By Lemma 2, the minimization over \(\alpha '_d\) yields \(\alpha '_d = (\mathbf {x}'_d {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) for all \(d=1,\ldots , D\). This gives \(\alpha '=(\mathbf {x}' {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) hence \(\alpha =(\mathbf {x} {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\), which remarkably does not depend on \(\Sigma \).

The minimized value of the distinguisher is thus

$$\begin{aligned}&\min _{\alpha } \ \mathrm{tr}\left( {(\mathbf {x}-\alpha \mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x}-\alpha \mathbf {y})\right) \\&\quad = \mathrm{tr}\left( {(\mathbf {x}-\alpha ^\text {opt}\mathbf {y})}^\mathsf {T} \Sigma ^{-1} (\mathbf {x}-\alpha ^\text {opt}\mathbf {y})\right) \\&\quad = \mathrm{tr}\left( (\mathsf {Id} - {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1} )^2 {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \\&\quad = \mathrm{tr}\left( {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) - \mathrm{tr}\left( {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1} \ {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \end{aligned}$$

where \(\mathsf {Id}\) denotes the \(D\times D\) identity matrix and where \(\mathrm{tr}\left( {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \) is a constant independent of k. This proves Theorem 3.1. \(\square \)

The expression of \(\mathcal {D}_\mathrm {ML,sto}(\mathbf {x},\mathbf {t})\) given in Theorem 3.1 consists in the trace of a \(Q\times Q\) matrix, which can be admittedly very large. It can be, however, rewritten in a way that is easier to compute when Q is much greater than S and D:

Fig. 4
figure 4

Modus operandi for multivariate (\(D\ge 1\)) optimal attacks with one model \(\mathbf {Y}\) associated with envelope \(\alpha \in \mathbb {R}^{D\times 1}\) and a constant offset \(\beta \in \mathbb {R}^{D\times 1}\) (\(S=2\))

Corollary 1

(Alternative expression of \(\mathcal {D}_\mathrm {ML,sto}\)) Letting \(\mathbf {x}' = \Sigma ^{-1/2} \ \mathbf {x}\), and \(\mathbf {y}' = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \ \mathbf {y}\) as in the proof of Theorem 3.1, we have

$$\begin{aligned} \mathcal {D}_\mathrm {ML,sto}(\mathbf {x},\mathbf {t}) = \mathrm{argmax}_{k\in \mathbb {F}_2^n} \ \Vert \mathbf {x}' {\mathbf {y}'}^\mathsf {T} \Vert _F . \end{aligned}$$
(7)

Here the Frobenius norm is of a \(D\times S\) matrix.

Proof

Let us write \((\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}=(\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2}\) in (5). By the properties of the trace,

$$\begin{aligned}&\mathrm{tr}\left( {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1} \mathbf {y} \ \ {\mathbf {x}}^\mathsf {T} \Sigma ^{-1} \mathbf {x}\right) \\&\quad =\mathrm {tr}\left( {\underbrace{\Bigl ((\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-\frac{1}{2}} \mathbf {y} {(\Sigma ^{-\frac{1}{2}} \mathbf {x})}^\mathsf {T}\Bigr )}_{S\times D} \underbrace{{\Bigl ((\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-\frac{1}{2}} \mathbf {y} {(\Sigma ^{-\frac{1}{2}} \mathbf {x})}^\mathsf {T}\Bigr )}^\mathsf {T}}_{D \times S}}\right) \\&\quad =\mathrm{tr}\left( (\mathbf {y}' {\mathbf {x}'}^\mathsf {T}) \smash [t]{{(\mathbf {y}' {\mathbf {x}'}^\mathsf {T})}^\mathsf {T}}\right) = \Vert \mathbf {x}' {\mathbf {y}'}^\mathsf {T} \Vert _F^2. \end{aligned}$$

\(\square \)

Remark 1

Notice that in Corollary 1, \(\mathbf {y}'\) is a vector of empirical covariance equal to the identity matrix. Indeed, \(\mathbf {y}' {\mathbf {y}'}^\mathsf {T} = (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} \mathbf {y} {\mathbf {y}}^\mathsf {T} (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1/2} = \mathsf {Id}\).

3.2 Mathematical expressions for \(S=2\)

In order to provide interpretations for the optimal distinguisher expressions, we detail how an optimal attack unfolds when the leakage consists in a sum of a modulated scalar model and an offset (\(S=2\)). The cases for profiled attacks (denoted \(\mathcal {D}_{\mathrm {ML}}^{S=2}\)) and non-profiled attacks (denoted \(\mathcal {D}_{\mathrm {ML,sto}}^{S=2}\)) are presented in Fig. 4.

Interestingly, when \(S=2\), the template attack can decompose in two steps (affine projection followed by a Euclidean distance to the model). Remarkably, the projection vector is the same for all key guesses. This extends similar results [3] where only the linear relationship between leakage and model is explored. As for the online attack, \(\mathcal {D}_{\mathrm {ML,sto}}^{S=2}\) consists in a sum of square of CPA attacks on transformed data, aiming at orthogonalizing the noise.

3.3 Efficient implementation

Both \(\mathcal {D}_{\mathrm {ML}}\) and \(\mathcal {D}_{\mathrm {ML,sto}}\) can be optimized using the idea presented in [11]. This article applies a precomputation step in the case the number of traces is larger than the number of possible plaintexts (\(Q> \#\mathcal {T}=2^n\)). In this case, all summations \(\sum _q\) can be advantageously replaced by \(\sum _t \sum _{t_q=t}\). In most cases, the sum \(\sum _{t_q=t}\) can be achieved on the fly and does not involve an hypothesis on the key. Therefore, a speed gain of \(2^n\) (the cardinality of the key space) is expected.

Such optimization strategy can be applied to \(\mathcal {D}_{\mathrm {ML}}\). Indeed, let us define \(\mathbf {x}'=\Sigma ^{-1/2}\mathbf {x}\) and \(\alpha '=\Sigma ^{-1/2}\alpha \). Then,

$$\begin{aligned}&\mathcal {D}_{\mathrm {ML}}(\mathbf {x},\mathbf {t}) =\mathrm{argmin}_{k} \sum _{d=1}^D \Vert \mathbf {x}'_d-\alpha '_d\mathbf {y} \Vert _2^2 \qquad \text {(see Corollary 1)} \nonumber \\&\quad =\mathrm{argmin}_{k} \sum _{d=1}^D \sum _{t\in \mathbb {F}_2^n} \left( \sum _{q/t_q=t} {x'_{d,q}}^2 - 2 \sum _{q/t_q=t} x'_{d,q} \alpha '_d y(t,k) \right. \nonumber \\&\qquad \left. + \left( \sum _{q/t_q=t} 1\right) (\alpha '_d y(t,k))^2 \right) \nonumber \\&\quad =\mathrm{argmin}_{k} \sum _{d=1}^D \sum _{t\in \mathbb {F}_2^n} - 2 \underbrace{\Bigl (\sum _{q/t_q=t} x'_{d,q}\Bigr )}_{\text {denoted as } x'_{d,t}} \alpha '_d y(t,k)\nonumber \\&\qquad + \underbrace{\Bigl ( \sum _{q/t_q=t} 1 \Bigr )}_{\text {denoted as } n_t} (\alpha '_d y(t,k))^2 \end{aligned}$$
(8)
$$\begin{aligned}&\quad =\mathrm{argmax}_{k} \ \mathrm{tr}\left( \mathbf {x}' {\left( \alpha ' \mathbf {y}(k) \right) }^\mathsf {T}\right) -\frac{1}{2} \sum _{t\in \mathbb {F}_2^n} n_t \; \left\| \alpha ' y(t,k) \right\| _2^2.\nonumber \\ \end{aligned}$$
(9)

Notice that at line (8), the term \(\sum _{q/t_q=t} {x'_{d,q}}^2\) which does not depend on the key is simplified. The fast version of this computation is given in Algorithm 1.

figure a

The same optimization applies to \(\mathcal {D}_{\mathrm {ML,sto}}\). Indeed, in expression (7) of \(\mathcal {D}_{\mathrm {ML,sto}}(\mathbf x ,\mathbf t ) = \mathrm{argmax}_k \ \Vert \mathbf {x}' {\mathbf {y}'}^\mathsf {T} \Vert _F^2\), one can write

$$\begin{aligned}&\Vert \mathbf {x}' {\mathbf {y}'}^\mathsf {T} \Vert _F^2 = \sum _{s,d} \Bigl ( \sum _{q=1}^Q x'_{d,q} y'_{s,q} \Bigr )^2 \nonumber \\&\quad = \sum _{s,d} \Bigl ( \sum _{t\in \mathbb {F}_2^n} \underbrace{\Bigl ({\textstyle \sum _{q/t_q=t} x'_{d,t}}\Bigr )}_{\text {denoted as } x'_{d,t}} \underbrace{\Bigl ( y'_{s}(t,k) \Bigr )}_{\text {denoted as } y'_{s,t}} \Bigr )^2 . \end{aligned}$$
(10)

This means that \(\mathbf {x}'\) can be obtained by simple accumulation, exactly as in line 2 of Algorithm 1. The term \(y'_{s}(t,k)\) requires the computation of \(\mathbf {y}{\mathbf {y}}^\mathsf {T}\). In the case \(Q\gg 1\), it can be assumed that the texts \(\mathbf {t}\) are uniformly distributed. Hence, when \(Q\rightarrow +\infty \), by the law of large numbers,

$$\begin{aligned}&\frac{1}{Q} \mathbf {y}{\mathbf {y}}^\mathsf {T} = \frac{1}{Q} \sum _{q=1}^Q y_q {y_q}^\mathsf {T} = \sum _{t\in \mathbb {F}_2^n} \frac{\sum _{q/t_q=t} 1}{Q} y(t,k) {y(t,k)}^\mathsf {T} \\&\quad \xrightarrow [Q\rightarrow +\infty ]{} \frac{1}{2^n} \sum _{t\in \mathbb {F}_2^n} y(t,k) {y(t,k)}^\mathsf {T} . \end{aligned}$$

Therefore, in (10), \(y'_s(t)\) can also be precomputed. To the best of our knowledge, this optimization has never been discussed previously. The resulting distinguishing procedure is given in Algorithm 2. At line 3, the argument of the Frobenius norm can be computed by a fast matrix multiplication. Also, we notice that the matrix inversion at line 0 is actually a precomputation which involves only the model. Besides, if the EIS (Equal Images under all Subkeys) assumption holds [14, Def. 2], e.g., y(tk) only depends on \(t\oplus k\), then \(\sum _t y(t,k) {y(t,k)}^\mathsf {T}\) does not depend on k, hence only one single matrix inversion to compute. Eventually, the computational complexity of the optimal stochastic attack simply consists in traces accumulation per class and as many matrix products and Frobenius norms as keys to be guessed.

figure b

4 Practical results

4.1 Characterization of \(\Sigma \)

In this article, we assume that the attacker knows the noise covariance matrix. We give a straightforward procedure for the estimation.

  1. 1.

    collect Q traces (i.e., a matrix \(\mathbf {x}\in \mathbb {R}^{D\times Q}\)) where the plaintext is fixed to a given value,

  2. 2.

    estimate \(\Sigma \) as \(\hat{\Sigma }= \frac{1}{Q-1} \bigl ( \mathbf {x}-\frac{1}{Q}\mathbf {x}{\mathbf {1}}^\mathsf {T}\mathbf {1} \bigr ) \! {\bigl ( \mathbf {x}-\frac{1}{Q}\mathbf {x}{\mathbf {1}}^\mathsf {T}\mathbf {1} \bigr )}^\mathsf {T}\!\!,\) where \(\mathbf {1}=(1, \ldots , 1)\in \mathbb {R}^{1\times Q}\). This estimator is sample covariance matrix, which is unbiased.

Fig. 5
figure 5

Simulations for \(D=3\), \(S=5\), \(n=4\), \(\sigma =1\) (AR noise with \(\rho =0.5\))

Remark 2

Notice that \(\Sigma \) cannot be obtained by a direct profiling on the same traces to be used for the attack. Indeed, in those traces, the plaintext is varying; hence, the attacker would use for \(\hat{\Sigma }\) the covariance matrix of \(\mathbf {x}-\alpha ^\text {opt} \mathbf {y}\), where \(\alpha ^\text {opt}\) is equal to \(\alpha ^\text {opt} = (\mathbf {x} {\mathbf {y}}^\mathsf {T}) (\mathbf {y} {\mathbf {y}}^\mathsf {T})^{-1}\) [recall (6)]. Hence, \(\hat{\Sigma }= \frac{1}{Q-1} (\mathbf {x}-\alpha ^\text {opt} \mathbf {y}) {(\mathbf {x}-\alpha ^\text {opt} \mathbf {y})}^\mathsf {T}\). But the distinguisher \(\mathcal {D}_\mathrm {ML,sto}\) is

$$\begin{aligned}&\mathcal {D}_\mathrm {ML,sto}(\mathbf {x},\mathbf {t}) \nonumber \\&\quad = \mathrm{argmin}_{k\in \mathbb {F}_2^n} \min _{\alpha \in \mathbb {R}^{D\times S}} \, \mathrm{tr}\left( {(\mathbf {x}-\alpha \mathbf {y})}^\mathsf {T} \hat{\Sigma }^{-1} (\mathbf {x}-\alpha \mathbf {y})\right) \nonumber \\&\quad = \mathrm{argmin}_{k\in \mathbb {F}_2^n} \min _{\alpha \in \mathbb {R}^{D\times S}} \, \mathrm{tr}\left( \hat{\Sigma }^{-1} (\mathbf {x}-\alpha \mathbf {y}) {(\mathbf {x}-\alpha \mathbf {y})}^\mathsf {T}\right) \nonumber \\&\quad = \mathrm{argmin}_{k\in \mathbb {F}_2^n} \, \mathrm{tr}\left( \hat{\Sigma }^{-1} (\mathbf {x}-\alpha ^\text {opt}\mathbf {y}) {(\mathbf {x}-\alpha ^\text {opt}\mathbf {y})}^\mathsf {T}\right) \end{aligned}$$
(11)
$$\begin{aligned}&\quad = \mathrm{argmin}_{k\in \mathbb {F}_2^n} \, \mathrm{tr}\left( (Q-1) \hat{\Sigma }^{-1} \hat{\Sigma }\right) = \mathrm{argmin}_{k\in \mathbb {F}_2^n} \, D(Q-1) . \end{aligned}$$
(12)

Indeed, at line (11), we demonstrated in the proof of Theorem 3.1 in that the minimal value (6) of \(\alpha \) is independent on \(\Sigma \). Eventually, it can be seen at line (12) that the distinguisher with \(\hat{\Sigma }\) instead of \(\Sigma \) does not depend on the key.Footnote 9

4.2 Attacks on synthetic (i.e., simulated) traces

In this subsection, we present simulations when \(\alpha \) is known exactly or regressed online. We consider an attack of PRESENT, where the SBox is \(n=4 \rightarrow n=4\). For the sake of the simulations, we choose two kinds of \(\alpha \):

  • “identical”: all the \(n=4\) bits leak the same waveform, like in the Hamming weight model,

  • “proportional”: the waveform has weight 1 for SBox bit 0 and is multiplied by 2 (resp. 3 and 4) for SBox bit 1 (resp. 2 and 3).

The waveform for each bit is that represented in Fig. 1 (upper left graph). Specifically, for all \(1\le d\le D\) and \(1\le s\le S\), the envelope consists in damped oscillations:

$$\begin{aligned} \alpha _{d,s}&= e^{-\frac{2 d}{D}} \cos \left( 2\pi \frac{d}{D} \right) \ \;\text {for the ``identical'' case,} \end{aligned}$$
(13)
$$\begin{aligned} \alpha _{d,s}&= s \cdot e^{-\frac{2 d}{D}} \cos \left( 2\pi \frac{d}{D} \right) \ \;\text {for the ``proportional'' case.} \end{aligned}$$
(14)

The noise is chosen normal, using two distributions:

  • “isotropic”: the covariance matrix is \(\sigma ^2\) times the \(D\times D\) identity,

  • “auto-regressive” (of “AR” for short): the covariance matrix element at position \((d,d')\), for \(1\le d,d'\le D\), is \(\sigma ^2 \rho ^{|d-d'|}\). This noise is not independent from sample to its neighbors, but the correlation \(\rho \) decreases exponentially as samples get further apart.

Proposition 1

The success probability of \(\mathcal {D}_\mathrm {ML}\) is greater than that of \(\mathcal {D}_\mathrm {ML,sto}\).

Proof

Indeed, according to [10], \(\mathcal {D}_\mathrm {ML}\) maximizes the success probability. Thus, the distinguisher \(\mathcal {D}_\mathrm {ML,sto}\) has a smaller success probability. The success probability is the same if the minimization over \(\alpha \) in the proof of Theorem 3.1 yields the exact matrix \(\alpha \) used in the model (1). \(\square \)

Fig. 6
figure 6

Simulations for \(D=3\), \(S=5\), \(n=4\), \(\sigma =4\) (AR noise with \(\rho =0.5\))

Simulations allow to estimate the loss in terms of efficiency of not knowing the model (Proposition 1), by comparing distinguishers \(\mathcal {D}_\mathrm {ML}\) [(4)] and \(\mathcal {D}_\mathrm {ML,sto}\) [(5)]. The success rate of the optimal distinguisher \(\mathcal {D}_\mathrm {ML}\) is drawn in order to materialize the limit between feasible (below) and unfeasible (above) attacks.

Results for low noise (\(\sigma =1\)) are represented in Fig. 5. We can see that the Hamming weight model is clearly harder to attack, because the leakage of one bit cannot be distinguished from that of the other bits. Besides, we notice that the stochastic attack is performing much worse than the optimal attack: about 10 times more traces are required for an equivalent success probability in key extraction.

Results for high noise (\(\sigma =4\)) are represented in Fig. 6. Again, the “proportional” model is easier to attack than the “identical” model (for each bit). Now, we also see that the gap between the optimal ML attack and the stochastic attack narrows: only about 5 times more traces are needed for the stochastic attack to perform as well as the optimal attack in terms of success probability. Besides, we notice that the AR noise is favorable to the attacker. It is therefore important in practice for the attacker to characterize precisely the noise distribution (recall the methodology presented in Sect. 4.1).

Clearly, these conclusions are in line with the template versus stochastic (offline) study carried out in [8]: for high noise, the (online) learning of the model requires more traces; hence, is more accurate. Therefore, the performance of \(\mathcal {D}_\mathrm {ML,sto}\) gets closer to that of \(\mathcal {D}_\mathrm {ML}\) than for high noise.

4.3 Attacks on real-world traces

We now compare CPA with \(\mathcal {D}_\mathrm {ML}\) and \(\mathcal {D}_\mathrm {ML,sto}\) on measurements provided by the DPA contest V4. These traces have been acquired from an 8-bit processor and hence have a signal-to-noise ratio greater than one, reaching 7 at some points in time. The interval for our case study is [170, 250] from Fig. 2; hence, \(D=80\). Regarding ML, two learning strategies have been implemented:

  1. 1.

    the model is learned from a disjoint set of 5k traces, which is the operational scenario for a profiled attack;

  2. 2.

    the model is learned from the traces being attacked (denoted self in Fig. 7). This case does not represent a realistic attack, but is interesting in that it highlights the best possible attacker.

The attack success rates are plotted in Fig. 7.

Fig. 7
figure 7

Comparison of success rate of CPA, \(\mathcal {D}_\mathrm {ML,sto}\) for \(S\in \{9,2\}\), and \(\mathcal {D}_\mathrm {ML}\) for \(S\in \{9,2\}\) (with two distinct learning methods)

One can see that both variants of \(\mathcal {D}_\mathrm {ML}\) and \(\mathcal {D}_\mathrm {ML,sto}\) achieve better with \(S=9\) than with \(S=2\). This is consistent with the analysis carried out in Sect. 2.3. Actually, the CPA has a very poor performance because the model is very far from a Hamming weight: as shown in Fig. 2, some parameters \(\alpha _i\) (e.g., for \(i=2\) and 6) are positive in region [180, 200], whereas others \(\alpha _j\) (e.g., for \(j=1\), 3, 4 and 5) are negative. The compensating signs account why the Hamming weight model is inappropriate. The ML with model precharacterization on the traces under attack shows that very strong attacks are possible (using a few traces only). Interestingly, when the model used by ML is characterized on 5k traces distinct from the traces being attacked, the performance is almost similar. Eventually, the online stochastic attack derived in this paper (\(\mathcal {D}_\mathrm {ML,sto}\)) performs better than CPA (the distinguisher being the maximum value of the Pearson correlation over the \(D=80\) samples).

5 Conclusions and perspectives

Distinguishing a key from both multivariate leakage samples and multiple models can be done in one step as shown in this paper. A compact expression of the distinguisher is provided, using matrix operations. The strategy is applied to real-world traces in profiled and non-profiled scenarios. The resulting attack is more efficient than the traditional approach “dimensionality reduction then stochastic (linear regression) attack).” The new multivariate distinguisher outperforms the other state-of-the-art attacks. The presented methodology allows for leakage agnostic attacks on vectorial leakage measurements and complex models. In addition, the matrix-based expression of the distinguisher benefits from matrix-oriented software that implements computational optimizations for large dimensions.

A companion future work would consist in determining the optimal model dimensionality and basis from any acquisition campaign. Another perspective is to adapt the methodology to masked implementations, as already done for monovariate leakage in [6], yet for this case the distinguishers will certainly not exhibit simple closed-form expressions. However, we believe that the approach could be fruitful in practice backed with suitable optimization software.