1 Introduction

1.1 Context: the side channel threat

Side-channel traces collected from software code are extremely rich, since a same variable can leak at different places. Typically, leakage can spread over several samples within one clock cycle, and in addition, software implementations typically move variables in several registers or memory locations, causing leakage at many clock cycles.

1.2 Problem: making the most of high dimensionality

Modern oscilloscopes sample their input at a very high frequency; hence, it is possible to get more than one leakage sample per leaking sensitive variable. How to exploit such abundance of leakage measurements? Few non-supervised side-channel distinguishers manage such situation. Indeed, the samples usually leak differently; therefore, it is complex without prior knowledge to know how to best combine them constructively.

1.3 State-of-the-art

Clearly, in the past, some strategies have been put forward. For instance, Clavier et al. [11, §3] suggested (albeit in another context, namely that of so-called shuffling countermeasures) to take as monovariate signal the average of the signal over the D samples. As another example, Hajra and Mukhopadhyay [24] investigate non-profiling side-channel analysis. However, for their analysis to work, some a priori structure is injected in the model, which is learned online on the attacking traces. However, in reality, the information contents of the whole trace are larger than that soaked from the projection on one single support vector, but it depends on the multivariate trace distribution. Notice that template attacks (which are profiled attacks) achieve this goal. Template attacks on multivariate data have been described accurately as a “process” in state-of-the-art papers [30, §5.3.3]. However, those attacks are still perceived as a recipe (see Sec. 2.1 of [9]), so there is no (except from experimental cases) way to study which parameters impact the success of a template attack. Therefore, some folklore surrounds them. In particular, because the recipe is not formalized, some papers have tried to clarify the different steps and assumptions, in particular Choudary and Kuhn [10] wrote on making many details about template attacks more explicit, such as the notion of “pooled covariance matrix.” The inversion of the traces covariance matrix \(\varSigma \) is feared, and for this reason, a first pass of dimensionality reduction has been suggested right away in the seminal paper by [9]; the authors selected a few tens of so-called points of interest (PoIs), resulting in a smaller \(\varSigma \) which is easy to inverse. Still, we notice a theoretical contradiction, because the data processing inequality states that application of any function on the data reduces their informative contents. Subsequently, many researches have been carried out to improve on the heuristic method to identify PoIs in the traces. Archambeau et al. motivated the usefulness of principal component analysis (PCA) in [1] in this respect. Bär et al. improved on this PCA in [2] by actually hand-picking PoIs within PCA vectors. Elaabid et al. [18] account for a method to choose PoIs without the need of manual selection. Their method is as follows: a PCA is computed on all the traces, resulting in one “PCA trace”, and the PoIs are those samples such that the PCA value is larger than a user-defined threshold. The authors observed that this selection does retain most informative samples while discarding noisy samples. Fan et al. suggest to select PoIs as those where the noise is the most Gaussian [20]. Still, even deciding which selection of PoIs is optimal is questioned [45]. Zhang et al. noticed recently that there is still margin for improvement in PoIs selection algorithms [43]. So, to summarize, many papers have focused on reducing the traces dimensionality while retaining in the constructed subset most of the information, with a view to optimize the success rate. A secondary objective is to keep an acceptable computational load for the template attack. Indeed, template attacks, as presented originally, include in the attack phase the evaluation of computationally challenging functions, such as: exponential, matrix inversion, and square roots (see Eqn. (1) in Sec. 2.1 of [9]).

Eventually, we notice that many papers study trace denoising, using typically wavelets [16], independent component analysis (ICA) [29], etc. In our work, we exploit traces “raw,” so as to highlight the sole impact of multivariate analysis on attack efficiency. But, it is noteworthy that both of PoI-based and PCA-based template attacks [20] can also straightforwardly benefit from our approach, both in terms of processing time and in terms of the needed space memory.

1.4 Contributions

In this paper, we analyze template attacks from a theoretical perspective, and derive the mathematical expressions for template building and attack phases. In the context of multivariate normal noise, the two phases are easily simplified (Algorithms 1 and 2) as mere linear operations. In particular, there is no need for exponential and square roots, and only one matrix inversion is required at the end of the template building phase. These formal expressions allow us to drastically improve on the clarity of the actual computations carried out by template attacks and as a result, the formal expression also improves on the computational aspects related to template attacks.

Our noteworthy contributions are detailed below:

  • we factor code between training and matching algorithms,

  • we optimize further the computation and needed memory space by grouping traces by classes, resulting in a computation on a so-called coalesced data,

  • we have the training phase delivers only one matrix, thereby saving repeated computations in the attack phase (see Algorithms 3 and 4).

  • As a consequence, we manage to perform template attacks on highly multivariate traces. Contrary to belief, we show that the more samples are taken (i.e., no dimensionality reduction is needed), and the more successful is the template attack. But our approach is also compatible with reduced dimensionality data.

  • Said differently, we are able to analyze full-width traces without any preprocessing (which destroys information) and show that this is the best attack strategy, in terms of number of traces, to recover the key.

  • in the sequel, we extend our approach to the masking-based protected implementations.

  • Finally, we highlight a spectral approach which allows for a near exponential computational improvement in the attack phase (an attack factor in \(2^n\) is reduced to n, where n is the bitwidth of the sensitive variable).

Furthermore, those findings allow us to observe practically the effect of the dimensionality (the number of samples in a trace) on the success rate. Namely, we show that the success rate increases when traces of higher dimensionality are used. We also show that template attacks (without initial dimensionality reduction using PCA, for instance) are more efficient than monovariate attacks (such as the correlation power analysis, also known as CPA [5]) applied on the first direction of the PCA leakage reduction. Such practical derivations are, to the best of our knowledge, new, and we here provide some numerical values about the actual gain conveyed by attacks of increased dimensionality.

1.5 Outline

In order to present our contributions, we follow the scientific approach described hereafter. In Sect. 2, we present a mathematical modelization of the problem. In Sect. 3, we provide our first contribution, namely the formalization of template attacks. In Sect. 4, efficient algorithms to compute template attacks are given. In Sect. 5, we validate our contributions on real traces taken from an AES running on an ATMega 163 smart card. In particular, we exhibit a spectral approach to speed up the computation of template attacks. Finally, in Sect. 6, we conclude our study.

2 Mathematical modelization of the problem and notations

2.1 Side-channel problem

We model the side-channel problem as follows:

$$\begin{aligned} T,k \rightarrow Z \rightarrow Y(Z) = Y \rightarrow X = Y+N . \end{aligned}$$
(1)

In this equation, we have that:

  • T is the digital part known by the attacker, typically some text (either plaintext or ciphertext),

  • k is the digital part unknown by the attacker, typically some part of key, which is fixed and will be guessed,

  • Z is an intermediate value within the targeted cryptographic algorithm, e.g., \(Z=\texttt {SubBytes}(T\oplus k)\) in AES; formally, Z is the sensitive variable which consists in a vectorial Boolean function of T and k,

  • Y is the leakage corresponding to Z — the link between Y and Z is deterministic,

  • X is the side-channel leakage measured by the attacker, which consists in Y plus some independent noise N,

  • N is the noise.

We assume all those variables are measured many times (Q times). Therefore, the random variables have a dimensionality Q, with the particularity that k is the same for all Q, and that noise random variables \(N_q\) are all i.i.d.

In addition, we assume that the measurements are multidimensional of dimensionality D. This can represent the fact that:

  • oscilloscopes capture windows of D samples, typically at high sampling rates, resulting in traces of many samples per clock period,

  • simultaneous captures by various probes (e.g., power and electromagnetic, or also multiple electromagnetic probes placed at various locations).

2.2 Additional notations

We adopt the following notations:

  • S: the cardinality of the space where Z belongs to, that is if \(Z\in \{0,1\}^n\), then \(S=2^n\); to ease notations, we also assume (without loss of generality though) that the number of T values is S. For instance, Z typically arises from a bijective substitution box applied on \(T\oplus k\).

  • \(\varSigma \): the \(D\times D\) covariance matrix of the noise, such that \(N_q \sim \mathcal {N}(0, \varSigma )\) (\(0\le q<Q\)).

Moreover, we will use the reordering trick put forward in [28], which consists in trading \(\sum _q\) for \(\sum _t \sum _{q: t_q=t}\), where \(\sum _{q: t_q=t}\) is a shortcut notation for \(\mathop {\sum }\nolimits _{\begin{array}{c} {c}q\in \{1,\ldots ,Q\}, \\ \text {such that: } t_q=t \end{array}}\).

Therefore, we have the following “types” for variables defined in Sect. 2.1:

  • T and k are bitvectors (of second dimension Q), not necessarily of the same length,

  • Z is a matrix of Booleans \(\{0,1\}^{n} \times Q\),

  • Y and X are matrices of \(D\times Q\) real numbers.

The peculiarity of Y is that we have \(Y(Z_q)=Y(Z_{q'})\) if \(T_q=T_{q'}\) (under the same key k).

We introduce the notion of coalesced matrices. This notion has been introduced in [28, §3.3], but was not named there. Therefore, we qualify this notion (averaging traces corresponding to the same sensitive variable) as “coalescence.” It arises from the fact that Y and X, random variables in (1), depend only on values from Z (except from X which also depends on the noise, which is independent from other random variables), and therefore, their matrices can be coalesced from size \(D\times Q\) to \(D\times S\).

Definition 1

(Coalesced matrices) Coalesced model \(\tilde{Y}(k)\) is the \(D\times S\) matrix

$$\begin{aligned} \left( \tilde{Y}_{t=0,k}, \ldots , \tilde{Y}_{t=S-1,k}\right) , \end{aligned}$$

i.e., where t is enumerated in lexicographical order. This matrix is a property of the leakage from the device under test, i.e., it is not a random variable, but a constant value. For any measurement q, we denote \(Y_q = \tilde{Y}_{t_q,k}\).

Coalesced measurement \(\tilde{X}\) is the matrix

$$\begin{aligned} \left( \tilde{X}_0, \ldots \tilde{X}_{S-1}\right) = \left( \frac{\sum _{q: t_q=0} X_q}{\sum _{q: t_q=0} 1}, \ldots , \frac{\sum _{q: t_q=S-1} X_q}{\sum _{q: t_q=S-1} 1} \right) . \end{aligned}$$

In this equation, we denote by \(n_t\) the number of plaintexts such that \(t_q\) equals to t. If for some t the number \(n_t\) it is equal to zero, then by convention \(\tilde{X}_t=0\). Alternatively, the empty classes can be pruned. Both choices are equivalent.

The advantage of Definition 1 is twofold. First, as long as \(Q\ge \#T=2^n\), the matrix X (of size \(D\times Q\)) is reduced to a matrix \(\tilde{X}\) (of size \(D\times 2^n\)). Second, when \(n_t\) is not equal to zero, \(\tilde{X}_{t}\) as defined here is the empirical average of \(\mathbb {E}(X|T=t)\) over \(n_t\) values. Thus, for any t and q, \(X_q\) (for \(t_q=t\)) and \(\tilde{X}_t\) have same expectation, but \(\tilde{X}_t\) has an empirical standard deviation divided by \(\sqrt{n_t}\) relatively to that of \({X}_{q}\).

It shall also be noted that in our model, the \(D\times Q\) matrix Y (and the \(D\times S\) coalesced matrix \(\tilde{Y}\)) are general. In particular, we do not assume any structure, such as \(\tilde{Y}\) being the product of a \(D\times 1\) column by a \(1\times S\) row, as done in [6, 24].

Examples of Z functions are given hereafter:

  • in the software case, \(T,k\in \{0,1\}^n\), and \(Z=T\oplus k\);

  • in the hardware case, for the case of AES \((n=8)\) attacked on the last round:

    • \(T,k\in \{0,1\}^n\), and \(Z=(\text {InvSubBytes}(T\oplus k), T)\) for the bytes of the first row, and

    • \(T=(T_1,T_2)\in \{0,1\}^n\times \{0,1\}^n\), and \(Z=(\text {InvSubBytes}(T_1\oplus k), T_2)\) for the other three rows.

Those two cases refer to the two models expressed in [14]:

  • Only manipulated Data Leak (ODL): only the manipulated value influences the leakage.

  • Memory Transitions Leak (MTL): two values (the previous one and the new one) of a memory unit (e.g., a register) influence the power consumption and also the device leaks some combination of the two consecutively manipulated values.

We also introduce other useful notations for matrices:

  • Let A be a square matrix, then its trace is the sum of elements along its diagonal \({{\,\mathrm{tr}\,}}(A) = \sum _{i} A_{i,i}\); let A and B two rectangular matrices such that AB and BA are square. Then, the trace has the property that \({{\,\mathrm{tr}\,}}(A B) = {{\,\mathrm{tr}\,}}(B A)\).

  • Covariance matrices are symmetrical, and all their eigenvalues are positive. When some eigenvalues are zero, it means that rows are redundant, and those are implicitly removed until all eigenvalues are strictly positive. Alternatively, more measurements shall be captured. Therefore, we will consider covariance matrices are invertible. For example, the noise covariance matrix \(\varSigma \) has an inverse, denoted \(\varSigma ^{-1}\). We will also make use of the notation \(\varSigma ^{-1/2}\) for the (uniqueFootnote 1) matrix such that \(\varSigma ^{-1/2} \varSigma ^{-1/2} = \varSigma ^{-1}\).

Eventually, let us define the following operator:

Definition 2

(The kth trace operator \(\mathsf {tr}_k\)) Let \(k\in \mathbb {F}_2^n\). The kth trace operator \(\mathsf {tr}_k\) of a \(2^n\times 2^n\) square matrix M is

$$\begin{aligned} \mathsf {tr}_k(M) = \sum _t M_{t,t\oplus k} . \end{aligned}$$

The regular trace of a square matrix M is simply \(\mathsf {tr} = \mathsf {tr}_0\).

2.3 Characterization of traces

The traces can be characterized according to their signal-to-noise ratio (SNR) as defined in [30, § 4.3.2, page 73]). Referring to Eqn. (1), the monovariate SNR is defined as the SNR at each sample of the trace. Specifically, the monovariate SNR trace is defined as:

Definition 3

(Monovariate SNR trace)

$$\begin{aligned} \text {SNR} = \frac{\textsc {Var}(\mathbb {E}(X|T))}{\mathbb {E}(\textsc {Var}(X|T))} , \end{aligned}$$
(2)

where \(\mathbb {E}\) denotes the expectation and \(\textsc {Var}\) the variance operators.

We notice that this notion of SNR is useful to predict the approximated number of traces \(Q_{80\%}\) to recover the key, with \(80\%\) confidence. Indeed, \(Q_{80\%}\) is proportional to the inverse of SNR, as demonstrated in [4].

In some cases, such as in software implementations, the monovariate leakage can feature a strong SNR. However, this gives no intuition regarding the proportion of signal which is informative within the trace. Typically, could the SNR be improved more by reducing further the noise? The normalized inter-class variance (NICV [3]) allows to answer this question. The NICV is the proportion of the traces variance which can be explained by:

Definition 4

(Monovariate NICV trace)

$$\begin{aligned} \text {NICV} = \frac{\textsc {Var}(\mathbb {E}(X|T))}{\textsc {Var}(X)} . \end{aligned}$$
(3)

Owing to the law of total variance, that is

$$\begin{aligned} \textsc {Var}(X) = \textsc {Var}(\mathbb {E}(X|T)) + \mathbb {E}(\textsc {Var}(X|T)) , \end{aligned}$$

it can be seen that \(\text {NICV}\) is bounded between 0 and 1, and that \(\text {NICV} = 1/(1+1/\text {SNR})\). This implies that

$$\begin{aligned} \frac{\partial \;\text {NICV}}{\partial \;\text {SNR}} = \frac{1}{(1+\text {SNR})^2} > 0, \end{aligned}$$

i.e., \(\text {NICV}\) is an increasing function of SNR. The attack success rate is thus improved by increasing either the SNR or the \(\text {NICV}\).

3 Formalization of template attacks

In this section, we present our main result. We formalize the two phases of the template attack, namely learning and attacking. Their algorithms reveal a very simple mathematical expression for building the template model and for matching never seen traces (expression which happens to be very similar). In particular, this leads us to simplify the computation for the two phases: the mathematical expressions involve only linear algebra (neither logarithms nor exponential functions need to be evaluated); only one matrix inversion is required (namely once in the end of the profiling stage). Moreover, we show that the Q traces used either for profiling or matching can be regrouped in \(S=2^n\) classes. In the sequel, we use \(2^n\) (e.g., \(2^n=256\) for the AES) in lieu of S for the paper to be concrete. The attack two phases can in turn be computed using \(2^n\)-dimensional vectors and matrices of dimension \(2^n\times 2^n\). We will say that the number of traces Q is coalesced into \(2^n\) classes, which further simplifies the computations.

We start by analyzing the classical template attacks (without coalescence) in Sect. 3.1. We then see the gain of considering template attacks with coalesced traces and models in Sect. 3.2. Eventually, we compare results with attacks making use of dimensionality reduction in Sect. 3.3.

3.1 Template attack (without coalescence)

3.1.1 Template building = profiling stage

In the profiling stage, the attacker computes \(\tilde{Y}\), a \(D\times 2^n\) matrix, and \(\varSigma \). An experimental method to profile is the following:

  • In the estimation of \(\tilde{Y}\), the attacker fixes Z to some all possible values (the column index of \(\tilde{Y}\)), and averages many measurements, so as cancel out the noise; in order to explore all the possible values of Z, the attacker can enumerate all values of T and k. However, due to symmetries, it might be possible that some values of Z be encountered by many different pairs (Tk); hence, there is an opportunity to save some profiling effort.

  • Regarding the estimation of \(\varSigma \), the attacker fixes Z to an arbitrary value, say 0, and then estimates \(\varSigma \) as the covariance matrix of the traces.

Therefore, in the sequel, we consider that the averages of the traces (\(\tilde{Y}\)) and the noise covariance matrix (\(\varSigma \)) are known. Notice that \(N\sim \mathcal {N}(0, \varSigma )\) has zero mean, because the data (data = text and key) dependent part of the traces constitutes the expectation of the noise.

Also notice that we implicitly opted for a leakage decomposition basis known as the canonical basis [21]. However, any other choice of bases suits perfectly, insofar as the change to another basis is a multiplication by an invertible \(2^n \times 2^n\) matrix, which can be applied on coalesced matrices (the noise is not impacted by the basis change, which concerns only the data). Such basis change could be interesting if the leakage happens to be concentrated on a smaller basis than the canonical one: The uninteresting dimensions can thus safely be dropped, thereby simplifying the equations.

3.1.2 Template attack

After the identification stage (profiling, as described in Sect. 3.1.1), the attacker can perform the exploitation stage. We have the following result:

Theorem 1

(Theorem 1 of [7]) Template attacks guess the key as:

$$\begin{aligned} \hat{k} = {{\,\mathrm{argmin}\,}}_k {{\,\mathrm{tr}\,}}({(X-Y_k)}^\mathsf {T} \varSigma ^{-1} (X-Y_k)) . \end{aligned}$$
(4)

3.1.3 Standardization of traces

Theorem 2

(Template attack with standardized traces) It is possible to standardize the templates (and the traces), by trading:

  • \(Y_k\) for \(Y_k' = \varSigma ^{-1/2} Y_k\), and

  • X for \(X' = \varSigma ^{-1/2} X\).

Accordingly, the template attack simplifies as:

$$\begin{aligned} \hat{k} = {{\,\mathrm{argmin}\,}}_k {{\,\mathrm{tr}\,}}({(X'-Y'_k)}^\mathsf {T} (X'-Y'_k)) = {{\,\mathrm{argmin}\,}}_k ||X'-Y'_k||_F^2 , \end{aligned}$$
(5)

where \(||\;\cdot \;||^2_F\) is the square Frobenius norm of a matrix, that is the sum of all its elements raised to the power two.

Proof

Notice that owing to the symmetry of \(\varSigma \), one also has the following property \({\bigl (\varSigma ^{-1/2}\bigr )}^\mathsf {T} = \varSigma ^{-1/2}\). Hence,

$$\begin{aligned} {(X-Y_k)}^\mathsf {T} \varSigma ^{-1} (X-Y_k)&= {(X-Y_k)}^\mathsf {T} \varSigma ^{-1/2} \varSigma ^{-1/2} (X-Y_k) \\&= {(\varSigma ^{-1/2} (X-Y_k))}^\mathsf {T} \varSigma ^{-1/2} (X-Y_k) \\&= {(X'-Y'_k)}^\mathsf {T} (X'-Y'_k) . \end{aligned}$$

Besides,

$$\begin{aligned}&{{\,\mathrm{tr}\,}}({(X'-Y'_k)}^\mathsf {T} (X'-Y'_k)) \\&\quad = \sum _q {(X'_q-Y'_{q,k})}^\mathsf {T} (X'_q-Y'_{q,k}) \\&\quad = \sum _q ||X'_q-Y'_{q,k}||^2_2 \qquad \text {(norm-2 of a }D\text {-dimensional vector)} \\&\quad = ||X'-Y'_{k}||^2_F \qquad \text {(Frobenius norm of a }D\times Q\text { matrix)}. \end{aligned}$$

\(\square \)

Notice that the standardized noise \(N' = \varSigma ^{-1/2} N\) has distribution \(\mathcal {N}(0, I)\), where I is the \(D\times D\) identity matrix.

Remark 1

In the expression of the template attack of Eq. (5), the covariance matrix \(\varSigma \) disappears: It is hidden half in the model (\(Y'\) is \(\varSigma ^{-1/2} Y\)) and half in the matching traces (\(X'\) is \(\varSigma ^{-1/2} X\)). Alternatively, we will show in Algorithm 3 that \(\varSigma \) can be completely hidden in the templates, and that there is no need to use \(\varSigma \) in the corresponding matching phase (Algorithm 4). This way of using \(\varSigma \) allows to minimize the computation time, in particular because \(\varSigma \) will need to be inversed only once, namely when building the model.

3.2 Template attack (with coalescence)

The trace operator in Theorem 1 is applied to a \(Q\times Q\) matrix (resulted from the product of the raw matrices, without coalescence), which poses a problem of scaling as the number of traces grows. Therefore, we show that template attacks can be rewritten in a coalesced form:

Proposition 1

(Coalesced template attack) Template attacks (as per (4)) guess the key as:

$$\begin{aligned} \hat{k} = {{\,\mathrm{argmin}\,}}_k \sum _{t} n_t {(\tilde{X}_t-\tilde{Y}_{t,k})}^\mathsf {T} \varSigma ^{-1} (\tilde{X}_t-\tilde{Y}_{t,k}) , \end{aligned}$$
(6)

where \(n_t = \sum _{q: t_q=t} 1\) is the number of traces corresponding to plaintext value t.

Proof

By developing the argument to minimize in Theorem 1, there remain only two terms which depend on the key k:

$$\begin{aligned} \sum _q {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} X_q = {X}^\mathsf {T}_q \varSigma ^{-1} Y_{t_q,k} \end{aligned}$$
(7)

and

$$\begin{aligned} \sum _q {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} Y_{t_q,k} , \end{aligned}$$
(8)

because \(\sum _q {X}^\mathsf {T}_q \varSigma ^{-1} X_q\) is independent from the key k.

Now,

$$\begin{aligned} \sum _q {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} X_q&= \sum _t \sum _{q:t_q=t} {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} X_q \\&= \sum _t \left( \sum _{q:t_q=t} {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} X_q \right) \\&= \sum _t \left( \sum _{q:t_q=t} {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} X_q \right) \\&= \sum _t {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \left( \sum _{q:t_q=t} X_q \right) \\&= \sum _t n_t {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \left( \frac{\sum _{q:t_q=t} X_q}{n_t} \right) \\&= \sum _t n_t {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{X}_t \end{aligned}$$

and

$$\begin{aligned} \sum _q {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} Y_{t_q,k}&= \sum _t \sum _{q:t_q=t} {Y}^\mathsf {T}_{t_q,k} \varSigma ^{-1} Y_{t_q,k} \\&= \sum _t \sum _{q:t_q=t} {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{Y}_{t,k} \\&= \sum _t {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{Y}_{t,k} \left( \sum _{q:t_q=t} 1 \right) \\&= \sum _t n_t {\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{Y}_{t,k} . \end{aligned}$$

All in one, up to a key-independent term \(\sum _t n_t {\tilde{X}}^\mathsf {T}_t \varSigma ^{-1} \tilde{X}_t\), we can rewrite the argument to minimize in Theorem 1 as:

$$\begin{aligned} \sum _{t} n_t {(\tilde{X}_t-\tilde{Y}_{t,k})}^\mathsf {T} \varSigma ^{-1} (\tilde{X}_t-\tilde{Y}_{t,k}) . \end{aligned}$$

\(\square \)

Remark 2

Equation (6) reads as a trace on the plaintext space, weighted by the values \(n_t\). Regarding \(n_t\), they are the empirical estimators of \(Q \mathbb {P}(T=t)\).

Remark 3

The attack presented in Proposition 1 is exactly the same as that of Theorem 1: It will succeed with the same number of traces. However, the attack in Proposition 1 is computationally more efficient than that of Theorem 1 as soon as the number of traces Q is larger than the number of plaintexts involved in leakage model \(2^n\) (for \(2^n=256\) for AES). The gain holds both for training and matching phases. However, we underline that training requires much more traces than matching; hence, most of the gain of using coalesced data arises from the template building phase.

3.3 State-of-the-art dimensionality reduction

It has been suggested in [1] that template attacks can be made more practical by:

  1. 1.

    first start with a dimensionality reduction,

  2. 2.

    then perform a template attack in a reduced space.

This approach deserves to be confronted with the innate dimensionality reduction power of template attacks, as expressed in Sect. 3.2 (where the complexity is related to cryptographic parameter \(S=2^n\) and not to the traces dimensionality D).

To be more accurate, two dimensionality reduction techniques have been put forward: principal components analysis (PCA [1]) and linear discriminant analysis (LDA [39]).

In this section, we contrast template attacks with and without such pre-processing (especially the PCA).

3.3.1 PCA

The PCA is a technique aiming at identifying linear projections which concentrate information in few directions, namely principal directions. In this respect, it is already pointed out in [1] that any set of training traces of high dimensionality D can always be reduced to lower dimensionality \(2^n\), as we also noted in Sect. 3.2.

The PCA appears in two variants: classical and class-based, as tailored by Archambeau et al. in [1]. Both techniques require the definition of a covariance matrix. We analyze in the following lemma both techniques, which resort to the concept of coalesced matrices (Definition 1):

Definition 5

(Variance of a vectorial random variable) The variance of \(X\in \mathbb {R}^D\) is: \(\textsc {Var}(X) = \mathbb {E}((X-\mathbb {E}X){(X-\mathbb {E}X)}^\mathsf {T}) .\)

Lemma 1

(Law of total variance)

$$\begin{aligned} \textsc {Var}(X) = \mathbb {E}(\textsc {Var}(X|T)) + \textsc {Var}(\mathbb {E}(X|T)) . \end{aligned}$$

Proof

Adaptation of the law of total variance to the multivariate case. \(\square \)

Applied to \(X=Y+N\), where Y depends on T (but N does not), we have:

$$\begin{aligned} \textsc {Var}(X) = \textsc {Var}(Y) + \varSigma . \end{aligned}$$

Besides, as Y|T is deterministic, we can apply Lemma 1:

$$\begin{aligned} \textsc {Var}(Y) = \textsc {Var}(\mathbb {E}(Y|T)) . \end{aligned}$$

Let us assume Y is centered, namely \(\mathbb {E}(Y)=0\). Hence,

$$\begin{aligned} \textsc {Var}(Y) = \sum _t \mathbb {P}(T=t) \mathbb {E}(Y|T) {\mathbb {E}(Y|T)}^\mathsf {T} . \end{aligned}$$

Lemma 2

(PCA) We resort to the law of large numbers (LLN). The classical PCA is based on the estimation of the following \(D\times D\) covariance matrix:

$$\begin{aligned}&\frac{1}{Q} \sum _{q=1}^Q \left( X_q-\frac{1}{Q} \sum _{q'} X_{q'}\right) {\left( X_q-\frac{1}{Q} \sum _{q'} X_{q'}\right) }^\mathsf {T} \nonumber \\&\quad \xrightarrow [Q\rightarrow +\infty ]{\text {LLN}} \textsc {Var}(Y) + \varSigma . \end{aligned}$$
(9)

The PCA of Archambeau et al. is based on the estimation of the following \(D\times D\) covariance matrix:

$$\begin{aligned}&\frac{1}{Q} \sum _{q=1}^Q \left( \tilde{X}_q - \frac{1}{Q} \sum _{q'} \tilde{X}_{q'}\right) {\left( \tilde{X}_q - \frac{1}{Q} \sum _{q'} \tilde{X}_{q'}\right) }^\mathsf {T} \nonumber \\&\quad \xrightarrow [Q\rightarrow +\infty ]{\text {LLN}}\textsc {Var}(Y). \end{aligned}$$
(10)

Regarding the PCA, we only consider the second case. As the traces are first averaged, the covariance arising from the noise disappears; therefore, the only contribution is the intra-class variability. As the matrix is a covariance matrix, it has only positive eigenvalues. We consider the set of corresponding eigen-vectors as a matrix V (matrix of eigen-vectors), which is such that:

$$\begin{aligned} \tilde{Y}_k {(\tilde{Y}_k)}^\mathsf {T} V = V \varDelta , \end{aligned}$$

where the matrix \(\varDelta = {{\,\mathrm{diag}\,}}(\lambda _1, \lambda _2, \ldots , \lambda _{2^n})\) is the diagonal matrix of eigenvalues. Besides, it is known [25] that \(V{V}^\mathsf {T} = I\), the \(D\times D\) identity matrix.

Lemma 3

Template attack with PCA is always less efficient (in terms of success probability) than the template attack.

Proof

The reason is simply because the maximum likelihood is the best attack in terms of success probability. As the PCA is a preprocessing, it is less efficient. \(\square \)

Notice that in practice, it is hard to estimate models and noise accurately; therefore, reducing the dimensionality before attacking is a winning strategy. However, assuming that a perfect profiling is possible, any preprocessing reduces the information (recall the “data processing theorem”); therefore, one may wonder how different efficiencies of template attacks with and without PCA are.

In the case \(D<2^n\), \(\tilde{Y}_k {(\tilde{Y}_k)}^\mathsf {T}\) is almost of full rank (with very high probability). However, as traces are centered, there is one relationship amongst all rows of \(\tilde{Y}_k\), namely their sum is null. Hence, V is not invertible. There is actually no dimensionality reduction: The projection on any eigen-vector carries information.

The PCA therefore consists in transforming the traces the following way:

  • \(\tilde{Y}\) becomes \({V}^\mathsf {T}\tilde{Y}\) and X becomes \({V}^\mathsf {T}X\) (projection),

  • the covariance matrix \(\varSigma \) becomes \({V}^\mathsf {T} \varSigma V\). Indeed, the covariance matrix of the projected (centered) noise is \(\mathbb {E}(({V}^\mathsf {T} N){({V}^\mathsf {T} N)}^\mathsf {T}) = \mathbb {E}({V}^\mathsf {T} N {N}^\mathsf {T} V) = {V}^\mathsf {T} \mathbb {E}(N{N}^\mathsf {T}) V = {V}^\mathsf {T}\varSigma V.\)

The argument to maximize over k in the optimal distinguisher (4) becomes:

$$\begin{aligned}&{{\,\mathrm{tr}\,}}\left( \left( {\left( {V}^\mathsf {T} X - {V}^\mathsf {T} Y_k \right) }^\mathsf {T} \left( {V}^\mathsf {T} \varSigma V \right) ^{-1} \left( {V}^\mathsf {T} X - {V}^\mathsf {T} Y_k \right) \right) \right) = \nonumber \\&{{\,\mathrm{tr}\,}}\left( \left( {\left( X - Y_k \right) }^\mathsf {T} V \left( {V}^\mathsf {T} \varSigma V \right) ^{-1} {V}^\mathsf {T} \left( X -Y_k \right) \right) \right) = \nonumber \\&{{\,\mathrm{tr}\,}}\left( \left( {\left( X - Y_k \right) }^\mathsf {T} \left( V \left( {V}^\mathsf {T} \varSigma V \right) ^{-1} {V}^\mathsf {T} \right) \left( X -Y_k \right) \right) \right) . \end{aligned}$$
(11)

Assume that V is invertible. Using the property that \((A B)^{-1} = B^{-1} A^{-1}\), one has:

$$\begin{aligned} V \left( {V}^\mathsf {T} \varSigma V \right) ^{-1} {V}^\mathsf {T} = V V^{-1} \varSigma ^{-1} ({V}^\mathsf {T})^{-1} {V}^\mathsf {T} = \varSigma ^{-1} , \end{aligned}$$

which is the same covariance matrix as for the optimal distinguisher (matrix V disappears).

But V is not invertible, and therefore, Eq. (11) cannot be simplified.

Finally, it is noticeable that all the improvements introduced in the current paper are compatible with any dimensionality reduction technique.

4 Efficiently computing templates with coalescence

4.1 Simplification by the LLN

As will be shown, when a high number of traces are needed (where SNR is less), an efficient computation of an LLN-based approximation of the optimal template attack can be carried out. Hence, when using the LLN the number of traces to recover the key is generally more than the exact template attack. But (i) this is less and less a concern in situations where the SNR decreases (that is, the “real world” challenging scenarios) , (ii) we do not need an efficient computation when number of traces is less (when the SNR is high).

In this section, we come back to genuine template attacks, as described in Sect. 3.

Definition 6

(Equal Images under different Subkeys (EIS) [37, Def. 2]) In this case, we have that for each pair \((k,k')\), \(k\ne k'\), there exists a pair \((t,t')\), \(t\ne t'\), such that \(Y_{t,k}=Y_{t',k'}\).

In the case of the EIS, and in the specific case when \(Y_{t,k}\) depends only on \(t\oplus k\), we can drop the \({\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{Y}_{t,k}\) term. The effect is represented in Fig. 1: notwithstanding that the value of the distinguisher (Eq. (7)) is always the largest for the correct key \(k=k^*\), one can see “oscillations” occurring every \(2^n=16\) traces (here \(n=4\)), where indeed the dropped (Eq. (8)) term is the same for all key hypotheses. We recall the dimension of matrices in Table 1.

Fig. 1
figure 1

Values of the distinguisher (Eq. (7)), where the \({\tilde{Y}}^\mathsf {T}_{t,k} \varSigma ^{-1} \tilde{Y}_{t,k}\) term (Eq. (8)) is dropped

Table 1 Dimensions of traces and models, seen as matrices in this paper

The max-likelihood attack minimizes over k:

  • either: \({{\,\mathrm{tr}\,}}\left( {(X-Y_k)}^\mathsf {T} \varSigma ^{-1} (X-Y_k)\right) \)

  • or equivalently \(\sum _{t\in \mathbb {F}_2^n} n_t {(\tilde{X}_t-\tilde{Y}_{t,k})}^\mathsf {T} \varSigma ^{-1} (\tilde{X}_t-\tilde{Y}_{t,k})\).

Using the law of large numbers, for large Q, we have:

$$\begin{aligned} \frac{1}{Q} \sum _{t\in \mathbb {F}_2^n} n_t {(\tilde{Y}_{t,k})}^\mathsf {T} \varSigma ^{-1} \tilde{Y}_{t,k} \xrightarrow [Q\rightarrow +\infty ]{} \sum _t \mathbb {P}(t) {(\tilde{Y}_{t\oplus k})}^\mathsf {T} \varSigma ^{-1} \tilde{Y}_{t\oplus k} \end{aligned}$$

which does not depend on k if T is uniformly distributed, that is \(\mathbb {P}(t)=2^{-n}\). This is a mathematical justification for the observation made in Fig. 1. As already noted in Sect. 3.2, there is therefore only one key-dependent term which remains:

$$\begin{aligned} \sum _t n_t {\tilde{X}_t}^\mathsf {T} \varSigma ^{-1} {\tilde{Y}}_{t\oplus k} = \mathsf {tr}_k {(\tilde{X}}^\mathsf {T} \varSigma ^{-1} \tilde{Y}) , \end{aligned}$$
(12)

where we recall that \(\mathsf {tr}_k\) is the k-th trace operator introduced in Definition 2.

The templates are estimated as represented in Algorithm 1. In this formulation, it clearly appears that there is a unique noise covariance matrix \(\varSigma \), which justifies the technique of “pooled covariance matrix” presented in [10]. Indeed, by the law of large numbers, we have that for all \(t\in \mathbb {F}_2^n\):

$$\begin{aligned} \lim _{Q\rightarrow +\infty } \frac{1}{Q} \sum _{q: t_q=t} X_q&= \lim _{Q\rightarrow +\infty } \frac{1}{Q}\sum _{q: t_q=t} \tilde{Y}_{t_q} \\&= \lim _{Q\rightarrow +\infty } \frac{1}{Q}\sum _{q: t_q=t} \tilde{Y}_{t} \\&= \lim _{Q\rightarrow +\infty } \frac{\tilde{Y}_{t}}{Q}\sum _{q: t_q=t} 1\\&= \lim _{Q\rightarrow +\infty } \frac{n_t}{Q} \tilde{Y}_{t} \approx \frac{1}{2^n} \tilde{Y}_{t} , \end{aligned}$$

since \(n_t \approx Q/2^n\) when Q is large. Besides, as \(X_q = \tilde{Y}_{t_q} + N\), we have that:

$$\begin{aligned} \varSigma&= \frac{1}{Q} \sum _{q=1}^{Q} (X_q - \tilde{Y}_{t_q}) {(X_q - \tilde{Y}_{t_q})}^\mathsf {T} \\&= \frac{1}{Q} \sum _{q=1}^{Q} X_q {X_q}^\mathsf {T} \\&\quad - \frac{1}{Q} \sum _{q=1}^{Q} X_q {\tilde{Y}_{t_q}}^\mathsf {T} - \frac{1}{Q} \sum _{q=1}^{Q} \tilde{Y}_{t_q} {X_q}^\mathsf {T} + \frac{1}{Q} \sum _{q=1}^{Q} \tilde{Y}_{t_q} {\tilde{Y}_{t_q}}^\mathsf {T} \\&\quad \xrightarrow [Q\rightarrow +\infty ]{} \frac{1}{Q} X {X}^\mathsf {T} - \frac{1}{2^n}\tilde{Y}{\tilde{Y}}^\mathsf {T} , \end{aligned}$$

assuming \(n_t/Q \rightarrow 1/2^n\) when \(Q\rightarrow +\infty \) (plaintext uniformity). This justifies line 9 of Algorithm 1.

figure a

4.2 Profiling and attack algorithms

The correct key is recovered using Algorithm 2. In this algorithm, the matrices \(\tilde{X}\) and \(\tilde{Y}\) are, respectively, \(\tilde{X} = (\tilde{x}_{0}, \cdots , \tilde{x}_{2^n-1})\) and \(\tilde{Y} = (\tilde{y}_{0}, \cdots , \tilde{y}_{2^n-1})\). Premise of formal presentation of template attacks, including the notion of Mahalanobis measure, is presented in a paper by Zhang et al. [44].

When the templates \((\tilde{Y},\varSigma )\) are obtained from Algorithm 1, then the attack is no longer optimal [15], but is termed template attack. However, template attacks tend to optimal attack when the profiling is carried out on a number of traces which tends to infinity.

figure b

4.3 Improved profiling and attack algorithms

In practice the profiling (Algorithm 1) should be carried out only once, whereas the matching (Algorithm 2) should be carried out every time when the attack outcome is needed. So the adversary can compute \(\varSigma ^{-1} \tilde{Y}\) once in the end of the first stage (Algorithm 3 instead of Algorithm 1), and he can use this result every times the attack needs to be estimated (Algorithm 4 instead of Algorithm 2). Finally, it is possible to regroup the accumulations (lines 1–8) of Algorithms 123 and 4 in the same function.

figure c
figure d

4.4 Extension of our approach to masked implementations

Let d be a strictly positive integer. The protection by a masking of order d consists in dividing each sensitive variable Z into \((d+1)\) shares \(Z^{(0)}\), \(\ldots \), \(Z^{(d)}\). In order to reveal this secret value, an adversary must carry out a so-called high-order SCA. It consists in combining the leakage corresponding to all the \((d+1)\) shares. For such high-order attack, we leverage same notations as in Sect. 2, with the following extensions:

  • \((X^{(i)})_{0\le i \le d}\) denote the measured leakage corresponding, respectively, to the \((d+1)\) shares;

  • \((Y^{(i)})_{0\le i \le d}\) denote the leakage model corresponding, respectively, to the \((d+1)\) shares.

4.4.1 State-of-the-art of high-order attacks

Let us first review the high-order attacks state of the art in the case where the adversary uses only one time-sample per share during the attack. To conduct higher-order attacks, many combination functions were proposed in the literature, such as product combination [38], absolute difference combination (possibly raised to some power) [26] and sine-based combination [35]. According to [40], even if the combination functions loose the information, this loss vanishes in practice for high noise. In such case, the second-order CPA with the normalized product function becomes (nearly) equivalent to the maximum likelihood distinguisher applied to the joint distribution. According to [36] the optimal combination technique against the first-order Boolean masking is the normalized product (\((X^{(0)}-\mathbb {E}({X^{(0)}}))(X^{(1)}-\mathbb {E}({X^{(1)}}))\)). The authors of this paper ( [36]) show that this combination function should be accompanied with \(\mathbb {E}[(Y^{(0)}-\mathbb {E}({Y^{(0)}}))(Y^{(1)}-\mathbb {E}({Y^{(1)}}))|Z]\) as an optimal model. A high-order optimal distinguisher is introduced in [8]. This paper proves that the CPA with normalized product combination is optimal for high noise, independently from the masking technique and its order (whether \(d=1,2,\ldots \)). Let us now show how we can use these results in order to consider several time-sample per share.

4.4.2 Extension of our approach to higher-order masking

To extend our template approach to masking-based protected implementations, we take as a starting point the normalized product combination [36]. In fact one can see the measured trace as a set of \((d+1)\) sub-traces, such that each sub-trace corresponds to one of the \((d+1)\) shares. According to [36], an adversary which combines \((d+1)\) relevant PoIs (one from each sub-trace) should be successful in his attack by carrying out a normalized product combination and by accompanying it with \({\mathbb {E}}[\varPi _{i=0}^d(Y^{(i)}-\mathbb {E}({Y^{(i)}}))\ |Z]\) as an optimal model. In fact, for combining, the adversary can choose any relevant PoI of the first share, with any relevant PoI of the second share... with any relevant PoI of the \((d+1)\)th share.

Fig. 2
figure 2

Different methods to combine samples from \((d+1)\) sub-traces, in the case of dth-order masking countermeasure. Attack is carried out on the multivariate combination shown in the artificial trace. Each sample amongst the D making up the artificial trace is a centered product of \((d+1)\) samples taken each from a different sub-trace

Following this idea, to extend the template attack on a masked implementation, one can construct an artificial trace, such that any sample in the artificial trace is a normalized product combination of \((d+1)\) PoIs (one from each sub-trace). This is illustrated in Fig. 2a. In this figure, only one PoI is selected from the first sub-trace, and it is combined to samples of the other d sub-traces, using evaluator’s expertise to get the best choice.

A simple way of combining is to take the same number of PoIs from each sub-trace and combine them in the same order (i.e., the ith sample, for \(1\le i\le D\), of the artificial trace is the combination of all the PoI of the sub-traces that are in the ith position). Figure 2b describes the combination to create the artificial trace by such concatenation of D normalized products between \((d+1)\) sub-traces. Finally, the adversary can profile the leakage model of this artificial leakage and carry out the template attack by matching the combination of current leakage (also by the normalized product) to the profiled model.

Thanks to this combination idea, one can carry out the template attack against high-order masking schemes, in the general way or following our particular approach (coalescence followed by the spectral approach in order to be significantly more efficient both in terms of processing time and of memory space complexities). One can see the experimental results of this technique, against a first-order Boolean masking [31], in the Sect. 5.5 (at scenarios 3 and 4).

4.4.3 Extension of our approach to disaligned traces

In fact the traces could not be aligned at the begging, due to random delays (jittering) or any other raison [13]. The impact of this countermeasure in performance has already been quantified in a formal framework for the evaluation of waveform resynchronization algorithms [22]. To carry out a template attack using such disaligned traces, one can use several techniques to first realign them [17, 19, 34, 41]. All these techniques shall be applied before coalescence.

4.5 Computational performance analysis

4.5.1 Straightforward complexity analysis

Let us comment on the complexity of the algorithms. The body of Algorithm 1, that is lines 1–8, operates on vectors of size D. The same remark applies to the body, that is lines 1-8, of Algorithm 2. Hence, the overall complexity of these parts of the algorithm is \(D\times Q\) additions.

The complex part of Algorithm 1, namely line 9, is computed only once. The overall complexity of this part equals to that of \(X X^\mathsf {T}\) computation. So, that is equals to \(D^2\times Q\) multiplications.

The complex part of Algorithm 2, namely line 9, is also computed only once. Firstly, since the profiling stage should be done only one time, we can compute \( \varSigma ^{-1} \tilde{Y} \) once, as shown in Algorithm 3. The overall complexity of this multiplication equals \(D^2\times 2^n\). For inverting \(\varSigma \) efficiently one can use the optimized Coppersmith–Winograd-like algorithm that has a complexity of \(\mathcal {O}(d^{2.373})\) [42].

Another advantage of our template analysis is that the overall complexity of the attack phase, namely Algorithm 4, does not depend on Q, no matter how large Q is. Indeed, the overall complexity of \(\tilde{X}^\mathsf {T} \big (\varSigma ^{-1} \tilde{Y} \big )\) computation is equal to \(2^{2n}\times D\) multiplications. Table 2 shows the computing time of Algorithm 3 according to D. It can be seen that compute matrix produces to derive \(\varSigma \) takes more time than to the inverse of \(\varSigma \). Since the time computation of Algorithm 3 depends only to lines 10 and 11 (asymptotically), we provide its duration in Table 2. Of course the overall time of Algorithm 3 is about the sum of that of lines 10 and 11.

In order to save memory space, we compute \(\varSigma \) in line 10 of Algorithm 3 from X and \(\tilde{Y}\) without using any temporary matrix. Finally, we can factor the code of lines 1–8 of Algorithms 1, 23 and 4 in the same function.

Table 2 Computing time (in seconds) of Algorithm 3 according to D

4.5.2 Complexity improvement with spectral analysis

We improve the complexity of the template attack using results from [23]. To compute \({{\,\mathrm{tr}\,}}_k \Big (\tilde{X}^\mathsf {T} \big (\varSigma ^{-1} \tilde{Y} \big )\Big )\) more efficiently, let us first denote \(\big (\varSigma ^{-1} \tilde{Y} \big )\) as \(\tilde{\mathbb {Y}}\), such that:

$$\begin{aligned} {{\,\mathrm{tr}\,}}_k \Big (\tilde{X}^\mathsf {T} \big (\varSigma ^{-1} \tilde{Y} \big )\Big )= {{\,\mathrm{tr}\,}}_k \Big (\tilde{X}^\mathsf {T} \tilde{\mathbb {Y}}\Big ). \end{aligned}$$

Recall the dimension of \(\tilde{X}^\mathsf {T}\) is \(2^n\times D\) and that of \(\tilde{\mathbb {Y}}\) is \(D\times 2^n\). The (ij) element of the matrix \(\tilde{X}^\mathsf {T} \tilde{\mathbb {Y}}\) is thus:

$$\begin{aligned} \tilde{X}^\mathsf {T} \tilde{\mathbb {Y}}[i][j] =\sum _{l=0}^{D-1}\tilde{X}^\mathsf {T}[i][l] \tilde{\mathbb {Y}}[l][j] . \end{aligned}$$

Consequently,

$$\begin{aligned} {{\,\mathrm{tr}\,}}_k&\left( \tilde{X}^\mathsf {T} \tilde{\mathbb {Y}}\right) \\&=\sum _{i=0}^{2^n-1}\sum _{l=0}^{D-1}\tilde{X}^\mathsf {T}[i][l] \tilde{\mathbb {Y}}[l][i\oplus k]\\&=\sum _{l=0}^{D-1}\sum _{i=0}^{2^n-1}\tilde{X}^\mathsf {T}[i][l] \tilde{\mathbb {Y}}[l][i\oplus k]\\&=\sum _{l=0}^{D-1}\tilde{X}^\mathsf {T}[.][l] \otimes \tilde{\mathbb {Y}}[l][.](k)\\&=\sum _{l=0}^{D-1}WHT^{-1}\Big ( WHT\big (\tilde{X}^\mathsf {T}[.][l]\big ) \bullet WHT\big (\tilde{\mathbb {Y}}[l][.]\big )\Big )(k) \end{aligned}$$

So,

$$\begin{aligned}&{{\,\mathrm{tr}\,}}_k \left( \tilde{X}^\mathsf {T} \tilde{\mathbb {Y}}\right) =\nonumber \\&\quad WHT^{-1}\Big (\sum _{l=0}^{D-1} WHT\big (\tilde{X}^\mathsf {T}[.][l]\big ) \bullet WHT\big (\tilde{\mathbb {Y}}[l][.]\big )\Big )(k) ,\nonumber \\ \end{aligned}$$
(13)
  • \(\tilde{X}^\mathsf {T}[.][l]\) denotes the \(l^{th}\) column of the matrix \(\tilde{X}^\mathsf {T}\),

  • \(\tilde{\mathbb {Y}}[l][.]\) denotes the \(l^{th}\) line of the matrix \(\tilde{\mathbb {Y}}\), and

  • \(\otimes \) denotes the convolution product between them.

  • \(\bullet \) denotes the coordinate wise product between them.

Thanks to the (normalized) Walsh–Hadamard transform (WHT) that allows us to compute, for all \(l=0,\dots ,D-1\), the convolution product \(\tilde{X}^\mathsf {T}[.][l] \otimes \tilde{\mathbb {Y}}[l][.]\) with a complexity of \(n2^n\) instead of \(2^{2n}\). Thereby the overall complexity of \({{\,\mathrm{tr}\,}}_k \Big (\tilde{X}^\mathsf {T} \big (\varSigma ^{-1} \tilde{Y} \big )\Big )\) computation becomes \(n2^n\times D\) instead of \(2^{2n}\times D\). In applications such as key extraction from the AES (\(n=8\)), the computation time of the attack phase is indeed divided by \(2^n/n=32\), which is a significant gain. Table 3 shows the computing time of the line 9 of the Algorithm 4 according to D, with and without the spectral analysis (resp. Eqs. 13 and  12).

Table 3 Computing time (in seconds) of the line 9 of Algorithm 4 according to D, with and without the spectral analysis (WHT)

Is noteworthy that this result that assume the group operation \(\oplus \) over the set \(\{0,1\}^n\) (i.e., \(Z=F(t\oplus k)\)) hold true for any other group operation \(\odot \) as long as Walsh-Hadamard Transform is replaced by Fourier Transform (FFT) on this group \((\{0,1\}^n, \odot )\). For example, since in TEA (a Tiny Encryption Algorithm) and several AES candidates [12], the operation is \(+ (\mod (2^n))\), so cyclical Fourier transform must replace Walsh–Hadamard transform (WHT).

5 Experiments

In this section, we assess the efficiency of this template attack. We first analyze its success rate with variable numbers of samples per trace. We compare it with the success rate of the traditional CPA. In the sequel a comparison between this template attack and the monovariate CPA attack over PCA is done.

5.1 Traces used for the case study

An ATMega 163 smart card, involving an 8-bit AVR type processor, has been programmed to process a software AES. The analysis consists in measuring the power consumption of the smart card, when the AES is running. The power measurements are done with a PicoScope 3204A on the first round. The sampling rate equals 256 MSamples/second.

We first characterize the traces managed in the following experiments. The average trace, SNR (Definition 3), NICV (Definition 4), and the principal direction of PCA (recall Sect. 3.3.1) are computed in all possible samples of these traces. Figure 3 shows those characterizations over \(D=700\) samples.

Fig. 3
figure 3

Average trace, SNR, NICV and first principal direction of PCA, according to the index of samples

In the following template attacks, a set of traces is collected for templates building, whilst another set is collected for the attack phase. As the two sets of traces are measured from the same smart card, the results are optimistic in terms of attacker potential. We analyze the success rate of the template attack according to the size of a chosen samples window.

5.2 Template attacks with windows of increasing size

We analyzed the success rate with different numbers of samples (from \(D=1\) to \(D=700\) samples). The samples are selected around a chosen central point \(I_0\). This point is the sample time where the traditional CPA yields the highest peak. The window of growing size D gathers samples belonging to the interval \(\left[ I_0-D/2, I_0+D/2\right) \).

Figure 4 shows the success rate according to the number of measurements for different sizes of the window (from \(D=200\) to \(D=700\)). From these results, one can deduce that the larger the dimensionality, the fewer the number of traces to recover the secret key. Asymptotically, for very large dimensionalities \(D\rightarrow +\infty \), only a limited few number of traces suffices to extract the key.

For reference, Fig. 4 also shows the success rate of univariate CPA (which corresponds to the maximum value of the correlation coefficient along all \(D=700\) samples). The CPA uses Hamming weight model. It can be seen that (model-agnostic) template attack becomes more efficient than CPA starting from about \(D\in [250,300]\), and keeps being better for larger dimensionality.

Indeed, as shown in Fig. 5, this monotony of the success rate is approximately respected above \(D=300\) samples/trace. But, as shown in Fig. 6, this monotony in not respected below \(D=300\). One can deduce that in this interval the success rate depends of the signal wave form; in the interval \(D\in [1,300]\), it can happen that when increasing the window size, more noise than signal is injected in the template attack, thereby making it less efficient. In order to lead this attack more efficiently one can choose only the samples where the curve in Fig. 6 decreases, because these samples really depend on \(T\oplus k\) contrary to the samples where the curve increases.

Fig. 4
figure 4

Values of success rate, according to the number of samples

Fig. 5
figure 5

Number of measurements at 80% of success rate, according to the number of samples (zoom for \(D>300\))

As shown in Fig. 4, from around 280 samples/trace this multivariate analysis will be more efficient than the traditional CPA. So one can conclude that there is a venue to capture more information using part or all of the samples rather than conducting a monovariate attack.

Fig. 6
figure 6

Number of measurements at 80% of success rate, according to the number of samples (full scale, that is \(D\in [1,700]\))

In order to validate our study independently from the targeted device, the same experiment is carried out on a more recent smartcard, namely TI MSP430G2553. Figure 7 shows similar results on this device, comparing to the right part of Fig. 6 (\(D\ge 300\)). One can see that the more samples are taken into account, the faster the attack. This figure shows also that it is very important to choose the relevant PoIs, and that template attacks benefit from the great multiplicity of PoI. The gain is huge: more than \(100\times \) less traces to recover the key when using \(D=1\) vs \(D=700\).

It is noticeable that the using sampling frequency during the MSP430G2553’s (resp. the Atmega163’s) attack is 512 MHz (resp. 256 MHz), where the operating frequency is 16 MHz (resp. 8 MHz). So the two campaigns are comparable in terms of number of samples per clock cycle. The difference is in the noise, since the two circuits have not been designed in the same technology and the side-channel acquisition system differ (MSP430 LaunchPad and Smartcard reader, respectively).

It is also noteworthy that both of PoI-based and PCA-based template attacks [20] can straightforwardly benefit from our approach.

Fig. 7
figure 7

Number of measurements at 80% of success rate, according to the number of samples (full scale, that is \(D\in [1,700]\)) for the TI MSP430G2553 card

5.3 Comparison with PCA

In order to compare the efficiency of this method with other multivariate analyses, the PCA analysis is carried out on the same device with the same number of samples (700 samples/trace). Figure 8 shows that the template attack is more efficient in practice than both the PCA and the traditional CPA. Recall that Fig. 3 shows the average trace, SNR, NICV and the first principal direction of PCA, according to the index of samples.

Fig. 8
figure 8

Values of success rate, according to the number of samples

5.4 Template attack after dimensionality reduction (over first eigen-components)

In this stage we assess the effect of PCA over the success rate of our template attack. Instead of assessing the efficiency of our attack according to the number of samples per trace it is assessed according to the number of directions of PCA ordered by decreasing eigenvalue. Figure 9 shows the number of measurements required to reach 80% of success rate, according to the number of eigen-components considered per measurement. This figure shows also the ordered eigenvalues according to their corresponding eigen-components. It is clear that a quasi-linear relation exists between them: the success rate increases about as fast as the cumulative eigenvalues.

Fig. 9
figure 9

PCA with multiple directions (from 1 to \(2^n-1\) directions). Top: scree plot of eigenvalues (log scale). Bottom: number of measurements at 80% of success rate (log scale)

5.5 Study of our approach with simulated traces

In order to do a fair comparison under different aspects (PoIs, noise levels and masking) one can resort to simulated traces. In this paper we present four scenarios with different noise levels. In each scenario we increase a window size around a central PoI and we show the number of the required traces to reach \(80\%\) of success rate. So, we carry out, at once, a study with different PoI numbers, different noise levels and using the masking technique or not. The four scenarios are:

  1. 1.

    In the first scenario, we simulate a leakage that follows a half cycle of a sine function, during 700 samples (or equivalently any 700 PoIs that follows the same law). More exactly, if the sensitive variable is Z, then the leakage at the sth time sample is:

    $$\begin{aligned} X_z(s)=w_H(z)\sin \big (2\pi \frac{s}{2\times 700}\big )+N , \end{aligned}$$

    such that N denotes a Gaussian noise and the leakage model is \(Y_z(s)=w_H(z)\sin \big (2\pi \frac{s}{2\times 700}\big )\). In those equations, \(w_H(z)\) is the Hamming weight of \(z\in \mathbb {F}_2^n\), that is \(w_H(z)=\sum _{i=1}^n z_i\). We used such leakage simulation in order to be close to the real leakage (for example, see the real leakage of a smart card as shown in [33, Fig.4]). During this experiment we vary the window size around the central PoI (\(s=350\)). Figure 10 shows the number of the required traces to reach \(80\%\) of success rate according to the number of samples, for different standard deviations (\(\sigma \)) of the noise, in the first scenario.

  2. 2.

    In the second scenario, we simulate a leakage that follows the same leakage function as the first scenario but only for the even values of s. For the odd values of s we simulate non-relevant points (non-informative points [27]) by random values. Figure 11 shows the number of required traces to reach \(80\%\) of success rate according to the number of samples, for different noise standard deviations (\(\sigma \)), in the second scenario. One can show the inconvenient of considering non-relevant points as PoIs.

  3. 3.

    In the third scenario, we simulate the leakage of a protected device by first-order Boolean masking [32, §4]. For each share (masked sensitive value and the mask) the leakage is simulated by a sine function as in the first scenario. Before carrying out our approach, we combine the sub-traces in one artificial trace as described in Sect. 4.4. Figure 12 shows the number of required traces to reach \(80\%\) of success rate according to the number of samples, for two different standard deviations (\(\sigma \)) of the noise, in the third scenario. One can see that it is possible to reveal the secret by applying our approach even against masking.

  4. 4.

    In the fourth scenario, we simulate a leakage that follows the same leakage function than the third scenario but only for the even values of s. For the odd values of s we simulate non-relevant points (non-informative points) by random values. This alternation between relevant and dummy samples holds for both shares. Figure 13 shows the number of the required traces to reach \(80\%\) of success rate according to the number of samples, for two different standard deviations (\(\sigma \)) of the noise, in the last scenario. This figure shows the same trend as Fig. 12. The convergence rate with respect to dimensionality (#samples) is similar. But the asymptotic value (#samples to recover the key with \(80\%\) of success) is of course higher, especially for high noise.

Fig. 10
figure 10

Number of measurements at 80% of success rate (log scale), according to the number of samples, in scenario 1

Fig. 11
figure 11

Number of measurements at 80% of success rate (log scale), according to the number of samples, in scenario 2

Fig. 12
figure 12

Number of measurements at 80% of success rate (log scale), according to the number of samples, in scenario 3

Fig. 13
figure 13

Number of measurements at 80% of success rate (log scale), according to the number of samples, in scenario 4

6 Conclusion and perspectives

In this paper, we have provided an analytical formula for the template attack in a multivariate Gaussian setting. We have applied it to highly multivariate traces, and we have shown that template attacks outperform state-of-the-art heuristics, such as traces dimensionality reduction followed by monovariate distinguishers. Template attacks without prior dimensionality reduction can be applied to traces of dimensionality D of several hundreds without any effectiveness loss: The success rate increases as D increases. Therefore, this study reveals that the high sampling rate of oscilloscopes can help increase the success rate of the attacks.

Furthermore, we extend the approach to the masking-based protected implementations. We also exhibit a spectral approach for template attack which allows an exponential computational improvement in the attack phase (with respect to data bitwidth), which turns out to be a speed-up by \(32\times \) in the case of AES. Recall that both of PoI-based and PCA-based template attacks can straightforwardly benefit from our approach, both in terms of processing time and in term of the needed space memory. Our approach is validated both by simulated and real-world traces.