1 Introduction

Embedded devices such as smart cards, mobile phones, and RFID tags are widely used in our everyday lives. These devices implement cryptographic operations allowing to secure, for example, bank transfers, buildings and cars. Several cryptographic primitives exist such as hash functions and encryption functions. During the execution of an encryption algorithm, the device processes secret information. Such secret information could be retrieved with physical attacks against the physical device by analyzing unintentional leakages that appear in power consumption [26], processing time [25], electromagnetic emanation [14] or in a combination of them [47].

Two main families of attacks against cryptographic devices exist: non-profiled and profiled attacks. Non-profiled attacks are one-phase strategy that perform the key recovery step directly on the target device. Profiled attacks are two-phase strategies that (1) characterize the leakage value of the target device on the basis of a key-related information using a clone device (similar to the target device) and then (2) attack the target device.

In 1999, Kocher et al. [26] proposed a non-profiled differential power analysis (DPA) on power consumption data (called traces). This method models the theoretic leakage for each secret information. Then the real and the predicted leakage are compared using metrics, also known as distinguishers, like the correlation coefficient (CPA) [6], the difference of means (DoM) [26], the mutual information (MIA) [15], or the Kolmogorov–Smirnov test (KS) [52]. The rationale is that the likelihood of a secret information is related to the degree of similarity between the predicted and the real leakage information.

Profiled attacks, like template attack (TA) [5] and stochastic attack (SA) [46], make a step forward in the use of statistical modeling of leakages; they estimate the conditional density function of the trace for each key-related information by creating a Gaussian parametric model during a profiling step. Thereafter, during the attacking phase, the traces are classified by a maximum likelihood approach. We refer to [51] for a recent comparison between template attack and stochastic attack. If the assumption of Gaussianity holds and sufficient data exist for accurate estimation of the parameters, Template Attack can be considered as the strongest leakage analysis in an information theoretic sense [5]. Profiled attack is particularly suitable (1) to analyze the security of a cryptographic device in the worst case scenario and (2) when the adversary is only able to observe a single use of the key (e.g. in stream ciphers) or a secret one time value (e.g. a mask in a masking scheme).

In recent years, the cryptographic community explored new approaches based on machine learning. The results show that template attacks overestimate the security of embedded devices in several scenarios. Lerman et al. [27, 28] compared a template attack with a binary machine learning approach, based on non-parametric methods, against cryptographic hardware devices implementing a symmetric and an asymmetric cryptographic algorithm. Hospodar et al. [22, 23] analyzed a software implementation of a portion of a block-cipher. Their experiments support the idea that non-parametric techniques can be competitive and sometimes better (i.e. less traces in the attacking phase) than template attack. Heuser et al. [21] generalized this idea by analyzing multi-class classification models in several contexts. In the same year, Bartkewitz [2] applied a multi-class machine learning model allowing to improve the attack success with respect to the binary approach. Recently, Lerman et al. [29] proposed a machine learning approach that takes into account the temporal dependencies between power values. This approach improves the success rate of an attack in a low signal-to-noise ratio with respect to classification methods. At the same time, Martinasek et al. [35] applied a neural network to extract one byte of the key of AES. Their method retrieves the secret value with probability around 0.9 using a single measured power leakage.

Together with attacks, the embedded systems industry needs countermeasures. Side-channel attacks may be counteracted by inducing a leakage independent of the secret target value. It is worth to mention that all the previously presented attacks based on machine learning were applied on unprotected cryptographic devices. Recently, Lerman et al. [30] investigated whether the results of the previous works would be still the same in a protected environment. During the attacking phase, for a specific countermeasure and for a specific device, their investigations concerned: (1) How many traces are required against a protected device with a machine learning model compared to a strategy based on template attack or on stochastic attack (2) How many traces are required by a machine learning model attacking a protected device compared to an unprotected device (3) What is the impact of the number of traces used in the profiling step by a machine learning model when attacking a protected device.

The results presented in this paper extend the previous analyses in two main directions. We first focus on how to improve the efficiency of the attack by investigating several profiled and non-profiled attacks in protected contexts compared to an unprotected context. On one hand, we compare the previous results with multivariate regression attack and nonlinear stochastic attacks to bypass the protection. On the other hand, we add results on multivariate non-profiled attacks and machine learning models during the key recovery step. This provides a clear idea on how to improve the results of Lerman et al. [30] by proposing an original efficient strategy based solely on machine learning models. The main purpose of our paper is to see the limit of a machine learning approach (i.e. based solely on machine learning models) against a protected device and, thereafter, to compare this strategy to a broader set of attacks. In the second part, we quantify the information that the profiled models retrieve with a large learning set (i.e. in asymptotic contexts). Such a study is of particular importance when the adversary has no constraint on the number of measured traces during the profiling step.

As in Lerman et al. [30], our requirements are fast-execution, low-memory usage and high success rate of the attack. The purpose is to put ourselves in a realistic attack scenario before deployment or certification process (that are expensive and time consuming).

We made a detailed assessment of the proposed strategy by considering several public datasets with different number of traces during the profiling phase and the attacking phase. These traces were collected on a smart card that implements the block-cipher AES protected by a lightweight masking scheme. All our datasets were extracted from the public dataset of the DPAContest V4 [11, “an initiative towards an international benchmarking reference” [11].

The rest of the paper is organized as follows. Section 2 discusses side-channel attacks, non-profiled attacks, profiled attacks, masking countermeasures and strategies against countermeasures. Section 3 introduces our original attack based on a machine learning approach against a masking scheme. Section 4 illustrates the power of our proposal with a large number of experiments. Section 5 concludes this paper with several perspectives of future work.

2 Preliminaries

In this section we introduce the basic definitions of side-channel attacks, non-profiled attacks, profiled attacks, masking countermeasures and strategies against countermeasures.

2.1 Side-channel attacks

During the execution of an encryption algorithm, the cryptographic device processes a function \(f\) (targeted by the adversary)

$$\begin{aligned}&f:{\mathcal P}\times {\mathcal O}\rightarrow {\mathcal F}\nonumber \\&s=f_{O}(p) \end{aligned}$$
(1)

called a sensitive variable [44] where

  • \(O\in {\mathcal O}\) (and \({\mathcal O}= \{O_0,O_1,\ldots ,O_{K-1}\}\) is a key-related information where \({\mathcal O}=\{0,1\}^{l_1}\), \(l_1\) is the size of the secret value used in \(f\) (e.g. one byte of the secret key) and \(K\) is the cardinality of \({\mathcal O}\)

  • \(p\in {\mathcal P}\) (and \({\mathcal P}=\{p_0,p_1,\ldots ,p_{P-1}\}\) represents a public information where \({\mathcal P}=\{0,1\}^{l_2}\), \(l_2\) is the size of the public value used in \(f\) (e.g. one byte of the plaintext) and \(P\) is the cardinality of \({\mathcal P}\)

  • \({\mathcal F}=\{0,1\}^{l_3}\) is the codomain of \(f\) where \(l_3\) is the size of the output of \(f\).

Note that \(l_1\), \(l_2\) and \(l_3\) depend on the cryptographic algorithm and the device architecture. We assume that the adversary wants to retrieve the secret value used when the cryptographic device (executing a known encryption algorithm) encrypts known plaintexts.

Prouff [43] showed that nonlinear functions are less robust against side-channel attacks than linear functions. As a result, usually, the target function \(f\) represents a nonlinear function such as a substitution box (SBox) of the block-cipher, e.g.

$$\begin{aligned} f_{O}(p)=\mathrm{SBox}(p\oplus O) \end{aligned}$$
(2)

where \(\oplus \) is the bitwise exclusive-or.

Let

$$\begin{aligned} ^{j}T{}_{i} =\left\{ _{t}^{j}T{}_{i} \in \mathbb {R}\mid t\in \left[ 1;n\right] \right\} \end{aligned}$$
(3)

be the \(j\)-th leakage information (called trace) associated to the \(i\)-th target value. We consider the leakage information \(_{t}^{j}T{}_{i}\) of the device at time \(t\) depending on the output of \(f_{O_i}(p)\) such that

$$\begin{aligned} _{t}^{j}T{}_{i}&= y_i+\epsilon \end{aligned}$$
(4)
$$\begin{aligned}&= L\left( f_{O}\left( p\right) \right) +\epsilon \end{aligned}$$
(5)

where \(y_i=L\left( f_{O}\left( p\right) \right) \), \(\epsilon \in \mathbb {R}\) is the noise following a Gaussian distribution with zero mean and \(L\) is the leakage model

$$\begin{aligned}&L:{\mathcal F} \rightarrow {\mathcal Y}\\&y = L\left( f_{O}\left( p\right) \right) \nonumber \end{aligned}$$
(6)

where \({\mathcal Y}=\{y_1,y_2,\ldots \}\subset \mathbb {R}\) (also known as the set of classes). Examples of models \(L\) are the identity, the Hamming weight (HW) and the Hamming distance [33].

2.1.1 Non-profiled attacks

Non-profiled attacks are commonly used to target a cryptographic device. These attacks estimate the output value of \(f_{O}(p)\) for each possible target value \(O\). Then, the estimated leakage model \(\hat{L}\) transforms this output value to allow, in fine, to compare the real and the predicted leakage information with a distinguisher \(D\) (e.g. the Pearson correlation). Mathematically, a univariate non-profiled attack returns the target value \(\hat{O}\) that maximizes

$$\begin{aligned} \hat{O} = \underset{O\in {\mathcal O}}{\mathrm{arg\,max }}\ | D(\hat{{\varvec{T}}}(O),{\varvec{T}})| \end{aligned}$$
(7)

where

  • \(|x|\) designates the absolute value of \(x\)

  • \({\varvec{T}}=\left[ {_{t}^{1}T{}},\ldots , {_{t}^{N}T{}}\right] \) represents a list of traces measured at time \(t\) (i.e. \({_{t}^{i}T{}}\in \mathbb {R}\ \ \forall i\in \left\{ 1;N\right\} \))

  • \(\hat{{\varvec{T}}}\left( O\right) =\left[ \hat{L}(f_{O}(p_{[0]})),\ldots , \hat{L}(f_{O}(p_{[N]})) \right] \) refers to a list of estimated leakages parameterized with an estimated key \(O\) and known plaintexts \(p_{[i]}\) associated to \({_{t}^{i}T{}}\) (i.e. \(\hat{L}(f_{O}(p_{[0]})) \in {\mathcal Y}\)).

This paper focuses on correlation power analysis where the distinguisher represents the Pearson correlation estimator.

The multivariate non-profiled attack generalizes the univariate attack by considering several time instants related to the target information. According to [19], there are two multivariate approaches: (a) apply an attack to a combination of power leakages, or (b) apply an attack to multiple sample points independently and then combine their results. In our experiments, we considered both approaches.

2.1.2 Profiled attacks

Let \({\mathrm {Pr}}\left[ A\right] \) be the probability of A and let \({\mathrm {Pr}}\left[ A \mid B\right] \) be the probability of A given B. The profiled attack strategy represents a more efficient attack by deeper leakage estimations. It estimates (with a set of traces called learning set) a template \({\mathrm {Pr}}\left[ ^{j}T{}_{i} \mid y_i;\theta _i\right] \) (where \(\theta _i\) is the parameter of the probability density function) for each target value during the profiling step (also known as learning step). The learning set is measured on a controlled device similar to the target chip. In our experiments, we used the same cryptographic device but we refer to [37] that studies practical issues when the controlled and the target devices differ.

Once a template is estimated for each target value, during the attacking step the adversary classifies a new trace \(T\) (measured on the target device) using the a posteriori probability returned by a model \(A(T)\)

$$\begin{aligned} \hat{y}=A(T)&= \underset{y_i\in {\mathcal Y}}{\mathrm{arg\,max }} {\mathrm {Pr}}\left[ y_i \mid T\right] \end{aligned}$$
(8)
$$\begin{aligned}&= \underset{y_i\in {\mathcal Y}}{\mathrm{arg\,max }}\ \frac{{\mathrm {Pr}}\left[ T \mid y_i\right] \times {\mathrm {Pr}}\left[ y_i\right] }{{\mathrm {Pr}}\left[ T\right] } \end{aligned}$$
(9)
$$\begin{aligned}&= \underset{y_i\in {\mathcal Y}}{\mathrm{arg\,max }}\ \hat{{\mathrm {Pr}}}\left[ T \mid y_i;\hat{\theta _i}\right] \times \hat{{\mathrm {Pr}}}\left[ y_i\right] \end{aligned}$$
(10)

where the a priori probabilities \(\hat{{\mathrm {Pr}}}\left[ y_i\right] \) are estimated by the user accordingly.

If a set \({\varvec{T}}\) of traces (where \({\varvec{T}}=\left[ {_{}^{1}T{}},\ldots , {_{}^{N}T{}}\right] \) and \( {_{}^{i}T{}}\in \mathbb {R}^n\ \ \forall i\in \left\{ 1;N\right\} \)) for a constant secret key are available, the adversary uses the equation (or the log-likelihood rule)

$$\begin{aligned} \hat{O}=\underset{y_i\in {\mathcal Y}}{\mathrm{arg\,max }}{\prod _{j=1}^{N}{{\mathrm {Pr}}}\left[ {_{}^{j}T{}_{}} \mid y_i\right] \times {{\mathrm {Pr}}}\left[ y_i\right] } \end{aligned}$$
(11)

Several approaches exist to estimate the probability \({\mathrm {Pr}}\left[ T_i \mid y_i\right] \) such as the parametric template attack [5], the stochastic attack [46], the multivariate regression model [49] and the non-parametric machine learning models [22, 27].

Template attacks

Template attacks [5] assume that \({\mathrm {Pr}}\left[ T_i \mid y_i\right] \) follows a Gaussian distribution for each target value, i.e.

$$\begin{aligned} \hat{{\mathrm {Pr}}}\left[ T_i \mid y_i;\hat{\theta _i}\right]&= \hat{{\mathrm {Pr}}}\left[ T_i \mid y_i;\hat{\mu }_i , \hat{\Sigma }_i\right] \end{aligned}$$
(12)
$$\begin{aligned}&= \frac{{e^{-\frac{1}{2}(T_{i}-\hat{\mu }_{i})\hat{\Sigma }_{i}^{-1}(T_{i}-\hat{\mu }_{i})^\top }}}{\sqrt{(2\pi )^{n} \det (\hat{\Sigma }_{i})}} \end{aligned}$$
(13)

where \(\det (\Sigma )\) denotes the determinant of the matrix \(\Sigma \) while \(\hat{\mu }_{i}\in \mathbb {R}^{n}\) and \(\hat{\Sigma }_{i}\in \mathbb {R}^{n\times n}\) are respectively the expected value and the covariance of the \(n\) variate traces associated to the \(i\)-th target value.

2.1.2.1 Stochastic attacks

Stochastic attacks [46] (also known as linear regression attack) model the leakage information that depends on the secret value \(y_i=f_{O}\left( p\right) \) at time \(t\) with a regression model \({_{t}h}\), i.e.

$$\begin{aligned} _{t}^{j}T{}_{i}&= L\left( y_i\right) +\epsilon \end{aligned}$$
(14)
$$\begin{aligned}&= {_{t}h\left( y_i\right) } +{_{t}R}\end{aligned}$$
(15)
$$\begin{aligned}&= {{_{t}c} + \sum ^{U}_{u=1} {_{t}\alpha _u} \, g_u\left( y_i\right) } + {_{t}R} \end{aligned}$$
(16)

where \({_{t}R}\in \mathbb {R}\) is a residual Gaussian noise at time \(t\), \(\left\{ {_{t}c}, {_{t}\alpha _1},\ldots , {_{t}\alpha _U} \right\} \in \mathbb {R}^{U+1}\) is the parameter of the regression model \({_{t}h}\) and \(\left\{ g_1,\ldots , g_{U} \right\} \) is the basis used in the regression. Each \(g_u\) is a monomial of the form \(\prod _{j\in \mathcal {J}}{\mathrm {Bit}}_j\left( y_i\right) \) where \({\mathrm {Bit}}_j\left( y_i\right) \) returns the j-th bit of \(y_i\) and \(\mathcal {J}\subset \{1,2,\ldots ,l_3\}\). In other words, we can vary the degree of a stochastic model from 1 through to \(l_3\) to vary its complexity. For example, in a linear basis, each function \(g_u\) equals to

$$\begin{aligned} g_u\left( y_i\right) ={\mathrm {Bit}}_u\left( y_i\right) \end{aligned}$$
(17)

Then, the attacker assumes that \({\mathrm {Pr}}\left[ T \mid y_i\right] \) follows the Gaussian distribution \({\mathcal N}\left( {h\left( y_i\right) },\Sigma \right) \) where \(h(y_i)\) equals to \(\left\{ {{_{1}h}}(y_i), {{_{2}h}}(y_i),\ldots , {{_{n}h}}(y_i)\right\} \) and \(\Sigma \in \mathbb {R}^{n\times n}\) is the covariance matrix of the residual term. An extended version of stochastic attack removes the profiling step [10]. However, this approach is out of the scope of this work.

2.1.2.2 Multivariate regression attack

The multivariate regression model [49] describes the relationship between a set of \(n\) features representing a trace \({T_{i}} =\left\{ {_{1}T_{i}},\ldots , {_{n}T_{i}}\right\} \in \mathbb {R}^{n}\) and the target value such that

$$\begin{aligned} y_i = c + \sum ^{n}_{j=1} {\alpha }_{j} \ {_{j}T_{i}}+\epsilon \end{aligned}$$
(18)

where \(\alpha _{j}\in \mathbb {R}\ \ \forall j\in \left\{ 1;n\right\} .\)

2.1.2.3 Non-parametric machine learning attack

Non-parametric (supervised) machine learning models make no assumptions about the density distribution functions.

The machine learning field regroups several learning algorithms. For example, random forest model (RF) [3] builds a set of decision trees that classifies a trace based on a voting system. Support vector machine (SVM) [7] discriminates traces associated to different target values with hyperplanes in optionally a higher dimensional space than the original dimension. Unfortunately, no statistical modeling algorithms can be entitled to be the universally best one as formalized by the no-free-lunch theorem.

We refer to [2, 2123, 2729] for a detailed introduction to the non-parametric machine learning models.

2.2 Masking countermeasure

Based on secret sharing, the masking countermeasure aims to reduce the unintentional leakage information of a cryptographic device [4]. For this, the method masks a public information \(p\) with \(d\) uniformly distributed random values \(v=\{v_0,v_1,\ldots ,v_{d-1}\} \in {\mathcal V}^d\) changing at each execution where \({\mathcal V}=\{0,1\}^{l_4}\) and \(l_4\) is the size of each random value. This approach is called a masking scheme of order \(d\). From a theoretical point of view, the security level of a masked implementation against side-channel attacks increases exponentially with \(d\) [4] when the amount of noise in the traces is sufficiently high [48].

To deal with the mask values, the encryption scheme (\(E\)) is modified and satisfies the relation

$$\begin{aligned} E'(\mathrm{plaintext},\mathrm{key}, \mathrm{masks})=E(\mathrm{plaintext},\mathrm{key}) \end{aligned}$$
(19)

where \(E'\) is the modified encryption algorithm. In other words, \(E\) and \(E'\) output the same ciphertext when the same pair \(\{\)key;plaintext\(\}\) is used. However \(E\) and \(E'\) manipulate different internal values. If correctly implemented, the leakage of the masked implementation becomes statistically independent of the key.

The public cryptographic literature provides plenty of masking schemes based on the one time pad. The Boolean masking computes the combination of \(p\) and \(v\) with the bitwise exclusive-or operation [4]. The multiplicative masking performs the combination with a multiplication in the field \(GF(2^n)\) [1, 18]. The affine masking combines the advantages of the previous schemes in a security point of view with the cost of a heavy operation during the encryption step [50]. Our paper deals with the most common method—the Boolean masking scheme—as the DPAContest V4 uses it.

The main issue in designing masking scheme lies in propagating the mask values throughout nonlinear functions (i.e. \(\mathrm{SBox}(p\oplus v)\ne \mathrm{SBox}(p)\oplus \mathrm{SBox}(v)\)). Several approaches exist to modify the SBoxes part. One of the easiest approach computes a table look-up which associates to each masked input \(p\oplus v_{\mathrm{input}}\) the output value \(\mathrm{SBox}(p)\oplus v_{\mathrm{output}}\) where \(v_{\mathrm{input}}\) and \(v_{\mathrm{output}}\) are mask values [36]. However, such fast approach may require a lot of memory to store the tables. This is the proposed target masking scheme of the DPAContest V4.

2.3 Strategies against countermeasures

Recently, Moradi et al. [38] show that an adversary can successfully target a masked scheme when the implementation of the countermeasure contains mistakes. More precisely, a non-profiled CPA attack with an appropriate model retrieves the secret key with less than 200 traces.

When the masked implementation is correctly implemented, potentially, an adversary can retrieve the secret information using an attack of order \(d+1\) (where the attacker considers \(d+1\) targets: the set of \(d\) random mask values and a key-related information). More precisely, the \((d+1)\)-order non-profiled attack combines \(d+1\) points in each trace associated to the mask values (e.g. in a masking scheme of order 1, the adversary combines instants associated to the target value \(\mathrm{HW}(f_{O}(p)\oplus v_0)\) and to \(\mathrm{HW}\left( v_0\right) \) where \(v_0\) represents the mask value). Then, after this combination, a classical non-profiled attack is performed. It turns out that in a \(d\)-order masking scheme and with a correlation power analysis, the combination of \(d+1\) different instants (related to the \(d\) mask values and to the target value) correlates to the targeted sensitive variable but the masking scheme can still affect the success of the attack as the combination does not remove completely the dependence with the mask values [40].

In a secure implementation context, it is necessary that the mask values remain secret. Indeed, once the mask value is revealed or removed, the attacker is able to execute an efficient non-profiled or profiled attack.

In 2008, Schindler [45] extended the stochastic attack to a masking context by taking into account the mask value \(v\) in the deterministic part when targeting the secret information \(y_i=f_{O}(p\oplus v)\), e.g.

$$\begin{aligned} {_{t}^{j}T{}_{i}}&= L(y_i) +\epsilon \end{aligned}$$
(20)
$$\begin{aligned}&= {_{t}h(y_i)} +{_{t}R}\end{aligned}$$
(21)
$$\begin{aligned}&= {\left\{ \begin{array}{ll} {{_{t}c} + \sum _{u} {_{t}\alpha _u} \, \, g_u(f_{O}(p\oplus v))} +{_{t}R}&{} t\in T_1\\ {{_{t}c} + \sum _{u} {_{t}\alpha _u} \, \, g_u\left( v\right) } +{_{t}R} &{} t\in T_2 \end{array}\right. } \end{aligned}$$
(22)

where \(T_1\) and \(T_2\) represent sets of time instants correlated respectively to \(f_{O_i}(p\oplus v)\) and to \(v\). During the attacking step, the adversary replaces the probability \(\hat{{\mathrm {Pr}}}[T \mid y_i;\hat{\theta _i}]\) in (10) with

$$\begin{aligned} {\mathrm {Pr}}\left[ T \mid y_i=f_{O}\left( p\right) \right] = \sum _{v\in {\mathcal V}^d}{\mathrm {Pr}}\left[ T \mid y_i=f_{O}\left( p\oplus v\right) \right] \end{aligned}$$
(23)

where \({\mathrm {Pr}}[T \mid y_i =f_{O}(p\oplus v) ]\) follows the Gaussian distribution \({\mathcal N}({h(y_i)},\Sigma )\). The main advantage of this approach is that we need a smaller set of measurements during the profiling step compared to template attack applied to masking [45].

Oswald et al. [40] evaluated several approaches to attack a masked implementation with a combination between template attack and correlation power analysis. In the same year, Gierlichs et al. [16] extended these practical proposals with a theoretical analysis. The first approach (called Templates Before Preprocessing) uses template attack to extract the values of the estimated leakage information of the \(d+1\) masked information (e.g. \(\mathrm{HW}((f_{O}(p\oplus v))\) and \(\mathrm{HW}\left( v_i\right) \)) before combining them and to apply a correlation power analysis. The second approach (called templates during preprocessing) forces a bias into the mask values by removing traces associated to certain mask values. For this, the template attack extracts mask-related information and keeps a subset of traces associated to a subset of mask values. Then a correlation power analysis on the selected traces reveals the key. The third approach (called templates after preprocessing) uses template attack to extract the unmasked sensitive value (e.g. \(\mathrm{HW}(f_{O}(p))\)) and performs a correlation power analysis on the extracted unmasked sensitive value. The last approach (called template-based DPA) performs a template attack against the masking implementation by replacing \({{\mathrm {Pr}}}[T \mid y_i]\) in Eq. 10 with

$$\begin{aligned} {\hat{{\mathrm {Pr}}}}\left[ T \mid y_i\right] = \sum _{v\in {\mathcal V}^d} {\hat{{\mathrm {Pr}}}}\left[ T \mid y_i \wedge v \right] \times {\hat{{\mathrm {Pr}}}}\left[ v\right] \end{aligned}$$
(24)

As a result, we need \({\mathrm {card}}({\mathcal Y}) \times {\mathrm {card}}({\mathcal V}^d)\) templates (where \({\mathrm {card}}(x)\) represents the cardinality of the set \(x\)), one for each possible combination of \(y_i\) and \(v\).

3 Machine learning approach against masking countermeasure

We propose a new approach that uses a machine learning approach to: (1) bypass the problem of combining masks-related information that still keeps a dependence to mask values (unlike the \(d\)-order non-profiled attack, the templates before preprocessing and the templates after preprocessing); (2) keep all traces in the attacking step (unlike the templates during preprocessing); (3) reduce the number of templates from \(|\mathcal {Q}| \times |{\mathcal V}^d|\) to \(| {\mathcal V}^d|\) (compared to the template-based DPA) leading to several advantages.

From a theoretical point of view, the main issues are that: (i) the number of required data increases with the number of templates (cf. we need one learning set per template that leads to a gigantic workload in the profiling step [45]) and (ii) the imbalanced class problem [24] arises in the Template-based DPA according to the density distribution of \({L}(f_{O}(p))\) (unlike our proposal).

From a practical perspective, in the case of the DPAContest V4, the adversary has no control on the attacked device. As a result, we (empirically) estimated that template-based DPA needs a large number of measurements in the profiling step—at least 40,000 traces each of 435,002 samples, representing more than \(2^{34}\) bytes of information—to have at least one trace per template with probability 0.99 when the Hamming weight leakage model is chosen. For the same problem, our proposal needs at least 200 traces (i.e. a realistic attack scenario). In practice, we need at least 48,698 traces for template-based DPA and at least 35 for our proposal when considering the dataset of the DPAContest V4.

Our approach applies a profiled attack to extract the mask values (during the mask recovery step) before a non-profiled attack that retrieves the secret key (during the key recovery step). Note that this approach is generalizable to the case where a profiled attack is used to extract the secret key. Furthermore, we assume to be in the worst case scenario where the adversary knows the mask values used during the profiling phase. Our requirements are fast-execution, low-memory usage and high success rate (i.e. realistic attack scenarios). Efficient methods to perform profiled attacks have been proposed recently [2, 2123, 27, 28]. These methods use a machine learning model that returns the target value after a learning (profiling) step. Concerning the non-profiled attack, several approaches exist. One of the most efficient methods represents the correlation power analysis that does not require any estimation probability density function. Note that our method can be extended to other (nonlinear) distinguishers.

During the profiling phase, we make three main steps on a controlled device: (1) we collect a set of traces \({\varvec{T}}\) with an oscilloscope; (2) we select \(p\) points \(\{t_1,t_2,\ldots ,t_p\}\) that are significantly correlated to known mask values; (3) we build a profiled model \(\mathrm{A}(T)\) that returns the (combination of) \(d\) estimated mask values based on a trace \(T\) at time \(\{t_1,t_2,\ldots ,t_p\}\). We also build a learning model against the target function \(y_i\) if the adversary uses it during the mask recovery step. During the attacking phase, we then use \(\mathrm{A}(T)\) for each collected traces (on a target device) to take into account the estimated mask values during the mask recovery step. Example 1 illustrates our proposal with a specific case.

Example 1

Suppose that we target a masking scheme of order 1 and that we build one support vector machine [3]—a machine learning model—\(\mathrm{A}(T)\) during the profiling step. Our target value represents the output of a masked nonlinear function \(\mathrm{SBox}(p\oplus v\oplus O)\). The correlation power analysis step returns the key that maximizes

$$\begin{aligned} \hat{O} = \underset{O\in {\mathcal O}}{\mathrm{arg\,max }}\ | \rho (\hat{{\varvec{T}}}\left( O\right) , {\varvec{T}})| \end{aligned}$$
(25)

where

  • \(\rho \) represents a Pearson correlation estimator.

  • \({\varvec{T}}=\left[ {_{t}^{1}T{}},\ldots , {_{t}^{N}T{}}\right] \) represents a list of traces measured at time \(t\)

  • \(\hat{{\varvec{T}}}\left( O\right) = \left[ \mathrm{HW}(f_{O}(p_{[1]}\oplus \mathrm{A}({_{t}^{1}T{}}))),\ldots , \mathrm{HW}(f_{O}(p_{[N]}\right. \left. \oplus \mathrm{A}({_{t}^{N}T{}}))) \right] \) refers to a list of estimated leakages parameterized with an estimated key \(O\) and known plaintexts \(p_{[i]}\) associated to \({\varvec{T}}\).

Figure 1 summarizes our approach. First, during a preliminary phase, we implement the cryptographic device with its countermeasure before collecting a set of traces. Secondly, during the profiling phase, we reduce the number of points per trace with a feature selection algorithm [8] before selecting the best profiled model. In the third phase, we use the selected profiled model with a non-profiled attack or a profiled attack to extract the secret key value. This phase allows to estimate the resistance of cryptographic devices during the post-attacking phase (called the Security Level Estimation in Fig. 1) based for example on the number of traces required to extract the key.

Fig. 1
figure 1

Strategy against a masked implementation

The success of this approach is strictly related to the quality of our profiled model: the lower the error between the correct and the estimated mask values by the profiled model, the higher is the correlation between the real and the predicted traces for the correct key during the attacking phase. As a result, reducing the error of the profiled model leads to reduce the number of traces required to distinguish the correct key from the others. In the ideal case, our model extracts the mask values with probability one and, as a result, correlation power analysis returns the key with the same number of traces as an unprotected implementation during the attacking phase.

Several previous works showed that a machine learning approach improves the success of attacks with respect to template attack [2, 2123, 27, 28]. Therefore, we expected that the machine learning approach induces a reduction of the number of traces during the attacking phase compared to a strategy based on template attack.

In parallel, the quantity of traces used during the profiling step should influence the number of traces required during the attacking phase. Indeed, the variance of models decreases as the training set size increases [20]. Therefore, the higher the number of traces in the learning set, the lower is the error between the correct and the estimated mask values and the lower the number of traces required during the attacking phase. Several experiments have been performed to verify these intuitions.

4 Experiments and discussion

This section regroups all our experiments. In the following, the notation A/B is used in a masking context to denote the configuration with the learner A used in the Mask Recovery step and a profiled attack or a non-profiling attack B used in the Key Recovery step. In an unprotected context, the notation/A indicates a profiled attack or a non-profiling attack used in the Key Recovery step.

4.1 Target implementation

The experiments were carried out on electromagnetic emission leakages that are freely available on the DPAContest V4 website [11] to easily reproduce the results. The target cryptographic device (an Atmel ATMega-163 smart card) implements in software the masked block-cipher AES-256 in encryption mode without any mode of operation. Each trace has \(435{,}002\) samples associated to the same secret key and measured during the first round. The masking scheme is a variant of the “Rotating Sbox Masking” [39]; an additive Boolean masked scheme with masked SBox. According to its authors, it has a low-cost design and keeps performances and complexity close to the unprotected scheme (in a hardware context) while being resistant against several side-channel attacks. The purpose of the DPAContest is to retrieve the first 128 key bits. As we target the first 128 key bits and since the first round of AES-128 and AES-256 is the same, in the following we focus on AES-128.

The “Rotating Sbox Masking”, modified for the DPAContest, uses sixteen public masks \(v=\{v_{0},v_{1},\ldots ,v_{15}\}\). This strategy is based on a low-entropy masking scheme that aims to limit the amount of entropy. At the beginning of each encryption a random secret value (called offset) is drawn randomly between 0 and 15. Then, the \(i\)-th byte of the plaintext is xored with the mask \(v_{i+\mathrm{offset}}\), i.e.

$$\begin{aligned} p_{i} \oplus v_{i+\mathrm{offset}} = s_{i} = \hbox {one byte of the STATE} \end{aligned}$$
(26)

where \(i+\mathrm{offset}\) is computed modulo \(16\). Sixteen customized SBox (called \(\mathrm{MaskedSubBytes}\)) are precomputed such that \(\mathrm{MaskedSubBytes}_i(s_j)=\mathrm{SBox}(s_j\oplus v_i)\oplus v_{i+1},\ i\in \left\{ 1,2,\ldots , 16\right\} \). As a result, the hardware and software designers precompute once the sixteen SBoxes and store them in memory.

After the linear part of AES, an additional operation is processed to ensure that the State input for the next round equals

$$\begin{aligned}&\left( \mathrm{MC}\circ \mathrm{SR}\circ \mathrm{SBox}\circ \left\{ \begin{array}{ll} s_{0}\oplus v_{0+\mathrm{offset}+r}\oplus O_{0}^{r} &{} \ldots \\ s_{1}\oplus v_{1+\mathrm{offset}+r}\oplus O_{1}^{r} &{} \ldots \\ s_{2}\oplus v_{2+\mathrm{offset}+r}\oplus O_{2}^{r} &{} \ldots \\ s_{3}\oplus v_{3+\mathrm{offset}+r}\oplus O_{3}^{r} &{} \ldots \end{array}\right\} \right) \nonumber \\&\quad \oplus \left\{ \begin{array}{lll} v_{0+\mathrm{offset}+r+1} &{} \ldots &{} v_{12+\mathrm{offset}+r+1}\\ v_{1+\mathrm{offset}+r+1} &{} \ldots &{} v_{13+\mathrm{offset+r}+1}\\ v_{2+\mathrm{offset}+r+1} &{} \ldots &{} v_{14+\mathrm{offset+r}+1}\\ v_{3+\mathrm{offset+r}+1} &{} \ldots &{} v_{15+\mathrm{offset}+r+1} \end{array} \right\} \end{aligned}$$
(27)

where \(r\in \{ 0,\ldots ,8\}\) (for AES-128) is the round number, \(O^r_i\) is the \((i+1)\)-th byte of the \(r\)-th subkey while \(\mathrm{SR}\) and \(\mathrm{MC}\) represent respectively the ShiftRow and the MixColumns operations of AES. For the last round (where \(r=9\) for AES-128), the masking scheme insures that its output value equals to

$$\begin{aligned}&\left( \mathrm{SR}\circ \mathrm{SBox}\circ \left\{ \begin{array}{ll} s_{0}\oplus v_{0+\mathrm{offset}+r}\oplus O_{0}^{r} &{} \ldots \\ s_{1}\oplus v_{1+\mathrm{offset}+r}\oplus O_{1}^{r} &{} \ldots \\ s_{2}\oplus v_{2+\mathrm{offset}+r}\oplus O_{2}^{r} &{} \ldots \\ s_{3}\oplus v_{3+\mathrm{offset}+r}\oplus O_{3}^{r} &{} \ldots \end{array}\right\} \right) \nonumber \\&\quad \oplus \left\{ \begin{array}{lll} v_{0+\mathrm{offset}+r+1} &{} \ldots &{} v_{12+\mathrm{offset}+r+1}\\ v_{1+\mathrm{offset}+r+1} &{} \ldots &{} v_{13+\mathrm{offset+r}+1}\\ v_{2+\mathrm{offset}+r+1} &{} \ldots &{} v_{14+\mathrm{offset+r}+1}\\ v_{3+\mathrm{offset+r}+1} &{} \ldots &{} v_{15+\mathrm{offset}+r+1}\end{array}\right\} \end{aligned}$$
(28)

Finally, the last mask values are removed with a xor operation.

The DPAContest V4 uses a SASEBO-W platform that contains an ATMega163 8-bit smart card. The smart card is powered at 2.5V and clocked a 3.57 MHz. A LeCroy wave-runner 6100A oscilloscope was used. The bandwidth is 200 MHz and the sampling rate equals 500 MS/s. We refer to [11, 39] for additional information on this masking scheme and on the acquisitions setup.

4.2 Experimental results

For the sake of fairness, we compared different attacks based on the same target value and the same dataset: each attack extracts first the offset value before applying a key recovery step. Note that an adversary targeting the offset or the mask value leads to the same result in our case: the (Pearson) correlation between them equals one. We suggest to target the mask value when the setting differs.

All our experiments were executed on a MacBook Pro with 2.66 GHz Intel Core 2 Duo, 8 GB 1,067 MHz DDR3 with the tenth major release of OS X (i.e. OS X Mavericks version 10.9). The attack process lasted roughly 3 weeks without considering the step of collecting the traces.

4.2.1 Finding the offset value on traces

Before proceeding with the quantitative analysis, we report here a preliminary visualization phase that allowed us to find the points that are the highest correlated with the secret offset. For the sake of time and memory (and due to the big data context), we computed the efficient Pearson correlation between each instant of 1,500 traces and the offset values (see Fig. 2). However, we suggest to test several other methods in a non-big data scenario (or when the adversary has enough time, memory and money) such as strategies based on mutual information (e.g. minimum redundancy maximum relevance [42]), the sum of squared pairwise T-differences filter [17] and the principal component analysis [41]. It is worth emphasizing that several instants are (significantly) correlated with the target value. Visualization suggests that apart from the central part of traces there is a high amount of information about the offset value available in each trace. As a consequence, we should expect that the profiled model would output the right offset value with a high probability.

Fig. 2
figure 2

Correlation between offset and power values at each time in the first round of the masked AES

4.2.2 Model selection

This section assesses and compares several classifiers that extract the secret offset value. We considered five different types of multi-class classification models: support vector machine (SVM), random forest (RF), template attack (TA), stochastic attack (SA) and multivariate regression analysis (MRA). We used two disjoint sets: a learning set of 1,500 traces to estimate the parameters of each model and a validation set of 1,500 traces to measure their success rate in predicting the right offset value. During the feature selection step, in each trace, we selected \(50\) instants that are the highest linearly correlated with the offset value.Footnote 1 We did not considered other feature selection methods (such as “Principal Component Analysis” [41] or “minimum Redundancy Maximum Relevance” [42]) due to their massive memory requirements or time consuming while our dataset contained \(1{,}500\times 435{,}002 \mathrm{bytes}>2^{29}\) bytes.Footnote 2 We refer to [32] for an analysis of practical issues that a security evaluator faces when performing side-channel attacks. In spite of the low feature selection complexity, we observed a high success rate of the models.

Figures 3, 4, 5, 6 and 7 report the success rate to predict the right offset value as a function of the number of points (that were selected from the sorted \(50\) instants) used in each trace for respectively support vector machine, random forest, template attack, stochastic attack (with degree 1) and multivariate regression analysis. We can deduce the following observations. First, as expected, the higher the number of traces in the learning set (from 25 to \(100~\%\) of 1,500 traces), the higher is the accuracy, with an exception for multivariate regression analysis. This can be explained by the fact that the low complexity of the multivariate regression analysis (low variance) requires a small learning set to reach its best result. Secondly, the number of selected points in each trace influences the success rate: the higher the number of features, the higher is the success rates for support vector machine, random forest, stochastic attack and multivariate regression analysis. It is interesting to remark that in a small learning set setting (i.e. less than 75 % of the entire learning set) the template attack reduces its success rate when the number of features goes beyond a certain size. This is presumably due to the ill-conditioning of the covariance matrix when the number of features is too large.

Fig. 3
figure 3

Support vector machine

Fig. 4
figure 4

Random forest

Fig. 5
figure 5

Template attack

Fig. 6
figure 6

Stochastic attack with degree 1

Fig. 7
figure 7

Multivariate regression analysis

Figure 8 shows what happens when we build a stochastic attack with different degrees (from 1 through to 4). In our setting, the linear stochastic attack reaches similar results as the nonlinear stochastic attack. This observation is not surprising since, for computational reasons, we applied a feature selection that searches for linear dependencies. As a result, we consider the linear model in the following. However, the results may change if the size of the profiling set increases as the nonlinear stochastic attack (that has more parameters to estimate than the linear approach) requires more traces to estimate its parameters.

Fig. 8
figure 8

Stochastic attack with different degrees (from 1 to 4) using 25 % (a) and 100 % (b) of 1,500 traces in the learning set

Figure 9 combines the success rate of a random model (i.e. \(\frac{1}{16}\)) and the five previous models (i.e. support vector machine, random forest, template attack, linear stochastic attack and multivariate regression analysis) by choosing the best size for the learning set that is \(100\) % of 1,500 traces. In our setting, multivariate regression analysis has the worst success rate. Therefore, we do not consider this model in the following of our experiments. The success rates of support vector machine, random forest and stochastic attack are similar and greater than the success rate of template attack. We could argue that template attack requires a larger learning set. Figures 10 and 11 highlight the impact on the success rate of respectively template attack and support vector machine by increasing the size of the learning set until 8,500 traces.Footnote 3 The success rate of template attack remains lower than support vector machine.

Fig. 9
figure 9

SVM vs RF vs TA vs SA vs MRA vs random

Fig. 10
figure 10

Template attack

Fig. 11
figure 11

Support vector machine

Note that we did not select the best meta-parameter values for support vector machine and random forest (such as the number of trees in the random forest) but only the best number of features (from \(2\) to \(50\)) to predict the target value. The default values of the implementation of support vector machine [9] and random forest [31] were used.Footnote 4 As a consequence, we do not claim that the Support Vector Machine configurations and the random forest configurations are necessarily the best one for profiled attack for our experiments. However, our experiments show that a profiled attack based on a machine learning model extracts more information on the offset value than a strategy based on template attack for the presented task.

Based on the above considerations, and to choose the best learning model, we looked at the learning time and the prediction time of the offset, based on one trace, as a function of the number of selected points (see Figs. 12, 13). Template attack has the lowest learning time while its prediction time increases exponentially in the number of selected features. Support vector machine has a lower learning time and a reasonable prediction time compared to random forest. As a result, in the attacking step, we use only a support vector machine as the machine learning model. We do not report the results for stochastic attack or for multivariate regression analysis as we used unoptimized and nonpublic implementations. According to the previous results, we selected 50 features for support vector machine, template attack and stochastic attack (of degree 1) leading to a success rate of respectively \(0.88\), \(0.66\) and \(0.90\).

Fig. 12
figure 12

Time to process a trace in the learning step (in ms) per number of feature selected

Fig. 13
figure 13

Time to process a trace in the attacking step (in ms) per number of feature selected

4.2.3 Key recovery step

During the attacking step we considered four settings targeting the Hamming weight of the MaskedSubBytes. In the first setting, the correlation power analysis extracts the secret key on an unmasked implementation (i.e. the non-profiled attack always receives the correct offset value). The second setting targets the masked implementation where a support vector machine extracts the mask value and where a correlation power analysis searches the secret key. In the third and fourth experiments, we exchanged the support vector machine by respectively the template attack and the stochastic attack. We repeated ten times each setting with a different set of traces during the attacking phase while the learning set remains the same.

Figure 15 summarizes the number of key bytes found as a function of the number of traces used (in average) during the non-profiling phase for each setting. We found the key with \(16.3\) traces (with less than 5 s of execution time) for the unmasked implementation. For the masked implementation, we extracted the key with \(26\) traces (with less than 20 s of execution time) using the support vector machine, with \(27.8\) traces (with less than 80 s of execution time) using the stochastic attack and with \(56.4\) traces (with less than 45 s of execution time) using the template attack. Figure 16 shows the minimum, the maximum and the average number of traces used to find the key. Compared to each strategy applied on the protected device, the support vector machine (combined with the correlation power analysis) leads to the closest results to an unprotected configuration.

For the sake of completeness, we also implemented the state-of-the-art stochastic attack on the masking scheme without a non-profiling step as proposed by Schindler [45] (see Eq. 20). Figure 14 shows the number of traces needed on a validation set in function of the number of features used. The stochastic attack needs more than 40 traces to find the key when the model considers more than 5 features and it reaches the minimum with 3 features. According to Fig. 15, stochastic attack needs 107 traces in average to extract the 16 key bytes on a testing set (with less than 180 s of execution time) (Fig. 16).

Fig. 14
figure 14

Number of traces during the attacking phase in function of the number of features

Fig. 15
figure 15

Comparison of attacks against unprotected and protected AES

Fig. 16
figure 16

Minimum, maximum and average number of traces used by each attack to find the key

4.2.4 Attacking step with a multivariate key recovery step

To improve our best attack (i.e. the combination of support vector machine and correlation power analysis), we have studied how to minimize the number of traces used during the key recovery step by mixing \(n\) sample points. As discussed in Sect. 2.1.1, we can either (a) apply a strategy (based on a non-profiling attack or a profiling attack) to a combination of power leakages, or (b) apply a non-profiling attack to multiple sample points independently and then combine their results. This section considers both methods.

We tested three combination functions which we note as \(\mathrm{COMB}\): the first combines the result of each correlation power analysis with the mean function, the second uses the weighted mean while the last uses the max function. Mathematically, the attack returns the target value \(\hat{O}\) that maximizes

$$\begin{aligned} \hat{O} = \underset{O\in {\mathcal O}}{\mathrm{arg\,max }}\ \mathrm{COMB}\left( | D_0 |,\ldots , | D_n | \right) \end{aligned}$$
(29)

where \(D_i\) denotes a distinguisher applied on the set of traces measured at time \(i\). We define the mean function as

$$\begin{aligned} \mathrm{COMB}\left( | D_0 |,\ldots , | D_n | \right) = \frac{1}{n}\sum _i^n | D_i | \end{aligned}$$
(30)

The weighted mean function equals to

$$\begin{aligned} \mathrm{COMB}\left( | D_0 |,\ldots , | D_n | \right) = \frac{1}{n}\sum _i^n \rho _i | D_i | \end{aligned}$$
(31)

where \(\rho _i\) represents the Pearson correlation between the instant \(i\) on the traces and the \(i\)-th target value. Finally, the max function represents

$$\begin{aligned} \mathrm{COMB}\left( | D_0 |,\ldots , | D_n | \right) = \max _{i\in \{1,2,\ldots ,n\}} | D_i | \end{aligned}$$
(32)

We varied the number of sample points between \(1\) and \(40\) to select the best value. Figures 17, 18 and 19 report the minimum, the average and the maximum number of traces used to extract the secret key in function of the number of samples combined respectively with the max, the mean and the weighted mean function. We reach the minimum of \(23\) traces in average with \(10\) samples when considering the max function. We improve this result with the mean and the weighted mean function: \(20.9\) traces in average with \(40\) samples.

Fig. 17
figure 17

SVM/multivariate CPA with the max function

Fig. 18
figure 18

SVM/multivariate CPA with the mean function

Fig. 19
figure 19

SVM/multivariate CPA with the weighted mean function

The second approach combines first the points before to apply a key recovery attack. In practice, we used a support vector machine that learns the dependence between the power leakage (of different instants) and the target value that is the Hamming weight of the output of a MaskedSubBytes. Each support vector machine was build with 1,500 traces and tested with a validation set of 1,500 traces. We selected the best number of features between \(2\) and \(40\). Figure 20 shows the success rate of each support vector machine in function of the number of features used. In the best setting, the worst Support Vector Machine has a success rate of 0.85 while the best reaches a success rate of 0.95. After the selection of the best configuration, we used 3,000 traces (that were used previously) in the learning set to attack each MaskedSubBytes. Figure 16 shows the minimum, the average and the maximum number of traces when we use an \(/\)SVM (when there are no protection) and an SVM/SVM (when there is a protection). The \(/\)SVM requires on average 6.1 traces while the SVM/SVM needs only slightly more, on average 7.8 traces. However, the SVM/SVM requires a longer execution time than SVM/CPA: less than 130 s. Table 1 resumes the results of strategies.

Fig. 20
figure 20

SVM against the HW of the output of the MaskedSubBytes

Table 1 Summary results of strategies

4.3 Discussions

The experimental results of the previous sections suggest some considerations. First, we have demonstrated that the masking scheme proposed by the DPAContest V4 can be practically attacked with a combination between profiled and non-profiled attacks. Our strategy represents a combination between a Support Vector Machine and a correlation power analysis or another set of machine learning models. The SVM/CPA requires \(26\) traces during the attacking step to extract the key of the implementation of a masked AES-128. In comparison, a correlation power analysis against an unmasked implementation required in average \(16.3\) traces. We improved our results using a support vector machine during the key recovery step (7.8 traces in average in order to find the key) but with the cost of a longer execution time. This result equals roughly to an unprotected context targeted by a support vector machine.

The support vector machine succeeds to extract information on the offset because the cryptographic device chooses different operations in function of this value (e.g. the choice of the masked SBox). Furthermore, the success of the attack is related to the implementation: the device manipulates the 16 state bytes sequentially while they can be manipulated in parallel on a FPGA. Moreover, the cryptographic device selects randomly only one offset during whole of the encryption. As a result, many points in a trace relate to the chosen offset.

The attack should be improved by increasing the number of points selected in each trace. Indeed, Fig. 3 shows that the maximum value of the success rate is still not reached by the support vector machine targeting the offset value. However, Fig. 12 shows that the learning step time increases linearly with the number of points selected in each trace. As a result, there is a trade off to be made between the accuracy of the model and its learning speed.

The major consideration concerns accuracy since the experimental results show that in several settings, when targeting the offset value, machine learning improves the success of attacks with respect to a strategy based on template attack or to the state-of-the-art stochastic attack. More precisely, a machine learning model needs four times less traces than the state-of-the-art stochastic attack on masking scheme and two times less traces than a strategy based on template attack. A new strategy based on stochastic attack becomes very competitive (in term of data complexity during the attacking phase) as the machine learning model but with a longer execution time than the support vector machine.

A possible justification of the superiority of machine learning models derives from the results of the Shapiro–Wilk multivariate Gaussianity test (SW test) [13] that we carried out on the 1,500 traces used during the learning stepFootnote 5 by template attack. The SW test was applied on several dimensions and against each offset value. Figure 21 displays several box plots that summarize the \(p\)-values of SW tests by taking into account from 2 to 50 dimensions (the same as for the previous experiments) against each offset value. In agreement with our results, SW test rejected the hypothesis of Gaussianity in \(90.56\) % of multivariate configurations. The Mardia’s test [34], another multivariate Gaussianity test, confirms the previous results: \(96.17~\%\) of configurations are rejected by the test. This test explains why template attack has a lower success rate than the other non-parametric models: a large majority of traces follow another unknown density probability function. As a result, strategies based on template attack (e.g. templates during preprocessing) should be less efficient than a non-parametric machine learning model.

Fig. 21
figure 21

Boxplots of Shapiro–Wilk multivariate Gaussianity test. In each box plot, the central bar corresponds to the median, the hinges to the first and third quartiles, and the whisker represents the greatest/lowest value excluding outliers. A \(p\)-value is considered as an outlier when its value is more than \(\frac{3}{2}\) times of upper/lower quartile

We also deem that another possible justification of the strength of machine learning models relates to the number of parameters to estimate (since a higher number of parameters require a larger learning set). An interesting future work will focus on the number of parameters of support vector machine compared to template attack (that depends on the number of features) and to stochastic attack (that depends on the number of features as well as the degree of the regression model). It is worth to note that “how good is my profile” is still an open question [12].

5 Conclusion and perspectives

In this paper, we introduced an efficient machine learning approach to evaluate the security level of a masked implementation of AES. Specifically, we extended the results of previous related works to protected devices [2, 2123, 2729]. The machine learning approach against a masked cryptographic algorithm consists in attacking first the mask with a machine learning model (i.e. a profiled attack) before targeting the secret key with a non-profiled attack or profiled attack.

We showed that the multivariate regression attack, the stochastic attack or a strategy based on template attack overestimates the security level of protected device while the machine learning approach improves significantly this estimation. The main reason of the superiority of machine learning with respect to template attack relates to the rejection of the multivariate Gaussianity tests that reject the hypothesis that the traces follow a Gaussian distribution in a high number of configurations. Therefore, a machine learning model extracts more information on the secret information (than template attack) by analyzing the same leakage information.

The complexity of the key recovery step mainly depends on the quality of the profiled model. The higher the success to retrieve the mask, the lower is the number of traces during the attacking phase. As a result, compared to a template attack, a learning model improves the probability to find the true mask value from \(0.66\) to \(0.88\) with a consequent reduction of the average number of traces during the attacking phase from \(56.4\) to \(26\) when considering a correlation power analysis during the key recovery step. Regarding the state-of-the-art stochastic attack, the learning model divides the number of traces during the attacking phase by four. However, a new strategy based on stochastic attack reduces this number to \(27.8\) traces (in average) when considering a correlation power analysis during the key recovery step. In our context, the main advantage of a machine learning approach represents its speed: 80 s of execution time for a strategy based on stochastic attack while the machine learning model requires four times less. In comparison, a non-profiled attack against an unmasked implementation needs \(17\) traces with 5 s of execution time on the same cryptographic device. Therefore, the masked implementation increases the data complexity of the attack by two and the time complexity by four. We reduced significantly the data complexity when considering a machine learning model during the key recovery step: 7.8 traces in average to retrieve the secret key. This small number is roughly the same when the cryptographic device has no protection and targeted by a profiled attack.

The quality of the profiled attacks mainly depends on the number of points selected on the traces. A robust feature selection method allowed to reach a high success rate to find the mask value and the output of each sensitive information by the profiled model.

Interesting and as expected, the number of traces in the learning set of the machine learning model influences the result of the learning model targeting the mask values (the higher the number of traces in the learning set, the better). This is due to a reduction of the variance of the model.

The results of the DPAContest v4 validate the performance of our attacks. According to the DPAContest, the SVM/CPA approach requires \(22\) traces to find the correct key while the SVM/SVM approach needs \(7\) traces. As a result, in both cases, the hold-out validation method underestimates the strength of the attacks in the (private) dataset of the DPAContest.

From an applied perspective, the machine learning approach represents an effective, efficiency and automatic black-box testing methodology capable of assessing the vulnerabilities of cryptographic hardware in the worst case scenario. The rationale is that data acquisition from a cryptographic device is expensive in terms of time and storage while the penetration tests have to be performed in a very short time (e.g. few weeks). As a result, strategies that reduce these constraints are preferable.

About the interpretability issue, machine learning provides a methodology to obtain black-box tools to evaluate the security level of a cryptographic device. Given its black-box nature it is not easy to deduce why an attack works better than another with a specific dataset. However, if the understandability of the attack is considered as more valuable than the accuracy of the results then other machine learning models and more white-box approaches should be pursued (such as decision trees). Furthermore, it would be interesting to compare the resulting accuracy with black-box approaches.

In light of our analysis, we believe that our work opens up new avenues for interesting further research works. Among them, we will consider other multivariate (profiled and non-profiled) attacks. Furthermore, experiments must be performed on different public datasets of masking or hiding implementations that will be available in the DPAContest V4. More precisely, the DPAContest V4.2 will provide traces collected on a software implementation of AES protected with an improved masking scheme to thwart collision attack as well as second-order CPA. On the other hand, the DPAContest V4.3 will provide traces collected on a hardware implementation of a masked AES where all the 16 Sboxes are processed in parallel. Another interesting possible future work concerns the scenario: “What would have been the results in a not worst-case scenario where, for example, the machine learning approach targets an information related partially to the mask values?” If such experiments confirm the above results, then there are important implications. Strategies based on template attack or stochastic attack against countermeasures scheme may be shown to be less suitable for security level estimation in the worst case scenario compared to a machine learning approach.