1 Introduction

Side-channel analysis (SCA) is a class of cryptanalytic attacks that exploit the dependency between the execution of a cryptosystem implementation and the manipulated data (e.g. the power consumption or the timing) to recover some leakage about its secrets. It is often more efficient than a cryptanalysis mounted in the so-called black-box model where no leakage occurs. In particular, continuous side-channel attacks in which the adversary gets information at each invocation of the cryptosystem are especially threatening. Common attacks as those exploiting the running-time, the power consumption or the electromagnetic radiations of a cryptographic computation fall into this class. Many implementations of block ciphers have been practically broken by continuous side-channel analysis (see for instance [11, 34, 45, 50]) and securing them has been a long-standing issue for the embedded systems industry.

Side-channel attacks exploit information which leak from the physical implementations of cryptographic algorithms. Since this leakage (e.g. the power consumption or the electromagnetic emanations) depends on some small part of the internally used secret key, the adversary may perform an efficient key-recovery attack to reveal this sensitive data. Among SCA, the family of so-called profiling attacks is actually the most powerful one since the underlying adversary is assumed to priorly use an open copy of the final target to precisely tune all the parameters of the attack. It includes templates attacks [14] and Stochastic modelling (a.k.a. linear regression analysis) [17, 58, 59]. This attack strategy where the adversary precedes the attack by a supervised training phase may be viewed as a classical Machine Learning problem and a recent line of works has started to build connections between the world of side-channel analysis and the world of machine learning (with a particular focus on deep learning).

1.1 Related works

Several works have investigated the application of Machine Learning (ML) to defeat both unprotected [5, 29, 30, 41, 43] and protected cryptographic implementations [21, 42]. These contributions focus mainly on two techniques: the Support Vector Machine (SVM) [16, 64] and Random Forest (RF) [56]. Practical results on several datasets have demonstrated the ability of these attacks to perform successful key recoveries. Besides, authors in [29] have shown that the SVM-based attack outperforms the template attack when applied on highly noisy traces while [43] has experimentally argued that ML (and RF in particular) become(s) interesting if the amount of observations available for profiling is small while the dimension of the latter observations is high. Following the current trend in the Machine Learning area, more recent works have started to pay attention to deep learning (DL) algorithms like multi-layer perceptron networks (MLP) [46,47,48] or convolutional neural networks (CNN) [13, 44]. Essentially, the conclusion of theses works is that DL offers a promising alternative to the state-of-the-art attacks, especially when the measurement dimension is high and/or the measured signal suffers from deformation like jittering (for CNN). However, the application context of these previous works is too weak or the amount of information on the involved deep learning techniques is too limited to draw solid conclusions for challenging contexts (e.g. where the target implementations include state-of-the-art countermeasures) and/or to improve these first attempts. Indeed, on one side, the series of papers [46,47,48] give partial information about the training and architecture of the deep learning models but it is limited to a comparison of MLP with other more classical approaches in the context of unprotected implementations of the AES-128 algorithm running on an old PIC 8-bit micro-controller. On the other side, papers [44] and [13] show that the Deep Learning approach may be applied to defeat classical countermeasures such as masking [44] or clock jittering [13] but they say nothing for the application of DL when both countermeasures are combined and, more importantly, they give very limited information about the training and the models architectures. This last point is an important limitation which does not allow for the reproducibility of the analyses and hence hampers the development of deep learning in the embedded security community. More generally, a common framework to study and compare the effectiveness of Machine Learning methods against secure embedded implementations of cryptographic algorithms is today missing.

1.2 Contributions

In this paper, our main objective is to conduct an in-depth study of the application of deep learning theory in the context of side-channel attacks. In particular, for the first time, we discuss several parametrization options and we present a large variety of benchmarks which have been used to either experimentally validate our choices or to help us to take the adequate decision.Footnote 1 The methodologies followed for the hyper-parameters’ selection may be viewed as a proposal to help researchers to make their own choice for the design of new deep learning models. For most of the final choices made for the configuration of our models, we were not able to get a formal explanation and hence we of course do not claim that they are optimal. Since the current state of machine learning theory does not yet provide clear foundations to conduct such analyses, we think that having methodologies (even ad hoc) is a first necessary step which opens the way for further research in this domain. Our study also shows that convolutional neural networks are almost as efficient as multi-layer perceptron networks in the context of perfectly synchronized observations, and outperform them in presence of desynchronization/jittering. This suggests that CNN models should be preferred in the context of SCA (even if they are more difficult to train). When it comes to choose a base architecture for the latter models, our study shows that, surprisingly, the 16-layer network VGG-16 used by the VGG team in the ILSVRC-2014 competition [60] is a sound starting point (other public models like ResNet-50 [28] or Inception-v3 [63] are shown to be inefficient against masked implementations). It allows us to design architectures which, after training, are better than classical templates attacks even when combined with dimension reduction techniques like principal component analysis (PCA) [53]. By training (with 75 epochs) our \(\hbox {CNN}_{\mathrm{best}}\) architecture on a subset of 50,000 700-dimensional traces of ASCAD database, we outperformed the other tested models on highly desynchronized traces while we achieved one of the best performances on small desynchronized traces. Clearly, our analysis does not enable to draw strong conclusions on the optimization and selection of optimal networks in the context of side-channel analysis. It however shows the impact of each hyper-parameter on the model soundness.

All the benchmarkings have been done with the same target (and database) which corresponds to an AES implementation secured against first-order side-channel attacks and developed in assembly for an ATMega8515 component.Footnote 2 A signal-to-noise characterization has been done to validate that there is no first-order leakage. This project has been published in [4]. To enable perfect reproducing of our experiments and benchmarks, we also chose to publish the electromagnetic measurements acquired during the processing of our target AES implementation (available in [3]) together with example Python scripts to launch some initial training and attacks based on these traces. We think that this database may serve as a common basis for researchers willing to compare their new architectures or their improvements in existing models.

2 Preliminaries and theoretical foundations

2.1 Notations

Throughout this paper, we use calligraphic letters as \(\mathcal {X}\) to denote sets, the corresponding upper-case letter X to denote random variables (random vectors \(\mathbf {X}\) if with an arrow) over \(\mathcal {X}\), and the corresponding lower-case letter x (resp. \(\mathbf {x}\) for vectors) to denote realizations of X (resp. \(\mathbf {X}\)). Matrices will be denoted with bold capital letters. The i-th entry of a vector \(\mathbf {x}\) is denoted by \(\mathbf {x}[i]\), while the i-th observation of a random variable X is denoted by \(x_i\). The probability mass function (aka the probability distribution function, pdf for short) of a discrete random variable \(\mathbf {X}\) will be denoted by \(f_{\mathbf {X}}\). It is defined for any possible realization \(\mathbf {x}\) of \(\mathbf {X}\) by \(f_{\mathbf {X}}(\mathbf {x}) = \mathsf {P}[\mathbf {X} = \mathbf {x} ]\). The symbol \(\mathsf {E}_{}[\) ] denotes the expected value, and might be subscripted by a random variable \(\mathsf {E}_{X}[\) ], or by a probability distribution \(\mathsf {E}_{f_{X}}[\) ], to specify under which probability distribution it is computed. Side-channel traces will be viewed as discrete realizations of a random column vector \(\mathbf {L}\) with values in \([0,2^\omega -1]^D\) where \(D\) denotes the trace size (or dimension) and where \(\omega \) depends on the vertical resolution of the oscilloscope used for the acquisitions (usually, we have \(\omega \in \{8,10,12\}\)). During their acquisition, a target sensitive variable \(Z= \varphi (P,K)\) is handled, where \(P\) denotes some public variable, e.g. a plaintext chunk, and \(K\) the part of a secret key the attacker aims to retrieve. The value assumed by such a variable is viewed as a realization \(z\) of a discrete finite random variable \(Z\) defined over \(\mathcal {Z}\) (e.g. \(\mathcal {Z}= \{0,\ldots ,255\}\)).

2.2 Side-channel analysis

Side-channel analysis (SCA) aims at exploiting noisy observations \(\mathbf {L}\) of the processing of an algorithm to recover its secret parameter. When the SCA adversary has the ability to use an open device (i.e. a device on which he can control, at least partially, all the inputs of the algorithm, including the secret parameters), a particular class of attacks named profiling may be executed.

2.2.1 Profiling SCA

A profiling SCA is composed of two phases: a profiling (or characterization, or training) phase, and an attack (or matching) phase.

During the profiling step, the attacker computes for every possible \(k\in \mathcal {K}\) an estimation \(\hat{\text {g}}_{k}\) of the following conditional probability distribution function:

$$\begin{aligned} g_{k} : (\varvec{\ell }_{},p) \mapsto \mathsf {P}[\mathbf {L}= \varvec{\ell }_{}| (P,K)=(p,k) ]. \end{aligned}$$
(1)

The estimation \(\hat{\text {g}}_{k}\) is deduced from on a so-called profiling set which is denoted by \(\mathcal {D}_{\text {profiling}}\) and satisfies \(\mathcal {D}_{\text {profiling}}\doteq \{\varvec{\ell }_{i}, p_i,k_i\}_{i=1,\dots ,N_p}\) where \(N_p\) denotes the number of traces \(\varvec{\ell }_{i}\) acquired under known guessable chunks \(p_i\) and \(k_i\) of the cryptographic algorithm inputs. In the rest of this paper, the profiling set is viewed as the Cartesian product between the set a traces \(\mathbf{\mathcal {L}}\) and the set of corresponding inputs \({\mathcal {Y}}\).

Remark 1

In the context of side-channel analysis against block cipher implementations, it is common to label the observations/traces by an appropriate function \(\varphi (p,k)\) instead of \((p,k)\) (and both labellings are equivalent when the observations exactly correspond to the processing of \(\varphi (\cdot )\) since \(p\) is assumed to be known). It will be the case for the database used in the rest of the paper where \(\varphi (\cdot )\) corresponds to the AES sbox. Here the probability in the right-hand side of (1) may w.l.g. be rewritten \(\mathsf {P}[\mathbf {L}= \varvec{\ell }_{}|\varphi (P,K)=z ]\) and the profiling set may be rewritten \(\mathcal {D}_{\text {profiling}}\doteq \{\varvec{\ell }_{i}, z_i\}_{i=1,\dots ,N_p}\) with \(\mathbf{\mathcal {L}}= \{\varvec{\ell }_{i}; i\leqslant N_p\}\) and \({\mathcal {Y}}= \{z_i;i\leqslant N_p\}\).

During the attack step, the adversary gets a new attack set\(\mathcal {D}_{\text {attack}}\doteq \{\varvec{\ell }_{i}, p_i\}_{i=1,\dots ,N_a}\) for which the secret parameter, say \(k^{\star }\), is fixed but unknown. His goal is to recover the latter key. For such a purpose, it must be decided which of the pdf estimations \(\hat{\text {g}}_{k}\), \(k\in \mathcal {K}\), is the most likely knowing the attack set. It is well known that, under realistic assumptions on the distributions’ nature, the most efficient way to make such a decision is to follow a Maximum Likelihood strategy which amounts to estimate the following likelihood\(\mathbf {\mathsf {d}}_{N_a}[k]\) for every key candidate \(k\in \mathcal {K}\), and then to select the key candidate that maximizes it:

$$\begin{aligned} \mathbf {\mathsf {d}}_{N_a}[k] = \prod _{i=1}^{N_a} \frac{\mathsf {P}[\mathbf {L}=\varvec{\ell }_{i}\mid (P,K)= (p_i,k) ] }{f_{\mathbf {L}}(\varvec{\ell }_{i})}\times f_{P,K}(p_i,k),\nonumber \\ \end{aligned}$$
(2)

which is obtained under the hypothesis that acquisitions are independent.Footnote 3 The estimation of the right-hand term of (2) is simply done by replacing the probabilities \(\mathsf {P}[\mathbf {L}=\varvec{\ell }_{i}| (P,K)= (p_i,k) ]\) by their estimations \(\hat{\text {g}}_{k}(\varvec{\ell }_{i},p_i)\). The vector \(\mathbf {\mathsf {d}}_{N_a}\) is called scores vector and its kth coordinate is the score corresponding to key candidate \(k\).

2.2.2 Leakage dimensionality issue

The potentially huge dimensionality of \(\mathbf {L}\) may make the estimation of (1) a very complex problem. To circumvent it, the adversary usually priorly exploits some statistical tests (e.g. SNR or T-Test) and/or dimensionality reduction techniques (e.g. principal component analysis [53], linear discriminant analysis [19], kernel discriminant analysis [12]) to select points of interest or an opportune combination of them. Then, denoting \(\varepsilon (\mathbf {L})\) the result of such a dimensionality reduction, the attack is performed as described previously with the simple difference that \(\mathbf {L}\) and \(\varvec{\ell }_{i}\) are respectively replaced by \(\varepsilon (\mathbf {L})\) and \(\varepsilon (\varvec{\ell }_{i})\) in (1) and (2).

2.2.3 (Gaussian) Template attacks

Until now the most popular way adopted to estimate the conditional probability (1) is the one that led to the well-established Gaussian template attack [14]. It assumes that \(\mathbf {L}\mid (P,K)\) (or equivalently \(\varepsilon (\mathbf {L})\mid (P,K)\) if a dimensionality reduction has been priorly applied) has a multivariate Gaussian distribution, and estimates the mean vector \(\mathbf {\mu }_{p,k}\) and the covariance matrix \(\mathbf{\Sigma }_{p,k}\) for each possible \((p,k)\in \mathcal {P}\times \mathcal {K}\) (i.e. the so-called templates). In this way, for every \((p,k)\), the pdf \(\varvec{\ell }_{} \mapsto g_{k}(\varvec{\ell }_{},p)\) defined in (1) is approximated by the Gaussian pdf \(f_{\mathbf {\mu }_{p,k},\mathbf{\Sigma }_{p,k}}\). So, the Gaussian template attack is a strategy that makes use of a generative model. The same multivariate Gaussian assumption is the one that is made in quadratic discriminant analysis (QDA), which is a well-known generative strategy in the Machine Learning literature [19] to perform classification.

2.3 Machine learning and deep learning

In machine learning theory, the problem of estimating \(\mathsf {P}[\mathbf {L}\mid (P,K)=(p,k) ]\), for some \((p,k)\in {\mathcal {Y}}\) with \({\mathcal {Y}}\doteq \mathcal {P}\times \mathcal {K}\), is known as a generation problemFootnote 4 (a.k.a. prediction problem), while the estimation of \(\mathsf {P}[ (P,K)=(p,k) \mid \mathbf {L}]\) is referred to as a classification problem (see e.g. [9]).

Nowadays, deep neural networks are the privileged tool to address the classification problem, and they can be exploited in a discriminative way. In such a case, which corresponds to our choice, they aim at directly constructing an approximation \(\hat{\mathbf {\text {g}}}_{}\) of the function \(\mathbf {\text {g}}_{}: \varvec{\ell }_{},p\mapsto (\mathsf {P}[(P,K)=(p,k) \mid \mathbf {L}= \varvec{\ell }_{} ])_{k\in \mathcal {K}}\). It may be observed that \(\mathbf {\text {g}}_{}\) is directly linked to the pdfs in (1) through the following relation which comes as a direct consequence of Bayes Theorem:

$$\begin{aligned} \mathbf {\text {g}}_{}(\varvec{\ell }_{},p) = \big (g_{k}(\varvec{\ell }_{},p)\times f_{P,K}(p,k)/f_{\mathbf {L}}(\varvec{\ell }_{})\big )_{k\in \mathcal {K}}. \end{aligned}$$
(3)

The classification of a new leakage \(\varvec{\ell }_{}\) observed for an input \(p\) is done afterwards by processing \(\mathbf {y}=\hat{\mathbf {\text {g}}}_{}(\varvec{\ell }_{},p)\) and by choosing the key candidate \({\hat{k}}\) (or equivalently the label in the formalism of Machine Learning) such that \({\hat{k}}= \mathrm {argmax}_{k\in \mathcal {K}}~\mathbf {y}[k]\). If the key-discrimination is done from several, say \(N_a\), pairs \((\varvec{\ell }_{i},p_i)\) then the maximum likelihood approach is followed as in (2):

$$\begin{aligned} \mathbf {\mathsf {d}}_{N_a}[k] = \prod _{i=1}^{N_a}\mathbf {y}_i[k], \end{aligned}$$
(4)

where \(\mathbf {y}_i\) denotes the output of (the model function) \(\hat{\mathbf {\text {g}}}_{}\) input with the pair \((\varvec{\ell }_{i},p_i)\). It may be checked that (4) is a simple rewriting of (2) obtained by using (3).

Remark 2

In cases where the SCA targets a particular processing in the form \(\varphi (P,K)\), this leads to search for an approximation \({\hat{\mathbf {\text {g}}'}}_{}\) of the function \(\varvec{\ell }_{},p\mapsto (\mathsf {P}[\varphi (P,K)=z\mid \mathbf {L}= \varvec{\ell }_{}, P= p ])_{z\in \mathcal {Z}}\). The output of the model \({\hat{\mathbf {\text {g}}'}}_{}\) when input with \((\varvec{\ell }_{},p)\) is then a vector \(\mathbf {y}'\) indexed by the values \(z=\varphi (p,k)\) for \(k\) ranges over \(\mathcal {K}\). To build a second vector \(\mathbf {y}\) indexed by the key candidates \(k\) it suffices to process \(\mathbf {y}[k] = \mathbf {y}'[\varphi (p,k)]\) for the input \(p\) and for every \(k\in \mathcal {K}\).

Remark 3

Another formulation of the classification problem could consist in directly looking for an estimation of the pdf \(\mathsf {P}[K\mid \mathbf {L},P ]\): the deep neural networks will here be trained to classify the key candidates knowing the public input of the cryptographic primitive and the information leakage. However, even if this formulation may seem more natural (it perfectly matches the problem we want to resolve), it implies that the deep neural networks must not only recover the statistical dependency between the values of the manipulated data and the leakage, but also the function that links it to the key (e.g.  the function \(\varphi ^{-1}(\cdot ,P)\)). Since the latter function may be complex (e.g. can be affinely equivalent to the inverse of an sbox), this can made the task of the deep neural networks harder, whereas the function \(\varphi \) is often already known by the adversary.

Deep Learning is a branch of machine learning whose characteristic is to avoid any manual feature extraction step from the model construction work-flow. For example, in deep learning the dimensionality issue discussed in Sect. 2.2.2 is not necessarily tackled out by preprocessing a dimensionality reduction function \(\varepsilon \). As described below, the cascade of multiple layers that characterize DL models is indeed in charge of directly and implicitly extracting interesting features and of estimating the classifying model \(\hat{\mathbf {\text {g}}}_{}\). This approximation is searched in a family of functions (aka models in the machine learning terminology) specified a priori by the data analyst according to the specificities of the problem which is tackled out.

We conclude this subsection by recalling some basic definitions and notions about neural networks and their training.

Neural networks A neural network has an input layer (the identity over the input datum \(\varvec{\ell }_{}\)), an output layer (the last function, whose output \(\mathbf {y}\) is an estimation of the vector of conditional probabilities) and all other layers are called hidden layers. The so-called neurons, that give the name to the architecture, are the computational units (also named nodes) of the network and essentially process a scalar product between the coordinates of its input and a vector of trainable weights (or simply weights) that have to be trained. Each layer processes some neurons and the outputs of the neuron evaluations will form new input vectors for the subsequent layer. The numbers of layers in the neural networks, the dimension of the elementary units or the algebraic nature of the nonlinear layers form the architecture of the network and define the family of functions/models. The identification of the best approximating function in this family is made by solving a minimization problem with respect to a metric which is specific to the application.

Training of neural networks In a privileged setting, the training phase (i.e. the automatic tuning of the weights of the neurons) is done via an iterative approach which locally applies the (Stochastic) Gradient Descent algorithm [24] to minimize a loss function quantifying the classification error of the function \(\hat{\mathbf {\text {g}}}_{}\) over a training set which is a part of the profiling set. The cross-entropy [25, 40] metric is a classical (and often by default) choice in classification problems. It is smooth and decomposable, and therefore amenable to optimization with standard gradient-based methods. However, other metrics may be investigated and can potentially lead to better results [49, 61]. A training is said to be full batch learning if the full training database is processed before one update. At the opposite, if a single training input is processed at a time then the approach is named stochastic. In practice, one often prefers to follow an approach in between, called mini-batch learning, and to use small batch (aka group) of training inputs at a time during the learning. The size of the mini-batch is generally driven by several efficiency/accuracy factors which are e.g. discussed in [25] (e.g.  optimal use of the multi-core architectures, parallelization with GPUs, trade-off between regularization effect and stability, etc.).

An iteration over all the training datasets during the Stochastic Gradient Descent is called an epoch. The number of epochs is an important parameter to tune because small values may lead to under-fitting (the number of steps of the Gradient Descent is not sufficient and the model is too poor to capture a trend in the training dataset) and high values may lead to over-fitting (the model is too complex, it perfectly fit the training dataset but is not able to generalized its predictions to other datasets).

Several extensions and variants of the Stochastic Gradient Descent have been proposed in the context of deep learning. These variants, called optimizers, aim to adapt the learning rate (the step size) of the Gradient Descent during the training process. More details about the specification of neural networks will be given in the dedicated Sects. 3 and C, but we will not go further on the optimization approaches and the interested reader may refer to [24].

Hyper-parameters All the parameters that define an architecture (called architecture hyper-parameters or simply architecture parameters), together with some other parameters that govern the training phase (called training hyper-parameters or simply training parameters), have to be carefully set by the attacker.Footnote 5 This point will be discussed in Sect. 3.3.

2.4 Model assessment and selection

2.4.1 Evaluation methodology

In the machine learning community, several evaluation frameworks are commonly applied to assess the performances of a model or to select the best parameters that suit to a parametrized family of models. These methods aim to provide an estimator of the performance of a metric (e.g. the accuracy) which does not depend on the choice of the training set \(\mathcal {D}_{\text {train}}\) (on which the model is trained) and of the test set \(\mathcal {D}_{\text {test}}\) (on which the model is tested) but only on their size.

The so-called t-fold cross-validation [20] is currently the preferred evaluation method. Let \(\mathsf {c}\) be a metric, \(\hat{\mathbf {\text {g}}}_{}\) a model to evaluate, and \(\mathcal {D}_{\text {profiling}}=(\mathbf{\mathcal {L}},{\mathcal {Y}})\) a dataset with labels, the outline of the method is the following:

  1. 1.

    [optional] randomize the order of the labelled traces in \(\mathcal {D}_{\text {profiling}}\),

  2. 2.

    split the samples and their corresponding labels into t disjoint parts of equal size \((\mathbf{\mathcal {L}}_1,{\mathcal {Y}}_1),\ldots ,(\mathbf{\mathcal {L}}_t,{\mathcal {Y}}_t)\). For each \(i\in [1..t]\), do:

    1. (a)

      set \(\mathcal {D}_{\text {test}}\doteq (\mathbf{\mathcal {L}}_i, {\mathcal {Y}}_i)\) and \(\mathcal {D}_{\text {train}}\doteq (\bigcup _{j\ne i} \mathbf{\mathcal {L}}_j, \bigcup _{j\ne i} {\mathcal {Y}}_j)\),

    2. (b)

      (re-)trainFootnote 6 the model \(\hat{\mathbf {\text {g}}}_{}\) on \(\mathcal {D}_{\text {train}}\),

    3. (c)

      compute the performance metric by evaluating the model on \(\mathcal {D}_{\text {test}}\):

      $$\begin{aligned} \mathsf {c}_i = \mathsf {c}(\hat{\mathbf {\text {g}}}_{}, \mathcal {D}_{\text {test}}), \end{aligned}$$
      (5)
  3. 3.

     return the mean \(\frac{1}{t}\sum _{i=1}^t \mathsf {c}_i\).

It is known that the t-fold cross-validation estimator is an unbiased estimator of the generalization performance. Its main drawback is its variance which may be large and difficult to estimate [6, 10]. In this paper (Sects. 3 and C), we perform a 10-fold cross-validation for each selection of the model parameters. The choice of \(t=10\) results in a trade-off between evaluation complexity and accuracy, since for each choice of parameters the model is trained 10 times with a substantial computing time, and the generalization performance estimator is computed among 10 values on different training sets, reducing the uncertainty on the evaluation metrics. The dataset \(\mathcal {D}_{\text {profiling}}\) on which is performed the cross-validation is a fixed subset comprised of 50,000 labelled traces, split at each iteration into \(N_{\text {train}}= 45,000\) labelled traces for \(\mathcal {D}_{\text {train}}\) and \(N_{\text {test}}= 5,000\) labelled traces for \(\mathcal {D}_{\text {test}}\).

2.4.2 Evaluation metrics

We evaluate the performance of our models with three different metrics, which are: the rank function, the accuracy and the computational time.

The rank function is a commonly used metric in SCA for assessing the performance of an attack. Let us denote by \(k^{\star }\in \mathcal {K}\) the key that has been used during the acquisition of the dataset \(\mathcal {D}_{\text {profiling}}\). The rank function corresponding to a model \(\hat{\mathbf {\text {g}}}_{}\) trained with the dataset \(\mathcal {D}_{\text {train}}\) and tested with the dataset \(\mathcal {D}_{\text {test}}\) is defined by:

$$\begin{aligned} \mathsf {rank}_{}(\hat{\mathbf {\text {g}}}_{}, \mathcal {D}_{\text {train}},\mathcal {D}_{\text {test}},n) = |\{ k\in \mathcal {K}\mid \mathbf {\mathsf {d}}_{n}[k] > \mathbf {\mathsf {d}}_{n}[k^{\star }] \}|, \end{aligned}$$
(6)

where \( \mathbf {\mathsf {d}}_{n}[k]\) is the score for the candidate \(k\) as defined in (4) (replacing \(\mathcal {D}_{\text {attack}}\) by \(\mathcal {D}_{\text {test}}\) and \(N_a\) by \(n\)) and estimated from a modelling of the conditional probability done with \(\mathcal {D}_{\text {train}}\) and a test done with the \(n\) first traces in \(\mathcal {D}_{\text {test}}\). For example, if \(k^{\star }\) has the highest score (resp. the lowest score), then its rank is 0 (resp. \(|\mathcal {K}|-1\)). Note that in this definition, the rank function depends on the choices of the training and test datasets. To get a better measure of the rank for given cardinalities \(N_{\text {train}}\) and \(N_{\text {test}}\) of \(\mathcal {D}_{\text {train}}\) and \(\mathcal {D}_{\text {test}}\) respectively, it is therefore more suitable to estimate its mean over several pairs of datasets:Footnote 7

$$\begin{aligned} \mathsf {RANK}_{N_{\text {train}},n}(\hat{\mathbf {\text {g}}}_{}) = \mathsf {E}_{}[\mathsf {rank}_{}(\hat{\mathbf {\text {g}}}_{}, \mathcal {D}_{\text {train}}, \mathcal {D}_{\text {test}},n) ] , \end{aligned}$$
(7)

where the mean is defined over all the datasets \(\mathcal {D}_{\text {train}}\) and \(\mathcal {D}_{\text {test}}\), respectively, of cardinality \(N_{\text {train}}\) and \(N_{\text {test}}\), and where \(n\) is assumed to be bounded above by \(N_{\text {test}}\). For any pair of cardinalities, an approximation of the mean can be obtained by cross-validation as detailed previously just by replacing \(\mathsf {c}\) in (5) by the rank function \(\mathsf {rank}_{}(\cdot )\). This is exactly what has been done for the benchmarks discussed in Sect(s). 3 and C. More precisely, the following attack has been repeated \(t=10\) times and the average rank of the correct key is plotted: (1) select a training set of fixed size \(N_{\text {train}}\) and (2) compute the evolution of the rank of the correct key when the model is tested with an increasing number \(n\) of traces in \(\mathcal {D}_{\text {test}}\) (of size \(N_{\text {test}}\)).

A second metric which is commonly used in machine learning is the accuracy. With the same previous notations, we can define it as:

$$\begin{aligned}&\mathsf {acc}_{}(\hat{\mathbf {\text {g}}}_{}, \mathcal {D}_{\text {train}},\mathcal {D}_{\text {test}}) \nonumber \\&\quad = \frac{|\{(\varvec{\ell }_{i},p_i,k^{\star }) \in \mathcal {D}_{\text {test}}\mid k^{\star }=\mathrm {argmax}_{k\in \mathcal {K}}~\mathbf {y}_i[k]\}|}{|\mathcal {D}_{\text {test}}|}, \end{aligned}$$
(8)

where we recall that \(\mathbf {y}_i\) denotes the \(|\mathcal {K}|\)-dimensional output \(\hat{\mathbf {\text {g}}}_{}(\varvec{\ell }_{i},p_i)\). Then, similarly as for the rank function but for possibly unbounded size of \(\mathcal {D}_{\text {test}}\), we can define from (8) an Expected Accuracy of the model (ACC) by:

$$\begin{aligned} \mathsf {ACC}_{N_{\text {train}}}(\hat{\mathbf {\text {g}}}_{}) = \mathsf {E}_{}[\mathsf {acc}_{}(\hat{\mathbf {\text {g}}}_{}, \mathcal {D}_{\text {train}},\mathcal {D}_{\text {test}}) ] , \end{aligned}$$

where the mean is defined over all the datasets \(\mathcal {D}_{\text {train}}\) of size \(N_{\text {train}}\) and all the datasets \(\mathcal {D}_{\text {test}}\) (with unbounded size).Footnote 8

Finally, our selection of parameters is also guided by the computational time of the model training. The mean of the training time is computed in the same manner as the other evaluation metrics during the 10-fold cross-validation.

Fig. 1
figure 1

SNR characterization for the third sbox output manipulation

2.4.3 About the profiling set-up

Our implementations of machine learning algorithms have been developed with Keras library [15] (version 2.1.1) or directly with Tensorflow library [1] (version 1.4.0). We run the trainings over ordinary computers equipped with 16 GB of RAM and gamer market GPUs Nvidia GTX 1080 Ti. The computation of all the benchmarks took approximately 12 days by using 3 GPU cards.

2.5 Target of the attacks experiments and leakage characterization

For our attack experiments, we targeted a Software protected AES implementation running over an 8-bit AVR architecture. More precisely, the device is an ATMega8515.

2.5.1 About the implementation

To maximize our control on the code executed by the device, we choose to implement the AES in assembly language. We developed two versions which merely aim at defeating first-order SCA attacks (i.e. attacks exploiting a single temporal leakage without (re-)combining of temporal points). The first one makes use of the classical table re-computation method (see e.g. [2, 34, 55] for a detailed description). The AES state is secured with 16 different masks for the linear parts and, for the SubBytes processing, the same pair of input and output masks is used for each state element. The outlines of the implementation are summed up in Algorithm 1 in “Appendix B”. A Signal-to-Noise Ratio (SNR)Footnote 9 has been done (1) to validate that there is no first-order leakage (snr1 in Fig. 1b) and (2) to identify where the target values and the masks are leaking (\(snr2-snr5\) in Fig. 1b). To help evaluators to perform elementary attacks against the first two bytes of the AES state during the first round, the corresponding masks of the linear parts (\(r[1]\) and \(r[2]\) in Algorithm 1) have been fixed to 0.

Attack experiments reported in the rest of the paper only target the output of the third sbox processing during the first round (namely \(\text {sbox}^{\star }[\text {state0}[3]] \doteq \text {sbox}(p[3]\oplus k[3])\oplus r_{\text {out}}\) with \(i=3\) in Algorithm 1).Footnote 10

figure a

2.5.2 About the acquisition phase

The side-channel observations were obtained by measuring the electromagnetic (EM) radiations emitted by the device. To this aim, a sensor made of several coils of copper was plugged into a low-noise amplifier. To sample measurements, a digital oscilloscope was used with a sampling rate of 2G samples per second. We insist on the fact that the temporal acquisition window was set to record the first round of the AES only. As the MCU clock was quite stable, the resynchronization of the measurements was not difficult and resulted in a campaign of 100,000 traces composed of 100,000 time samples. Among them, we chose to finally extract only 60,000 traces after validating that it was sufficient to accurately realize all our benchmarks (see e.g. Sect. C.2). To identify the leakage samples related to the secure processing of \(\text {sbox}(p[3]\oplus k[3])\), several SNRs have been processed. Their definition is given in Table 1a.

It may be observed in Fig. 1b that snr1 (in grey) is very low, which essentially shows that there is no first-order leakage on the unmasked sbox output \(\text {sbox}(p[3]\oplus k[3])\). The leakages on the sbox output masked with \(r[3]\) and on the mask \(r[3]\) itself are relatively high (snr4 and snr5 respectively). The SNR snr4 shows three peaks because the sbox output with mask \(r[3]\) is not only manipulated during the SubBytes step but also during the ShiftRows and the MixColumns. Eventually, one can also observe a leakage on the sbox output masked with \(r_{\text {out}}\) and on the mask \(r_{\text {out}}\) itself (snr2 and snr3). Since this leakage is smaller than that for the sbox with mask \(r[3]\), we found it more challenging and preferred to focus on it in our attack experiments.

For the reasons explained in previous paragraph, we chose to enshorten the initial traces (composed of 100,000 samples) and to only keep, for each trace, the 700 samples in the interval [45400..46100] which contains information on the two pairs of \((\text {sbox}(p[3]\oplus k[3])\oplus r[3],r[3])\) and \((\text {sbox}(p[3]\oplus k[3])\oplus r_{\text {out}},r_{\text {out}}))\) (see Fig. 15 in “Appendix A” for a zoom on the SNRs in the target intervalsFootnote 11).

To enable perfect reproducing of our experiments and benchmarks, we published the electromagnetic measurements acquired during the processing of our target AES implementation (available in [3]) together with example Python scripts to launch some initial training and attacks based on these traces. See Section B for details about this open database.

3 Convolutional neural networks

3.1 Core principles and constructions

Convolutional Neural Networks (CNNs) involved in our studies are based on 5 types of layers that are briefly recalled hereafter:

  • The Fully-Connected layers (FC for short), denoted by \(\lambda \), are expressible as affine functions: denoting \(\mathbf {x}\) the \(D\)-dimensional input of an FC, its output is given by \(\mathbf{A}\mathbf {x} + \mathbf {B}\), being \(\mathbf{A}\in \mathbb {R}^{C\times D}\) a matrix of weights and \(\mathbf {B}\in \mathbb {R}^C\) a vector of biases. These weights and biases are the trainable weights of the FC layer.Footnote 12

  • Convolutional layers (CONV for short) are linear layers, denoted by \(\gamma \), that share weights across space. To apply it to an input, \(n_{\text {filter}}\) small column vectors, called convolutional filters, of size W (aka kernel size) are slid over the input by some amount of units, called stride.Footnote 13 The column vectors form a windowFootnote 14 which defines a linear transformation of the W consecutive points of the data into a new vector \(\mathbf {x}\). When the window slides over the last points, the input trace can be either padded with 0 resulting in a vector \(\mathbf {x}\) which has the same number of points than the input data (same padding) or the data is not padded and the window only slides over the valid part of the data, resulting in a vector \(\mathbf {x}\) smaller than the input trace (valid padding). The coordinates of the window (viewed as a matrix) are among the trainable weights which are constrained to be unchanged for every input. This constraint aims to allow the CONV layer to learn shift-invariant features. The reason why several filters are applied is that we expect each filter to extract a different kind of characteristic from the input. As one goes along convolutional layers, higher-level abstraction features are expected to be extracted. These high-level features are arranged side-by-side over an additional data dimension, the so-called depth.Footnote 15 This is this geometric characteristic that makes CNNs robust to temporal deformations [36].

  • Pooling layers (POOL for short) are nonlinear layers, denoted by \(\delta \), that reduce the spatial size by making some filters slide across its input. Filters are 1-dimensional, characterized by a length W and a stride, that is usually chosen equal to W. They do not contain trainable weights and only slide across the input to select a segment, then a pooling function is applied: the most common pooling functions are the max pooling which outputs the maximum values within the segment and the average pooling which outputs the average of the coordinates of the segment.

  • Activation layers (ACT for short) are composed of a single nonlinear real function that is applied independently to each coordinate of its input. Several activation functions have been used in deep learning and currently the so-called ReLU is preferred. It processes \(\max (0,x)\) to each coordinate x. The ACT layer preceded by the Batch Normalization layer below is denoted \(\alpha \).

  • Batch Normalization layers (BN for short) have been introduced in [31] by Ioffe Szegedy to reduce the so-called internal covariate shift in neural networks and to eventually allow for the usage of higher learning rates. The reasoning behind the soundness of this layer is well argued in [25].

  • The softmaxFootnote 16 layer (SOFT for short), denoted by \(\mathsf {s}\), applies the following processing to each coordinate of its input \(\mathbf {x}\): \(\mathsf {s}(\mathbf {x})[i] = \frac{e^{\mathbf {x}[i]}}{\sum _{j}e^{\mathbf {x}[j]}}\).

Eventually, the main block of a CNN is a CONV layer \(\gamma \) directly followed by a BN layer and an ACT layer \(\alpha \). The former locally extracts information from the input thanks to filters and the latter increases the complexity of the learned classification function thanks to its nonlinearity. After some \( ( \alpha \circ \gamma )\) blocks, a POOL layer \(\delta \) is usually added to reduce the number of neurons: \(\delta \circ [ \alpha \circ \gamma ]^{n_{\text {conv}}} \). This new block is repeated \(n_{\text {blocks}}\) times in the neural network until obtaining an output of reasonable size. Then, \(n_{\text {dense}}\) FC layes \(\lambda \) are introduced in order to obtain a global result which depends on the entire input. To sum-up, a common convolutional network can be characterized by the following formula:Footnote 17

$$\begin{aligned} \mathsf {s}\circ [\lambda ]^{n_{\text {dense}}} \circ [\delta \circ [\alpha \circ \gamma ]^{n_{\text {conv}} } ]^{n_{\text {blocks}}}. \end{aligned}$$
(9)

3.2 A brief overview of current CNN architectures

The first successful CNN network, best known as LeNet, was developed in the nineties and was mostly applied to handwritten digit recognition [37, 38]. The last version of the network, LeNet-5 [38], is a small architecture which operates on images of \(32 \times 32\) pixels split into 10 classes. The architecture is comprised of 2 convolutional layers with, respectively, 6 and 16 filters of size \(5 \times 5\), and 3 final dense layers of, respectively, 120, 84 and 10 units. Each convolutional layer is followed by an average pooling layer and the activation function is the hyperbolic tangent. The network achieved an accuracy of nearly 99% on the test dataset.

CNN networks gained popularity with their breakthrough as a contender in the Imagenet Large Scale Visual Recognition Challenge (ILSVRC, [57]) and since 2012, deep CNN networks have constantly established new records in computer vision [28, 35, 60, 62]. ILSVRC is an image classification challenge which provides each year a labelled dataset of roughly 1,000,000 images of \(200 \times 200\) pixels split into 1000 classes. Candidates train and validate their algorithm on the provided dataset and submit it to the competition. Then algorithms are evaluated with an (unknown) test dataset and they are ranked according to two metrics, the top-1 accuracy and the top-5 accuracy, where the top-5 accuracy is the fraction of the test dataset for which the correct label is among the best 5 predictions returned by the algorithm.

The first CNN architecture presented at ILSVRC challenge in 2012 [35] obtained a great success in the competition by outperforming all the challengers with a top-5 accuracy rate of 84.7% (against 73.8% for the second-best entry). This CNN network, well-known as AlexNet, has 8 layers, with 5 convolutional layers dispatched in 3 blocks and 3 dense layers of 4096 units each. The convolutional layers have 3,96,256 and 384 filters of size \(11 \times 11\), \(5 \times 5\) and \(3 \times 3\). Each block has a final max pooling layer and ReLU activation functions are used instead of hyperbolic tangents (as in LeNet). The subsequent winner of ILSVRC challenge, ZFNet [65], improved the previous architecture by reducing the size of the first convolutional layer to \(7 \times 7\) and by increasing the number of filters to 1024 for the last convolutional layers; it achieved a top-5 accuracy of 85.2%.

The trend of reducing the size of the filters by increasing the depth of the network was later confirmed to be a successful strategy. The runner-up architecture of the ILSVRC 2014 challenge, VGGNet [60], obtained a 92.7% top-5 accuracy with an architecture comprised of (up to) 16 convolutional layers of 512 filters of size \(3 \times 3\) distributed in 5 blocks (for the VGG-19 version).

The winner of ILSVRC 2014, GoogLeNet [62], also used a deep network architecture with a total of 27 layers, but managed to decrease the number of parameters to train by using a new element in the architecture, the Inception module, which is a stack of small size convolutional layers (\(1 \times 1\), \(3 \times 3\), \(5 \times 5\)) with few parameters. Furthermore, the last dense layers are replaced by an average pooling layer, which also decreases the number of parameters to train. GoogLeNet obtained a top-5 accuracy rate of 93.3% with 12 times fewer parameters than AlexNet.

Finally, deep residual networks recently grow in popularity with the success of ResNet at ILSVRC 2015 [28]. These architectures manage to overcome the degradation problem that occurs when the depth of the network increases. They rely on residual units that learn for each layer the residual function \(\mathcal {F}(x) = \mathcal {H}(x)-x\) where x is the input of the layer and \(\mathcal {H}(x)\) is the desired function of the layer. The top of the networks also have an average pooling layer like GoogLeNet. ResNet has up to 152 layers and achieves a 96.6% top-5 accuracy at ILSVRC 2015, winning the challenge.

3.3 Determining an architecture

Essentially, the strategy we have followed to finalize a choice of hyper-parameters is composed of three consecutive steps. First, we perform some ad hoc preliminary tests to select a base model with fixed architecture parameters, namely a \(\hbox {CNN}_{\mathrm{base}}\) model in the form (9), and then we analyse the performances we can obtain with such a model while making the training hyper-parameters vary. In a third time, and after fixing the best solutions for the training hyper-parameters, we made the network hyper-parameters vary in order to optimize the architecture, called \(\hbox {CNN}_\mathrm{best}\).

This strategy has also been conducted in the context of multi-layer perceptron networks. The setting process leads to the \(\hbox {MLP}_\mathrm{best}\) architecture given in C.

3.3.1 Choice of the \(\hbox {CNN}_\mathrm{base}\)

To get a first idea about the kind of architectures relevant in our context, we chose to test some of state-of-the-art CNN architectures listed in previous section:Footnote 18 VGG-16 [60], ResNet-50 [28] and Inception-v3 [63]. Our purpose was not to compare their efficiency after some specific tuning but was to check whether one of them seems straightforwardly more adapted to our context than the others. Results are summed up in Fig. 2 where we have plotted the evolution of the mean rank of the correct key-hypothesis according to the number of epochs. Results are obtained with a 10-fold cross-validation.

Fig. 2
figure 2

Mean ranks (y-axis) w.r.t. the number of test traces (x-axis) obtained for VGG-16, ResNet-50 and Inception-v3 and for different epochs

Clearly, ResNet-50 and Inception-v3 do not seem to succeed in extracting key-dependent information for the observations whereas VGG-16 does very well. Based on these preliminary results, we chose to apply the same design principles as in VGG-16 architecture and we investigated the impact on several parameters’ configuration on the side-channel attack efficiency.

We moreover added the following rules, which are today classical in literature and enable us to limit the number of different configurations to test.

Rule 1

CONV layers in the same block have exactly the same configuration (to keep the global volume constant).

Rule 2

Each pooling has dimension 2 (and hence divides the size of the input by 2).

Rule 3

The number of filters \(n_{\text {filters},i}\) in a CONV layer of the ith block (starting from \(i=1\)) satisfies for \(i\ge 2\):

$$\begin{aligned} n_{\text {filters},i}= \min (n_{\text {filters},1}\times 2^{i-1},512) . \end{aligned}$$

Remark 4

The core idea behind Rule 3 is to keep the global amount of information treated by the different layers as constant as possible. Since each pool layer divides the input dimension by 2, the number of filters is itself multiplied by 2 to compensate it. This idea is inspired by VGG-16.

Rule 4

All the CONV layers have the same kernel size.

An example of the tested CNN architecture \(\hbox {CNN}_\mathrm{base}\) is given in Fig. 30 in “Appendix D”. For the initial configuration of our \(\hbox {CNN}_\mathrm{base}\) model used to test the impact of the different hyper-parameters, we select the following values: the number of blocks of CONV layers \(n_{\text {blocks}}\) is equal to 4, there is only one CONV layer by block (\(n_{\text {conv}} = 1\)), the number of filters for the first block \(n_{filter,1}\) is equal to 64, each filter has kernel size 3 (same padding) with ReLU activation functions and a max pooling layer for each block (\(W=2\)). \(\hbox {CNN}_\mathrm{base}\) has \(n_{\text {dense}} =2\) final dense layers of 4, 096 units.

Table 1 Benchmarks summary

3.3.2 Choice of the hyper-parameters in SCA context

The goal of the following benchmarks is twofold. First, it is to study, for the architecture \(\hbox {CNN}_\mathrm{base}\) presented in previous section and for our dataset, the impact of each (hyper)-parameter on the accuracy and SCA-efficiency. Secondly, it is to choose the final hyper-parameters for a new CNN model (hopefully) better than the original one and hence called \(\hbox {CNN}_\mathrm{best}\). The number of the latter parameters being too large for an exhaustive test of all the possible configurations, we chose to arbitrarily follow a predetermined sequence of tests. Note that the same \(\hbox {CNN}_\mathrm{base}\) architecture is used for each benchmarking (which each focuses on a single hyper-parameter). For completeness, we also validated the soundness of our choices when the observations are desynchronized.

The desynchronizations of the traces are simulated by generating for each trace a random number \(\delta \) in \([0..N_{\text {max}}]\) and by shifting the original trace of \(\delta \) points to the left. The samples added to a trace by this processing directly come from the corresponding full trace in ATMega8515_raw_traces.h5. For a chosen value \(N_{\text {max}}\) we then generate a new dataset \(\mathcal {D}_{N_{\text {max}}}\) from the original one.

Our goal was not to find the optimal configuration/training strategies but was to identify one of them making sense in our context and leading to accurate results. Other choices are certainly possible and we let the question of determining the most pertinent strategy as an open problem for further studies on this subject. The roadmap followed for our benchmarks is summarized in Table 1. Since our goal is to improve the SCA efficiency, i.e. to have the correct hypothesis ranked first with the minimum of traces during the matching/test phase, we always privileged rank-flavoured criteria for parameters selection.

3.3.3 Training parameters

The tuning of some training parameters is inherited by the analogous study in MLP context (see “Appendix C”). In particular we fixed the training set size to 50,000,Footnote 19 and chose to use the RMSProp optimizer with a learning rate of \(10^{-5}\).

Number of epochs and batch size For our campaigns, we did not observe any over-fitting (relatively to our rank function) when the number of epochs is increasing. As a direct consequence, the quality of the trained model in terms of our rank function never decreases when the number of epochs increases. Based on this observation, the following benchmarks aim to get the best trade-off between the SCA-efficiency and the training duration/time as a function of the number of epochs and the training batch size.

Fig. 3
figure 3

Mean ranks and training time of \(\hbox {CNN}_\mathrm{base}\) with different values of epochs and batch sizes

The first benchmark series are obtained by training \(\hbox {CNN}_\mathrm{base}\) with different values of epoch and batch size. The results in Fig. 3 show that the SCA-efficiency is enhanced not only when the number of epochs increases but also when the batch decreases. We consider that as a natural behaviour since the number of gradient descents (and thus the number of steps in the solving of the minimization problem) increases linearly with the number of epochs and linearly with (\(|\mathcal {D}_{\text {train}}|/batch\_size\)). However, as a counterpart, the training duration linearly increases with the number of gradient descent steps, as well. We selected one of the best trade-off, namely 100 epochs and a batch size equal to 200.Footnote 20 This choice will allow us to test the impact of the other parameters in our CNN experiments while keeping the training time acceptable.

The following benchmark validates, for a batch size equal to 200, that the previous observation (namely the fact that the SCA-efficiency increases with the number of epochs) stays true when traces are desynchronized.

Fig. 4
figure 4

Mean ranks of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and different numbers of epochs

With \(\hbox {CNN}_\mathrm{base}\) and with 5000 traces, we manage to obtain a mean rank close to 20 for a maximal desynchronization value \(N_{\text {max}}=50\), and a mean rank close to 40 for a maximal desynchronization value \(N_{\text {max}}=100\). These results highlight the success of CNN architectures in the context of desynchronized traces, as studied in [13]. Results of Fig. 4 also show the impact of the epoch parameter on the SCA-efficiency with the desynchronization amount: an epoch value of 100 is enough without desynchronization, but it does not yield to the best performance when traces are desynchronized.

3.3.4 Architecture parameters

In each case, we study the parameters in the context of 256 classes.

Number of layers The architecture of our \(\hbox {CNN}_\mathrm{base}\) suggests to divide the study of the number of layers in two phases: in a first time we make the number \(n_{\text {blocks}}\) of blocks vary where a block is composed of convolutional layers followed by a single pooling layer. In this phase we consider, for each block, only \(n_{\text {conv}}=1\) convolutional layer per block (which is the configuration of our \(\hbox {CNN}_\mathrm{base}\) model). In a second time, we look for the optimal number \(n_{\text {conv}}\) of convolutional layers per block assuming that this number is fixed to for all the blocks and that the number of blocks equals 4.

Fig. 5
figure 5

Mean ranks (left-hand side) and mean accuracy (right-hand side) of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and several numbers of blocks

Results of the first phase are plotted in Fig. 5. As expected, we notice that the SCA-efficiency increases with the number of blocks. A difference can be observed in the mean rank between 4 blocks and 5 blocks in presence of desynchronization. This fact can be explained in term of dimension of the input layer before applying the dense layers. The input trace has dimension 700, i.e. contains 700 temporal features, corresponding to its time samples. When 5 blocks are used, 5 max pooling layers of stride 2 are applied, and the temporal input dimension is divided by \(2^5=32\), resulting in an input for the first dense layer composed of 21 temporal features times 512 abstract features (43 temporal features times 512 abstract ones if only 4 blocks). The temporal features are those which are directly impacted by the temporal noisy effect of desynchronization. So the less they are the more the model is robust to desynchronization. This explains why adding blocks increases the SCA-efficiency in presence of desynchronization. Thus, we choose \(n_{\text {blocks}}=5\) as best parameter. However, in our further benchmarks we keep the value \(n_{\text {blocks}} = 4\); choosing this mid range value allows us to have a better understanding of the impact of the other parameters on the SCA-efficiency.

Results of the second phase are plotted in Fig. 6. We observe that optimal performances are obtained with only one CONV layer per block, even if the SCA-efficiency seems to be dimly impacted by \(n_{\text {conv}}\). This is probably due to the fact that increasing number of layers, the number of trainable weights increases as well. To observe a benefit we should let models with more weights train longer, but for our benchmark we fixed the number of epochs (100 in our experiment). So not only we do not observe any benefit in augmenting the \(n_{\text {conv}}\), but actually we observe performances slightly decreasing for an under-fitting phenomenon due to the lack of epochs. Observing results obtained with \(n_{\text {conv}}=0\), we verify the fact that the performance of the network is impacted by desynchronization when no CONV layer at all is exploited. The claim of CONV layers overcoming desynchronization issue by extracting patterns in the trace is then verified. Those conclusions are perfectly in line with [54] which observes that CONV layers play a minor role when the observations traces are perfectly synchronized.

Fig. 6
figure 6

Mean ranks (left-hand side) and mean accuracy of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and different numbers of CONV layers per block

Number of filters Following Rule 3, the aim of the next benchmark is to test several values for the number of filters in the CONV layers of the first block (denoted \(n_{filters,1}\) in Rule 3). Figure 7 shows that increasing the number of filters also increases the SCA-efficiency. However, it also obviously increases the time of the training which leads us to look at a good trade-off between efficiency and computational time.

Fig. 7
figure 7

Mean ranks (left-hand side) and mean training time (right-hand side) of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and different values of initial number of filters

Kernel size We here study the impact of the kernel size (aka the dimension of the convolutional filters) on the model efficiency. In parallel, we compare two different approaches which either consist in combining several convolutional layers with small dimension or in selecting one unique convolutional layer with high dimension.

Fig. 8
figure 8

Mean ranks with different kernel sizes (left-hand side) and mean ranks with 3 layers of dimension 3 in each block for different epochs (right-hand side) of \(\hbox {CNN}_\mathrm{base}\), for a desynchronization amount of 100

Unexpectedly, as shown in Fig. 8, the kernel size significantly impacts the SCA-efficiency. We obtain a mean rank below 10 with a desynchronization amount of 100 by increasing the size of the filters to 11, whereas we only obtain a mean rank of 40 by increasing the number of convolutional layers to 3 for each block with filters of size 3. For a complete comparison we also increased the number of epochs to 200 in our second experiment but it clearly does not improve the efficiency in the same scale than the kernel size. This behaviour is very different than the one expected if we refer to recent results in computer vision where the trend is to increase the number of layers with filters of small size.

Number of fully-connected final layers

Fig. 9
figure 9

Mean ranks (left-hand side) and mean accuracy (right-hand side) of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and various numbers of final fully-connected layers

For this benchmark, we train four versions of \(\hbox {CNN}_\mathrm{base}\) with different numbers of fully-connected final layers. Each of these dense layers has 4096 units. We observe from Fig. 9 that the network requires at least one dense layer when the traces are synchronized. Roughly speaking, this suggest that, for SCA attacks, the QDA part of CNN networks is simulated by dense layers, while convolutional layers essentially extract information (e.g. by combining leakage points and/or by dealing with the desynchronization). The results also confirm that the number of dense layers increases the SCA-efficiency. Hence, fully connected layers are a critical part of the CNN network in the context of SCA and shall be not removed. This differs from the recent trend in computer vision where dense layers are replaced by an average pooling layer and it explains the poor results of Inception-v3 and ResNet-50 in our experiments (Fig. 2).

Activation functions For the three tested activation functions, Fig. 10 shows that only ReLU can afford a good efficiency for SCA when desynchronization is below 50. Therefore, we recommend the usage of ReLU activation function for CNN architecture.

Fig. 10
figure 10

Mean ranks of \(\hbox {CNN}_\mathrm{base}\) for a desynchronization in \(\{0,100\}\) and activation function in \(\{ReLU, tanh,sigmoid\}\)

Pooling layer and padding We observed that the choice of the pooling layer and the padding played a minor role in the attack success rate. Among the tested options, we eventually selected a the Average pooling layer and a padding option equal to same. More information about the related benchmarking may be found in “Appendix E”.

\({CNN}_\mathrm{best}\) At the end of these benchmarks, we are able to define the best CNN architecture based on our selection of the parameters. Therefore, we define \(\hbox {CNN}_\mathrm{best}\) as the CNN architecture with 5 blocks and 1 convolutional layer by block, a number of filters equal to (64, 128, 256, 512, 512) with kernel size 11 (same padding), ReLU activation functions and an average pooling layer for each block. The CNN has 2 final dense layers of 4096 units. \(\hbox {CNN}_\mathrm{best}\) is trained with a batch size equal to 200 by using RMSprop optimizer with an initial learning rate of \(10^{-5}\). According to previous benchmarks and results of Fig. 4, a number of epochs above 130 is necessary to obtain the best results with \(\hbox {CNN}_\mathrm{base}\) on desynchronized traces. In our experiments on \(\hbox {CNN}_\mathrm{best}\), we noticed that a training with 75 epochs is sufficient. For robustness and because it had an acceptable computing time impact, we eventually chose to benchmark until a number of epochs equal to 100.

4 Attack comparisons on desynchronized traces

In this section, we perform on desynchronized traces a last comparison of the efficiency of the four models studied throughout this article, namely: VGG-16, PCA-QDA (aka template attacks) and \(\hbox {CNN}_\mathrm{best}\). For completeness, we also report on comparisons with a MLP model trained by following an approach similar to that described in Sect. 3 for CNN models (see “Appendix C” for more details). As in the previous sections, models are evaluated with a 10-fold cross-validation on 50, 000 traces.

VGG-16 We train VGG-16 with different numbers of epochs. The size of the batch is equal to 200 and we use RMSprop optimizer with a initial learning rate of \(10^{-5}\). The results for different desynchronization values are plotted in Fig. 11. As expected, the SCA-efficient increases with the number of epochs and we select 150 epochs for this model.

Fig. 11
figure 11

VGG-16 mean ranks for a desynchronization amount in \(\{0,100\}\) and different numbers of epochs

Template attacks As described in Sect. 14, we first perform an unsupervised PCA on the 700 samples of the traces in the dataset \(\mathcal {D}_{\text {profiling}}\) and we apply a QDA on the resulting components. Fig. 12 shows the impact of the desynchronization on the efficiency of template attacks. We note that the attack fails for a desynchronization amount equal to 100. When no desynchronization is added, best results are obtained when only 10 components are kept after the PCA.

Fig. 12
figure 12

Template Attacks mean ranks for a desynchronization amount in \(\{0,100\}\) and different values of PCA reduction

\(CNN_{best}\) Finally, we evaluate \(\hbox {CNN}_\mathrm{best}\) with the parameters described in the previous subsection. Results are displayed in Fig. 13.

Fig. 13
figure 13

Mean ranks of \(\hbox {CNN}_\mathrm{best}\) for a desynchronization amount in \(\{0,100\}\)

Summarize of the results In Fig. 14, we compare the best results obtained from the models. \(\hbox {CNN}_\mathrm{best}\) outperforms all the other models on desynchronized traces with only 75 epochs. VGG-16 has decent results too, but with an higher number of epochs. \(\hbox {CNN}_\mathrm{best}\) and template attacks combined to a PCA have similar results with synchronized traces, however this second model performs poorly with desynchronization. \(\hbox {MLP}_\mathrm{best}\) has good performances on synchronized traces, but is very sensitive to desynchronization.Footnote 21

Fig. 14
figure 14

Mean ranks of the best models for a desynchronization amount in \(\{0,100\}\)

5 Conclusions and perspectives

In this paper, we have conducted a thorough study of the application of deep learning theory in the context of side-channel attacks. In particular, we have discussed several parametrization options and we have presented a large variety of benchmarks which have been used to either experimentally validate our choices or to help us to take the adequate decision. The methodologies followed for the hyper-parameters selection may be viewed as a proposal to help researchers to make their own choice for the design of new deep learning models. They also open the way for further research in this domain. Since convolutional neural networks are shown almost similarly efficient as multi-layer perceptron networks in the context of perfectly synchronized observations but outperform them in presence of desynchronization/jittering, our study suggests that CNN models should be preferred in the context of SCA (even if they are more difficult to train). When it comes to choose a base architecture for the latter models, our study shows that the 16-layer network VGG-16 used by the VGG team in the ILSVRC-2014 competition [60] is a sound starting point (other public models like ResNet-50 [28] or Inception-v3 [63] are shown to be inefficient). Our results show that VGG-16 allows to design architectures which, after training, are better than classical Templates Attacks even when combined with dimension reduction techniques like principal component analysis (PCA) [53]. Clearly, our analysis does not enable to draw strong conclusions on the optimization and selection of optimal networks in the context of side-channel analysis. It however shows the impact of each hyper-parameter on the model soundness. All the benchmarkings have been done with the same target (and database) which corresponds to an AES implementation secured against first-order side-channel attacks and developed in assembly for an ATMega8515 component. This project has been published in [4]. To enable perfect reproducing of our experiments and benchmarks, we also chose to publish the electromagnetic measurements acquired during the processing of our target AES implementation (available in [3]) together with example Python scripts to launch some initial training and attacks based on these traces. We think that this ASCAD database may serve as a common basis for researchers willing to compare their new architectures or their improvements in existing models. In this paper, we did not address the question of model optimality in the context of side-channel attacks. We think that this subject, together with the extension of our study to masking schemes which are not a simple Boolean sharing, is a promising avenue for further research.