Keywords

1 Introduction

Recently, deep learning techniques proved to be a powerful option for profiled side-channel analysis (SCA) as they 1) do not need the pre-processing phase to select the points of interest, and 2) perform well even in the presence of noise and countermeasures [3, 8]. Although the application of deep learning for profiled SCA became popular, there are many open questions like the selection of hyperparameters for successful side-channel attacks. For hyperparameters like architectural details (e.g., number of layers, neurons), recent works provided directions to follow [16, 29, 30]. At the same time, the ability to select the right moment to stop the training phase is left to instinct (and sometimes luck) as there does not seem to be a direct connection between the machine learning metrics and the performance of a side-channel attack, as discussed later [14]. While the lack of a clear connection may not be deemed crucial, the training phase’s end depends on metrics like accuracy, loss, recall, and performance on a validation set. If the training phase finishes too late, the machine learning model overfits. As a consequence of overfitting, the model will not generalize to unseen data, and the attack phase will fail.Footnote 1

One way to analyze the generalization of a deep neural network is through the lens of information theory. In [19], the authors proposed a new methodology to interpret the training of a multilayer perceptron (MLP) through a theory called the Information Bottleneck (IB) [25]. They demonstrated that the training of an MLP provides two distinct phases - fitting and compression. These phases are determined by computing the mutual information between the intermediate representations (activations from hidden layers) and input (raw data) or output (labels). The output of a hidden layer can be seen as a summary of statistics containing information about the input and output. The fitting phase is usually very fast, requiring only a few epochs, while the compression phase lasts longer. The compression phase is also responsible for the neural network’s generalization, i.e., its ability to perform on unseen data. We consider the mutual information between output layer activation’s and the data labels (as given by the leakage model) as a metric to identify the epoch when the neural network achieves its optimal generalization capacity. Our results show that training a network for too many epochs harms generalization, and early stopping based on the mutual information metric is a reliable technique to avoid this scenario. We test our metric against three masked AES implementations and show that our metric provides superior results than the typical metrics like accuracy, recall, or loss.

To the best of our knowledge, this is the first result providing a reliable attack performance metric different from conducting an actual attack (key ranking). While key ranking is a reliable validation metric to optimize the generalization of a deep neural network for side-channel attacks [3], it brings significant computational overhead when using large validation sets. On the other hand, mutual information offers remarkable performance at a fraction of the computational cost as it does not have to be computed for all key hypotheses. To facilitate reproducible research, we will make the source code publicly available.

1.1 Related Works

From the first paper considering convolutional neural networks for SCA [10], deep learning gained significant recognition in the SCA community as an important direction to follow for profiled SCA. Despite good results, even when considering protected targets [3, 8, 13, 30], there are open questions. For instance, progress on topics like interpretability and explainability of neural networks is difficult in general, and as such, there are not many results for SCA [11, 26, 27].

A second important research direction explored by the SCA community is how to find optimal architectures, i.e., tune hyperparameters [28,29,30]. Common options for hyperparameter exploration are number/types of layers/neurons, activation functions, and the number of epochs. Determining the number of training epochs is relevant for any application domain but is a particularly challenging problem in SCA since machine learning metrics are known not to indicate well the SCA performance [14]. Indeed, during the machine learning training process, we aim to minimize loss, which often appears to be inversely proportional to accuracy, but there is no mathematical relationship between those two metrics. What is more, in side-channel analysis, the goal is to break the target with as few as possible attack traces, which again does not seem to have direct mathematical relationships with accuracy (or other common machine learning metrics like precision and recall). This gap comes as a consequence of the context of the problem more than the profiling method. Indeed, it is common in machine learning applications to discuss the classification or prediction of single observations for which accuracy, precision and recall are natural metrics. By contrast, the most standard (and powerful) SCAs are the continuous DPAs for which information theoretic metrics are more reflective since they are correlated with the complexity of these attacks (in the number of traces). Common options are the use of early stopping or a predetermined number of epochs for training [7, 8, 15, 30]. In both cases, as the performance is observed through machine learning metrics, it is difficult to know if the training stopped at the right moment.

To evaluate the resilience against side-channel attacks, two different types of metrics are required. Mutual information has been long established in the context of side-channel evaluations as the metric of choice for evaluating the quality of an implementation [21]. Mutual information has two interesting properties: first, it is independent of the adversary, and second, it has the same meaning for any implementation or countermeasure. It is, however, advisable to complement an information-theoretic metric with a security metric (success rate or guessing entropy) that captures the success of an adversary in exploiting such information [23]. Conditional entropy has also been used to compare the strength of profiled attacks [22]. One limitation for applying information-theoretical metrics is the computation of metrics on real measurements due to statistical sampling, making the estimation of the true statistical distribution impossible. As a solution, [2] proposed easy-to-compute bounds Perceived Information (lower bound) and Hypothetical Information (upper bound). These metrics have been shown to work well under the assumption that the target variable has a uniform probability distribution. Perceived Information has also been shown to be asymptotically equivalent to the Negative Log-Likelihood when used as a loss function during the training of a neural network [12]. Although the maximization of perceived information from [12] is a consistent objective in deep learning-based profiled side-channel analysis, the usage of PI value to estimate, e.g., the best epoch to stop the training directly implicates that the loss function value is an adequate metric for this task, which unfortunately is not a general observation in deep learning-based profiled SCA. As observed in [14], inconsistency between loss function and SCA metrics might occur very often for several deep learning models, especially in the presence of the class imbalance problem. Still, as long as a metric embeds a sum of log probabilities, the intuition that it represents a good predictor of continuous attacks (and, therefore, good for model comparisons) should hold to a good extent. To circumvent this problem, Robissout et al. [17] conducted a success rate evaluation for every epoch of training and validation. This solution leads to a significant time overhead for large number of attack traces. More importantly, there is no consensus on calculating the key rank for training sets with random keys, as best practice recommends.

1.2 Contributions

This work provides two main contributions as follows:

  1. 1.

    A new application for the use of mutual information as a metric to select the epoch where a deep network achieves its best performance for profiled side-channel analysis. We show that the information about the labels transferred to the output layer can be measured and used as a reliable metric to determine when to stop the training phase. Our approach offers four distinct improvements compared to existing results. First, the mutual information metric is precise and can accurately predict the epoch at which the network achieves its best performance, while the alternative, validation key ranking, will give a range of values for the best epoch. Second, the computational overhead for computing the information transferred to the output layer is much smaller than performing a full attack to obtain the key ranking at each epoch’s end. Third, mutual information is suitable for training sets with randomized keys, as recommended by best practices, while there is no consensus on calculating the key ranking for this type of training sets, as far as the authors are aware. Forth, we extend [19], where the authors show how to calculate the information path for MLP architectures to CNN architectures.

  2. 2.

    The use of mutual information consistently offers good performance. We test it on three publicly available datasets, using two leakage models and two different architectures. In the experimental section, we thoroughly compare its performance to the four conventional metrics: validation loss, validation accuracy, validation recall, and validation key rank. We conclude that in all cases, the use of mutual information will lead to better generalization.

2 Background

2.1 Deep Learning in the Context of Side-Channel Analysis

In the profiled scenario, we assume the adversary has full control of a device identical to the targeted one. Thus, we consider the supervised learning task, i.e., learning a function f mapping an input to the output (\(f: \mathcal X \rightarrow Y\)) based on examples of input-output pairs. The examples come from a dataset divided into three parts: profiling set consisting of N traces, validation set consisting of V traces, and attack set consisting of Q traces.

Once we learn the function f, the goal of the attack phase is to make predictions about the classes

$$\begin{aligned} y(x_{1},k^*), \ldots , y(x_{Q},k^*), \end{aligned}$$

where \(k^*\) represents the secret (unknown) key of the device under the attack.

In the profiled SCA, the trained neural network is tested against a Q side-channel traces, in which the secret key is unknown, and the key recovery methodology assumes that the correct key is the one that maximizes the summation probabilities \(S_{k}\) for each key byte candidate:

$$\begin{aligned} S_{k} = \sum _{i=1}^{Q} \log (p_{i,j}). \end{aligned}$$
(1)

The value (i.e., the probability) \(p_{i,j}\) is an element of a matrix P with size number of traces \(\times \) number of classes. This matrix gives the output class probabilities obtained by using the model f on a test or validation set. Thus, \(p_{i,j}\) is the probability element obtained as a function of the attack trace \(x_i\), leakage model l, and input data \(p_{k_i}\) for every possible key guess k: \(f_k(x_i, p_{k_i}, l)\). In the context of neural networks, \(p_{i,j}\) represents the neuron’s activation value for trace i from the Softmax output layer.

A usual approach to assess the attacker’s performance is to use metrics that denote the number of measurements required to obtain the secret key \(k^*\). Common examples of such metrics are guessing entropy (GE) and success rate (SR) [23]. Guessing entropy represents the average number of key candidates an adversary needs to test to determine the correct key, denoted with \(k^*\), after conducting a side-channel analysis. More specifically, given Q traces in the attack phase, an attack outputs a vector \(g= [g_1,g_2,...,g_{|K|}]\) in decreasing order of probability. The guessing entropy is the average position of \(k^*\) in g over several experiments (commonly 50 or 100). The success rate is defined as the average empirical probability that \(g_1\) equals the secret key \(k^*\).

For a deep learning-based SCA to be successful, the trained model must generalize well, which means as small as a possible error on the test set. Next, we define the generalization interval in SCA.

Definition 1

Generalization Interval in SCA. Given \((X_{train}, Y_{train})\) where \(X_{train}\) represents the input training data and \(Y_{train}\) represents a set of training labels, the generalization interval defines the epochs where a successful key recovery can be obtained using Eq. (1).

Techniques used to reduce test errors are commonly known as regularization techniques, among which early-stopping is one of the best-known techniques [6]. Early stopping works under the assumption that the neural network achieved the best generalization and will start overfitting and deteriorate the generalization after this point (of best generalization), which is an undesired behavior. Consequently, determining the pre-specified number of iterations (epochs) is a hyperparameter selection process.

2.2 Datasets

We consider three publicly available datasets, instances of software AES implementations protected with the first-order Boolean masking. The first one is the ASCAD database [15]. We use the trace set where the plaintext and key are randomly defined for each separate encryption. This trace set is used as a training set. A second fixed-key trace set is split into validation and test sets. We attack the third key byte. In this dataset, each trace contains 1 400 features. This dataset is available at https://github.com/ANSSI-FR/ASCAD/tree/master/ATMEGA_AES_v1/ATM_AES_v1_variable_key.

The second dataset is the DPA Contest v4 (DPAv4) [24]. DPAv4 dataset provides traces collected from an AES-256 RSM (rotate shift masking) implementation. This dataset has a fixed key, and we attack the first key byte. Each trace consists of 2 000 features. This dataset is available at http://www.dpacontest.org/v4/.

The third dataset refers to the CHES Capture-the-flag (CTF) masked AES-128 encryption trace set, released in 2018 for the Conference on Cryptographic Hardware and Embedded Systems (CHES). In our experiments, the training set has a fixed key that is different from the key configured for the validation and test sets. Each trace consists of 2 200 features. We attack the first key byte. This dataset is available at https://chesctf.riscure.com/2018/news.

2.3 Information Theory

In information theory, the (marginal) entropy H(X) of a random variable X is defined as the average information obtained by observing X, and it can be quantitatively defined as:

$$\begin{aligned} H(X) = -\sum _{x \in X}p(x)\log _2 p(x), \end{aligned}$$
(2)

where p(x) represents the probability of variable X taking value x. The conditional entropy of X given Y, which represents the entropy of X when Y is known, is defined as:

$$\begin{aligned} H(X|Y) = -\sum _{x \in X} p(x)\sum _{y \in Y}p(x|y)\log _2 p(x|y). \end{aligned}$$
(3)

Finally, mutual information defines the dependence between variables X and Y, and it can be defined using entropy and conditional entropy values as:

$$\begin{aligned} I(X;Y) = H(X) - H(X|Y). \end{aligned}$$
(4)

An important property of mutual information in this context is the Data Processing Inequality (DPI), which states that for any three variables XYZ, forming a Markov chain, \(X\rightarrow Y\rightarrow Z\), the mutual information between the variables can only decrease and \(I(X;Y)\ge I(X;Z)\).

3 Information Theory of Deep Neural Networks

Shwartz-Ziv and Tishby [19] showed that information theory could be used to visualize the training phase of a deep network to compare the performance of different network architectures. Intuitively, when training a network, each layer gets its information from the layer before and transforms it using matrix multiplication of nonlinear functions. Their insight was to treat each layer (the hidden activation functions) in the deep network as a random variable fully described by the information captured about the input data and the labels. Modeling each layer in the deep network as a random variable gives an alternative view of a deep network as a Markov chain. Each variable represents the nonlinear activation function, which successively transforms the input data into the label space. Using the mutual information between the layers, the input data, and the labels, we can visualize the input data’s transformation into the label space.

Definition 2

Information Path. Given a tuple \((X, Y, T_i)\) where X represents the input data, Y represents a set of labels and \(T_i\) is a hidden layer in an n-layered network, described as \(X,Y\rightarrow T_1 \rightarrow ...\rightarrow T_n\), the information path is defined as the set of points \(\{[I(X;T_i), I(T_i;Y)]|i\in \{1,n\}\}\).

The information path is a record of the information each hidden layer preserves about the input data X and the output variables Y. It is typically computed for each epoch during the training phase. The information is plotted in a two-dimensional coordinate system referred to as the information plane. The coordinates of the information plane quantify the bits of information layer \(T_i\) has about the input data X as \(I(X; T_i)\), and the bits of information layer \(T_i\) has about the labels Y as \(I(T_i;Y)\). We can view the variable \(T_i\) as a compressed representation of the input X, and \(I(X; T_i) \) calculated based on the value \(p(x)p(t_i|x)\), which measures how compact the representation of X is. The maximum value for \(I(X;T_i)\) is H(X), which is the Shannon entropy corresponding to the case where \(T_i\) copies X and there is no compression. The minimal value for \(I(T_i;X)\) is 0 and corresponds to the case where T has one value.

Lemma 1

Information Path Uniqueness. For each tuple \((X, Y, T_i)\), where X represents the input data, Y represents a set of labels, and \(T_i\) is a layer in an n-layered network described as \(X,Y\rightarrow T_1 \rightarrow ...\rightarrow T_n\) there exists a unique information path that satisfies the following two inequalities:

$$\begin{aligned} H(X)\ge I(X;T_1)\ge ...\ge I(X;T_n)\ge I(X; \hat{Y}), \end{aligned}$$
(5)

where \(\hat{Y}\) are the labels predicted by the network, and

$$\begin{aligned} I(X;Y) \ge I(T_1;Y) \ge \ldots \ge I(T_n;Y)\ge I(\hat{Y};Y). \end{aligned}$$
(6)

Proof

The proof for this lemma follows immediately by applying the DPI principle. \(\square \)

3.1 Information Bottleneck Principle

Shwartz-Ziv and Tishby observed that stochastic gradient descent (SGD) optimization defines two distinct phases during training [19]. The first one is the fitting phase, where both \(I(X;T_i)\) and \(I(T_i;Y)\) increase fast as the training progresses. During the fitting phase, the deep network layers increase information about the input data and the labels. The second phase is the compression phase, where the network starts to compress or forget information about the input data and slowly increases its generalization capacity by retaining more information about the labels.

The network’s behavior during the compression phase has been linked to the form of the activation functions [4]. This happens due to a random diffusion-like behavior of the SGD algorithm if double-sided saturatingFootnote 2 nonlinear activation function such as Tanh is employed. More precisely, Shwartz-Ziv and Tishby provided results for Tanh and showed how information about the labels increases in the compression phase [19]. On the other hand, Saxe et al. demonstrated that the non-saturating activation functions like ReLU provide a different behavior in the compression phase as there is no causal connection between generalization and compression [18].

Figure 2 gives an overview of the information path of the deep network architecture described in Fig. 1 at epochs: 1, 20, 100, and 200 during the training process. Each figure contains the coordinates of the information \([I(X;T_{i}),I(T_{i};Y)]\) where i represents the i-th layer. The information is captured from the five hidden layers (\(T_{2:6}\)) plus an input layer (\(T_{1}\)) and an output layer (\(T_{7}\)).

Fig. 1.
figure 1

MLP with five hidden layers (\(T_{2:6}\)). The letter “D” denotes the dense (fully-connected) layer while the labels \(T_{1}\) and \(T_{7}\) corresponds input and output layers, respectively, with the Tanh activation function. The output layer \(T_{7}\) has the Softmax activation function. The numbers under the layers indicate the number of neurons.

For the above example, we see that information changes only for the last two hidden layers \(T_5\) and \(T_6\), and the output layer \(T_7\). The plot contains mutual information results for twenty training experiments (i.e., twenty dots for each layer). These results demonstrate that at the beginning of the training phase, the mutual information quantities \([I(X;T_{i}),I(T_{i};Y)]\) are at a minimum level for hidden layers \(T_5\) and \(T_6\), and for the output layer \(T_7\). As the training progresses (epochs 20 and 100), the mutual information values increase until the \([I(X;T_{i}),I(T_{i};Y)]\) reaches its maximum for all layers, including the output layer. If we continue the training process, the compression phase starts to happen as \(I(X;T_{i})\) starts to decrease and \(I(T_{i};Y)\) stays at a maximum level. In Fig. 2, this information path is clearly observed for hidden layers \(T_5\) and \(T_6\), and the output layer \(T_7\). The values obtained for layers \(T_1\) to \(T_4\) provide very small changes during training, which are difficult to visualize due to the scale of Fig. 1.

Fig. 2.
figure 2

The information flow captured at epoch 1, 20, 100, and 200 for the network architecture depicted in Fig. 1 when using the DPAv4 dataset (training set).

Fig. 3.
figure 3

Information plane for the DPAv4 dataset (training set).

Figure 3 shows the evolution of the information path for all training epochs for the DPAv4 dataset for the same architecture given in Fig. 1. The training evolution provides two distinct phases (fitting and compression), as discussed in [19]. In the first phase, the layers (mostly visible for hidden layers \(T_{4:6}\) and output layer \(T_{7}\)) are fitting the training data. The information of an inner state \(T_i\) or layer increases for the input X and output YFootnote 3. In the second phase, the output information stays high, but the input information starts to decrease. From the figure, it is clear that the second phase starts before epoch 100.

3.2 Information Path for Side-Channel Analysis Data

To assess whether a machine learning model generalizes well, we commonly check its performance on previously unseen data, i.e., validation set as defined in Definition 1. Similarly, to investigate the generalization from the information path, we must assess its behavior on the validation set. More precisely, we aim to find the generalization interval as defined in Definition 3.

Definition 3

SCA Generalization Interval via Information Path.

Given a tuple \((X_{train}, Y_{train}, T_i)\) where \(X_{train}\) represents the input training data, \(Y_{train}\) represents a set of training labels, and \(T_i\) is a hidden layer in an n-layered neural network, the generalization interval for SCA defines the interval of training epochs where the quantities \([I(X;T_{i}),I(T_{i};Y)]\) reach the maximal values and we can obtain successful key recovery with Eq. (1) by predicting on the dataset \(X_{test}\) with the trained neural network.

The results in Fig. 4 show it is possible to observe a different “movement” of the points \([I(X;T_i),I(T_i;Y)]\) in the information path when using the validation set. The fitting phase is clearly seen, as \(I(X;T_i)\) and \(I(T_i;Y)\) increase with the processing of first epochs (for the validation set, this movement is observable for hidden layer \(T_{6}\) and output layer \(T_{7}\)).

The compression phase is different from the information plane observed in Fig. 3. Now, the points \([I(X;T_i),I(T_i;Y)]\) reach the maximum value for each hidden layer, and, later, both quantities decrease with the processing of more epochs. This indicates the overfitting scenario for the given trained machine learning model. More specifically, this happens because the generalization in difficult SCA settings (i.e., masked or protected AES) is minimal when given in terms of deep learning metrics (accuracy, loss, recall). At the same time, we aim to capture the machine learning model at an epoch when the best possible generalization occurs. From our observations, the epoch at which the generalization is optimal is given by the moment when \(I(T_n;Y)\) reaches a maximum value.

Fig. 4.
figure 4

Information plane from the DPAv4 dataset (validation set).

3.3 Improving the Generalization in Deep Learning-Based SCA

Recall, for a deep learning-based side-channel analysis to be successful, the trained model must generalize well to previously unseen data (validation/test set). Given a deep neural network defined by a set of hyperparameters \(\theta \), the internal representations \(T_{i}\), \(i \in \{1,n\}\), (where \(T_{1}\) and \(T_{n}\) are the input and output layers, respectively) should inform about the labels Y and input X [1].

As observed from Figs. 3 and 4, we stop the training when we reach the maximum value for \(I(T_n;Y)\). As such, we assume:

  1. 1.

    During the training, the intermediate representation \(T_{i}\) will be compressed to estimate Y correctly.

  2. 2.

    The intermediate representation \(T_{i}\) should be robust such that small addition of noise should not affect this compressed internal representation.

  3. 3.

    Only the information transferred to the output network layer is important for measuring the generalization [4].

Our investigation suggests that the maximum value for \(I(T_n;Y)\) for the output layer happens when the fitting phase is finished, and the compression started. This means that the training does not need to go through the full compression phase to achieve the best generalization. This is also in line with findings from Shwartz-Ziv and Tishby as they observed that the beginning of the compression phase coincides with the best generalization [19], and Saxe et al., who showed empirical results demonstrating that the compression phase does not necessarily improve generalization [18].

In essence, the value for I(TY) should increase for all hidden layers as the training progresses and reach a maximum value at the end of the fitting phase. This directly means that the information path helps to indicate the optimal number of training epochs for all hidden layers, especially the outer layers. As the experiments in the Sect. 4 suggest, taking the maximum value for \(I(T_n;Y)\) from the output layer (where \(T_n\) are the output class probabilities after Softmax) provides an efficient early stopping metric for profiled side-channel analysis.

The calculation of \(I(T_n;Y)\) gives minimal overhead during the training process since we need to make the computation for a small fraction of the validation set. We estimate that the time overhead to compute \(I(T_n;Y)\) at the end of each epoch is less than 2%.

4 Experimental Validation

4.1 Estimating Mutual Information

The first step of calculating mutual information, Eq. (4), is density estimation [9], aiming to construct an approximation of the density function (denoted with p in Eqs. (2) and (3)) using observed data. There are two main approaches to density estimation. The first approach is parametric, where we assume the observed data to be drawn from a known family of distributions (e.g., normal distribution), while the second approach, non-parametric, does not assume the distribution of the observed data. We consider the non-parametric approach more suitable for our setting as we have little information about the underlying data distribution.

Common approaches for non-parametric estimation are simple discretization methods such as equal interval binning, (or histogram estimator), which divides the observed data into equal-sized bins, or equal frequency intervals, which divides the observed data into bins with an equal number of samples [5]. The generic approach’s price is a user-supplied parameter such as the bin width in the case of equal-sized bins or the number of samples in each bin for the frequency-based binning. A variation of the discretization techniques described above is the kernel density estimator, where a kernel function replaces the “box” of the bin estimators. The user-supplied parameter is the kernel bandwidth, and its choice will significantly impact its performance.

Finding the optimal value for the user-supplied parameter is nontrivial but important as its value could directly affect the estimator error. The quality of an estimator is evaluated by its bias and variance, and generic expressions for all estimators mentioned above are known [20]. However, such expressions require as input the value of the distribution from which the data is observed, which in our case is not known.

Adaptive estimators use a recursive algorithm to determine the optimal bin width [4]. Such methods use entropy to measure disorder in the observed data and determine the bin width that minimizes the entropy function over all possible thresholds. Although they offer good performance, we found this last method lacking for our purpose as it is very time-consuming. Our investigation showed that it would be faster to calculate the key rank for every epoch than to use entropy-based binning.

As there is no optimal solution for the non-parametric estimation and low computational overhead is an important requirement in our case, we decided to use the histogram estimator. We thoroughly tested its reliability by running two types of experiments. In the first, we simply considered all possible values for the bin widths. In the second, we considered the different known rules for determining the bin widths. For both experiments, we consider all three datasets, described in Sect. 2.2 and provide detailed results in Appendix A.

We conclude from the first experiment that although the histogram estimator’s bin width impacts the achieved performance, the mutual information metric is stable, and we observe the same behavior for several values of this parameter. More precisely, any value between 25 and 170 will give similar results. Therefore, we selected the value of 100 for the number of bins as it is in the middle of the interval. This observation holds for both the Hamming weight and the identity leakage models.

In the second experiment, we use plug-in estimates to determine the value for the bin-width. These estimates work by making assumptions on the distribution of the observed data and are known as empirical rules for determining the bin width. The results are depicted in Appendix A, where we can observe that the bin width obtained with different rules result in very similar attack performance. Our experiments confirm that the mutual information metric is not very sensitive to the histogram bin size, making it a robust procedure for practical applications where one needs to consider different datasets, leakage models, and neural network architectures. Our strategy is to choose the bin value as the average value with good performance for all tested scenarios based on this observation.

Additionally, we consider how long the generalization interval lasts, as longer intervals mean it will be easier to stop the training while in the generalization. The analysis for the length of the generalization interval is given in Appendix C, and it shows that using a regularization technique prolongs the interval.

4.2 Results for the Publicly Available Datasets

In this section, we compare the performance of five metrics (validation loss, validation accuracy, validation recall, key rank for the validation set, and \(I(T_{n};Y)\)) to select the epoch t at which the machine learning model achieves the best performance. Besides these five metrics, we depict the results when we do not use an early stopping regime but rather allow the full number of training epochs. Those results are denoted as “GE all epochs” and “SR all epochs” for guessing entropy and success rate, respectively. Note, when giving results for guessing entropy and success rate, we conduct 100 key rank executions by randomly selecting attack traces from a larger set. The best success rate value equals 100%, and the best guessing entropy value is 1. When giving results with distributions (e.g., Fig. 6), we repeat experiments 100 times, i.e., there are 100 training phases to be able to build distributions.

For the three tested datasets (ASCAD, DPAv4, and CHES CTF), the results are obtained by attacking one key byte in the first AES encryption round. For ASCAD, we give results for the HW and identity leakage models. For DPAv4, we give results in Appendix D, and we consider only the HW leakage model as the identity leakage model allows an easy attack where there are no significant differences among neural network architectures or validation metrics. We note that the results for DPAv4 are in line with the results for the other datasets. For CHES CTF, we consider only the Hamming weight leakage model as the available dataset contains only 43 000 traces, which is not enough to break the target in the identity leakage model.

We first conduct a tuning phase where we experiment with varying CNN and MLP architectures. We emphasize that we do not claim that the obtained architectures are optimal, as finding optimal architectures was not the goal of this work. The final neural network configurations use the Adam optimizer, and the learning rate is set to 0.001. Initial weights are initialized at random using random uniform method. The selected loss function is the categorical cross-entropy provided by the Keras library. All the experiments were performed on a computer equipped with a GPU Nvidia RTX 2060. The details about selected convolutional neural networks hyperparameters are listed in Table 1. For MLP, we use an architecture with five dense layers containing 600 neurons each. We verified that the selected MLP provides good results for the ASCAD dataset, and the selected CNNs provide strong results for all three considered datasets.

Table 1. Hyperparameters for CNNs.

ASCAD Random Keys Results. The empirical validation on the ASCAD dataset (key byte 3) considers 200 000 traces for training, 500 traces for validation, and 500 traces for the test. Both validation and test sets have a fixed key. The selected CNN architecture is trained for 50 epochs. After identifying the best epoch for each of the five metrics, the corresponding machine learning models are applied to the test set. Note we provide additional results about the best epoch based on the information path in Appendix B.

Figure 5 shows GE and SR for the test set obtained for each validation metric for the HW leakage model. From Fig. 5b, the best success rate is achieved when the machine learning model is selected from the epoch when the metric is the maximum value of \(I(T_{n};Y)\). More precisely, around the processing of 460 traces, the success rate reaches 100% if the model is selected from the epoch determined by the maximum \(I(T_{n};Y)\) value. The lines “GE all epochs” and “SR all epochs” correspond to the results when evaluating GE and SR after processing 50 epochs. We can see that those lines also depict the worst attack performance as in those cases, due to too many training epochs, the machine learning models overfit and do not generalize for the test set. Figure 5a shows no significant differences among most of the metrics (except the scenario where we do not use early stopping), and for both SCA metrics, we see that mutual information works well and gives consistently strong attack performance.

Fig. 5.
figure 5

Results on ASCAD for the HW leakage model, CNN architecture.

Figure 6 shows the results for 100 experiments with the same CNN architecture. Figure 6a gives the \(I(T_n;Y)\) evolution for the processed epochs. On average, the highest \(I(T_{n};Y)\) values are achieved between epochs 20 and 30 (the highest for epoch 24), as indicated by the plot distribution in Fig. 6b. Figure 6c shows the test and validation GE results for the number of epochs (training phase) and \(I(T_n;Y)\) metric. We see an interval (epochs between 8 and 26) when the key rank is low and, consequently, generalization is satisfactory. This indicates that GE reaches good values even before \(I(T_n;Y)\) becomes maximal. Still, allowing more epochs does increase \(I(T_n;Y)\) while keeping GE minimal. Figure 6d shows the distribution of the best epochs based on the validation key rank metric. This histogram contains the results of 100 experiments (unchanged hyperparameters) and indicates that the best validation key rank may happen at different epochs. More precisely, we see the highest value already around epoch 12, which explains why the validation key rank is among the worst-performing metrics for both SR and GE. While this could sound counter-intuitive, there is a simple explanation for such behavior. As we use a validation set often (whenever evaluating whether to stop training), the validation set indirectly influences the trained model. Consequently, it is possible to observe some differences when applying trained models to the test set, which was never evaluated before. As we can observe, \(I(T_n;Y)\) metric seems to be less sensitive to this issue, and as such, it represents a more suitable choice for an early stopping metric.

Fig. 6.
figure 6

Results on ASCAD for the HW leakage model, CNN architecture.

Next, we repeat the ASCAD dataset experiments in the HW leakage model, but with an MLP architecture. From Figs. 7a and 7b, \(I(T_n;Y)\) is the most successful metric, followed closely by loss. Again, if there is no early stopping, neural network overfits, resulting in poor attack performance. Figure 8a indicates that \(I(T_n;Y)\) reaches the best performance for epochs 25 to 35. This is confirmed in Fig. 8b where we observe the highest frequency for epoch 31. Considering the validation and test set behavior when using \(I(T_n;Y)\) to indicate stopping, several epochs give good behavior (from 10 to 35). This is in line with CNN’s behavior, as GE can indicate a successful attack even before \(I(T_n;Y)\) reaches the maximal value. Finally, Fig. 8b gives insight into the performance of the validation key rank, where several epochs have high frequency, but the highest value happens around epoch 5, which is too early as confirmed when evaluating the attack performance (Fig. 7 where the validation key rank performs significantly worse than \(I(T_n;Y)\)).

Fig. 7.
figure 7

Results on ASCAD for the HW leakage model, MLP architecture.

Fig. 8.
figure 8

Results on ASCAD for the HW leakage model, MLP architecture.

Next, we consider the identity leakage model for the ASCAD dataset. First, in Fig. 9, we depict the results for guessing entropy and success rate. The differences among attack performances are very small, but the mutual information metric gives good results for both guessing entropy and success rate. Here, not having the early stopping mechanism does not affect attack performance. This behavior is expected, as due to more classes, neural networks need more epochs to fit the data into the model (and naturally, to overfit).

Fig. 9.
figure 9

Results on ASCAD for the identity leakage model, CNN architecture.

Figure 10a displays the \(I(T_n;Y)\) evolution over 50 epochs. The mutual information increases with the number of epochs and reaches a steady level around epoch 47. This is confirmed in Fig. 10b, where we can indeed observe that epochs 47 to 49 give the best results. Considering GE, both validation and test set values indicate strong performance when having more than 15 epochs. Using the validation key rank as the early stopping metric shows several epochs as suitable to stop the training process (Fig. 10d). Still, the two highest peaks are observed around epochs 32 and 48. As the validation key rank and \(I(T_n;Y)\) point to similar epochs to stop the training, the results in Fig. 9 are as expected – no significant difference in the attack performance.

Fig. 10.
figure 10

Results on ASCAD for the identity leakage model, CNN architecture.

CHES CTF Results. We consider 43 000 traces in the training set and 1 000 traces in the validation set. Additional 1 000 traces are used as a test set. These results were obtained from the 100 training runs on CNN configured with the unchanged hyperparameters. Figure 11 shows the guessing entropy and success rate for the five considered metrics. We can observe that using the training model at the epoch with the maximum \(I(T_{n};Y)\) provides the best success rate and guessing entropy (followed closely by the validation key rank). Retrieving the model at epochs indicated by the best validation accuracy, loss, or recall leads to significantly worse SR and GE results. Similarly, if there is no early stopping, the attack performance is also poor.

Fig. 11.
figure 11

Results on CHES CTF for the HW leakage model, CNN architecture.

Figure 12a provides the mutual information value \(I(T_{n};Y)\) for training phase and every epoch. The maximum \(I(T_{n};Y)\) is reached between epochs 10 and 18. Figure 12b gives similar indication with epoch 14 having the highest frequency. Those results are confirmed in Fig. 12c, where epochs 10 to 15 have the lowest guessing entropy. After epoch 18, the neural network starts to degrade its generalization capacity as it starts to overfit on the training set. On the other hand, the generalization capacity before epoch seven also provides, on average, poor generalization since the network is inside the fitting phase, where satisfactory generalization is not achieved yet (i.e., the network underfits). Finally, in Fig. 12d, the validation key rank gives similar results (there are similar SCA metrics results in Fig. 11). Still, again the validation key rank indicates to stop the training a little bit earlier than \(I(T_{n};Y)\).

Fig. 12.
figure 12

Results on CHES CTF for the HW leakage model, CNN architecture.

4.3 Discussion

When attacking a protected target, like the public databases consisting of the first-order masked AES implementations, model generalization is very limited, and validation or test metrics are close to random guessing. For side-channel analysis, a sufficient generalization is given by a low guessing entropy or high success rate. As we can observe from the results given in Sect. 4, the trained model at each epoch provides different key rank results, and the over-training easily leads to deterioration of the model’s generalization. This problem can be addressed by using an appropriate metric to save the trained model at the epoch that provides the best SR or GE. Our experimental analysis shows that having the maximum value of \(I(T_n;Y)\) as a metric to select the model at the best epoch provides a better success rate and guessing entropy results when compared to machine learning metrics like loss, recall, or accuracy. Our results show that \(I(T_n;Y)\) works especially well in settings where other metrics could have problems, as is the Hamming weight leakage model, which suffers from data imbalance. The \(I(T_n;Y)\) metric works even better than the validation key rank, where we notice that the key rank validation indicates somewhat earlier to stop the training. Based on the obtained results, we give several observations for deep learning-based SCA:

  1. 1.

    It is necessary to implement early stopping regularization.

  2. 2.

    Early stopping based on mutual information consistently gives the best results.

  3. 3.

    Validation key rank seems to be somewhat more conservative in its estimate than the mutual information.

  4. 4.

    GE reaches good values even before \(I(T_n;Y)\) reaches its maximum value. However, when \(I(T_n;Y)\) has reached its maximum value, we notice that the model produces the most stable attack behavior.

  5. 5.

    Mutual information metric, although computationally intensive, is “lighter” compared to the computational effort required for calculating GE or SR for each epoch.

  6. 6.

    For simple datasets, various metrics will provide “good enough” results. However, for complex datasets, the mutual information metric gives superior results.

5 Conclusions and Future Work

This paper demonstrates that using the mutual information between output layer activations (i.e., the output Softmax probabilities) and the true labels \(I(T_{n};Y)\) of a validation set leads to better generalization for separate test sets. We compared \(I(T_{n};Y)\) metric against conventional machine learning metrics (accuracy, recall, and loss), and we verified that mutual information could be a more reliable metric to detect an epoch at which the trained neural network is inside a generalization interval.

In future work, we plan to investigate the mutual information metric as a reference for selecting other hyperparameters. Additionally, we would like to investigate this metric’s behavior when the traces contain misalignment, and consequently, the generalization is more difficult. Such analysis is also essential to improve the portability capabilities of trained deep neural networks for side-channel attacks.