Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis

Perin, Guilherme; Buhan, Ileana; Picek, Stjepan

doi:10.1007/978-3-030-89915-8_3

Guilherme Perin¹⁰,
Ileana Buhan¹¹ &
Stjepan Picek¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12910))

Included in the following conference series:

International Workshop on Constructive Side-Channel Analysis and Secure Design

937 Accesses
9 Citations

Abstract

Today, deep neural networks are a common choice for conducting the profiled side-channel analysis. Unfortunately, it is not trivial to find neural network hyperparameters that would result in top-performing attacks. The hyperparameter leading the training process is the number of epochs during which the training happens. If the training is too short, the network does not reach its full capacity, while if the training is too long, the network overfits and cannot generalize to unseen examples. In this paper, we tackle the problem of determining the correct epoch to stop the training in the deep learning-based side-channel analysis. We demonstrate that the amount of information, or, more precisely, mutual information transferred to the output layer, can be measured and used as a reference metric to determine the epoch at which the network offers optimal generalization. To validate the proposed methodology, we provide extensive experimental results.

This work was supported by the European Union’s H2020 Programme under grant agreement number ICT-731591 (REASSURE).

Access provided by Autonomous University of Puebla. Download conference paper PDF

On the Influence of Optimizers in Deep Learning-Based Side-Channel Analysis

Online Performance Evaluation of Deep Learning Networks for Profiled Side-Channel Analysis

Focus is Key to Success: A Focal Loss Function for Deep Learning-Based Side-Channel Analysis

Keywords

1 Introduction

Recently, deep learning techniques proved to be a powerful option for profiled side-channel analysis (SCA) as they 1) do not need the pre-processing phase to select the points of interest, and 2) perform well even in the presence of noise and countermeasures [3, 8]. Although the application of deep learning for profiled SCA became popular, there are many open questions like the selection of hyperparameters for successful side-channel attacks. For hyperparameters like architectural details (e.g., number of layers, neurons), recent works provided directions to follow [16, 29, 30]. At the same time, the ability to select the right moment to stop the training phase is left to instinct (and sometimes luck) as there does not seem to be a direct connection between the machine learning metrics and the performance of a side-channel attack, as discussed later [14]. While the lack of a clear connection may not be deemed crucial, the training phase’s end depends on metrics like accuracy, loss, recall, and performance on a validation set. If the training phase finishes too late, the machine learning model overfits. As a consequence of overfitting, the model will not generalize to unseen data, and the attack phase will fail.^{Footnote 1}

One way to analyze the generalization of a deep neural network is through the lens of information theory. In [19], the authors proposed a new methodology to interpret the training of a multilayer perceptron (MLP) through a theory called the Information Bottleneck (IB) [25]. They demonstrated that the training of an MLP provides two distinct phases - fitting and compression. These phases are determined by computing the mutual information between the intermediate representations (activations from hidden layers) and input (raw data) or output (labels). The output of a hidden layer can be seen as a summary of statistics containing information about the input and output. The fitting phase is usually very fast, requiring only a few epochs, while the compression phase lasts longer. The compression phase is also responsible for the neural network’s generalization, i.e., its ability to perform on unseen data. We consider the mutual information between output layer activation’s and the data labels (as given by the leakage model) as a metric to identify the epoch when the neural network achieves its optimal generalization capacity. Our results show that training a network for too many epochs harms generalization, and early stopping based on the mutual information metric is a reliable technique to avoid this scenario. We test our metric against three masked AES implementations and show that our metric provides superior results than the typical metrics like accuracy, recall, or loss.

To the best of our knowledge, this is the first result providing a reliable attack performance metric different from conducting an actual attack (key ranking). While key ranking is a reliable validation metric to optimize the generalization of a deep neural network for side-channel attacks [3], it brings significant computational overhead when using large validation sets. On the other hand, mutual information offers remarkable performance at a fraction of the computational cost as it does not have to be computed for all key hypotheses. To facilitate reproducible research, we will make the source code publicly available.

1.1 Related Works

From the first paper considering convolutional neural networks for SCA [10], deep learning gained significant recognition in the SCA community as an important direction to follow for profiled SCA. Despite good results, even when considering protected targets [3, 8, 13, 30], there are open questions. For instance, progress on topics like interpretability and explainability of neural networks is difficult in general, and as such, there are not many results for SCA [11, 26, 27].

A second important research direction explored by the SCA community is how to find optimal architectures, i.e., tune hyperparameters [28,29,30]. Common options for hyperparameter exploration are number/types of layers/neurons, activation functions, and the number of epochs. Determining the number of training epochs is relevant for any application domain but is a particularly challenging problem in SCA since machine learning metrics are known not to indicate well the SCA performance [14]. Indeed, during the machine learning training process, we aim to minimize loss, which often appears to be inversely proportional to accuracy, but there is no mathematical relationship between those two metrics. What is more, in side-channel analysis, the goal is to break the target with as few as possible attack traces, which again does not seem to have direct mathematical relationships with accuracy (or other common machine learning metrics like precision and recall). This gap comes as a consequence of the context of the problem more than the profiling method. Indeed, it is common in machine learning applications to discuss the classification or prediction of single observations for which accuracy, precision and recall are natural metrics. By contrast, the most standard (and powerful) SCAs are the continuous DPAs for which information theoretic metrics are more reflective since they are correlated with the complexity of these attacks (in the number of traces). Common options are the use of early stopping or a predetermined number of epochs for training [7, 8, 15, 30]. In both cases, as the performance is observed through machine learning metrics, it is difficult to know if the training stopped at the right moment.

To evaluate the resilience against side-channel attacks, two different types of metrics are required. Mutual information has been long established in the context of side-channel evaluations as the metric of choice for evaluating the quality of an implementation [21]. Mutual information has two interesting properties: first, it is independent of the adversary, and second, it has the same meaning for any implementation or countermeasure. It is, however, advisable to complement an information-theoretic metric with a security metric (success rate or guessing entropy) that captures the success of an adversary in exploiting such information [23]. Conditional entropy has also been used to compare the strength of profiled attacks [22]. One limitation for applying information-theoretical metrics is the computation of metrics on real measurements due to statistical sampling, making the estimation of the true statistical distribution impossible. As a solution, [2] proposed easy-to-compute bounds Perceived Information (lower bound) and Hypothetical Information (upper bound). These metrics have been shown to work well under the assumption that the target variable has a uniform probability distribution. Perceived Information has also been shown to be asymptotically equivalent to the Negative Log-Likelihood when used as a loss function during the training of a neural network [12]. Although the maximization of perceived information from [12] is a consistent objective in deep learning-based profiled side-channel analysis, the usage of PI value to estimate, e.g., the best epoch to stop the training directly implicates that the loss function value is an adequate metric for this task, which unfortunately is not a general observation in deep learning-based profiled SCA. As observed in [14], inconsistency between loss function and SCA metrics might occur very often for several deep learning models, especially in the presence of the class imbalance problem. Still, as long as a metric embeds a sum of log probabilities, the intuition that it represents a good predictor of continuous attacks (and, therefore, good for model comparisons) should hold to a good extent. To circumvent this problem, Robissout et al. [17] conducted a success rate evaluation for every epoch of training and validation. This solution leads to a significant time overhead for large number of attack traces. More importantly, there is no consensus on calculating the key rank for training sets with random keys, as best practice recommends.

1.2 Contributions

This work provides two main contributions as follows:

1.
A new application for the use of mutual information as a metric to select the epoch where a deep network achieves its best performance for profiled side-channel analysis. We show that the information about the labels transferred to the output layer can be measured and used as a reliable metric to determine when to stop the training phase. Our approach offers four distinct improvements compared to existing results. First, the mutual information metric is precise and can accurately predict the epoch at which the network achieves its best performance, while the alternative, validation key ranking, will give a range of values for the best epoch. Second, the computational overhead for computing the information transferred to the output layer is much smaller than performing a full attack to obtain the key ranking at each epoch’s end. Third, mutual information is suitable for training sets with randomized keys, as recommended by best practices, while there is no consensus on calculating the key ranking for this type of training sets, as far as the authors are aware. Forth, we extend [19], where the authors show how to calculate the information path for MLP architectures to CNN architectures.
2.
The use of mutual information consistently offers good performance. We test it on three publicly available datasets, using two leakage models and two different architectures. In the experimental section, we thoroughly compare its performance to the four conventional metrics: validation loss, validation accuracy, validation recall, and validation key rank. We conclude that in all cases, the use of mutual information will lead to better generalization.

2 Background

2.1 Deep Learning in the Context of Side-Channel Analysis

In the profiled scenario, we assume the adversary has full control of a device identical to the targeted one. Thus, we consider the supervised learning task, i.e., learning a function f mapping an input to the output ($f: \mathcal X \rightarrow Y$) based on examples of input-output pairs. The examples come from a dataset divided into three parts: profiling set consisting of N traces, validation set consisting of V traces, and attack set consisting of Q traces.

Once we learn the function f, the goal of the attack phase is to make predictions about the classes

$$\begin{aligned} y(x_{1},k^*), \ldots , y(x_{Q},k^*), \end{aligned}$$

where $k^*$ represents the secret (unknown) key of the device under the attack.

In the profiled SCA, the trained neural network is tested against a Q side-channel traces, in which the secret key is unknown, and the key recovery methodology assumes that the correct key is the one that maximizes the summation probabilities $S_{k}$ for each key byte candidate:

$$\begin{aligned} S_{k} = \sum _{i=1}^{Q} \log (p_{i,j}). \end{aligned}$$

(1)

The value (i.e., the probability) $p_{i,j}$ is an element of a matrix P with size number of traces $\times $ number of classes. This matrix gives the output class probabilities obtained by using the model f on a test or validation set. Thus, $p_{i,j}$ is the probability element obtained as a function of the attack trace $x_i$, leakage model l, and input data $p_{k_i}$ for every possible key guess k: $f_k(x_i, p_{k_i}, l)$. In the context of neural networks, $p_{i,j}$ represents the neuron’s activation value for trace i from the Softmax output layer.

A usual approach to assess the attacker’s performance is to use metrics that denote the number of measurements required to obtain the secret key $k^*$. Common examples of such metrics are guessing entropy (GE) and success rate (SR) [23]. Guessing entropy represents the average number of key candidates an adversary needs to test to determine the correct key, denoted with $k^*$, after conducting a side-channel analysis. More specifically, given Q traces in the attack phase, an attack outputs a vector $g= [g_1,g_2,...,g_{|K|}]$ in decreasing order of probability. The guessing entropy is the average position of $k^*$ in g over several experiments (commonly 50 or 100). The success rate is defined as the average empirical probability that $g_1$ equals the secret key $k^*$.

For a deep learning-based SCA to be successful, the trained model must generalize well, which means as small as a possible error on the test set. Next, we define the generalization interval in SCA.

Definition 1

Generalization Interval in SCA. Given $(X_{train}, Y_{train})$ where $X_{train}$ represents the input training data and $Y_{train}$ represents a set of training labels, the generalization interval defines the epochs where a successful key recovery can be obtained using Eq. (1).

Techniques used to reduce test errors are commonly known as regularization techniques, among which early-stopping is one of the best-known techniques [6]. Early stopping works under the assumption that the neural network achieved the best generalization and will start overfitting and deteriorate the generalization after this point (of best generalization), which is an undesired behavior. Consequently, determining the pre-specified number of iterations (epochs) is a hyperparameter selection process.

2.2 Datasets

We consider three publicly available datasets, instances of software AES implementations protected with the first-order Boolean masking. The first one is the ASCAD database [15]. We use the trace set where the plaintext and key are randomly defined for each separate encryption. This trace set is used as a training set. A second fixed-key trace set is split into validation and test sets. We attack the third key byte. In this dataset, each trace contains 1 400 features. This dataset is available at https://github.com/ANSSI-FR/ASCAD/tree/master/ATMEGA_AES_v1/ATM_AES_v1_variable_key.

The second dataset is the DPA Contest v4 (DPAv4) [24]. DPAv4 dataset provides traces collected from an AES-256 RSM (rotate shift masking) implementation. This dataset has a fixed key, and we attack the first key byte. Each trace consists of 2 000 features. This dataset is available at http://www.dpacontest.org/v4/.

The third dataset refers to the CHES Capture-the-flag (CTF) masked AES-128 encryption trace set, released in 2018 for the Conference on Cryptographic Hardware and Embedded Systems (CHES). In our experiments, the training set has a fixed key that is different from the key configured for the validation and test sets. Each trace consists of 2 200 features. We attack the first key byte. This dataset is available at https://chesctf.riscure.com/2018/news.

2.3 Information Theory

In information theory, the (marginal) entropy H(X) of a random variable X is defined as the average information obtained by observing X, and it can be quantitatively defined as:

$$\begin{aligned} H(X) = -\sum _{x \in X}p(x)\log _2 p(x), \end{aligned}$$

(2)

where p(x) represents the probability of variable X taking value x. The conditional entropy of X given Y, which represents the entropy of X when Y is known, is defined as:

$$\begin{aligned} H(X|Y) = -\sum _{x \in X} p(x)\sum _{y \in Y}p(x|y)\log _2 p(x|y). \end{aligned}$$

(3)

Finally, mutual information defines the dependence between variables X and Y, and it can be defined using entropy and conditional entropy values as:

$$\begin{aligned} I(X;Y) = H(X) - H(X|Y). \end{aligned}$$

(4)

An important property of mutual information in this context is the Data Processing Inequality (DPI), which states that for any three variables X, Y, Z, forming a Markov chain, $X\rightarrow Y\rightarrow Z$, the mutual information between the variables can only decrease and $I(X;Y)\ge I(X;Z)$.

3 Information Theory of Deep Neural Networks

Shwartz-Ziv and Tishby [19] showed that information theory could be used to visualize the training phase of a deep network to compare the performance of different network architectures. Intuitively, when training a network, each layer gets its information from the layer before and transforms it using matrix multiplication of nonlinear functions. Their insight was to treat each layer (the hidden activation functions) in the deep network as a random variable fully described by the information captured about the input data and the labels. Modeling each layer in the deep network as a random variable gives an alternative view of a deep network as a Markov chain. Each variable represents the nonlinear activation function, which successively transforms the input data into the label space. Using the mutual information between the layers, the input data, and the labels, we can visualize the input data’s transformation into the label space.

Definition 2

Information Path. Given a tuple $(X, Y, T_i)$ where X represents the input data, Y represents a set of labels and $T_i$ is a hidden layer in an n-layered network, described as $X,Y\rightarrow T_1 \rightarrow ...\rightarrow T_n$, the information path is defined as the set of points $\{[I(X;T_i), I(T_i;Y)]|i\in \{1,n\}\}$.

The information path is a record of the information each hidden layer preserves about the input data X and the output variables Y. It is typically computed for each epoch during the training phase. The information is plotted in a two-dimensional coordinate system referred to as the information plane. The coordinates of the information plane quantify the bits of information layer $T_i$ has about the input data X as $I(X; T_i)$, and the bits of information layer $T_i$ has about the labels Y as $I(T_i;Y)$. We can view the variable $T_i$ as a compressed representation of the input X, and $I(X; T_i) $ calculated based on the value $p(x)p(t_i|x)$, which measures how compact the representation of X is. The maximum value for $I(X;T_i)$ is H(X), which is the Shannon entropy corresponding to the case where $T_i$ copies X and there is no compression. The minimal value for $I(T_i;X)$ is 0 and corresponds to the case where T has one value.

Lemma 1

Information Path Uniqueness. For each tuple $(X, Y, T_i)$, where X represents the input data, Y represents a set of labels, and $T_i$ is a layer in an n-layered network described as $X,Y\rightarrow T_1 \rightarrow ...\rightarrow T_n$ there exists a unique information path that satisfies the following two inequalities:

$$\begin{aligned} H(X)\ge I(X;T_1)\ge ...\ge I(X;T_n)\ge I(X; \hat{Y}), \end{aligned}$$

(5)

where $\hat{Y}$ are the labels predicted by the network, and

$$\begin{aligned} I(X;Y) \ge I(T_1;Y) \ge \ldots \ge I(T_n;Y)\ge I(\hat{Y};Y). \end{aligned}$$

(6)

Proof

The proof for this lemma follows immediately by applying the DPI principle. $\square $

3.1 Information Bottleneck Principle

Shwartz-Ziv and Tishby observed that stochastic gradient descent (SGD) optimization defines two distinct phases during training [19]. The first one is the fitting phase, where both $I(X;T_i)$ and $I(T_i;Y)$ increase fast as the training progresses. During the fitting phase, the deep network layers increase information about the input data and the labels. The second phase is the compression phase, where the network starts to compress or forget information about the input data and slowly increases its generalization capacity by retaining more information about the labels.

The network’s behavior during the compression phase has been linked to the form of the activation functions [4]. This happens due to a random diffusion-like behavior of the SGD algorithm if double-sided saturating^{Footnote 2} nonlinear activation function such as Tanh is employed. More precisely, Shwartz-Ziv and Tishby provided results for Tanh and showed how information about the labels increases in the compression phase [19]. On the other hand, Saxe et al. demonstrated that the non-saturating activation functions like ReLU provide a different behavior in the compression phase as there is no causal connection between generalization and compression [18].

Figure 2 gives an overview of the information path of the deep network architecture described in Fig. 1 at epochs: 1, 20, 100, and 200 during the training process. Each figure contains the coordinates of the information $[I(X;T_{i}),I(T_{i};Y)]$ where i represents the i-th layer. The information is captured from the five hidden layers ($T_{2:6}$) plus an input layer ($T_{1}$) and an output layer ($T_{7}$).

For the above example, we see that information changes only for the last two hidden layers $T_5$ and $T_6$, and the output layer $T_7$. The plot contains mutual information results for twenty training experiments (i.e., twenty dots for each layer). These results demonstrate that at the beginning of the training phase, the mutual information quantities $[I(X;T_{i}),I(T_{i};Y)]$ are at a minimum level for hidden layers $T_5$ and $T_6$, and for the output layer $T_7$. As the training progresses (epochs 20 and 100), the mutual information values increase until the $[I(X;T_{i}),I(T_{i};Y)]$ reaches its maximum for all layers, including the output layer. If we continue the training process, the compression phase starts to happen as $I(X;T_{i})$ starts to decrease and $I(T_{i};Y)$ stays at a maximum level. In Fig. 2, this information path is clearly observed for hidden layers $T_5$ and $T_6$, and the output layer $T_7$. The values obtained for layers $T_1$ to $T_4$ provide very small changes during training, which are difficult to visualize due to the scale of Fig. 1.

Figure 3 shows the evolution of the information path for all training epochs for the DPAv4 dataset for the same architecture given in Fig. 1. The training evolution provides two distinct phases (fitting and compression), as discussed in [19]. In the first phase, the layers (mostly visible for hidden layers $T_{4:6}$ and output layer $T_{7}$) are fitting the training data. The information of an inner state $T_i$ or layer increases for the input X and output Y^{Footnote 3}. In the second phase, the output information stays high, but the input information starts to decrease. From the figure, it is clear that the second phase starts before epoch 100.

3.2 Information Path for Side-Channel Analysis Data

To assess whether a machine learning model generalizes well, we commonly check its performance on previously unseen data, i.e., validation set as defined in Definition 1. Similarly, to investigate the generalization from the information path, we must assess its behavior on the validation set. More precisely, we aim to find the generalization interval as defined in Definition 3.

Definition 3

SCA Generalization Interval via Information Path.

Given a tuple $(X_{train}, Y_{train}, T_i)$ where $X_{train}$ represents the input training data, $Y_{train}$ represents a set of training labels, and $T_i$ is a hidden layer in an n-layered neural network, the generalization interval for SCA defines the interval of training epochs where the quantities $[I(X;T_{i}),I(T_{i};Y)]$ reach the maximal values and we can obtain successful key recovery with Eq. (1) by predicting on the dataset $X_{test}$ with the trained neural network.

The results in Fig. 4 show it is possible to observe a different “movement” of the points $[I(X;T_i),I(T_i;Y)]$ in the information path when using the validation set. The fitting phase is clearly seen, as $I(X;T_i)$ and $I(T_i;Y)$ increase with the processing of first epochs (for the validation set, this movement is observable for hidden layer $T_{6}$ and output layer $T_{7}$).

The compression phase is different from the information plane observed in Fig. 3. Now, the points $[I(X;T_i),I(T_i;Y)]$ reach the maximum value for each hidden layer, and, later, both quantities decrease with the processing of more epochs. This indicates the overfitting scenario for the given trained machine learning model. More specifically, this happens because the generalization in difficult SCA settings (i.e., masked or protected AES) is minimal when given in terms of deep learning metrics (accuracy, loss, recall). At the same time, we aim to capture the machine learning model at an epoch when the best possible generalization occurs. From our observations, the epoch at which the generalization is optimal is given by the moment when $I(T_n;Y)$ reaches a maximum value.

3.3 Improving the Generalization in Deep Learning-Based SCA

Recall, for a deep learning-based side-channel analysis to be successful, the trained model must generalize well to previously unseen data (validation/test set). Given a deep neural network defined by a set of hyperparameters $\theta $, the internal representations $T_{i}$, $i \in \{1,n\}$, (where $T_{1}$ and $T_{n}$ are the input and output layers, respectively) should inform about the labels Y and input X [1].

As observed from Figs. 3 and 4, we stop the training when we reach the maximum value for $I(T_n;Y)$. As such, we assume:

1.
During the training, the intermediate representation $T_{i}$ will be compressed to estimate Y correctly.
2.
The intermediate representation $T_{i}$ should be robust such that small addition of noise should not affect this compressed internal representation.
3.
Only the information transferred to the output network layer is important for measuring the generalization [4].

Our investigation suggests that the maximum value for $I(T_n;Y)$ for the output layer happens when the fitting phase is finished, and the compression started. This means that the training does not need to go through the full compression phase to achieve the best generalization. This is also in line with findings from Shwartz-Ziv and Tishby as they observed that the beginning of the compression phase coincides with the best generalization [19], and Saxe et al., who showed empirical results demonstrating that the compression phase does not necessarily improve generalization [18].

In essence, the value for I(T; Y) should increase for all hidden layers as the training progresses and reach a maximum value at the end of the fitting phase. This directly means that the information path helps to indicate the optimal number of training epochs for all hidden layers, especially the outer layers. As the experiments in the Sect. 4 suggest, taking the maximum value for $I(T_n;Y)$ from the output layer (where $T_n$ are the output class probabilities after Softmax) provides an efficient early stopping metric for profiled side-channel analysis.

The calculation of $I(T_n;Y)$ gives minimal overhead during the training process since we need to make the computation for a small fraction of the validation set. We estimate that the time overhead to compute $I(T_n;Y)$ at the end of each epoch is less than 2%.

4 Experimental Validation

4.1 Estimating Mutual Information

The first step of calculating mutual information, Eq. (4), is density estimation [9], aiming to construct an approximation of the density function (denoted with p in Eqs. (2) and (3)) using observed data. There are two main approaches to density estimation. The first approach is parametric, where we assume the observed data to be drawn from a known family of distributions (e.g., normal distribution), while the second approach, non-parametric, does not assume the distribution of the observed data. We consider the non-parametric approach more suitable for our setting as we have little information about the underlying data distribution.

Common approaches for non-parametric estimation are simple discretization methods such as equal interval binning, (or histogram estimator), which divides the observed data into equal-sized bins, or equal frequency intervals, which divides the observed data into bins with an equal number of samples [5]. The generic approach’s price is a user-supplied parameter such as the bin width in the case of equal-sized bins or the number of samples in each bin for the frequency-based binning. A variation of the discretization techniques described above is the kernel density estimator, where a kernel function replaces the “box” of the bin estimators. The user-supplied parameter is the kernel bandwidth, and its choice will significantly impact its performance.

Finding the optimal value for the user-supplied parameter is nontrivial but important as its value could directly affect the estimator error. The quality of an estimator is evaluated by its bias and variance, and generic expressions for all estimators mentioned above are known [20]. However, such expressions require as input the value of the distribution from which the data is observed, which in our case is not known.

Adaptive estimators use a recursive algorithm to determine the optimal bin width [4]. Such methods use entropy to measure disorder in the observed data and determine the bin width that minimizes the entropy function over all possible thresholds. Although they offer good performance, we found this last method lacking for our purpose as it is very time-consuming. Our investigation showed that it would be faster to calculate the key rank for every epoch than to use entropy-based binning.

As there is no optimal solution for the non-parametric estimation and low computational overhead is an important requirement in our case, we decided to use the histogram estimator. We thoroughly tested its reliability by running two types of experiments. In the first, we simply considered all possible values for the bin widths. In the second, we considered the different known rules for determining the bin widths. For both experiments, we consider all three datasets, described in Sect. 2.2 and provide detailed results in Appendix A.

We conclude from the first experiment that although the histogram estimator’s bin width impacts the achieved performance, the mutual information metric is stable, and we observe the same behavior for several values of this parameter. More precisely, any value between 25 and 170 will give similar results. Therefore, we selected the value of 100 for the number of bins as it is in the middle of the interval. This observation holds for both the Hamming weight and the identity leakage models.

In the second experiment, we use plug-in estimates to determine the value for the bin-width. These estimates work by making assumptions on the distribution of the observed data and are known as empirical rules for determining the bin width. The results are depicted in Appendix A, where we can observe that the bin width obtained with different rules result in very similar attack performance. Our experiments confirm that the mutual information metric is not very sensitive to the histogram bin size, making it a robust procedure for practical applications where one needs to consider different datasets, leakage models, and neural network architectures. Our strategy is to choose the bin value as the average value with good performance for all tested scenarios based on this observation.

Additionally, we consider how long the generalization interval lasts, as longer intervals mean it will be easier to stop the training while in the generalization. The analysis for the length of the generalization interval is given in Appendix C, and it shows that using a regularization technique prolongs the interval.

4.2 Results for the Publicly Available Datasets

In this section, we compare the performance of five metrics (validation loss, validation accuracy, validation recall, key rank for the validation set, and $I(T_{n};Y)$) to select the epoch t at which the machine learning model achieves the best performance. Besides these five metrics, we depict the results when we do not use an early stopping regime but rather allow the full number of training epochs. Those results are denoted as “GE all epochs” and “SR all epochs” for guessing entropy and success rate, respectively. Note, when giving results for guessing entropy and success rate, we conduct 100 key rank executions by randomly selecting attack traces from a larger set. The best success rate value equals 100%, and the best guessing entropy value is 1. When giving results with distributions (e.g., Fig. 6), we repeat experiments 100 times, i.e., there are 100 training phases to be able to build distributions.

For the three tested datasets (ASCAD, DPAv4, and CHES CTF), the results are obtained by attacking one key byte in the first AES encryption round. For ASCAD, we give results for the HW and identity leakage models. For DPAv4, we give results in Appendix D, and we consider only the HW leakage model as the identity leakage model allows an easy attack where there are no significant differences among neural network architectures or validation metrics. We note that the results for DPAv4 are in line with the results for the other datasets. For CHES CTF, we consider only the Hamming weight leakage model as the available dataset contains only 43 000 traces, which is not enough to break the target in the identity leakage model.

We first conduct a tuning phase where we experiment with varying CNN and MLP architectures. We emphasize that we do not claim that the obtained architectures are optimal, as finding optimal architectures was not the goal of this work. The final neural network configurations use the Adam optimizer, and the learning rate is set to 0.001. Initial weights are initialized at random using random uniform method. The selected loss function is the categorical cross-entropy provided by the Keras library. All the experiments were performed on a computer equipped with a GPU Nvidia RTX 2060. The details about selected convolutional neural networks hyperparameters are listed in Table 1. For MLP, we use an architecture with five dense layers containing 600 neurons each. We verified that the selected MLP provides good results for the ASCAD dataset, and the selected CNNs provide strong results for all three considered datasets.

Table 1. Hyperparameters for CNNs.

Full size table

ASCAD Random Keys Results. The empirical validation on the ASCAD dataset (key byte 3) considers 200 000 traces for training, 500 traces for validation, and 500 traces for the test. Both validation and test sets have a fixed key. The selected CNN architecture is trained for 50 epochs. After identifying the best epoch for each of the five metrics, the corresponding machine learning models are applied to the test set. Note we provide additional results about the best epoch based on the information path in Appendix B.

Figure 5 shows GE and SR for the test set obtained for each validation metric for the HW leakage model. From Fig. 5b, the best success rate is achieved when the machine learning model is selected from the epoch when the metric is the maximum value of $I(T_{n};Y)$. More precisely, around the processing of 460 traces, the success rate reaches 100% if the model is selected from the epoch determined by the maximum $I(T_{n};Y)$ value. The lines “GE all epochs” and “SR all epochs” correspond to the results when evaluating GE and SR after processing 50 epochs. We can see that those lines also depict the worst attack performance as in those cases, due to too many training epochs, the machine learning models overfit and do not generalize for the test set. Figure 5a shows no significant differences among most of the metrics (except the scenario where we do not use early stopping), and for both SCA metrics, we see that mutual information works well and gives consistently strong attack performance.

Figure 6 shows the results for 100 experiments with the same CNN architecture. Figure 6a gives the $I(T_n;Y)$ evolution for the processed epochs. On average, the highest $I(T_{n};Y)$ values are achieved between epochs 20 and 30 (the highest for epoch 24), as indicated by the plot distribution in Fig. 6b. Figure 6c shows the test and validation GE results for the number of epochs (training phase) and $I(T_n;Y)$ metric. We see an interval (epochs between 8 and 26) when the key rank is low and, consequently, generalization is satisfactory. This indicates that GE reaches good values even before $I(T_n;Y)$ becomes maximal. Still, allowing more epochs does increase $I(T_n;Y)$ while keeping GE minimal. Figure 6d shows the distribution of the best epochs based on the validation key rank metric. This histogram contains the results of 100 experiments (unchanged hyperparameters) and indicates that the best validation key rank may happen at different epochs. More precisely, we see the highest value already around epoch 12, which explains why the validation key rank is among the worst-performing metrics for both SR and GE. While this could sound counter-intuitive, there is a simple explanation for such behavior. As we use a validation set often (whenever evaluating whether to stop training), the validation set indirectly influences the trained model. Consequently, it is possible to observe some differences when applying trained models to the test set, which was never evaluated before. As we can observe, $I(T_n;Y)$ metric seems to be less sensitive to this issue, and as such, it represents a more suitable choice for an early stopping metric.

Next, we repeat the ASCAD dataset experiments in the HW leakage model, but with an MLP architecture. From Figs. 7a and 7b, $I(T_n;Y)$ is the most successful metric, followed closely by loss. Again, if there is no early stopping, neural network overfits, resulting in poor attack performance. Figure 8a indicates that $I(T_n;Y)$ reaches the best performance for epochs 25 to 35. This is confirmed in Fig. 8b where we observe the highest frequency for epoch 31. Considering the validation and test set behavior when using $I(T_n;Y)$ to indicate stopping, several epochs give good behavior (from 10 to 35). This is in line with CNN’s behavior, as GE can indicate a successful attack even before $I(T_n;Y)$ reaches the maximal value. Finally, Fig. 8b gives insight into the performance of the validation key rank, where several epochs have high frequency, but the highest value happens around epoch 5, which is too early as confirmed when evaluating the attack performance (Fig. 7 where the validation key rank performs significantly worse than $I(T_n;Y)$).

Next, we consider the identity leakage model for the ASCAD dataset. First, in Fig. 9, we depict the results for guessing entropy and success rate. The differences among attack performances are very small, but the mutual information metric gives good results for both guessing entropy and success rate. Here, not having the early stopping mechanism does not affect attack performance. This behavior is expected, as due to more classes, neural networks need more epochs to fit the data into the model (and naturally, to overfit).

Figure 10a displays the $I(T_n;Y)$ evolution over 50 epochs. The mutual information increases with the number of epochs and reaches a steady level around epoch 47. This is confirmed in Fig. 10b, where we can indeed observe that epochs 47 to 49 give the best results. Considering GE, both validation and test set values indicate strong performance when having more than 15 epochs. Using the validation key rank as the early stopping metric shows several epochs as suitable to stop the training process (Fig. 10d). Still, the two highest peaks are observed around epochs 32 and 48. As the validation key rank and $I(T_n;Y)$ point to similar epochs to stop the training, the results in Fig. 9 are as expected – no significant difference in the attack performance.

CHES CTF Results. We consider 43 000 traces in the training set and 1 000 traces in the validation set. Additional 1 000 traces are used as a test set. These results were obtained from the 100 training runs on CNN configured with the unchanged hyperparameters. Figure 11 shows the guessing entropy and success rate for the five considered metrics. We can observe that using the training model at the epoch with the maximum $I(T_{n};Y)$ provides the best success rate and guessing entropy (followed closely by the validation key rank). Retrieving the model at epochs indicated by the best validation accuracy, loss, or recall leads to significantly worse SR and GE results. Similarly, if there is no early stopping, the attack performance is also poor.

Figure 12a provides the mutual information value $I(T_{n};Y)$ for training phase and every epoch. The maximum $I(T_{n};Y)$ is reached between epochs 10 and 18. Figure 12b gives similar indication with epoch 14 having the highest frequency. Those results are confirmed in Fig. 12c, where epochs 10 to 15 have the lowest guessing entropy. After epoch 18, the neural network starts to degrade its generalization capacity as it starts to overfit on the training set. On the other hand, the generalization capacity before epoch seven also provides, on average, poor generalization since the network is inside the fitting phase, where satisfactory generalization is not achieved yet (i.e., the network underfits). Finally, in Fig. 12d, the validation key rank gives similar results (there are similar SCA metrics results in Fig. 11). Still, again the validation key rank indicates to stop the training a little bit earlier than $I(T_{n};Y)$.

4.3 Discussion

When attacking a protected target, like the public databases consisting of the first-order masked AES implementations, model generalization is very limited, and validation or test metrics are close to random guessing. For side-channel analysis, a sufficient generalization is given by a low guessing entropy or high success rate. As we can observe from the results given in Sect. 4, the trained model at each epoch provides different key rank results, and the over-training easily leads to deterioration of the model’s generalization. This problem can be addressed by using an appropriate metric to save the trained model at the epoch that provides the best SR or GE. Our experimental analysis shows that having the maximum value of $I(T_n;Y)$ as a metric to select the model at the best epoch provides a better success rate and guessing entropy results when compared to machine learning metrics like loss, recall, or accuracy. Our results show that $I(T_n;Y)$ works especially well in settings where other metrics could have problems, as is the Hamming weight leakage model, which suffers from data imbalance. The $I(T_n;Y)$ metric works even better than the validation key rank, where we notice that the key rank validation indicates somewhat earlier to stop the training. Based on the obtained results, we give several observations for deep learning-based SCA:

1.
It is necessary to implement early stopping regularization.
2.
Early stopping based on mutual information consistently gives the best results.
3.
Validation key rank seems to be somewhat more conservative in its estimate than the mutual information.
4.
GE reaches good values even before $I(T_n;Y)$ reaches its maximum value. However, when $I(T_n;Y)$ has reached its maximum value, we notice that the model produces the most stable attack behavior.
5.
Mutual information metric, although computationally intensive, is “lighter” compared to the computational effort required for calculating GE or SR for each epoch.
6.
For simple datasets, various metrics will provide “good enough” results. However, for complex datasets, the mutual information metric gives superior results.

5 Conclusions and Future Work

This paper demonstrates that using the mutual information between output layer activations (i.e., the output Softmax probabilities) and the true labels $I(T_{n};Y)$ of a validation set leads to better generalization for separate test sets. We compared $I(T_{n};Y)$ metric against conventional machine learning metrics (accuracy, recall, and loss), and we verified that mutual information could be a more reliable metric to detect an epoch at which the trained neural network is inside a generalization interval.

In future work, we plan to investigate the mutual information metric as a reference for selecting other hyperparameters. Additionally, we would like to investigate this metric’s behavior when the traces contain misalignment, and consequently, the generalization is more difficult. Such analysis is also essential to improve the portability capabilities of trained deep neural networks for side-channel attacks.

Notes

1.
It is also possible for a machine learning model to underfit if the training stopped too early. Still, this is usually of less concern as the resulting machine learning model would generalize to unseen data but not use its full potential, i.e., the attack would not be as powerful as possible.
2.
A saturating activation function squeezes the input data, i.e., the output is bounded to a certain range.
3.
Note, information plane figures show different layers, but it is not possible to recognize a specific layer by just “observing” the graph, i.e., there is no pre-specified behavior for a specific layer. We store and plot data for each layer separately.

References

Amjad, R.A., Geiger, B.C.: How (not) to train your neural network using the information bottleneck principle. CoRR abs/1802.09766 (2018). http://arxiv.org/abs/1802.09766
Bronchain, O., Hendrickx, J.M., Massart, C., Olshevsky, A., Standaert, F.-X.: Leakage certification revisited: bounding model errors in side-channel security evaluations. In: Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019. LNCS, vol. 11692, pp. 713–737. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26948-7_25
Chapter MATH Google Scholar
Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data augmentation against jitter-based countermeasures. In: Fischer, W., Homma, N. (eds.) CHES 2017. LNCS, vol. 10529, pp. 45–68. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66787-4_3
Chapter Google Scholar
Chelombiev, I., Houghton, C., O’Donnell, C.: Adaptive estimators show information compression in deep neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SkeZisA5t7
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S.J. (eds.) Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, 9–12 July 1995, pp. 194–202. Morgan Kaufmann (1995). https://doi.org/10.1016/b978-1-55860-377-6.50032-3
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Hettwer, B., Gehrer, S., Güneysu, T.: Profiled power analysis attacks using convolutional neural networks with domain knowledge. In: Cid, C., Jacobson, M.J., Jr. (eds.) Selected Areas in Cryptography - SAC 2018, pp. 479–498. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10970-7_22
Chapter Google Scholar
Kim, J., Picek, S., Heuser, A., Bhasin, S., Hanjalic, A.: Make some noise. unleashing the power of convolutional neural networks for profiled side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 148–179 (2019). https://doi.org/10.13154/tches.v2019.i3.148-179
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6) (2004). https://doi.org/10.1103/physreve.69.066138
Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Carlet, C., Hasan, M.A., Saraswat, V. (eds.) SPACE 2016. LNCS, vol. 10076, pp. 3–26. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49445-6_1
Chapter Google Scholar
Masure, L., Dumas, C., Prouff, E.: Gradient visualization for general characterization in profiling attacks. In: Polian, I., Stöttinger, M. (eds.) COSADE 2019. LNCS, vol. 11421, pp. 145–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-16350-1_9
Chapter Google Scholar
Masure, L., Dumas, C., Prouff, E.: A comprehensive study of deep learning for side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 348–375 (2020). https://doi.org/10.13154/tches.v2020.i1.348-375
Perin, G., Chmielewski, L., Picek, S.: Strength in numbers: improving generalization with ensembles in machine learning-based profiled side-channel analysis. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(4), 337–364 (2020). https://doi.org/10.13154/tches.v2020.i4.337-364. https://tches.iacr.org/index.php/TCHES/article/view/8686
Picek, S., Heuser, A., Jovic, A., Bhasin, S., Regazzoni, F.: The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(1), 209–237 (2019). https://doi.org/10.13154/tches.v2019.i1.209-237
Prouff, E., Strullu, R., Benadjila, R., Cagli, E., Dumas, C.: Study of deep learning techniques for side-channel analysis and introduction to ASCAD database. IACR Cryptology ePrint Archive 2018, 53 (2018). http://eprint.iacr.org/2018/053
Rijsdijk, J., Wu, L., Perin, G., Picek, S.: Reinforcement learning for hyperparameter tuning in deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2021/071 (2021). https://eprint.iacr.org/2021/071
Robissout, D., Zaid, G., Colombier, B., Bossuet, L., Habrard, A.: Online performance evaluation of deep learning networks for profiled side-channel analysis. In: Bertoni, G.M., Regazzoni, F. (eds.) COSADE 2020. LNCS, vol. 12244, pp. 200–218. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68773-1_10
Chapter Google Scholar
Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D.: On the information bottleneck theory of deep learning. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id=ry_WPG-A-
Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. CoRR abs/1703.00810 (2017). http://arxiv.org/abs/1703.00810
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall (1998). https://doi.org/10.1201/9781315140919
Standaert, F.-X., Peeters, E., Archambeau, C., Quisquater, J.-J.: Towards security limits in side-channel attacks. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 30–45. Springer, Heidelberg (2006). https://doi.org/10.1007/11894063_3
Chapter Google Scholar
Standaert, F.-X., Koeune, F., Schindler, W.: How to compare profiled side-channel attacks? In: Abdalla, M., Pointcheval, D., Fouque, P.-A., Vergnaud, D. (eds.) ACNS 2009. LNCS, vol. 5536, pp. 485–498. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01957-9_30
Chapter Google Scholar
Standaert, F.-X., Malkin, T.G., Yung, M.: A unified framework for the analysis of side-channel key recovery attacks. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 443–461. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01001-9_26
Chapter Google Scholar
TELECOM ParisTech SEN research group: DPA Contest ($4^{{\rm th}}$ edition) (2013–2014). http://www.DPAcontest.org/v4/
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle (2015)
Google Scholar
van der Valk, D., Picek, S.: Bias-variance decomposition in machine learning-based side-channel analysis. Cryptology ePrint Archive, Report 2019/570 (2019). https://eprint.iacr.org/2019/570
van der Valk, D., Picek, S., Bhasin, S.: Kilroy was here: the first step towards explainability of neural networks in profiled side-channel analysis. Cryptology ePrint Archive, Report 2019/1477 (2019). https://eprint.iacr.org/2019/1477
Wouters, L., Arribas, V., Gierlichs, B., Preneel, B.: Revisiting a methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardware Embed. Syst. 2020(3), 147–168 (2020). https://doi.org/10.13154/tches.v2020.i3.147-168. https://tches.iacr.org/index.php/TCHES/article/view/8586
Wu, L., Perin, G., Picek, S.: I choose you: automated hyperparameter tuning for deep learning-based side-channel analysis. Cryptology ePrint Archive, Report 2020/1293 (2020). https://eprint.iacr.org/2020/1293
Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(1), 1–36 (2019). https://doi.org/10.13154/tches.v2020.i1.1-36. https://tches.iacr.org/index.php/TCHES/article/view/8391

Download references

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Guilherme Perin & Stjepan Picek
Radboud University, Nijmegen, The Netherlands
Ileana Buhan

Authors

Guilherme Perin
View author publications
You can also search for this author in PubMed Google Scholar
Ileana Buhan
View author publications
You can also search for this author in PubMed Google Scholar
Stjepan Picek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stjepan Picek .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Shivam Bhasin
Siemens AG, Munich, Germany
Fabrizio De Santis

Appendices

A Bin Size Estimators

To estimate the probability density, a critical step is to determine the bin width, which is the user-supplied parameter for the histogram estimator. The results of our experiments are shown in Figs. 13, 14, and 15. As shown in the figures on the right, any bin width larger than 15 leads to a final key rank lower than 4 (key rank equal to 1 indicates the successful key recovery). The key rank is computed for a separate test set and is obtained by selecting the machine learning model at the epoch that gives the highest $I(T_n;Y)$ for each tested bin width. The plots on the left side of Figs. 13, 14, and 15 show the value of $I(T_n;Y)$ w.r.t. the number of epochs for all tested bin sizes. As we can see, if the bin size is too small, the mutual information $I(T_n;Y)$ barely changes.

Figure 16 shows results for eight different well-known estimators, vs. the choice of fixing the number of bins to 100. We tested Freedman Diaconis (‘fd’), ‘sturges’, ‘auto’ (which is the maximum of ‘fd’ and ‘sturges’ estimators), ‘rice’, ‘scott’, square-root estimator (‘sqrt’), and ‘doane’ estimators. The results show the guessing entropy resulting from early stopping by having $I(T_{n}; Y)$ as a metric by testing different bin size estimators. Notice they all lead to successful key recovery with similar guessing entropy convergence for all tested datasets and leakage models.

B From Information Path to the Best Epoch to Stop the Training

We use the information path for visualizing how much information each hidden layer has learned about the true labels (Y). The amount of information learned is an estimation of how well each hidden layer fits the distribution of Y, which is directly derived from the selected leakage model.

We repeat the experiment for the ASCAD random key dataset (Fig. 17, MLP architecture). Using the visualization provided by the information path, we confirm that all hidden layers are a transformed representation of the input traces and contain information about Y. However, the estimate of the mutual information between Y and the output layer probabilities (after the Softmax activation layer) captures best the prediction capability of the network. In our experiments the best epoch to stop the training is the epoch where the value of I($T_{n};Y$) reaches its maximum value.

Figure 17a shows the evolution value for I($T_{i};Y$) during neural network training for all hidden layers, while Fig. 17b provides the guessing entropy results. We provide results by selecting the best epoch from the maximum value of I($T_{i};Y$) for all the hidden layers, as suggested by the reviewer. We provide these particular results to illustrate that only the last layer is indicative for early stopping results. We conclude that the information path can lead to optimal choices for the number of training epochs for profiling side-channel analysis. Note, however, that for the general solution selection, the best epoch from the I($T_{n};Y$) (output layer) provided better results across multiple datasets.

C On the Length of the Generalization Interval

As stated, we aim to reach the generalization interval and then stop the training. By doing so, we ensure that the trained machine learning model will generalize to unseen data. The question remains how difficult it is to stop at the generalization interval. Intuitively, the shorter the interval, the easier it would be to miss it. Ideally, we aim to have a neural network that reaches the generalization interval relatively fast and stays in that interval for a longer period. Before discussing how to obtain a long generalization interval, we must ensure it happens and that we do not go to the overfitting phase from the underfitting phase.

Regularization techniques can help prevent a deep neural network from overfitting during the training process. To check the impact of the regularization on the neural network and its generalization interval, we use the information plane as it provides a visual indication for the relationship between $I(X;T_{n})$ and $I(T_{n};Y)$. The maximum value of $I(T_{n};Y)$ during training indicates an epoch at which the neural network should be inside the generalization interval for the training process, as defined in Definition 3. When the network does not implement any regularization technique in its hyperparameter configuration, the trained model has a higher chance of overfitting the training data.

Figure 18 depicts results for a CNN with and without regularization (dropout). This experiment is conducted on a proprietary unprotected software AES implementation (STM32 microcontroller). For that, we considered 6 000 traces for the training set and 1 000 traces for the validation set, both having fixed keys. The traces contain 400 features. Observing Fig. 18b, for the case without regularization, we see that the mutual information $I(T_{n};Y)$ reaches a maximum value (where the distributions $T_n$ and Y are obtained from the validation set) and after that, $I(T_{n};Y)$ for validation decreases continuously while $I(T_{n};Y)$ for the training stays at a maximum value. Additionally, $I(T_{n};Y)$ indicates that the generalization phase lasts shorter than one would infer from accuracy, as illustrated in Fig. 18a.

Figures 18a and 18b also show the accuracy and $I(T_{n};Y)$, respectively, for training and validation labels sets obtained from a regularized CNN with dropout. After processing 200 epochs, the training accuracy has not reached 100%, the desired outcome for a regularized neural network. At the same time, the validation accuracy reaches approximately 56%, which is a significantly higher value compared to 51% without regularization, as shown in Fig. 18a. The mutual information $I(T_{n};Y)$ for the validation set (see Fig. 18b) reaches its maximum value and stays longer at this level. This indicates that the same generalization level is kept until at least epoch 100. Consequently, as the value of $I(T_{n};Y)$ stays high for more training epochs, the neural network provides better generalization for those epochs. Again, accuracy cannot indicate the same phenomenon, as its value remains stable (albeit of different magnitude for the validation set) for regularized and non-regularized networks.

The neural network configurations (with and without dropout) are illustrated in Fig. 19. The “R” and “S” labels refer to ReLU and Softmax, respectively. The number under the layer block indicates the number of neurons in dense layers (“D”) and the dropout rate for dropout layers.

D DPAv4 Results

For the DPAv4 dataset, we consider 34 000 traces in the training set and 2 000 traces in the validation set. An additional 2 000 traces are used as a test set. These results were obtained from the 100 training runs on CNN configured with unchanged hyperparameters. Fig. 20 shows guessing entropy and success rate obtained from the selected metrics (accuracy, recall, loss, key rank, and maximum $I(T_{n};Y)$) from the validation set. Selecting the model at an epoch with the maximum $I(T_{n},Y)$ for the validation set provides the best results for both SR and GE. Like the ASCAD dataset in the HW leakage model, $I(T_{n};Y)$ gives better results than the validation key rank for a small number of attack traces. Again, this happens due to the influence of the validation set on the trained model. Interestingly, allowing training for all 50 epochs leads to overfitting, but the same behavior happens if we stop training based on loss, recall, and accuracy.

Observe from Figs. 21a and 21b that the network achieves its maximum $I(T_{n};Y)$ value between epochs 10 and 16. Figure 21c confirms that we require around 10 epochs to reach guessing entropy of 1. Additionally, the behavior stays relatively stable up to epoch 38 (where there is no deterioration up to epoch 15, and afterward, there are slight changes in GE). Finally, in Fig. 21d, the validation key rank agrees with $I(T_{n};Y)$ by reaching the maximal frequency values for epochs 11 to 15 (cf. Fig. 21b).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perin, G., Buhan, I., Picek, S. (2021). Learning When to Stop: A Mutual Information Approach to Prevent Overfitting in Profiled Side-Channel Analysis. In: Bhasin, S., De Santis, F. (eds) Constructive Side-Channel Analysis and Secure Design. COSADE 2021. Lecture Notes in Computer Science(), vol 12910. Springer, Cham. https://doi.org/10.1007/978-3-030-89915-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-89915-8_3
Published: 21 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89914-1
Online ISBN: 978-3-030-89915-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics