Keywords

1 Introduction

The design of a measurement setup is the first step in the evaluation of a cryptographic implementation against side-channel analysis. Due to its physical nature, this step inherently carries hard to quantify risks of security overstatements. Noisy setups may indeed lead evaluators to conclude that the measurements are less informative than they actually are, and this gap will then be increased in case a countermeasure aiming at noise amplification, like masking [7, 13] or shuffling [15, 29], is implemented. Surprisingly, and despite papers focused on practical side-channel attacks usually describe how they optimized their setups, especially when targeting challenging real-world devices [3, 21], very few works are dedicated to the systematic evaluation of measurement setups and the impact of their optimization on security evaluations. Besides, and to the best of our knowledge, the most advanced (published) investigations of this topic were performed in specific settings such as the exploitation of static leakages, as recently investigated by Moos et al. [19], or the evaluation of physical effects such as couplings to reduce a masked implementation’s security order [10, 11, 16]. But when it comes to the the impact of measurement setups on the noise level in the context of (standard) attacks exploiting the dynamic part of the leakage, the only works we are aware of are the one of Guilley et al. which puts forward the Signal-to-Noise Ratio (SNR) as a meaningful metric to quantify the quality of side-channel acquisitions [14], and the one of Merino del Pozo and Standaert that discusses the impact of different setups in the context of leakage detection [22]. In this respect, and despite these references are important first steps in specifying relevant comparison metrics and highlighting the existence of an interesting design space, they are still have a limited scope: [14] estimates its proposed (univariate) metric for a single measurement setup while [22] compares different analog amplifiers and filters for a single probing method.

Recognizing that the design space of measurement setups is broader than investigated in these previous works, this paper aims at analyzing four important parameters of actual measurement setups. Namely, our goal is to discuss and evaluate the impact of the probing method used in the setups, the clock frequency of the Device Under Test (DUT), its supply voltage and the sampling rate of the oscilloscope used to collect the measurements. We therefore study these parameters systematically for two DUTs: a software (ARM Cortex) target and a hardware (Xilinx FPGA) one. We additionally evaluate the effect of these different parameters for both univariate evaluation metrics like the SNR and multivariate evaluation metrics like the Perceived Information (PI).

We then use our investigations to extract useful observations regarding how to select the parameters of our design space. While most of these observations are admittedly present (implicitly or explicitly) in former experimental works, we hope their compilation for two different devices and the quantitative analysis of the losses a poor measurement setup may imply for security evaluators (which may reach orders of magnitude) make a useful consolidating effort.

2 Background

We will use Mangard’s SNR [17] to evaluate the quality of first-order and univariate leakages, as suggested by Guilley et al., and the PI metric analyzed in [6] to evaluate the quality of higher-order or multivariate leakages. For the latter we profile Gaussian templates in a linear subspace. We next recall these different evaluation metrics and detail the profiling tools used in our analyzes.

2.1 Mangard’s SNR

Introduced in the context of side-channel analysis by Mangard, the SNR intuitively captures the data-dependent signal as the variance of the mean traces and the noise as the mean of the variance traces, for each time sample [17]. As a result, for a target intermediate variable y, it is defined as the ratio:

$$\begin{aligned} \hat{\text {SNR}} = \frac{\hat{\mathsf {Var}}_{y}\left( \hat{\mathsf {E}}_{i} \left( l_{i}^{y}\right) \right) }{ \hat{\mathsf {E}}_{y} \left( \hat{\mathsf {Var}}_{i} \left( l_{i}^{y} \right) \right) } , \end{aligned}$$
(1)

where \(\hat{\mathsf {Var}}\) and \(\hat{\mathsf {E}}\) are the sample variance and the sample mean estimated on \(l_{i}^{y} \in \mathcal {L}\), which represents the i-th side-channel observation generated by a target variable y. It must be pointed out that the noise in Mangard’s definition is the result of two contributions. First, physical noise is due to physical phenomena (e.g., thermal noise, flicker noise) and electrical conditions (e.g., impedance mismatch, unwanted coupling with unrelated equipment). Second, algorithmic noise is due to the presence of operations that are independent of the target ones and are processed in parallel to them (i.e., at the same time). As argued by Guilley et al., it is a good metric for assessing the quality of side-channel measurements to be exploited by first-order univariate attacks [14], since it can be related to the complexity of popular attacks such as the Correlation Power Analysis (CPA) and (univariate Gaussian) Template Attacks (TA) [5, 8, 18].

2.2 Subspace Based Gaussian Templates

Gaussian template attacks are a standard method to exploit multivariate leakages [8]. We combine them with a dimensionality reduction step in order to reduce the possibly high number of informative dimensions d of the leakage traces to a lower value \(d'<d\). The profiling consists of an estimation, using n leakage traces \(\boldsymbol{l}\), of the parameters \(\boldsymbol{\mu _x}\), \(\boldsymbol{\varSigma _x}\) and \(\boldsymbol{W}\) of a Probability Density Function (PDF) of the form:

$$\begin{aligned} \mathrm {\tilde{m}_n}(\boldsymbol{l} | x) = \frac{1}{\sqrt{(2\pi )^{d'} \cdot |\boldsymbol{\varSigma _x} |}} \cdot \exp ^{\frac{1}{2} (\boldsymbol{W}\boldsymbol{l} - \boldsymbol{\mu }_x)\boldsymbol{\varSigma _x} ( \boldsymbol{W}\boldsymbol{l}-\boldsymbol{\mu }_x)' }, \end{aligned}$$
(2)

where x is the value of the profiled variable, \(\boldsymbol{\mu _x}\) the mean vector of length \(d'\), \(\boldsymbol{\varSigma _x}\) the covariance matrix of size \(d'\times d'\) and \(\boldsymbol{W}\) is the projection matrix of size \(d'\times d\). This projection matrix is determined thanks to Linear Discriminant Analysis (LDA) [25]. LDA aims to find the subspace that maximizes the inter-class variance (i.e., the signal of Mangard’s SNR) and minimizes the intra-class variance (i.e., the noise of Mangard’s SNR). In practice, we applied this dimensionality reduction to all the samples with sufficient SNR (which d ranging from 30 to 500 depending on the cases) and usually kept a dozen dimensions for \(d'\). Next, in the online attack phase, the likelihood of x is obtained by applying Bayes’ law to the leakage models estimated beforehand such that:

$$\begin{aligned} \mathrm {\tilde{m}_n}(x|\boldsymbol{l}) = \frac{\mathrm {\tilde{m}_n}(\boldsymbol{l}|x)}{ \sum _{x^* \in \mathcal {X}} \mathrm {\tilde{m}_n}(\boldsymbol{l}|x^*) }. \end{aligned}$$
(3)

The estimated PDF and the likelihood of the profiled variable can then be used to calculate the amount of information contained in the leakages.

2.3 Information Theoretic Metrics and Bounds

For higher-order or multivariate attacks, the SNR metric is not directly applicable and a more general information theoretic metric has to be used. In the context of side-channel attacks, the Mutual Information (MI) is the most frequently considered candidate [26]. It generalizes the SNR in the sense that it can be related to the complexity of worst-case higher-order & multivariate attacks [9, 12] (and it is essentially equivalent to the SNR in the first-order univariate case [18]). However, as recently discussed in [6], estimating the MI is in general a hard problem. Known estimators are biased and distribution-dependent. Perfect estimations would therefore require the exact knowledge of the leakage distribution. As a workaround, they proposed the use of the previously introduced PI metric, which represents the amount of information that can be extracted from a device thanks to an the adversary’s model, possibly biased due to estimation and assumption errors. For a target secret variable X with leakage variable \(\boldsymbol{L}\), and denoting the leakage model \(\tilde{\text {m}}_n(\boldsymbol{l}|x)\) as described in the previous section, the PI is expressed as:

$$\begin{aligned} \hat{\text {PI}}_n (X;\boldsymbol{L}) = \text {H}(K) + \sum _{x \in \mathcal {X}} \mathsf {p}(x) \sum _{\boldsymbol{l} \in \mathcal {L}} \mathsf {p}(\boldsymbol{l}|x) \cdot \text {log}_{2} (\tilde{m}_n(x|\boldsymbol{l})), \end{aligned}$$
(4)

with H(X) the Shannon entropy of the variable \(S \in \mathcal {S}\). The PI is a lower bound to the worst-case MI and equality holds in case the adversary’s model is perfect. It can be viewed as the amount of information extrated by the best practical attack tried by an evaluator. Concretely, the PI is usually estimated with k-fold cross-validation and we used \(k=10\) in our following experiments.

3 Setup Model and Design Space

We now introduce our model and design space for measurement setups, alongside with the two devices we have adopted to conduct our investigations.

3.1 Setup Model

The setup model is illustrated in Fig. 1. Its goal is to highlight important parameters for the informativeness of the leakages such as the probing method, the DUT’s parameters and the Digital Storage Oscilloscope (DSO)’s parameters. As reported in [27], the choice of those components and how they interact with each other impact sensibly on the final outcome of the practical side-channel security evaluation of a leaking implementation. A bit more precisely, the current absorbed by the DUT is first monitored by a probe, which has the role to convert the current signal into a voltage signal. This signal can then be amplified using a preamplifier stage in order to increase its magnitude, to mitigate noise in the measurements and to improve electrical characteristics for the following blocks. At the end of the so-called measurement chain, a DSO samples and quantizes the analog voltage signal, converting it in a digital representation. Usually, the sampling operation is handled following a specific timing, that exploits a trigger signal in order to synchronize different measurements.Footnote 1 The precise design space that we will consider for each block of the model will be detailed later.

We note that our investigations do not consider the question of filtering, which we view as an orthogonal one, since it can be performed after the measurements took place in order to compensate a too noisy setup.

Fig. 1.
figure 1

Measurement setup model for power analysis evaluation.

3.2 Platforms

In our investigations, we have used two devices in order to cover both hardware and software implementations of cryptographic algorithms. This choice is motivated by the expected differences between the two types of targets. For example, hardware implementations generally allow better controlling the design aspects (from the level of parallelism to low-level implementation choices) while software implementations are usually more general purpose and serial.

Hardware DUT. Our target hardware DUT is a Xilinx Spartan-6 LX75 FPGA, mounted on a Sakura-G board, implementing an AES-128 processor with a 32-bit architecture. It is illustrated in Fig. 2. In order to provide synchronization between measurements, we generate a trigger signal on one of the IO pins of the FPGA, rising to logical ‘1’ one cycle before the starting of the encryption and set back to ‘0’ one cycle after the end of the AES encryption. We used the integrated measurement point for our measurements.

Fig. 2.
figure 2

Architecture of the 32-bit AES encryption co-processor.

Software DUT. Our target software implementation is running on a Cortex-M0 MCU from the STM32F0308 Discovery board. Small modifications were performed on the board. Namely, we added a crystal oscillator to provide a stable clock source for the measurements and decoupling capacitors were desoldered. The MCU is running tiny-AES [2], an open source AES-128 implementation. We used the same trigger methodology as for the hardware DUT. Our measurements were performed on the dedicated current measuring point for the MCU.

3.3 Design Space

We explored our design space and DUTs by testing the following parameters:

Regarding the probing methodology, we used both a 2\(\,\Omega \) resistor in series with the power supply voltage and an inductive probe (the Tektronix CT-1, which gives a transresistance of 5 mV/mA in the frequency range 25 kHz–1 GHz [1]). When using the CT-1 current probe, the shunt resistor was short-circuited. We optionally used a preamplifier, namely a R&S HZ16 [24] providing a gain of 20 dB with a noise figure of 4.5 dB, in the frequency range 100 kHz–3 GHz.Footnote 2

Next, the DUT’s clock frequency is an important macroscopic feature of a side-channel trace, since it usually reflects the frequency spectrum where leakage can be found. We chose three clock frequency values (1 MHz, 6 MHz and 24 MHz) for the hardware DUT and three clock frequencies (4 MHz, 24 MHz and 48 MHz) for the software DUT. Note that 48 MHz is the maximal clock frequency of the device. Those sets of values were chosen to observe the impact of the clock frequency on the shape and distinguishability of the leakage cycles.

Similarly, we chose three power supply voltage (0.8 V, 1.2 V and 1.4 V) for the hardware DUT and three power supplies (2.6 V, 3.0 V and 3.6 V) for the software DUT. Those sets of values were chosen in order to observe the impact of working at nominal supply voltage vs. in minimum and maximum corner cases.

Finally, we used a Picoscope 6424E providing a vertical resolution of 12 bits and running at three different sampling rates as DSO. We chose sampling rates values according to the clock frequency of the given DUT, to analyze the impact of the collected number of samples per clock cycles (which impacts the acquisition bandwidth and memory requirements). Precisely, we set the sampling rate of the DSO at approximately \(\times 1\), \(\times 5\), \(\times 25\) the chosen DUT’s clock frequency. We note that the sampling rate of our DSO is not an integer multiple to the DUT’s clock frequency as it may induce correlated noise in the measurements.

In total, we performed \(4 \times 3 \times 3 \times 3 =108\) experiments on each DUT. In each experiment, we targeted the first key byte of the first AES round and collected \(4 \times 10^6\) traces for the hardware DUT and \(10^5\) for the software one. Both the input plaintexts and keys have been picked up uniformly at random, in order to stimulate the combinational and sequential logics of both platforms.

4 Experimental Results and Discussion

In this section, we present the results of our analyses for the proposed metrics throughout our design space. We first introduce the set of plots (e.g., for the SNR and PI) that summarize our experiments and will be the basis of our discussions. We then extract the best configurations for the measurement setup of both platforms. We finally propose general guidelines for the design of good measurement setups. Given the granularity of the explored design space, we organize this discussion according to the setup model in Sect. 3. We also evaluate the relevance of univariate evaluation metrics as predictors of multivariate ones.

Figure 3 shows the highest SNR value we found for each set of parameters for both platforms (in logarithmic scale). We present the results in the form of a matrix where the X-axis contains the different power supply values and sampling speeds, and the Y-axis contains the DUT clock frequencies and the probing method used in each experiment. The thick orange lines delimit the probing methods on the X-axis and the sampling speeds on the Y-axis. Darker blue blocks represent setup parameters where the SNR is higher. Figure 4 shows a similar matrix for the PI values obtained after LDA, in order to evaluate the impact of setup choices from a multivariate attack perspective.

Fig. 3.
figure 3

Peak SNR values observed for the hardware (a) and software (b) DUTs.

Fig. 4.
figure 4

Peak PI values observed for the hardware (a) and software (b) DUTs.

A bit more in detail, the SNRs in Fig. 3 were calculated on the whole leakage trace and the maximum value was then taken. In the hardware case, as shown in Fig. 3a, the best SNR is obtained using the CT-1 current probe combined with the amplifier, setting the DUT to the slowest clock speed and lowest power supply value, and sampling at the highest rate. In the software case, as shown in Fig. 3b, the differences are more subtle and many sets of parameters give a peak SNR value close to the best one. The latter is obtained using the resistor combined with the amplifier, setting the DUT to the highest clock speed and sampling at lowest rate (contrary to the hardware case) while still using the lowest power supply value (like in the hardware case).

Regarding our multivariate analysis, we calculated the PI for each set of parameters. Concretely, for each experiment independently, we first pre-selected samples based on the SNR traces, keeping the ones above the noise floor for profiling. We then built Gaussian templates combined with LDA as presented in Sect. 2.2. We next analyzed the impact of the \(d'\) parameter, trying \(d'=1\) up to 25 for the hardware platform and 50 for the software one. We finally kept the \(d'\) leading to the highest PI, which is reported in Fig. 4.

For the hardware platform’s results reported in Fig. 4a, we observe that the experiment leading to the highest PI is obtained with the same set of parameters that leads to the highest SNR in Fig. 3a. Generally speaking, comparing the two metrics, the PI follows the same trend as the SNR in this case.Footnote 3 For the software platform’s results reported in Fig. 4b, we see that the effect of the probing method and the power supply voltage are negligible, which is different from the univariate SNR analysis of Fig. 3b. By contrast, observations regarding the clock frequency and sampling rate remain similar as in the univariate case. We also note that our highest PI value is 7.96 for an 8-bit bus.

As a complement, Fig. 5 depicts exemplary SNR and leakage traces for both platforms, corresponding to the best cases in Fig. 3. The SNR traces are in the upper subplots and mean leakage traces in the lower subplots.

Fig. 5.
figure 5

Exemplary SNR and leakage traces: hardware (left) and software (right).

These experimental investigations lead to the following observations:

Probe. The choice of a probe was more critical for the hardware platform than the software one in our experiments. In the hardware case, the inductive probe gave better results than the resistor. A plausible explanation is that the CT-1 interferes less with the side-channel signal and is intrinsically less noisy than a shunt resistor. So as long as the target leakage is covered by the probe’s bandwidth, it seems to be a good choice. In the software case, both the inductive probe and the resistor gave good results, presumably due to the easier-to-exploit measurements (reflected by the higher SNR and PI values). As for the use of the amplifier, it does not show a significant impact as in most of our design space, the signal that we sample is within the vertical range of our DSO.

We posit that the observation regarding the inductive (CT-1) probe could change if targeting higher clock frequencies, and the observation regarding the amplifier could change if targeting more advanced technologies or a side-channel signal with lower amplitude (e.g., an electromagnetic one).

Clock Frequency. This parameter is in general important for side-channel analysis. Whenever it can be controlled by the adversary, both our hardware and software results suggest the same rule-of-thumb: “use the highest available clock frequency such that independent clock cycles are easy to distinguish”.

We first illustrate this rule-of-thumb with Fig. 6. It shows the traces we recorded with the best parameter set and varying clock frequencies for the FPGA platform. At 1 MHz, the independent peaks for each clock cycle are clearly distinguishable. At higher clock frequencies, the leakage traces are smoother and the overlapping between the clock cycles in the measurements increases.

Fig. 6.
figure 6

Clock frequency effect on the hardware setup.

We next turn to the software case study to explain the first part of the rule-of-thumb (i.e., why it is not advisable to reduce the clock frequency unconditionally). In this respect, we first note that for this software DUT, the clock cycles were clearly distinguishable even for the maximum clock frequency (so the second part of the rule-of-thumb was fulfilled). In this case, the best SNR and PI values are observed for higher clock frequencies. We explain this effect by observing that all the samples in a clock cycle are not equally informative. During an MCU clock cycle, most of the dynamic power is contained right after the rising edge of the clock as the effect of the registers changing state. The leakage from the remaining of the clock cycle is mostly due to static power and is usually less informative [20, 23]. Therefore, the interest of decreasing the clock frequency can become detrimental when conditioned on a sampling frequency.

More precisely, and as illustrated in Fig. 7, decreasing the clock frequency can lead the collected samples (represented by red diamonds in the figure) to correspond mostly to the static part of the leakage, and to miss the information of the dynamic part (represented by the green rectangles of the figures). Overall, this can lead to a collection of samples that is less informative: the univariate SNR can be lower by missing the most informative sample and the multivariate PI can be lower by cumulatively covering less relevant samples.

Fig. 7.
figure 7

Sampling with different DUT clock frequencies.

\(\mathbf{V} _\mathbf{DD} \). Despite less definitive than the clock frequency, the supply voltage also affects the shape of the leakage traces, as it increases the critical path and therefore spreads the information towards more samples. This naturally causes the multivariate PI to be improved when lowering the supply voltage below the nominal one. Interestingly, we also observed that for both targets and most sets of other parameters, decreasing VDD is also beneficial to the univariate SNR.

A plausible reason for this observation is that both devices are based on CMOS technology (even though from different technology nodes and manufacturers) which generally exhibits smoother transient current when VDD is lower than nominal, due to reduced transconductance of digital cells. This can reduce both the signal and, here more dominantly, the noise of the leakage.

We note that this observation is admittedly technology-dependent: see [28] for a report on several technology nodes. It is also not unconditional: as reported in the same paper, the output noise of a digital cell in subthreshold regime (that corresponds to extremely low VDD values) is not minimal, as transistors exhibit higher resistance and thus contribute more to increase the noise level. So overall, our conclusion regarding the VDD parameter is that reducing it below the nominal value can have marginal interest, especially for multivariate attacks, but is not expected to lead to significant gain/loss factors.

Sampling Rate. This parameter is especially critical for the cost of the attacks as it affects the memory requirements to store the leakage measurements.

On the one hand, its selection is related to the clock frequency: as in general when quantizing signals, the sampling rate should at least be chosen larger than the Nyquist frequency. This requirement was confirmed in our experiments, and showed to be more critical (resp., less critical) in the hardware case (resp., software case). This is presumably due to the lower (resp., larger) amount of less (resp., more) informative samples of the harware (resp., software) case.

On the other hand, in the context of side-channel analysis, a natural question is whether increasing the sampling rate significantly beyond the Nyquist frequency can be useful. Namely, can it lead to more powerful multivariate attacks? By testing a sampling frequency of \(\times 1\), \(\times 5\) and \(\times 25\) the clock frequency, we observed that collecting more samples helps only to a limited extent. In particular, both for the software and the hardware platforms, the gains when moving from \(\times 1\) to \(\times 5\) the clock frequency are more significant than when moving from \(\times 5\) to \(\times 25\) the clock frequency, again with more incentive to increase the sampling rate in the hardware case than in the software case. A plausible reason for this difference is once more the more condensed and noisy nature of the hardware leakage (i.e., the fact that it is concentrated in less cycles with more algorithmic noise, rather than spread over more cycles in software).

Univariate vs Multivariate Evaluations. Eventually, our results indicate that whether the SNR is a good predictor of the PI is quite case-dependent.

If the SNR traces present a single peak or a set of peaks that are close to each other (e.g., within one cycle), they usually indicate correlated leakage coming from a single operation. In this case, which typically corresponds to our hardware experiments (see the left part of Fig. 5), a good univariate SNR will generally be a good indicator of a good multivariate PI. Multivariate attacks will always be more powerful but the SNR can serve as a first-order comparison metric.

By contrast, if the SNR traces contain multiple peaks separated by several clock cycles, they rather indicate independent leakage coming from different operations. In this case, which typically corresponds to our software experiments (see the right part of Fig. 5), multivariate attacks are expected to be significantly more powerful than univariate ones. So the direct estimation of the multivariate PI is in general a better (i.e., more conclusive) evaluation strategy.

5 Conclusions

This study aims at evaluating the risk of over-estimating the physical security of an implementation due inadequate parameter choices when configuring a measurement setup. We focus on four main parameters: the probing method, the clock frequency, the power supply voltage and the sampling rate. We apply our methodology to an embedded software and a hardware FPGA implementation of the AES-128 block cipher. It leads to 108 experiments for each DUT, that we analyze by means of univariate and multivariate evaluation metrics, namely the SNR and the PI. Our findings show that the losses due to a bad selection of parameters can be significant and lead to a strong over-estimations of an implementation’s security level. We also use our experiments in order to consolidate general intuitions and recommendations regarding the good choice of parameters and to discuss their device and architecture dependencies.