1 Introduction

Generation, consumption and energy saving are not a minor issue. Although sources of clean energy are being sought, the generation of energy still causes large emissions of CO2. Approximately 72% of global energy consumption goes to transportation, lighting and electrical heating, while other energy consumptions are mainly due to electrical and electronic equipment, including the computing devices, among others [1,2,3,4,5,6].

Energy consumption in computing devices is an area of research and development that is rapidly growing. This growth is due to the fact that there are several critical problems that must be thoroughly addressed, including the energy gap problem, energetic consumption for autonomous systems and the problem of exaflop scaling, among others [7,8,9,10,11,12,13]. Firstly, regarding the energy gap problem, Nokia (2010) reported that the power consumption used by new applications increases annually more than what mobile device batteries can yield. Requiring that new smartphones need to be recharged more often, limiting users mobility [7, 14]. Secondly, autonomous vehicles, such as robots and drones, require efficient energy management. In fact, if the algorithms that control or use them for specific tasks are not energetically optimal, the battery is at risk and can affect the tasks of the vehicle, even to its integrity [9,10,11]. Thirdly, related to the exaflop problem, it is known that there is a steady growth in supercomputer performance as can be seen from the Top500.org reports [12]. With nowadays computer technology, the increment of the number of floating point operations per second (flops) is directly proportional to power consumption; hence, exaflop computers would be energetically unviable. However, around the world, teams of engineers and scientists are aiming, by 2020, to manufacture exaflop computers with the target in energy consumption of 20 Mwatts per hour [12, 13].

Various strategies have been developed and implemented to save energy. For mobile devices: use of processors with low power consumption, mobile computation offloading strategies, ensuring that applications have sequential rather than parallel executions, and scheduler optimization for both processor and antennas, to name a few [15]. Some strategies to save energy have been developed, such as the use of approximate computing, to the effect that autonomous vehicles can save energy in the expensive computing operations they perform [16, 17]. For exaflop scaling: use of processing units that have a good flop-to-watt ratio including low power processors, use of accelerators such as GPUs or Xeon Phi, use of multicore processors with variable frequency, cooling technology and energy capture/recovering [12, 13].

Most of energy models based on extensions to Amdahl’s law are not considering the use of variable frequency as Turbo Boost provides [18]. Meneses [19] presents an extension to Amdahl’s law for Intel processors with Turbo Boost, but only to study the performance behavior and not the energy behavior of parallel programs. Verner [20] presents a model that considers frequency variation but only in the parallel regions of a program. On the other hand, it is necessary to have a model that helps to analyze the energy behavior of parallel programs running on multicore processors with Turbo Boost technology. The main contribution of this work is a model to analyze energy consumption of parallel programs executed on Intel multicore processors with Turbo Boost, which is an extension to Amdahl’s law.

2 Related work

Power consumption is an important factor during the designing phase of new processors. Recently, the use of multicore processors has increased from 2 to 48 cores. In addition, it is common to manufacture processors with more memory cache and with frequencies higher than 3.6 Ghz. This frequency has a limit and it is due to the fact that energy consumption generates heat. For instance, when an instruction flow is being executed the energy that is consumed by the processor might cause overheat, ending requiring additional energy to cool it down [21,22,23,24,25].

Fuller and Miller [26] in 2010 addressed the power consumption problem, exploring new technologies for the design of modern processors. Furthermore, different strategies have been used to reduce power consumption of these new processor architectures, heterogeneous or asymmetric, including specialized cores, vectorization, scaled dynamic voltage frequency (DVFS) and systematic techniques for power activation, among others [27, 28].

The study of the multicore energy behavior has been focused on two different paths. The first one has been through mathematical models and the second one through experimenting. Table 1 shows all models related to power consumption analysis of multicore processors.

Table 1 Models related to the study of power consumption

Traditionally, Amdahl’s law model has been used to analyze parallel programming speedup [47] and it has been taken as basis for the study of power consumption of this programming in multicore architectures. Research on energy models based on extensions to Amdahl’s law can be found in [29,30,31,32, 32,33,34,35]. Most of these studies have been focused on power consumption of multicore architectures without considering frequency variations. Kim et al. [48] conducted their research focused on frequency variations on the sequential regions of the parallel program. Pei et al. [35] present a model for heterogenous multicore processors; in this model, the Data Preparation Overhead (ODP) is focused to generate its extension to Amdahl’s law.

Energy proportionality refers to a model that establishes that a server consumes power proportionally to its use [36,37,38,39,40]. Since its introduction in 2007, the model has been embraced by the server industry as a design goal for further optimizations of energy efficiency. In [36], the authors found that modern servers have taken the proportionality of server power consumption from linear to quadratic. Hsu et al. [37] reviewed different energy proportionality metrics and they proposed a unique unified indicator to represent them all, called EP (energy proportionality). Furthermore, the authors identified that while EP is considered better than other existing metrics, EP does not capture an important aspect of the server power proportionality, which is linearity. Energy proportionality is still under research for data centers environment [38,39,40].

Some models can be categorized as energy complexity models. Some authors [41,42,43,44] have formalized the problem of energy consumption by extending the ideas of computational complexity to a new area called energy complexity to explain its behavior. Actually, computational complexities have been obtained for traditional algorithms such as sorting algorithms. The main idea is to know the energy used as a function of problem size.

There are other models using different strategies for the analysis of energetic behavior based on processors with independent power supply resources [33, 45]. These resources can be cores, cache memory and network interconnection, among others. Basmadjian and de Meer [45] worked on the design of a software-based model of power consumption for multicore processors. They suggested that it is important to bear in mind that the presence of more than one processing unit directly affects the processor power behavior, because computational resources are shared. They also mentioned that measuring power consumption from hardware-level is not trivial for multiprocessors with a high quantity of circuits inside. Cho and Melhem [33] studied the effects of parallelism over a program performance and energy consumption. Their proposed model was tested on a machine that could apply core off-lining. They predicted that the introduction of more cores combined with a high percentage of parallel code in a process helps to reduce the energy consumption. Their main conclusion is that the more energy savings, the more common elements processing units there are, energized by the same power source.

Finally, another energy consumption model is presented in [46]. The authors take the energy delay product (EDP) as the metric for energy efficiency. The lowest EDP means the best energy efficiency. For a shared memory multicore processor system, they used processor overclocking and memory frequency scaling to achieve better performance and lower power consumption, providing better energy efficiency. Furthermore, the authors stated that it cannot be ensured that better energy efficiency can be achieved anytime we use processor overclocking. However, they also found that reducing memory frequency combined with processor overclocking cannot ensure that we achieve greater energy efficiency. Hence, they propose a processor overclocking and memory frequency scaling algorithm based on the Holistic Energy-Efficient (HEE) algorithm, which determines when to use processor overclocking to improve energy efficiency and when you could combine processor overclocking with memory frequency scaling for better energy efficiency.

The majority of research that have studied the energy consumption behavior in Intel multicore processors are based on experimentation. They have tested several benchmarks to study the impact of energy consumption on specific tasks of the system, such as memory access, use of DVFS (dynamic voltage frequency scalability), use of Turbo Boost, use of vectorization, use of counters of thermal, power use and operating system configurations, to name a few [18, 28, 49,50,51,52,53,54,55,56,57].

James et al. [25] present a performance analysis of Intel Core i7 processors (Nehalem). Their analysis combines different workload scenarios. The authors determine that Turbo Boost activation is due to inherent features of the application. They note that Turbo Boost mitigates Amdahl’s law. This article reports that the reduction in time is 6%, while the energy consumption increases by 16%. They report that with Core i7 the step size for frequency increases is 133.33 MHz. Turbo Boost is made possible by a processor feature called power gating. Traditionally, an idle processor core consumes zero active power while dissipating static power due to leakage current. Power activation also aims to reduce the leakage current, further reducing the power consumption of the idle core. The additional available power margin can be diverted to the active cores to increase their voltage and frequency without violating the power, voltage and thermal enveloped [25]. Turbo Boost makes Nehalem a dynamically asymmetric processor; this means that different cores can execute the same set of instructions, while frequencies vary independently between cores. The experiments they do make them CPU intensive, floating point and integer, and with large memory accesses. In this article, they present a frequency-based processor power consumption model.

Acun et al. [49] present a series of experiments on 4 supercomputers in order to see the behavior of HPC applications on supercomputers with nodes that have Turbo Boost. The test program they use is MK double-precision matrix multiplication, as it represents a benchmark for heavy hardware use. The authors report a degradation in the performance of applications as the number of cores increases in processors with dynamic frequency change. The observed problem is that not all nodes increase the frequency at the same time. They tried trying to set the frequency constant but did not get good results. The solution they found was a process migration algorithm to processors with active Turbo Boost and which are observed to increase the frequency in the same way.

Aniruddha et al. in 2017 [50] proposed that the models to predict the performance of energy and temperature in clusters presuppose homogeneous execution systems, that is, processors and cores that run at the same frequencies, so that frequency variations must be carefully studied of processors. This paper presents an empirical study to see the performance of processors for HPC. In particular, they do the family review of Intel processor families with Turbo Boost. The performance problem due to energy variation increases as processors arrive with higher frequency dynamic increases. The variation is further amplified by the restrictions imposed by the hardware (number of cores, power management functionality, among others). In this paper, the authors show that with the hardware power constraint of the processor, the variation in processor performance and energy efficiency has increased up to fourfold in the latest Intel processors (Sandy Bridge, Ivy Bridge and Broadwell). This work suggests that there is a risk of greater variation in performance and energy efficiency, as the number of cores in a processor increases.

Current energy models based on extensions to Amdahl’s law are not considering the use of variable frequency as Turbo Boost provides [18]. Verner [20] and Meneses [19] present an extension to Amdahl’s law for Intel Processors with Turbo Boost, but only to study the performance behavior and not the energy behavior of parallel programs. On the other hand, it is necessary to have a model that helps to analyze the energy behavior of parallel programs running on multicore processors with Turbo Boost technology. Hence, we present a model to analyze energy consumption of parallel programs executed on multicore processors with variable frequencies, which is an extension to Amdahl’s law.

3 Energy consumption model

We give an energy consumption model for multicore processors with variable frequency, based on Amdahl’s law. This model extends the model proposed in  [34] and covers processors with variable frequency.

The relation between energy, power and time is given by [14]:

$$\begin{aligned} energy = power \times time. \end{aligned}$$
(1)

Based on this relation, our energy consumption model consists of two main components: the power and the time. On one hand, we obtain the power model for multicore processors of variable frequency taking into account two different measurements of power. The first one is the power associated to the processor’s base frequency. The second one is the power associated to the processor after variable frequency has been enabled. Due to the fact that frequency is directly related to power, for higher frequency there is higher power consumption. Hence, for multicore processors with variable frequency, it is expected that as frequency increases an increment of power consumption is obtained. On the other hand, we introduce Amdahl’s law to model the time of a parallel program running on a multicore processor. Then, we extend the model for multicore processors with variable frequency. Furthermore, this extension predicts not only the speedup in terms of time, but also the energetic speedup of the multicore processor. In Sects. 3.1 and  3.2, we describe both components of the energy consumption model, the power and the time models, respectively. Section 3.3 develops the energy model and Sect. 3.4 develops an extension of Amdahl’s law for the energetic speedup.

3.1 Power

The power model is developed based on two cases, which we call \(W_{off}\) and \(W_{on}\). Case \(W_{off}\) refers to the status when the processor does not have variation in its frequency. Case \(W_{on}\) refers to the status when the processor has an active variation in its frequency.

In general, according to [34], it is possible to represent the entire power usage by the addition of three parameters: base power, power of active cores and idle power for all cores. These parameters are denoted by \(W_{base}\), \(pW_{active}\) and \(mW_{idle}\), where p refers to the number of active cores and m to the total number of cores in the processor. The parameter \(pW_{active}\) can be \(pW_{aoff}\) or \(pW_{aon}\) depending on the case.

The power of a processor with m cores of which p are active is given by:

$$\begin{aligned} W_{off}(p) & = W_{base} + pW_{aoff} + mW_{idle}, \end{aligned}$$
(2)
$$\begin{aligned} W_{on}(p)& = W_{base} + pW_{aon} + mW_{idle}. \end{aligned}$$
(3)

Since \(W_c = W_{base} + mW_{idle}\) is constant, we will use \(W_{c}\) in the future reference to this term. As stated in [45], this constant depends on the layout of the resources shared by the multicore processor including, cache memory and interconnection network, among others.

$$\begin{aligned} W_{off}(p)& = W_{c} + pW_{aoff}, \end{aligned}$$
(4)
$$\begin{aligned} W_{on}(p)& = W_{c} + pW_{aon}. \end{aligned}$$
(5)

When only one core is used, Eqs. (6) and (7) represent the power required by a sequential program in a multicore processor for both cases.

$$\begin{aligned} W_{off}(1)& = W_{c}+W_{aoff}, \end{aligned}$$
(6)
$$\begin{aligned} W_{on}(1)& = W_{c}+W_{aon}. \end{aligned}$$
(7)

3.2 Time

According to Amdahl [47], the execution time of a sequential program to solve a problem of size n, running on a single processor is given by Eq. (8), where \(\sigma (n)\) is the serial portion of the computation and \(\varphi (n)\) is the portion of the computation that can be executed in parallel.

$$\begin{aligned} T(n,1) = \sigma (n) + \varphi (n). \end{aligned}$$
(8)

Following the same principle, the execution time for a program running on a multicore processor is given by Eq. (9), where p denotes the number of cores in the multicore processor and k(np) is the overhead related to the intercommunication between processors.

$$\begin{aligned} T(n,p) = \sigma (n) + \frac{\varphi (n)}{p} + k(n, p). \end{aligned}$$
(9)

Hence, the speedup of a parallel program is the ratio between sequential execution time and parallel execution time as given by Eq. (11). Assuming \(k(n, p) > 0\) and \(\sigma (n)>0\), we obtain inequality (10).

$$\begin{aligned} \Psi (n,p)& = \frac{\sigma (n) + \varphi (n) }{\sigma (n) + \frac{\varphi (n)}{p} + k(n, p)}, \end{aligned}$$
(10)
$$\begin{aligned} \Psi (n,p)\le & \frac{1}{f + (1-f)/p}, \;\text{ where }\; f= \frac{\sigma (n)}{\sigma (n) + \varphi (n)}. \end{aligned}$$
(11)

Note that \(\sigma (n) > 0\) because in practice all programs have a sequential code region, even if it is very small. And the inherent sequential proportion is given by f. From (10), it follows that the maximum speedup when \(f \rightarrow 0\) is given by:

$$\begin{aligned} \lim _{f\rightarrow 0}\Psi (n,p) = p. \end{aligned}$$
(12)

3.2.1 Amdahl’s law for processors with variable frequency

We give an extension for the speedup (10) of multicore processors with variable frequency based on Amdahl’s law. We consider the execution time of a program when the frequency change is active or not. Let \(T_{off}(n, p)\) be the time a program uses to solve a problem of size n with p cores when the frequency change is not active. We consider that \(T_{off}(n,1)\) and \(T_{off}(n,p)\) are given by Eqs. (8) and (9), respectively.

To calculate the execution time of a sequential program to solve a problem of size n in a processor that has the frequency change active, we must consider the ratio of frequencies H given by:

$$\begin{aligned} H = \frac{\text{ increased } \text{ frequency }}{\text{ base } \text{ frequency }}. \end{aligned}$$

Then, \(T_{on}(n,1)\) and \(T_{on}(n,p)\) are defined as

$$\begin{aligned} T_{on}(n,1)& = \frac{\sigma (n)}{H} + \frac{\varphi (n)}{H}, \end{aligned}$$
(13)
$$\begin{aligned} T_{on}(n,p)& = \frac{\sigma (n)}{H} + \frac{\varphi (n)}{Hp} + \frac{\kappa (n,p)}{H}. \end{aligned}$$
(14)

The speedup of a parallel program that solves a problem of size n with p cores on a multicore processor with variable frequency is given by:

$$\begin{aligned} \Psi (n,p)& = \frac{T_{off}(n,1)}{T_{on}(n,p)})\nonumber \\&=\frac{ \sigma (n) + \varphi (n)}{\frac{\sigma (n)}{H}+\frac{\varphi (n)}{Hp} + \frac{k(n, p)}{H}}, \end{aligned}$$
(15)

We obtain (16) from (15) (as in the case above for the time),

$$\begin{aligned} \Psi (n,p) & \le \frac{ \sigma (n) + \varphi (n)}{\frac{\sigma (n)}{H}+\frac{\varphi (n)}{Hp}}\nonumber \\ & = H\left( \frac{ \sigma (n) + \varphi (n)}{\sigma (n)+\frac{\varphi (n)}{p}}\right) . \end{aligned}$$
(16)

Taking \(f= \frac{\sigma (n)}{ \sigma (n)+ \varphi (n)} \), for \(\sigma (n)>0\), turns (16) into (17).

$$\begin{aligned} \Psi (n,p)\le \frac{H}{f+\frac{(1-f)}{p}}. \end{aligned}$$
(17)

Note that the maximum speedup for a parallel program running in a processor with frequency variation is when \(\lim _{f\rightarrow 0}\Psi (n, p) \), then

$$\begin{aligned} \lim _{f\rightarrow 0}\Psi (n,p) \le pH. \end{aligned}$$
(18)

Figure 1a, b shows the speedup for three different values of f: 0.2, 0.1 and 0.02. Figure 1a shows the speedup of a parallel program without variation in the frequency (traditional Amdahl’s law), and Fig. 1b shows the speedup of a parallel program using \(H=1.14\). The H value was taken assuming that the processor used increments in frequency from 2.7 to 3.1 GHz (\(H \approx 3.1GHz/2.7GHz\)). The same values are used in the experiments conducted in Sect. 4.

In Fig. 1b, it can be observed that for the frequency ratio \(H=1.14\), there are some regions where the speedup is higher than the upper limit of Amdahl’s [Eq. (12)]. For example, when \(f=0.02\) and \(p \in [1, 8]\), where p is the number of cores, the speedup is greater than p (the upper limit of Amdahl’s law). Speedup allows to study the behavior of a parallel program with respect to a sequential program, if speedup is positive then it follows that the parallel program has better performance than the sequential program and, if there is a deceleration then the sequential program has better performance than the parallel program. In the same way, in the case of processors with frequency change, the speedup allows to study the behavior of a parallel program in a processor with frequency variable with respect to the sequential execution in the multicore processor without change of frequencies. For \(f=0.02\), the speedup exceeds the traditional limit of Amdahl’s law, given in Eq. 12, because the parallel program is executed in the multicore processor with frequencies greater than of sequential execution, so H-more instructions per unit of time per streaming is executed in the parallel program. It can also be seen that the acceleration is below pH as indicated in Eq. 18.

Fig. 1
figure 1

Amdahl’s law for multicore processors without and with frequency variation

3.3 Energy consumption model

In Sects. 3.1 and 3.2, we introduced the models for time and power for multicore processors that can change their frequency. In this section, we will combine both models to generate an energy consumption model. Analogously to the modeling of time, we begin by introducing the general formula for energy of a sequential program. In general, the energy required to solve a problem of size n on a single core is denoted by J(n, 1), where \(\sigma (n)\) refers to the inherently sequential portion of the computation and \(\varphi (n)\) to the potentially parallel portion. Then, J(n, 1) is given by Eq. (19).

$$\begin{aligned} J(n,1)& = W(1)\sigma (n) + W(1)\varphi (n) \nonumber \\& = W(1)(\sigma (n) + \varphi (n)), \end{aligned}$$
(19)

where W(1) depends if the processor frequency is variable or not [Eqs. 4 and 5].

We define J(np) as the energy to solve a problem of size n on p cores. We have:

$$\begin{aligned} J(n,p) = W(1)\sigma (n) + W(p)\frac{\varphi (n)}{p} + W_{\kappa }\kappa (n,p), \end{aligned}$$
(20)

where W(1) and W(p) could be taken from Eqs. 4 and 5, depending if the processor have activate the variable frequency, and \(W_{\kappa }\kappa (n,p)>0\), which refers to the energetic overhead inherent in a parallel program.

Now, to model the energy for a multicore processor of variable frequency, we consider two cases. The first one is when the frequency of the processor does not have any variation, and the second case is when the frequency has variation. We denote these two cases by \(J_{off}\) and \(J_{on}\), respectively.

Eqs. (21) and (22) correspond to the sequential and parallel executions for which there is a lack of frequency variation in the multicore processor, where \(W_{off}(1)\) is given by Eq. (6) and \(W_{off}(p)\) by Eq. (4).

$$\begin{aligned} J_{off}(n,1)& = W_{off}(1)\sigma (n) + W_{off}(1)\varphi (n) \nonumber \\& = W_{off}(1)(\sigma (n) + \varphi (n)). \end{aligned}$$
(21)
$$\begin{aligned} J_{off}(n,p)& = W_{off}(1)\sigma (n) + W_{off}(p)\frac{\varphi (n)}{p} + W_{\kappa }\kappa (n,p). \end{aligned}$$
(22)

Similarly, for the case when frequency change is enabled, Eqs. (23) and (24) give the energy consumption for solving a problem of size n in one and p cores, respectively.

$$\begin{aligned} J_{on}(n,1)& = W_{on}(1)\frac{\sigma (n)}{H} + W_{on}(1)\frac{\varphi (n)}{H} \nonumber \\& = \frac{W_{on}(1)}{H}(\sigma (n) + \varphi (n)), \end{aligned}$$
(23)
$$\begin{aligned} J_{on}(n,p)& = W_{on}(1)\frac{\sigma (n)}{H} + W_{on}(p)\frac{\varphi (n)}{Hp} + W_{\kappa }\kappa (n,p). \end{aligned}$$
(24)

3.4 Amdahl’s law extension for energetic speedup

Based on the energy consumption model, it is possible to calculate the energetic speedup using Amdahl’s law. By knowing the energetic speedup, the energy saving of a multicore processor of variable frequency can also be known. Again, we will work on two cases, \(\Psi _{off}(n,p)\), for multicore processors with no variable frequency and \(\Psi _{on}(n,p)\) for multicore processors with variable frequency. This representation will allow us to study the energy consumption behavior of a parallel program in multicore processors. When the frequency change is inactive, as in Amdahl’s law, the energy consumption of a sequential program \(J_{off}(n,1)\) is divided by the power consumption of a parallel program \(J_{off}(n,p)\) as given in Eq. (25).

$$\begin{aligned} \Psi _{off}(n,p) & = \frac{J_{off}(n,1)}{J_{off}(n,p)} \nonumber \\ & = \frac{W_{off}(1)(\sigma (n) + \varphi (n))} {W_{off}(1)\sigma (n) + W_{off}(p)\varphi (n) + W_{\kappa }\kappa (n,p)} \nonumber \\ & \le \frac{W_{off}(1)(\sigma (n) + \varphi (n))}{W_{off}(1)\sigma (n) + W_{off}(p)\frac{\varphi (n)}{p}}\nonumber \\ & \le \frac{1}{f + (1-f)\frac{W_{off}(p)}{pW_{off}(1)}}, \end{aligned}$$
(25)

where \(f = \frac{\sigma (n)}{(\sigma (n) + \varphi (n))}\).

The maximum energy speedup for a parallel program running in a processor without frequency variation is when \(\lim _{f\rightarrow 0} \Psi _{off}(n,p)\), then

$$\begin{aligned} \lim _{f\rightarrow 0} \Psi _{off}(n,p) = \frac{pW_{off}(1)}{W_{off}(p)}. \end{aligned}$$
(26)

If this limit is greater than 1, then we are in a scenario where we can have energy savings. If this value is less than or equal to 1, then we are in a scenario where sequential programs consume less energy than parallel programs.

On the other hand, when the frequency change is active, the energy consumption of a sequential execution when the frequency change is inactive \(J_{off}(n,1)\) is divided by the energy consumption of the parallel program in a multicore processor with the frequency change activated \(J_{on}(n,p)\) as given by Eq. (27).

$$\begin{aligned} \Psi _{on}(n,p)& = \frac{J_{off}(n,1)}{J_{on}(n,p)} \nonumber \\ & = \frac{W_{off}(1)(\sigma (n) + \varphi (n))} {W_{on}(1)\frac{\sigma (n)}{H} + W_{on}(p)\frac{\varphi (n)}{Hp} + W_{\kappa}\kappa (n,p)} \nonumber \\ & \le \frac{W_{off}(1)(\sigma (n) + \varphi (n))}{W_{on}(1)\frac{\sigma (n)}{H} + W_{on}(p)\frac{\varphi (n)}{Hp}} \nonumber \\ & \le \left( \frac{W_{off}(1)}{W_{on}(1)}\right) \frac{H}{f + (1 - f)\frac{W_{on}(p)}{W_{on}(1)p}}. \end{aligned}$$
(27)

The maximum energy speedup for a parallel program running in a processor with frequency variation is when \(\lim _{f\rightarrow 0} \Psi _{off}(n,p)\), then

$$\begin{aligned} \lim _{f\rightarrow 0} \Psi _{on}(n,p) = \frac{pHW_{off}(1)}{W_{on}(p)}. \end{aligned}$$
(28)

As in the inequality (26), if this limit is greater than 1, we can have energy savings. If this value is less than or equal to 1, then the sequential programs consume less energy than parallel programs. The relevant point is that we can know the maximum energy acceleration if p is equal to the number of cores of the processor, and if we know \(H, W_{off}(1)\) and \(W_{on}(p)\).

For example, Fig. 2 shows an instance of the model that describes how energy speedup behaves in a processor that can change frequency. In this example we considered an 4-core processor that has its base frequency at 2.75Ghz and can change its frequency to 3.1GHz, thus the value of \(H=1.127\). The power values associated to the processor are: \(W_c=2.5 \) watts for the constant power, \(W_{aoff}=6.5\) watts for the active power of the core when Turbo Boost is off, and \(W_{aon} = 8.5\) is the active power per core when Turbo Boost is on. With these values it is possible to compute \(W_{off}(p) = W_c + pW_{aoff}\) and \(W_{on}(p)=W_c+pW_{aon}\). Figure 2a shows the energy speedup when the frequency is constant. Figure 2b shows the energy speedup when there are changes in the frequency. These figures show that the processor without changes in the frequency can have greater speedup than the processor that has changes in the frequency. However, the gain is not so significant, it stays close to 1. This means, parallel programs running in processors with or without change of frequency consume the same energy when they use all cores available. So, from the point of view of the model, a processor with frequency change offers greater performance in parallel programs and uses the same amount of energy as a sequential program.

Fig. 2
figure 2

Energy speedup for parallel programs in multicore processor for \(J_{off}(n,p)\) and \(J_{on}(n,p)\)

4 Experiments and discussion

Several experiments have been conducted in two platforms to validate the energy model introduced in Sect. 3.3. These experiments have been executed on three different Intel processors adapted with Turbo boost technology listed below. This technology allows the processor to switch on and off the frequency variation during the processing of instructions.

  1. 1.

    Intel Core i5 Dual core 2.7 Ghz (power book).

  2. 2.

    Intel Core i5 Quad core 2.3Ghz (iMac).

  3. 3.

    Intel Xeon Eight Quad core 3.4Ghz (Server).

To measure the energy consumption in a processor, we used Power Gadget, which is an Intel software-based power usage monitoring tool activated for the processors used in these experiments.

To test the proposed model, part of the Linpack benchmark was used, specifically the double-precision matrix multiplication in square matrices. It is well known that the complexity of this task is \(O(n^3)\) and that it stresses the processor. Also, the parallel part exceeds 98% of execution time with matrices greater than 1500. The sizes of the matrices that were tested are \(1500\times 1500, 2000\times 2000, 3000\times 3000\) and \(3500\times 3500\).

OpenMP was used for the tests. The number of threads that were used in the OpenMP programs for the experiments was from one to the number of cores that were on the test platform. That is, for Intel Core i5 dual core processor was experimented with one and two threads. On the Intel Core i5 quad core processor, the experiments were done with one, two, three and four threads. Finally, for Intel Xeon eight-core, the experiments were done from one until eight threads. Special care was taken in that increasing the number of threads would increase the use of cores in the processor. For example, if k threads were used, then k cores of the processor were used. For this reason, we use threads and cores in an indistinct manner in the following sections.

4.1 Dual core experiments

Table 2 shows the power behavior of the Intel Core i5 Dual core processor, as measured by the Intel PowerTutor tool. Columns \(W_c\) and \(W_a\) represent the power constant and the active power for each core [Eqs. (4) and (5)]. Column W(1) is the power of the processor with one active core [Eqs. (6) and (7)], and column W(2) is the power of the processor with two active cores [Eqs. (4) and (5)].

Table 2 Frequency and electric power for Intel Core i5 dual core with Turbo Boost off and on

Figure 3 and Table 3 show the execution time behavior and speedup of double-precision matrix multiplication for different sizes of square matrices. In Fig. 3, the solid lines show the consumption time of the program running in one and two cores with Turbo Boost off, while the dashed lines show the behavior of the process in one and two cores with Turbo Boost on. Notice, that when Turbo Boost is activated there are shorter execution times compared to when it is turned off. The process that consumes more time is the sequential one with Turbo Boost off, and the process that consumes less time is the one that uses four cores with Turbo Boost on. This behavior can be deduced from Eqs. (12) and (18), notice that when \(H=3.1/2.7\approx 1.148 > 1\), the speedup obtained with Turbo Boost on is greater than when Turbo Boost is off.

Fig. 3
figure 3

Double-precision matrix multiplication time performance (dpMM)

Table 3 shows the experimental speedup and the model associated to Eq. (17), of double-precision matrix multiplication using one and two cores, for the cases when Turbo Boost is on and off. Speedup is calculated with Eq. (27), since we are interested in knowing the execution behavior of parallel programs in architectures with changes in the frequency handling of the processor cores. In Table 3, we can observe that the speedup with Turbo Boost on exceeds the traditional linear limit of Amdahl’s law [Eq. (12)]. Also, according to the behavior predicted by the model [Eq. (18)], this upper limit is pH, and since \(H=1.148\), the upper limit of the speedup is 2.29x. This result is consistent with those in column Model Two Threads for Turbo Boost On in Table 3, where the maximum speedup reported is 2.2953 for the case of a matrix of size 3500.

Table 3 Speedup on Intel Core i5 dual core processor
Fig. 4
figure 4

Double-precision matrix multiplication (dpMM) energy behavior in Intel Core i5 Dual core

Figure 4 shows the energy consumption for double-precision matrix multiplication with different matrix sizes in one and two cores. Figure 4a shows the behavior of energy consumption in the experiments carried out. Figure 4b shows the energy consumption behavior of the model, according to Eqs. (22) and (24), using the processors data found in Table 2 and the sequential time when Turbo Boost is turned off. The solid lines in these graphs correspond to the energy consumption of matrix multiplication when Turbo Boost is off and the dashed lines correspond to the same experiment when Turbo Boost is on. When both graphs in Fig. 4 are compared, it is observed that the behavior predicted by the model corresponds to that of the experiments. Notice that the process consumes more energy when Turbo Boost is on than when it is off. Also, the execution in two cores consumes less energy than the sequential execution. Another behavior that can be observed from this figure is that the program that consumes most energy is the sequential one with Turbo Boost turned on, while the process that consumes less energy is the one that uses two cores with Turbo Boost off. Finally, in Fig. 4, we can see that the cases of the process running with 1 core with Turbo Boost off and the process that runs with two cores with Turbo Boost on, consume almost the same amount of energy.

Table 4 Experimental and model energetic speedup for double-precision matrix multiplication on Intel Core i5 dual core processor

Table 4 shows the behavior of energetic speedup for double-precision matrix multiplication when Turbo Boost is on and off. The values on this Table are for experiments with several matrix sizes and for the model given by Eqs. (25) and (27), according to the energy power values found in Table 2. Table 4 shows that for both the experiments and the results of the model, when Turbo Boost is turned off, a greater energy speedup is obtained compared to the speedup obtained when Turbo Boost is on. An interesting behavior observed is, that for both the experiments and the model, when the two cores of the processor are used, the speedup is very close to one. This means that a process using two cores with Turbo Boost on spends the same energy as the sequential process when Turbo Boost is off. Finally, we observe that the speedup that occurs using a core when Turbo Boost is on is less than one, this is because the power consumption of a sequential process with Turbo Boost on spends more energy than a sequential process with Turbo Boost off, as shown in Fig. 4.

4.2 Quad core experiments

Table 5 shows the main attributes of power and frequency of Intel Core i5 quad core processor. Values for the constant power \(W_c\), active power \(W_a\), power of the processor with 1, 2, 3 and 4 active cores are obtained by experiments with the Intel Power Gadget program. Column W(1) is the power of the processor with one active core [Eqs. (6) and (7)], and column W(p), with \(p\in [2,3,4]\) is the power of the processor with two, three and four active cores [Eqs. (4) and (5)].

Table 5 Frequency and electric power for Intel Core i5 quad core with Turbo Boost off and on

Figure 5 and Table 6 show the execution time behavior of double-precision matrix multiplication for different sizes of square matrices. In Fig. 5, the solid lines show the behavior of the experiment running in one, two, three and four cores with Turbo Boost off, while the dashed lines show the behavior of the process in one, two, three and four cores with Turbo Boost on. In the figure and in the table, it is possible to see that the execution for one, two, three and four cores runs in less time when the frequency is increased. The process that consumes more time is the sequential one with Turbo Boost off, and the process that consumes less time is the one that uses two cores with Turbo Boost on. As in the case of the dual core processor, this behavior is predicted by Eqs. (8), (9), (13) and (14), where \(H = 2.8/2.7 \approx 1.037 > 1\).

Fig. 5
figure 5

Double-precision matrix multiplication time performance (dpMM)

Table 6 Speedup on Intel Core i5 quad core processor

Table 6 shows the experimental speedup and the models speedup, corresponding to Eq. (15). The columns show the speedup with one, two, three and four cores with Turbo Boost on and off for the experimental speedup and the models speedup. As can be seen in the speedup columns for two, three and four cores, the values obtained from the model are very similar to the values obtained by the experiments. As in the case of the two-core processor, the traditional Amdahl’s law (10) is fulfilled when Turbo Boost is turned off. But, when Turbo Boost is turned on, the upper limit of Amdahl’s law (12) is exceeded, as is indicated by inequality (18). This upper limit is nH, and since \(H\approx 1.037\) the upper limit of speedup is 2.074x for two cores, 3.11x for three cores and 4.148x for four cores. This behavior can be appreciated from experimental values of speedup that are below this upper limit.

Fig. 6
figure 6

Double-precision matrix multiplication (dpMM) energy behavior on Intel Core i5 Quad core

Figure 6 and Table 7 show the behavior of energy consumption and energy speedup of matrix multiplication. Figure 6 shows the energy consumption for double-precision matrix multiplication with different matrix sizes in one, two, three and four cores. Figure 6a shows the behavior of energy consumption from the experiments carried out. Figure 6b shows the energy consumption behavior given by the model, as obtained from Eqs. (22) and (24), with the processors data found in Table 5 and sequential time when Turbo Boost is turned off. The solid lines in these graphs correspond to the energy consumption of matrix multiplication when Turbo Boost is off and the dashed lines correspond to the same experiment when Turbo Boost is on. From Fig. 6a, b it is observed that the behavior predicted by the model corresponds to that of the experiments. As in the case of the dual core processor, in Fig. 6 it is observed that the process consumes more energy when Turbo Boost is on. However, the execution when all cores are used consumes less energy than the sequential execution (1 thread). Moreover, the program that consumes the most energy is the sequential one with Turbo Boost turned on, while the process that consumes less energy is the one that uses all the available cores with Turbo Boost turned off.

Table 7 Energetic speedup model and experimental for Intel Core i5 quad core processor

Table 7 shows the energetic speedup for double-precision matrix multiplication. These values are for both the experimental and the model cases when Turbo Boost is on and off. The values on this table are for experiments with several matrix sizes and for the model given by Eqs. (27) and (25), according to the energy power values found in Table 5. It should be noted that, although there is an error between the value obtained in the model with respect to the experimental value, the model does show the qualitative behavior of the energy, for the cases in which Turbo Boost is off and on. Unlike the dual core case, where the maximum energy speedup is approximately one, we find that when all the cores of the Quad Core processor are used, we have a speedup close to 2x, for the cases in which Turbo Boost is on and off. Thus, when we use the multicore processor at its maximum computational capacity and with Turbo Boost on, we save time (more than 4x speedup) and spend half of the energy consumed by the sequential program with Turbo Boost off.

4.3 Xeon eight-core experiments

Table 8 shows the main attributes of power and frequency of Intel Xeon eight-core processor. Values for the constant power \(W_c\), power of the processor with 1, 2, 3, 4, 5, 6, 7 and 8 active cores are obtained by experiments with the Intel Power Gadget program. Column W(1) is the power of the processor with one active core [Eqs. (6) and (7)], and columns W(p), with \(p\in [2,\ldots ,8]\) is the power of the processor with two, three until eight active cores.

Table 8 Frequency and electric power for Intel Xeon eight-core with Turbo Boost off and on

For this case, the H element from Eq. 17 is \(H=3.6GHX/3.4GHX=1.058823\). For visualization purposes, the tables and figures that describe the experimental and model behavior use 1, 2, 4, 6 and 8 threads.

Figure 7 shows the experimental and the models time behavior, corresponding to Eq. (15) for double-precision matrix multiplication. The time behavior is maintained in the model.

Fig. 7
figure 7

Double-precision matrix multiplication (dpMM) performance on Intel Xeon eight-core

Table 9 shows the comparison of the speedup model from (10) and (17) inequalities, with the acceleration obtained from the experiments. The case of Turbo Boost off is the traditional Amdahl’s law.

Table 9 Speedup experimental and model for Intel Xeon eight-core processor

Figure 8 shows the experimental and the models energy behavior, corresponding to from Eqs. (22) and (24) for double-precision matrix multiplication. The energy consumption behavior is maintained in the model.

Fig. 8
figure 8

Double-precision matrix multiplication (dpMM) energy behavior on Intel Xeon eight-core

Table 10 shows the comparison of the speedup energy model from (25) and (27) inequalities, with the acceleration obtained from the experiments. It is appreciated that while increasing the use of cores, there is more energy savings. The worst performance of power consumption is when using a single core with the Turbo Boost turned on. The maximum energy acceleration is 4.8x and is obtained when using eight cores with the Turbo Boost turned off. The multicore processor saves more energy by using all the cores.

Table 10 Energetic speedup experimental and model for Intel Xeon eight-core processor

Table  11 shows the values obtained from Verner Model [20]. We notice that the model is equal to our model in the case of Turbo Boost off. And when comparing our model with the Verner model for Turbo Boost on, it is observed that our model is closer to the experiments (Table 10).

Table 11 Energy speedup on Intel Xeon eight-core processor with Verner Model

5 Conclusions

It is possible to obtain extensions of Amdahl’s law to study the performance and power behavior of parallel programs running on Intel processors with Turbo Boost technology. The model presented in this work can help to understand the behavior of parallel programs on processors with Turbo Boost. When change of frequencies in the processing units is considered, the rate of change in the frequency should be taken into account, as given by Eq. (17). To extend Amdahl’s law for the case of energy consumption behavior, it is necessary to establish the power model used by the processors when the frequency change is active and when it is not [Eqs. (4) and (5)]. By combining the model extensions for time and power, it is possible to extend Amdahl’s law for energy, by considering constant or variable frequency [Eqs. (25) and (27)].

Several experiments were carried out on platforms with Intel Core i5 processors of two and four cores and Intel Xeon eight-core. These processors allow to enable or disable frequency change through the Turbo Boost technology. The operation considered was double-precision matrix multiplication for different matrix sizes in order to stress the processor. The advantage of this test is that it is highly parallelizable and the values of f given by Eq. (10) are very close to zero, allowing to study the behavior of a highly parallelizable task.

With the experiments, it was possible to verify that the model corresponds to the qualitative behavior in time and energy for parallel programs running in processors that allow frequency change. In addition, the experiments show that the sequential programs are the ones that consume more energy when the processor has frequency change enabled and that the parallel programs that consume less energy are those that occupy the maximum amount of cores available in the processor, in accordance with the model. Parallel programs consume less energy when the processor frequency is not changed compared to parallel programs that run when the processor increases its frequency. From Figs. 4 and 6, it is possible to observe upper and lower limits in the energy consumption, the programs that consume more energy are the sequential ones with the case of Turbo Boost on. The programs that consume less energy are those that have Turbo Boost off and that use the maximum amount of cores available in the processor, however, if they use all of the available cores, the energy difference between the case with Turbo Boost off compared to the case with Turbo Boost on is very small.

As expected, it was validated that at higher frequencies, the processor cores operate with more power. This is noted in Tables 2,  5 and  8. However, the speedup analyses show that the use of more cores decreases the execution time and decreases the energy consumption. In fact, for the case in which all the processor cores are used, the consumption of energy is almost the same with Turbo Boost on and off.

The power model presented in this work considers linear behavior; however, it is useful to deduce the formulas for energy acceleration and it is appreciated in the experiments that it has a good behavior with respect to the experimental results. In the future, a nonlinear correction can be made to the power model to have a better prediction of the energy consumption behavior, taking into account the energy proportionality models.

The experimental results validate the model for the case of architectures with Intel Turbo Boost technology. In the future work, tests will be done for AMD’s Turbo Core technology and ARM processors.