1 Introduction

Heterogeneous systems combine different types of processor, and computing nodes may use a combination of traditional multicore architectures (CPUs) and accelerators (mostly Nvidia GPUs [31] or Intel Xeon Phi cards [35]). Although such systems are becoming more common [42], they present a new set of specific challenges, such as scalability, energy efficiency, data management, programmability and reliability [2].

The role of the software developer will be increasingly important as such systems grow in popularity. They will be expected to manage the inherent tension between performance and power consumption, exploit the most useful feature of each component type, and be able to handle the complexity implied by combinations of hardware, instruction sets and programming models. So far, the efficient mapping of system components to computations within heterogeneous systems is largely the responsibility of the programmer (that is, the ability of the run-time system to achieve this is relatively immature).

The hardware/software co-design methodology has emerged since the 1990s as an approach to providing both analysis methods (which allow developers to assess whether or not a system meets its goals in terms of performance, power usage, etc.), and synthesis methods (which allow developers and researchers to rapidly explore the space of design methodologies) [8, 44].

This approach has facilitated significant advances in high-performance computing, which has, in turn, allowed for developments in computational modelling, image analysis, and many other areas [25, 38].

A particular application domain of interest to us is metaheuristics; specifically, algorithms inspired by natural processes or phenomena [37]. Many of these methods (such as the genetic algorithm [18], or particle swarm optimization [23]) are population-based: they maintain a collection of individual solutions which “evolves” in some way as the computation proceeds. These algorithms are generally stochastic, as they tend to rely on randomized search techniques. Additionally, they are inherently parallel, and many such variants have been described [1].

One nature-based method of particular interest is Ant Colony Optimization (ACO) [10, 14, 16]. This algorithm is based on foraging behavior observed in colonies of ants, and has been applied to a wide variety of problems, including vehicle routing [45], feature selection [7] and autonomous robot navigation [17]. The method relies on “ants” (i.e., mobile agents) constructing paths on a graph representing a particular problem, where the paths represent a given solution. Paths are assessed according to the quality of the solution that they represent, and ants then deposit “pheromone” (i.e., signalling chemicals) accordingly (the better the solution, the higher the pheromone concentration). The algorithm takes advantage of positive feedback behaviour that emerges from the multi-agent system, where distributed selection quickly drives the population to high quality solutions.

The original ACO method (called the Ant System [12]) was developed by Dorigo in the 1990s, and this version (or slight variants thereof, such as the MAX-MIN Ant System (MMAS) [41]) is still in regular use [6, 22, 24]. Parallel versions of the Ant System have been developed [9, 27, 40, 46] (see also [33] for a survey), and, in recent work, we have presented a GPU-based version of ACO that, for the first time, parallelizes both main phases of the algorithm (that is, tour construction and pheromone deposition) [3, 4].

The initial version of our ACO algorithm [3, 4] was implemented in CUDA (Compute Unified Device Architecture) and written in C, which gave access to the parallel processing capabilities of the GPU. This paper extends our framework to encompass large-scale supercomputers, thus enabling its implementation in MPI and OpenMP (in addition to CUDA), and also incorporating different generations of Nvidia GPUs.

Since the advent of CUDA in 2006, at least four different generations of GPUs have been released: Tesla, Fermi, Kepler and Maxwell. Our algorithmic design investigates the potential to deploy a load-balancing strategy across several generations of Nvidia GPUs, for maximum performance and minimum power consumption. In what follows, we use our well-established ACO based metaheuristic as a both a benchmarking application and an illustration of the long-term potential for this method. Our experimental study covers a wide range of computing systems, from consumer-market devices to high-end servers.

This paper is organized as follows. Section 2 reviews the ACO method, the CUDA programming model and our ACO-based algorithm. Section 3 describes our parallelization techniques to enhance ACO simulation on GPU-based heterogeneous clusters, which form the main contribution of this work. Section 4 focuses on the experimental results, Sect. 5 gives a performance analysis, and we conclude in Sect. 6 with an overall assessment and suggestions for future work.

2 Background

2.1 Ant colony optimisation for the traveling salesman problem

In what follows, we reprise our description of the algorithm, which was first given in [5]. The Traveling Salesman Problem (TSP) [26] involves finding the shortest (or “cheapest”) round-trip route that visits each of a number of “cities” exactly once. The symmetric TSP on n cities may be represented as a complete weighted graph, G, with n nodes, with each weighted edge, \(e_{i,j}\), representing the inter-city distance \(d_{i,j}=d_{j,i}\) between cities i and j. The TSP is a well-known NP-hard optimisation problem, and is used as a standard benchmark for many heuristic algorithms [21].

Table 1 CUDA summary by hardware generation since its inception (four generations up to 2015)

The TSP was the first problem solved by Ant Colony Optimisation (ACO) [11, 13]. This method uses a number of simulated “ants” (or agents), which perform distributed search on a graph. Each ant moves through on the graph until it completes a tour, and then offers this tour as its suggested solution. In order to do this, each ant may drop “pheromone” on the edges contained in its proposed solution. The amount of pheromone dropped, if any, is determined by the quality of the ant’s solution relative to those obtained by the other ants. The ants probabilistically choose the next city to visit, based on heuristic information obtained from inter-city distances and the net pheromone trail. Although such heuristic information drives the ants towards an optimal solution, a process of “evaporation” is also applied in order to prevent the process stalling in a local minimum.

The Ant System (AS) is an early variant of ACO, first proposed by Dorigo [11]. The AS algorithm is divided into two main stages: Tour construction and Pheromone update. Tour construction is based on m ants building tours in parallel. Initially, ants are randomly placed. At each construction step, each ant applies a probabilistic action choice rule, called the random proportional rule, in order to decide which city to visit next. The probability for ant k, placed at city i, of visiting city j is given by the Eq. 1

$$\begin{aligned} p_{i,j}^{k}= \frac{\left[ \tau _{i,j}\right] ^{\alpha } \left[ \eta _{i,j}\right] ^{\beta }}{\sum _{l \in N_{i}^{k}}\left[ \tau _{i,l}\right] ^{\alpha } \left[ \eta _{i,l}\right] ^{\beta }}, \qquad if \;j \in N_{i}^{k}, \end{aligned}$$
(1)

where \(\eta _{i,j} = 1/d_{i,j}\) is a heuristic value that is available a priori, \(\alpha \) and \(\beta \) are two parameters which determine the relative influences of the pheromone trail and the heuristic information respectively, and \(N_{i}^{k}\) is the feasible neighbourhood of ant k when at city i. This latter set represents the set of cities that ant k has not yet visited; the probability of choosing a city outside \(N_{i}^{k}\) is zero (this prevents an ant returning to a city, which is not allowed in the TSP). By this probabilistic rule, the probability of choosing a particular edge (ij) increases with the value of the associated pheromone trail \(\tau _{i,j}\) and of the heuristic information value \(\eta _{i,j}\). The numerator of the Eq. 1 is pretty much the same for every ant in a single run, thus, computation times can be saved by storing this information in additional matrix, called choice_info matrix as showed in [15]. The random proportional rule ends with a selection procedure, which is done analogously to the roulette wheel selection procedure of evolutionary computation (for more detail see [15, 19]). Each value \(choice\_info [current\_city][j]\) of a city j that ant k has not visited yet determines a slice on a circular roulette wheel, the size of the slice being proportional to the weight of the associated choice. Next, the wheel is spun and the city to which the marker points is chosen as the next city for ant k. Furthermore, each ant k maintains a memory, \(M^{k}\), called the tabu list, which contains the cities already visited, in the order they were visited. This memory is used to define the feasible neighbourhood, and also allows an ant to both to compute the length of the tour \(T^{k}\) it generated, and to retrace the path to deposit pheromone.

After all ants have constructed their tours, the pheromone trails are updated. This is achieved by first lowering the pheromone value on all edges by a constant factor, and then adding pheromone on edges that ants have crossed in their tours. Pheromone evaporation is implemented by

$$\begin{aligned} \tau _{i,j} \leftarrow (1-\rho )\tau _{i,j}, \qquad \forall (i,j) \in L, \end{aligned}$$
(2)

where \(0 < \rho \le 1\) is the pheromone evaporation rate. After evaporation, all ants deposit pheromone on their visited edges:

$$\begin{aligned} \tau _{i,j} \leftarrow \tau _{i,j} + \sum _{k=1}^{m} \Delta \tau _{i,j}^{k}, \qquad \forall (i,j) \in L, \end{aligned}$$
(3)

where \(\Delta \tau _{ij}\) is the amount of pheromone ant k deposits. This is defined as follows:

$$\begin{aligned} \Delta \tau _{i,j}^{k} = \left\{ \begin{array}{ll} 1/C^{k} \text { if }\; e(i,j)^{k} &{} \text { belongs to }T^{k} \\ 0 &{}\text { otherwise }\\ \end{array} \right. \end{aligned}$$
(4)

where \(C^{k}\), the length of the tour \(T^{k}\) built by the k-th ant, is computed as the sum of the lengths of the edges belonging to \(T^{k}\). According to Eq. 4, the better an ant’s tour, the more pheromone the edges belonging to this tour receive. In general, edges that are used by many ants (and which are part of short tours), receive more pheromone, and are therefore more likely to be chosen by ants in future iterations of the algorithm.

2.2 The CUDA programming model

Compute Unified Device Architecture (CUDA) [29] is a platform for Graphics Processing Units (GPUs), covering both hardware and software. On the hardware side, the GPU consists of N multiprocessors which are replicated within the silicon area, each endowed with M cores sharing the control unit, and a shared memory (a small cache explicitly managed by the programmer). Each GPU generation has increased CUDA Compute Capabilities (CCC), as well as increasing the number of cores and shared memory size (see Table 1). In conjunction with these developments, power consumption has been reduced by a factor of 2 at each new generation.

The CUDA software paradigm is based on a hierarchy of abstraction layers: the thread is the basic execution unit; threads are grouped into blocks, and blocks are mapped to multiprocessors. C language procedures to be ported to GPUs are transformed into CUDA kernels, mapped to many-cores in a SIMD (Single Instruction Multiple Data) fashion (that is, with all threads running the same code but having different IDs). The programmer deploys parallelism by declaring a grid composed of blocks equally distributed among all multiprocessors. A kernel is therefore executed by a grid of thread blocks, where threads run simultaneously grouped in batches called warps, which are the main scheduling units.

2.3 Our initial CUDA implementation

In previous work, we developed a CUDA-based ACO implementation, with an emphasis on data parallelism [4]. We now summarize this algorithm, as it provides the foundation of the current work.

Recall that our ACO implementation involves ants moving on a graph, deciding where to move next based on simulated pheromone concentrations. When an ant makes a decision on which city/node to visit next, it must calculate heuristic values which are the same for all ants at any one time step (that is, the heuristic information constitutes information on nodes, which must be consistent and accessible to all ants). It makes sense, therefore, to split the computation of heuristic values into a separate heuristic info kernel, which is then executed prior to tour construction. Transition probabilities are stored in a two-dimensional choice matrix, which is used to inform “roulette wheel” (Monte Carlo) selection by each ant.

In the tour construction kernel, each ant is associated with a thread block, such that each thread represents a city (or cities) that the ant may visit. This avoids the problem of warp divergences, and enhances data parallelism, as all threads within a block may cooperate. The degree of parallelism improves by a factor of 1 : w, where w is the number of CUDA threads per block.

Finally, the pheromone kernel performs evaporation and deposition. Evaporation is straightforward, as a single thread can independently lower each entry in the pheromone matrix by a constant factor. Deposition is more challenging, since each ant generates its own private tour in parallel, and will eventually visit the same edge as another ant. In order to prevent race conditions, we require the use of CUDA atomic operations when accessing the pheromone matrix in this stage.

3 Scaling to heterogeneous clusters

Fig. 1
figure 1

Heterogeneous system based on different Nvidia GPU generations

Traditional parallel implementations are not always efficient when ported to heterogeneus systems. They are often inherited from scalable supercomputers, where all nodes in the cluster have the same compute capabilities, and they therefore lack the ability to distinguish computational devices with assymmetric computational power and energy consumption. Differences are not limited to fundamental hardware design (CPUs vs. GPUs), but also occur within the same family of processors. For example, the Kepler family (see Table 1) includes Tesla K20, K20X and K40 models, endowed with 13, 14 and 15 multiprocessors, respectively (the K80 model even reaches 30 multiprocessors split into two chips). Figure 1 shows a heterogeneous cluster which, nowadays, may include different Nvidia GPU generations, even within the same node.

With this scenario in mind, we introduce a heterogeneity-aware parallelization of ACO applied to the Travelling Salesman Problem as introduced in Sect. 2.1. Our departure point is (1) the CUDA-based implementation of ACO described in Sect. 2.3, and (2) the parallelization strategy proposed by Stützle [39], where independent instances of the ACO algorithm are run on different processors (GPUs in our case, having assorted CUDA Compute Capabilities).

Parallel runs do not incur any communication overhead, and the final solution is chosen across all independent executions, taking advantage of the stochastic nature of ACO algorithms. The execution time of each independent execution may differ, as it depends on (1) the underlying GPU each ACO instance runs on, which is actually unknown at compile-time, and (2) the TSP instance size (the same in principle for all processors, but affected by GPU heterogeneity). Given that the slowest GPU will determine the overall execution time, our mission is to make use of the idle time offered by the most powerful GPUs. Performance and energy differences shown in the last two rows of Table 1 lead us to believe that there is ample room for improvement here.

We have designed an implementation with three main focuses: (1) Resources accounting through MPI processes, (2) performance monitoring via OpenMP threads, and (3) power consumption balance using GPU Boost. We now expand on each of these in the following subsections.

3.1 Resources accounting

First, our algorithm defines a MPI thread for each existing node in the cluster where we run our simulation. Heuristic information about inter-city distances is sent to each node, where supporting data structures are also created to avoid communication overhead. Then each MPI thread creates as many OpenMP threads as GPUs are available on a node, which is easily attained by querying the GPU properties at runtime (using cudaGetDeviceCount from the CUDA API) and NVML (Nvidia Management Library).

3.2 Performance monitoring

Secondly, a warm-up phase is performed to establish performance differences among all targeted GPUs running the particular TSP instance to be solved. This phase measures, at run-time, the execution time of a small number of iterations of the ACO algorithm (five to ten) in order to detect these differences. Importantly, at this stage, the algorithm is not trying to solve the TSP problem in any meaningful sense (five to ten iterations is not enough to do so) but these runs allow us to calculate the performance differences between GPUs. The execution times spent at this warm-up phase on all GPUs are reduced to obtain the maximum value using MPI_Allreduce. Thus, the Percent parameter is eventually determined according to Eq. 5. The slowest GPU will have \(Percent=1\), a GPU two times faster than slowest GPU would have \(Percent=0.5\), and so on.

$$\begin{aligned} Percent = \frac{Ex.time_{actualGPU}}{Ex.time_{slowestGPU}} \end{aligned}$$
(5)

We then establish the time-budget, which is a threshold that determines the maximum completion time for that ACO algorithm on every GPU. It corresponds to the execution time required to perform a number of iterations of ACO on the slowest GPU available. This number of iterations (referred to as \(\delta \) from now on) is a configuration parameter of our algorithm, and is known by all nodes in the simulation. It is empirically determined to be good enough to find out a good solution to the TSP on our CUDA implementation of ACO. For instance, in our experimental section \(\delta \) is set to 1000 iterations.

Each OpenMP thread then calculates the slot that it can use for the simulation (\(\gamma \), with \(\gamma > \delta \)). This slot can be used for a deeper search (thus computing additional iterations of ACO), or for reducing the power consumption (by relaxing the clock rate in GPU cores). In addition, when \(\gamma \ge \delta /2\), the algorithm can even do a restart to avoid becoming “trapped” in a local minimum.

Additional iterations (\(\gamma \)) are obtained by Eq. 6.

$$\begin{aligned} \gamma = \delta * (1/percent) \end{aligned}$$
(6)

where “percent” is the performance difference identified among GPUs at warm-up stage, which we have previously explained.

The number of restarts or additional iterations that each GPU may perform is calculated by Eq. 7

$$\begin{aligned} \gamma = 1/percent \end{aligned}$$
(7)

as the numerator represents the percent for the slowest GPU, which is always set to 1.

Table 2 Hardware resources and experimental setup used during our executions

Finally, if we wish to reduce the overall power consumption of our simulation, we may use GPU Boost™, which is a new hardware feature introduced by Nvidia from the K40 Kepler GPU onwards. GPU Boost manipulates the clock rate of the GPU cores to trade performance by energy. The idea is to sacrifice time in favour of power consumption when the latter is more critical. Developers can use the nvidia-smi shell command to set up the frequency in the GPU, usually exceeding/reducing the nominal value around 20 %. To prevent excessive thermal stress, Nvidia does not allow developers to change this parameter at run-time or within an application, as the Intel SpeedStep™does. Moreover, the GPU is required to work in Persistence Mode, which ensures that driver stays loaded even when the GPU has no work to run on it. The range of clocks supported can be queried by the nvidia-smi -d SUPPORTED_CLOCKS command, and changed with the -ac option (see [32] for more details and a full list of commands). Clock changes require superuser privileges, or developers can use the NVIDIA Management Library (NVML) [30] instead. NVML is a C-based API for monitoring and managing diverse states of NVIDIA GPU devices (including clock settings), without requiring the user to run nvidia-smi prior to launching the application on the GPU. The real-time power consumption measurement of individual GPU components using a software approach is only supported by the Nvidia Kepler architecture GPU. This is also done by using NVML, which reports the GPU power usage at real-time. We use nvmlDeviceGetPowerUsage command to obtain power usage.

4 Experimental setup

4.1 Hardware environment

For this experimental study, we used the following platforms:

  • On the CPU side: Four Intel Xeon X7550 processors running at 2 GHz and plugged into a quad-channel motherboard endowed with 128 Gigabytes of DDR3 memory.

  • On the GPU side: Four GPUs, starting with anTesla C2050 (Fermi generation, approximately 4 years old) and ending with a brand new GeForce GTX 980 (Maxwell generation), with two Kepler models in between (K20 and K40), all sharing the motherboard space with PCI-e 3.0 slots to communicate with the CPUs.

Table 2 gives a detailed description of all these platforms. We use gcc 4.8.2 with the -O3 flag to compile on the CPU, and the CUDA compiler/driver/runtime version 6.5 to compile and run on the GPU.

4.2 Benchmarking

We test our designs using a set of benchmark instances from the well-known TSPLIB library [36, 43]. All benchmark instances are defined on a complete graph, and all distances are defined as integer numbers. Table 3 shows a list of all targeted benchmark instances with information on the number of cities, the type of distance and the length of optimal tours.

ACO parameters such as the number of ants (m), and those values to set up their behaviour, like \(\alpha \), \(\beta \), \(\rho \), and so on, are set according to the values recommended in [15]. In particular, \(m=n\) (being n the number of cities), \(\alpha =1\), \(\beta =2\) and \(\rho =0.5\).

Table 3 Description of benchmark instances from TSPLIB library (EUC_2D stands for 2D euclidean distance)

5 Experimental results

Given the fact that our techniques establish the experimental setup dynamically, results shown below are platform dependent.

Fig. 2
figure 2

Execution times in seconds on different Nvidia GPU generations for several TSP instances. Although we have used a Tesla s2050 in our experiments, the figure only shows the performance of a single GPU of the S2050 server (i.e. Tesla C2050)

Fig. 3
figure 3

Quality of the results obtained for different TSP Lib instances, normalized to the optimal solution

5.1 Performance and workload balance

Figure 2 shows performance differences across different GPU generations when they run several TSP instances. Results are recorded for 1000 iterations, and averaged over 10 different runs. The fastest GPU belongs to the latest generation (Maxwell-based GeForce GTX 980), outperforming the slowest GPU by up to a 4.2\(\times \) factor. This slowest GPU is the Tesla C2050, which determines the time-budget for the entire execution. Tesla K20c, the Kepler model, obtains intermediate results, with up to 1.6\(\times \) gain versus the Tesla C2050.

Fig. 4
figure 4

Execution times in seconds on a Tesla K40 GPU for several TSP instances using different clock frequencies

Fig. 5
figure 5

Power consumption (in milliwatts) measured for the Tesla K40 GPU on different clock frequencies and TSP instances

Results are measured statically for the sake of showing performance differences in a real scenario. However, as described, our methodology includes a warm-up stage to calculate these differences at run-time. In previous work [4], more details about performance analysis are given; in particular, we reported up to 20\(\times \) speed-up factor on average for a Tesla C2050 versus a single-threaded CPU.

We now enhance our parallelization strategy to take advantage of the time that Kepler and Maxwell GPUs are idle, in order to improve the quality of the results. One idea, which we call Deep Search, is to increase the number of iterations in order to perform a deeper search within the same time budget. For instance, GeForce GTX 980 carries out 4102 iterations, Tesla K40 carries out 1946 iterations, Tesla K20c carries out 1654 iterations, and Tesla C2050 just 1000 iterations (the time-budget established for this simulation).

Another possibility is to include a restart to avoid being trapped in a local minimum. That is possible if and only if the performance gap is at least twice the slowest GPU performance. These two goals can be merged to create a hybrid approach which we call Deep Search + Restart. Driven by this combination, GeForce GTX 980 may perform up to four restarts of 1000 iterations each (as its percent value is 0.24 on pr1002 TSP instance), whereas Tesla K40 and Tesla K20c only perform a single phase with a deeper search involving 1946 and 1657 iterations, respectively (0.51 and 0.60 % values are not enough to complete two restarts).

Figure 3 shows a tour quality comparison across the sequential run and all parallel strategies for a variety of benchmarks normalized by the optimal solution. The first bar represents the sequential code, written in ANSI C, provided by Stuzle in [15]. This code runs for 1000 ACO iterations on a single-threaded CPU. The second bar is the result quality for our GPU version over 1000 ACO iterations. Figures show that the quality of solutions obtained for these two versions are relatively similar to each other.

The third bar shows our GPU Deep Search strategy, and the fourth bar represents Deep Search + Restart. These two last versions improve results by significant margin within the same time-budget, with a small advantage for Deep Search on average. Note that Deep Search performs restarts implicitly, as different searches are executed on different GPUs, whereas Deep Search + Restarts includes restarts explicitly on the same GPU.

5.2 Power consumption

Figure 4 shows the power budget for our simulation under different clock settings. Performance gains reflect up to 1.3\(\times \) speed-up factor, in line with the 31 % increment in the clock rate (frequency raises from 666 to 875 MHz).

Figure 5 outlines power consumption in milliwatts for different clock rates. As expected, power consumption raises with higher clock frequencies.

The overall power budget is correlated to the total execution time of the application (see Fig. 6a). However, the 745 MHz clock setting—which is actually set by default on Nvidia’s driver for the Tesla K40—is the most energy efficient.

5.3 Power-aware performance metrics

Researchers have proposed metrics combining performance and power measures into a single index. The most popular in low-power circuit design is in the form of ED\(^n\) [34], where E is the energy, D is the circuit delay, and n is a nonnegative integer. The power-delay product (PDP), the energy-delay product (EDP) [20] and the energy-delay-squared product (ED\(^2\)P) [28] are all special cases of ED\(^n\) with \(n = 0, 1, 2\), respectively.

Intuitively, ED\(^n\) captures the energy usage per operation, with a lower value reflecting the fact that power is more efficiently translated into the speed of operation. The parameter n implies that a 1 % reduction in circuit delay is worth paying an n % increase in energy usage; thus, different n values represent varying degrees of emphasis on deliverable performance over power consumption.

Fig. 6
figure 6

Energy consumption in Joules / 1000 (mJ) measured on different clock frequencies for the Tesla K40 GPU. Measurements are taken for the execution on all targeted TSP instances, and averaged over 10 launches. a Total energy, b Energy delay product (EDP), and c Energy delay square product

Figure 6b shows the Energy Delay Product (EDP) for our ACO simulation, and Fig. 6c the Energy Delay Square Product (triple weight on performance). These couple of metrics prioritize performance over energy. Figure 4 shows that performance differences among different clock frequencies are remarkable, to benefit fastest settings.

6 Conclusions and future work

We present a parallelization strategy tailored to heterogeneous and massively parallel systems. Heterogeneity may limit acceleration and waste energy unless programmers develop smarter applications to wisely control those features on the road towards an optimal performance/watt ratio. Our proposal cares about accuracy, joules and time equally, deploying those magnitudes on an equilateral triangle managed by a cooperative scheduling of jobs to attain an optimal balance among them at run-time. This makes our strategy particularly useful for non-deterministic algorithms and stochastics behaviours where real-time and/or energy contraints must be fulfilled. With the user setting up those constraints properly, our method may even grant priority to any of the goals composing the metaheuristic.

In a preliminary stage of development, we have illustrated our ideas using Ant Colony Optimization as case study. Given the scalability demonstrated along our experimental study, we foresee an immense potential to extend and refine our methods in future heterogeneous systems. In particular, queries to measure energies and temperatures within the GPU are weak and almost non-existing on low-power devices like Tegra heterogeneous plaforms. Given the long way ahead for improvement and how vendors are enthusiastically endorsing low-power devices, we believe the ideas presented here will greatly benefit from incoming sensors, hardware counters, middleware, libraries and tools, to provide the research community solid pillars to face the expected growth of heterogeneous systems in a much better power-aware manner.