Performance Estimation of Task Graphs Based on Path Profiling

Lattuada, Marco; Pilato, Christian; Ferrandi, Fabrizio

doi:10.1007/s10766-015-0372-7

Performance Estimation of Task Graphs Based on Path Profiling

Published: 23 July 2015

Volume 44, pages 735–771, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Parallel Programming Aims and scope Submit manuscript

Performance Estimation of Task Graphs Based on Path Profiling

Download PDF

Marco Lattuada¹,
Christian Pilato¹^nAff2 &
Fabrizio Ferrandi¹

260 Accesses
Explore all metrics

Abstract

Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels.

Comparison of Time and Energy Oriented Scheduling for Task-Based Programs

Throughput-Driven Parallel Embedded Software Synthesis from Synchronous Dataflow Models: Caveats and Remedies

A hybrid performance analysis technique for distributed real-time embedded systems

Article 12 April 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays, creating Multiprocessor System-on-Chip (MPSoC) architectures is a well-established solution for the design of efficient embedded systems [1]. On one hand, these architectures can deliver significant computational power thanks to a variety of processing elements, like general purpose processors, digital signal processors and specialized hardware accelerators. On the other hand, designing the applications for these systems is challenging due to several complex and interdependent steps to be performed [2–5]. First, the application has to be decomposed into multiple tasks that can be potentially executed in parallel or accelerated by dedicated components (partitioning). Then, these tasks need to be assigned to the available processing elements (mapping) and, finally, it is necessary to determine the execution order of the tasks assigned to the same resources (scheduling). When exploring this large design space (either by hand or by automatic methodologies), these combined solutions demand an accurate performance estimation before taking the final decisions [4].

Different approaches have been proposed for estimating the performance of parallel applications running on the top of MPSoCs. Accurate evaluations can be obtained by running the design solutions directly on the target platforms, but in most of the cases these are not available in the early stages of the design. Alternatively, it is possible to use cycle-accurate simulators [6, 7], but they can be too slow to be adopted during the design exploration phase, when multiple solutions have to be evaluated and compared. Fast estimation techniques, based on mathematical models [8, 9], are thus usually preferred in this phase. They are indeed less accurate but much faster, allowing the possibility of exploring more solutions in less time.

Additionally, depending on the nature of the application, different representations can be used to describe the solutions and to estimate the performance. Applications running on embedded systems can be dominated either by data (e.g. audio/video/image processing, digital communications) or by control (e.g. device control, packet processing). Data-oriented applications are often represented through data-flow models and their analysis methods usually focus more on architectural aspects rather than on the application behavior, which is assumed to be highly predictable [10]. However, there is a wide range of applications that cannot be well represented by these models, mainly due to the large presence of coarse-grained parallelism and conditional constructs (e.g. data-dependent loops, branches, function calls), which can significantly vary the execution time of the single parts of the application. In this scenario, task graph models [11] are widely adopted to represent partitioned solutions and derive mapping decisions [2, 5].

Multiple techniques have been proposed for estimating the performance of a task graph and most of them model the task execution time as a constant value [12] or as a stochastic variable [9]. These task estimations are then combined to estimate the execution time of the entire application, but without considering code correlations that may exist [13]. This can easily lead to wrong estimations that can, in turn, lead to the adoption of sub-optimal solutions. Conversely, the application behavior can be collected dynamically through code profiling [14], but this information is usually exploited only at task level, reducing the accuracy of the task graph estimation.

In this paper we present a methodology to accurately estimate the performance of control-dominated applications for heterogeneous embedded systems. To collect precise information about the control flow of the application (e.g. how many times the different sequences of branch transitions are executed), we extend the well-known Efficient Path Profiling [15] with a novel technique, called Hierarchical Path Profiling, which allows us to better correlate the profiling information with the structure of the partitioned application. Since the behavior and the correlations of the control constructs depend only on the input data, the profiling is independent from any parallel implementations or target architectures. For this reason, the profiling can be performed only once, on the sequential specification, and on a generic host machine, which is usually much faster than the target architecture. Our approach then combines these profiling data with task graph information to accurately estimate the speed-up of multiple parallel solutions with respect to the sequential version. We also integrate performance models of the different processing elements [16] and predictions of the synchronization costs [17] to have more accurate estimations of the specific target architecture. We applied our methodology to multiple embedded applications in several scenarios, which have been obtained by varying the number of processors in the target architecture and the compiler optimization levels. We then validated our estimations by comparing them with the benchmark execution on an MPSoC simulation platform. This shows that our methodology is effectively able to accurately predict the speed-up with an absolute error that is smaller than 5 % in average.

The rest of this paper is organized as follows. Section 2 presents a motivating example, which clearly shows why classical techniques are inadequate for estimating the performance of control-dominated applications running on MPSoCs. Section 3 discusses previous work, while Sect. 4 provides preliminary definitions and discusses the applicability of the approach to different architectures. Our methodology is then detailed in Sect. 5 and evaluated in Sect. 6. Finally, Sect. 7 concludes the paper.

2 Motivation

Estimating the speed-up introduced by a parallel implementation of a control-dominated application is challenging since the execution times of the tasks can vary significantly and, additionally, control constructs in distinct portions of the code can be correlated. To exemplify this problem, let us consider the function fun_0 shown in Fig. 1a. One of its parallelization is described through some annotations borrowed from the OpenMP formalism [18] and shown in Fig. 1b, while the corresponding task graph is shown in Fig. 2. Let us also assume that the target architecture is composed of two processors (i.e. $CPU_\alpha $ and $CPU_\beta $), and the following information is known:

the estimated execution time of each statement $o_i$ (including the calls to functions fun_1, fun_2 and fun_3) is fixed and known, as reported in Table 1 (the identifier i of $o_i$ is reported on the left-hand side of Fig. 1a);
the probability of condition c1 being true is 0.5, the probability of condition c2 being true is 0.5 and the condition c3 is always true;
the architecture requires 50 cycles to create the tasks and 10 cycles for either synchronizing or destructing the created tasks.

Finally, let us also assume that there exists a correlation between c1 and c2, which controls the execution of fun_1 and fun_3, respectively. The following situations are considered:

c1 and c2 always have the same value (either true or false) during an execution of fun_0: fun_1 and fun_3 are both invoked (true) or none of them is invoked (false).
c1 and c2 always have opposite values during the same execution of fun_0: fun_1 and fun_3 are called in mutual exclusion.

Table 2 reports the maximum and the average execution time for all the tasks. It is important to note how the execution of fun_1 and fun_3 heavily impacts on the execution time of Task1 and Task3.

Table 1 Estimation of clock cycles delay of each statement

Performance Estimation of Task Graphs Based on Path Profiling

Abstract

Similar content being viewed by others

Comparison of Time and Energy Oriented Scheduling for Task-Based Programs

Throughput-Driven Parallel Embedded Software Synthesis from Synchronous Dataflow Models: Caveats and Remedies

A hybrid performance analysis technique for distributed real-time embedded systems

1 Introduction

2 Motivation

3 Related Work

4 Preliminaries

4.1 Definitions

4.2 Supported Target Architectures

5 Proposed Methodology

5.1 Hierarchical Path Profiling

5.2 Task Graph Estimation

5.3 Analysis of the Proposed Methodology

6 Experimental Evaluation

6.1 Experimental Setup

6.2 Experimental Results

7 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1: Example of Application of Task Graph Estimation Technique based on Path Profiling

Appendix 1: Example of Application of Task Graph Estimation Technique based on Path Profiling

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation