1 Introduction

Multicore systems integrated with hardware accelerators, such as field-programmable gate arrays (FPGAs), provide faster data processing and lower power consumption [1,2,3]. Efficient schedules that take into account both hardware and software components are critical to the performance of such hardware-accelerated multicore systems. Typical scheduling approaches consider optimizations within the software stack such as EDF (earliest deadline first), RMS (rate monotonic), and LLF (least laxity first) [4,5,6]. However, many of these are not suitable or applicable to the hardware components as well. In this paper, we propose a new hardware–software scheduling algorithm for scheduling applications in the hardware-accelerated multicore systems. Throughout this paper, when we use the term software, we denote to the execution units in traditional systems without any hardware accelerator (e.g., CPUs on multi-core systems). To this same end, when we use the term hardware, we refer to the execution units in the hardware accelerator itself (e.g., FPGAs).

The hardware–software co-scheduling problem is challenging because the hardware and software components have different computational logic when executing tasks. Consider the following example to illustrate the difference between software execution and hardware execution. As shown in Fig. 1a, after the tasks T1 and T2 have completed, the three tasks T3, T4 and T5 are ready for execution. In a multicore system as Fig. 1b, T5 can only be scheduled after T3 or T4 completes, because each software core can only execute one task at a time. Contrast this with the way that the same tasks can be executed on the hardware accelerator, as shown in Fig. 1c. Here, there are different resources (i.e., execution units) for each task shown in boxes. Consequently, task T5 can be scheduled at the same time as tasks T3 and T4 because they do not use the same resources. From this, we can see that the number of software cores is a limiting factor on the number of tasks that can be executed in parallel on the software. In hardware acceleration, there is no upper limit to the number of parallel tasks, but the number of tasks is bounded by the size of hardware resources. Unlike typical scheduling problems that only consider the scheduling of tasks on software cores, in hardware-accelerated multicore (i.e., both hardware and software) systems, we need to incorporate hardware execution into the scheduling problem.

Fig. 1
figure 1

When task T3, T4, and T5 are released: b in two-core system, T3 and T4 can execute at the same time and T5 waits in the queue. c Using hardware acceleration, all the tasks T3, T4, and T5 can execute at the same time, but they consume non-reusable FPGA resource

In the general multicore scheduling problem, real-time applications are always represented by a directed acyclic graph (DAG) consisting of a series of tasks, where nodes represent tasks and edges represent data dependencies. And each real-time task has its own time requirements, such as execution time, release time, and deadline. The multicore platform consists of multiple software cores. The scheduling problem is to find the order and assignment of task on the multicore platform that will achieve the goal. The goals of scheduling problem include, for example, time optimization [7], energy usage reduction [8,9,10,11,12,13], minimizing the number of computing units [14, 15], and so on. Various strategies have been exploited to minimize the overall completion time (i.e., the makespan). Most renowned strategies can be divided into three categories: list scheduling, task duplication, and other heuristics strategies.

List scheduling is that it maps tasks on a set of pre-defined processors by calculating a priority list of tasks [16,17,18,19]. Task duplication method is another popular scheduling strategy that focuses on large-scale multicore systems. It reduces the makespan by executing tasks copies on different processors to reduce the inter-processor communication cost. These algorithms show better performance in scenarios with higher communication costs (i.e., higher computation-to-communication ratio) [20,21,22,23,24,25]. There are also some other heuristics strategies, for example, weight-based mechanism [26] and genetic algorithms [27, 28].

In this paper, we focus on finding the task scheduling solution in a hardware-accelerated multicore system. Our goal is to obtain the shortest makespan under the hardware resource constraints. We propose an efficient software–hardware co-scheduling strategy (ESHCS).

Our strategy consists of two phases: allocation and scheduling. In the allocation phase, our goal is to efficiently maximize the advantages of hardware parallelism by improving the utilization of limited hardware resources. We cluster tasks into different groups and use linear programming to obtain allocation results under hardware resource constraints. In the scheduling phase, we propose a list-based heuristic strategy to assign pre-allocated tasks to the suitable software cores or hardware acceleration by considering the computation capabilities of cores as well as the real-time requirements of the tasks.

Finally, we evaluate our algorithm by experiments. Our results confirm that ESHCS can obtain the efficient solutions within a reasonable runtime. Compared to prior work [29], ESHCS is more efficient. It greatly improves the running time at the expense of very little speedup performance.

In summary, the contributions of this work are as follows:

  • We construct a model for multicore platforms integrated with hardware accelerators based on hardware and software features and give a description of the task features required for hardware–software co-scheduling.

  • We propose an effective allocation strategy derived from the idea of linear programming. Compared to linear programming, our method achieves a good allocation results in less time.

  • We propose an allocation-based scheduling algorithm. Compared to list scheduling and genetic scheduling, our method obtains a reasonable makespan in the shortest time.

  • We present the results of our experiments using both simulation applications and real-world applications. Our algorithm performs the best in both applications.

The rest of this paper is organized as follows. In Sect. 3, we define the hardware–software co-scheduling problem and the related terminology. In Sect. 2, we discuss the related work. And in Sect. 4, the proposed scheduling approach is presented. Section 5 shows the experiments and analysis of our algorithm. Finally, Sect. 6 discusses the conclusions of this research and our future work.

2 Related work

Although multicore scheduling algorithms cannot be directly applied to hardware-accelerated multicore architectures, we can incorporate aspects of these approaches. For instance, heterogeneous earliest finish time (HEFT) is a well-known list scheduling, which calculates the upward rank of the task to set the priorities of the tasks [16]. Critical-path-on-a-processor (CPOP) uses the sum of upward and download rank values for prioritizing. The objective of these two algorithms is to minimize the makespan at lower cost [16]. Heterogeneous edge and task scheduling (HETS) minimize the communication overhead edges to obtain reduced schedule length [18]. Massinissa et al. [28] proposed a genetic approach to scheduling the applications under energy constraints. Du et al. [19] designed a feature-aware list scheduling to assign the tasks to the appropriate processors. The feature they considered includes frequency, data size, and resources utilization.

In addition to the above works, allocation and scheduling for FPGA-based systems could help inform our work. Several approaches have been presented in previous works including optimizing frameworks [30, 31], heuristic algorithms [32], priority-driven algorithms [19], genetic algorithms [33,34,35], feature extraction [36], etc. Lam et al. proposed two hardware/software partitioning algorithms: a customized Tabu search algorithm and a dynamic programming algorithm. The first algorithm is able to produce good approximate solutions quickly, while the second one yields more accurate solutions in smaller-scale instances [37]. These two algorithms are designed for tree-like task graphs. Chen et al. [38] proposed a multi-model multi-task learning approach for heterogeneous hardware–software co-design. It reduces overall power consumption and critical path latency. Reza presented a heuristic-based dynamic scheduling technique to schedule task graphs on multi-FPGA systems [39]. Matteo focuses on the efficient scheduling onto cloud FPGAs to minimize the makespan [40]. These two works focus only on scheduling in multiple hardware, without considering the collaboration of hardware and software. Furthermore, Rodriguez et al. [31] designed a HAP scheduler to minimize the energy consumption in the CPU + FPGA systems. Du et al. [19] proposed a speedup estimation model based on the speedup of tasks on FPGA. They used this speedup estimation as well as the runtime resource load balancing strategy to assign the tasks to the appropriate computing cores. Dai et al. [36] proposed a benefit-based scheduling metric to evaluate the task assignment. Based on the metric, they accelerate task execution. This work targets on independent tasks without communication. For the applications with high communication load, Hao proposed an efficient scheduling algorithm using task duplication [41]. However, all these works are designed and proven to get great performance in nonreal-time applications and assume no limitation in hardware resources. For real-time applications, Zhu et al. [30] focus on improving the scheduling efficiency of CPU + FPGA architecture and proposed a scheduling framework for independent tasks. In our previous work [29], we proposed a scheduling approach, called ReTPA, for FPGA-based multicore systems. This work was our first exploration of the real-time hardware–software co-scheduling problem. We further propose this efficient hardware–software co-scheduling algorithm in this paper, which better balances performance and time overhead.

We specifically introduce three representative works: a classic priority-driven approaches HEFT [16], two advanced genetic approaches, GAA and MGAA [33] and our algorithm, ReTPA, in 2022 [29]. The comparison of our algorithms and these algorithms is presented in Sect. 5.

2.1 Heterogeneous earliest-finish-time (HEFT)

The HEFT algorithm proposed by Topcuoglu et al. [16] is a classic multicore scheduling algorithm. Their objective is to simultaneously minimize the makespan and meet low scheduling costs. The multicore structure used in this paper consists of different software cores and involve communication cost. HEFT assigns the task with the highest upward rank value according to an insertion-based policy. The experiments demonstrate the HEFT algorithm shows an impressive performance in terms of both quality and cost of schedule. To compare with our algorithms, we firstly use linear programming or greedy to allocate tasks into software and hardware and then use HEFT for task scheduling. We use LHEFT or GHEFT to denote the HEFT with linear programming or greedy strategy, respectively.

2.2 Genetic algorithm approach (GAA) and modified genetic algorithm approach MGAA

The GAA and MGAA algorithm are two CPU + FPGA scheduling algorithms, proposed by Abdallah et al. [33]. They are two novel GA-based approaches considering the communication cost. GAA assigns the execution sequence of tasks by proposed strategy A or B and then uses the genetic algorithm to find the best cores for each task. MGAA exploits further the search space to find better solutions by creating different successful sequences in a pre-treatment. They are proven to be capable of finding good solutions while improving the running time compared to other existing works.

2.3 Real-time priority-driven algorithm (RePTA)

The ReTPA algorithm for real-time applications in FPGA-based multicore systems is proposed in [29]. This work proposed a two-step scheduling strategy: allocation and scheduling. In that algorithm, first we use a linear programming approach to obtain the hardware–software allocation results. And then, we proposed a priority-driven scheduling to assign the task based on the allocation results. The performance comparisons illustrate that ReTPA shows the good performance for computation-intensive applications. But there are some shortcomings, for example, linear programming can lead to the optimal solution in NP-hard problem but at a significant time cost. In the scheduling phase, priority calculation can be upgraded.

3 A novel model

The notation \(\alpha \mid \beta \mid \gamma \) is used to describe scheduling problems, where \(\alpha \) denotes the environment, \(\beta \) denotes properties of the tasks, and \(\gamma \) denotes the goal. Our hardware–software co-scheduling problem is described as \(P\mid prec,c_\mathrm{ij},time_\mathrm{real}\mid C_\mathrm{max},RSC_\mathrm{con}\). P is the set of computing resources (i.e., CPU, FPGA). \(prec, c_\mathrm{ij}, time_\mathrm{real}\) means the tasks are non-independent and real time. The communication costs between tasks are considered. \(C_\mathrm{max}, RSC_\mathrm{con}\) means the goal is to minimize the makespan under the resource constraints.

In this work, the application is aperiodic and is represented by a directed acyclic graph, \(G=(V, E)\), where V is the set of v tasks and E is the set of e edges. Each edge e(ij) represents the precedence constraint such that task \(\tau _i\) should complete its execution before task \(\tau _j\) starts. A \(v \times v\) matrix DATA is used to present communication data, where \(data_{i,j}\) is the amount of data that needs to be transferred from task \(\tau _i\) to task \(\tau _j\). Task execution of a given application is assumed to be non-preemptive and unsuspendable.

The target computing environment consists of a series of software cores and hardware resources, where software cores are represented by \(p_1, p_2, \dots , p_{q}\) and hardware accelerator FPGA is considered as a special computing core represented by \(p_{0}\). Computing cores are connected in a fully connected topology. In our model, all inter-core communications are assumed to perform without contention. Communication only occurs at the end of tasks.

In addition, there are three very important features that we should characterize in detail:

  • Time features of tasks

    The real-time application consists of a series of tasks with their time features: release time, computation cost, and deadline. W is a \(v \times 2\) computation cost matrix in which each \(w_{i,1}\) gives the estimated execution time to complete task \(\tau _i\) on software and \(w_{i,2}\) gives the estimated execution time to complete task \(\tau _i\) on hardware. Two v-dimensional vectors RLS and DDL represent release time and deadline, respectively. The release time of a task is the instant of time when the task becomes available for execution. The task can be scheduled and executed at any time at or after its release time, whenever its data dependency conditions are met. If task \(\tau _i\) can be executed when the system begins execution, it has no release time, hence \(RLS_i=0\). The deadline of a task is the instant of time when its execution is required to be completed. If task \(\tau _i\) has no deadline, it can be completed at any time during the system execution.

  • Computing resources

    The hardware-accelerated multicore platform includes software and hardware resources. The software resource is the number of software cores, represented by q. The hardware resources are FPGA, which composes of look-up table (LUT), registers and flip-flops (FFs). A LUT cannot be deployed to two tasks. Therefore, the number of LUTs limits the total number of tasks in hardware. We consider it as a resource constraint when scheduling the hardware-accelerated multicore system, represented by \(LUT_\mathrm{con}\). We use a v-dimensional vector LUT to present the number of LUTs for tasks, where \(LUT_i\) is the number of hardware resources required to execute task \(\tau _i\) in hardware.

  • Communication cost

    The data transfer rates depends on the architecture of the communication links, such as shared bus, hierarchical bus, and network on chip. The data transfer rates between cores are stored in matrix BAND of size \((q+1) \times (q+1)\), where \(BAND_{k,l}=BAND_{l,k}\) for computing cores \(p_k\) and \(p_l\). We define the communication cost \(c_{i,j}\) when task \(\tau _i\) executed on computing unit \(p_k\) and task \(\tau _j\) executed on computing unit \(p_l\) as

    $$\begin{aligned} c_{i,\,j}=\frac{data_{i,\,j}}{BAND_{k,\,l}} \end{aligned}$$
    (1)

    We assume that the cost of intra-core communication is negligible compared to the cost of inter-core communication. When \(k=l\), \(c_{i,\,j}\) is zero.

Our goal is to find a solution for the real-time application to (I) minimize the makespan \(C_\mathrm{max}\); (II) ensure each task \(\tau _i\) meets its release time \(RLS_i\) and deadline \(DDL_i\); (III) ensure the actual hardware resources consumption \(LUT_\mathrm{sum}\) is less than or equal to the hardware resource constraints \(LUT_\mathrm{con}\).

Makespan \(C_\mathrm{max}\) is defined as

$$\begin{aligned} L=\max \{AFT_i\} \end{aligned}$$
(2)

where \(AFT_i\) is the actual finish time of task \(\tau _i\). The makespan will be the last actual finish time of an exit task.

The overall hardware resources consumption \(LUT_\mathrm{sum}\) is defined as

$$\begin{aligned} LUT_\mathrm{sum}=\sum _{\tau _i\in \mathcal {H}}^{n} LUT_i \end{aligned}$$
(3)

where \(\mathcal {S}\) is the sets of tasks allocated into software cores and \(\mathcal {H}\) as the sets of tasks allocated into hardware.

4 Co-scheduling strategy

Before introducing the detail of our co-scheduling strategy, we first introduce the linear programming method in [29] and some notions used in our strategy.

4.1 Related knowledge

Linear Programming The goal of task allocation is to minimize makespan within the constraints of hardware resources. It can easily be described as a mathematical model for linear programming.

We define a v-dimensional vector X as the allocation solution, where \(X_i=0\) indicates that task \(\tau _i\) is allocated to the software cores and \(X_i=1\) means task \(\tau _i\) is allocated to FPGA. The target function is then defined as:

$$\begin{aligned} \left\{ \begin{aligned}&\min \quad {TimeCost}_\mathrm{sum}=\sum _{i=1}^{v}(1-X_i)\times w_{i,1}+X_i\times w_{i,2}&\\&\mathrm{s.t.}\quad \sum _{i=1}^{n}X_i\times LUT_i\le LUT_\mathrm{con}&\end{aligned} \right. \end{aligned}$$
(4)

As a solution, we obtain a hardware/software allocation scheme X.

Notions We introduce some notions derived from [29].

Definition 1

Given a real-time task \(\tau _i\), the Successor Parallel Rank of task \(\tau _i\), denoted by \(SPR(\tau _i)\), is defined as

$$\begin{aligned} SPR(\tau _i)=\frac{\vert Suc(\tau _i)\vert }{\max _{\tau _j\in V}\vert Suc(\tau _j)\vert } \end{aligned}$$
(5)

where \(Suc(\tau _i)\) is the set of the direct successor of task \(\tau _i\), \(\vert Suc(\tau _i)\vert \) represents the number of the elements in the set \(Suc(\tau _i)\).

Definition 2

Given a real-time task \(\tau _i\), the Execution Time of task \(\tau _i\), denoted by \(\tau _i\), is defined as

$$\begin{aligned} ET(\tau _i)= \left\{ \begin{aligned} w_{i,\,1},\quad&\tau _i \in \mathcal {S},\\ w_{i,\,2},\quad&\tau _i \in \mathcal {H}. \end{aligned} \right. \end{aligned}$$
(6)

Definition 3

Given a real-time task \(\tau _i\), the Earliest Start Time of task \(\tau _i\), denoted by \(EST(\tau _i)\), is defined as

$$\begin{aligned} EST(\tau _i)= {\left\{ \begin{array}{ll} RLS_i&{} \mathrm{Pre}(\tau _i)=\emptyset \\ \max \{RLS_i,\max _{\tau _j\in \mathrm{Pre}(\tau _i)}\{EST(\tau _i)+ET(\tau _j)+c_{i,j}\}\}&{} \mathrm{Pre}(\tau _i)\ne \emptyset \end{array}\right. } \end{aligned}$$
(7)

where \(\mathrm{Pre}(\tau _i)\) is the set of the direct predecessors of task \(\tau _i\).

Definition 4

Given a real-time task \(\tau _i\), the Last Finish Time of task \(\tau _i\), denoted by \(LFT(\tau _i)\), is defined as

$$\begin{aligned} LFT(\tau _i)= {\left\{ \begin{array}{ll} DDL_i&{} {Suc}(\tau _i)=\emptyset \\ \min \{DDL_i,\min _{\tau _j\in {Suc}(\tau _i)}\{LFT(\tau _j)-ET(\tau _j)-c_{i,j}\}\} &{} {Suc}(\tau _i)\ne \emptyset \end{array}\right. } \end{aligned}$$
(8)

Definition 5

During scheduling, given a real-time task \(\tau _i\), the Scheduling Urgency of task \(\tau _i\) at the current time CT, denoted by \(SU(\tau _i, CT)\), is defined as

$$\begin{aligned} SU(\tau _i, CT)=\frac{ET(\tau _i)}{LFT(\tau _i)-\max \{EST(\tau _i), CT\}} \end{aligned}$$
(9)

\(SU(\tau _i, CT)=1\) means that if \(\tau _i\) must be executed at the current time. Otherwise, at least one successor will miss its deadline.

4.2 ESHCS

In this section, we introduce our efficient software–hardware co-scheduling strategy (ESHCS), which includes two phases: allocation phase and scheduling phase.

In the allocation phase, we obtain the software and hardware allocating solution efficiently. The goal is to minimize the makespan within the hardware resource constraints. There are two factors that affect the makespan: the task parallelism rate and the software–hardware speedup. Software–hardware speedup is the acceleration ratio of software computing cost and hardware computing cost. It is a great indicator for maximizing the time benefits of hardware acceleration. Meanwhile, to improve the efficiency of the algorithm, we design a heuristic strategy that combines linear programming and greedy strategy. This algorithm achieves time minimization by increasing the parallelism rate of tasks and the software-hardware speedup.

In the scheduling phase, we obtain the scheduling sequence based on the software–hardware allocation result. The goal is to minimize the makespan and to meet the deadline of tasks. In this phase, the strategies for software and hardware tasks are different. We dynamically measure the priorities of software tasks. The software task with the highest real-time priority will be assigned to the suitable software cores. Hardware tasks are executed in parallel in the FPGA.

4.2.1 Task allocation phase

The linear programming model guarantees the generation of the optimal solution in NP-hard problem. But it yields the shortest makespan at the cost of exponential computation cost. In contrast, heuristic methods usually generate sub-optimal solutions in a polynomial-time. We make a tradeoff between computation cost and quality of results. We propose a heuristic allocation method combining the linear programming and greedy strategy.

In our methods, we analyze the topology of the task graph and cluster the tasks into groups \(\mathcal {P}\) based on their dependencies. In each group, the tasks are independent of each other, and they can be executed simultaneously in FPGA. We calculate the sum of software–hardware speedup of each group. Software–hardware speedup of task \(\tau _i\) is defined as follows.

$$\begin{aligned} \mathrm{speedup}_i=\frac{w_{i,1}}{w_{i,2}} \end{aligned}$$
(10)

We use \(\mathcal {P}_\mathrm{max}\) to represent the task group with the largest total speedup. We assume the tasks in \(\mathcal {P}_\mathrm{max}\) are all hardware tasks. There will be two cases: the total hardware consumption of this group is less than the hardware constraints or is over the hardware constraints. In the first case, we assign the tasks of \(\mathcal {P}_\mathrm{max}\) to the FPGA and find the next group with the largest total speedup. Repeat the same step until there are no groups left or case two occurs. In the second case, we use the linear programming strategy to choose the hardware tasks from \(\mathcal {P}_\mathrm{max}\). Finally, if there are some remaining hardware resources, a greedy strategy is applied to the remaining tasks: assigning the tasks with the maximum speedup to the FPGA until the hardware resource constraints are satisfied.

The main allocation procedure is described as pseudo-code in Algorithm 1. The hardware task choosing strategy is described as Algorithm 2.

figure a
figure b
Fig. 2
figure 2

An example of task grouping

Fig. 3
figure 3

An example of allocation when \(LUT_{con}=15\)

We give an example of the allocation procedure in Figs. 2, 3. Figure 2 shows the division of the task groups. In step 1, the entry tasks \(\tau _1\) and \(\tau _2\) are initialized as the first group. And then we find all the direct successors of the first group. Of these, the task \(\tau _5\) is the direct successor of \(\tau _3\) so we delete the task \(\tau _5\) and get the second group. Similarly, we obtain the third group. The procedure stops when all the tasks in the group are exit tasks. Figure 3 shows the allocation when \(LUT_{con}=15\). In step 1, \(\tau _8\) is assigned to the FPGA. Because \(\mathcal {P}_4\) has the largest \(\mathrm{Speedup}_\mathrm{sum}\). The group \(\mathcal {P}_3\) has the second-largest \(\mathrm{Speedup}_\mathrm{sum}\). But the \(LUT_\mathrm{sum}\) of it is larger than the remaining hardware resource. We using linear programming: choosing \(\tau _5\) and \(\tau _6\) to hardware. At this point, one hardware resource remains. We use a greedy strategy to assign \(\tau _4\) to the FPGA.

4.2.2 Task scheduling phase

In the scheduling phase, if all predecessors of task \(\tau _i\) and all the communications task \(\tau _i\) have been completed, task \(\tau _i\) is a ready task. We schedule when a new ready task arrives or a software core is idle. We apply the different strategies to the software and hardware task.

For the software tasks, we design a priority-driven approach based on the Successor Parallel Rank and Scheduling Urgency. The ready software task with the highest priority is assigned to the idle CPU first. When \(SU(\tau _i, CT)=1\), we assign the task \(\tau _i\) to the software cores first without priority comparing to maximize the guarantee that the successor can meet the deadline.

The priority of software task \(\tau _i\) is calculated by:

$$\begin{aligned} PRI(\tau _i)= {\left\{ \begin{array}{ll} SU(\tau _i, CT) &{} {Suc}(\tau _i)=\emptyset \\ SU(\tau _i, CT)+SPR(\tau _i)\times \max _{\tau _j \in {Suc}(\tau _i)} PRI(\tau _j) &{} {Suc}(\tau _i)\ne \emptyset \end{array}\right. } \end{aligned}$$
(11)

In the priority calculation, tasks with higher SU at the current time are given higher priority to maximize the chance that the tasks will all finish before the deadline; tasks with more successors are given higher priority so that more successors can be released early to improve parallelism; and tasks with higher successors are given higher priority.

For the ready hardware task \(\tau _i\), we directly assign it to the FPGA for execution. The main scheduling procedure is described as pseudo-code in Algorithm 3.

figure c
Fig. 4
figure 4

Data information for scheduling

Fig. 6
figure 5

Map of tasks execution at 11, scheduling \(\tau _3\) or \(\tau _7\) to CPU2

Fig. 5
figure 6

Final scheduling solution using ESHCS

We give the scheduling solution obtained by ESHCS in Fig. 5, which is equal to 24. The EST and LFT of tasks in Fig. 4 is calculated using Eqs. 7 and 8. And we give an example of priority calculation and comparison during the system running time in Fig. 6. When \(CT=11\), the task \(\tau _3\) and \(\tau _3\) are released. Since \(PRI(\tau _3,11)>PRI(\tau _7,11)\), the task \(\tau _3\) is assigned to CPU2 for execution first, and the task \(\tau _7\) will be executed when CPU1 is idle.

5 Experiments

In the comparative evaluation, we consider two types of applications as the workload for performance experiments: simulation applications generated by the sample generator [29] and the graphs represented the numerical real-world problems [16]. All the experiments were run on a computer with 8 Intel i9-9900k@4.8Ghz processors with 16GB DDR4 memory. The operating system is a 64-bits Windows 10 pro. We used MATLAB 2019b to implement and test the algorithms.

The metrics used for performance evaluation include Speedup and Running Time. Speedup evaluate the solution from a makespan optimization perspective, which presents the effectiveness of the scheduling solution. For a given application, the value of Speedup is the ratio of sequential execution time to the makespan. The sequential execution time is computed by assigning all tasks to a single processor. For the same application, if the makespan is minimized, it results in a larger Speedup.

In addition to evaluating the performance of the solutions, we also pay attention to the efficiency of the algorithm itself. Running Time is the computation cost of obtaining the scheduling solution. Among the algorithms that give equal Speedup values, the one with the smallest Running Time is the most practical implementation.

In the remainder of this section, we first introduce the samples generator used in the experiments in Sect. 5.1. And then, the comparison results among our scheduling algorithm ESHCS and the related work are analyzed in Sects. 5.2 and 5.3.

5.1 Sample generator

We use the sample generator proposed in [29] to simulate real-time application. This generator generates the samples from two aspects, the DAG and task attributes.



  • DAG generator

The DAG generator has two input parameters: the number of tasks n and the type of DAG. We generate DAG with three different structures for experiments including tree, fork-join, and random. Examples of these three DAG structures are shown in Fig. 7.

Fig. 7
figure 7

Examples of different simulation structures



  • Task attribute generator

    We generate five attributes for each task including release time, deadline, software execution time, hardware execution time, and hardware resource. Referring to [16], the values of attributes are generated according to the following rules:

    • The task must be theoretically schedulable. The difference between the release time and deadline is not less than the hardware execution time, i.e., \(DDL_i-RLS_i \ge w_{i,2}\).

    • The hardware execution time \(w_{i,2}\) is randomly generated from a uniform distribution, and the software execution time \(w_{i,1}\) is \(\alpha \) times longer than that of the hardware execution time \(w_{i,2}\), which follows the random uniform distribution from 2 to 8.

    • Hardware resource \(LUT_i\) is generated by a random parameter \(\beta \) and the ratio of software execution time \(w_{i,1}\) to hardware execution time \(w_{i,2}\) as \(LUT_i=\alpha \times \frac{w_{i,1}}{w_{i,2}}\), where \(\alpha \) follows the Gaussian distribution \(N(\mu , \sigma ^2)\) with \(\mu =3\) and \(\sigma =1\).

For the experiments, we generate five datasets. The value of DAG parameters are assigned from the corresponding sets given: \(SET_{n}=\{8, 16, 24, 32, 40\}\) and \(SET_{DAGtype}=\{\)tree, fork-join, and random\(\}\). Each dataset includes three different types of samples, and 50 random samples are selected for each type. We list the statistical information of five datasets in Table 1. v is the number of tasks and \(\overline{\sum _{i=1}^{n}LUT_i}\) represents the average value of total hardware resource. It is noted that we are not using the parameter, communication to computation ratio (CCR), in this work. The value of CCR is set to 0.1.

Table 1 Information of five datasets

5.2 Comparative experiment on simulation application

In this subsection, we design three different experiments using the simulated applications. First, we compare ESHCS with our previous algorithm ReTPA. Then, we test our algorithm against other algorithms. Finally, we explored the impact of resource size on the performance of scheduling algorithms.

5.2.1 Comparative experiment with ReTPA

We compare ESHCS with our proposed algorithm RePTA [29]. ESHCS is a further performance-balanced scheduling algorithm based on RePTA. The results are in Table 2. We can see that Speedup of ReTPA and ESHCS differ minimally in all five datasets, but ESHCS shows a significant improvement in Running Time. The mean improvements are the reduction of 1.76% and the improvement of 36.59%, respectively. ESHCS achieves our goal of significantly reducing the computation cost of the algorithm by sacrificing a very small amount of the makespan speedup.

Table 2 Comparison of scheduling performance between ESHCS and ReTPA

5.2.2 Comparative experiment with other algorithms

In this subsection, we analyze the performance of ESHCS and other five algorithms. The structures of the five algorithms are compared in Table 3, where ’Allocation’ presents the allocation strategy used in the algorithms and ’Schedule’ presents the scheduling strategy. The allocation strategy of the GAA is sequential, i.e., task \(T_i\) is assigned to the processor \(p=(i-1)[ns]+1\), where \(p\le ns+1\) and \(p=ns+1\) represents FPGA. Moreover, the parameters of GAA-A and MAGG in our experiments are: \(Popsize=10\), number of generations \(NG=10\), number of threads \(Nb=6\).

Table 3 Five considered algorithms

This experiment compares the performance of the algorithms with respect to various application scales. The performance ranking of Speedup is {LHEFT, ESHCS, GHEFT, GAA-A, MGAA} in Fig. 8. It should be noted that each ranking in this paper starts with the best algorithm and ends with the worst one. ESHCS and LHEFT provide the highest speed-up solutions, far better than HEFT with greedy strategy and the other two genetic algorithms. The average speedup of LHEFT of all datasets is the best, better than the ESHCS algorithm by 8.21%, the GHEFT algorithm by 20.01%, the GAA-A algorithm by 41.37%, and the MGAA algorithm by 51.44%. The reason why the genetic algorithms do not find the best solution as expected is that they consider the hardware as a special software core, so the number of tasks allocated to the hardware is about \(n/(ns+1)\). The iterations are more about trying a better scheduling strategy or allocating better groups of tasks to the hardware. There is not a tendency to divide more tasks into hardware to improve parallel efficiency, which causes the algorithm to easily fall into a local optimal. This shortcoming can also be seen in the parameter experiments in Sect. 5.2.3. Speedup of the genetic algorithms improve significantly with increasing software scale, but the size of hardware resources has almost no effect on the performance of genetic algorithms.

The performance ranking of Running Time is {GHEFT, ESHCS, LHEFT, GAA-A, MGAA} in Fig. 9. In general, the computation cost of all algorithms increases with the application scale. Running Time of LHEFT increases fastest, due to the computational complexity of linear programming. The computational complexity of the greedy strategy is \(\mathcal {O}(n)\) so that GHEFT provides the best performance in terms of computation cost. The time performance of ESHCS is second only to GHEFT and remains at the millisecond level. The average Running Time of ESHCS and GHEFT is 0.0229 s and 0.0213 s.

In these experiments, LHEFT and ESHCS outperform the other algorithms for any application scale in terms of Speedup. But the average Running Time of LHEFT is 14.67 times larger than ESHCS. The Running Time of GHEFT outperforms the other algorithms but its Speedup is much worse than LHEFT and ESHCS. For the two genetic algorithms, GAA-A and MGAA have neither the best nor the worst average Speedup performance, and they spend too much time searching for the solution. In conclusion, ESHCS obtains the best scheduling solution in a reasonable computation cost.

Fig. 8
figure 8

Speedup of algorithms with respect to application scale

Fig. 9
figure 9

Running Time of algorithms with respect to application scale

5.2.3 Parameter experiment

This experiment investigates the impact of resource size on scheduling performance. The resources of hardware-accelerated multicore system includes software cores and hardware resource. The value of parameters are given: \(SET_\mathrm{softcore~ number}=\{2,4,6,8,10\}\) and \(SET_\mathrm{hardware~ percentage}=\{0,0.2,0.4,0.6,0.8,1\}\), where the number of software cores represents the scale of the software resources on the platform and hardware percentage is the ratio of the hardware resource constraint to the hardware resource overhead required by all tasks. As described in Table 1, \(\overline{\sum _{i=1}^{n}LUT_i}\) is the average hardware resource overhead for each dataset which use as a baseline. \(\overline{\sum _{i=1}^{n}LUT_i} \times SET_\mathrm{hardware~ percentage}\) are the hardware constraints we set in the experiment. For the experiments, we select the fifth datasets with \(n=40\), where \(\overline{\sum _{i=1}^{n}LUT_i}=316.6667\). Figures 10 and 11 show the results of these two parameter experiments.

The first experiment, in Fig 10, compares Speedup of the algorithms in different hardware scales. The increase in hardware resources leads to a dramatic increase in LHEFT, ESHCS and GHEFT. In contrast, the trends of MGAA and GAA-A appear to be flat. This indicates that LHEFT, ESHCS and GHEFT are better adapted to hardware-accelerated multicore systems. The best performance is achieved by LHEFT and ESHCS with a slight difference, whose Speedup is up to 8.8712 and 8.3227 when the hardware percentage reaches 1.

The second experiment, in Fig 11, demonstrates the impact of the software scale on scheduling performance. Overall, the increase in software scale also contributes to the increase in Speedup. But the trend is smaller than the one brought by hardware percentage. The trends of algorithms {LHEFT, ESHCS, GHEFT, GAA-A, MGAA} increase as the number of software cores increases, and the Speedup stabilizes after the number of software cores reaches 6. This indicates that software cores have a limited effect on application acceleration. This is a reflection of the fact that hardware acceleration is the future of high performance computing. In addition, how to minimize the active software cores and achieve efficient scheduling is another interesting future research topic [14, 15].

Fig. 10
figure 10

Speedup of algorithms with respect to hardware resources

Fig. 11
figure 11

Speedup of algorithms with respect to software cores

5.3 Comparative experiment on real world application

Except for simulation task graph generation, we select two real parallel applications for our experiments: Fast Fourier transform (FFT) and Gaussian elimination (GE). The number of tasks in FFT with \(\rho \) is \(v=(2\times \rho -1)+\rho \times \lg \rho \), where \(\rho =2^n\). The number of tasks in GE with \(\rho \) is \(v=\frac{\rho ^2+\rho -2}{2}\). Figure 12 shows the examples of FFT and GE applications. In Fig. 12, FFT is generated with the parameter \(\rho =4\), and GE is generated with the parameter \(\rho =5\).

Fig. 12
figure 12

Example of real parallel structures


  • Fast Fourier Transform

In this experiment, we compare the performance of algorithms using FFT. The application scale varies \(\rho \) from 2 to 32 with the number of tasks \(v=\{5, 13, 39, 95, 223\}\). Table 4 shows the scheduling performance in terms of Speedup \(\overline{S_\mathrm{up}}\) and Running Time \(\overline{R_{time}}\). Bold data represents the best performance among five algorithms.

It is seen that ESHCS has an advantage in Speedup for three datasets as well as LHEFT provides a similar performance. ESHCS performs better than LHEFT in Running Time. It slightly performs less than GHEFT in the small scales. The allocation algorithm integrating linear programming and greedy algorithms effectively reduces the computational cost of ESHCS. The scheduling performances of genetic-based algorithms, GAA-A and MGAA, in FFT are better than theirs in GE, but there is still a gap with the other three methods. In summary, ESHCS has the best performance in Speedup in most cases, while GHEFT outperforms in Running Time in most cases. ESHCS shows the best performance in large-scale applications.

Table 4 Scheduling performance in fast Fourier transform

  • Gaussian Elimination

In this experiment, we compare the performance of algorithms using GE. The application scale varies \(\rho \) from 5 to 13 with the number of tasks \(v=\{14, 27, 44, 65, 90\}\). Table 5 shows the scheduling performance in Speedup \(\overline{S_{up}}\) and Running Time \(\overline{R_{time}}\).

Running Time of LHEFT, ESHCS and GHEFT have increased with the expansion of GE scale. They have remained at the millimeter level. In contrast, the running time of GAA-A and MGAA is nearly multiple times larger than the other three algorithms. The expansion of search space in the genetic algorithm brings a large scheduling cost, especially in large-scale applications. Although GHEFT shows a slight disadvantage in Running Time, the scheduling efficiency of ESHCS far outperforms it with an average improvement of 43.22%.

Among all five algorithms, ESHCS and GHEFT achieve the best performance in the Running Time and the gap is very small. In contrast, the Speedup of ESHCS is better than GHEFT by 44.82%. The above results suggest that for GE, ESHCS obtains the best scheduling solution in a reasonable computation time.

Table 5 Scheduling performance in Gaussian elimination

6 Conclusion

In this paper, we propose ESHCS for scheduling real-time applications in the multicore system integrated with hardware acceleration. The goal of this algorithm is to minimize the makespan subject to hardware resource constraints while meeting the real-time requirements of tasks. Our strategy consists of two phases: heuristic allocation and list-based scheduling. In the experiments, we test our algorithm in both simulation applications and real applications. The experiments confirm that ESHCS greatly improves the running time by sacrificing slight speedup performance compared to ReTPA. ESHCS also shows a significant enhancement compared to other advanced algorithms LHEFT, GHEFT, GAA-A and MGAA. Furthermore, we plan to extend this work to periodic real-time tasks. Moreover, how to reduce the number of active processors while achieving the best solutions is another interesting future research topic.