1 Introduction

In the past few years, grid computing systems and applications have become popular [8] due to the rapid development of many-core systems. A grid computing environment is a parallel and distributed system that brings together various computing capacities to solve large computational problems. In grid environments, task scheduling, which plays an important role, divides a larger job into smaller tasks and maps tasks onto a parallel and distributed system [5, 11]. The goal of task scheduling is typically to schedule all of the tasks on a given number of available processors in such a way as to minimize the overall length of the time required to execute the whole program.

There are deterministic approaches and non-deterministic approaches to solving the task-scheduling problem. An algorithm is referred to as a deterministic algorithm if given a specific input it will always produce the same output and the underlying machine always passes through the same sequence of states. Most of the deterministic algorithms are based on a greedy strategy, which merely attempts to minimize the finishing time of the tasks in such a way that tasks are allocated to the parallel processors without backtracking [20]. Several popular algorithms in this class have been proposed, such as a list scheduling algorithm [25], clustering algorithm [16], and task duplication algorithm [23]. However, these deterministic algorithms can only solve certain cases efficiently and cannot preserve the consistent performance on a broad range of problems due to the greedy property.

A non-deterministic algorithm differs from a deterministic algorithm in its capability to produce various outcomes depending on the choices that it makes during execution. Currently, non-deterministic algorithms have been used to solve a wide range of combinatorial problems, which take more time to explore the solution space for a high-quality solution. To solve the scheduling problem, several research studies have been performed based on non-deterministic algorithms, such as genetic algorithms [20, 21, 26], ant colony optimization algorithms [13, 19], and particle swarm optimization algorithms [17, 24], which apply randomized search techniques in the search process.

A parallel and distributed computing system could be a homogeneous [4, 20] or heterogeneous system [6, 9]. A homogeneous system means that the processors have the same performance in terms of processing capabilities. On the other hand, heterogeneous systems have different processing capabilities in the target system. In general, the processors are connected by a network, which is either fully connected [25, 26] or partially connected [7]. For more related topologies and applications, please see [2, 3, 15]. In the fully connected network, every processor can communicate with every other processor, whereas data can be transferred to some specified processors in a partially connected network. To reduce the communication time, the task duplication issue [20] was discussed by duplicating some tasks on more than one processor. However, to avoid increasing energy consumption, we consider here the target system that is a fully connected heterogeneous system without task duplication.

The genetic algorithm (GA), which was first proposed by Holland [10], provides a popular solution for application problems. GAs have been shown to outperform several algorithms in the task-scheduling problem [8, 12, 14, 27], which simply define the search space to be the solution space in which each point is denoted by a number string, called a chromosome. Based on these solutions, three operators, which are selection, crossover, and mutation, are employed to transform a population of chromosomes to better solutions iteratively. To retain the good features from the previous generation, the crossover operator exchanges the information from two randomly chosen chromosomes, and the mutation operator alters one bit of a chromosome at random.

In this study, we proposed a genetic algorithm for task scheduling on a grid computing system, called TSGA. In general, GA approaches directly initialize the first population by a uniform random process. TSGA develops a new initialization policy, which divides the search space into specific patterns to accelerate the convergence of the solutions. To solve the task-scheduling problem, a chromosome usually contains a mapping part and an order part to indicate the corresponding computer and the executing order. In the standard GA, when crossover and mutation operators are applied, the parents’ good characteristics cannot be kept in the next generation. Based on the task–processor assignments of chromosome encoding, TSGA exploits new operators for crossover and mutation to preserve good features from the previous generation.

The remainder of this study is organized as follows. In the next section, we provide the problem definition. Section 3 introduces previous studies for the task-scheduling problem. The proposed genetic-scheduling algorithm is presented in Sect. 4. We describe our experimental results in Sect. 5. Lastly, conclusions are drawn in Sect. 6.

2 Problem definition

Task scheduling is the process of mapping tasks to system processors and arranging the execution order for each processor. Tasks with data precedence are modeled by a directed acyclic graph (DAG) [26]. The main idea of DAG scheduling is to minimize the makespan, which is the overall execution time for all of the tasks.

2.1 DAG modeling

A DAG \(G=(V,E)\) is depicted in Fig. 1a, where \(V\) is a set of \(N\) nodes and \(E\) is a set of \(M4\) directed edges. For the problem of task scheduling, \(V\) represents the set of tasks, and each task contains a sequence of instructions that should be completed in a specific order. Let \(w_{i, j}\) be the computation time to finish a specific task \(t_i\) \(\in \) \(V\) on a processor \(P_j\) , which is detailed in Fig. 1b. Each edge \(e_{i,j}\) \(\in \) \(E\) in the DAG indicates the precedence constraint that task \(t_i\) should be completed before task \(t_j\) starts. Let \(c_{i, j}\) denote the communication cost that is required to transport the data between task \(t_i\) and task \(t_j\), which is the weight on an edge \(e_{i,j}\). If \(t_i\) and \(t_j\) are assigned to the same processor, the communication cost \(c_{i,j}\) is zero.

Fig. 1
figure 1

Illustraion example: a DAG, b the computation cost matrix, and c feasibility of scheduling

Consider an edge \(e_{i,j}\) that starts from \(t_i\) and ends at \(t_j\) . The source node \(t_i\) is called a predecessor of \(t_j\), and the destination node \(t_j\) is called a successor of \(t_i\). In Fig. 1a, \(t_1\) is the predecessor of \(t_2, t_3, t_4\), and \(t_5\); \(t_2, t_3, t_4\), and \(t_5\) are the successors of \(t_1\). In a graph, a node with no parent is called an entry node, and a node with no child is called an exit node. If a node \(t_i\) is scheduled to a processor \(P_j\), then the start time and the finishing time of \(t_i\) are denoted by \({\mathrm {ST}}(t_i, P_j)\) and \({\mathrm {FT}}(t_i, P_j)\), respectively.

2.2 Makespan

After all of the tasks are scheduled onto parallel processors, considering a specific task \(t_i\) on a processor \(P_j\), the start time \({\mathrm {ST}}(t_i, P_j)\) can be defined as

$$\begin{aligned} {\mathrm {ST}}(t_i, P_j)=\max \{{\mathrm {RT}}_j,{\mathrm {DAT}}(t_i, P_j)\}, \end{aligned}$$

where \({\mathrm {RT}}_j\) is the ready time of \(P_j\) and \({\mathrm {DAT}}(t_i, P_j)\) is the data arrival time of \(t_i\) at \(P_j\). Because \({\mathrm {DAT}}(t_i, P_j)\) is the time for all of the required data to be received, it is computed by

$$\begin{aligned} {\mathrm {DAT}}(t_i, P_j)=\max _{t_k \in {\mathrm {pred}}(t_i)}\{({\mathrm {FT}}(t_k, P_j)+c_{k,i})\}, \end{aligned}$$

where \({\mathrm {pred}}(t_i)\) denotes the set of immediate predecessors of \(t_i\). Because

$$\begin{aligned} {\mathrm {FT}}(t_k, P_j)=w_{k,j}+{\mathrm {ST}}(t_k, P_j) \end{aligned}$$

and

$$\begin{aligned} {\mathrm {RT}}_j=\max _{t_k \in {\mathrm {exe}}(P_j)}\{{\mathrm {FT}}(t_k, P_j)\}, \end{aligned}$$

where \({\mathrm {exe}}(P_j)\) is the set that contains the tasks that are assigned to be executed on \(P_j\), the makespan, i.e., the overall schedule length of the entire program, is the latest finishing time of all of the tasks and can be obtained by

$$\begin{aligned} {\mathrm {makespan}}=\max _{t_i \in V}\{{\mathrm {FT}}(t_i, P_j)\}. \end{aligned}$$

Figure 1c demonstrates a scheduling for the graph described in Fig. 1a. The makespan of this scheduling is 23.

3 Related work

Task scheduling has been proven to be an NP-complete problem. There are several algorithms, including deterministic approaches [25] or non-deterministic approaches [20, 26], which were proposed to solve the problem of dependent tasks scheduling for real applications.

Topcuoglu [25] classified deterministic algorithms into three main types: list-scheduling algorithms, clustering algorithms, and task duplication algorithms. Among these deterministic approaches, list-scheduling algorithms are practical and provide better performance results with less scheduling time than the others. The heterogeneous earliest-finish-time (HEFT) algorithm [25] is most efficient in list-based algorithms; this algorithm takes advantage of an upward-ranking method and an insertion-based policy to reach the goal of minimizing the makespan. However, HEFT scheduling does not perform very well for irregular DAGs and for higher heterogeneities [1].

On the other hand, since non-deterministic algorithms are popular for a wide range of combinatorial problems, this class of algorithms are successfully applied to solve the task-scheduling problem, such as GAs, ant colony optimization, particle swarm optimization, and hybrid heuristic techniques [6, 26]. This kind of algorithm takes more time to explore the solution space for a high-quality solution, compared to deterministic algorithms. For more heuristic task-scheduling algorithms, please see [6, 22, 29].

GAs have been shown to outperform several algorithms in task-scheduling problems [8, 27], in which each solution is represented as a chromosome. Three genetic operators, selection, crossover, and mutation, are employed to transform a population of chromosomes to another better population. To retain the good features from the previous generation, the crossover operator exchanges the information from two chromosomes, which are chosen randomly. The mutation operator alters one bit of a chromosome to increase the diversity of the chromosomes and prevent premature convergence. This genetic process is repeated until the stopping criterion is satisfied, and the best chromosome from all of the generations is reported as the best solution.

Omara [20] proposed the standard genetic algorithm (SGA), which gives a standard pattern to demonstrate this process. Additionally, Omara proposed the critical path genetic algorithm (CPGA), which assigns the tasks in the critical path to the same processor, to reduce the longest communication time. However, the performance of CPGA is not very efficient for graphs that are composed of task sets containing more than one critical path.

The genetic variable neighborhood search (GVNS) algorithm [26] combines a GA with a variable neighborhood search (VNS) to obtain a better load balance between the global exploration and the local exploitation of the search space. Two neighborhood structures, the load balancing neighborhood structure and the communication reduction neighborhood structure, are used by VNS. After the crossover operator and the mutation operator, the current population and the new population are merged and sorted by makespan in an increasing order.

4 Proposed method

In this section, we introduce the TSGA algorithm in detail, including the encoded and decoded representations and genetic operators.

4.1 The representation of solutions

The representation of a chromosome is given in Fig. 2. A chromosome \(S\) is composed of two parts: an order part (\(S_O\)) and a mapping part (\(S_M\)). We use integer arrays to store \(S_O\) and \(S_M\), and the size of the arrays equals the number of tasks. If \(S_O[i]\) is \(j\) and \(S_M[i]\) is \(k\), then a task \(t_j\) is executed on processor \(P_k\).

Fig. 2
figure 2

A representation of a chromosome \(S\)

According to the chromosome represented in Fig. 2, the solution of a DAG in Fig. 1a can be scheduled in Fig. 1c. We first assign tasks into the mapping processor according to the index of \(S_M\). Tasks \(t_4\), \(t_7\), and \(t_8\) are scheduled on processor \(P_1\). Tasks \(t_3\), and \(t_5\) are executed on processor \(P_2\). Tasks \(t_1\), \(t_2\), \(t_6\), and \(t_9\) are assigned to processor \(P_3\). Following the order in \(S_O\), we schedule \(t_4\), \(t_7\), and \(t_8\) in the order of \(t_4\), \(t_8\), \(t_7\) in \(P_1\). For \(P_2\), \(t_3\) is executed before \(t_5\). Tasks \(t_1\), \(t_2\), \(t_6\), and \(t_9\) are taken in the order of \(t_1\), \(t_2\), \(t_6\), \(t_9\) in \(P_3\). We lastly compute the makespan by including the commutation time for each task.

4.2 Outline

The algorithmic outline of TSGA is given in Algorithm 1, and its flowchart in Fig. 3. In Line 2, a variable \(g\) counts the number in the past generation. If \(g\) is not less than \(G\), TSGA will output the best solution. Lines 4 to 7 are the main processes that produce the next generation. Line 8 sieves out the best \(C\) chromosomes from both the old and the new generation according to the fitness function, where \(C\) is the number of chromosomes.

figure a
Fig. 3
figure 3

TSGA flowchart

The crossover operator exchanges the information from two chromosomes to hold some good features from the previous generation. Because we can obtain better or worse chromosomes than the parent chromosomes, the crossover operator is applied with a predestined probability called the crossover rate, which is denoted by \(p_c\). On the other hand, the mutation operator alters one bit of a chromosome, to increase the diversity of the chromosomes and prevent premature convergence. The mutation rate is \(p_m\).

4.3 Initialization

TSGA initializes the first population that consists of encoded chromosomes, as shown in Algorithm 2. Instead of using a random strategy to give the processor assignment, we devise a new method that divides the search space into specific patterns and a new rank value that is used to build the executed order. Because the best scheduling solution can occupy a number of processors in different cases, we provide some patterns with a different number of processors to explore the solution space in different aspects. In Line 1, the search space is divided into \(\log _2 n\) subspaces, where \(n\) is the number of processors. For each group \(d\), the processor assignment \(S_M\) is given by an integer random function in the range of \(\{1\dots 2^d\}\), and the execution order \(S_O\) follows Rank_order(DAG), which is depicted in Algorithm 3. Note that the notation \(\oplus \) is the concatenation of two strings in Line 7.

Because a task should be executed earlier if the task has more successors, we use the Rank_order(DAG) procedure to help the implementation of this idea. Lines 2 to 8 compute the \(r\) value by traversing the task graph upward, starting from the exit task. If a task is the exit task, the \(r\) value of the task is 1. Otherwise, the \(r\) value of a task is the summation of its successors’ \(r\) values and 1. In Line 6, the notation \({\mathrm {succ}}(t_i)\) is the set of immediate successors of \(t_i\). An example of calculation for \(r\) is given in Table 1.

figure b
figure c
Table 1 An example of calculation for \(r\)

4.4 Crossover

The crossover operator exchanges the information in two chromosomes to retain the good features from the parent generation. Each part of a chromosome is applied to a specific crossover operator: the crossover map operator or the crossover order operator. These two operators have the same probability to be taken. The pseudo code of the crossover operator is given in Algorithm 4.

The relationship between the tasks and processors is the most important information encoded in the chromosomes for the scheduling problem. However, crossover operators in most previous studies use some simple cut-and-paste processes to produce offspring. These are just random steps and cannot keep the valuable features of the task–processor assignments from the parents. Our crossover operator is based on the task–processor relationship in such a way that parents can pass good features to their children.

figure d

4.4.1 Crossover map operator

The crossover map operator is used to interchange the mapping assignments of two chromosomes, as shown in Fig. 4a. The description of the crossover map operator is given in Algorithm 5. The crossover map operator chooses two chromosomes, \(S\) and \(T\), from the population and picks a random number \(I\) from the set \(\{1, 2,\dots , N\}\). TSGA keeps the processor assignment, which is located on the left of \(I\) (including \(I\)). For the processors on the right of \(I\), our crossover map swaps the processor numbers of \(S_M\) and \(T_M\), which are assigned to execute the same task according to the task places in \(S_O\) and \(T_O\). For example, because \(t_5\) is assigned to \(P_2\) in \(S_M\) and to \(P_3\) in \(T_M\), the crossover map swaps the processor assignments; in other words, it maps \(t_5\) to \(P_3\) in \(S'_M\) and to \(P_2\) in \(T'_M\). Note that the order part of the two chromosomes does not need to be changed.

Fig. 4
figure 4

An example of crossover: a the crossover map operator and b the crossover order operator

The chromosomes \(S''_M\) and \(T''_M\) are generated by the crossover map operator in SGA [20] by a cut-and-paste process, which exchanges \(S_M[5..9]\) and \(T_M[5..9]\) directly. Task \(t_7\) is executed on \(P_1\) in both parents. After the crossover map, \(t_7\) is executed on processor \(P_1\) and \(P_2\) in \(S''_M\) and \(T''_M\), respectively, while our crossover map assigns \(t_7\) to \(P_1\) in both \(S'_M\) and \(T'_M\). The result shows that the operator can inherit good features from the previous chromosomes successfully.

figure e
figure f

4.4.2 Crossover order operator

The description of the crossover order operator is listed in Algorithm 6. In Fig. 4b, after selecting two chromosomes \(S\) and \(T\), we choose a crossover point \(I\) from the set \(\{1, 2,\dots , N\}\) and copy the left portion of \(I\) (including \(I\)) in \(S_O\) and \(T_O\) to the new order parts \(S'_O\) and \(T'_O\), respectively. For the right segment of \(I\) in \(S'_O\), our crossover order operator follows the task order in \(T_O\), except for the tasks in the left segment on \(I\) of \(S'_O\). We then use the same process to generate \(T'_O\). Because we only produce new chromosomes by the order part, the task–processor assignment should be maintained the same as the parents. Thus, we adjust the map part according to the original processor assignment.

The crossover order operator in SGA does not adjust the processor assignment after completing the crossover order operator. As shown in Fig. 4b, the chromosome \(S''\), created by the crossover order operator in SGA, lost the task–processor assignments of its parents, since the order part is changed while the map part is not adjusted correspondingly.

4.5 Mutation

The mutation operator is typically implemented by altering one element of a chromosome at random. Because we have order and map parts in a chromosome, each part has a specific mutation operator: the mutation map operator or the mutation order operator. These two operators have the same probability to be taken, similar to our crossover operator.

4.5.1 Mutation map operator

The mutation map operator is used for the map part of the chromosome, which is shown in Fig. 5a. Similar to the crossover operator, we choose a random number \(I\) from the set \(\{1, 2,\dots , N\}\). Then, the mutation map operator changes the processor in which the chosen task \(S_O[I]\) is executed to another processor that is chosen randomly.

Fig. 5
figure 5

An example of mutation: a the mutation map operator and b the mutation order operator

4.5.2 Mutation order operator

The mutation order operator is applied to the order part of the chromosome, which is given in Algorithm 7. We select an index \(I\) from 1 to \(N\) randomly, which decides a task \(S_O[I]\) to be mutated. For the task \(S_O[I]\), let \(I_1\) and \(I_2\) be the indices of its closest predecessor on the left segment of \(I\) and the closest successor on the right segment of \(I\), respectively. For the index \(i\) between \(I_1+1\) and \(I\), if \({\mathrm {DAT}}\) of \(S_O[i]\) is larger than \({\mathrm {DAT}}\) of \(S_O[I]\), the task \(S_O[I]\) should be executed before the task \(S_O[i]\), because the task \(S_O[I]\) can be executed early. For the index \(i\) between \(I\) and \(I_2-1\), if \({\mathrm {DAT}}\) of \(S_O[i]\) is smaller than \({\mathrm {DAT}}\) of \(S_O[I]\), the task \(S_O[i]\) should be executed before the task \(S_O[I]\), since the task \(S_O[i]\) can be executed early. We collect the tasks that can be executed early into a set called \(A\). If \(A\) is empty, then we put all of the indices between \(I_1+1\) and \(I_2-1\) into \(A\). We select an index \(I_3\) from \(A\) randomly and insert the task \(S_O[I]\) before the index \(I_3\). We lastly adjust the processor assignment to correspond with the original processor assignment, for the same reason mentioned in the crossover order operator.

figure g

Figure 5b demonstrates the mutation order operator for \(I=4\). Since \(S_O[4]\) is \(t_5\), the mutation order operator changes the execution order of \(t_5\). According to the graph shown in Fig. 1a, we have the table with the \({\mathrm {DAT}}\) value and find \(I_1=1\) and \(I_2=8\). We then have that the elements in the set \(A\) are the indices 2 and 5, because \(S_O[2]\) is \(t_2\) and \(6={\mathrm {DAT}}(t_2)>{\mathrm {DAT(}}t_5)=3\), and \(S_O[5]\) is \(t_4\) and \(2={\mathrm {DAT}}(t_4)<{\mathrm {DAT}}(t_5)=3\). If we choose the index \(I_3=2\) randomly from \(A\), the task \(t_5\) is inserted in the front of \(t_2\) (i.e., \(S_O[2]\)). Additionally, the processor \(P_1\) (i.e., \(S_M[4]\)) should be moved correspondingly.

5 Experimental results

At the beginning of this section, we describe the simulation environment for realistic problems. We next demonstrate five DAGs of well-known parallel applications and explain the related parameters. We lastly show the performance of TSGA compared with SGA, CPGA, GVNS, and HEFT.

5.1 Parameter Setting

This study considers five well-known applications: Gauss–Jordan elimination (GJ) [28], the fast Fourier transformation (FFT) [25], Robot control, Sparse matrix solver, and SPEC fpppp. Robot control, sparse matrix solver, and SPEC fpppp are taken from a Standard Task Graph (STG) archive [30]. Their graph structures are drawn in Fig. 6 and the detailed information is given in Table 2.

Fig. 6
figure 6

The graph structure. a 21-task GJ, b 15-task FFT, c 88-task Robot control, d 96-task Sparse matrix solver, and e 334-task SPEC fpppp

Table 2 Parameter settings used in the experiment

Each application has various features that depend on different settings, as described below:

  1. 1.

    Number of processors (\(P\)): \(P\) is the number of processors in the system. The target heterogeneous system is a fully connected computing network in which various processors are connected to each other. Note that the number of processors and the performance speedup are not linearly proportional due to the communication delay.

  2. 2.

    Heterogeneity factor (\(\beta \)): This value indicates the range percentage of the computation cost on the processors. The variable \(\beta \) is defined by the ratio of the most efficient processing capability and the least efficient processing capability in a system. We set the most efficient processing capability to be 1. Thus, we have \(0<\beta \le 1\). For practical computing environments, heterogeneous systems could contain some processors that have different processing capabilities. Here, we set four types of processing capability, and there is the same number of processors in each type.

  3. 3.

    Communication-to-computation ratio (\({\mathrm {CCR}}\)): For a graph, the \({\mathrm {CCR}}\) is defined as the average communication time of the edges divided by the average computation time of the nodes in the system. A graph is represented as a computation-intensive application if its \({\mathrm {CCR}}\) is low. On the other hand, if the \({\mathrm {CCR}}\) value is high, then it is a communication-intensive application. The communication time of DAGs is generated by uniform distribution.

The values of the corresponding sets shown below are the settings for the experimental parameters.

  • \({\mathrm {SET}}_P =\{4, 8, 16\}\)

  • \({\mathrm {SET}}_{\beta } =\{0.1, 0.25, 0.5, 0.75, 1.0\}\)

  • \({\mathrm {SET}}_{\mathrm {DAG}} =\) {GJ, FFT, Robot control, Sparse matrix solver, SPEC fpppp}

  • \({\mathrm {SET}}_{\mathrm {CCR}} =\{0.25, 0.5, 1.0, 2.0, 4.0\}\)

15 distinct system models were created by the combination of \({\mathrm {SET}}_P\) and the set \({\mathrm {SET}}_\beta \). The different settings of \({\mathrm {SET}}_{\mathrm {DAG}}\) and \({\mathrm {SET}}_{\mathrm {CCR}}\) generate 25 different types of DAGs, and 10 graphs are produced for each type of DAG; thus the total number of DAGs is 250. The combinations of all of these parameters give 3,750 various scenarios. Based on these DAGs with different characteristics, the experimental result will not be biased toward any specific algorithm.

5.2 Performance results

To evaluate our proposed algorithms, we have implemented them using an AMD FX(tm)-8120 eight-core processor (3.10 GHz) using the \(\mathrm{C}^{++}\) language. Usually, an excellent solution to various problems would be generated by a variety of parameters for a specific algorithm. However, we use the same parameter values that were listed in Table 3, to show the performance in terms of makespan in this study.

Table 3 Parameter settings used in the experiment

The performance of TSGA is compared with four algorithms, CPGA, SGA, GVNS, and HEFT. For each data configuration of the five DAGs, we record the average makespan obtained over ten runs for CPGA, SGA, GVNS, and TSGA. On the other hand, HEFT, a deterministic algorithm, is run only once. Please note that the time complexity of HEFT is \(O(n^2p)\), where \(n\) is the number of tasks and \(p\) is the number of processors, and the time complexity of these GAs are \(O(n^2p)\) for fixed population size and number of generations. Thus, it is evident that HEFT is fastest among these five algorithm.

The experimental results of the five DAGs with \(P\) fixed to 16 and varying the \(\beta \) value are given in Fig. 7, which presents the histograms for the makespan. Each application is shown in its row and was tested with a different CCR value and a varying \(\beta \) value. Each column contains five applications with a fixed CCR value. Note that the vertical coordinate shows the makespan. To demonstrate that the stability of TSGA is accepted, we add the standard deviation for each TSGA’s result in Fig. 7.

Fig. 7
figure 7

Makespan comparison with varying parameters for \(P=16\). Each number represents the standard deviation of the corresponding TSGA’s result

The experiment reveals that TSGA outperforms other algorithms in terms of makespan in our test cases with different characteristics. Because CPGA schedules the nodes of the critical path to the same processor, it has good outcomes in Robot control. However, CPGA produces unfavorable scheduling results in other applications. GVNS and SGA show their good scheduling ability only in certain cases. When the CCR value increases, the performance of SGA decreases. On the other hand, although HEFT shows efficient scheduling results, TSGA is better than HEFT in terms of makespan in most cases.

For all of the algorithms, the makespan in the tests with a larger \(\beta \) value is always smaller than those with a smaller \(\beta \) value. The system with a higher \(\beta \) value has a larger number of processors with good processing capability. If a task is assigned to a processor that has high processing capability, then the task will be completed early. In general, the makespan decreases when \(\beta \) increases.

When we consider a smaller \(\beta \) value (\(\beta =0.1\)), there are more possible scheduling solutions. For a small value of \({\mathrm {CCR}}\), we can produce good scheduling easily because the application has less communication delay and less idle time. We therefore focus on a large \({\mathrm {CCR}}\) value (\({\mathrm {CCR}}=8\)). The results, which are provided by testing five applications, are given in Fig. 8.

Fig. 8
figure 8

Makespan comparison with varying \(P.\) a GJ, b FFT, c robot control, d sparse matrix solver, and e SPEC fpppp for \({\mathrm {CCR}}=8\) and \(\beta =0.1\). Each number represents the standard deviation of the corresponding TSGA’s result

For the concept of parallelism, when the number of processors increases, the makespan will decrease. However, most of the algorithms will produce an increased makespan, with an increasing number of processors in certain cases, especially when the CCR value is large, as shown in Fig. 8. Having a larger CCR value implies that the communication time is larger than the computation time. In these scenarios, good scheduling should not occupy most of the processors. Most of the genetic algorithms cannot catch this factor and initialize with random strategies. On the other hand, TSGA with the initialization operator can reduce the probability of this phenomenon and produce better scheduling results in all of the cases.

To demonstrate that the proposed processor assignment strategy is efficient at the initial step, we give the comparison of genetic algorithms that employ the original initial operator or our proposed processor assignment technique. Figure 9 is the result of testing with \(\beta =0.1\) and \({\mathrm {CCR}}=8\), where the three new algorithms CPGA\(^{\prime }\), GVNS\(^{\prime }\), and SGA\(^{\prime }\) are the three algorithms CPGA, GVNS, and SGA that employ our proposed initialization operator, respectively. Obviously, the new algorithms always outperform the original algorithms by using the random assignment technique. Although the qualities of those algorithms have been improved, our proposed method TSGA always has the best scheduling results among all of the test cases.

Fig. 9
figure 9

Makespan comparison of algorithms with our processor assignment strategy for a GJ, b FFT, c robot control, d sparse matrix solver, and e SPEC fpppp

6 Conclusions

In this study, we presented a genetic algorithm for task scheduling, referred to as TSGA, to solve the problem of task scheduling on parallel and distributed computing systems to improve the standard genetic algorithm by increasing the convergence speed and preserving the good features of the previous generation. To demonstrate the TSGA performance, we introduce new genetic operators in TSGA. The initialization operator that was proposed divides the search space into specific patterns to allow TSGA to explore the search space extensively. The crossover map operator and the mutation map operator help us to search more suitable processor assignments, and the crossover order operator and the mutation order operator provide us with a more efficient execution order. Compared with the four related algorithms, SGA, CPGA, GVNS, and HEFT, the experimental results show that our proposed method TSGA outperforms other algorithms in a variety of scenarios in terms of makespan. We also give the experimental results to show the relationship between TSGA and various parameters. For cases with high CCR values, TSGA schedules tasks that occupy fewer processors to reduce the communication time. For cases with small \(\beta \) values, TSGA assigns tasks that occupy the processors having higher processing capability to minimize the execution time. Furthermore, we use other genetic algorithms to show the advantage of our initialization operator and the satisfactory results.