1 Introduction

Nowadays, processors that comprise multiple independent cores are being extensively leveraged. These processors support parallel processing, where a problem can be partitioned and distributed to be run on these processors in parallel. Moreover, a multithreading technique is also used to increase the extent of parallelism, and simultaneous multithreading can help overcome the underutilization of a superscalar processor, whereby a processor can issue multiple instructions from multiple threads within the same cycle, as illustrated in Šilc et al. (2001). Additionally, more recent programming languages including Java, C++, C#, etc. are compatible with parallel programming or virtual parallelism using multithreading. However, the old sequential codes do not support this multithreaded technique, rewriting old sequential software codes into a new version that supports parallel computing is a rather onerous task. The predicament of regenerating sequential programs into a new code version that can be parallelized is not a straightforward process. Several researchers have attempted to convert sequential code into their counterparts that are capable of being parallelized. For example, Andión et al. (2013) proposed a method that converts a sequential application into a parallel counterpart that can be executed on multicore processors.

Analogously, Jabri (1990) utilized the tree-walk attribute evaluator with parsing to produce an intermediate representation (IR) of a sequential source code program in the form of a three-address code. Notably, this program was segmented as basic nodes of a flow-graph, thereby enabling local and global optimization and supporting the usage of parallelization as well as optimized scheduling. The importance of scheduling can be evidenced by the fact that it has evinced the focus of many researchers (Shao et al. 2006; Mahafzah and Jaradat 2010; Jiang et al. 2014; Althebyan et al. 2017; Srikanth and Geetha 2018).

For optimization purposes, many heuristics and metaheuristic approaches, such as heuristic local search, nearest-neighbor algorithm, and chemical reaction optimization, were used to solve various optimization problems (Al-Adwan et al. 2018, 2019; Lam and Li 2012). Lam and Li (2012) proposed the chemical reaction optimizer (CRO) in the form of a metaheuristic approach based on the natural process of chemical reaction that commences with some unstable substances that react with each other through different reactions, such as decomposition; on-wall ineffective collision; inter-molecular ineffective collision; and synthesis through several reactions. Under this process, unstable substances are converted into their stable counterparts that are then considered as new substances exhibiting different properties from the ones found in the initially unstable substances.

Islam et al. (2019c) mentioned that one of the CRO features is the flexibility of changing its basic framework, which makes it widely used to solve many optimization problems, such as 0–1 knapsack problem (Truong et al. 2013), printed circuit board drilling problem (Eldos et al. 2015), multi-objective optimization problem (Bechikh et al. 2015), capacitated arc routing problem (Bensedira et al. 2016), static bike repositioning problem (Szeto et al. 2016), shortest common super sequence problem (Saifullah and Islam 2016), longest common subsequence problem for multiple string (Islam et al. 2019b), and selecting an optimal subset of software requirement optimization using a fuzzy artificial chemical reaction optimization algorithm (Alrezaamiri et al. 2019).

Furthermore, CRO was used in other interesting areas, such as in Zhou et al. (2017), the authors employed deep reinforcement learning to CRO, and Nayak et al. (2017) presented a hybrid chemical reaction based metaheuristic with fuzzy c-means algorithm for optimal cluster analysis, also Hazra and Roy (2019) presented chemical reaction optimizer to solve WCEED (wind-based combined economic emission dispatch problem) to concurrently reduce the emissions formed by fossil-fueled power plants and the thermal-wind electrical energy cost. Additionally, chemical reaction optimizer was used with other heuristic and metaheuristic algorithms in various application areas. For example, Nguyen et al. (2014) presented a new algorithm based on CRO and particle swarm optimization (PSO) to solve optimal functions problem. Also, Dam et al. (2017) solved the vehicle routing problem using a chemical reaction optimizer with a unified tabu search. Moreover, Zhang et al. (2018) used the advantages of adaptive CRO and PSO to solve continuous optimization problems.

The main motivation of this work is to enhance the compiler scheduling by providing multiple code segments (i.e., multiple execution paths) over multicores in multithreaded architecture using the optimization capability of CRO. Another motivation is to satisfy the need of the parallelizing approach that is computation dependent and language independent; in the sense, it can be applied to different compilers of various programming languages. Therefore, the main contribution of this paper is to present, analyze, and evaluate a new technique called chemical reaction optimizer for multithreading scheduling (CROTS) that utilizes the flow-graph of a program or its respective IR to schedule multiple code segments (i.e., multiple execution paths) over multicores in multithreaded architecture. The analysis and evaluation of CROTS are accomplished in terms of various performance metrics, including execution time, speedup, efficiency, cost, and percentage of error.

In general, CROTS consists of two phases. The first phase entails the scanning of flow-graph and the categorization of its nodes, as regular nodes, loop nodes, and branch nodes, this phase is considered as a pre-phase of CROTS. Notably, the node categorization in this phase will be used in the subsequent phase to generate the execution paths. Subsequently, the execution paths are generated by the nodes, and the results of branch nodes condition, also the values of the variables used in the execution paths will be saved for each one separately. After generating the execution paths, CROTS eventually determines the number of requisite threads and their adequate sizes in the second phase, based on the number of execution paths, and the best/worst execution paths, where the best execution path has the minimum number of nodes, while the worst execution path has the maximum number of nodes. CROTS will use the basic nodes as black boxes that will disregard the contents of these nodes; it will primarily go through various execution paths. CROTS will utilize these paths to allocate their nodes to the available threads incrementally via CRO. Finally, CROTS will provide an optimized incremental schedule for the available threads using a CRO reaction.

Furthermore, the bi-level optimization problems (BLOPs) are considered as a class of problems that have two levels of optimization as mentioned in Chaabani et al. (2018). Chaabani et al. (2018) mentioned that BLOPs are considered as one of the most difficult optimization problems. Thus, the authors inspire the co-evolutionary decomposition-based bi-level algorithm (CODBA) from CRO which is called energy-based CODBA and denoted as E-CODBA, where E-CODBA is used to solve combinatorial bi-level problems. In the experimental results, E-CODBA shows better performance in comparison with other four approaches, including CODBA (Chaabani et al. 2015a), CODBA-II which is an improved CODBA (Chaabani et al. 2015b), co-evolutionary bi-level method using repeated algorithms (CoBRA) (Legillon et al. 2012), and a simple nested strategy that uses evolutionary algorithm at both levels to handle bi-level optimization problems (N-BEA) which mentioned in Chaabani et al. (2018) and presented in Marinakis et al. (2007), where N-BEA is a solution that uses a genetic algorithm to solve the vehicle routing problem.

Accordingly, the scheduling problem in this paper can be considered as a BLOPs (Chaabani et al. 2018). The optimization will be in both phases of CROTS; in the first phase, the process of generating solution paths from sequential codes converted to parallel counterparts and encoding the execution paths to suitable nodes for parallelizing in the next phase. In the second phase, the main feature of CRO will be used to optimize the multithreading scheduling.

CROTS is applied to the dense matrix–vector product benchmark, which is denoted as DenseAMUX benchmark and presented in Andión et al. (2013), over different size matrices, including M1000, M2000, M3000, M5000, and M7000. CROTS is compared with intermediate representation based on diKernel, where diKernel is a domain-independent kernel, which is denoted as KIR and presented in a study conducted by Andión et al. (2013). However, we re-implemented KIR as it is presented and described by Andión et al. (2013) to get its experimental results and compare it with CROTS. The comparison is done according to the speedup performance metric. The experimental results show that CROTS displayed better speedup values. More specifically, CROTS exhibited the maximum speedup value of 11.98 when 32 threads are used for M5000, as opposed to KIR, which could only attain a speedup value of 4.18. This paper presents different experimental and analytical results to show the performance of this new technique; these include the number of steps, execution time, speedup, efficiency, cost, and percentage of error metrics.

The rest of the paper has been organized in the following manner. Section 2 includes the related work, where it is followed by the background in Sect. 3. Meanwhile, Sect. 4 elucidates our new technique. This is followed by Sect. 5, which presents and discusses the experimental results. Finally, Sect. 6 presents the main conclusions of this research work.

2 Related work

One of the most important problems for a parallel processing system is scheduling. The scheduling problem has been used in various application areas using different approaches to improve their solutions. For example, Shao et al. (2006) mentioned two of the most important factors that affect power dissipation which, are schedule length and switching activity. Shao et al. (2006) studied the scheduling problem that minimizes these two factors for applications with loops on multiple functional unit architectures.

Mahafzah and Jaradat (2008) put forward a job scheduling with load balancing methodology on the optical transpose interconnection system (OTIS) based on the hypercube, which is denoted as OTIS-hypercube, terming it as clusters dimension exchange method (CDEM). Similarly, Mahafzah and Jaradat (2010) put forward another job schedule with a load balancing algorithm, namely a hybrid dynamic parallel scheduling algorithm (HD-PSA) on a chained-cubic tree (CCT) interconnection network, which is a combination of dynamic load balancing and parallel scheduling strategies. They proposed HD-PSA as a load balancing solution on CCT networks. The HD-PSA is premised on the dimension exchange method (DEM) for dynamic load balancing on hypercube (Ranka et al. 1988) as well as tree walking algorithm (TWA) for load balancing on tree interconnection networks (Shu and Wu 1995).

Virtual parallelism can be achieved by using multithreading, which is one of the parallel programming approaches, that yields faster execution and easy communication. Notably, multithreading is used in different approaches to improving the performance of sequential algorithms. Mahafzah (2014) presented a performance evaluation of a parallel multithreaded A* heuristic search algorithm for solving the 15-puzzle problem. Also, Mahafzah (2011) postulated a parallel multithreaded approach concerning the iterative deepening A* (or IDA*) heuristic search algorithm, using POSIX threads (Pthreads) as well as message-passing interface (MPI) libraries that were operating on 16 dual-core processors. Presenting a parallel multithreaded IDA* (PMIDA*) algorithm, where the author assessed it experimentally and analytically and compared it with other parallel heuristic search algorithms; specifically, the message-passing IDA* version-2 and version-3 (MPIDA*-2 and MPIDA*-3) (Hafidi et al. 1995) and the unidirectional distributed heuristic search (Uni_DHS) (Al-Ayyoub 2005).

Task scheduling is one of the major problems that face the MapReduce framework, although many schedulers have been presented. Data locality should be maintained by schedulers by avoiding several data transmissions to achieve an acceptable performance. Althebyan et al. (2017) presented a multithreading locality (MTL) scheduler that utilizes multithreading principles.

Xu et al. (2017) presented an optimization of partial collaborative transportation scheduling in supply chain management with a third-party logistics enterprise using a modified ant colony optimization with a negative selection operation (ACO-NSO) algorithm. Also, Islam et al. (2019a) presented a chemical reaction optimization for supply chain management (CRO-SCM) with a third-party logistics enterprise, where it is used to find the best transportation routes within the shortest computing time. CRO-SCM has been evaluated and compared with the IBM CPLEX Optimizer, genetic algorithm (GA), the traditional ACO algorithm, and the modified ACO-NSO algorithm. The proposed CRO-SCM achieved better results than the other mentioned algorithms. Additionally, Li et al. (2020) presented a workflow job scheduling algorithm based on load balancing that utilizes cloud resources efficiently.

Jiang et al. (2014) put forward a tuple molecular structure-premised on chemical reaction optimization (TMSCRO) method for directed acyclic graph (DAG) scheduling on heterogeneous computing systems. Also, Jabri (1990) had introduced a formal specification, implementation, as well as optimization of the front-end processors of high-level languages. In this method, the concerned approach divided the processors into two primary phases: the syntax/semantic analysis phase and the global optimization phase. During the first phase, an intermediate representation of the source program is concurrently parsed in the form of three addresses codes using a tree-walk attribute evaluator that is segmented as basic blocks of a flow-graph. In the subsequent phase, the flow-graph representation will serve as the input of a set of global optimization algorithms that will then be utilized to collect the data flow information required by the back-end processors.

Jabri (1990) presented an intermediate presentation of the source code that includes nested and non-nested loops statements, nested and non-nested if statements, and assignment statements. Accordingly, the intermediate presentation of the source code presented in Jabri (1990) will be used in the first phase of CROTS, specifically, in the solution’s generator algorithm (Algorithm 1).

Additionally, Jabri (2009) introduced a generic code generator (GCC) that was based on a formal and global optimizing approach. GCC was specified at varying levels in terms of a generic description of different machine architectures. Each level specified some aspects of the machine represented by a respective machine language and an appropriate optimization. It was implemented using a generic optimizing translation scheme that performs successive transformations of the predecessor levels’ machine language into the languages of their respective successor levels. It may be important to note that this transformation process is accompanied by an imposed optimization effect. The distinguishing features between GCC and the ones similar to it include ease of retargeting ability; effective optimization; accuracy, and formalized machine description.

Paralleling sequential codes is a hot research area that many researchers concerned their work to improve the process of converting sequential codes to parallel counter codes, which should preserve sequential codes specifications, such as syntax, semantics, and dependencies issues.

Correspondingly, Andión et al. (2013) proposed a method to convert a sequential application into a parallel counterpart, which facilitates the execution of the sequential application on multicore processors. The proposed method is predicated on an intermediate representation that, in turn, is based on domain-independent kernel, (e.g., assignment, reduction, recurrence, etc.). However, the syntactic variations of computations (e.g., pointers, arrays, complex control flows, etc.) that are used in the source code of sequential application are capable of constructing the parallel version.

Andión et al. (2013) presented an intermediate representation based on diKernel, where diKernel is a domain-independent kernel, which is denoted as KIR. Andión et al. (2013) used a multithreading technique to show the performance of their work using the speedup performance metric. However, the parallelizing process in KIR using the benchmark DenseAMUX is as follows; the regular assignments are converted into a forall parallel loop as presented by Albert et al. (1991) because the computations in regular assignments are conflict-free. Second, the program variables within the loop statements (i.e., variables defined within loop statements) have been privatized within the parallel region. Thus, the compile-time variables used in the benchmark during the use of KIR is mainly impacted by locality exploitation. Also, the authors presented a partitioning technique that is used to convert sequential programs into parallel counterparts. Therefore, the coarse-grain threads used in KIR may have different sizes, that is, they have different sizes of execution paths because every thread will work on a subset of the data. Threads will be allocated at the beginning of the parallelizing process by a specific subset of data, and they will work separately. Therefore, in Andión et al. (2013), KIR distributes data not the execution paths over the threads, where execution paths of the subsets may have different sizes, thus some threads may need more execution time than others. Also, KIR did not present the parallelization process of a loop statement that has un-privatized variables (i.e., variables that are common within more than one parallel region).

On the other hand, Almghawish et al. (2017) presented an automatic parallelizing model for sequential code using Python called PyParallelize, to detect recursive functions. Analogously, Arabnejad et al. (2018) provided an elucidation of AutoPar-Clava, which serves as the library for the Clave weaver. It performs symbolic and automatic parallelization of C code with OpenMP programs.

In more detail, AutoPar-Clava detects two types of statements; the first one is the loop statements in program segments, where loop statements have no true data dependencies between loop statement iterations, and the second one is when statements are not guided by the runtime or by user information. Thus, AutoPar-Clava considers these statements as independent blocks, that can be run separately from the program segments. So, if there is a variable in the program segment that is guided by the runtime, then AutoPar-Clava cannot parallelize this case, also if there is a true data dependency between loop statement iterations, AutoPar-Clava cannot parallelize the loop statement. Moreover, AutoPar-Clava is a library only for C programming language that contains two main parts, namely the parallel loop detection and the source to source code parallelization. Unfortunately, AutoPar-Clava is dependent on C programming language, and could not parallelize loop statements with true data dependencies between loop statement iterations.

Murad et al. (2019) presented static scheduling using a chemical reaction optimizer, denoted as SS-CRO, which is used to schedule different types of instructions. SS-CRO is evaluated analytically and experimentally in terms of various performance metrics, including the number of steps, execution time, and the number of obtained solutions. The analytical and experimental results show that SS-CRO achieved better results than static scheduling using a genetic algorithm (SS-GA).

3 Background

In this section, the worst-case execution time definition and calculation are presented. The worst-case execution time (WCET) is used in the process of determining the upper bound of different problems. The WCET has been studied using different approaches and used as indicators of the solution bound to various problems.

Stappert et al. (2001) posited a fast-paced and efficacious WCET calculation method that considers both high-level program flow, such as loops and infeasible paths, and low-level machine aspects like pipelining and caches.

Meanwhile, Puschner (2002) proposed a single-path programming paradigm. Under this approach, a program will entail a single possible execution path whenever it follows this paradigm. The author opined that a single possible execution path renders the path and WCET analysis is almost trivial.

Similarly, static WCET is defined as a technique to derive upper bound for the execution time of programs. The derivation of flow information, such as loop bound and infeasible paths, can be viewed as a key component of static WCET analysis. In this context, Gustaffson et al. (2006) proposed three algorithms to calculate infeasible paths. The first algorithm is called detecting infeasible nodes algorithm that finds infeasible nodes, where these nodes are basic blocks that have never been visited in any execution of a certain scope. The second algorithm is called detecting infeasible pairs of nodes algorithm that finds infeasible pairs of basic blocks. The third algorithm is called detecting infeasible path that finds sequences of nodes during the same iteration scope, where they are never been executed together.

4 CROTS technique

This section describes the CROTS technique, which is composed of two phases, where its input represents a flow-graph of the program segments code while its output is an incremental multithreaded schedule of the execution paths that are assigned to concurrent threads. In the first phase, the CROTS technique scans the flow-graph and categorizes its nodes based on Definition 1, and then generates the maximum number of execution paths based on Definition 2, where these two definitions are presented in the next subsection. During the process of generating the execution paths, nodes dependencies will be shown using the order of each node in its corresponding execution path which will be saved in the solution matrix SM[][]. While the values of the variables corresponding to each execution path will be saved in the variable matrix Var[][]. In the second phase, the CROTS technique schedules the execution paths over the available threads incrementally and considers the threads’ size using CRO reactions. Finally, the technique gives an optimized incremental threads schedule using CRO. These phases have been shown in CROTS structure, which contains Algorithms 1–8, where these algorithms are explained in detail in Sects. 4.1.24.1.9, and their number of steps has been found in Sect. 4.2. However, in CROTS we follow the generalized form of CRO algorithm as mentioned in Islam et al. (2019c), which is shown in the following subsections.

4.1 CROTS structure

This subsection presents the structure of the CROTS technique, which contains Algorithms 1–8, where they are explained in the following subsections. The following definitions are used by these algorithms.

Definition 1

A flow-graph G is defined by G = (N, E), where N = {n1, n2, n3, etc.} is a set of nodes and E = {e1, e2, e3, etc.} is a set of edges. In the flow-graph, there are different kinds of nodes, namely; a starting node ns, an ending node nf, regular nodes, loop nodes, and branch nodes. Only a starting node is with no predecessors, and only an ending node is with no successors. Meanwhile, regular nodes have at least one successor and one predecessor. Loop nodes are with two outgoing edges and one of them is a back edge. Branch nodes are with two outgoing edges.

Definition 2

An execution path is a sequence of connected nodes in graph G with a starting node ns and an ending node nf. The set of possible execution paths are encoded by a solution matrix SM[][], where SM[i][j] is the jth node in the ith execution path. For example, if the execution path n1, n2, n3, n4 n2, n5 is the ith solution, where starting node ns equals to n1 and ending node nf equals to n5, then it is encoded by SM[i][0], SM[i][1], SM[i][2], SM[i][3], SM[i][4], SM[i][5], respectively.

4.1.1 CROTS attributes

In this paper, the main concept of CRO is used in the CRO-Scheduler algorithm, which is presented in Algorithm 2 (see Sect. 4.1.3). However, this subsection presents the main attributes used in the CROTS technique which are based on the original CRO attributes, as mentioned in (Lam and Li 2012; Islam et al. 2019c). Thus, we modified and adapted CRO to make it suitable for optimizing multithreading scheduling. Accordingly, we did not use all of its well-known attributes, but we used some of them those help in optimizing multithreading scheduling. Thus, the main attributes used in CROTS technique are as follows:

  • Popmatrix denoted as MTH[T][L] which holds parts of the execution paths to be executed in the available threads, where T is the number of threads and L is the thread’s size.

  • Molecule denoted as MTH[i] which holds parts of the execution paths to be allocated to thread i after they are scheduled by the selected reaction.

  • Kinetic Energy denoted as KE, which is the maximum number of tolerance of dependencies achieved between parts of the execution paths to accept the worst solution, where each execution part is independent of its predecessor or a node in MTH[i] is empty; which is determined by the fitness function presented in Algorithm 7 (see Sect. 4.1.8).

  • Potential Energy denoted as PE, which is the number of dependencies between the parts of the execution paths or a node in MTH[i] is full; which is determined by the fitness function presented in Algorithm 7 (see Sect. 4.1.8).

  • tth and MoleT are two random values between one and zero that are used to determine if the reaction that will be used is a uni-molecular reaction, such as decomposition (case A) or on-wall ineffective collision (case B), or an inter-molecular reaction, such as synthesis (case C) or inter-molecular ineffective collision (case D), where these case are presented in Sect. 4.1.3.

4.1.2 Solution’s generator

In Algorithm 1, the input of the solution’s generator (SG) is a program segment’s flow-graph G and the output is the maximum number of execution paths encoded in a solution matrix SM[][]. The flow-graph of the input program code is done according to the method introduced by Jabri (1990).

In lines 1–8 of the SG algorithm (Algorithm 1), SG scans all the nodes in the program’s flow-graph to determine the following: a starting node, an ending node, and to distinguish between the regular nodes that are going serially and the branch/loop nodes. Branch/loop nodes have one input arrow and two distinct output arrows, see Definition 1. The two distinct output arrows from the branch/loop node are determined according to a loop or if condition, where one output is for the true value, and the other one is for the false value of the condition used. Each node in the graph will be given a unique name.

In lines 9–27, the SG algorithm will acquire the solution set by generating the maximum number of execution paths (see Definition 2) that may occur in the graph. During the process of finding a new execution path in iteration i, new solution nodes will be saved in the current vector CS[] until SG finds the full solution path (i.e., SG algorithm reaches the ending node for the current solution), and then, the context of the CS[] will be placed in the solution matrix SM[][] and mark all its nodes as unscheduled nodes (line 22), where matrix SM[][] holds all the solutions’ paths to be used in CRO-Scheduler if the new solution in CS[] does not exist in the solution matrix from a previous iteration. Execution paths contain regular nodes and branch nodes where a branch node will lead the path to go through different ways depending on its condition’s result.

Notably, every solution path has the variables of its values those initialized in the starting node and used/updated in the next nodes in the same execution path. Assume that R is the maximum number of variables used in graph G so every solution path has at most R variables, such as {v1, v2, v3, …, vR}. During the generation of the current execution path, the values of the variables corresponding to the current execution path will be saved in the current variable vector CV[], and then, when the process of generating the execution path is done and the execution path is saved in the solution matrix, the values of the variables of the same execution path will be placed in the variable matrix Var[][]. For example, Var[i][vij] is the value of the jth variable in the ith execution path.

Now, every path starts from a starting node and ends at an ending node, and goes through a series of nodes sequence. In the outer loop, the algorithm will generate a new execution path and save it temporarily in CS[], it will start from a starting node and go to its successors to get a sequence of nodes that ends by an ending node, so the worst-case of the outer loop is EP, which is the maximum number of execution paths. In the inner loop, the algorithm will continue to visit the successor nodes until it reaches an ending node, and the worst-case of the inner loop is ml, where ml is the maximum execution path’s size.

figure a

Finally, the SG algorithm designates every execution nodes’ path, where each node has two identifiers, the first identifier is for the solution’s number and the second identifier relates to the node’s position in the path.

4.1.3 Chemical reaction optimizer scheduler algorithm

The chemical reaction optimizer scheduler, denoted as the CRO-Scheduler algorithm, is shown in Algorithm 2. CRO-Scheduler algorithm schedules the execution paths into the threads, that is, the nodes of the execution paths will be allocated incrementally to the available threads according to their size. Usually, the size of the thread is not equal to the execution path size, also execution paths have different sizes; so the process of allocating the nodes to the available threads incrementally will cause some cases that should be treated by the CRO-Scheduler. These cases are as follows:

  • Case A If there is one thread that has nodes less than its’ size L, then the decomposition algorithm (Algorithm 4) should be called to try to allocate more nodes to the thread so that a thread holds at most L nodes.

  • Case B If there is a thread that holds different nodes from distinct execution paths, then the on-wall ineffective collision algorithm (Algorithm 3) should be called to reorder its nodes properly (i.e., order the nodes according to their original execution paths).

  • Case C If there are two threads with empty spaces, then the synthesis algorithm (Algorithm 6) should be called to try to merge them.

  • Case D If two distinct threads have nodes from several different solution paths, then the inter-molecular ineffective collision algorithm (Algorithm 5) should be called to properly reorder their nodes, that is, it gathers the nodes that belong to each solution path and put them on the same thread.

In this paper, we modified and adapted CRO to optimize multithreading scheduling, where the main features of the operators were used in Algorithm 2. The operators will be called according to the status (case) of the unscheduled nodes, wherein every iteration the proper CRO operator will be called. In each iteration, the unscheduled nodes will be scheduled in a local optimization manner that the chosen CRO operator will reduce the dependency issues and execution time of the scheduled nodes. Finally, a global optimization will be achieved, when the main algorithm is ended, the optimized multithreading schedule for all scheduled nodes (i.e., parts of execution paths) will be delivered.

However, the number of iterations in Algorithm 2, which uses the CRO operators, is defined as the number of the available threads, because in this algorithm we use CRO operators to allocate parts of the execution paths into the available threads.

figure b

CRO-Scheduler algorithm (Algorithm 2) has two distinct phases, the first phase is the initialization phase and the second one is the processing phase. In the initialization phase, the CRO-Scheduler algorithm determines the number of the required threads and the size of the required thread, if they are not defined as parameters. The process of calculating the size and the number of threads that are used in the algorithm will be determined statically.

In the processing phase, notice that at the beginning, all the path’s nodes status will be set to unscheduled nodes. In line 11, T × L unscheduled nodes will be allocated randomly to threads matrix denoted as MTH[T][L] from different execution paths according to the T threads of size L for each from SM[][], and the fitness function will be called to compute PE and KE for each row in MTH[T][L]. In line 12, the algorithm will calculate the PE and the KE for all rows in MTH[T][L]. In line 13, we used the while loop statement to iteratively use the CRO operators to allocate nodes to the available threads. The while loop statement will continue until threads are being full, that is, until the variable Th-no value in the while loop statement equals to the number of available threads T. The process of allocating unscheduled nodes to MTH[][] in these steps has two main issues. The first issue, the unscheduled nodes will be allocated to the threads incrementally, that is, in each iteration, at most L (thread’s size) unscheduled nodes that may belong to different execution paths will be allocated to each thread randomly. The second issue, each execution path has its own variables’ values that were initialized in the starting node and saved in the variable matrix Var[][] in SG algorithm (Algorithm 1), and then, they will be used and may be updated during the execution of the other nodes in the same execution path. During the execution of the threads, these nodes will be imported from Var[][] to be used in line 41 to execute the scheduled nodes in each thread properly.

In line 16, the status of the path’s nodes that were selected from the execution paths to be processed will be updated to scheduled nodes. In line 18, two random variables, namely tth and MoleT will be used to determine which reaction will be selected, where both of them will generate values between zero and one. Thus, if the tth value is greater than MoleT value then the uni-molecular reaction will be used; otherwise, inter-molecular will be used. Later on, the algorithm will check out the case of the selected rows from MTH[][], that is, MTH[i] and MTH[j], in lines 19–47, where it will call the proper chemical reaction according to the selected MTH[][] rows cases. In each case, one reaction will be used in each iteration and the contents of the selected rows from MTH[][] will be allocated to one or two threads according to the sequential order of the available threads, where Th-no presents the sequential order of the available threads.

In Algorithm 2, the null value is used as initialization value for the Case variable and subsequently, the Case variable will take one of the following values (i.e., cases) A, B, C, and D. In line 20, when the case of the reaction is a uni-molecular reaction, the case status will be determined if MTH[i] is not full and the Case status will be set to A, otherwise, the Case status will be set to B. In line 35, when the case of the reaction is an inter-molecular reaction, the case status will be determined if the sum of the items in MTH[i] and MTH[j] is less than 2L then one of them is not full and the Case status will be set to C, otherwise the Case status will be set to D. However, the null status will be changed at least to A or B because the only condition to set the case to A or B is the number of items inside MTH[i]. Accordingly, if case A is chosen in every iteration, which is the worst-case scenario of the while loop in line 13, the MTH[][] items will be in sequence incrementally allocated to the available threads using the decomposition operator.

In case A, if there is a selected row MTH[i] with an empty location that takes some path nodes that are less than the allowed size L, it will be sent to the decomposition reaction algorithm for adding more unscheduled nodes from matrix SM[][]. In case B, if a row MTH[i] has path nodes from distinct paths, it will be sent to the on-wall reaction algorithm for reordering it, that is, for putting the nodes from one path in order, according to their original execution path order. In case C, if there are two selected rows MTH[i] and MTH[j] with empty spaces or locations, they will be sent to the synthesis reaction algorithm to be merged, if possible. In case D, if there are two distinct rows MTH[i] and MTH[j] that have several nodes from different execution paths, they will be sent to the inter-molecular reaction algorithm for properly rearranging them, that is, it gathers the nodes that belong to each solution path and put them on the same row.

In line 48, the execution nodes will be marked as scheduled. In line 49, the threads will be run concurrently. Then, the threads will be synchronized (line 50) and the common data will be exchanged if needed. When they are done, every completed path’s nodes status will be changed from scheduled nodes to processed nodes (line 51). Subsequently, in line 52, the values of the variables related to each processed nodes in each execution path will be updated in the variable matrix Var[][], where it may be used in the next iterations in another thread. After this, more unscheduled nodes from SM[][] will be allocated to the available threads and the process will continue until all unscheduled nodes are processed.

On the other hand, if MTH[i] or MTH[j] items remain unscheduled, then their execution path parts are copied to MTH[][] as new candidates to be selected in the next iterations, as shown in lines 53 and 54 of Algorithm 2. Notice that a thread remains idle until an operator has scheduled the execution path parts of its input parameters (i.e., MTH[i] and MTH[j]). Thus, a thread is allocated if the operator schedules the execution path parts of its input parameters, as shown in lines 26, 30, 41, and 45 of Algorithm 2, otherwise it stays idle.

4.1.4 On-wall ineffective collision algorithm

The on-wall ineffective collision algorithm, as shown in Algorithm 3, considers row MTH[i] as its input and ascertains if it has unordered nodes that belong to the ith execution paths if so reorder the nodes to derive a more optimized schedule, as shown in lines 1–3. In lines 4 and 5, Algorithm 3 will ascertain if the solution has different nodes from distinct execution paths. Subsequently, it needs to call the fitness function (Algorithm 7) to verify the dependency between the solution’s nodes in the new order, and to get the values of PE′ and KE′, where PE′ is the potential energy after applying the required operator (or reaction) and KE′ is the kinetic energy after applying the required operator (or reaction). Finally, if the sum of PE′ and KE′ is greater than or equal to the sum of PE and KE it will return the new scheduled row MTH[i] to the CRO-Scheduler algorithm (Algorithm 2); otherwise, it will return null. Note that, PE is the potential energy before applying the required operator (or reaction) and KE is the kinetic energy before applying the required operator (or reaction).

figure c

4.1.5 Decomposition algorithm

The decomposition algorithm is shown in Algorithm 4, which receives one selected row from MTH[][] such as MTH[i]. In line 1, the algorithm checks out the number of scheduled nodes in MTH[i] to determine if it is full or not. In lines 24, the algorithm will check out if MTH[i] is full or not if the MTH[i] is full it will exit. In lines 4 and 5, when MTH[i] is semi-full by W scheduled nodes, where W is less than L, then the algorithm will allocate L minus W unscheduled nodes to MTH[i]. Thereafter in line 6, it marks the new nodes allocated to MTH[i] as scheduled nodes. Note that the decomposition algorithm does not produce two new rows because it fills the empty spaces in MTH[i] from MTH[][]. Subsequently, in line 7, it calls the fitness function to get PE′ and KE′, and then, in line 8, if the sum of PE′ and KE′ is greater than or equal to the sum of PE and KE, it will send the new row to CRO-Scheduler algorithm (Algorithm 2); otherwise, it will return null.

figure d

4.1.6 Inter-molecular ineffective collision algorithm

Inter-molecular ineffective collision algorithm receives two full rows from MTH[][] including MTH[i] and MTH[j] with the nodes of common solutions, as shown in Algorithm 5. Under this algorithm, the rows will exchange their input execution paths if they contain common nodes (i.e., nodes belong to the same execution path); each row will obtain the maximum number of nodes of a path as opposed to running the nodes of a path in different rows. Doing so will reduce the quantum of data sharing between the threads in threads execution. Thus, to ascertain if the two solutions have different nodes from the same input solution to recollect them in the same row. In lines 5 and 6, the algorithm will call the fitness function for both rows MTH[i] and MTH[j] (Algorithm 7). In line 7, it checks if the sum of PE′1, KE′1, PE′2, and KE′2 is greater than or equal to the sum of PE1, KE1, PE2, and KE2 then it will send the new rows to the CRO-Scheduler algorithm (Algorithm 2); otherwise, it will return null. Note that, PE′1 and PE′2 are the potential energies after applying the required operator (or reaction) for MTH[i] and MTH[j], respectively. Also, KE′1 and KE′2 are the kinetic energies after applying the required operator (or reaction) for MTH[i] and MTH[j], respectively. Moreover, PE1 and PE2 are the potential energies before applying the required operator (or reaction) for MTH[i] and MTH[j], respectively. Additionally, KE1 and KE2 are the kinetic energies before applying the required operator (or reaction) for MTH[i] and MTH[j], respectively.

figure e

4.1.7 Synthesis algorithm

The synthesis algorithm is shown in Algorithm 6, which refers to the process of merging two rows, such that the number of nodes inside the threads is less than L (thread’s size). Therefore, this algorithm will merge them and yield one row rather than two distinct rows. In line 5, the synthesis algorithm will call the on-wall ineffective collision (Algorithm 3). In line 6, the synthesis algorithm will call the fitness function (Algorithm 7) to get the PE′ + KE′ of the new row. Finally, it checks the sum of PE′ and KE′ where if it is greater than or equal to the sum of PE1, KE1, PE2, and KE2, then it will send the new row to CRO-Scheduler algorithm (Algorithm 2); otherwise, it will return null.

figure f

4.1.8 Fitness function

In fitness function, the computation is concentrated on the dependencies issues that are represented by PE and KE, where PE is the number of dependencies between the parts of the execution paths, or the number of full nodes in a row MTH[i]. Meanwhile, KE is the maximum number of tolerance of dependencies achieved between parts of the execution paths in a row MTH[i], where a node is not in-use (empty) or a node is filled but it is independent of its predecessor.

Accordingly, the fitness function, which is shown in Algorithm 7, ascertains the dependencies between various parts of the execution path. If the dependencies are held under the new arrangement of the path, then it will calculate the PE and KE. Thus, PE will initiate by a zero value and will be incremented by one if the first case held and will be incremented by two if the second case held. The first case is when there is an execution path part in position MTH[i][j], and the second case is when the same execution path part in position MTH[i][j] and depends on its predecessor. On the other hand, KE will also initiate by a zero value and then KE is incremented by 1/L if the position MTH[i][j] is empty, that is, it has no execution path part in it. Also, KE is incremented by 1/L, if an execution path part in position MTH[i][j] is independent on its predecessor. The fitness function will go through all the nodes in MTH[i] to calculate the PE and KE.

figure g

4.1.9 CROTS main algorithm

The process of scheduling in the CROTS main algorithm (Algorithm 8) uses binary tree topology to depict the execution paths, before choosing the size and the number of the threads dynamically.

First, the execution paths are scheduled, after which the best-case and worst-case are ascertained from the paths. The best-case of the paths refers to the shortest path that holds the minimum number of nodes, whereas the worst-case refers to the longest path that has the maximum number of nodes. The size of these threads will be determined by the best and worst cases of the available execution paths, which are denoted as BP and WP, respectively. However, the thread’s size L should be greater than or equal to BP, and less than or equal to WP.

figure h

Second, the worst-case in resolving the problem assigns a thread for each path, so that the number of threads does not exceed the binary tree leaves in level Z. Put succinctly, the number of threads needs to be less than or equal to the number of the siblings on this level Z.

Finally, when the number of maximum available paths EP is greater than the number of threads in the tree, divide the paths into two groups and recall the CROTS for each group in a recursive manner. Lines 6–11 describe the process of visiting a binary tree, where each time the paths will be divided by two. CROTS will call the CRO-Scheduler algorithm recursively for each child until it visits all the nodes.

4.1.10 Block diagram of CROTS

In this subsection, the block diagram of CROTS is presented, as shown in Fig. 1, where CROTS has two phases. In the first phase, the input is the flow-graph of a program segment and its output is the maximum number of execution paths to be used as the input of the second phase. That is, in the first phase, SG algorithm (Algorithm 1) scans and categorizes the flow-graph nodes then generates the execution paths, and finally, it encodes the execution paths and saves them in the solution matrix which is denoted as SM[][].

Fig. 1
figure 1

Block diagram of CROTS

In the second phase, as mentioned earlier, the input of CROTS is the execution paths that are generated in the previous phase by the SG algorithm and the output is an optimized schedule of execution paths. Accordingly, in the second phase, the CROTS algorithm divides the execution paths using binary tree topology by calling itself recursively then it calls CRO-Scheduler. Meanwhile, CRO-Scheduler allocates unscheduled nodes (i.e., parts of execution paths) to the available threads randomly and runs them concurrently. CRO-Scheduler checks the threads status (case) and sends them to the appropriate function according to their current status and the corresponding cases. The process is going as follows; if the status of the threads matches case A, then decomposition function is being called, otherwise it will check case B. If the status of the threads matches case B, then the on-wall ineffective collision is being called, otherwise it will check case C. If the status of the threads matches case C, then synthesis is being called, otherwise it will check case D. Finally, if the status of the threads matches case D, then the inter-molecular ineffective collision is being called, but if it does not match it CRO-Scheduler will return the unscheduled execution paths parts into SM[][].

Thus, in the second phase, CROTS incrementally sends execution path parts from SM[][] to CRO-Scheduler until all execution path parts are scheduled. However, the stopping condition occurs when SM[][] is being empty, and accordingly, CROTS will not call CRO-Scheduler anymore and subsequently CROTS terminates. This indicates that all execution parts have been scheduled and an optimized multithreading schedule has been delivered.

4.2 Analytical analysis of CROTS

This subsection presents the analytical analysis of CROTS in terms of the number of steps in all algorithms used in the CROTS technique, that is, Algorithms 1–8. Table 1 entails the symbols used in calculating the number of steps in Eqs. 110 with their description.

Table 1 Symbols used in calculating the number of steps in Eqs. 110 and their description

In the SG algorithm (Algorithm 1), the time complexity of the lines 1–8 is O(N), where every node will be visited, named, and marked as a regular node, loop node or a branch node. The time complexity of the lines 9–27 is O(EP × ml) because there are two nested loop statements. In the outer loop, the while loop will iterate EP times, so the worst-case of the outer loop is EP, which is the maximum number of execution paths. In the inner loop, the algorithm will continue to visit the successor nodes until it reaches an ending node, and the worst-case of the inner loop is ml, where ml is the maximum execution path’s size. Thus, Eq. 1 presents the number of steps of Algorithm 1. Notably, Algorithm 1 is considered as a pre-phase of the CROTS technique that generates the execution paths and then encodes them in SM[][], which is the solution matrix that is used by CROTS.

$$ Steps_{{Algo{-}1}} = N + EP \times ml. $$
(1)

In CRO-Scheduler algorithm (Algorithm 2), in lines 1–9, CRO-Scheduler needs three steps; thus, a constant value c can be ignored later on, unless the CROTS main algorithm (Algorithm 8) calls CRO-Scheduler and sends the size of the thread and the number of threads into the optional parameters L and T, respectively. Notably, letter c refers to a constant value in the following equations to calculate the number of steps. In lines 11–12, the algorithm needs L × T steps to allocate the nodes and compute the PE and KE for each row in MTH[][]. The time complexity of the second phase is O(EP × (CH × (ml/(L × T)))), where the maximum number of the execution paths EP in SM[][] is multiplied by the following: the CH which is the time complexity for the chosen chemical reaction presented in Algorithms 3–6 depending on the selection done by the if statements in lines 20–47, and then multiplied by the result of dividing the maximum execution path size ml by the product of the thread’s size L and the number of threads T. The statement in line 20 needs maximum L steps to check out the number of items in MTH[i]. In line 35, the statement needs 2L steps to check out the number of items in MTH[i] and MTH[j]. Equation 2 presents the number of steps of Algorithm 2.

$$ Steps_{{Algo{-}2}} = c + L \times T + EP \times \left( {\left( {2L + CH} \right) \times \left( {ml/\left( {L \times T} \right)} \right)} \right). $$
(2)

The time complexity of the four reactions (Algorithms 3–6) is impacted by the time complexity of the fitness function (Algorithm 7), which is O(L) because whenever they perform any action on their input solutions, they would need to call the fitness function to check whether the new solutions hold the dependencies between all the row nodes. Specifically, the fitness algorithm (Algorithm 7) contains one for loop and this loop will iterate L times (lines 3–15). In line 2, scanning the nodes in one row requires L steps. Also, the first and last lines of the fitness function algorithm require constant c steps. So, the total number of steps required for Algorithm 7 is c + 2L, which is O(L) time complexity.

In on-wall ineffective collision algorithm (Algorithm 3), lines 1–3 require O(L) time complexity, since all scheduled nodes in the selected row MTH[i] are scanned, where the size of the row MTH[i] is L. Also, lines 4 and 5 require O(L) time complexity. In line 6, the fitness function is called, which requires L steps. The if statement in lines 7 and 8 requires constant c steps. Equation 3 presents the number of steps of Algorithm 3.

$$ Steps_{{Algo{-}3}} = c + 3L. $$
(3)

In the decomposition algorithm, the number of steps in lines 14 in Algorithm 4 is constant c. Also, the number of steps needed in lines 5 and 6 is 2(L - W) steps. In line 7, the decomposition algorithm will call the fitness function (Algorithm 7) to ascertain the dependencies between the nodes, which requires L steps. The if statement in lines 8 and 9 requires constant c steps. Equation 4 presents the number of steps of Algorithm 4.

$$ Steps_{{Algo{-}4}} = c + 2\left( {L - W} \right) + L. $$
(4)

In the inter-molecular ineffective collision algorithm (Algorithm 5), scanning the nodes in two rows in lines 1–3 requires 2L steps. In lines 5 and 6, the inter-molecular ineffective collision algorithm calls the fitness function two times for both selected rows, MTH[i] and MTH[j], which requires 2L steps. Finally, the if statement in lines 7–9 requires constant c steps. Thus, the number of steps of Algorithm 5 is shown in Eq. 5.

$$ Steps_{{Algo{-}5}} = c + 4L. $$
(5)

In the synthesis algorithm (Algorithm 6), to combine the contents of two rows into one row, as shown in lines 1–3, it requires L/2 steps. In line 5, it calls the on-wall ineffective collision (Algorithm 3), which requires c + 3L steps, as shown in Eq. 3. In line 6, the synthesis algorithm calls the fitness function (Algorithm 7), whose time complexity is O(L). Finally, the if statement in lines 7–9 requires constant c steps. Equation 6 presents the number of steps of Algorithm 6.

$$ Steps_{{Algo{-}6}} = c + { 4}. 5L. $$
(6)

In CROTS main algorithm (Algorithm 8), lines 1–5 require constant c steps. In lines 6–11, the algorithm requires O(log2(EP)) time complexity, mainly for performing the while loop in line 6. CROTS it will call the CRO-Scheduler algorithm recursively for each child until it visits all the nodes, whose number of steps is shown in Eq. 2. The worst-case scenario is when CH refers to the synthesis algorithm (Algorithm 6), which has the highest number of steps. Equation 7 illustrates the detailed number of steps for CROTS. Notably, CH holds the number of steps of the chosen chemical reaction presented in Algorithms 3–6, where its time complexity is O(L), and c refers to a constant value. However, since L × T is smaller than EP then we can drop it along with the constant c. Also, CH is equal to L. Therefore, Eq. 8 shows the approximated number of steps of CROTS.

$$ Steps_{CROTS} = { \log }_{ 2} (EP) \times \left( {c + L \times T + EP \times \left( {\left( { 2L + CH} \right) \times \left( {ml/\left( {L \times T} \right)} \right)} \right)} \right),\;{\text{where}}\,CH = {\text{O}}\left( L \right) $$
(7)
$$ Steps_{CROTS} \approx { \log }_{ 2} \left( {EP} \right) \times (EP \times \left( {ml/T} \right). $$
(8)

Table 2 entails the number of steps for Algorithms 1–7. The minimum number of steps relates to the fitness function (Algorithm 7). On the other hand, the CRO-Scheduler algorithm has the maximum number of steps.

Table 2 Number of steps for Algorithms 1–7

Meanwhile, the experimental cost of a multithreaded algorithm, which is denoted as CostE, is typically defined as the product of the execution time observed in the experimental results, which is denoted as ExeTime, by the number of threads used in the experiment (or processing units), as mentioned in (Grama et al. 2003), which is shown in Eq. 9. Also, in (Cormen et al. 2009), the definition of the execution time of an algorithm is presented as the number of steps. It may be worthwhile to note that a step is considered to be machine independent. Thus, the analytical cost (CostA) of CROTS equals the number of steps of CROTS algorithms multiplied by the number of threads used, as shown in Eq. 10.

$$ Cost_{E} = ExeTime \times T $$
(9)
$$ Cost_{A} = Steps_{CROTS} \times T. $$
(10)

4.3 Example for CROTS

The following example illustrates the steps that CROTS entails. Put succinctly, this example is given to illustrate the incremental thread scheduling using CRO reactions. Figure 2 shows a program segment from ML (Modula-like language), as posited by Jabri (1990). The program is divided as a group of nodes predicated on the flow of the program, as shown in Fig. 2a. More specifically, when there is an if statement or a loop statement, they will redirect the program flow into the requisite node based on the result of the condition in the statement. Meanwhile, the program’s statements are allocated to the nodes according to the criteria that are presented by Jabri (1990).

Fig. 2
figure 2

a Program segment from ML (Modula-like language) and its node names. b Program segment from ML as flow-graph

The input of the CROTS technique is a flow-graph of a program segment, which denotes the output of the approach mentioned by Jabri (1990), such as the flow-graph presented in Fig. 2b. Note that, the program segment in Fig. 2a is presented as nodes in the flow-graph of Fig. 2b.

During the first phase of CROTS, the flow-graph is scanned, after which the starting node ns and the ending node nf are determined, which are denoted in Fig. 2b by n1 and n5, respectively. Meanwhile, n2 and n7 are considered as loop nodes, n3 is considered as a branch node, and the remaining nodes are considered as regular nodes.

In the second phase, this technique will generate the execution paths using the flow-graph nodes; for example, one execution path would be as (n1, n2, n3, n4, n2, n5) that commences with n1 as a starting node and n5 as an ending node. It will subsequently encode the nodes based on the solution number. For example, let (n1, n2, n3, n6, n7, n8, n7, n2, n5) represent the first solution found; in this stage, the CROTS technique will encode it in the following manner: SM[0][0], SM[0][1], SM[0][2], SM[0][3], SM[0][4], SM[0][5], SM[0][6], SM[0][7], SM[0][8], respectively, as shown in Fig. 3. Meanwhile, CROTS will properly schedule the execution paths into the threads by making use of CRO reactions.

Fig. 3
figure 3

Execution paths before and after encoding with their variables’ initial values

Figure 3 shows five execution paths before and after encoding them. Importantly, the new naming exhibits consistency and a correlation between the nodes of the execution paths, where every node has a unique identifier relating to its solution as well as its logical position within the solution. In the event, the nodes were reordered and their identifiers will be used as indicators of dependency issues.

Two threads are used in this example, which means T equals two, and the thread’s size is four, that is, L equals to four, for the sake of simplicity. Figure 4 illustrates the steps of scheduling the execution paths of the aforementioned example.

Fig. 4
figure 4

Example of CROTS schedule using two threads of size 4

In Fig. 3, the execution paths are generated using the program segment in Fig. 2. In Fig. 2a, first, randomly it initializes the variables {a, b, y, d, e} for each execution path, which exist in node n1 and then uses their values in the following statements in the next nodes for each execution path. Notably, each execution path has a different sequence of nodes according to the initial values given to the variables {a, b, y, d, e} (see Fig. 3) which are initialized in node n1. Second, the process of finding an execution path will go through the nodes according to the results of their statements using the values of the variables corresponding to it. Branch/loop nodes lead execution paths to go through different nodes sequences. For example, generating execution path before encoding SM[3], as shown in Fig. 3, is started by initializing the variables {a, b, y, d, e} in the starting node n1, then it will go to its’ successor the branch node n2. In Fig. 2a, assume that the value of variables {a, b, y, d, e} in execution path SM[3] are {10, 12, 8, 20, 7}, respectively. Note that the value of variable d in this execution path is 20, which is smaller than 30, thus the while statement condition result is false, so it will not enter the while loop body. Subsequently, the branch node n2 will go to its’ successor in execution path SM[3] resulted from a false condition, which is n5 the ending node. On the other hand, in the execution path SM[0], the variables initial values are different, where d equals to 35; thus, the successor of n2 in this execution path is n3, because the branch node condition is true. The process of generating the execution paths is done in the same way for all other paths.

In this example, the scheduling of five execution paths took place in five iterations, as shown in Fig. 4, wherein every iteration unscheduled nodes from SM[][] will be allocated to the threads matrix (i.e., in this example MTH[2][4]), then the appropriate reaction will be called according to the case of the selected row/s in each iteration (see Algorithm 2). Note, in each iteration, at most four unscheduled nodes will be allocated to each available row MTH[i] since the size of each row is four, and the selection of the unscheduled nodes will be done randomly; thus, to illustrate the scheduling, the unscheduled nodes will be chosen from randomly two different execution paths to be allocated to the available rows.

The PE and the KE will get initial values using the fitness function for each row in its initial case, see line 12 in Algorithm 2. The values of PE, KE, PE′, and KE′ are shown in Fig. 4.

In the first iteration, in the allocation step, the unscheduled nodes that will be chosen to be allocated to the available rows in MTH[][] are from two distinct execution paths namely SM[0], SM[1]. Note that the threads have different nodes from both execution paths. Thus, case D will be applied, subsequently, inter-molecular ineffective collision is used to reorder the nodes in the rows and gather the nodes of the same solution within the same row, so that they can hold dependency issues. Note that, the sum of the PE1, KE1, PE2, and KE2 is 14, which is greater than the sum of PE1, KE1, PE2, and KE2 which is 12.5, so threads will be allocated by the scheduled nodes from MTH[][]. In the second iteration, the second row (i.e., MTH[2]) has empty spaces, because the remaining nodes in the execution path SM[1] are chosen to be allocated to the second row (i.e., MTH[2]), where they are less than the thread’s size, so case A will be applied, that is, decomposition is used to allocate more unscheduled nodes to it. Note that, the sum of PE′ and KE′ is 13.25 which is greater than the sum of PE and KE which is 10.75, so threads will be allocated by the scheduled nodes from MTH[][], whereas in the third iteration, in the allocation step, the selected unscheduled nodes were from the first and the fifth execution paths, namely SM[0] and SM[4] (see Fig. 3), but nodes from same execution paths were separated into both rows, so case D will be applied. Thus, inter-molecular ineffective collision is used to gather nodes from the same execution path into the same thread. Note that, the sum of the PE1, KE1, PE2, and KE2 is 14, which is greater than the sum of PE1, KE1, PE2, and KE2 which is 12.5, so threads will be allocated by the scheduled nodes from MTH[][]. In the fourth iteration, both of the threads have empty spaces, so case C will be applied. Subsequently, synthesis is used to gather the nodes in one row rather than using two rows. Meanwhile, synthesis calls the on-wall ineffective collision to reorder the nodes properly, where the on-wall ineffective collision moves SM[3][2] to the first place of the first row (MTH[1]) because all of its predecessors’ were processed. Note that, the sum of the PE1, KE1, PE2, and KE2 is 7.25, which is greater than the sum of PE1, KE1, PE2, and KE2 which is 7, so threads will be allocated by the scheduled nodes from MTH[][]. Finally, during the last iteration, decomposition (case A) is used to allocate the last unscheduled node from the execution path SM[0] to the second row (MTH[2]). Note that, the sum of the PE1, KE1, PE2, and KE2 is 9.75, which is greater than the sum of PE1, KE1, PE2, and KE2 which is 8.75, so threads will be allocated by the scheduled nodes from MTH[][].

Clearly, in each iteration, execution paths have been allocated to the available threads with the sum of KE′s and PE′s that is greater than or equal to the sum of the initial values of KEs and PEs.

However, Andión et al. (2013) presented an intermediate representation of converting sequential codes to parallel counterparts using KIR. KIR is suitable for various syntactic source code (e.g., use of arrays, pointers, and/or complex control flow), also it shows multiple levels of parallelism. Additionally, Andión et al. (2013) presented an automatic partitioning strategy, where the parallelism done by KIR is suitable to be mapped to hardware that is based on multicore CPUs. All these mentioned advantages of KIR are also advantages of CROTS.

In our approach, the process of allocating threads incrementally using CRO operators in CROTS achieved a higher speedup in the experimental results than KIR. Also, CROTS can distribute parts of the execution paths to fixed-sized threads, whatever is the size of the execution paths, where KIR could not predict the sizes of the execution paths; therefore, threads in KIR have different sizes. Additionally, CROTS parallelized loops with un-privatized variables as seen in the above example, where KIR could not.

Moreover, CROTS is independent which can be applied to any programming language, and it can parallelize the whole program segments including; nested and non-nested loops statements, nested and non-nested if statements, and assignment statements. Also, CROTS can parallelize loop statements that have true data dependencies. Unlike AutoPar-Clava, which is a library only for C programming language and it can parallelize only loops statements with no true data dependencies. Accordingly, the only approach that is close to our approach is KIR, which is presented by Andión et al. (2013), and AutoPar-Clava is not close to our approach. However, upon our knowledge and until this moment there are no other existing approaches that we can compare with our approach.

5 Experimental results

This section presents and discusses the experimental environment and results for the CROTS technique. In this section, every experiment was done thirty times and the average of the thirty results is taken to represent the results of the experiments.

Section 5.1 presents the experimental environment and Sect. 5.2 presents CROTS speedup. This is followed by Sect. 5.3 that shows CROTS efficiency, while Sect. 5.4 presents the speedup comparison between CROTS and KIR. Finally, experimental verses analytical results are presented in Sect. 5.5.

5.1 Experimental environment

This subsection presents the experimental environment, including hardware specifications, software, and tools used to implement CROTS and KIR techniques, and the datasets benchmarks that have been used in the experiments.

For experimental purposes, we use the following platform for this work: a multicore system comprises of Intel(R) Core(TM) i7-8550U 1.80 GHz CPU processors with 8 CPUs of cache memory per processor and 16 GB-RAM. The operating system used was a 64-bit-Ubuntu 17.10. Also, CROTS and KIR techniques were implemented using C# programing language, which is executed in the integrated development environment Visual Studio 2017 from Microsoft. Moreover, we used the dense matrix–vector product DenseAMUX benchmark which is mentioned by Andión et al. (2013) as an application. Also, we used the program segment of Fig. 2, which is presented by Jabri (1990), to show the speedup achieved by the CROTS technique.

The process of multiplying a dense Q × Q matrix MQ with a Q × 1 vector XQ to yield the Q × 1 result vector YQ is extensively used to resolve several problems. Many problems have a huge input size and take a long time to get resolved in a sequential program. Notably, the sequential execution time of a matrix–vector product is O(Q2). In Grama et al. (2003), the authors show the need to resolve the problem using a multithreading approach, particularly when Q is very large, by dividing the problem into several partitions and distributing them on threads to ensure that the execution time is less than O(Q2).

In the following experiments, different factors are known to affect the speedup and efficiency results, such as the size of the matrix, the size of the threads, and the number of the threads. For this reason, five different matrices sizes are used, including 1000 × 1000, 2000 × 2000, 3000 × 3000, 5000 × 5000, and 7000 × 7000 which are denoted as M1000, M2000, M3000, M5000, and M7000, respectively, where these datasets were generated from DenseAMUX benchmark. Also, seven distinct datasets of different sizes were generated using the SG algorithm, where each dataset has a different number of execution paths, and these datasets are generated from the program segment in Fig. 2, which is presented by Jabri (1990). Figures 5, 6, and 7 illustrate the speedup and efficiency performance of the CROTS. On the other hand, the size of thread is known to impact the extent of concurrent threads work and speedup. The experimental results revealed that using smaller sizes leads to low speedup values, whereas using larger sizes leads to high speedup values.

Fig. 5
figure 5

Speedup of CROTS technique for different matrices sizes using various number of threads

Fig. 6
figure 6

Speedup of CROTS technique for a various number of execution paths of the program segment in Fig. 2 using various number of threads

Fig. 7
figure 7

The efficiency of CROTS technique for different matrices sizes using various number of threads

The experiments are conducted using 1, 2, 4, 8, 16, and 32 threads, where the size of thread is defined in Algorithm 2 (CRO-Scheduler algorithm), that is, the size of each thread will be L that equals to 80,000. This implies that on each iteration, a sequence of at most 80,000 unscheduled nodes from SM[][] will be allocated to each thread.

5.2 Speedup of CROTS technique

In this subsection, we present the speedup of the CROTS technique and we explain the contrast between its values using different datasets.

Figures 5 and 6 illustrate two distinct comparisons of the speedup that are achieved using the CROTS technique for a various number of threads using DenseAMUX benchmark and the program segment of Fig. 2, respectively. The speedup SP in these experiments refers to the ratio of the execution time of the sequential algorithm TS to the multithreaded execution time TM, as shown in Eq. 11, which is mentioned by Mahafzah (2013). Under this comparison, the execution time of the algorithm on a single thread is considered as the algorithm’s sequential execution time. However, the multithreaded algorithm execution time has been defined as the time required by a multithreaded algorithm to solve the problem using multiple threads. This is also inclusive of the time required for creating and terminating concurrent threads, serializing the algorithm, and synchronizing shared data (Grama et al. 2003).

$$ S_{P} = T_{S} /T_{M}. $$
(11)

In the case of a single thread, the speedup is always one for all datasets because TS and TM are equal. Meanwhile, in the two thread case, the load of work is distributed over two threads; as a result, the execution time became smaller than the sequential execution time, which reflects in higher speedup using the two threads as compared to its one thread counterpart. When more than four threads are used in the cases of M5000 and M7000, the speedup increases drastically, as shown in Fig. 5, because the execution time required for solving the problem of a large dataset involving more threads is lower than the execution time for using one thread. This is because the workload is distributed on more number of threads when more threads are used; they also require less execution time than the sequential execution time. Moreover, memory latency tends to affect the usage of one thread to resolve the problem of the large input size. Consequently, the sequential execution time is drastically higher than the multithreaded execution time in the case of using large data sets, which is affected the speedup to attain higher values.

Meanwhile, Fig. 6 illustrates the speedup achieved by the CROTS technique for the program segment of Fig. 2. In this experiment, seven distinct datasets of different sizes were generated using the SG algorithm, where each dataset has a different number of execution paths, and these datasets are generated from the program segment in Fig. 2; for example, NE1000 has 1000 execution paths; NE2000 has 2000 execution paths; NE3000 has 3000 execution paths, etc. The speedup values in this figure have different values than the speedup values in Fig. 5, because of the differences between the DenseAMUX benchmark and the program segment of Fig. 2 datasets in terms of the execution path size and the contents of the flow-graph nodes.

5.3 Efficiency of CROTS technique

In this subsection, we present the well-known measure of performance, the efficiency metric, and we explain the results of CROTS using various datasets. The efficiency is defined as the fraction time that demonstrates a processing unit (thread) is usefully employed and can be calculated by deducing the ratio of speedup to the number of processing units (threads) (Grama et al. 2003). In practical terms, however, efficiency generally lies between zero and one. In the case of using 32 threads for the small datasets M1000, M2000, and M3000, the efficiency obtains low values, as shown in Fig. 7, which implies that there is an inefficient use of a thread or suggests that the idle time of threads is fairly large. On the other hand, the efficiency improves in the case of using 32 threads for larger datasets such as; M5000 and M7000, as shown in Fig. 7, since their speedups have higher values than their counterparts of small datasets. This is because the dataset became larger, thereby leading to an increase in the workload of the threads. Consequently, the idle time of the thread is reduced where a thread has more workload.

Meanwhile, the efficiency achieved higher values during the usage of more than four threads when the size of the dataset increased, such as in the case of M5000 and M7000, as shown in Fig. 7. Since the speedup values achieved by the CROTS technique have higher values in these cases than the ones achieved when small sizes were used. Put succinctly, the threads were used efficaciously while using large dataset sizes.

5.4 Comparison between CROTS and KIR

In this subsection, we present the experimental comparison between two distinct techniques our technique CROTS and KIR which is proposed by Andión et al. (2013). In this comparison, we used two different datasets, namely M5000 and M7000 to compare the speedup achieved by both techniques. For experimental reasons, we re-implement KIR according to its description presented by Andión et al. (2013). Thus, the input parameters used for KIR are the number of threads, including 1, 2, 4, 8, 16, and 32 threads, the matrix dimension, that is, 5000 for M5000 and 7000 for M7000, the matrix to be multiplied (i.e., M5000 and M7000) by a vector and the vector. Note that the DenseAMUX benchmark is a dense matrix–vector multiplication. The matrix and vectors are generated separately, before executing the codes, in a pre-stage, for both techniques.

Figures 8 and 9 illustrate two comparisons made between CROTS and KIR using two different matrices sizes M5000 and M7000. Thus, the scheduling phase of the CROTS technique demonstrates how the CROTS technique provides and impacts speedup values. In both figures, the speedup of CROTS is shown, where it attains higher values than KIR, that is, CROTS achieved a speedup value three times more than the speedup achieved by KIR when using 32 threads. CROTS also improves the thread scheduling, which in turn, helps augment efficiency and elevate the speedup as much as possible. Also, the CROTS technique allocates nodes to the available threads during each iteration, which facilitates the utilization of the thread at most times. The usage of chemical reaction optimizer in the CROTS technique improves the utilization of threads, which helps in optimizing thread scheduling. This is reflected in higher speedup values than KIR.

Fig. 8
figure 8

Speedup comparison between CROTS and KIR using M5000 matrix size

Fig. 9
figure 9

Speedup comparison between CROTS and KIR using M7000 matrix size

Furthermore, the generalization of hypotheses is difficult when comparing two groups of subjects, where the participants of the groups are few, that is, less than eight participants in each group, as presented in Nachar (2008). Thus, statistical tests can be used to prove the hypotheses or reject them, that is, statistical tests can be used to show the difference between two groups of subjects according to the hypotheses. In general, two different types of statistical tests are used to compare two groups of subjects; nonparametric and parametric tests. Parametric test, such as Student’s t test presented by Student (1908), is used when samples are normally distributed, and nonparametric test, such as Mann–Whitney U presented by Mann and Whitney (1947), is used when the distribution is asymmetrical, as explained in Nachar (2008). Accordingly, the Student’s t test presented by Student (1908) is inappropriate for comparing speedup observations, because t test assumed a normal distribution of data, where the actual distribution observed in speedup learning systems have no straightforward mathematical characterization (i.e., not normal distribution), as mentioned in Etzioni and Etzioni (1994). In contrast, the Mann–Whitney U test is a nonparametric alternative to the t test, where the Mann–Whitney U test does not assume a normal distribution of data. In Nachar (2008), the author has shown that the Mann–Whitney U test can be used to test whether the speedup results are statistically significant or not.

Therefore, we used the Mann–Whitney U test to show if the obtained speedup values are statistically significant or not. The comparisons are made over the speedup values achieved by CROTS and KIR (Figs. 8, 9). Table 3 entails the speedup values achieved by CROTS and KIR, where datasets M5000 and M7000 are used on a various number of threads.

Table 3 Speedup values achieved by CROTS and KIR

Mann–Whitney U test is applied for testing the difference between two independent groups, where the two groups should come from the same population, homogenous, and have the same distribution. Also, the groups should be represented by two continuous cumulative distributions, as shown in Nachar (2008), wherein our case the two groups are the speedup results achieved from CROTS and KIR using the datasets M5000 and M7000. The Mann–Whitney U tests the null hypothesis H0 which implies there is no difference between the groups, otherwise the alternative hypothesis H1 will be satisfied which implies there is a statistically significant difference between the groups. Moreover, if the alternative hypothesis H1 is satisfied and the null hypothesis H0 is rejected, then H1 suggests that the variable of one group is stochastically larger than the other group. Mathematically, the hypotheses H0 and H1 are expressed in Eqs. 12 and 13, as presented in Nachar (2008), where CROTSspeedup and KIRspeedup correspond to the speedup values achieved by CROTS and KIR, respectively. Pro is the probability of being CROTSspeedup larger than KIRspeedup.

$$ H_{0} :Pro\left( {CROTS_{speedup} > KIR_{speedup} } \right) = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$} $$
(12)
$$ H_{ 1} :Pro\left( {CROTS_{speedup} > KIR_{speedup} } \right) > \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$}. $$
(13)

Accordingly, the Mann–Whitney U test requires the calculation of the U statistic of each group, where U statistic is expressed mathematically as shown in Eqs. 14 and 15 (Nachar 2008). Thus, in Eqs. 14 and 15, the number of participants in each group is βCROTS and βKIR, wherein our case, they are the number of experiments done to test CROTS and KIR, respectively. In these experiments, both βCROTS and βKIR are the same and equal to six, where the six experiments were done using 1, 2, 4, 8, 16, and 32 threads. The calculation of U statistic needs the variables RankCROTS and RankKIR, which are the sum of the ranks assigned to the speedup values achieved by CROTS and KIR, respectively. In other words, in Eqs. 14 and 15, the speedup results are ranked according to Nachar (2008). After calculating the UCROTS and UKIR values, the smallest of them (i.e., min(UCROTS and UKIR)) should be compared with the critical value which is extracted from the table of critical values of the Mann–Whitney U test according to the corresponding α threshold value as presented in Zar (1984). Thus, the null hypothesis H0 is rejected if the min(UCROTS and UKIR) is less than the critical value, and the alternative hypothesis H1 is satisfied. Table 4 entails the values of UCROTS, UKIR, RankCROTS, RankKIR, and the critical values for the Mann–Whitney U test when α is 0.05. Accordingly, the results show that the min(UCROTS and UKIR) in both experiments using datasets M5000 and M7000 equals nine, that is, less than the critical value that equals 29, which implies that the null hypothesis H0 is rejected. Thus, the alternative hypothesis H1 is satisfied, which implies that CROTS speedup values are significantly larger than KIR speedup values. In Table 4, notice that the values of RankCROTS and RankKIR are the same for both datasets M5000 and M7000, that is, because the values of speedup achieved by CROTS are larger than the ones achieved by KIR and the number of experiments is the same for both.

$$ U_{CROTS} = \beta_{CROTS} + \beta_{KIR} + ((\beta_{CROTS} (\beta_{CROTS} + 1))/2) - Rank_{CROTS} $$
(14)
$$ U_{KIR} = \beta_{CROTS} + \beta_{KIR} + ((\beta_{KIR} (\beta_{KIR} + 1))/2) - Rank_{KIR}. $$
(15)
Table 4 Mann–Whitney U test results for speedup values of CROTS and KIR

5.5 Experimental and analytical results of CROTS

This subsection presents the experimental and analytical cost results of CROTS. Also, it presents the acceptable error percentage between the experimental and analytical cost results of the CROTS technique.

During the analysis of the CROTS technique, Eqs. 9 and 10 are used to find its experimental and analytical costs, respectively. Figure 10 shows a comparison between experimental and analytical results in terms of cost. The comparison is made for different size matrices M1000, M2000, M3000, M5000, and M7000. Four threads have been used during this comparison. Correspondingly, the experimental results witnessing the same level of growth of the analytical results; the experimental results typically achieved slightly larger values when compared with analytical results. Also, the slope of both types of results seems to be similar in terms of experimental and analytical costs, which verifies the correctness of our CROTS technique.

Fig. 10
figure 10

Cost comparison between experimental and analytical results

Phillips (2018) pointed out that the acceptable percentage of error is between 5 and 35%, whereas the percentage of error between 0 and 5% to be exceptionally good. However, if this exceeds 35%, it can be inferred that the data is unreliable or chaotic. The percentage of error is calculated using Eq. 16, wherein Err denotes the percentage of error as mentioned in Kemerer (1987). According to Table 5, the average percentage of error between the experimental and analytical cost results is at an acceptable level of 14.56%.

$$ Err = \left[ {\left( {Cost_{E} {-}Cost_{A} } \right) \, /Cost_{A} } \right] \times 100\%. $$
(16)
Table 5 Experimental cost versus analytical cost and their percentage of error

Furthermore, let M(di) denote a dataset; the subsequent dataset will be M(di+1), where the size of M(di+1) is larger than that of M(di). To illustrate this, the sizes of M3000 and M5000 are 3000 × 3000 and 5000 × 5000, respectively. The increment of the experimental cost and analytical cost are evident; thus, there is a growth of factor fE for the experimental cost, where 1.43 ≤ fE ≤ 3.53, as depicted in Eq. 17 and Table 5. Also, there is a growth of factor fA for the analytical cost, where 1.45 ≤ fA ≤ 3.47, as depicted in Eq. 18 and Table 5. Consequently, the increase in cost is found to have a positive correlation with the expansion of the dataset.

$$ Cost_{E} \left( {M\left( {d_{i + 1} } \right)} \right) = f_{E} \times Cost_{E} \left( {M\left( {d_{i} } \right)} \right) $$
(17)
$$ Cost_{A} \left( {M\left( {d_{i + 1} } \right)} \right) = f_{A} \times Cost_{A} \left( {M\left( {d_{i} } \right)} \right). $$
(18)

6 Conclusions

This paper presented a new multithreaded scheduling technique for program segments called the CROTS technique. CROTS technique schedules the counterparts of the sequential codes on multiple threads using a chemical reaction optimizer. CROTS technique entails two distinct phases. In the first phase, CROTS scans the flow-graph of the sequential code using the approach proposed by Jabri (1990), categories the nodes, and determines its counterparts, then CROTS generates the execution paths that will be used in the second phase. In the second phase, it schedules the execution paths to the threads incrementally using a chemical reaction optimizer. Finally, CROTS provides an optimized incremental thread schedule.

In the experimental and analytical results, several performance metrics were used over different datasets to evaluate the CROTS technique performance; these include the number of steps, execution time, speedup, efficiency, cost, and percentage of error. The experimental results revealed that when CROTS used 32 threads on matrix M5000, the maximum speedup value of 11.98 is attained. Moreover, the experimental cost of the CROTS technique is seen to grow by a factor fE where 1.43 ≤ fE≤ 3.53. Also, the analytical cost of the CROTS technique is seen to grow by a factor fA where 1.45 ≤ fA≤ 3.47. Finally, the findings also revealed an average percentage error of 14.56% between the experimental and analytical cost results.