Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The modern GPU is a many-core processor which supports execution of thousands of threads concurrently. GPU comprises of a series of streaming processors with hundreds of core aligned in a particular way which facilitate single instruction multiple threads (SIMTs) programming model.

General-purpose computing on GPU is a graphical processing unit which is very efficient at computer graphics manipulation and image processing. The highly parallel structure of GPU makes it easy to use to perform the general-purpose computation and accelerate traditional CPU-based computational tasks. Recently, general-purpose computation on graphics processing units (GPGPU) has been adopted in accelerating many algorithms such as sorting [11], graph algorithms [12], data encryption algorithms [5], and other generic algorithms [13]. Several libraries and programming environment are available that allow programmers to perform GPGPU and exploit computational power of GPU. Compute Unified Device Architecture (CUDA) [8] framework is one of the programming environments provided by NVIDIA that allows to exploit data parallelism of GPU using C-style programming language.

Scheduling is a way by which work specified by some means is assigned to resources that complete the work. Job/process scheduling is the process of arranging, controlling, and optimizing the allocation of system resources to threads, processes, and data flows for maximum utilization. The operating system schedules the processes for execution using several scheduling algorithms based on various scheduling criteria such as CPU utilization, throughput, turnaround time, waiting time, and priority. System resources can be equally and effectively utilized by properly scheduling the processes and hence achieve a target quality of service. The major scheduling algorithms types of CPU/job scheduling algorithms include as follows: First Come First Serve (FCFS), Shortest Job First (SJF), Round Robin (RR), and Priority-Based Scheduling (PBS). Scheduling is required to perform multitasking and multiplexing. Scheduling is a complex job requiring extensive processing which is better performed on a parallel platform.

In this paper, a novel parallel approach to perform scheduling using CUDA technology to enhance the performance of process scheduling algorithms is proposed. A combination of well-established data parallel algorithms and parallel sorting techniques has been adopted to achieve drastic performance increase in execution time of scheduling algorithms (Fig. 1).

Fig. 1
figure 1

Example of bitonic parallel sort algorithm

1.1 Data Parallel Algorithms

In data parallel algorithms, parallelism is involved in simultaneous operations across large sets of data, rather than from multiple threads of control [4]. Prefix sum or scan is one among the most common data parallelism algorithms. The prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers \( x_{0} , x_{1} , x_{2} , \ldots x_{n - 1} \) are the second sequence of numbers \( y_{0} , y_{1} , y_{2} , \ldots y_{n - 1} \), the sums of prefixes (running totals) of the input sequence where \( y_{0} = x_{0} ; y_{1} = x_{0} \oplus x_{1} ; y_{1} = x_{0} \oplus x_{1} \oplus x_{2} \ldots \), as shown in Fig. 2. Mathematically \( y_{i} = x_{0} \oplus x_{1} \oplus x_{2} \oplus \cdots \oplus x_{i}\, \forall_{i} \in 0 \ldots n - 1 \). Figure 2 depicts the pictorial representation for the scan data parallel algorithm [6].

Fig. 2
figure 2

Prefix sum data parallel algorithm

1.2 Bitonic Sort

There are multiple sorting algorithms available such as Merge sort, Quick sort, Radix sort, and Heap sort. Bitonic sort is one among the sorting algorithms that can make use of GPU computing power and thus is efficient in terms of both space and time complexities [7].

Bitonic Sequence: A sequence is called Bitonic if it is first increasing, then decreasing. In other words, an array arr[0..n − 1] is Bitonic if there exists an index i where \( 0 \le i \le n - 1 \) such that,

$$ a_{0} \le a_{1} \le \cdots \le a_{n/2 - 1} \,and\, a_{n/2} \ge a_{n/2 + 1} \ge \cdots \ge a_{n - 1} $$

The Bitonic Sort Algorithm (as illustrated in Fig. 1):

  1. (1)

    Let s = \( \left\langle {a_{0} , a_{1} , a_{0} , \ldots ,a_{n - 1} } \right\rangle \) be a bitonic sequence such that

    1. (a)

      \( a_{0} \le a_{1} \le \ldots \le a_{n/2 - 1} \), and

    2. (b)

      \( a_{n/2} \ge a_{n/2 + 1} \ge \ldots \ge a_{n - 1} \)

  2. (2)

    Consider the following subsequences of s

    1. (a)

      \( s_{1} = \left\langle {min\left( {a_{0} ,a_{n/2} } \right), min\left( {a_{1} ,a_{n/2 + 1} } \right), \ldots , min\left( {a_{n/2 - 1} ,a_{n - 1} } \right)} \right\rangle \)

    2. (b)

      \( s_{2} = \left\langle {max\left( {a_{0} ,a_{n/2} } \right), max\left( {a_{1} ,a_{n/2 + 1} } \right), \ldots , min\left( {a_{n/2 - 1} ,a_{n - 1} } \right)} \right\rangle \)

  3. (3)

    Sequence properties

    1. (a)

      \( s_{1} \,and\,s_{2}\, are\, both\, bitonic \)

    2. (b)

      \( \forall x \forall y\, x \in s_{1} , y \in s_{2} , x < y \)

  4. (4)

    Apply recursively on \( s_{1} \,and\, s_{2} \) to produce a sorted sequence

  5. (5)

    Works for any bitonic sequence, even if \( \left| {s_{1} } \right| \ne \left| {s_{2} } \right|. \)

Given an unordered sequence of size \( 2n \), exactly \( log_{2} 2n \) stages of merging are required to produce a completely ordered list (Fig. 3).

Fig. 3
figure 3

Representation of bitonic sort

1.3 Prefix Sum (Scan)

Prefix sum of a sequence of numbers x0, x1, x2, …, xn is another sequence of numbers y0, y1, y2, …, yn given by:

$$ y_{0} = x_{0} ; y_{1} = x_{0} \oplus x_{1} ; y_{1} = x_{0} \oplus x_{1} \oplus x_{2} $$

The general mathematical representation of inclusive prefix sum for computing the output value in a sequential order is:

$$ \begin{aligned} & y_{i} = x_{i} ; \quad i = 0 \\ & y_{i} = y_{i - 1} \oplus x_{i} ; \quad i > 0^{{}} \\ \end{aligned} $$

The given data set is divided into blocks of fixed size. The scan is performed individually on each block. The result obtained is not the final result. It is a temporary array, as second block does not counter for first block elements and similarly henceforth. Each block sum is the last element of that block. The concept used first involves extracting the last element of the block that will give the individual block sum and storing these in a separate array let’s say Y in the corresponding positions, i.e., block 0 sum stored at zeroth index and block 1 sum stored in first index. The final result is obtained by adding the elements in array Y to the corresponding temporary array elements by using indices of block and the thread. For instance, the block 0 contains final result so we do not have to perform any computation. But for block 1, Y[0] (which is nothing but the block sum of block 0) is to be added to each and every element of block 1 present in the temporary array. For block 2, Y[0] + Y[1] is added to every element, which is nothing but the block sum of zeroth and first block, respectively, as shown in Fig. 4, similarly for further blocks. Finally, the resultant array is obtained.

Fig. 4
figure 4

Block-wise data parallel scan operation

2 Related Work

Job/process scheduling is one of the important tasks performed by the operating system (OS). The performance of the OS depends on the CPU scheduling algorithms. Recently, several improved CPU scheduling algorithms such as [2, 9, 10] have been introduced for improving the system performance. Scheduling needs to be carried out very frequently by the operating system and is a complex job which may require repetitive computation. State-of-the-art performance is achieved by implementing many generic algorithms [1, 3] on GPU. This motivates to implement scheduling algorithms on GPU and analyze the performance.

3 Implementation

Algorithm 1 provides the steps carried out to implement the parallel non-preemptive scheduling algorithm.

figure a
figure b
figure c

To form a bitonic sequence from a random input, we start by forming four-element bitonic sequences from consecutive two-element sequence. Consider four-element in sequence x0, x1, x2, x3. We sort x0 and x1 in ascending order and x2 and x3 in descending order. We then concatenate the two pairs to form a four-element bitonic sequence. Next, we take two four-element bitonic sequences, sorting one in ascending order, the other in descending order (using the bitonic sort which we will discuss below), and so on, until we obtain the bitonic sequence as shown in Fig. 3.

figure d

The experiment done was on analyzing the performance of the Priority-Based Scheduling (PBS) algorithm. A large number of processes are taken which include parameters such as burst time, arrival time, and priority. The processes are sorted based on the priority first, and then, the tie is broken between the processes having same priority depending upon their burst time. The scheduling is performed placing the processes with the highest priority and lowest burst time first (the highest priority corresponds to lowest number). The input elements are sorted using bitonic sort. Further, the block-wise work-efficient scan operation is performed to obtain the waiting time and turnaround time as shown in Fig. 4. Data parallel reduction algorithm is applied to sum waiting time and turnaround time and thus compute average waiting time and turnaround time. The scan algorithm iterates \( log\left( n \right) \) time.

4 Results

The proposed method uses parallel bitonic sort, and the computation of this sorting has a complexity of O((log N)2) that makes it n times faster than its serial complexity O(N (log N)2). Param Shavak with Kepler GTX GPU card is used to analyze the proposed method. Screenshots of execution are shown in Figs. 5 and 6 that depict the time taken to execute serial and parallel scheduling code, respectively. Figure 7 shows the graphical representation for the same. It can be seen that execution speed of parallel code is almost 10–15 times more than that of serial code.

Fig. 5
figure 5

Screenshot of serial execution of the PBS

Fig. 6
figure 6

Screenshot of parallel execution of the proposed method

Fig. 7
figure 7

Graphical representation of the comparison of serial and parallel scheduling algorithms

5 Conclusion

The paper focuses on solving a priority-based algorithm using bitonic sort and the prefix scan, where all the codes are implemented in parallel and executed on GPU, to enable faster execution of the problem.

The entire task is broadly classified into two sub-tasks that are as follows:

  1. 1.

    Sorting the processes based on priority first and then on burst time (execution time) using the parallel code for the bitonic sort by launching multiple kernels.

  2. 2.

    Priority-Based Scheduling is implemented by using the parallel code for prefix scan which again has two stages:

    1. a.

      Data that are divided into blocks are first put through initial prefix scan which results in scanned results, but block-wise.

    2. b.

      The second stage involves the different blocks to encounter for all the elements that are present in the block previous to it.

The prefix scan is again done by launching multiple kernels.

It is observed that when the number of processes to be scheduled increases, the amount of the time taken for the CPU to schedule these processes also increases drastically. But, however, in the case, when there is an increase in the number of processes that are to be scheduled on the GPU, the amount of the time taken by it to schedule these processes increases but by a relatively very less value.

In other words, the proposed parallel implementation is 10–15 times faster than the serial implementation. Future work involves analyzing the proposed method for non-preemptive scheduling algorithms.