An Efficient Parallel Implementation of CPU Scheduling Algorithms Using Data Parallel Algorithms

Agrawal, Suvigya; Yadav, Aishwarya; Parwani, Disha; Mayya, Veena

doi:10.1007/978-981-13-2673-8_45

Suvigya Agrawal¹⁷,
Aishwarya Yadav¹⁷,
Disha Parwani¹⁷ &
…
Veena Mayya¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 870))

725 Accesses

Abstract

Modern graphics processors provide high processing power, and furthermore, frameworks like CUDA increase their usability as high-performance co-processors for general-purpose computing. The Graphical Processing Units (GPUs) can be easily programmed using CUDA. This paper presents an efficient parallel implementation of CPU scheduling algorithms on modern The Graphical Processing Units (GPUs). The proposed method achieves high speed by efficiently exploiting the data parallelism computing of the The Graphical Processing Units (GPUs).

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Hybrid Parallel Barnes-Hut Algorithm for GPU and Multicore Architectures

Parallelization of Selected Algorithms on Multi-core CPUs, a Cluster and in a Hybrid CPU+Xeon Phi Environment

A Fast GPU Based Implementation of Optimal Binary Search Tree Using Dynamic Programming

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The modern GPU is a many-core processor which supports execution of thousands of threads concurrently. GPU comprises of a series of streaming processors with hundreds of core aligned in a particular way which facilitate single instruction multiple threads (SIMTs) programming model.

General-purpose computing on GPU is a graphical processing unit which is very efficient at computer graphics manipulation and image processing. The highly parallel structure of GPU makes it easy to use to perform the general-purpose computation and accelerate traditional CPU-based computational tasks. Recently, general-purpose computation on graphics processing units (GPGPU) has been adopted in accelerating many algorithms such as sorting [11], graph algorithms [12], data encryption algorithms [5], and other generic algorithms [13]. Several libraries and programming environment are available that allow programmers to perform GPGPU and exploit computational power of GPU. Compute Unified Device Architecture (CUDA) [8] framework is one of the programming environments provided by NVIDIA that allows to exploit data parallelism of GPU using C-style programming language.

Scheduling is a way by which work specified by some means is assigned to resources that complete the work. Job/process scheduling is the process of arranging, controlling, and optimizing the allocation of system resources to threads, processes, and data flows for maximum utilization. The operating system schedules the processes for execution using several scheduling algorithms based on various scheduling criteria such as CPU utilization, throughput, turnaround time, waiting time, and priority. System resources can be equally and effectively utilized by properly scheduling the processes and hence achieve a target quality of service. The major scheduling algorithms types of CPU/job scheduling algorithms include as follows: First Come First Serve (FCFS), Shortest Job First (SJF), Round Robin (RR), and Priority-Based Scheduling (PBS). Scheduling is required to perform multitasking and multiplexing. Scheduling is a complex job requiring extensive processing which is better performed on a parallel platform.

In this paper, a novel parallel approach to perform scheduling using CUDA technology to enhance the performance of process scheduling algorithms is proposed. A combination of well-established data parallel algorithms and parallel sorting techniques has been adopted to achieve drastic performance increase in execution time of scheduling algorithms (Fig. 1).

1.1 Data Parallel Algorithms

In data parallel algorithms, parallelism is involved in simultaneous operations across large sets of data, rather than from multiple threads of control [4]. Prefix sum or scan is one among the most common data parallelism algorithms. The prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $ x_{0} , x_{1} , x_{2} , \ldots x_{n - 1} $ are the second sequence of numbers $ y_{0} , y_{1} , y_{2} , \ldots y_{n - 1} $, the sums of prefixes (running totals) of the input sequence where $ y_{0} = x_{0} ; y_{1} = x_{0} \oplus x_{1} ; y_{1} = x_{0} \oplus x_{1} \oplus x_{2} \ldots $, as shown in Fig. 2. Mathematically $ y_{i} = x_{0} \oplus x_{1} \oplus x_{2} \oplus \cdots \oplus x_{i}\, \forall_{i} \in 0 \ldots n - 1 $. Figure 2 depicts the pictorial representation for the scan data parallel algorithm [6].

1.2 Bitonic Sort

There are multiple sorting algorithms available such as Merge sort, Quick sort, Radix sort, and Heap sort. Bitonic sort is one among the sorting algorithms that can make use of GPU computing power and thus is efficient in terms of both space and time complexities [7].

Bitonic Sequence: A sequence is called Bitonic if it is first increasing, then decreasing. In other words, an array arr[0..n − 1] is Bitonic if there exists an index i where $ 0 \le i \le n - 1 $ such that,

$$ a_{0} \le a_{1} \le \cdots \le a_{n/2 - 1} \,and\, a_{n/2} \ge a_{n/2 + 1} \ge \cdots \ge a_{n - 1} $$

The Bitonic Sort Algorithm (as illustrated in Fig. 1):

(1)
Let s = $ \left\langle {a_{0} , a_{1} , a_{0} , \ldots ,a_{n - 1} } \right\rangle $ be a bitonic sequence such that
1. (a)
  $ a_{0} \le a_{1} \le \ldots \le a_{n/2 - 1} $, and
2. (b)
  $ a_{n/2} \ge a_{n/2 + 1} \ge \ldots \ge a_{n - 1} $
(2)
Consider the following subsequences of s
1. (a)
  $ s_{1} = \left\langle {min\left( {a_{0} ,a_{n/2} } \right), min\left( {a_{1} ,a_{n/2 + 1} } \right), \ldots , min\left( {a_{n/2 - 1} ,a_{n - 1} } \right)} \right\rangle $
2. (b)
  $ s_{2} = \left\langle {max\left( {a_{0} ,a_{n/2} } \right), max\left( {a_{1} ,a_{n/2 + 1} } \right), \ldots , min\left( {a_{n/2 - 1} ,a_{n - 1} } \right)} \right\rangle $
(3)
Sequence properties
1. (a)
  $ s_{1} \,and\,s_{2}\, are\, both\, bitonic $
2. (b)
  $ \forall x \forall y\, x \in s_{1} , y \in s_{2} , x < y $
(4)
Apply recursively on $ s_{1} \,and\, s_{2} $ to produce a sorted sequence
(5)
Works for any bitonic sequence, even if $ \left| {s_{1} } \right| \ne \left| {s_{2} } \right|. $

Given an unordered sequence of size $ 2n $, exactly $ log_{2} 2n $ stages of merging are required to produce a completely ordered list (Fig. 3).

1.3 Prefix Sum (Scan)

Prefix sum of a sequence of numbers x₀, x₁, x₂, …, x_n is another sequence of numbers y₀, y₁, y₂, …, y_n given by:

$$ y_{0} = x_{0} ; y_{1} = x_{0} \oplus x_{1} ; y_{1} = x_{0} \oplus x_{1} \oplus x_{2} $$

The general mathematical representation of inclusive prefix sum for computing the output value in a sequential order is:

$$ \begin{aligned} & y_{i} = x_{i} ; \quad i = 0 \\ & y_{i} = y_{i - 1} \oplus x_{i} ; \quad i > 0^{{}} \\ \end{aligned} $$

The given data set is divided into blocks of fixed size. The scan is performed individually on each block. The result obtained is not the final result. It is a temporary array, as second block does not counter for first block elements and similarly henceforth. Each block sum is the last element of that block. The concept used first involves extracting the last element of the block that will give the individual block sum and storing these in a separate array let’s say Y in the corresponding positions, i.e., block 0 sum stored at zeroth index and block 1 sum stored in first index. The final result is obtained by adding the elements in array Y to the corresponding temporary array elements by using indices of block and the thread. For instance, the block 0 contains final result so we do not have to perform any computation. But for block 1, Y[0] (which is nothing but the block sum of block 0) is to be added to each and every element of block 1 present in the temporary array. For block 2, Y[0] + Y[1] is added to every element, which is nothing but the block sum of zeroth and first block, respectively, as shown in Fig. 4, similarly for further blocks. Finally, the resultant array is obtained.

2 Related Work

Job/process scheduling is one of the important tasks performed by the operating system (OS). The performance of the OS depends on the CPU scheduling algorithms. Recently, several improved CPU scheduling algorithms such as [2, 9, 10] have been introduced for improving the system performance. Scheduling needs to be carried out very frequently by the operating system and is a complex job which may require repetitive computation. State-of-the-art performance is achieved by implementing many generic algorithms [1, 3] on GPU. This motivates to implement scheduling algorithms on GPU and analyze the performance.

3 Implementation

Algorithm 1 provides the steps carried out to implement the parallel non-preemptive scheduling algorithm.

To form a bitonic sequence from a random input, we start by forming four-element bitonic sequences from consecutive two-element sequence. Consider four-element in sequence x₀, x₁, x₂, x₃. We sort x₀ and x₁ in ascending order and x₂ and x₃ in descending order. We then concatenate the two pairs to form a four-element bitonic sequence. Next, we take two four-element bitonic sequences, sorting one in ascending order, the other in descending order (using the bitonic sort which we will discuss below), and so on, until we obtain the bitonic sequence as shown in Fig. 3.

The experiment done was on analyzing the performance of the Priority-Based Scheduling (PBS) algorithm. A large number of processes are taken which include parameters such as burst time, arrival time, and priority. The processes are sorted based on the priority first, and then, the tie is broken between the processes having same priority depending upon their burst time. The scheduling is performed placing the processes with the highest priority and lowest burst time first (the highest priority corresponds to lowest number). The input elements are sorted using bitonic sort. Further, the block-wise work-efficient scan operation is performed to obtain the waiting time and turnaround time as shown in Fig. 4. Data parallel reduction algorithm is applied to sum waiting time and turnaround time and thus compute average waiting time and turnaround time. The scan algorithm iterates $ log\left( n \right) $ time.

4 Results

The proposed method uses parallel bitonic sort, and the computation of this sorting has a complexity of O((log N)²) that makes it n times faster than its serial complexity O(N (log N)²). Param Shavak with Kepler GTX GPU card is used to analyze the proposed method. Screenshots of execution are shown in Figs. 5 and 6 that depict the time taken to execute serial and parallel scheduling code, respectively. Figure 7 shows the graphical representation for the same. It can be seen that execution speed of parallel code is almost 10–15 times more than that of serial code.

5 Conclusion

The paper focuses on solving a priority-based algorithm using bitonic sort and the prefix scan, where all the codes are implemented in parallel and executed on GPU, to enable faster execution of the problem.

The entire task is broadly classified into two sub-tasks that are as follows:

1.
Sorting the processes based on priority first and then on burst time (execution time) using the parallel code for the bitonic sort by launching multiple kernels.
2.
Priority-Based Scheduling is implemented by using the parallel code for prefix scan which again has two stages:
1. a.
  Data that are divided into blocks are first put through initial prefix scan which results in scanned results, but block-wise.
2. b.
  The second stage involves the different blocks to encounter for all the elements that are present in the block previous to it.

The prefix scan is again done by launching multiple kernels.

It is observed that when the number of processes to be scheduled increases, the amount of the time taken for the CPU to schedule these processes also increases drastically. But, however, in the case, when there is an increase in the number of processes that are to be scheduled on the GPU, the amount of the time taken by it to schedule these processes increases but by a relatively very less value.

In other words, the proposed parallel implementation is 10–15 times faster than the serial implementation. Future work involves analyzing the proposed method for non-preemptive scheduling algorithms.

References

J.P. Arun, M. Mishra, S.V. Subramaniam, Parallel implementation of MOPSO on GPU using OpenCL and CUDA, in 2011 18th International Conference on High Performance Computing (HiPC), pp. 1–10
Google Scholar
N. Goel, R.B. Garg, Simulation of an optimum multilevel dynamic round robin scheduling algorithm (2013), http://arxiv.org/abs/1309.3096
P. Harish, P.J. Narayanan, Accelerating large graph algorithms on the GPU using CUDA, in Proceedings of the 14th International Conference on High Performance Computing, p. 197, Heidelberg (2007), http://dl.acm.org/citation.cfm?id=1782174.1782200
W.D. Hillis, G.L. Steele Jr., Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), http://doi.acm.org/10.1145/7902.7903
Article Google Scholar
H. Jo, S.T. Hong, J.W. Chang, D.H. Choi, Data encryption on GPU for high-performance database systems. Procedia Comput. Sci. 19(Supplement C), 147–154 (2013), http://www.sciencedirect.com/science/article/pii/S1877050913006327, The 4th International Conference on Ambient Systems, Networks and Technologies (ANT 2013), The 3rd International Conference on Sustainable Energy Information Technology (SEIT-2013)
D.B. Kirk, W.M.W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd edn. (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2013)
Chapter Google Scholar
Q. Mu, L. Cui, Y. Song, The implementation and optimization of Bitonic sort algorithm based on CUDA, (2015), http://arxiv.org/abs/1506.01446
J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with cuda. Queue 6(2), 40–53 (2008), http://doi.acm.org/10.1145/1365490.1365500
Article Google Scholar
A. Pandey, P. Singh, N.H. Gebreegziabher, A. Kemal, Chronically evaluated highest instantaneous priority next: a novel algorithm for processor scheduling. J. Comput. Commun. 4, 146–159 (2016), http://www.scirp.org/JOURNAL/PaperInformation.aspx?PaperID=65949
Article Google Scholar
H.B. Parekh, S. Chaudhari, Improved round robin CPU scheduling algorithm: round robin, shortest job first and priority algorithm coupled to increase throughput and decrease waiting time and turnaround time, in 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), pp. 184–187, Dec 2016
Google Scholar
N. Satish, M. Harris, M. Garland, Designing efficient sorting algorithms for manycore gpus, in Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS’09) (IEEE Computer Society, Washington, DC, USA 2009), pp. 1–10, https://doi.org/10.1109/IPDPS.2009.5161005
P. Zhang, E. Holk, J. Matty, S. Misurda, M. Zalewski, J. Chu, S. McMillan, A. Lumsdaine, Dynamic parallelism for simple and efficient GPU graph algorithms, in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3’15) (ACM, New York, NY, USA, 2015), pp. 11:1–11:4, http://doi.acm.org/10.1145/2833179.2833189
Y. Zhang, J.D. Owens, A quantitative performance analysis model for GPU architectures, in Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11) (IEEE Computer Society, Washington, DC, USA, 2011), p. 382, http://dl.acm.org/citation.cfm?id=2014698.2014875

Download references

Author information

Authors and Affiliations

Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, 576104, India
Suvigya Agrawal, Aishwarya Yadav, Disha Parwani & Veena Mayya

Authors

Suvigya Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Aishwarya Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Disha Parwani
View author publications
You can also search for this author in PubMed Google Scholar
Veena Mayya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suvigya Agrawal .

Editor information

Editors and Affiliations

Medi-Caps University, Indore, Madhya Pradesh, India
Raj Kamal
Loughborough University, Loughborough, UK
Michael Henshaw
Medi-Caps Group of Institutions, Indore, Madhya Pradesh, India
Pramod S. Nair

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agrawal, S., Yadav, A., Parwani, D., Mayya, V. (2019). An Efficient Parallel Implementation of CPU Scheduling Algorithms Using Data Parallel Algorithms. In: Kamal, R., Henshaw, M., Nair, P. (eds) International Conference on Advanced Computing Networking and Informatics. Advances in Intelligent Systems and Computing, vol 870. Springer, Singapore. https://doi.org/10.1007/978-981-13-2673-8_45

Download citation

DOI: https://doi.org/10.1007/978-981-13-2673-8_45
Published: 28 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2672-1
Online ISBN: 978-981-13-2673-8
eBook Packages: EngineeringEngineering (R0)

Publish with us