Modules for Teaching Parallel Performance Concepts

Qasem, Apan

doi:10.1007/978-3-319-93109-8_3

Apan Qasem⁶

768 Accesses
2 Citations

Abstract

This chapter introduces three teaching modules centered on parallel performance concepts. Performance related topics embody many fundamental ideas in parallel computing. In the ACM/IEEE curricular guidelines (ACM2013), an entire knowledge unit has been devoted to parallel performance. In addition, performance topics pervade every knowledge area within PDC and can be found across other knowledge areas including Algorithms, Architecture and Systems Fundamentals. The three modules presented in this chapter cover a range of parallel performance topics. Since power savings have become an important consideration from hand-held devices to supercomputers, energy efficiency is also emphasized in each module. The modules focus more on architectural and algorithmic issues rather than the programming aspects. The modules are constructed to illustrate parallel performance issues primarily through code examples and experimental studies. This approach makes the modules accessible to students who do not yet have a strong background in parallel programming. Thus, the target audience for this chapter are instructors who are teaching CS1, with or without parallel programming, and also instructors who are teaching upper-level electives where their students may already have taken a semester of parallel programming.

Access provided by CONRICYT-eBooks. Download chapter PDF

Parallel Algorithms: Theory, Practice and Education

Students’ Favorite Parallel Programming Practices

Integrating Parallel Computing in Introductory Programming Classes: An Experience and Lessons Learned

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Relevant core courses: :

CS1, Operating Systems, Computer Architecture

Relevant PDC topics: :

speedup (C), efficiency (C), Amdahls Law (A), space vs. time (C), power vs. time (C), synchronization and communication (C), task granularity (A), scheduling and mapping on multicore (A), load balancing (A), trade-offs in performance and power (C), Analysis and Evaluation: linear and super linear speedup (C), latency and bandwidth trade-offs, data locality, SMP (C), NUMA (C), strong and weak scaling (C), (Bloom classification in parentheses)

Context for use: :

CS1 fundamentals, operating system thread scheduling, parallel architecture performance evaluation

Learning outcomes: :

list and define parallel performance metrics: speedup, efficiency, linear speedup, super linear speedup, latency and bandwidth
describe the implications of Amdahl’s law on parallel performance
recognize the use of parallelism to achieve strong scaling and weak scaling
analyze the effects of load imbalances on performance and power
apply techniques to balance load across threads or processes
explain the need for inter-thread synchronization and communication
apply techniques to pin and schedule threads on multicore systems for improved performance
describe how cores share memory resources, such as DRAM and cache
recognize the importance of exploiting data locality in parallel applications

Introduction

This chapter introduces three teaching modules centered on parallel performance concepts. Performance related topics embody many fundamental ideas in parallel computing. In the ACM/IEEE 2013 curricular guidelines (ACM2013), an entire knowledge unit has been devoted to parallel performance [1, 2]. In addition, performance topics pervade every knowledge area within PDC and can be found across other knowledge areas including Algorithms, Architecture and Systems Fundamentals.

The three modules presented in this chapter cover a range of parallel performance topics. Since power savings have become an important consideration from hand-held devices to supercomputers, energy efficiency is also emphasized in each module. The topics provide at least 3.5 h of Core-Tier 1, Tier 2 and Elective hours from ACM2013. The modules are designed to be introduced in CS1 and two upper-level electives, namely, Operating Systems and Computer Architecture. They are, however, designed with enough flexibility to enable adoption in a number of undergraduate courses at various levels.

The modules focus more on architectural and algorithmic issues rather than the programming aspects. The modules are constructed to illustrate parallel performance issues primarily through code examples and experimental studies. This approach makes the modules accessible to students who do not yet have a strong background in parallel programming. Thus, the target audience for this chapter are instructors who are teaching CS1, with or without parallel programming, and also instructors who are teaching upper-level electives where their students may already have taken a semester of parallel programming.

Elementary Concepts

This module is designed to introduce fundamental concepts in parallel computing in a CS1 course. The concepts are illustrated with no particular binding to any programming language and therefore can be introduced in different flavors of CS1 courses.

Recommended Length 1 lecture (1:15 min)
Recommended Course CS1, CS2

Organization and Content

The major topics in this module include (i) overview of parallel computation on a multicore processor, (ii) data dependence and need for synchronization in parallel programs, (iii) parallel performance and Amdahl’s law and (iv) energy efficient computing. The topics are introduced through lectures slides, an in-class activity, code examples and a program demo. The following subsections describe how these topics are explained and the order in which they are introduced.

Parallelism in Real Life

The module begins with an in-class activity that engages the students and demonstrates the benefits of parallelism. An activity that works quite well with CS freshman is a live simulation of the word search problem where students act as processing threads. In this activity, the class is split into k groups. Each group is assigned the task of finding a collection of words in a book and reporting the page numbers where the words occurred. Each group gets a copy of the book. But the copies are sectioned into different-sized segments. Thus, one group might get the entire book in one chunk while another may be assigned one page per group member. The students are then asked to try to find an efficient method of solving the problem with resources they are given. Naturally, the teams with fewer pages per student (thread) are likely to get to the results first. However, care must be taken in selecting the words and their positions and in segmenting the text.

Parallel Computing and Its Importance Today

Following the in-class example, a set of lecture slides defines parallel computing and discusses its importance in today’s world. A high-level definition of a parallel computer is presented. Student familiarity with basic Von-Neumann architecture is assumed (not an unrealistic expectation for CS1 students). The discussion of the definition of a parallel computer is followed by some history of parallel computing. The point is made that parallel computing has been around for a long time, ever since the beginning of computing. Notwithstanding, it has only become mainstream in the last decade. Brief descriptions of mainframe, vector computers and clusters are presented. This is followed by a discussion of multicore computers of today. The importance of energy efficiency and the role it has played in the evolution of computer chips and given rise to multicore systems is discussed. The lecture slides emphasize the need for achieving higher performance at lower power consumption or at specified power budgets. The ubiquity of parallel computers is also discussed. Students are asked to guess/comment on the number of processing cores on their smartphones and tablets. Their guesses are then validated against actual numbers. A discussion follows on the need for more parallel processing cores.

Sequential vs. Parallel Program Execution

A major portion of the module is spent introducing the student to the fundamental difference in sequential and parallel program execution. A walk-through example is used for this purpose. Figure 1 shows a subset of the slides that are used to explain this topic. The slides are accompanied by a set of examples written in SimPar [3]. Two such examples are shown in Figs. 2 and 3. SimPar is a simple macro language that uses an intuitive pragma based syntax. Since students are generally not expected to be familiar with any parallel programming language in CS1, SimPar is an effective tool to discuss parallelism with real examples without getting bogged down in syntax minutiae. SimPar contains only one kind of parallel statement, a directive in the form of #PARALLEL { …}. This implies that all high-level statements enclosed in the subsequent block will be executed concurrently. SimPar processes such directives by taking each statement in the block and converting it into a Pthread function. Supplementary materials for this chapter includes a SimPar parser that can be used to create other simple examples. The instructor should be aware that SimPar is not a realistic parallel language and is very limited in ability. Thus it should not be used for creating extended examples beyond CS1. During the walk-through of the example, students are asked to list the order in which the statements will execute on the processor. A parallel directive is then inserted for the two assignment statements and the meaning is explained to the students. The program is then extended to include array assignments instead of just simple assignments. This program is compiled and executed and the result examined in class. Students are then asked to comment on what other statements could be parallelized. The instructor leads them to an example where the result statement is put in the PARALLEL block along with the two assignment statements. This program is run, potentially several times, and the error demonstrated to the students. The students are then asked to describe the problem in the code. This is followed by a discussion of data dependence and the challenges with parallel programming.

Parallel Programming Tools

Students are told that SimPar is not a real language. The syntax for real languages are more complex and so is the programming model. Some of the currently available parallel languages and tools, including OpenMP, Pthreads, MPI are presented. The suitability of each is briefly discussed. The slides include example codes for each of these parallel languages. However, students are told they are not expected to learn the syntax at this stage.

Performance Metrics

In this segment of the module, performance issues in parallel computing are reiterated. This is followed by definitions and examples of sequential and parallel performance metrics. A simple parallel search code written in SimPar is used to do an in-class demo to show the differences in the performance metrics. Sequential and parallel (OpenMP) versions of the code are also shown in class. The code is compiled and executed with different data sets. Execution time and energy are measured for each run. A convenient tool for measuring power consumption on Intel processors is Likwid [4], freely available for download. The specific performance metrics and definitions that are discussed include

Execution time
Energy
Speedup and Greenup
Amdahl’s Law
Linear speedup
Scalability

Pedagogical Notes

The author has used this module in CS1 courses in three semesters at Texas State University. In all three cases, it was helpful to introduce this module towards the end of the semester when students are somewhat more confident with the syntax of the sequential language that is being used in the class.

For the in-class activity, we found that a group size of four and a section size of two pages per member for the most parallel group is ideal. Making groups larger, makes the sequential group not as engaged. More than two pages of dense text makes the example run too long. We also found that, it is helpful to assign some form of reward to the team finishing first. This motivates the teams to be more engaged in the activity. Our experience also showed that it is better to place the stronger and more vocal students in the sequential group. Since the activity is framed as a competition and the sequential group is almost certain to not win, putting under-performing students in that group is not advisable.

It is advisable that instructors practice the live coding examples ahead of lecture time. Students often raise questions and suggest alternate approaches. The instructor should be fairly comfortable with the examples in order to incorporate these suggestions on-the-fly. The instructor should also take care to use the same system for the demo as the one used for practice. Variations in system configuration can make some examples not work as expected.

Sample Exercises

1.
Computer A has 4 processors and Computer B has 8 processors. A parallel program P, takes 16 s to run on A and 12 s to run on B. Is this the type of performance you would expect out of P? Give one explanation as to why P does not achieve more/less performance.
2.
Execute simple programs written in SimPar. Compare their performance with performance of sequential versions.
3.
Download the C++ implementations of (i) knapsack and (ii) quicksort from http://tues.cs.txstate.edu. Consider the opportunities for parallelism in these two codes. Insert SimPar directives to parallelize the two applications. Execute the parallel applications and compare their performance with the sequential version of the code.

Task Orchestration

This module focuses on performance issues related to communication and synchronization of parallel applications. It is intended to be introduced in the Operating Systems course, as it provides the most context for the material covered.

Recommended Length 1.5 lectures (2 h)
Recommended Course Operating Systems