Keywords

1 Introduction

Heterogeneous computing platforms are becoming widespread in recent years. Such platforms are not just limited to supercomputing systems, but also being deployed in personal computing environments. NVIDIA introduced CUDA for programming CPU-GPU heterogeneous computing platforms in 2007 [1]. Since then, the CUDA ecosystem has grown rapidly [10] and has spurred language and tools development for effectively exploiting the performance potential of such systems. OpenACC [11], OpenCL [6], and Microsoft AMP [14] are a few other programming systems for heterogeneous computing.

With widespread increase in size and complexity of programs, it is important to provide a programming environment that is intuitive to developers who are used to creating software on non-heterogeneous systems. CUDA’s language-integrated heterogeneous parallel programming approach is key for this goal. CUDA modules contain code that executes on the CPU, GPU or on both processors. Functions and variables are annotated with execution spaces (CPU/GPU). CPU code can create work on the GPU using an extended call syntax.

Our language-integrated approach is distinct from other frameworks such as OpenCL, where the GPU program is embedded in a character string that is explicitly passed to compiler API functions invoked by the CPU program.

Early versions of CUDA tools required the whole program at compile time. This represented a significant hurdle from a software development and porting perspective. In this paper, we describe our work to enable separate compilation for CUDA in release 5.0, which removes this limitation. We believe this to be the first language-integrated heterogeneous programming environment that provides separate compilation and linking support.

In the rest of the paper, we use the terms device and host to refer to the GPU and the CPU, respectively. By extension, device code and host code refer to code entities that execute on the GPU and CPU, respectively. There are several motivations for enabling separate compilation for device code:

  • Incremental Compilation Speedup: Requiring all device code to be in a single translation unit increases the compile time and memory requirements for the compiler toolchain. Incremental compilation is also not possible; if any part of the device code is changed, the entire device code must be compiled again. This slows down the edit-build-debug cycle during software development. Separate compilation solves this problem by allowing code refactoring into multiple translation units. A change to one translation unit requires only that translation unit to be re-compiled, and the application to be re-linked. This reduces the incremental compilation overhead, allowing faster code development.

  • Ease of Porting: Most large applications have code factored into tens or even hundreds of translation units. Separate compilation support eases the process of porting such applications to the GPU.

  • Library Support: Separate compilation support enables linking against third party libraries, allowing modular program development. User code may now link against one or more libraries with device code, where the library and user code are independently compiled, including the possibility of libraries calling user-defined device callback functions.

In this paper, we make the following contributions:

  • We present our design for separate compilation in the CUDA programming environment. We introduce the device linker for linking separately compiled device code objects.

  • We describe how host-visible device entities are supported under separate compilation.

  • We present a novel device sub-link mechanism that allows groups of objects to be linked separately, and co-exist in the generated program.

  • We describe a device link time resource allocation algorithm for allocating device shared memory.

  • We evaluate the optimization tradeoffs in separate compilation mode compared to the older whole-device-program compilation mode.

  • We present performance metrics with real applications, comparing programs built under separate compilation versus the whole-device-program mode.

2 CUDA

CUDA consists of a programming language, a compiler and a runtime for heterogeneous parallel computing [1, 10]. A typical target platform has two different kinds of processors - the CPU and the GPU. The GPU can execute many threads in parallel using multiple processors that contain local memories.

Figure 1 shows a simple CUDA program that multiples two vectors element wise, in parallel.

Fig. 1.
figure 1

An example CUDA program fragment

Functions annotated with __device__ keyword execute on the GPU. Functions annotated with __global__ keyword are the entry point for GPU code execution (kernel function). A function with no explicit annotation or marked with the __host__ keyword execute on the CPU. Thus, multiply is a device function, hostFunc is a host function and vectorMultiply is an entry function for GPU code execution.

Kernel functions can be launched from host code using the triple angle bracket syntax (“\(<\) \(<\) \(<\)...\(>\) \(>\) \(>\)”). Namespace scope variables can be allocated in specific GPU memory regions with annotations (e.g., __constant__).

3 Heterogeneous Separate Compilation

Figure 2 shows the CUDA separate compilation and linking framework. For each translation unit, the CUDA frontend splits the host and device code into separate parts. The device code is passed to the device compiler, generating a “fatbinary”. A fatbinary contains the device machine code for one or more GPU architectures in ELF format [5]. The fatbinary is transformed into a data array and embedded in the host part of the CUDA source file generated by the frontend. This combined program is then processed by the host compiler to produce an object file. The object file can be linked against other objects on the host system to produce a host executable or library.

Fig. 2.
figure 2

Separate compilation and device linking

Previous versions of CUDA did not support separate compilation of device code, i.e., all device code entities accessed during a kernel launch had to be in the same translation unit. The separate compilation framework described in this paper removes this restriction, by using a new device linker that links the device code entities from multiple object files.

The device linker extracts the device code embedded in the object files, carries out the linking process, and generates another object file embedding the linked device code image. The generated object also contains synthesized definitions for host functions that are invoked during host startup (see Sect. 5). The synthesized object file is presented to the host linker along with the original object files to produce a host executable or libraryFootnote 1. The device linker may be invoked explicitly, it is also invoked implicitly by the compiler driver when the target is a host executableFootnote 2.

4 Host-Visible Device Entities

In a heterogeneous computing environment like CUDA, certain device entities can be directly referenced in host code. For example, kernel functions can be launched from the host, but these functions contain code that executes on the GPU. In this section, we detail how such “host-visible” device entities are supported under separate compilation.

The host-visible entities are __global__ (kernel) functions, namespace scope variables allocated in __device__ or __constant__ address space, textures and surfaces. With separate compilation, the definition and reference to these entities may be in different translation units. For example, consider the program below, where the host-visible entities are defined in first.cu and referenced in second.cu.

figure a

As described in Fig. 2, the host code and device code are processed by different backend compilers. As a result, host-visible device entities need special handling in the code passed to the host backend compiler. For each host-visible entity defined in the current translation unit, the compiler frontend will create a shadow entity (function, variable, etc.) with the same linkage, in the code passed to the host compiler. References in the host code to the original device entity will be updated to refer to the shadow entity. In addition, the compiler frontend will insert registration code to be run during host start up. The registration code creates a mapping from the shadow entity to the name of the original device entityFootnote 3. This mapping enables the CUDA driver to retrieve the device entity being referenced from the host code. For example, the compiler frontend would insert the following code for “foo” when generating the host-side code for first.cu above:

figure b

5 Multiple Device Links

5.1 Motivation

Existing separate compilation environments typically have a single link step. Objects are combined into a single executable by the link step. In the CUDA environment, linking device code in a single link step may negatively impact the performance of the final linked code. The entry point for device execution is a kernel function. Before a kernel is launched, the runtime ensures that the all resources required by the kernel are available. The runtime requires that the compiler toolchain provide upper bounds on the resources used by the kernel and any functions transitively invoked from the kernel. Examples of the resources that need to be tracked are the maximum amount of __shared__ memory needed by the kernel and the maximum number of physical registers needed during the kernel’s execution. To compute this information, the CUDA linker builds a callgraph for each kernel function. However, in the presence of indirect function calls, the call graph information may be conservative. As a result, the computed resource usage values may be overly pessimistic. This may cause the kernel launch to fail at runtime or to artificially restrict the number of parallel units of work that are run simultaneously on the processor, degrading the throughput of the executing programsFootnote 4.

For example, consider the following two source files:

figure c

The functions first and second are never invoked during the execution of the same kernel. However, since the address of first is taken, the linker may assume that first is invokable from the indirect call site in second. As a result, the __shared__ memory requirement for a kernel invoking second includes allocations for both variables arr and local. If the entities from first.cu and second.cu never interact, one solution is for the user to put the objects created from first.cu and second.cu in different sub-linked object groups. The device linker is invoked multiple times, with disjoint sets of objects participating in each link step. Since the linker call-graph only considers objects participating in the current link step, the call graph will be more precise and the __shared__ memory requirement calculations will therefore be more accurate.

This multiple device link facility is very useful for library writers, because it enables performance isolation of library code from the user’s code. Objects in the library that contain device code, and that are not supposed to directly interact with the user’s device code, can be device-linked before the library is shipped. This insulates the user’s device code and the library’s device code from each other, with respect to the per-kernel resource requirement computation in the linker. It may also lead to shorter device link times for the user’s code, since objects from the pre-linked library will not participate in the user’s device link step.

5.2 Constraints

The design for the multiple device link mechanism must support the following constraints:

  • As described in Sect. 4, registration code for host-visible device entities needs a handle to the linked device image. The device linker must define a function that provides this handle.

  • The same object cannot be allowed to link in multiple device links. The object file may contain definition of host-visible entities such as kernel functions; if the entity is referenced in host code, the CUDA runtime would not be able to uniquely determine the device entity being referenced if the object participates in multiple device links. This restriction is consistent with the “one-definition-rule” (ODR) semantics of the CUDA C language, derived from C++ [8].

  • The device linker design places a restriction that host objects created by the compiler cannot be modified by the device linker. This eliminates the possibility of patching the object file, e.g., to insert a function call to the linker generated function that returns a handle to the linked imageFootnote 5.

Fig. 3.
figure 3

Multiple Device Link Example

5.3 Design

Each object is associated with a unique identifier, called the module-id. The module-id uniquely identifies the object among all objects that participate in any device link step. The module-id is computed by the device compiler and embedded in the generated object file. The device compiler also synthesizes a call to an externally defined function (“init-function”) in the generated object. The init-function’s name is derived from the module-id, and is unique per object. The init-function returns a handle to the linked device image, and the handle is used to register host-visible device entities (Sect. 4). During device linking, the linker synthesizes the definition for the init-function for every object, using the embedded module-id.

How does the device compiler generate the module-id? It leverages the ODR semantics of the CUDA C source language. ODR semantics require that a function or variable with extern linkage be defined exactly once in the complete programFootnote 6. If such a function or variable definition is present in the current translation unit, its name is used to derive the module-id, along with the file name and pathFootnote 7.

Figure 3 shows an example with multiple device links. a.o and b.o participate in the first device link, c.o and d.o participate in another one. The object files generated from the source (.cu) files contains the module-id and the call to the init-function (e.g., init_XXX). The object file generated by each device link step contains the module-ids and the init-function definitions. The host linker binds the init-function calls to the definitions present in the objects created by the device link steps.

5.4 Detecting Error Scenarios

The above design ensures that the program will fail to build for some error cases:

  • Object in Multiple Sub-links: If an object is involved in multiple device links, the device linker will create the init-function definition for that object multiple times. When the host linker is invoked, the program will fail to link because of ODR semantics (a function with extern linkage can be defined exactly once).

  • Device Link Step Missing: As described above, the device compiler synthesizes a call to the uniquely named init-function in every object. If this object does not participate in any device link step, and is presented to the host linker, the program will fail to link because of the missing definition of the init-function.

6 No-Cloning Under Separate Compilation

Typical host programs have a single entry point (usually called “main”) which is the starting point for execution. Under CUDA, each kernel represents an entry point in the device code; there can be multiple kernels in a complete device program. The kernel functions may call common device functions or reference common data. Under the legacy compilation mode (i.e., without separate compilation), the entire device code was compiled as a single translation unit. In this case, a compiler optimization would clone the common device functions and __shared__ memory variables, per kernel. Thus, each kernel could be separately optimized, for example:

  • Since each kernel gets its own cloned version of the __shared__ memory variable, the __shared__ memory layout can be computed independently of other kernels. This has two benefits. First, the __shared__ memory addresses can be fixed before code generation, allowing optimizations such as constant folding. Second, the total size of the __shared__ memory addressed by the kernel may be reduced, since there is more freedom in assigning offsets to __shared__ memory variables.

  • Global inter-procedural register allocation is possible for each kernel and the versions of the device functions cloned for it; this allows better code generation, and can reduce the maximum number of registers used by the kernel.

Decreasing the resource requirements of the kernel (total registers and __shared__ memory used) can increase GPU performance by increasing GPU occupancy [7], i.e., by reducing the number of cycles the GPU is stalled without work.

Unfortunately, with separate compilation, this optimization is no longer possible, since device function definitions may now be in a different translation unit. We measure the impact of disabling this optimization in Sect. 8 Footnote 8.

7 Optimizing Shared Memory Layout at Link Time

GPUs have limited memory space for shared memory, so it is important to allocate shared memory objects in an optimal manner. This is complicated by the fact that device code can have multiple kernels (entry points in device code), and different kernels may access different shared memory objects. If a user declares the shared memory variables local to a kernel, then it is obvious where they are used, but typically users allocate the shared memory in global scope so that it can be shared by multiple functionsFootnote 9. This provides us with an opportunity to reduce the space requirements by overlapping shared memory objects that are used in different kernels.

Consider the following example:

figure d

There are two kernels E and F, and three shared memory objects X, Y and Z; if E uses X and Z, while F uses Y and Z, then we can overlap the allocation of X and Y, thus only using 4400 bytes rather than 8400 bytes. However, this requires doing data allocation at link time rather than having each object allocate its own memory. This also requires the use of a link-time callgraph with multiple roots (kernel functions). Determining the optimal allocation of such shared memory can be thought of as a graph coloring algorithm. Graph coloring is commonly used in the context of register allocation. In this case we apply it to data allocation, and use the set of kernels reached as the interference graph. The algorithm works as follows:

  1. 1.

    Search the relocations [5] to find all uses of shared objects.

  2. 2.

    Use the callgraph to find which shared objects are used by each kernel (so if a non-kernel function F references a shared object, find all kernels that reach F).

  3. 3.

    If no one uses the shared object, remove it. If only one kernel uses shared object, move it to be local to kernel.

  4. 4.

    Build interference graph where each node represents the shared memory object, and has edge between nodes when the sets of kernels they reach intersect.

  5. 5.

    Sort the list of nodes so largest-size objects are allocated first.

  6. 6.

    Go through interference graph and assign each node to an allocation group such that edges are always in a different group (color the graph).

  7. 7.

    Assign offsets to each allocation group (a group is set of overlapping objects).

The above scheme is not always optimal. Graph coloring assumes equal-size registers; in our case the objects are not all equal sized. We allocate the largest objects first to minimize the wasted space, but there can be situations where several smaller objects that conflict with each other but not with a larger object could all fit within the space of the larger object, in essence by doing a nested allocation of the smaller objects. This situation requires completely independent sub-graphs. We may modify our algorithm to account for this case, but so far we have not seen this be an issue in the code we have processed, and thus have refrained from the additional complexity. The algorithm is very effective for the benchmarks we evaluated, as shown in Sect. 8. Typically all the shared objects can be fit into just two allocation groups, the most complicated case we have seen so far in real code has required only 4 allocation groups.

Table 1. Benchmarks

8 Results

Table 1 describes the benchmarks used for performance measurements. The Lawa sources may be configured to be compiled as single file containing all the device source code, or as separate files. We contrast the two build modes to illustrate the potential advantages of separate compilation. The three other benchmarks (Cublas, Cufft, Thrust) contain legacy CUDA code sources that put all device code in a single file. We use these benchmarks to measure the impact of toggling the cloning optimization (Sect. 6) and for measuring the effectiveness of the link time shared memory allocation scheme (Sect. 7).

All measurements are done with the CUDA 5.0 release, running on a 64bit Linux system with default full optimization. The separate compilation support is also implemented on Windows and Mac systems, but the results are independent of host platform so we only show the results for one platform.

8.1 Lawa

Figure 4 plots the compilation time for each source file in seconds, with Lawa built in separate compilation mode. All except one file take less than 6 s to compile. The combined host and device link times were less than a second. The dotted line in the same graph shows the build time when the Lawa device code is compiled as a single translation unit (36 s)Footnote 10. Doing a full build from scratch is significantly slower with separate compilation, due to invoking the both the host and device compilers multiple times, but incremental compilation where only part of the program is rebuilt (which is typical of the edit-build-debug cycle and one of the goals of separate compilation) is significantly faster (6 vs 36 s).

Fig. 4.
figure 4

Lawa compile time

Table 2. Lawa run time for kernels K1–K4 (microseconds)

Table 2 compares the program runtime when built under separate compilation versus the entire device code in a single file (“whole program compilation”). The second kernel slows down significantly due to the lack of inlining when the file is split. However, other kernels have comparable runtime, such that the overall run time only degrades by 4.5 % in the separate compilation mode. To get better performance with separate compilation, a user should look for hot spots and potentially inline code in that area.

Fig. 5.
figure 5

Performance metrics for No-Cloning relative to Cloning.

Fig. 6.
figure 6

Per kernel shared memory size with link time layout optimization, relative to size with optimization disabled.

8.2 Impact of No-Cloning Versus Cloning

Figure 5 shows performance metrics when the benchmarks were built in no-cloning mode, relative to the values for cloning modeFootnote 11. Compile times for Cufft and Thrust show little impact from no-cloning. The effect for Cublas is less clear. Compilation time for some files decreased significantly, while it increased for other files, though it remains almost unchanged for the vast majority.

The runtime impact of no-cloning was minimal for most tests, but did show some negative impact on about 25 % of the Cublas tests. As described in Sect. 6, disabling cloning inhibits certain optimizations that seem to impact these tests.

For all three benchmarks, the per-kernel shared memory does not change significantly with no-cloning, for the vast majority of the kernelsFootnote 12. This indicates that the link time shared memory layout optimization (Sect. 7) effectively places the shared memory variables such that the per-kernel sizes approach the “best case” sizes possible with cloning. Alternately, the benchmarks may predominantly contain shared memory variables accessed only by a single kernel, in which case the layout optimization is not applicable. This is further explored in Sect. 8.3 below.

As legacy CUDA code is recompiled under separate compilation mode, the cloning optimization will be disabled. The fact that compile time, run time and shared memory usage did not change significantly for the vast majority of the cases should smooth the initial transition to separate compilation mode for legacy CUDA programs.

8.3 Impact of Shared Memory Layout Optimization

Figure 6 shows the per kernel shared memory sizes with the link time layout optimization discussed in Sect. 7. The values are relative to the sizes with the optimization disabled and compiled under no-cloning mode. So 1.0 means there was no change, 0.2 means that the optimized code reduced the space usage to 20 % of the original. The figure shows that the link time optimization is extremely effective for Cublas and Thrust kernels, while there was no impact for Cufft. For Cublas, the optimized layout was less than a quarter of the size of the non-optimized layout, for almost half of the kernels. Similarly dramatic results can be seen for the Thrust kernels.

9 Related Work

Several heterogeneous parallel programming systems have been developed over recent years. OpenACC [11], OpenCL [6], Microsoft AMP [14], Cell [16], PGI Accelerator [13] and PGI CUDA Fortran [12] are some of the prominent examples. OpenACC and PGI Accelarator provide pragma directives to control code generation and variable placement for the associated heterogeneous co-processor. AMP and CUDA Fortran provide a language-integrated environment similar to CUDA. However, none of these environments currently allow separate compilation and linking of device code.

OpenCL 1.2 adds support for separate compilation and linking of device code fragments. This is quite different from our approach:

  • Language integration: In OpenCL, the device code is represented as a string embedded in host code. In CUDA, host and device code entities are part of the same source program where kernel functions and device variables may be accessed in host code. This makes programming simpler, but is challenging to implement.

  • Explicit (OpenCL) versus Implicit (CUDA) linking: OpenCL provides APIs to explicitly link device code. In CUDA, the linking is implicit - it is orchestrated by the compiler driver. This model is easier to use because it is similar to existing host environments.

  • Device Sub-linking (CUDA): The CUDA linker allows linking sub-groups of device code. This feature is important for shipping pre-linked library code.

Cell supports device code link, in a similar manner to ours but the host and device sources are compiled and linked separately.

Levine gives a short history of linking and suggests that the idea has been around since 1947 [2].

10 Conclusion

In this paper, we presented our design for separate compilation and linking of embedded device code, in a language-integrated heterogeneous environment. Separate compilation provides many advantages - the ability to support third party libraries, compile time speedups with incremental builds and ease of porting legacy programs to the heterogeneous environment. We described the linker extensions for sub-linking subsets of device code objects. We also described a link time scheme for optimizing shared memory layout. Finally, we evaluated the cost of disabling cloning in separate compilation mode, and the effectiveness of the shared memory layout optimization.