1 Introduction

The \(l_1\)-minimization (\(l_1\)-min) problem can be written as:

$$\begin{aligned} \min ||x||_1\ \ \ \ s.t. \ \ Ax = b, \end{aligned}$$
(1)

where \(A\in \mathbb {R}^{m\times n}\) (\(m\ll n\)) is a full-rank dense matrix, \(b\in \mathbb {R}^m\) is a pre-specified vector, and \(x\in \mathbb {R}^n\) is an unknown solution. The \(l_1\)-min solution, called the sparse representation, has proven to be sparse. Due to the sparsity of the solution, the \(l_1\)-min optimization has been successfully applied in various fields such as signal processing [18], machine learning [911] and statistical inference [12, 13] and so on.

To solve the \(l_1\)-min problem (1), researchers have developed many efficient algorithms. E.g., the gradient projection method [14], truncation Newton interior-point method [15], homotopy methods [16], class of iterative shrinkage-thresholding methods [17, 18], augmented Lagrange multiplier method (ALM) [19, 20], and alternating direction method of multipliers [21] and so on. A survey by Yang et al. [22] has compared and benchmarked these representative algorithms.

With the increasing scale of the problem, the execution efficiency of existing \(l_1\)-min algorithms decreases to a large degree. An efficient way is to shift these algorithms to the distributed or multi-core architecture such as graphics processing units (GPUs). Big data processing using GPUs has drawn a lot of attention in recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that is designed to support joint CPU/GPU execution of applications, by NVIDIA in 2007 [23], GPUs have become increasingly strong competitors among the general-purpose parallel programming systems.

For most \(l_1\)-min algorithms, their main components include dense matrix-vector multiplications such as Ax and \(A^Tx\), and vector operations. There have been highly efficient implementations for Ax, \(A^Tx\), and vector operations on the GPU in the CUBLAS library [24]. Therefore, the existing GPU-accelerated \(l_1\)-min algorithms [25, 26] are mostly based on CUBLAS. On the NVIDIA GTX980 GPU, for the test matrices with m varying from 50 to 5000 and n being fixed at 100,000, the performance curves of the Ax and \(A^Tx\) implementations in CUBLAS are shown in Figs. 1a and 2a, respectively. For the test matrices with n varying from 4000 to 520,000 and m being fixed at 1000, the performance curves of the Ax and \(A^Tx\) implementations in CUBLAS are shown in Figs. 1b and 2b, respectively. Obviously, for the Ax and \(A^Tx\) implementations in CUBLAS, the performance value fluctuates as m increases or n increases, and the difference between the maximum and minimum performance values is distinct. In addition, we observe that when parallelizing the \(l_1\)-min algorithms on the GPU, the number of kernels can be reduced by merging some operations into a single kernel, which can save time between kernel calls and avoid double loads of vectors. However, CUBLAS does not allow merging several operations into a single kernel. At present, some new features for the NVIDIA GPU with compute capability 3.2 or higher, such as the shuffle instruction and read-only data cache, can be utilized to improve the performance of GPU-accelerated methods but not yet utilized in CUBLAS.

Fig. 1
figure 1

The implementation of Ax in CUBLAS. a The performance curve with m (\(n=100,000\)), b the performance curve with n (\(m=1000\))

Fig. 2
figure 2

The implementation of \(A^Tx\) in CUBLAS. a The performance curve with m (\(n=100,000\)), b the performance curve with n (\(m=1000\))

Therefore, these observations motivate us to further investigate the design of robust and highly parallel \(l_1\)-min solvers on the GPU. In this study, we propose a novel warp-based implementation of Ax on the GPU, called the GEMV kernel, and a novel thread-based implementation of \(A^Tx\) on the GPU, called the GEMV-T kernel. For the GEMV kernel, a self-adaptive warp allocation strategy is used to assign the optimal warp number for each matrix row. Similar to the GEMV kernel, we design a self-adaptive thread allocation strategy to assign the optimal thread number to each matrix row for the GEMV-T kernel. Experimental results show that our proposed two kernels are more robust than CUBLAS, and always have high performance. In addition, two popular \(l_1\)-min algorithms, fast iterative shrinkage-thresholding algorithm (FISTA) and augmented Lagrangian multiplier method (ALM), are taken for example. Utilizing the technique of merging kernels and the sparsity of the solution of the \(l_1\)-min algorithms, we propose two highly parallel \(l_1\)-min solvers on the GPU. Furthermore, we design a concurrent multiple \(l_1\)-min solver and optimize its performance by using the shuffle instruction and read-only data cache.

In summary, our work makes the following contributions:

  • Two novel adaptive optimization GPU-accelerated implementations of the matrix-vector multiplication are proposed. The two methods are more robust than CUBLAS, and always have high performance.

  • Based on the above implementations of the matrix-vector multiplication on the GPU, we present two highly parallel \(l_1\)-min solvers on the GPU by utilizing the technique of merging kernels and the sparsity of the solution of the \(l_1\)-min algorithms.

  • Utilizing new features of GPU, we design an optimal concurrent multiple \(l_1\)-min solver on the GPU.

The remainder of this paper is organized as follows. In Sect.  2, we describe two \(l_1\)-min algorithms, FISTA and ALM. In Sect.  3, we introduce the CUDA architecture. Two adaptive optimization implementations of the matrix-vector multiplication on the GPU, GPU-accelerated FISTA and ALM solvers, a concurrent multiple \(l_1\)-min solver on the GPU and some optimization strategies are proposed in Sect.  4. Experimental results are presented in Sect.  5. Section  6 contains our conclusions and points to our future research directions.

2 Two \(l_1\)-min Algorithms

2.1 Fast Iterative Shrinkage-Thresholding Algorithm

The problem in Eq. (1) is known as the basis pursuit (BP) problem [7]. In practice, a measurement data b often contains noise (such as the measurement error \(\varepsilon \)), which is called the BPDN problem. A variant of this problem is also well known as the unconstrained BPDN problem with a scalar weight \(\lambda \) or the Lasso problem [27] in the statistics perspective:

$$\begin{aligned} \min \frac{1}{2}||Ax-b||^2_2 + \lambda ||x||_1. \end{aligned}$$
(2)

The fast iterative shrinkage-thresholding algorithm (FISTA) is a kind of accelerations, and achieves an accelerated non-asymptotic convergence rate of \(O(k^2)\) by combining Nesterovs optimal gradient method [17, 18]. For FISTA, it adds a new sequence \(\{y_k, k=1,2,\ldots \}\) as follows.

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}=soft\left( y_k-\frac{1}{L_f}\bigtriangledown f(y_k),\frac{\lambda }{L_f}\right) \\ t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}\\ y_{k+1}=x_k+\frac{t_k-1}{t_{k+1}}(x_k-x_{k-1}), \end{array}\right. } \end{aligned}$$
(3)

where \(soft(u,a)=\text {sign}(u)\max \{|u|-a,0\}\) is the soft-thresholding operator, \(y_1=x_0\), \(t_1=1\) and the associated Lipschitz constant \(L_f\) of \(\bigtriangledown f(\cdot )\) is given by the spectral norm of \(A^TA\), denoted by \(||A^TA||_2\). Algorithm 1 summarizes the generic FISTA algorithm.

figure a

2.2 Augmented Lagrangian Multiplier Method

The augmented Lagrangian multiplier method (ALM) [19, 20] combines penalty methods and the Lagrange multiplier algorithm, and its corresponding augmented Lagrangian function is

$$\begin{aligned} L_{\rho }(x^{*},\lambda ) = ||x||_1+\lambda ^T(Ax-b)+\frac{\rho }{2}||Ax-b||_2^2, \end{aligned}$$
(4)

where \(x^{*}\) is the optimal solution for the problem in Eq. (1), \(\lambda \) is a vector of Lagrange multipliers, \(\frac{\rho }{2}||Ax-b||_2^2\) is the quadratic penalty function, \(\rho >0\) is a constant that determines the penalty for the infeasibility.

Equation (4) can be solved by the following iterative procedure called the method of multipliers [28].

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{k+1}=\min L_{\rho }(x,\lambda _k),\\ \lambda _{k+1} = \lambda _k + \rho _k(Ax_{k+1}-b), \end{array}\right. } \end{aligned}$$
(5)

where \(\{\rho _k\}\) is a monotonically increasing positive sequence and is sufficiently large after a certain index. By the procedure shown in Eq. (5), we can simultaneously calculate the optimal solutions \(x^{*}\) and \(\lambda ^{*}\).

The first step of the procedure in Eq. (5), called x-min which is also a convex optimization problem, can be solved by FISTA. The generic ALM algorithm is summarized in Algorithm 2.

figure b

3 CUDA Architecture

The compute unified device architecture (CUDA) is a general purpose parallel computing platform and programming model. The developers can use provided CUDA C/C++ to define the function, called kernel, which is executed in parallel by each CUDA thread. All threads are organized into the thread blocks, then these thread blocks are organized into a grid. Both the thread block and the grid can have up to three dimensions.

A CUDA-enabled GPU has many computing cores called CUDA cores, which can collectively run thousands of threads. These CUDA cores are organized into a scalable array of streaming multiprocessors (SMs). A SM is designed to concurrently execute hundreds of threads by employing a unique architecture called SIMT (Single Instruction, Multiple-Thread). When a thread block is given to a SM, it is split in warps, each composed of 32 threads. In the best case, all 32 threads have the same execution path and the instruction is executed concurrently.

Fig. 3
figure 3

The way of accessing data for threads

The GPU memory includes the on-chip memory, e.g., the shared memory and L1 cache, and the on-board memory, e.g., the global memory. For GPUs with different compute capabilities, the way of accessing data from the memory has some difference. In this study, we only consider GPUs with compute capability 3.2 or more in order to utilize new features such as the read-only data cache. Figure 3 shows the possible way of accessing data from the memory for threads. The global memory has large size and is shared by all SMs. However, since it resides on the off-chip DRAM (Dynamic Random Access Memory), the low bandwidth and large latency result in the slow access. The shared memory is used by all threads in a thread block, and provides the high bandwidth and low latency, but its size is small. Both the L1 cache and the read-only data cache are shared by all the threads within a SM, and can be accessed fast like the shared memory. Compared with the uncontrollable L1 cache, the read-only data cache is controllable and can be easily used by programmers.

Therefore, major challenges in optimizing an application on GPUs are: global memory access latency, the on-chip memory access efficiency, different execution paths in each warp, communication and synchronization between threads in different blocks and resource utilization.

4 GPU Implementation

4.1 Data Layout

We use a row major and 0-based indexing array a to store the matrix A, and utilize the padding scheme to optimize the global memory access performance, as shown in Fig. 4.

Fig. 4
figure 4

Padding scheme for the array a

For devices with compute capability 2.0 or higher, global memory accesses by threads within a warp can be coalesced into the minimum number of L2-cache-line-sized (i.e., 32 bytes) aligned transactions. We assume that each matrix row is assigned to a warp with 32-byte memory transaction. From Fig. 4, we observe that each row of the matrix A includes 15 float-precision elements and is misaligned before padding. The misaligned case results in requiring 11 memory transactions for the first four rows of A. Padding the array a can decrease the number of memory transactions to 8. Therefore, the padding scheme can result in about 30 % improvement of the global memory-access performance for this case in Fig. 4.

4.2 Robust Matrix-Vector Multiplications

The matrix-vector multiplications include Ax (GEMV) and \(A^Tx\) (GEMV-T), where \(A\in R^{m\times n}\). In the following subsections, we propose two robust matrix-vector multiplication kernels. For the two kernels, we take full advantage of the multi-level cache hierarchy to cache vector x, thus improve the access efficiency. And they are robust and extensible for different GPU devices. The gird of our kernels is organized as a 1D array of thread blocks, and the thread block is also organized as a 1D array of threads.

4.2.1 GEMV Kernel

The GEMV Ax is composed of m dot products of x with each row of A, and these dot products can be independently computed. Thus, for our proposed GEMV kernel, we can assign one warp or multiple warps to a product dot. To optimize the GEMV kernel performance, we use the following self-adaptive warp allocation strategy to select the number of warps k for a product dot:

$$\begin{aligned} \min w = sm\times 2048 / k /32, \quad \ s.t. \ \ m\leqslant w, \end{aligned}$$
(6)

where sm is the number of streaming multiprocessors and m is the number of the matrix rows.

Fig. 5
figure 5

The GEMV kernel

Figure 5 shows the main procedure of the GEMV kernel. It is composed of two stages. The first stage includes three steps: x-load step, partial-reduction step and warp-reduction step.

x-load step The step is used to make threads per block parallel read elements of x into the shared memory xP. Because the size of x is large, x is segmentally read into the shared memory, and each time the size is blockDim.x (size of thread blocks). By this way, the accesses to x are coalesced, and the access number is reduced by letting warps in the same thread block to share the section of elements of x.

partial-reduction step Each time after a section of elements of x is read into the shared memory, the threads in each warp group (k warps are grouped into a warp group) perform in parallel a partial-style reduction (see lines 12–5 in Fig. 5). Obviously, each thread in a warp group at most performs \(\lceil n/wgSize\rceil \) times of reductions and the accesses to the global memory A are coalesced.

warp-reduction step After the threads in each warp group have completed the partial-style reductions, the fast shuffle instructions are utilized to perform a warp-style reduction for each warp in these warp groups. The warp-style reduction values are stored in the shared memory.

In the second stage, the warp-style reduction values in the shared memory for each warp group are reduced to an output value in parallel.

For \(k = 1\), a warp group only includes a warp. From Fig. 5, we observe that it only needs the first stage and the output vector b can be obtained. Thus, we specially design a GEMV kernel for \(k = 1\) in order to optimize the GEMV kernel performance (omitted here). Similarly, a GEMV kernel for \(k = 32\) is also redesigned (omitted here) because it is not necessary to read x into the shared memory for this case. We can directly load x into registers.

4.2.2 GEMV-T Kernel

The GEMV-T, \(A^Tx\), is composed of n dot products of x with each columns of A, and these dot products can be independently computed. Comparing with the GEMV, the size of the vector x in the GEMV-T is small. Thus we assign one thread or multiple threads to a dot product in our proposed GEMV-T kernel. To optimize the GEMV-T kernel performance, we use the following self-adaptive thread allocation strategy to select the number of threads k for a product dot:

$$\begin{aligned} \min t = sm\times 2048/k, \quad \ s.t. \ \ n\leqslant t, \end{aligned}$$
(7)

where sm is the number of streaming multiprocessors and n is the number of the matrix columns.

Fig. 6
figure 6

The GEMV-T kernel

Figure 6 shows the main procedure of the GEMV-T kernel. Like the GEMV kernel, it also needs two stages. In the first stage, it is only composed of x-load step and partial-reduction step. The x-load step has the same function as in the GEMV kernel. In the partial-reduction step, since a row major and 0-based index format is used to store the matrix A, the accesses to A will not be coalesced if the thread groups (k threads are grouped into a thread group) are constructed in an inappropriate way. For example, we assume that A is a \(4\times 8\) matrix as shown in Fig. 7, 16 threads in a thread block are launched, and two threads are assigned to a dot product in the GEMV-T. If we use the following thread groups \(\{0,1\}\), \(\{2,3\}\), \(\{4,5\}\), \(\ldots \), \(\{14,15\}\), the accesses to A will not be coalesced (see Fig. 7a). However, when the thread groups \(\{0,8\}\), \(\{1,9\}\), \(\{2,10\}\), \(\ldots \), \(\{7,15\}\) are utilized, the accesses to A are coalesced, as shown in Fig. 7b. Therefore, in the partial-reduction step, the thread groups are created according to Definition 1 below in order to ensure that the accesses to A are coalesced.

Fig. 7
figure 7

Accesses to A

Definition 1

Assume that the size of the thread block is s, h threads are assigned to a dot product in \(A^Tx\), and \(z = s/h\). The thread groups are created as follows: \(\{0,z,,(h-1)*z\}\), \(\{1,z+1,\ldots ,(h-1)*z+1\}\),\(\ldots \),\(\{z-1,2*z-1,,2*(h-1)*z-1\}\).

For these elements of x read into the shared memory each time, the threads in each thread group perform in parallel a partial-style reduction similar to that in the GEMV kernel.

Since these threads in a thread group are usually not in the same warp, we can not use the shuffle instruction to reduce their partial-style reduction values. Therefore, in the second stage, we store the partial-style reduction values obtained by threads in each thread group to the shared memory, and then reduce them in the shared memory to an output value in parallel.

4.3 Parallel \(l_1\)-min Slovers

4.3.1 FISTA and ALM Solvers

When parallelizing FISTA in Algorithm 1 on the GPU, 6 kernels, as shown in Fig. 8a, are needed. To minimize the number of kernels, save time between kernel calls, and avoid double loads of vectors, we merge these kernels by two steps. The first step merges the kernel 1 and the kernel 2 into a single kernel. In the second step, three vector operation kernels are merged into a single kernel. Thus, the total number of kernels is reduced from 6 to 3 (see Fig. 8b).

Fig. 8
figure 8

Kernels in FISTA

Fig. 9
figure 9

Concurrent multiple \(l_1\)-min solver

For kernels in Fig. 8b, it is easy to implement the first two kernels based on our proposed GEMV and GEMV-T implementation methods on the GPU. Although CUBLAS has shown high performance for the vector operations, CUBLAS does not allow merging several operations into a single kernel. Therefore, for the third kernel, we adopt the implementation method in [29], which supports merging several vector operations into a single kernel. The parallel FISTA on the GPU is called the FISTA solver.

For ALM in Algorithm 2, x-min can be solved by FISTA. Therefore, to parallelize ALM on the GPU, we need to design a kernel to finish the dual update, \(\lambda _{k+1}=\lambda _k+\rho _k(Ax_{k+1}-b)\), besides utilizing three kernels in FISTA. Obviously, based on our proposed GEMV implementation method on the GPU, the dual update kernel is easy to be designed. The parallel ALM on the GPU is called the ALM solver.

4.3.2 Concurrent Multiple \(l_1\)-min Solvers

In some real applications, we usually need to solve multiple \(l_1\)-min problems concurrently. To accommodate the real requirements, here we take FISTA for example and thus design a GPU-accelerated implementation, as shown in Fig. 9, to solve multiple \(l_1\)-min problems concurrently. In this method, we assign one thread block to a \(l_1\)-min problem, and for each thread block, the idea of constructing the parallel FISTA to solve the \(l_1\)-min problem is similar to that of implementing FISTA on the GPU in Sect. 4.3.1. We call the concurrent multiple \(l_1\)-min solver MFISTASOL.

In MFISTASOL, each thread block needs to access the global memory a, so we let a be cached in the read-only data cache in order to reduce the number of accesses to a. With the read-only data cache, a is shared by all thread blocks and can be accessed fast.

4.4 Optimization

When using FISTA or ALM to solve the \(l_1\)-min problem, with the increasing number of iterations, the output vector x becomes sparser and sparser through the soft-thresholding operator as shown in Algorithms 1 and 2. Therefore, we can utilize the sparsity to reduce the accesses to the global memory a for the GEMV kernel in the FISTA solver, the ALM solver and MFISTASOL.

When the ith element of x is equal to zero, all elements in the ith column of A do not need to be accessed because they do not have any contribution to the output vector. With the increasing iteration in FISTA and ALM, x becomes sparser and sparser, and thus a number of columns of A are not accessed. By this way, we can improve the performance of the FISTA solver, the ALM solver and MFISTASOL by reducing accesses to the global memory a. However, since we use the row major and 0-based indexing array a to store the matrix A, the accesses to a in the above case are not coalesced. In this study, we alleviate the overhead deriving from the non-coalescence through L2 cache with 32-byte memory transactions.

5 Experimental Results

In this section, we test our proposed \(l_1\)-min solvers on the GPU from the following four aspects : (1) analyzing the validity of using the vector sparsity to optimize the GEMV kernel performance and the read-only data cache to improve the MFISTASOL performance, (2) comparing GEMV and GEMV-T kernels with the implementation in the CUBLAS library, (3) testing the performance of our proposed parallel \(l_1\)-min solvers, and (4) testing the performance of our proposed concurrent multiple \(l_1\)-min solver.

The experimental environments include one machine which is equipped with an Intel Xeon Quad-Core CPU and an NVIDIA GTX980 GPU and another machine with an Intel Xeon Quad-Core CPU and an NVIDIA GTX760 GPU. Our source codes are compiled and executed using the CUDA toolkit 6.5.

The measured GPU performance for all experiments does not include the data transfer (from the GPU to the CPU or from the CPU to the GPU). The test matrices are shown in Table 1. The element values of each test matrix are randomly generated according to the normal distribution.

Table 1 Test matrices

5.1 Experimental Analysis

First, we take Mat05 in Table 1 for example to test the performance influence of utilizing the sparsity of the vector x to optimize the GEMV (Ax) kernel. The GTX980 is used in this experiment. The ratio of the number of zero elements to the total number of elements in x, represented with \(\delta \), is set to 0.05, 0.10, 0.15, \(\ldots \), 0.90 and 0.95, respectively. Figure 10 shows the execution time ratios of the GEMV kernel without the sparsity to the GEMV kernel with the sparsity for all \(\delta \) values. We observe that the execution time ratio increases as \(\delta \) increases. Therefore, we affirm that by utilizing the vector sparsity, the performance of the GEMV kernel is improved.

Fig. 10
figure 10

The execution time ratios for all \(\delta \) values

Second, we take the GTX980 for example to verify the validity of using the read-only data cache to improve the performance of the concurrent multiple \(l_1\)-min solver (MFISTASOL). The test matrices are used in Table 1. 128 \(l_1\)-min problems are concurrently calculated. In MFISTASOL, the total number of iterations is set to 10, the initial \(x_0\) is randomly generated according to the normal distribution, and \(b=Ax_0\). The execution time ratios of MFISTASOL without the read-only data cache to MFISTASOL with the read-only data cache are shown in Fig. 11 for all test cases. We see that for all test cases, the execution time ratios have been sustained at around 1.2. Thus the MFISTASOL performance is improved by using the read-only data cache.

Fig. 11
figure 11

The execution time ratios for all test cases

Fig. 12
figure 12

The performance comparison of the GEMV kernel and CUBLAS

5.2 Performance Comparison of Matrix-Vector Multiplications

We compare GEMV and GEMV-T kernels with the implementation in CUBLAS. The performance is measured in terms of Gflop/s, which is obtained by \(2\times m\times n\)/the matrix-vector multiplication kernel execution time (the time unit is second) [30]. Figures 12 and 13 respectively show the comparison of the GEMV and GEMV-T kernels with CUBLAS. From Fig. 12, we see that on the GTX760 and GTX980, the GEMV kernel outperforms CUBLAS, and respectively obtains average speedups of \(4.42\times \) and \(2.74\times \) compared to CUBLAS. For the GEMV kernel, on the GTX760 and GTX980, the average performance values are respectively 57.69 GFlops/s and 83.81 GFlops/s, and the standard deviations are respectively 9.27 and 5.48. However, on the GTX760 and GTX980, CUBLAS only obtains the average performance values of 21.80 GFlops/s and 39.88 GFlops/s and the standard deviations of 18.73 and 20.47, respectively. This verifies that for matrices with different scales, our proposed GEMV kernel always has high performance, but CUBLAS does not. For the GEMV-T kernel, we can obtain the same conclusion as the GEMV kernel from Fig. 13.

Fig. 13
figure 13

The performance comparison of the GEMV-T kernel and CUBLAS

Fig. 14
figure 14

GEMV. a The performance curves with m (\(n=100,000\)), b the performance curves with n (\(m=1000\))

Fig. 15
figure 15

GEMV-T. a The performance curves with m (\(n=100,000\)), b the performance curves with n (\(m=1000\))

Next, we take the GTX980 for example to further verify the above observations. The test setup is as same as in the introduction. Figure 14 shows the performance curves of the GEMV kernel and CUBLAS. Obviously, for all cases, our proposed GEMV kernel always obtains around 80 GFlops/s. However, CUBLAS has lower performance than our proposed GEMV kernel, and the difference between the maximum and minimum performance values is distinct. For the test matrices with n being set to a fixed value 100, 000, the performance of our proposed GEMV-T kernel has been maintained at around 80 GFlops/s as m increases, as shown in Fig. 15a. The CUBLAS performance in general increases as m increases, and is maintained at around 80 GFlops/s only after m is more than 200. From Fig. 15b, for the test matrices with m being fixed at 1000, when n increases, we obtain the same conclusion as in Fig. 15a for the GEMV-T kernel and CUBLAS.

Therefore, we can conclude that our proposed matrix-vector multiplication implementations on the GPU are more robust than CUBLAS, and usually have high performance.

5.3 Performance of Parallel \(l_1\)-min Solvers

We test the parallel performance of our proposed FISTA and ALM solvers by comparing them with the corresponding implementations using the CUBLAS library. The FISTA solver and corresponding implementation using the CUBLAS library are denoted as GFISTA and BFISTA, respectively. The ALM solver and corresponding implementation using the CUBLAS library are denoted as GALM and BALM, respectively. CFISTA is the sequential CPU implementation corresponding to GFISTA, and CALM is the sequential CPU implementation corresponding to GALM. All experiments are conducted on the GTX980. For each \(l_1\)-min problem, the matrix A comes from Table 1, the initial \(x_0\) with 1024 non-zero elements is randomly generated according to the normal distribution, and \(b=Ax_0\). CFISTA, BFISTA, and GFISTA stop after the number of iterations is more than 50 for all test cases. In the CALM, BALM, and GALM, the total number of iterations for the inner iteration and the outer iteration has been set to 50 and 10, respectively. Table 2 lists the execution time of all algorithms for all test cases. The speedups of BFISTA, GFISTA, BALM, and GALM are shown in Fig. 16. The time unit is second (denoted by s). From Table 2 and Fig. 16, we observe that compared to CFISTA, GFISTA obtains speedups ranging from 37.68 to 53.66 for all test cases, and the average speedup is 48.22. However, BFISTA only achieves speedups ranging from 11.15 to 38.82 for all test cases, and the average speedup is 24.05. For GALM, comparing with CALM, the maximum, minimum and average speedups are 51.21, 35.98 and 44.0, respectively, which are 2.04, 4.24, and 2.36 times faster than those that are obtained by BALM. All these results show that our proposed FISTA and ALM solvers have high parallelism, and outperform the corresponding implementations using the CUBLAS library.

Table 2 Execution time of all algorithms
Fig. 16
figure 16

Speedups of all algorithms

Fig. 17
figure 17

Execution time of Kernel 1, Kernel 2, and Kernel 3 in the selected iteration steps for BFISTA and GFISTA. a Mat07, b Mat12

In addition, we also take Mat07 and Mat12 for example to show the execution time of Kernel 1, Kernel 2, and Kernel 3 in the selected iteration steps for BFISTA and GFISTA in Fig. 17. The time unit is millisecond (denoted by ms). For each one of the two test matrices, the execution time of Kernel 2 of GFISTA and Kernel 2 of BFISTA nearly remains invariable for all selected iteration steps, and Kernel 2 of GFISTA almost has the same execution time as that of BFISTA. This observation is in accordance with that in Fig. 13. The execution time of Kernel 3 of GFISTA and Kernel 3 of BFISTA is almost invariable for all selected iteration steps, but Kernel 3 of GFISTA is nearly 2.08 times faster than that of BFISTA due to the merging of kernels. The execution time of Kernel 1 of BFISTA is nearly invariable for all selected iteration steps. However, the execution time of Kernel 1 of GFISTA decreases as the iteration step increases because of the sparsity of solutions, and is greatly advantageous over that of Kernel 1 of BFISTA in each iteration step. These results further verify the efficiency of our proposed FISTA and ALM solvers.

5.4 Performance of the Concurrent Multiple \(l_1\)-min Solver

We test the performance of our proposed concurrent multiple \(l_1\)-min solver, MFISTASOL. The test setup is as same as in Sect. 5.3. For each test case, 128 \(l_1\)-min problems are concurrently calculated. All experimental results on the GTX980 are shown in Table 3. The time unit is second (denoted by s). We observe that compared to the sequential execution of the FISTA solver, our proposed concurrent multiple \(l_1\)-min solver, MFISTASOL, can obtain the average speedup of around 3.0.

Table 3 Execution time of MFISTASOL

6 Conclusion

This paper proposes two robust implementations of the matrix-vector multiplication on the GPU. Moreover, based on the two proposed matrix-vector multiplication implementations on the GPU, we presents two highly parallel \(l_1\)-min solvers, the FISTA solver and the ALM solver, utilizing the technique of merging kernels and the sparsity of the solution of \(l_1\)-min problems. To accommodate the real requirements of solving multiple \(l_1\)-min problems concurrently, we design a concurrent multiple \(l_1\)-min solver. Experimental results show that our proposed methods are efficient, and have high performance.

Next, we will further do research in this field, and apply the proposed solvers to the real problems.