1 Introduction

Vector quantization (VQ) [1, 2, 3] is an effective technique applicable in a wide range of areas, such as image compression [4], video compression [5, 6, 7], and bioinformatics [8]. The popularity of VQ comes from its low decoding complexity. However, the computational complexity required for finding a useful codebook is very high. There are some suboptimal searching methods [3, 9, 10] to improve the encoding speed. In [3 and 9], data structures are introduced to the codebooks after it is trained, e.g. Tree-structure VQ [3] and K-D tree VQ [9]. In contrast, [10] achieves the same purpose by using the property codebook, e.g. the ordering property of self-organizing map (SOM) [11]. Although these methods can improve the encoding speed, they are not suitable for codebook training, which is the more time-consuming procedure.

The current generation of consumer-level GPUs has evolved to a stage that supports parallel processing and high-precision computation [12]. Moreover, GPUs become increasingly suitable for general purpose computations, such as sparse matrix solving [13], image-based relighting [14], wavelet transform [15, 16], and neural network simulation [17].

At the time when only non-programmable GPUs are available, there were still some attempts to use GPUs for accelerating the general purpose computations. Bohn [18] proposed an implementation for the online mode training of SOM based on GPUs. However, GPUs cannot handle sequential operations efficiently, which is essential for the online SOM training.

Recently, Takizawa [19] proposed a CPU–GPU co-processing solution for LBG algorithm, where GPU is mainly for Euclidean distance calculation and CPU is for updating the codebook. In their implementation, a training iteration consists of a number of rendering passes, where each rendering pass is associated with a codevector. Within a rendering pass, the temporary nearest codevector index, as well as its corresponding distance, is updated. Afterward, all rendering passes in a training iteration, the nearest codevector indices, available in the GPU, are transferred to the main memory. Then, the codebook is updated sequentially by CPU. By using their GPU approach, the LBG algorithm can be accelerated by 35 times, compared with a decent CPU implementation. However, there are two bottlenecks in the Takizawa’s approach. The first one is the large amount of data transfer between CPU and GPU, which is equal to the number of training vectors. The second one is the large number of rendering passes within a training iteration.

As GPUs evolve, many of the hardware limitations at the time of the Takizawa’s method [19] had been relieved. For example, the number of fragment processing units in GPUs was more than that of vertex processing units at that time, but they are unified now. We can use either vertex shader or fragment shader for the computational intensive operations. Also, the maximum number of instructions of shader is greatly increased. As a result, handling only one codevector in one rendering pass to shorten the number of instructions is no longer necessary. Furthermore, vertex shader has an advantage over fragment shader that it allows random write.

This paper presents an unified GPU implementation for LBG and SOM training algorithms based on the random write ability of vertex shader. The proposed scheme uses GPU as the main computing processor, instead of as a co-processor. In the our approach, all the computations are carried out within GPU, so the data transfer between CPU and GPU is minimized. Moreover, within an iteration, there is only one rendering pass, and the overhead between rendering passes is minimized. Our experimental results show that for the LBG algorithm, our approach can run four times faster than the Takizawa’s approach. In addition, by controlling point size in the vertex shader, our approach is able to handle the SOM algorithm.

The rest of this paper is organized as follows. In Sect. 2, we briefly describe the background of GPU, LBG, and SOM algorithms. In Sect. 3, the implementation of both algorithms on GPU is presented. The performance of our scheme and comparison are given in Sect. 4. Section 5 presents our conclusion.

2 Background

2.1 LBG

A VQ model contains a codebook \({\mathbb{Y}}=\{ \vec{c}_1,\ldots, \vec {c}_M \}\) in ℜk. Given an input vector \(\vec{x}\), the output is an index i* whose corresponding codevector \(\vec{c}_{i*}\) is the closest codevector to \(\vec{x}\). Given a dataset \({\mathbb{D}}=\{\vec {x}_1, \ldots, \vec{x}_N\}\) in ℜk and the initial codebook \({\mathbb{Y}}(0)=\{ \vec{c}_1(0),\ldots, \vec{c}_M(0) \}\), the LBG training is summarized as follows:

  1. 1.

    t = 0.

  2. 2.

    Set \(\Upomega_i=\varnothing\), for all i = 1, …, M.

  3. 3.

    For each training vector \(\vec{x}_j\), find out the nearest codevector \(\vec{c}_{i^*} (t)\). Put \(\vec{x}_j\) into the subset \(\Upomega_{i^*}\). Therefore, \(\Upomega_{i^*}\) can be expressed as

    $$ \Upomega_{i^*}=\left\{ \vec{x}_j \left| \| \vec{x}_j - \vec {c}_{i^*} \|^2 \leq \| \vec{x}_j - \vec{c}_{i} \|^2, \forall i \neq i^* \right. \right\}. $$
    (1)

    Note that the subsets \(\Upomega_{i}\)s are non-overlapped.

  4. 4.

    Update the codebook:

    $$ \vec{c}_i (t+1) = {\frac{1} {|\Upomega_i|}} \sum_{\vec{x}_j \in \Upomega_i} \vec{x}_j {,\ \ } \forall i. $$
    (2)
  5. 5.

    Set t = t + 1 and go to Step 2.

The above iterative procedure repeats until the quantization error is less than a threshold or the number of iterations has reached a pre-defined number.

2.2 SOM

Prior to the SOM training [11, 20, 21, 22], a neighborhood structure is imposed on a codebook. The neighborhood structure is represented by a graph \({\mathbb{G}}=\{{\mathbb{V}}, {\mathbb{E}}\}\), where \({\mathbb{V}}=\{ v_1, \ldots, v_M \}\) is a set of vertices and \({\mathbb{E}}\) is the set of edges in this graph. In this representation, a vertex v i is associated with a codevector \(\vec {c}_i\).

If codevectors \(\vec{c}_i\) and \(\vec{c}_{i'}\) are joined by an edge, \(\vec{c}_i\) and \(\vec{c}_{i'}\) are neighbors. The topological distance between two neighbors is defined as 1. The topological distance between \(\vec{c}_i\) and \(\vec{c}_j\) is defined as the minimum number of edges connecting \(\vec{c}_i\) and \(\vec{c}_j\) in \({\mathbb{G}}\). If vertices \(\vec{c}_i\) and \(\vec{c}_j\) are separated by no more than u topological distance, their associated codevectors \(\vec{c}_i\) and \(\vec{c}_j\) are called level-u neighbor. For a given codevector \(\vec{c}_i\), the index set of its level-u t neighbors is denoted by \({{\mathbb{I}}}_i(u_t)\), where u t denotes the level-u neighbors at time t.

Some frequently used structures of SOM are shown in Fig. 1, where each dot represents a vertex and each line represents an edge. Note that the figure is only a graphical representation and it does not reflect any actual geometric information of the codebook.

Fig. 1
figure 1

Three neighborhood structures of SOM. a a linear structure; b regular grid; c diamond-regular grid

Since the sequential SOM training [11] is not suitable for parallelization, in this paper, we consider the batch mode SOM [11] instead. The idea of SOM is similar to that of LBG. That is, based on the current codebook \({{\mathbb{Y}}}(t)\), we partition the dataset into M overlapped subsets \(\Upomega_i\). The SOM is summarized as follows:

  1. 1.

    t = 0.

  2. 2.

    Set \(\Upomega_i=\varnothing\), for all i = 1, …, M.

  3. 3.

    For each training vector \(\vec{x}_j\), find out the nearest codevector \(\vec{c}_{i*} (t)\). Put \(\vec{x}_j\) into the subsets \(\Upomega_{i'}\)s, where \( i' \in {{\mathbb{I}}}_{i*}(u_t)\), and u t linearly decreases to zero during the training process.

  4. 4.

    Update the codebook:

    $$ \vec{c}_i (t+1) = {\frac{1}{|\Upomega_i|}} \sum_{\vec{x}_j \in \Upomega_i} \vec{x}_j . $$
    (3)
  5. 5.

    Set t = t + 1 and go to Step 2.

The above iterative procedure repeats until the quantization error is less than a threshold or the number of iterations has reached a pre-defined number.

If the ordering property of SOM is preserved, the Euclidean distance in the data space between two neighboring codevectors is usually small. Figure 2 shows a typical result of a SOM. As shown in Fig. 2a, the initial codevectors are distributed in a random manner and do not form a good topological ordering. In contrast, the codevectors form a good topological ordering after training as shown in Fig. 2b.

Fig. 2
figure 2

Codevectors in the data space. a Initial codevector before training. b A typical result of a 10 × 10 diamond-regular SOM after training. To indicate the neighborhood structure, an edge is added to join two codevectors if they are neighbors of each other in the graph. There are 1024 2D samples in the training set. These samples are uniformly distributed over [−1, 1]

2.3 Background on the GPU

Recently, the demands from the gaming and the film-making industries for the acceleration of 3D rendering have driven the development of graphics processing unit (GPU). The GPU operations are controlled by some user-defined programs called shaders. Since GPU is a chip consisting of multiple processing units, we can consider a GPU as a virtual SIMD computer.

Figure 3 shows the graphics pipeline of GPU. It consists of two programmable units, namely the vertex processing unit and the fragment processing unit logically. Programs controlling the vertex processing unit are called vertex shaders, whereas those programs controlling the fragment processing unit are called fragment shaders. Primarily, vertex shaders perform the transformations of vertices, and fragment shaders perform the color calculation of pixels. In particular, a fragment shader can read and modify a texture. Note that a texture can be considered as a 2D array of texels, where each texel has four color channels, namely R, G, B, and A.

Fig. 3
figure 3

The graphics pipeline of GPU

3 Implementations

3.1 GPU-based LBG implementation

Intuitively, we can consider our implementation as a multiclient–multiserver model, where training vectors are clients and codevectors are servers. For each client (a training vector), it determines its nearest codevector and sends itself to the corresponding server. When the training vectors arrive at a server, the server accumulates the training vectors and updates the number of arrived training vectors. After all clients are served, each server holds the accumulation of arrived training vectors and the number of arrived training vectors. Finally, each server performs a simple division operation to update its codevector.

In our setting, there are N training vectors and M codevectors, and the dimension of vectors is k. From the shader programming point of view, each training vector is a vertex. A codebook is organized as a number of textures. The logical structure of a codebook in GPU is shown in Fig. 4. The resolution of the textures is equal to 1 × M, and each texture position corresponds a codevector. We use \(\lceil{\frac{k+1} {4}}\rceil\) textures to hold a codebook (Note that a texel in a texture have 4 channel, {R G B A} and it can hold 4 values.). “An extra channel” is used to store the number of arrived training vectors of a codevector.

Fig. 4
figure 4

The logical structure of a codebook in GPU

Before the GPU training, all training vectors are uploaded to the GPU memory as a display list of points, and two sets of textures are initialized. One set of textures is to hold the old codebook for “reading”. It initialized the codevectors prepared by CPU. The other set is to hold the updating codebook for “writing”. Notice that the read/write ability of textures is realized by using frame buffer objects (FBOs).

The flow diagram of our GPU-based LBG implementation is shown in Fig. 5. For each training vector, the vertex shader performs the nearest codevector searching and finds out the corresponding texel position. Then, a point of size one is rasterized at the corresponding texel position. In the fragment shader, the received training vectors are blended to the writing textures. Besides the received training vectors, a constant “1” is also blended to the extra color channel of the writing textures. Afterward, the writing textures hold the accumulations and the numbers of training vectors of the codevectors. See the appendix for a full example of vertex shader and fragment shader. Finally, an additional fragment shader is performed to update the codevector by dividing the accumulations by the numbers of training vectors.

Fig. 5
figure 5

The flow diagram of our GPU based LBG implementation

Our implementation of LBG algorithm is summarized in the following steps:

  1. (1)

    Pass the training vectors the vertex shader by drawing the display list of points.

  2. (2)

    Use the vertex shader to perform the nearest codevector searching and find out the corresponding texel position, such that a point of size one is rasterized at the corresponding texel position. The training vector is also passed to the fragment shader for further processing.

  3. (3)

    Use a fragment shader to update the codebook. The received training vectors are blended to the writing textures. Besides the received training vectors, a constant “1” is also blended to the extra color channel of the writing textures.

  4. (4)

    Swap the reading texture set and writing set. Footnote 1

  5. (5)

    Use an additional fragment shader to update the codevectors. Firstly, read the accumulated values from the reading texture set. Secondly, divide the accumulated values by the number of training vector of the codevectors. Output the new codevectors to the writing texture set.

  6. (6)

    Swap the reading textures and writing textures.

  7. (7)

    Check the termination conditions by CPU. If the conditions are satisfied, stop. Otherwise, goto step 1.

3.2 GPU-based SOM implementation

In the SOM implementation, a codebook is also organized as a number of 2D textures as shown in Fig. 6. For each training vector, the vertex shader finds out its nearest codevector as well as the corresponding texel position. Afterward, instead of a point of size one, a point of size 2u t  + 1 is rasterized at the corresponding texel position. The remaining two fragment shaders are same as those of the LBG. The main difference is that each training vector not only update the nearest codevector but also its u t -level neighborhood in case of SOM. Hence, the texel positions of codevectors with neighborhood distance less than or equal to u t are updated as shown in Fig. 7. The size of neighborhood u t decreases along with t, which can be given by

$$ u_{t} = \left\{ {\begin{array}{ll} {2,} & {{\text{if }}0 \le t < T/8} \\ {1,} & {{\text{if }}T/8 \le t < T/2} \\ {0,} & {{\text{otherwise}}} \\ \end{array} } \right. $$
(4)

where T is the maximum number of iterations.

Fig. 6
figure 6

The logical structure of a 10 × 10 2D SOM codebook in GPU

Fig. 7
figure 7

In our GPU SOM implementation, we update a number of texels by controlling the point size. In (a), u t  = 1 (the point size is 3). So, a training vector will generate 9 fragments, and the corresponding 9 codevectors are updated. Similarly, in (b), u t  = 2 (the point size is 5). So, a training vector will generate 25 fragments, and the corresponding 25 codevectos are updated

4 Results and discussion

To evaluate the speed performances of our algorithm, we compared the speed of our approach with that of Takizawa’s approach and the CPU approach. All of the experimental results are obtained using a PC with its configuration listed in Table 1.

Table 1 The configuration of the computer used in our experiments

4.1 LBG algorithm

We have implemented three approaches for the LBG training, i.e. the CPU approach, Takizawa’s approach, and our GPU approach. The data vectors dimension k is 16. By varying the number of training vectors N and the number of codevectors M, the number of iterations per second for various settings are measured. The results are summarized in Table 2.

Table 2 Speed of the LBG scheme

Our GPU-based LBG approach can run around 17–50 times faster than CPU-based LBG approach as shown in Table 2. This speed improvement is larger as the number of codevectors increases. Compared with Takizawa’s scheme, our GPU-based LBG approach can run 1.5–4 times faster. The improvement is due to the reduction in the number of rendering passes and the reduction in the amount of data transfer between CPU and GPU memories.

4.2 SOM

We apply our scheme to two scenarios of vector quantization. One is 3-dimensional RGB color quantization, and the other is 16-dimensional vector quantization. For the sake of speed comparison, we make u t constant during the training. Tables 3 and 4 show the training speeds of CPU-based SOM and our GPU-based SOM for different point sizes. As shown in the tables, our GPU-based SOM approach can run around 25 to 100 times faster compared to the CPU-based LBG approach. In particular, when the image size is 1, 024 × 1, 024, the SOM resolution is 8 × 8 (64 codevectors), and the point size is 5, our GPU-based SOM approach runs 25 times faster. When we increase the resolution to 16 × 16 (256 codevectors), our GPU-based SOM approach can run 70 times faster. In general, we will have a bigger improvement on the speed when the SOM resolution, or the number of codevectors, increases.

Table 3 Speed of CPU SOM Scheme for color quantization
Table 4 Speed of our GPU SOM scheme for color quantization

We also implement our scheme for clustering for high dimensional data. The data dimension is set to 16. In the experiment, we vary the number N of training vectors and the number M of codevectors. We measure the number of iterations per second for various settings. Tables 5 and 6 show the training speeds of CPU-based SOM and our GPU-based SOM for different point sizes. From the tables, our GPU-based SOM approach can run around 15 to 40 times faster compared to the CPU-based LBG approach. The general trend of the improvement is similar to that of the color quantization case. That is, the improvement on the speed is large when we increase the number of codevectors, i.e., the SOM resolution.

Table 5 Speed of our CPU SOM Scheme
Table 6 Speed of our GPU SOM Scheme

5 Conclusion

In this paper, we proposed a novel GPU implementation for the LBG and SOM training algorithms. By using vertex shader to implement the nearest codevector searching and to do random write, the functionality of GPUs is better utilized. Our experimental results show that the proposed scheme is remarkably better than the conventional GPU approach. Besides, our approach unified the implementation of LBG training and SOM training.