A GPU implementation for LBG and SOM training

Xiao, Yi; Leung, Chi Sing; Ho, Tze-Yui; Lam, Ping-Man

doi:10.1007/s00521-010-0403-7

A GPU implementation for LBG and SOM training

ICONIP2009
Published: 30 July 2010

Volume 20, pages 1035–1042, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Neural Computing and Applications Aims and scope Submit manuscript

A GPU implementation for LBG and SOM training

Download PDF

Yi Xiao¹,
Chi Sing Leung¹,
Tze-Yui Ho¹ &
…
Ping-Man Lam¹

290 Accesses
12 Citations
Explore all metrics

Abstract

Vector quantization (VQ) is an effective technique applicable in a wide range of areas, such as image compression and pattern recognition. The most time-consuming procedure of VQ is codebook training, and two of the frequently used training algorithms are LBG and self-organizing map (SOM). Nowadays, desktop computers are usually equipped with programmable graphics processing units (GPUs), whose parallel data-processing ability is ideal for codebook training acceleration. Although there are some GPU algorithms for LBG training, their implementations suffer from a large amount of data transfer between CPU and GPU and a large number of rendering passes within a training iteration. This paper presents a novel GPU-based training implementation for LBG and SOM training. More specifically, we utilize the random write ability of vertex shader to reduce the overheads mentioned above. Our experimental results show that our approach can run four times faster than the previous approach.

GPU Accelerated Self-Organizing Map for High Dimensional Data

Article 11 October 2014

Representing and analyzing relief patterns using LBP variants on mesh manifold

Article 21 September 2020

A general codebook design method for vector quantization

Article 02 February 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Vector quantization (VQ) [1, 2, 3] is an effective technique applicable in a wide range of areas, such as image compression [4], video compression [5, 6, 7], and bioinformatics [8]. The popularity of VQ comes from its low decoding complexity. However, the computational complexity required for finding a useful codebook is very high. There are some suboptimal searching methods [3, 9, 10] to improve the encoding speed. In [3 and 9], data structures are introduced to the codebooks after it is trained, e.g. Tree-structure VQ [3] and K-D tree VQ [9]. In contrast, [10] achieves the same purpose by using the property codebook, e.g. the ordering property of self-organizing map (SOM) [11]. Although these methods can improve the encoding speed, they are not suitable for codebook training, which is the more time-consuming procedure.

The current generation of consumer-level GPUs has evolved to a stage that supports parallel processing and high-precision computation [12]. Moreover, GPUs become increasingly suitable for general purpose computations, such as sparse matrix solving [13], image-based relighting [14], wavelet transform [15, 16], and neural network simulation [17].

At the time when only non-programmable GPUs are available, there were still some attempts to use GPUs for accelerating the general purpose computations. Bohn [18] proposed an implementation for the online mode training of SOM based on GPUs. However, GPUs cannot handle sequential operations efficiently, which is essential for the online SOM training.

Recently, Takizawa [19] proposed a CPU–GPU co-processing solution for LBG algorithm, where GPU is mainly for Euclidean distance calculation and CPU is for updating the codebook. In their implementation, a training iteration consists of a number of rendering passes, where each rendering pass is associated with a codevector. Within a rendering pass, the temporary nearest codevector index, as well as its corresponding distance, is updated. Afterward, all rendering passes in a training iteration, the nearest codevector indices, available in the GPU, are transferred to the main memory. Then, the codebook is updated sequentially by CPU. By using their GPU approach, the LBG algorithm can be accelerated by 35 times, compared with a decent CPU implementation. However, there are two bottlenecks in the Takizawa’s approach. The first one is the large amount of data transfer between CPU and GPU, which is equal to the number of training vectors. The second one is the large number of rendering passes within a training iteration.

As GPUs evolve, many of the hardware limitations at the time of the Takizawa’s method [19] had been relieved. For example, the number of fragment processing units in GPUs was more than that of vertex processing units at that time, but they are unified now. We can use either vertex shader or fragment shader for the computational intensive operations. Also, the maximum number of instructions of shader is greatly increased. As a result, handling only one codevector in one rendering pass to shorten the number of instructions is no longer necessary. Furthermore, vertex shader has an advantage over fragment shader that it allows random write.

This paper presents an unified GPU implementation for LBG and SOM training algorithms based on the random write ability of vertex shader. The proposed scheme uses GPU as the main computing processor, instead of as a co-processor. In the our approach, all the computations are carried out within GPU, so the data transfer between CPU and GPU is minimized. Moreover, within an iteration, there is only one rendering pass, and the overhead between rendering passes is minimized. Our experimental results show that for the LBG algorithm, our approach can run four times faster than the Takizawa’s approach. In addition, by controlling point size in the vertex shader, our approach is able to handle the SOM algorithm.

The rest of this paper is organized as follows. In Sect. 2, we briefly describe the background of GPU, LBG, and SOM algorithms. In Sect. 3, the implementation of both algorithms on GPU is presented. The performance of our scheme and comparison are given in Sect. 4. Section 5 presents our conclusion.

2 Background

2.1 LBG

A VQ model contains a codebook ${\mathbb{Y}}=\{ \vec{c}_1,\ldots, \vec {c}_M \}$ in ℜ^k. Given an input vector $\vec{x}$, the output is an index i* whose corresponding codevector $\vec{c}_{i*}$ is the closest codevector to $\vec{x}$. Given a dataset ${\mathbb{D}}=\{\vec {x}_1, \ldots, \vec{x}_N\}$ in ℜ^k and the initial codebook ${\mathbb{Y}}(0)=\{ \vec{c}_1(0),\ldots, \vec{c}_M(0) \}$, the LBG training is summarized as follows:

1.
t = 0.
2.
Set $\Upomega_i=\varnothing$, for all i = 1, …, M.
3.
For each training vector $\vec{x}_j$, find out the nearest codevector $\vec{c}_{i^*} (t)$. Put $\vec{x}_j$ into the subset $\Upomega_{i^*}$. Therefore, $\Upomega_{i^*}$ can be expressed as
$$ \Upomega_{i^*}=\left\{ \vec{x}_j \left| \| \vec{x}_j - \vec {c}_{i^*} \|^2 \leq \| \vec{x}_j - \vec{c}_{i} \|^2, \forall i \neq i^* \right. \right\}. $$
(1)
Note that the subsets $\Upomega_{i}$s are non-overlapped.
4.
Update the codebook:
$$ \vec{c}_i (t+1) = {\frac{1} {|\Upomega_i|}} \sum_{\vec{x}_j \in \Upomega_i} \vec{x}_j {,\ \ } \forall i. $$
(2)
5.
Set t = t + 1 and go to Step 2.

The above iterative procedure repeats until the quantization error is less than a threshold or the number of iterations has reached a pre-defined number.

2.2 SOM

Prior to the SOM training [11, 20, 21, 22], a neighborhood structure is imposed on a codebook. The neighborhood structure is represented by a graph ${\mathbb{G}}=\{{\mathbb{V}}, {\mathbb{E}}\}$, where ${\mathbb{V}}=\{ v_1, \ldots, v_M \}$ is a set of vertices and ${\mathbb{E}}$ is the set of edges in this graph. In this representation, a vertex v _i is associated with a codevector $\vec {c}_i$.

If codevectors $\vec{c}_i$ and $\vec{c}_{i'}$ are joined by an edge, $\vec{c}_i$ and $\vec{c}_{i'}$ are neighbors. The topological distance between two neighbors is defined as 1. The topological distance between $\vec{c}_i$ and $\vec{c}_j$ is defined as the minimum number of edges connecting $\vec{c}_i$ and $\vec{c}_j$ in ${\mathbb{G}}$. If vertices $\vec{c}_i$ and $\vec{c}_j$ are separated by no more than u topological distance, their associated codevectors $\vec{c}_i$ and $\vec{c}_j$ are called level-u neighbor. For a given codevector $\vec{c}_i$, the index set of its level-u _t neighbors is denoted by ${{\mathbb{I}}}_i(u_t)$, where u _t denotes the level-u neighbors at time t.

Some frequently used structures of SOM are shown in Fig. 1, where each dot represents a vertex and each line represents an edge. Note that the figure is only a graphical representation and it does not reflect any actual geometric information of the codebook.

Since the sequential SOM training [11] is not suitable for parallelization, in this paper, we consider the batch mode SOM [11] instead. The idea of SOM is similar to that of LBG. That is, based on the current codebook ${{\mathbb{Y}}}(t)$, we partition the dataset into M overlapped subsets $\Upomega_i$. The SOM is summarized as follows:

1.
t = 0.
2.
Set $\Upomega_i=\varnothing$, for all i = 1, …, M.
3.
For each training vector $\vec{x}_j$, find out the nearest codevector $\vec{c}_{i*} (t)$. Put $\vec{x}_j$ into the subsets $\Upomega_{i'}$s, where $ i' \in {{\mathbb{I}}}_{i*}(u_t)$, and u _t linearly decreases to zero during the training process.
4.
Update the codebook:
$$ \vec{c}_i (t+1) = {\frac{1}{|\Upomega_i|}} \sum_{\vec{x}_j \in \Upomega_i} \vec{x}_j . $$
(3)
5.
Set t = t + 1 and go to Step 2.

The above iterative procedure repeats until the quantization error is less than a threshold or the number of iterations has reached a pre-defined number.

If the ordering property of SOM is preserved, the Euclidean distance in the data space between two neighboring codevectors is usually small. Figure 2 shows a typical result of a SOM. As shown in Fig. 2a, the initial codevectors are distributed in a random manner and do not form a good topological ordering. In contrast, the codevectors form a good topological ordering after training as shown in Fig. 2b.

2.3 Background on the GPU

Recently, the demands from the gaming and the film-making industries for the acceleration of 3D rendering have driven the development of graphics processing unit (GPU). The GPU operations are controlled by some user-defined programs called shaders. Since GPU is a chip consisting of multiple processing units, we can consider a GPU as a virtual SIMD computer.

Figure 3 shows the graphics pipeline of GPU. It consists of two programmable units, namely the vertex processing unit and the fragment processing unit logically. Programs controlling the vertex processing unit are called vertex shaders, whereas those programs controlling the fragment processing unit are called fragment shaders. Primarily, vertex shaders perform the transformations of vertices, and fragment shaders perform the color calculation of pixels. In particular, a fragment shader can read and modify a texture. Note that a texture can be considered as a 2D array of texels, where each texel has four color channels, namely R, G, B, and A.

3 Implementations

3.1 GPU-based LBG implementation

Intuitively, we can consider our implementation as a multiclient–multiserver model, where training vectors are clients and codevectors are servers. For each client (a training vector), it determines its nearest codevector and sends itself to the corresponding server. When the training vectors arrive at a server, the server accumulates the training vectors and updates the number of arrived training vectors. After all clients are served, each server holds the accumulation of arrived training vectors and the number of arrived training vectors. Finally, each server performs a simple division operation to update its codevector.

In our setting, there are N training vectors and M codevectors, and the dimension of vectors is k. From the shader programming point of view, each training vector is a vertex. A codebook is organized as a number of textures. The logical structure of a codebook in GPU is shown in Fig. 4. The resolution of the textures is equal to 1 × M, and each texture position corresponds a codevector. We use $\lceil{\frac{k+1} {4}}\rceil$ textures to hold a codebook (Note that a texel in a texture have 4 channel, {R G B A} and it can hold 4 values.). “An extra channel” is used to store the number of arrived training vectors of a codevector.

Before the GPU training, all training vectors are uploaded to the GPU memory as a display list of points, and two sets of textures are initialized. One set of textures is to hold the old codebook for “reading”. It initialized the codevectors prepared by CPU. The other set is to hold the updating codebook for “writing”. Notice that the read/write ability of textures is realized by using frame buffer objects (FBOs).

The flow diagram of our GPU-based LBG implementation is shown in Fig. 5. For each training vector, the vertex shader performs the nearest codevector searching and finds out the corresponding texel position. Then, a point of size one is rasterized at the corresponding texel position. In the fragment shader, the received training vectors are blended to the writing textures. Besides the received training vectors, a constant “1” is also blended to the extra color channel of the writing textures. Afterward, the writing textures hold the accumulations and the numbers of training vectors of the codevectors. See the appendix for a full example of vertex shader and fragment shader. Finally, an additional fragment shader is performed to update the codevector by dividing the accumulations by the numbers of training vectors.

Our implementation of LBG algorithm is summarized in the following steps:

(1)
Pass the training vectors the vertex shader by drawing the display list of points.
(2)
Use the vertex shader to perform the nearest codevector searching and find out the corresponding texel position, such that a point of size one is rasterized at the corresponding texel position. The training vector is also passed to the fragment shader for further processing.
(3)
Use a fragment shader to update the codebook. The received training vectors are blended to the writing textures. Besides the received training vectors, a constant “1” is also blended to the extra color channel of the writing textures.
(4)
Swap the reading texture set and writing set. ^{Footnote 1}
(5)
Use an additional fragment shader to update the codevectors. Firstly, read the accumulated values from the reading texture set. Secondly, divide the accumulated values by the number of training vector of the codevectors. Output the new codevectors to the writing texture set.
(6)
Swap the reading textures and writing textures.
(7)
Check the termination conditions by CPU. If the conditions are satisfied, stop. Otherwise, goto step 1.

3.2 GPU-based SOM implementation

In the SOM implementation, a codebook is also organized as a number of 2D textures as shown in Fig. 6. For each training vector, the vertex shader finds out its nearest codevector as well as the corresponding texel position. Afterward, instead of a point of size one, a point of size 2u _t + 1 is rasterized at the corresponding texel position. The remaining two fragment shaders are same as those of the LBG. The main difference is that each training vector not only update the nearest codevector but also its u _t-level neighborhood in case of SOM. Hence, the texel positions of codevectors with neighborhood distance less than or equal to u _t are updated as shown in Fig. 7. The size of neighborhood u _t decreases along with t, which can be given by

$$ u_{t} = \left\{ {\begin{array}{ll} {2,} & {{\text{if }}0 \le t < T/8} \\ {1,} & {{\text{if }}T/8 \le t < T/2} \\ {0,} & {{\text{otherwise}}} \\ \end{array} } \right. $$

(4)

where T is the maximum number of iterations.

4 Results and discussion

To evaluate the speed performances of our algorithm, we compared the speed of our approach with that of Takizawa’s approach and the CPU approach. All of the experimental results are obtained using a PC with its configuration listed in Table 1.

Table 1 The configuration of the computer used in our experiments

Full size table

4.1 LBG algorithm

We have implemented three approaches for the LBG training, i.e. the CPU approach, Takizawa’s approach, and our GPU approach. The data vectors dimension k is 16. By varying the number of training vectors N and the number of codevectors M, the number of iterations per second for various settings are measured. The results are summarized in Table 2.

Table 2 Speed of the LBG scheme

Full size table

Our GPU-based LBG approach can run around 17–50 times faster than CPU-based LBG approach as shown in Table 2. This speed improvement is larger as the number of codevectors increases. Compared with Takizawa’s scheme, our GPU-based LBG approach can run 1.5–4 times faster. The improvement is due to the reduction in the number of rendering passes and the reduction in the amount of data transfer between CPU and GPU memories.

4.2 SOM

We apply our scheme to two scenarios of vector quantization. One is 3-dimensional RGB color quantization, and the other is 16-dimensional vector quantization. For the sake of speed comparison, we make u _t constant during the training. Tables 3 and 4 show the training speeds of CPU-based SOM and our GPU-based SOM for different point sizes. As shown in the tables, our GPU-based SOM approach can run around 25 to 100 times faster compared to the CPU-based LBG approach. In particular, when the image size is 1, 024 × 1, 024, the SOM resolution is 8 × 8 (64 codevectors), and the point size is 5, our GPU-based SOM approach runs 25 times faster. When we increase the resolution to 16 × 16 (256 codevectors), our GPU-based SOM approach can run 70 times faster. In general, we will have a bigger improvement on the speed when the SOM resolution, or the number of codevectors, increases.

Table 3 Speed of CPU SOM Scheme for color quantization

Full size table

Table 4 Speed of our GPU SOM scheme for color quantization

Full size table

We also implement our scheme for clustering for high dimensional data. The data dimension is set to 16. In the experiment, we vary the number N of training vectors and the number M of codevectors. We measure the number of iterations per second for various settings. Tables 5 and 6 show the training speeds of CPU-based SOM and our GPU-based SOM for different point sizes. From the tables, our GPU-based SOM approach can run around 15 to 40 times faster compared to the CPU-based LBG approach. The general trend of the improvement is similar to that of the color quantization case. That is, the improvement on the speed is large when we increase the number of codevectors, i.e., the SOM resolution.

Table 5 Speed of our CPU SOM Scheme

Full size table

Table 6 Speed of our GPU SOM Scheme

Full size table

5 Conclusion

In this paper, we proposed a novel GPU implementation for the LBG and SOM training algorithms. By using vertex shader to implement the nearest codevector searching and to do random write, the functionality of GPUs is better utilized. Our experimental results show that the proposed scheme is remarkably better than the conventional GPU approach. Besides, our approach unified the implementation of LBG training and SOM training.

Notes

In fact, we do not need to swap the contents of the two sets of textures. We only need to swap their texture IDs.

References

Linde Y, Buzo A, Gray RM (1980) An algorithm for vector quantizer design. IEEE Trans Commun 28(1):84–95
Article Google Scholar
Gray RM (1984) Vector quantization. IEEE Acoust Speech Signal Process Mag 1(1):4–29
Google Scholar
Gersho A, Gray RM (1991) Vector quantization and signal compression. Kluwer, Norwell
Google Scholar
Lu Z-M, Xu D-G, Sun S-H (2000) Multipurpose image watermarking algorithm based on multistage vector quantization. IEEE Trans Image Process 14(6):822–831
Google Scholar
Tsang PWM, Tsang WH (1996) Preservation of interlaced patterns in encoding video signals using side match vector quantization smoothing regularization of MLP for Jacobian stabilization. IEEE Trans Consu Electron 42(1):112–120
Article MathSciNet Google Scholar
Mukherjee D, Chae JJ, Mitra SK, Manjunath BS (2000) A source and channel-coding framework for vector-based data hiding in video. IEEE Trans Circuits Syst Video Technol 10:630–645
Article Google Scholar
Sim J-Y, Lee S-U (2008) Compression of 3D point visual data using vector quantization and rate-distortion optimization. IEEE Trans Multimed 10(3):305–315
Article Google Scholar
Pham T (2003) Alignment-free sequence comparison with vector quantization and hidden markov models. In CSB ’03: Proceedings of the IEEE computer society conference on bioinformatics, Washington, DC, USA, 2003, p 534, IEEE Computer Society
Ramasubramanian V, Paliwal KK (1992) Fast k-dimensional tree algorithms for nearest neighborsearch with application to vector quantization encoding. IEEE Trans Signal Process 40(3):518–531
Article Google Scholar
Laha A, Chanda B, Pal NR (2006) Accelerated codebook searching in a som-based vector quantizer. IJCNN 2006. 2006, pp 3306–3311
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Book MATH Google Scholar
Mark WR, Glanville RS, Akeley K, Kilgard MJ (2003) Cg: A system for programing graphics hardware in a C-like language. ACM Trans Graph 22(3):896–907
Article Google Scholar
Bolz J, Farmer I, Grinspun E, Schreoder P (2003) Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924
Article Google Scholar
Leung CS, Wong TT, Lam PM, Choy KH (2006) An RBF-based image compression method for image-based rendering. IEEE Trans Image Process 15(1):1031–1041
Article Google Scholar
Garcia A, Shen HW (2005) GPU-based 3d wavelet reconstruction with tileboarding. Vis Comput 21(8–10):755–763
Article Google Scholar
Wong TT, Leung CS, Heng PA, Wang J (2007) Discrete wavelet transform on consumer-level graphics hardware. IEEE Trans Multimed 9(3):668–673
Article Google Scholar
Ho TY, Lam PM, Leung CS (2008) Parallelization of cellular neural networks on gpu. Pattern Recogn Lett 41(8):2684–2692
MATH Google Scholar
Bohn CA (1998) Kohonen feature mapping through graphics hardware. In: Proceedings of 3rd international conference on computational intelligence and neurosciences, pp 64–67
Takizawa H, Kobayashi H (2006) Hierarchical parallel processing of large scale data clustering on a pc cluster with gpu co-processing. J Supercomput 36:219–234
Article Google Scholar
Togneri R, Lai E, Attikiouzel Y (1990) Kohonen’s algorithm for the numerical parametrisation of manifolds. Pattern Recognit Lett 11(5):313–319
Article MATH Google Scholar
Leung CS, Chan LW (1997) Transmission of vector quantized data over a noisy channel. IEEE Trans Neural Netw 8(3):582–589
Article Google Scholar
Leung CS, Chan LW (1999) Design of trellis vetor quantizers using kohonen maps. Neural Netw 12(6):907–914
Article Google Scholar

Download references

Acknowledgment

The work was supported by a research grant from the City University of Hong Kong (Project No.: 7002588).

Author information

Authors and Affiliations

Department of Electronic Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong
Yi Xiao, Chi Sing Leung, Tze-Yui Ho & Ping-Man Lam

Authors

Yi Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Chi Sing Leung
View author publications
You can also search for this author in PubMed Google Scholar
Tze-Yui Ho
View author publications
You can also search for this author in PubMed Google Scholar
Ping-Man Lam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tze-Yui Ho.

Appendix

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, Y., Leung, C.S., Ho, TY. et al. A GPU implementation for LBG and SOM training. Neural Comput & Applic 20, 1035–1042 (2011). https://doi.org/10.1007/s00521-010-0403-7

Download citation

Received: 06 May 2010
Accepted: 03 June 2010
Published: 30 July 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s00521-010-0403-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A GPU implementation for LBG and SOM training

Abstract

Similar content being viewed by others

GPU Accelerated Self-Organizing Map for High Dimensional Data

Representing and analyzing relief patterns using LBP variants on mesh manifold

A general codebook design method for vector quantization

1 Introduction