1 Introduction

As machine learning technologies (e.g., Support Vector Machine [35], Neural Networks [6, 26], and Random Forest [21]), especially deep learning [2, 3, 27], are applied in a continuously wider range of scenarios, people are paying more and more attention to the performance of these algorithms. However, the learning speed of traditional machine learning algorithms is widely criticized [14]. The main reason for the slow speed is that parameters of machine learning algorithms usually need to be updated iteratively in a gradient method. Hence, traditional machine learning methods are difficult to meet the real-time learning needs for large-scale data applications.

Extreme learning machine (ELM) [9,10,11,12, 14,15,16] is a feedforward neural network whose design objective is to ensure a high accuracy, least user intervention, and real-time learning [14]. In practical applications, ELM is usually superior to the traditional machine learning algorithms, such as SVM and back propagation (BP) in both classification accuracy and learning speed when fed with adequate training samples [11]. Therefore, ELM has found application scenarios in medical healthcare [23, 28, 29, 31,32,33], query processing [4, 19], location-based social networks (LBSNs) [22, 42], Geographic Information System (GIS) [18], etc.

As a matter of fact, ELM is also hard to escape the mantra of performance scalability. As the dataset scale increases, the efficiency of learning gradually declines. Our preliminary investigation shows that for scales like classical UCI Forest CoverType dataset,Footnote 1 time cost on each step of the standard ELM single-core CPU implementation for supervised classification tasks becomes so significant that they cannot be ignored. Thus, ELM needs an efficient acceleration solution to adapt to AI applications for big data. Although the use of hardware acceleration is an obvious solution, there is no research on the applicability of various hardware acceleration platforms. To sum up, the motivation and contribution of this work are as follows.

1.1 Motivation

  1. 1.

    Can ELM keep running well in large-scale application scenario? As we know, real-time learning is one of the essential considerations of ELM. The learning speed under large-scale datasets becomes a severe test for the practicality of ELM.

  2. 2.

    Is the ELM training phase suitable for hardware acceleration? What is the key factor to affect the scalability of ELM learning speed? And how to rewrite the ELM algorithm based on this key factor to adapt it to various hardware acceleration platforms?

  3. 3.

    If ELM training can be accelerated, then how to select appropriate hardware acceleration platforms for different scales of datasets? There are several kinds of acceleration hardware available including multi-core CPU, GPU, FPGA, etc. The answer to the question will be instructive to the application of ELM in large-scale AI scenarios.

1.2 Contributions

In this work, we made the following contributions.

  1. 1.

    We validated that the standard implementation of ELM has poor learning speed scalability on large-scale dataset. We fully implemented the single-core CPU version of ELM algorithm as a baseline and evaluated the training time cost of each step in detail on a classical large-scale dataset as shown in Table 1. The evaluation shows that the training time cost of ELM and the size of dataset are linearly related.

  2. 2.

    The study reveals that ELM is a hardware-friendly algorithm. The algorithm and the evaluation of the baseline implementation both show that ELM training process is dominated by matrix multiplication operations which, as is known, are easily to be paralleled. For CoverType dataset, matrix multiplication accounts for roughly 97% of the training time (see Sect. 2). Hence, we were motivated to carry out substantial research on adapting the standard ELM algorithm to several types of acceleration hardware that support parallelization, such as Field-Programmable Gate Arrays (FPGAs) [5, 25, 40] and Graphic Processing Units (GPUs) [1, 17, 36], and specialized multi-core CPU with architecture optimized for multiple computational operations (see Sect. 3).

  3. 3.

    GPU and FPGA are suitable devices for ELM. We suggest that (1) use GPU to accelerate ELM algorithms for large dataset, and (2) use FPGA for small dataset considering its lower power consumption, especially for those embedded systems. To support these conclusions, the source code of the proposed hardware acceleration platforms related with ELM algorithms was implemented and opened. Experiments were performed on three datasets, namely Covertype, Ionosphere, and Sonar. Experimental results further confirmed our opinion that ELM training algorithm is suitable for hardware acceleration. Specifically, in our experiments, multi-core CPU with SIMD optimization can get about \(10\times\) speedup. GPU’s speedup is larger than \(800\times\), and FPGA’s speedup is about \(40\times\).

The rest of this paper is organized as follows: in Sect. 2, the basic ELM algorithm is discussed and its performance is evaluated to show the bottleneck. In Sect. 3, the optimization methods of ELM algorithm on different hardware devices are elaborated. In Sect. 4, the performance of those different ELM implementations is evaluated and compared. Section 5 elaborates related work, and the conclusion is shown in Sect. 6.

2 Preliminary

2.1 Extreme learning machine

Given N arbitrary distinct samples \(({\mathbf {x}}_i, {\mathbf {t}}_i) \in {\mathbf {R}}^{n\times m}\), where \({\mathbf {x}}_i =[x_{i1}, x_{i2},\ldots ,x_{in}]^T \in {\mathbf {R}}^n\) and \({\mathbf {t}}_i = [t_{i1},t_{i2},\ldots ,t_{im}]^T \in {\mathbf {R}}^m\). \({\mathbf {x}}_i\) is the sample vector; \({\mathbf {t}}_i\) is the target label. Standard Single-hidden Layer Feedforward neural Networks (SLFNs) are shown in Fig. 1 and can be modeled mathematically as Eq. 1:

$$\begin{aligned} \sum _{j = 1}^{L}{\mathbf {\beta }}_j g_j({\mathbf {x}}_i) = \sum _{j = 1}^{L}{\mathbf {\beta }}_j g({\mathbf {w}}_j \cdot {\mathbf {x}}_i + b_j) = {\mathbf {o}}_i, i = 1,\ldots ,N \end{aligned}$$
(1)

where L is the number of hidden layer nodes, \({\mathbf {w}}_j = [w_{j1}, w_{j2},\ldots , w_{jn}]\) is the input weight vector, \({\mathbf {\beta }}_j = [\beta _{j1}, \beta _{j2},\ldots , \beta _{jm}]^T\) is the output weight vector, \(b_j\) is the bias of the \(j_{th}\) hidden node, and \({\mathbf {o}}_j\) is the output of the \(j_{th}\) node. Generally,

$$\begin{aligned} g(x) = \frac{1}{1+e^x} \end{aligned}$$
(2)
Fig. 1
figure 1

An example of Standard Single-hidden Layer Feedforward neural Networks (SLFNs)

Table 1 Baseline evaluation (single-core implementation): which gives the execution time distribution with different N and L.

To approximate these samples with zero errors means that \(\sum _{i=1}^L \Vert {\mathbf {t}}_i - {\mathbf {o}}_i\Vert = 0\), where \({\mathbf {\beta }}_j\), \({\mathbf {w}}_j\) and \(b_j\) exist, satisfying that

$$\begin{aligned} \sum _{j = 1}^{L}{\mathbf {\beta }}_j g({\mathbf {w}}_j {\mathbf {x}}_i + b_j) = {\mathbf {t}}_i, i = 1,\ldots ,N \end{aligned}$$
(3)

which can be rewritten in terms of

$$\begin{aligned} {\mathbf{H }}\pmb {\beta } = {\mathbf{T} } \end{aligned}$$
(4)

where

$$\begin{aligned} \begin{aligned} &H({\mathbf {w}}_1,\ldots ,{\mathbf {w}}_L,{\mathbf {b}}_1,\ldots ,{\mathbf {b}}_L,{\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N) \\ &= \begin{bmatrix} g({\mathbf{w} }_1 {\mathbf{x} }_1 + b_1)&\ldots&g({\mathbf{w} }_L {\mathbf{x} }_1 + b_L)\\ \vdots&\ddots&\vdots \\ g({\mathbf{w} }_1 {\mathbf{x} }_N + b_1)&\ldots&g({\mathbf{w} }_L {\mathbf{x} }_N + b_L) \\ \end{bmatrix} _{N \times L} \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \pmb {\beta } = \begin{bmatrix}{\mathbf {\beta }}_1^T\\ {\mathbf {\beta }}_2^T\\\vdots \\ {\mathbf {\beta }}_L^T\end{bmatrix}_{m \times L}^T ,\, and\qquad \pmb {T} =\begin{bmatrix}\pmb {t}_1^T\\ \pmb {t}_2^T \\ \vdots \\ \pmb {t}_N^T\end{bmatrix}_{m\times N}^T \end{aligned}$$
(6)

\({\mathbf{H} }\) is the ELM feature space to map the N-dimensional input data space into L-dimensional hidden nodes space. Unlike the most common algorithms that all parameters need to be iteratively adjusted to, once the input weight \({\mathbf {w}}_i\) and hidden layer bias \(b_i\) are randomly determined, the output matrix of hidden layer nodes will be ensured. Training SLFNs can be transformed into solving a linear system.

In most cases, H is a non-square matrix (\(L\ll N\)); the output weights \({\mathbf {\beta }}\) are computed by \({\mathbf {\beta }} = H^\dag T\), where \(H^\dag\) is the Moore–Penrose generalized inverse of matrix H. Equation 7 gives the steps about how to compute \(H^\dag\).

$$\begin{array}{*{20}l} {\user2{H}\beta = \user2{T}} \hfill \\ {\user2{H}^{T} \user2{H}\beta = \user2{H}^{T} \user2{T}} \hfill \\ {\beta = (\user2{H}^{T} \user2{H})^{{ - 1}} \user2{H}^{T} \user2{T}} \hfill \\ \end{array}$$
(7)

The training phase of ELM algorithm is described in Algorithm 1, which includes 7 steps in all. Algorithm 2 is able to classify unknown data D based on trained W, B, and \({\mathbf {\beta }}\). The matrix \(H_2\) is computed in the same way as in the training phase. The output o represents predicted values.

figure a
figure b

2.2 Bottleneck analysis

In general, large datasets are provided in training phase in order to compute the parameters \(\mathbf {\beta }\) of ELM, namely the weight vectors connecting the hidden nodes and the output nodes. Although the learning speed of ELM is much faster than most classic learning algorithms, it is still slow due to the large amount of the training dataset.

To further clarify the time cost of ELM algorithm, we focus on the training phase of ELM and perform a detailed measure of the execution time for each step in Algorithm 1. The algorithm is implemented on single-core CPU with Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00 GHz in Ubuntu 16.04 LTS. The dataset used in our experiment is Covertype,Footnote 2 which includes 581,012 samples, and each sample contains 54 data attributes. All samples are divided into 7 classes. That is to say, the number of input nodes of SLNFs in our ELM algorithm is 54, and the number of output nodes is 7. We change the number of nodes in the hidden layer L or the number of training samples N to measure the time cost.

Table 1 illustrates the execution time distribution with different N and L. Columns 3 to 9 record the running time of each step in Algorithm 1. Note that all of the time cost we measured has been tested 10 times, and it is the mean value that is recorded in the result. We can get the following 2 findings.

  1. 1.

    For each instance, step 3 and step 5 in Algorithm 1 consume a significant amount of execution time, and they are both corresponding to matrix multiplication operations. Under the condition that N or L is large, all the matrix multiplication operations will take up about 97.5% of the total time cost in training algorithm, which is the critical bottleneck of ELM algorithm.

  2. 2.

    The time cost of matrix multiplication is linearly related to N and L. With the increase of N or L, the execution time will be longer.

To sum up, matrix multiplication occupies a lot of execution time in ELM training phase, especially when the training dataset is large. In fact, the main cost of ELM predicting phase also incurs in the matrix multiplication on the verification dataset described in Algorithm 2. Hence, the efficiency of ELM algorithm is seriously affected by that of the matrix multiplication’s implementation. How to efficiently calculate large-scale matrix multiplication is the key point of ELM acceleration. As we all know, matrix multiplication is very suitable to be implemented on several hardware devices; we will discuss the details in next section.

3 Implementations on hardware devices

The above baseline experiment shows that the critical point of ELM acceleration is how to calculate large-scale matrix multiplication quickly. With the rapid improvement in modern hardware’s performance, parallel execution becomes prevalent, which can greatly reduce the execution time. Therefore, employing new hardware devices to accelerate ELM algorithm is a feasible solution.

In this section, we will discuss the performance of ELM algorithms in three modern hardware devices (multi-core/SIMD CPU, GPU, FPGA) for accelerating matrix multiplication operation which is the key bottleneck of ELM. The advantages and disadvantages of each computing device are shown in Table 2.

Table 2 Advantages and disadvantages of each computing device
Fig. 2
figure 2

Architecture of the ELM training phase. Calculating H by multi-core and calculating A with several devices, including CPU, GPU, and FPGA

First of all, we will demonstrate the framework of what we did in this section. As we can see in Algorithm 1, the training phase of ELM algorithms has 7 steps in all. Figure 2 shows the framework of the training phase, in which we implement step 3 with multi-core (green box) and offload step 5 with several devices (yellow box), in short, offloading those time-consuming and hardware-friendly tasks into the new hardware devices.

3.1 CPU implementation

3.1.1 Baseline

Formally, given an \(m \times k\) matrix A and an \(k \times n\) matrix B, the matrix multiplication \(C = A\times B\) is an \(m \times n\) matrix; the element \(C_{ij}\) can be calculated by

$$\begin{aligned} C_{ij} = \sum _{l=1}^{k}A_{il}B_{lj} = A_{i1}B_{1j} +A_{i2}B_{2j}+\cdots +A_{ik}B_{kj} \end{aligned}$$
(8)

A straightforward implementation of the matrix multiplication which is based on a single core is shown in Algorithm 3. Obviously, the computing complexity is \(O(m\times k\times n)\). We regard it as a baseline for other modern hardware implementations.

figure c
Fig. 3
figure 3

Comparison of SISD and SIMD. Single Instruction Multiple Data (SIMD) is an instruction unit that controls multiple duplicated processing units simultaneously to perform the same operations on multiple data

Definition 1

(Speedup Ratio) Suppose the running time of matrix multiplication on the baseline version is \(T_{baseline}\), the time on device is \(T_{device}\), and the speedup ratio is

$$\begin{aligned} speedup=\frac{T_{baseline}}{T_{device}} \end{aligned}$$

3.1.2 Multi-core, thread-level parallelism

A multi-core processor, which consists of a single computing component with two or more independent processing units, called cores, can run multiple instructions on separate cores. Therefore, it can increase overall speed for programs amenable to parallel computing.

$$\begin{aligned} {A_{m\times k} \times B_{k\times n} = \begin{bmatrix} A_{11} \ldots A_{1k} \\ \vdots \ddots \vdots \\ A_{a1} \ldots A{ak} \\ \end{bmatrix} \times \begin{bmatrix} B_{11} \ldots B_{1b} \\ \vdots \ddots \vdots \\ B_{k1} \ldots B{kb} \\ \end{bmatrix} = \begin{bmatrix} C_{11} \ldots C_{1b} \\ \vdots \ddots \vdots \\ C_{a1} \ldots C{ab} \\ \end{bmatrix} = C_{m\times n}} \end{aligned}$$
(9)

In order to accelerate ELM algorithm, we take full advantage of multi-core to accelerate its matrix multiplication which is the most time-consuming operation. As shown in Eq. 9, matrix C is divided into \(a \times b\) subblocks, each of which is a (\(m/a \times n/b\)) matrix and C will be derived by the calculation of these subblocks. Due to the independence of the calculation process between these sub-matrices, each one can be obtained by a single core. By computing all of the \(a \times b\) in parallel, ELM algorithm can be greatly accelerated. Calculation algorithm for each sub-matrix is shown in Algorithm 4, where id represents the ID of threads, \(id\in [0,a\times b)\). Obviously, the computing complexity of multi-core implementation algorithm for matrix multiplication is \(O(\frac{m*n*k}{a\times b})\).

figure d
figure e

3.1.3 SIMD, instruction-level parallelism

Single Instruction Multiple Data (SIMD) is an instruction unit that controls multiple duplicated processing units simultaneously to perform the same instructions on multiple data. Figure 3 illustrates the data processing of Single Instruction Single Data (SISD) and SIMD and reveals the excellent data parallelism of SIMD. At present, the prevalent SIMD instruction sets include MultiMedia eXtensions (MMX), Streaming SIMD Extensions (SSE), Intel Advanced Vector eXtensions (AVX), Fused Multiply Accumulate (FMA), and so on. SIMD is particularly suitable for processing continuous dense data. Most modern CPUs include SIMD instructions which can improve the performance of the games and multimedia’s application. Hence, the excellent parallelism of SIMD can also be applied to accelerate ELM algorithm on matrix multiplication operation which involves mutually independent multiplication and addition.

GNU Compiler Collection (GCC) provides a native intrinsic to the above instruction sets. Algorithm 5 describes the implement of matrix multiplication by using AVX which contains eight duplicated processing units. The instruction unit manipulates these processing units to perform matrix multiplication simultaneously. We take the calculation of \(c_{ij}\) as an example to illustrate the workflow of this algorithm. \(c_{ij}\) is obtained by multiplying the i row of matrix A denoted \(A_i\) with the j column of matrix B denoted \(B_j\). Both \(A_i\) and \(B_j\) have length k. In the calculation process, every eight pieces of data of \(A_i\) and \(B_j\) are sent to eight processing units for matrix multiplication in parallelism manner. After each of the 8 pairs of multiplication, the result is added to \(c_{ij}\). In other words, the data of matrix A and matrix B are split into (k / 8) blocks each of which has a length of eight, as shown in Fig. 3b. The complexity of SIMD implementation is \(O(\frac{m*n*k}{8})\).

3.2 GPU implementation

3.2.1 GPU

Compared with multi-core CPUs, the computation performance of GPUs which consist of thousands of cores to handle multiple tasks at the same time is more efficient, especially for the array-based workload. Architectural-level comparisons of CPU and GPU are given in Fig. 4. GPUs are suitable for computation-intensive procedures whose running time is mainly spent on the operation of registers rather than data access. In addition, procedures that are easy to be parallelized are also applicable to GPUs, in order to make cores full loaded simultaneously.

Fig. 4
figure 4

CPU versus GPU. Compared with multi-core CPUs, the computation performance of GPUs is more efficient which consists of thousands of cores to handle multiple tasks at the same time, especially for the array-based workload

3.2.2 Programming model

Compute Unified Device Architecture (CUDA)Footnote 3 is a parallel computing platform and programming model based on NVIDIA GPUs. GPU-CUDA hardware is built with grids, blocks, and threads as shown in Fig. 5. In the three-level hierarchical architecture, the execution is independent among the entities of the same level. Threads are the smallest execution unit, controlled by kernel functions to perform operations independently in parallel. Each thread has a unique thread ID (threadIdx). Blocks are organized as a 3D array of threads, and the total size of a block is limited to 1024 threads. Like thread, each block also has its unique block ID (blockIdx). Blocks that execute the same operations independently can form a grid. All the necessary data on host memory need to be transferred to the allocated device memory when kernel is executed on the device. Similarly, resultant data will be transferred back to the host, and the device memory will be released after device execution.

Fig. 5
figure 5

CUDA model. Threads are the smallest execution unit controlled by kernel functions to perform operations independently in parallel

3.2.3 ELM implementation on GPU

In this part, we employ CUDA to design a matrix multiplication algorithm on GPU for ELM algorithm acceleration. Given an \(m \times k\) matrix A and a \(k \times n\) matrix B, the matrix multiplication \(C = A\times B\) is an \(m \times n\) matrix, each element \(C_{ij}\) can be calculated by a GPU thread, the progress of the thread is shown in Algorithm 6, and it has \(m\times n\) CUDA threads in total. Obviously, the complexity of the algorithm is O(k).

figure f

3.3 FPGA implementation

3.3.1 FPGA

Field-Programmable Gate Array (FPGA) is an integrated circuit custom designed by a customer or a designer after manufacturing to perform specific functions. An FPGA includes Configurable Logic Block (CLB), Input Output Block (IOB), and Interconnect. CLB module can not only be used to implement composition logic and timing logic, but also to configure distributed RAM and distributed ROM. The interconnect allows CLBs to be “wired together” or to connect to IOBs. The architecture of FPGA is shown in Fig. 6.

Fig. 6
figure 6

FPGA chip. An FPGA includes Configurable Logic Block (CLB), Input Output Block (IOB), and Interconnect. CLB module can be used not only to implement composition logic and timing logic, but also to configure distributed RAM and distributed ROM

FPGA has the following advantages. First, FPGA has high flexibility. Compared with ASIC manufacturing process, FPGA does not need wiring and taping out. Therefore, the development process is greatly simplified and the development cycle is shortened. Second, FPGA is of high parallel computing efficiency, and it can execute multiple instructions per instruction cycle, while ASIC, DSP, and even CPU can only handle one instruction. Third, FPGA has lower power consumption, owing to its lower frequency.

3.3.2 Programming model

High-Level Synthesis (HLS) not only allows engineers to develop at a high level of abstraction, but also makes it easy to generate multiple design solutions that can be deployed on FPGAs. Figure 7 shows the synthesis process. An algorithm for specific function is designed by C/C++/System C. Then, C/C++/System C testbench verifies the correctness of the design algorithm and also for the collaborative simulation of RTL and C. The collaborative simulation also includes verifying the design functionality of the generated RTL with the C/C++/System C testbench. The clock period constraint represents the target clock period which the design algorithm should run. Finally, the design will be mapped to the target FPGA device.

Fig. 7
figure 7

HLS. The algorithm implemented with System C can be compiled and bitstreamed to the FPGA device

There are three methods, namely loop pipelining, loop unrolling, and array partitioning, to improve the performance of the hardware function. They all exploit the parallelism between loop iterations.

Fig. 8
figure 8

Loop pipeline. The 3 operations in the loop can run with pipeline manner

Fig. 9
figure 9

FPGA array partition. HLS provides three-array partitioning method

Fig. 10
figure 10

FPGA logical implementation of ELM. Get_H is the IP kernel about calculating H in Algorithm 1 line 3. Matrix_multiply_itself is the IP kernel about calculating \(H^T H\) in Algorithm 1 line 5

Table 3 Several dataset benchmark results
Table 4 Multi-thread performance details with CoverType dataset, which shows the result of our multi-core performance under different thread counts; we set N = 500,000 and \(L = 200\)
Fig. 11
figure 11

CPU performance Analysis. The performance with SIMD optimization is better than that with baseline algorithm under the same experimental parameters. Besides, the performance is linear with the thread count

Table 5 CPU versus GPU performance on several datasets. ELM is much more suitable for GPU
Table 6 Influence of the change of N or L on GPU performance

The example of loop pipelining is shown in Fig. 8. In the mode of serial execution, there are three clock periods between the two RD operations, and it requires six clock periods to complete an entire loop. However, with pipelining, there is only one clock period between the two RD operations and it needs four clock cycles to finish the entire loop. In other words, the next iteration of the loop can start finish of the current iteration; in this way, the performance of the loop can be improved.

Loop unrolling is another method to accelerate the performance of hardware. It creates multiple copies of the loop body and adjusts the loop iteration counter accordingly. For instance, unrolling a loop by creating two copies of the loop body, the loop variable referenced by each copy is updated accordingly, and the loop iteration counter is also updated. Obviously, there are more operations in each loop iteration, and more parallelism among these operations can be employed. More parallelism means more throughput and a higher system performance. As shown in Fig. 7, the main body of the loop operates 2 elements of array arr.

figure g

Array partitioning By default, there are only has two data ports for the arrays to read/write data to block RAM, and HLS carries out the array partition to improve the bandwidth by splitting the array into several smaller arrays, which can increase the data ports, as shown in Fig. 9. The three styles of partitioning are:

  • block The original array is divided into same sized consecutive sub-arrays.

  • cyclic The original array is divided into same sized interleaving sub-arrays.

  • complete The original array is divided into individual elements.

3.3.3 FPGA for ELM

The detail is shown in Algorithm 8; the main architecture is similar to the baseline version, where the main difference is that we can take full advantage of the parallelism of FPGA by using loop unrolling, pipelining, and array partitioning. Line 5 and line 6 are about the loop pipelining and loop unrolling preprocessor directives. The internal circuit logic of ELM training algorithm is shown in Fig. 10.

figure h

4 Evaluation

4.1 Benchmark with CPU

Setup The ELM algorithms are evaluated by our high-end server, which is equipped with 4 Intel Xeon E7-4820 v4 CPUs and 1TB Main Memory. Ubuntu 16.04 LTS and GCC 5.4 are installed.

Several datasets We exploit the following 3 datasets to reveal their accelerating performance: CoverType,Footnote 4 Ionosphere,Footnote 5 and Sonar.Footnote 6 The source code can be found in Github.Footnote 7 The performance comparison of the three data sets is illustrated in Table 3. For the small training set, the performance is excellent, while for the large data set, steps 3 and 5 are the critical bottleneck as we have discussed in Sect. 2.2.

Multi-thread For the first group of evaluation, the dataset Covertype was used. The variable parameters used here are N and L. We set N = 500,000 and \(L=200\). In the experiment, we adjusted the number of threads to investigate the experimental features. We tested the experimental data in a similar way to that in Sect. 2.2. Table 4 shows the result of our multi-core performance under different thread counts with the same N and L. The result reveals the following findings:

  1. 1.

    The performance is linear with the thread count, i.e.,, the computing time of columns 3 and 5 decreases when the thread number increases.

  2. 2.

    As the number of threads increases, this approach will reach the upper bound of its performance, i.e.,, 32 threads and 64 threads have similar training time.

SIMD For the second group of evaluation, we integrate the SIMD techniques into the ELM algorithm. Figure 11a, b shows the training performance (single thread) under different N and L, with or without SIMD optimizations. The vertical axis represents the training time of the ELM. Figure11c shows the comparison result of the performance in multi-thread environment, with and without SIMD optimization. The following findings are obtained: The performance with SIMD optimization is better than that of baseline algorithm under the same experimental parameters. For example, the speedup ratio of SIMD is 1.67 under the parameters of N = 500,000 and \(L=200\).

4.2 Benchmark with GPU

Setup In this part, the experiment was on a workstation which is equipped with an Intel(R) Core(TM) i7 6500U CPU, a NVIDIA GeForce 940MX, and 8 GB Memory. Our development environment is Visual Studio 2010 and CUDA 9.1. To support GPU computing, we used the library cuBLASFootnote 8 and CULA.Footnote 9 We also used the same dataset CoverType, Ionosphere, Sonar as in the above section.

Implementation First, we re-implement the baseline version of ELM on the Windows platform; then, we offloading the step 5 to step 7 in Algorithm 1 into the GPU computing card. In contrast to Sect. 4.1, we need to record the memory transforming time from host memory to device memory, because of the low speed of PCI-E. The source code of this part can be found in Github.Footnote 10

Results Table 5 shows the performance comparison based on 3 different datasets, and Table 6 shows the performance comparison of running ELM under different parameters. We can find that:

  1. 1.

    The performance of matrix multiplication on GPU is still proportional to the amount of data. However, the performance of matrix multiplication on GPU is much better than that on CPU (Table 5).

  2. 2.

    The performance is approximately \(800\times\) faster than that of the CPU only version. The training process for ELM can achieve real-time effect (Table 6).

  3. 3.

    GPU has a relatively high data transforming cost, especially for large datasets (Table 6 column 3).

4.3 Benchmark with FPGA

Setup Same as the previous experiments, the benchmark was run on a VCU118 FPGA chip, and parameters are shown in Table 7.

Table 7 Parameters of the VCU118 XCVU9P-L2FLGA2104E FPGA

Results Table 8 shows the performance result of the implementation of step 5 in Algorithm 1 on different hardware devices, including single-core CPU, GPU, and FPGA. Since the BRAM on FPGA is small, this part of evaluation work is based on the small-scale dataset Ionosphere. We can find that the speedup is roughly \(40\times\).

Table 8 CPU versus GPU performance on several datasets. \(Speedup_{GPU}> Speedup_{FPGA} > Speedup_{multi-core}\)

4.4 Summary

In this paper, we used several kinds of popular modern hardware devices to accelerate the original ELM algorithm. We get the follow findings:

  1. 1.

    Matrix multiplication is the most time-consuming operation in ELM algorithm; thus, all of the devices evaluated in this paper can accelerate the processing of ELM training phase.

  2. 2.

    The speedup order is: \(Speedup_{GPU}> Speedup_{FPGA} > Speedup_{multi-core}\). Multi-core CPU with SIMD optimization is friendly for ELM and can get about \(10\times\) speedup. GPU’s speedup is larger than \(800\times\) and is more suitable for ELM. FPGA’s speedup is about \(40\times\) and is also suitable for ELM, due to its small energy consumption.

  3. 3.

    Drawback CPU will have a performance limitation when the number of threads is large, which was 32 threads in our benchmark. GPU may have a relatively high data transforming cost, because its chip has a lower power. FPGA is only suitable for relatively small datasets since the block RAM is small compared with host memory.

5 Related works

The extreme learning machine (ELM) [9,10,11,12, 14,15,16] was introduced by Huang as a classification algorithm with relatively fast learning speed and good generalization performance [14]. Because of these advantages, ELM can be applied in many fields and display significant application performance [13, 20, 24, 30, 34, 37, 38, 41, 43, 44].

ELM with parallelism There were several methods proposed to speed up the ELM algorithm from a parallel perspective. A parallel incremental extreme SVM classifier was proposed in [7]. The ELM algorithm for large-scale regression on GPU was proposed in the paper [1, 17, 36]. Besides, paper [8] described an algorithm that was designed and implemented on MapReduce framework. Research has shown that Field-Programmable Gate Array (FPGA) performs better than General-Purpose Processors in machine learning algorithms [39]. As the hardware implementation of ELM for classification is still at an early stage, and FPGA is a suitable device for its implementation, we are confronted with emerges the research question of what performance can be achieved by existing ELM computational methods and FPGA devices. At present, there are only a few studies touching upon this topic [5, 25, 40].

6 Conclusion

In this work, we first evaluated the performance of ELM algorithm based on the single-core implementation and concluded that the main cost of ELM was on the matrix multiplication. Then, we designed and implemented optimization algorithms for different hardware devices (multi-core, SIMD, GPU, and FPGA). All of the above hardware can achieve performance improvement, and the best performance is obtained by using GPU, especially under large dataset. We strongly recommend that (1) use GPU to accelerate ELM algorithms for large dataset, and (2) use FPGA for small dataset because of its lower power, especially for some embedded applications.