1 Introduction

Machine learning has been extensively studied in image classification and recognition nowadays. In some methods, features extraction and selection play an important role before applying neural networks to execute classification. Whether the features can represent the characteristics of images matters a lot and determines the classification accuracy. Local binary patterns (LBPs) (Pietikäinen 2010) and gray level co-occurrence matrix (GLCM) (Haralick et al. 1973) are features widely used for texture classification. On the other hand, researchers have utilized neural networks with local receptive field, such as convolutional neural network (CNN) (LeCun et al. 1995), to process images directly without extra feature extraction step. In CNN, the structure formed by convolutional layer and pooling layer performs like feature extraction procedure.

Extreme learning machine (ELM) was proposed by Huang et al. (2006, 2012) and performed well in both regression and classification. ELM is derived from single-hidden layer feed-forward neural networks (SLFNs) which consists of one hidden layer and one output layer. The advantages of ELM include that the hidden layer parameters of ELM can be generated randomly and it learns faster while keeping superior performance. ELM has been used in image classification and recognition combined with other algorithms or various kinds of image features. Li et al. (2015) employed ELM as the classifier with image features extracted by LBPs, which presented better performance than the state-of-art methods. Meanwhile, ELM and graph-based optimization methods were fused to boost remote sensing image classification (Bencherif et al. 2015). And Zeng et al. 2017 presented a traffic sign recognition method in which CNN was used to extract features from images and ELM was also applied as the classifier.

To utilize ELM to process images directly, Huang et al. (2015) proposed local receptive field based extreme learning machine (ELM-LRF). ELM-LRF outperforms CNN on the NORB dataset (LeCun et al. 2004a) in classification accuracy and time consumption. Huang et al. (2017) proposed a modified ELM-LRF to perform texture image classification. The modified method (Huang et al. 2017) employed multi-scale convolution kernels in ELM-LRF (ELM-MSLRF) and it could learn texture information of different scales. ELM-MSLRF is superior to ELM-LRF according to the experimental results.

The Gabor filters (Fogel and Sagi 1989) have been successfully employed in object detection (Jain et al. 1997), image segmentation (Jain and Farrokhnia 1991) and classification (Rajadell et al. 2013) and edge detection (Mehrotra et al. 1992) for more than two decades. By providing information with different scales and orientations, Gabor filters perform like human beings and are often used for texture representation and description. In practical applications, Gabor filters can extract the relevant features in different scales and orientations in the frequency domain. Recently, Gabor filters also have been used as the convolution kernels of CNN (GCNN) to carry out improved speech recognition (Chang and Morgan 2014).

On the other hand, data augmentation has been widely used in neural network to prevent overfitting of small dataset (Cui et al. 2015; Krizhevsky et al. 2012). In these studies, label-preserving transformations were used to augment training data. Cui et al. (2015) proposed two data augmentation methods to deal with data sparsity for both deep neural networks (DNNs) and CNN. The proposed augmentation methods could increase the variations of training data. Krizhevsky et al. (2012) introduced two data augmentation forms to reduce overfitting. The first form was carried out by doing image translation and horizontal reflections, and the second one included altering the intensities of the RGB channels in training images.

Motivated by the aforementioned research, two main improvements are introduced in this paper. First, to improve the performance of ELM-LRF in image classification, we use Gabor functions as one kind of convolution kernel filters. The Gabor functions can provide more image information by using filters with different scales and orientations. The proposed method, extreme learning machine with hybrid local receptive fields (ELM-HLRF), can provide maps generated by Gabor filters and randomly generated convolution filters in the convolutional layer. Second, we propose a data augmentation method using label-preserving transformations to improve classification performance. This data augmentation method uses Gaussian blur to preprocess training images. Then the blurred images and the original training images will be incorporated as augmented data to train classifiers. We evaluate the proposed methods on these datasets: the Outex dataset (Ojala et al. 2002), the Yale face database (Georghiades et al. 1997), the ORL face database (Samaria and Harter 1994) and the NORB dataset (LeCun et al. 2004b). Experimental results demonstrate that: first, ELM-HLRF performs better than ELM-LRF, ELM and support vector machine (SVM) (Cortes and Vapnik 1995) in classification accuracy; second, the proposed data augmentation method can improve classification performance.

The rest of this paper begins with related works in Sect. 2. Section 3 gives a detailed description of the proposed methods. Section 4 reports and discusses the experimental results. Finally Sect. 5 concludes.

2 Related works

2.1 Gabor filters in image processing

The two-dimensional discrete Gabor function can be written as follows

$$\begin{aligned} \phi (x,y) = \frac{1}{{2\pi {\delta _x}{\delta _y}}}\exp \left[ { -\, \frac{1}{2}\left( {\frac{{{x^2}}}{{\delta _x^2}} + \frac{{{y^2}}}{{\delta _y^2}}} \right) } \right] \exp (2\pi jWx), \end{aligned}$$
(1)

where W denotes the radial frequency of Gabor wavelet, \({\delta _x}\) and \({\delta _y}\) are parameters of Gaussian envelope along the x-axis and y-axis respectively.

The Gabor filter with frequency W and orientation \(\theta \) by coordinate rotation can be given by

$$\begin{aligned} \phi '(x,y) = \frac{1}{{2\pi {\delta _x}{\delta _y}}}\exp \left[ { -\, \frac{1}{2}\left( {\frac{{x{'^2}}}{{\delta _x^2}} + \frac{{y{'^2}}}{{\delta _y^2}}} \right) } \right] \exp (2\pi jWx'), \end{aligned}$$
(2)

where \(\phi '(x,y)\) is the two-dimensional discrete Gabor function when \(x' = {\alpha ^{ - s}}(x \cdot \cos {\theta _l} + y \cdot \sin {\theta _l})\), \(y' = {\alpha ^{ - s}}( -\, x \cdot \sin {\theta _l} + y \cdot \cos {\theta _l})\), s is the scale and l is the orientation; * represents transpose conjugate; \(s = 1,2, \ldots ,p\); \(l = 1,2, \ldots ,q\); \(\alpha \) is the scale factor and \(\alpha > 1\). The (xy) denotes initial coordinate, while (\(x', y'\)) denotes the transformed coordinate. The symbols p and q are the number of scales and orientations of Gabor filters respectively. The functions \(\phi (x,y)\) and \(\phi '(x,y)\) have the following relationship:

$$\begin{aligned} \phi (x,y) = {\alpha ^{ - s}}\phi '(x,y). \end{aligned}$$
(3)

The Fourier transform of \(\phi (x,y)\) is

$$\begin{aligned} \varPhi (u,v) = \exp \left\{ {\frac{1}{2}\left[ {\frac{{{{(u - W)}^2}}}{{\delta _u^2}} + \frac{{{v^2}}}{{{\delta _v}}}} \right] } \right\} , \end{aligned}$$
(4)

where \({\delta _u} = 1/2\pi {\delta _x}\) and \({\delta _v} = 1/2\pi {\delta _y}\). Let I(xy) be the input image, I(xy) filtered by \(\phi '(x,y)\) can be written as

$$\begin{aligned} F(x,y) = \sum \limits _{{x_1}} {\sum \limits _{{y_1}} {I({x_1},{y_1})\phi '^*_{}(x - {x_1},y - {y_1})} }, \end{aligned}$$
(5)

where F(xy) is the filter response. The mean and standard deviation of the magnitude of F(xy) can be used as features (Manjunath and Ma 1996) of image tiles to perform texture classification. The mean and standard derivative of F(xy) are calculated as

$$\begin{aligned} {\mu _{s,l}}= & {} \sum \limits _{{x}} \sum \limits _{{y}} {{|{F}(x,y)} } |, \end{aligned}$$
(6)
$$\begin{aligned} {\delta _{s,l}}= & {} \sqrt{\sum \limits _{{x}} \sum \limits _{{y}} {{(|{F}(x,y)} } | - {\mu _{s,l}}{)^2}} . \end{aligned}$$
(7)

The feature vector for I(xy) is represented as \([{\mu _{1,1}},{\delta _{1,1}},{\mu _{1,2}},{\delta _{1,2}}, \ldots ,{\mu _{p,q}},{\delta _{p,q}}]\).

Figure 1 presents an instance of Gabor filtered face image. In this instance, we set \(p = 5\) and \(q = 8\), so there are 40 Gabor filtered results.

Fig. 1
figure 1

b Gabor filtered results of a image. In b, each row shares the same scale and each column shares the same orientation

2.2 Brief review of ELM

ELM is proposed by Huang et al. (2006, 2012). When compared with traditional neural networks, ELM has faster learning speed and higher accuracy. The weights and biases in hidden layer of ELM can be assigned randomly. The flowchart of ELM is shown in Fig. 2. Let \(({X_j},{t_j})\)\((j = 1,2, \ldots ,N)\) be the N input samples of SLFNs, where \({X_j} = {[{x_{j1}},{x_{j2}}, \ldots ,{x_{jn}}]^T} \in {R^n}\) denotes input feature vector and \({t_j}\) denotes target value of \({X_j}\) . The value domain of \({t_j}\) is \(\{1,2, \ldots ,m\}\). The SLFNs with L hidden nodes can be written as follows

$$\begin{aligned} \sum \limits _{i = 1}^L {{\beta _i}g({W_i} \cdot {X_j} + {b_i})} = {o_j},\quad j = 1, \ldots ,N, \end{aligned}$$
(8)

where \(g( \cdot )\) denotes the activation function, \({W_i} = [{w_{i,1}},{w_{i,2}}, \ldots ,{w_{i,n}}]\) denotes the input weights, \({\beta _i}\) denotes the output weight and \({b_i}\) denotes the bias of the ith hidden layer unit. Function \({W_i} \cdot {X_j}\) denotes the inner product of \({W_i}\) and \({X_j}\).

The goal of SLFNs is to minimize the output error, that is to say, the output of SLFNS \({o_j}\) and the target output \({t_j}\) should satisfy the following equation:

$$\begin{aligned} \sum \limits _{j = 1}^N {\left\| {{o_j} - {t_j}} \right\| } = 0. \end{aligned}$$
(9)

For the training dataset, we have the following assumption:

$$\begin{aligned}&\sum \limits _{i = 1}^L {{\beta _i}g({W_i} \cdot {X_j} + {b_i})} = {t_j},\quad j = 1, \ldots ,N, \end{aligned}$$
(10)
$$\begin{aligned}&H\beta = T, \end{aligned}$$
(11)

where H is the output matrix of the hidden layer, \(\beta \) is the output weights and \(T\mathrm{{ = [}}{t_1}\mathrm{{,}}{t_2}\mathrm{{,}} \ldots \mathrm{{,}}{t_N}\mathrm{{]}}\) is the target outputs of all inputs. Equation 11 can be written as

$$\begin{aligned} \begin{array}{l} H({W_1}, \cdots ,{W_L},{b_1}, \cdots ,{b_L},{X_1}, \ldots ,{X_L})\\ \quad = {\left[ {\begin{array}{ccc} {g({W_1} \cdot {X_1} + {b_1})}&{} \cdots &{}{g({W_L} \cdot {X_1} + {b_L})}\\ \vdots &{} \cdots &{} \vdots \\ {g({W_1} \cdot {X_N} + {b_1})}&{} \cdots &{}{g({W_L} \cdot {X_N} + {b_L})} \end{array}} \right] _{N \times L}}, \end{array} \end{aligned}$$
(12)

where

$$\begin{aligned} \beta = {\left[ {\begin{array}{c} {{\beta _1}}\\ \vdots \\ {{\beta _L}} \end{array}} \right] _{L \times 1}},T = {\left[ {\begin{array}{c} {t_1^{}}\\ \vdots \\ {t_N^{}} \end{array}} \right] _{N \times 1}}. \end{aligned}$$
(13)

After training, we get \({\hat{W}_i}\), \({\hat{b}_i}\) and \({\hat{\beta }_i}\) which satisfy the following equation

$$\begin{aligned} \left\| {H({{\hat{W}}_i},{{\hat{b}}_i}){{\hat{\beta }}_i} - T} \right\| = \mathop {\min }\limits _{W,b,\beta } \left\| {H({W_i},{b_i}){\beta _i} - T} \right\| , \end{aligned}$$
(14)

where \(i = 1, \ldots ,L\). Equation (14) also means to minimize loss function also means to minimize loss function

$$\begin{aligned} E = {\sum \limits _{j = 1}^N {\left( {\sum \limits _{i = 1}^L {{\beta _i}g({W_i} \cdot {X_j} + {b_i}) - {t_j}} } \right) } ^2}. \end{aligned}$$
(15)

\(\beta \) can be calculated as follows

$$\begin{aligned} \hat{\beta }= {H^\dag }T, \end{aligned}$$
(16)

where \({H^\dag }\) is the Moore-Penrose generalized inverse of H.

Fig. 2
figure 2

The flowchart of ELM

2.3 Brief review of ELM-LRF

In ELM-LRF, the connection between the input layer and one node of hidden layer is generated according to a continuous probability distribution. These random connections constitute local receptive fields. ELM-LRF consists of four layers: the hidden layer, the pooling layer, the full-connected layer and the output layer.

In the hidden layer, the convolution kernel \({a_{{i}}}\) (\({i} = 1,2, \ldots ,{k'}\)) is randomly generated. Assume that the initial input weights are \({\hat{A}^{init}}\), the size of each input weight is \(r \times r\) and the size of each input image is \(d \times d\). So the size of each feature map is \((d - r + 1) \times (d - r + 1)\). Then

$$\begin{aligned} \begin{array}{l} {{\hat{A}}^{init}} \in {R^{{r^2} \times {k'}}},{{\hat{A}}^{init}} = [\hat{a}_1^{init},\hat{a}_2^{init}, \cdots ,\hat{a}_{{k'}}^{init}],\\ \mathrm{{ }}\hat{a}_{{i}}^{init} \in {R^{{r^2}}},{i} = 1, \ldots ,{k'}, \end{array} \end{aligned}$$
(17)

where \({\hat{A}^{init}}\) is orthogonalised using singular value decomposition (SVD). The orthogonalised input weights are \(\hat{A}\) and \(\hat{A} = [{\hat{a}_1},{\hat{a}_2}, \ldots ,{\hat{a}_{{k'}}}]\). Each column of \(\hat{A}\) is the orthogonal basis of \({\hat{A}^{\mathrm{{init}}}}\). If \(\hbox {r}^{2}<{k'}\), \({\hat{A}^{init}}\) should be transposed at first and then orthogonalised and transposed again at last. The convolution weight of the ith feature map is \({a_{{i}}} \in {\mathrm{{R}}^{r \times r}}\) and it is aligned of \({\hat{a}_{{i}}}\) by column. The convolution result of node \(({x_1},{x_2})\) at the ith feature map is \({c_{{x_1},{x_2},{i}}}\):

$$\begin{aligned} \begin{array}{l} {c_{{x_1},{x_2},{i}}} = \sum \limits _{{m_1} = 1}^r {\sum \limits _{{m_2} = 1}^r {({I_{{x_1} + {m_1} - 1,{x_2} + {m_2} - 1}} \cdot {a_{{m_1},{m_2},{i}}})} } \\ \mathop {}\nolimits _{} \mathop {}\nolimits _{} {x_1},{x_2} = 1, \ldots ,(d - r + 1), \end{array} \end{aligned}$$
(18)

where \({I_{{x_1} + {m_1} - 1,{x_2} + {m_2} - 1}}\) is the pixel value of input image I at location \(({x_1} + {m_1} - 1,{x_2} + {m_2} - 1)\).

In the pooling layer, pooling size e denotes the distance between the center and the edge of pooling. And the size of the pooled maps are the same as the feature maps (\((d - r + 1) \times (d - r + 1)\)). The symbol \({c_{{x_1},{x_2},i}}\) is node (\({x_1},{x_2}\)) of the \(i\mathrm{{th}}\) feature map; \({h_{{p_1},{p_2},i}}\) is node \(({p_1},{p_2})\) of the \(i\mathrm{{th}}\) pooled map; \(i = 1,2, \ldots , k'\). And \({h_{{p_1},{p_2},i}}\) is obtained using

$$\begin{aligned} \begin{array}{l} {h_{{p_1},{p_2},i}} = \sqrt{\sum \limits _{{x_1} = {p_1} - e}^{{p_1} + e} {\sum \limits _{{x_2} = {p_2} - e}^{{p_2} + e} {c_{{x_1},{x_2},i}^2} } } \\ {p_1},{p_2} = 1, \ldots ,(d - r + 1). \end{array} \end{aligned}$$
(19)

If \(({x_1},{x_2})\) is out of bound, \({c_{{x_1},{x_2},i}} = 0\).

In the full-connected layer, each pooling map is merged into a row vector. The size of each pooling map is \((d - r + 1) \times (d - r + 1)\). If there are N input images, the matrix \(H' \in {\mathrm{{R}}^{N \times [k' \cdot {{(d - r + 1)}^2}]}}\) is calculated as

$$\begin{aligned} H' = {\left[ {\begin{array}{cccc} {{{\hat{h}}_{1,1}}}&{}{{{\hat{h}}_{1,2}}}&{} \cdots &{}{{{\hat{h}}_{1,k' \cdot {{(d - r + 1)}^2}}}}\\ {{{\hat{h}}_{2,1}}}&{}{{{\hat{h}}_{2,2}}}&{} \cdots &{}{{{\hat{h}}_{2,k' \cdot {{(d - r + 1)}^2}}}}\\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ {{{\hat{h}}_{N,1}}}&{}{{{\hat{h}}_{N,2}}}&{} \cdots &{}{{{\hat{h}}_{N,(k') \cdot {{(d - r + 1)}^2}}}} \end{array}} \right] _{N \times [k' \cdot {{(d - r + 1)}^2}]}}, \end{aligned}$$
(20)

where \({\hat{h}_{i,j}} = g({W_j} \cdot {I_i} + {b_j})\), \({I_i}\) is the ith (\(1 \le i \le N\)) input image and \(1 \le j \le k' \cdot {(d - r + 1)^2}\). The output weight \(\beta \) of ELM-LRF is calculated as (Huang et al. 2012, 2015)

$$\begin{aligned}&\begin{array}{l} \beta = {H'}_{}^{^T}{\left( \frac{1}{C} + {H'}{H'}_{}^T\right) ^{ - 1}}T\\ \mathrm{{if}}\mathop {}\nolimits _{} N \le k' \cdot {(d - r + 1)^2}, \end{array} \end{aligned}$$
(21)
$$\begin{aligned}&\begin{array}{l} \beta = {\left( \frac{1}{C} + H'^{T}H'\right) ^{ - 1}}H'^{T}T\\ \mathrm{{if}}\mathop {}\nolimits _{} N > k' \cdot {(d - r + 1)^2}. \end{array} \end{aligned}$$
(22)

3 Methods

Section 3.1 introduces the proposed method ELM-HLRF, and Sect. 3.2 presents the data augmentation method.

3.1 Local receptive field based extreme learning machine with hybrid filter kernels

In this paper, we propose an innovative neural network which uses Gabor filters and randomly generated convolution kernels in the convolutional layer. The proposed method is called hybrid local receptive field based extreme learning machine (ELM-HLRF). In this modified topology, randomly generated convolution kernels used in ELM-LRF and Gabor filters of different scales and orientations are combined to process input images.

In image processing, with different combinations of the scale and orientation parameters, Gabor filters are employed to detect contours of various scales and orientations. Therefore the scales and orientations of Gabor filters are important parameters, and these parameters need to be chosen to get optimal training results in ELM-HLRF.

Fig. 3
figure 3

The flowchart of ELM-HLRF

The flowchart of ELM-HLRF is shown in Fig. 3. In Fig. 3, \({G_{{i_1}}} \in {\mathrm{{R}}^{r \times r}}\)\(({i_1} = 1, \ldots ,{k_1})\) is the convolution kernel provided by Gabor filters, \({k_1}\mathrm{{ = }}p \cdot q\) and

$$\begin{aligned} {G_{{i_1}}} =\,\phi _{{i_1}}'(x,y),\quad \mathop {}\limits _{} {i_1} = 1, \ldots ,{k_1}. \end{aligned}$$
(23)

The convolution result when \(\phi _{{i_1}}'(x,y)\) is applied to input image I is

$$\begin{aligned} {F_{{i_1}}}(x,y) = \sum \limits _{{x_1}} {\sum \limits _{{y_1}} {I({x_1},{y_1})\phi '^*_{{i_1}}(x - {x_1},y - {y_1})} }. \end{aligned}$$
(24)

The other kind of convolution kernel \({{a'}_{{i_2}}}\) (\({i_2} = 1,2, \ldots ,{k_2}\)) is randomly generated. According to (17), the initial input weights are \({\hat{A'}^{init}}\), and

$$\begin{aligned} \begin{array}{l} {{\hat{A'}}^{init}} \in {R^{{r^2} \times {k_2}}},{{\hat{A'}}^{init}} = \left[ \hat{a'}_1^{init},\hat{a'}_2^{init}, \ldots ,\hat{a'}_{{k_2}}^{init}\right] ,\\ \hat{a'}_{{i_2}}^{init} \in {R^{{r^2}}},\quad {i_2} = 1, \ldots ,{k_2}. \end{array} \end{aligned}$$
(25)

The orthogonalised result of \({\hat{A'}^{init}}\) is \(\hat{A'} = [{\hat{a'}_1},{\hat{a'}_2}, \ldots ,{\hat{a'}_{{k_2}}}]\). The convolution result of node \(({x_1},{x_2})\) at the \({i_2}\mathrm{{th}}\) feature map is \({{c'}_{{x_1},{x_2},{i_2}}}\), which is obtained using (18).

The convolution weights of the convolutional layer of ELM-HLRF consist of \({G_{{i_1}}} \in {\mathrm{{R}}^{r \times r}}\)\(({i_1} = 1, \ldots ,{k_1})\) and \({\mathrm{{{a'}}}_{{i_2}}} \in {\mathrm{{R}}^{r \times r}}\)\(({i_2} = 1, \ldots ,{k_2})\). The pooling map \({{h'}_{{p_1},{p_2},i}}\) (\(i = 1,2, \ldots , ({k_1} + {k_2})\)) can be calculated using (19).

As presented in Sect. 2.3, in the full-connected layer, the matrix \({\hat{H}} \in {\mathrm{{R}}^{N \times [({k_1} + {k_2}) \cdot {{(d - r + 1)}^2}]}}\) can be calculated as

$$\begin{aligned} {\hat{H}} = {\left[ {\begin{array}{cccc} {{{\hat{h}}_{1,1}}}&{}{{{\hat{h}}_{1,2}}}&{} \cdots &{}{{{\hat{h}}_{1,({k_1} + {k_2}) \cdot {{(d - r + 1)}^2}}}}\\ {{{\hat{h}}_{2,1}}}&{}{{{\hat{h}}_{2,2}}}&{} \cdots &{}{{{\hat{h}}_{2,({k_1} + {k_2}) \cdot {{(d - r + 1)}^2}}}}\\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ {{{\hat{h}}_{N,1}}}&{}{{{\hat{h}}_{N,2}}}&{} \cdots &{}{{{\hat{h}}_{N,({k_1} + {k_2}) \cdot {{(d - r + 1)}^2}}}} \end{array}} \right] _{N \times [({k_1} + {k_2}) \cdot {{(d - r + 1)}^2}]}}. \end{aligned}$$
(26)

The output weight \(\xi \) of ELM-HLRF is calculated as

$$\begin{aligned}&\begin{array}{l} \xi = {\hat{H}}_{}^{^T}{\left( \frac{1}{C} + {\hat{H}}{\hat{H}}_{}^T\right) ^{ - 1}}T\\ \mathrm{{if}}\mathop {}\nolimits _{} N \le ({k_1} + {k_2}) \cdot {(d - r + 1)^2}, \end{array} \end{aligned}$$
(27)
$$\begin{aligned}&\begin{array}{l} \xi = {\left( \frac{1}{C} + {\hat{H}}^{T}{\hat{H}}\right) ^{ - 1}}{\hat{H}}^{T}T\\ \mathrm{{if}}\mathop {}\nolimits _{} N > ({k_1} + {k_2}) \cdot {(d - r + 1)^2}. \end{array} \end{aligned}$$
(28)

3.2 Data augmentation

In supervised machine learning, we need sufficient data to train the neural networks to obtain robust model and to avoid overfitting. Data augmentation is the commonly used method which can enlarge datasets by using label-preserving transformations. The related data augmentation methods include color jittering, PCA jittering, random scale transformation, random crop, horizontal flip, vertical flip, translation transform, rotation, reflection, affine transformation, Gaussian noise and blurring, etc.

In this paper, we carry out a data augmentation method by altering the pixel intensities of training images. A \(5 \times 5\) Gaussian blur function with standard deviation 1 is used to process each training image. Each blurred image and its corresponding training image share the same label. Then we train the network using the blurred images and the original training images. Figure 4 gives an example of Gaussian blurred face image. Trained with the augmented datasets, the learning machine will be more robust to blur details.

Fig. 4
figure 4

b Gaussian blurred result of the a original face image

4 Experimental results and analysis

In this paper, we use five datasets to evaluate the performance of ELM-HLRF. Details about the datasets, the parameters setup, the experimental results and the discussions are introduced in this section.

4.1 Datasets for performance evaluation

In this paper, we use five datasets to evaluate the performance of ELM-HLRF: the Outex\(\_\)TC\(\_\)00000, the Outex\(\_\)TC\(\_\)00012, the Yale face database, the ORL face database and the NORB dataset.

We use two different databases of Outex: Outex\(\_\)TC\(\_\)00000 and Outex\(\_\)TC\(\_\)00012. The Outex\(\_\)TC\(\_\)00000 consists of 8832 images of 24 different textures. Each texture has 368 images, 184 for training and 184 for testing. We choose 099 folder in Outex\(\_\)TC\(\_\)00000 to do training and testing in our experiment. The other Outex database we use in this paper is Outex\(\_\)TC\(\_\)00012, which contains 1440 images of 24 different textures. Each texture has 60 samples, 20 of which for training and 40 of which for testing. These texture images are recorded under different illuminations. Images of Outex\(\_\)TC\(\_\)00012 are with the resolution of 128 by 128. We resize them into size of 32 by 32. Figures 5 and 6 give some texture image samples of Outex\(\_\)TC\(\_\)00000 and Outex\(\_\)TC\(\_\)00012 respectively.

Fig. 5
figure 5

24 different textures of the Outex\(\_\)TC\(\_\)00000 dataset

Fig. 6
figure 6

24 different textures of the Outex\(\_\)TC\(\_\)00012 dataset

The Yale face database contains 165 grayscale images of 15 individuals. There are 11 images for each face with different facial expressions such as center-light, glasses, happy, left-light, no glasses, normal, right-light, sad, sleepy, surprised and wink. Each image is 100 by 100 pixels in size. Images of Yale face database are also resized into a size of 32 by 32 pixels to speed up experiments in this paper. Figure 7 shows the face image samples of Yale face database.

Fig. 7
figure 7

Images of the Yale face database

The ORL face database (Samaria and Harter 1994) consists of 400 images, 10 for each of 40 distinct subjects. For some subjects, the images were taken at different times, lighting, facial expressions (open or closed eyes, smiling or not smiling) and facial details (glasses or no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position. The size of each image is 92 by 112 pixels. Images of the ORL face database are also resized into the size of 32 by 32 in this paper. Figure 8 are the face image samples of the ORL face database.

Fig. 8
figure 8

Image samples of the ORL face database

The NORB dataset is a benchmark for object recognition (LeCun et al. 2004c). It contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30\(^{\circ }\)–70\(^{\circ }\) every 5\(^{\circ }\)), and 18 azimuths (0–340 every 20\(^{\circ }\)). And there are 48,600 images (\(50*6*9*18\)) in the NORB dataset, half of which (ie 24,300 images) are used for training and the rest (ie 24,300 images) for testing. We downsize them to \(32\times 32\) in the experiments.

4.2 Parameters setup

The experiments are carried out on five datasets using the Matlab software on a windows 7 64 bit system with Intel(R) Core(TM) i5-4210U CPU and 64 GB RAM.

In this paper, the number of the hidden nodes of ELM ranges from 100 to 2000. For each experiment, we will find the parameters which generate the optimal classification results on each dataset. Besides, six parameters have direct effect on classification accuracy: the number of the Gabor filter scales p, the number of the Gabor filter orientations q, the number of the convolution filters of ELM-HLRF \({k_2}\), the number of the convolution filters of ELM-LRF \(k'\) , the convolution size r, the pooling size e and the regularization parameter C. The value of C can be \(\{0.01, 0.1, 1, 10, 100\}\). The p ranges from 1 to 5 with stride 1, q is equal to 4 or 8, \(k'\) and \({k_2}\) ranges from 4 to 80 with stride 4. The convolution size r ranges from 4 to 9 and the pooling size e ranges from 3 to 8.

4.3 Results and analysis

4.3.1 Performance evaluation on dataset Outex\(\_\)TC\(\_\)00012

Figure 9 shows the relationship between the classification accuracy, the pooling size e and the convolution size r of ELM-LRF when the number of the convolution filters \({k_2}\) is set to 48. We can see from Fig. 9 that when \(e=8\) and \(r=9\) we have the highest classification accuracy. For Outex datasets Outex\(\_\)TC\(\_\)00000 and Outex\(\_\)TC\(\_\)00012, the parameters of ELM-HLRF are listed in Table 1.

Fig. 9
figure 9

Classification accuracy with different pooling size e, convolution size r of ELM-LRF when the number of the convolution filters \({k_2}\) is set to 48

Table 1 The parameters of ELM-HLRF on dataset Outex\(\_\)TC\(\_\)00000 and Outex\(\_\)TC\(\_\)00012
Table 2 The classification accuracy (\(\%\)) when ELM-HLRF and ELM-LRF are applied to dataset Outex\(\_\)TC\(\_\)00012 and \(k'\) is a constant

Table 2 presents the classification accuracy when ELM-HLRF and ELM-LRF are applied to dataset Outex\(\_\)TC\(\_\)00012. It can be seen from Table 2 that the proposed method ELM-HLRF can improve classification accuracy by providing Gabor filtered maps in the convolutional layer. To evaluate the quality of Gabor filters as convolution kernels, we compare the classification accuracy of ELM-HLRF with that of ELM-LRF, ELM and SVM. Table 3 shows that when \(k'= 60\) ELM-LRF has the best result (82.50\(\%\)); and when \({k_2}=48\), \(p = 3\) and \(q = 4\), ELM-HLRF has the highest accuracy (96.77\(\%\)). It can be seen that ELM-HLRF outperforms other methods in classification accuracy.

Table 3 The classification accuracy (\(\%\)) when ELM-HLRF (\({k_2}=48\)) and ELM-LRF are applied to dataset Outex\(\_\)TC\(\_\)00012 and \(k' = ({k_2}+p \cdot q)\)

Figure 10 shows the classification results of ELM-LRF (\(k'= 60\)) and ELM-HLRF (\({k_2}=48\), \(p = 5\) and \(q = 4\)) with different number of training samples. The classification accuracy of ELM-HLRF is always higher than that of ELM-LRF. Figure 11 gives the time consumption of ELM-LRF (\(k'= 60\)) and ELM-HLRF (\({k_2}=48\), \(p = 5\) and \(q = 4\)) with varying number of training samples. It should be noted that the time here includes the time to produce convolution weights and the time to perform Gabor filtering procedure of ELM-HLRF in the training step. The time consumption of ELM-HLRF is more than ELM-LRF but ELM-HLRF has more convolution nodes and higher classification accuracy.

ELM-HLRF is also compared with the method presented by Yang et al. (2018). The average classification accuracy of the method (Yang et al. 2018) on Outex\(\_\)TC\(\_\) 00012 is 96.54\(\%\), while ELM-HLRF has higher accuracy (96.88\(\%\)).

Fig. 10
figure 10

Classification accuracy (\(\%\)) with varying number of training samples when ELM-HLRF and ELM-LRF are applied to Outex\(\_\)TC\(\_\)00012

Fig. 11
figure 11

Time-consumption (s) with varying number of training samples when ELM-HLRF and ELM-LRF are applied to Outex\(\_\)TC\(\_\)00012

4.3.2 Performance evaluation on dataset Outex\(\_\)TC\(\_\)00000

The classification results when ELM-LRF and ELM-HLRF are applied to dataset Outex\(\_\)TC\(\_\)00000 are shown in Tables 4 and 5. Several conclusions can be obtained from the two tables: first, Gabor filters are efficient convolution kernels when doing image feature extraction; second, Table 5 shows that when \(k' = ({k_2}+p \cdot q)\) the classification accuracy of ELM-HLRF is higher than that of ELM-LRF, so the hybrid filter kernels are superior to the randomly generated convolution kernels. ELM-HLRF achieves its optimal performance when \({k_2}=48\), \(p=3\), \(q = 8\) and does not increase with the increasing of p and q after then.

Table 4 The classification accuracy (\(\%\)) when ELM-HLRF and ELM-LRF are applied to dataset Outex\(\_\)TC\(\_\)00000 and \(k'=48\)
Table 5 The classification accuracy (\(\%\)) when ELM-HLRF (\({k_2} =48\)) and ELM-LRF are applied to dataset Outex\(\_\)TC\(\_\)00000 and \(k' = ({k_2}+p \cdot q)\)

Figure 12 shows the classification results of ELM-LRF (\(k'= 88\)) and ELM-HLRF (\({k_2}=48\), \(p = 5\) and \(q = 8\)) with different sizes of training samples. The classification accuracy of ELM-HLRF is always higher than that of ELM-LRF. Figure 13 gives the time consumption of ELM-LRF (\(k'= 88\)) and ELM-HLRF (\({k_2}=48\), \(p = 5\) and \(q = 8\)) with varying number of training samples. The convolution nodes of ELM-LRF and ELM-HLRF are the same in this experiment, and it can be seen from Fig. 13 that the consumed time of ELM-LRF is more than that of ELM-HLRF.

The results of ELM-HLRF on Outex\(\_\)TC\(\_\)00000 are also compared with those of other methods. With the input images of the same size (\(32\times 32\)), ELM-HLRF has higher classification accuracy (83.54\(\%\)) than the method ((\(76.7\pm 1.8)\%\)) presented by Reininghaus et al. (2015).

Fig. 12
figure 12

Classification accuracy (\(\%\)) with varying number of training samples when the proposed method and ELM-LRF are applied to Outex\(\_\)TC\(\_\)00000

Fig. 13
figure 13

Time-consumption (seconds) with varying number of training samples when ELM-HLRF and ELM-LRF are applied to Outex\(\_\)TC\(\_\)00000

4.3.3 Performance evaluation on the Yale face database

Figure 14 shows the classification results when ELM-LRF is applied to the Yale face database. The number of training samples for each class is 5. We can see from Fig. 14 that the increasing number of convolution maps contributes little to the classification accuracy. So we set \(k'=4\) and \(k'=8\) in this paper for the consideration of time-consumption.

Table 6 shows the classification accuracy of ELM, SVM, ELM-LRF, the method presented by Zhang et al. (2014) and ELM-HLRF. When the number of convolution filters is set to 8 (\({k_2}=4\), \(p=1\) and \(q = 4\)), the classification accuracy of ELM-HLRF is higher than that of other methods. Specifically, when the number of convolution nodes of ELM-LRF and ELM-HLRF is the same, ELM-HLRF outperforms ELM-LRF.

Fig. 14
figure 14

The classification accuracy when ELM-LRF is applied to the Yale face database and the number of training maps for each class is 5

Table 6 The comparison of classification accuracy (\(\%\)) on the Yale face database

4.3.4 Performance evaluation on the ORL face database

The classification results of ELM-HLRF on the ORL face database are shown in Table 7. In this experiment, 5 images for each class are used for training and the rest 5 for testing. We compare ELM-HLRF with ELM-LRF and the image classification method presented by Xu et al. (2015). The optimal parameters of ELM-LRF and ELM-HLRF are shown in Table 7. ELM-HLRF and ELM-LRF have the same classification accuracy (98.50\(\%\)), but the training time of ELM-HLRF is higher than that of ELM-LRF because the former has more convolution kernels.

Table 7 The comparison of classification accuracy (\(\%\)) and training time (s) on the ORL face database

4.3.5 Performance evaluation on the NORB dataset

Table 9 shows the comparison of classification accuracy on the NORB dataset. The optimal parameters of ELM-LRF are shown in Table 8. The parameters of ELM-HLRF are shown in Table 8, and the optimal parameters are shown in Table 9. We can see from Table 9 that ELM-HLRF achieves good performance (97.45\(\%\)), which is comparable to ELM-MSLRF (97.50\(\%\)) and better than the other methods. Besides, the training time of ELM-HLRF is longer than that of ELM-LRF and ELM-MSLRF because the former has more convolution kernels.

Table 8 The parameters of ELM-LRF and ELM-HLRF on NORB
Table 9 The comparison of classification accuracy (\(\%\)) and training time (s) on NORB

4.3.6 Data augmentation

In this paper, data augmentation is considered to improve the overall classification accuracy. We use Gaussian blur to preprocess the training images first. Then the Gaussian blurred images combined with original training images will be provided as new training image inputs of ELM-HLRF or ELM-LRF. We use a \(5 \times 5\) Gaussian blur function with standard deviation of 1 to preprocess the original training images. Table 10 shows the classification results when data augmentation is applied to the five datasets. It can be concluded that the results obtained by the proposed data augmentation method are better than those without data augmentation.

Table 10 Classification accuracy (\(\%\)) of the three datasets before and after data augmentation

4.4 Discussions

Fig. 15
figure 15

The relationship between the number of Gabor filters and the number of random convolution kernels in ELM-HLRF on the Outex\(\_\)TC\(\_\)00012

Fig. 16
figure 16

The relationship between the number of Gabor filters and the number of random convolution kernels in ELM-HLRF on the ORL face database

Figures 15 and 16 show the relationship between the number of Gabor filters and the number of the random convolution kernels in ELM-HLRF on the Outex\(\_\)TC\(\_\)00012 dataset and the ORL face database respectively. We can see from Figs. 15 and 16 that, the accuracy increases with the increasing number of random convolution kernels at first, and then the number of random convolution kernels is not in evidence with the classification accuracy when \({k_2}\ge 24\) for Outex\(\_\)TC\(\_\)00012 and \({k_2}\ge 20\) for the ORL face database.

Furthermore, the results on the Outex\(\_\)TC\(\_\)00012, the ORL face database and the NORB dataset show that ELM-HLRF needs the more number of convolution kernels than ELM-LRF to achieve its highest classification accuracy. Therefore, ELM-HLRF spends more training time to achieve its optimal performance. On the other hand, ELM-HLRF has higher accuracy than ELM-LRF, which means that the Gabor filters can provide the features that the random convolution kernels cannot extract from images.

5 Conclusions

In this paper, we propose an innovative method local receptive field based extreme learning machine with hybrid filter kernels (ELM-HLRF) to carry out image classification. Two kinds of convolution kernels are included in ELM-HLRF: the kernels of ELM-LRF and Gabor filter kernels. In parallel, a data augmentation method based on Gaussian blur is used to improve classification performance. We evaluate the performance of ELM-HLRF and the data augmentation method using five datasets: Outex\(\_\)TC\(\_\)00000, Outex\(\_\)TC\(\_\)00012, the Yale face database, the ORL face database and the NORB dataset. It can be concluded that ELM-HLRF has higher classification accuracy than ELM-LRF, SVM and ELM, which proves that Gabor filter kernels are efficient convolution kernels. Also, the experiment results indicate that the training data augmented using data augmentation is more effective than the original training data.