1 Introduction

Currently, efficient machine learning algorithms [1, 2] are the focus of many researchers. To complete the training of the network quickly and efficiently without iteration, Huang et al. [3, 4] proposed a simple and effective extreme learning machine (ELM). In ELM, the connection weights as well as biases between the input layer and the hidden layer are randomly initialized, and the regularized least squares method is used to analytically determine the weight of the output layer. Huang et al. [5] extended ELM by introducing kernel learning and proposed an extreme learning machine with kernel (KELM), which achieved better results by introducing the kernel function into ELM. In KELM, the selection step of the number of nodes is omitted by using the feature kernel mapping.

In some classification methods, feature extraction and selection play an important role before performing classification using classifiers [6, 7]. Both ELM and KELM are single-hidden-layer network structures, which could not handle image classification problems well [8]. Some improved algorithms based on ELM [8,9,10] have deepened the number of layers or changed the structure. These algorithms only improve their general approximation ability and do not improve the ability to extract important feature in the image. To solve the problems, Huang et al. [11] proposed a local receptive field based extreme learning machine (ELM-LRF), which is enable to process images directly by combining the concept of local receptive field with ELM. The input layer and hidden convolutional layer in ELM-LRF are locally connected, which allow the network to process the local structure of the image like convolutional neural networks (CNNs) [12,13,14,15]. Since the ELM-LRF does not require back-propagation (BP) algorithm [16] during training, the computation and time required for training are greatly reduced.

ELM-LRF is improved in two ways. On one hand, ELM-LRF has only one convolutional layer and one pooling layer, and its performance is limited by the shallow architecture. Therefore, researchers propose to improve it by using multiple sets of convolutional and pooling layers to obtain more abstract feature representations, such as CKELM [17] and DC-ELM [18], and they prove that multiple set of convolutional and pooling layers could effectively improve classification accuracy. However, DC-ELM simply increases the number of convolutional layers and does not consider the identifiable category details contained in the shallow features, resulting in a waste of feature. On the other hand, single-scale local receptive fields are used in ELM-LRF and it is not suitable for complex texture image classification. Liu et al. [19] proposed a multi-mode ELM-LRF (MM-LRF-ELM) framework. In MM-LRF-ELM, the feature in each RGB channel and depth feature are, respectively, extracted, and then, the above features are used to construct the nonlinear feature. Huang et al. [20] proposed an extreme learning machine with multi-scale local receptive fields (ELM-MSLRF) by combining several local receptive fields of different sizes. Compared with ELM-LRF, ELM-MSLRF obtains a better result in texture image classification. He et al. [21] propose an extreme learning machine with hybrid local receptive fields (ELM-HLRF), which uses the Gabor function as a convolutional kernel filter. Gabor filters with different scales and directions are used to extract more features in ELM-HLRF. Extracted features are combined with local receptive field features to improve classification accuracy. Due to the limitation of local receptive field size and the way of convolution, features with global attributes cannot be extracted by the above methods.

Because of the above two issues, this paper proposes a new network structure extreme learning machine with autoencoding receptive fields (ELM-ARF). ELM-ARF is mainly improved in two aspects. Firstly, the theory of ELM auto encoder (ELM-AE) [22] and local receptive fields is used to train global receptive fields, which are used to extract the object contour in the image. After being pooled separately, global receptive field features are concatenated with the local receptive field features and input into next layer. Then, by using the identity mapping theory in residual network (ResNet) [23], the shallow layers are directly connected to the deep layers, so that the category details contained in the shallow layer feature can flow to any deep layer in the network. By performing a series of combined coding on the feature extracted from each layer, ELM-ARF makes full use of the feature contained in the image to effectively improve the accuracy of image classification.

The main content of this paper is organized as follows. The theory of ELM-AE is introduced in Sect. 2. The structure and training processes of ELM-ARF are introduced in Sect. 3. Section 4 presents our experimental results on USPS, MNIST, NORB and CIFAR10 databases. The paper concludes in Sect. 5.

2 Related works

The main objective of ELM-AE is to make three different equivalent representations of the original input features: (1) compressed representation, the input feature is equivalently mapped from the original feature space to the low-dimensional feature space; (2) sparse representation, the input feature is equivalently mapped from the original feature space to the high-dimensional feature space; and (3) equal dimension representation, the input features are equivalently mapped from the original feature space to the equal-dimensional feature space.

The equivalent representation of ELM-AE can be implemented in two steps.

In the first step, the mapping matrix \({\varvec \beta}\) is obtained by using the principle of ELM training output weight matrix. \({\varvec{X}} \in {\mathbb{R}}^{{N \times n_{i} }}\) is the input sample matrix, \(N\) is the number of samples, and \(n_{i}\) is the dimension of samples. The input weight \({\varvec{a}} \in {\mathbb{R}}^{{n_{i} \times n_{h} }}\) between the input layer and the hidden layer is randomly initialized, and the hidden-layer offset is \({\varvec{b}} \in {\mathbb{R}}^{{N \times n_{h} }}\). Then, the output feature matrix of the hidden layer is \({\varvec{H}} = \text{g} ({\varvec{Xa}} + {\varvec{b}}) \in {\mathbb{R}}^{{N \times n_{h} }}\), where \(n_{h}\) represents the dimension of the hidden-layer feature and \(\text{g} \left( \cdot \right)\) is the activation function. \({\varvec{X}}\) is used as an output feature to train the network output weight \({\varvec \beta}\), and then \({\varvec{X}} = {\varvec{H}}{\varvec \beta}\). \({\varvec \beta}\) can be calculated by:

$${\varvec \beta} { = }\left\{ {\begin{array}{*{20}c} {\left( {\frac{{{\varvec{I}}_{{n_{h} }} }}{C} + {\varvec{H}}^{\text{T}} {\varvec{H}}} \right)^{ - 1} {\varvec{H}}^{\text{T}} {\varvec{X}}} & {N > n_{h} } \\ {{\varvec{H}}^{\text{T}} \left( {\frac{{{\varvec{I}}_{N} }}{C} + {\varvec{HH}}^{\text{T}} } \right)^{ - 1} {\varvec{X}}} & {N < n_{h} } \\ \end{array} } \right.$$
(1)

In the second step, the input sample features are equivalently mapped by using the trained mapping matrix \({\varvec \beta}\) to obtain the equivalent features \({\varvec{X}}_{\text{out}} = {\varvec{X}}{\varvec \beta}^{\text{T}}\), \({\varvec{X}}_{\text{out}} \in {\mathbb{R}}^{{N \times n_{h} }}\). When \(n_{i} > n_{h}\), the compression equivalent representation is implemented; when \(n_{i} < n_{h}\), the sparse equivalent representation is implemented; when \(n_{i} = n_{h}\), the equal-dimensional equivalent representation is implemented.

3 Network structure of ELM-ARF

In order to extract feature with global attributes and to utilize the category details contained in the shallow layer feature, the ELM-ARF consists of a convolution feature extraction part and a feature coding part, as shown in Fig. 1.

Fig. 1
figure 1

Architecture of ELM-ARF

3.1 Convolution feature extraction

In the convolution feature extraction part, the global receptive fields are encoded by using the theory of ELM-AE. The convolution features are extracted from the local receptive fields and the global receptive fields. After being pooled separately, the above two features are concatenated to fuse different receptive field features, and these fused features are input into the next convolutional layer.

3.1.1 Autoencoding of global receptive field

The global receptive field is trained by utilizing the theory of ELM-AE and local receptive fields. During ELM-AE training, a randomly initialized input weight matrix is required. Since the local receptive fields have obvious advantages in the extraction of image features [11], we use the local receptive fields that are also randomly initialized as the input weight to train the global receptive field. Because the local receptive fields are not in the form of a matrix, it cannot be used directly. Therefore, the local receptive fields need to be equivalently transformed into the form of a weight matrix for training. The equivalent transformation is shown in Figs. 2 and 3.

Fig. 2
figure 2

Convolution step. A represents convolution step of \(2 \times 2\) local receptive field, and the convolution order is a, b, c and d. B represents convolution step of \(3 \times 3\) receptive fields, and the convolution order is e, f, g and h

Fig. 3
figure 3

Equivalent of the convolution step

As shown in branch A of Fig. 2, a \(3 \times 3\) matrix is convoluted with a \(2 \times 2\) local receptive field. The convolution step is shown in the middle part of the branch A, and the local receptive field generates a feature map with a size of \(2 \times 2\) in the sliding order of a, b, c and d. Branch B indicates that the \(3 \times 3\) matrix is convoluted with four \(3 \times 3\) receptive fields. In the receptive field, the matrix value at the position of the coefficient 0 has no effect on the generated convolution value. Therefore, each step of the convolution operation in the branch A can be equivalent to that of the corresponding \(3 \times 3\) receptive field in the branch B. The generated convolution values of branch B are arranged in a matrix according to the convolution order in branch A, and the matrix is the same as the matrix generated by the branch A. Therefore, the convolution operation of the branch A can be equivalently represented by the branch B.

In Fig. 3, each column of the \(3 \times 3\) matrix is concatenated to generate a column vector of \(9 \times 1\), and the four receptive fields (e, f, g and h) are, respectively, concatenated to generate column vectors. These column vectors are transposed into row vectors and merged into a matrix with a convolution order of e, g, f and h. Then, the convolution operation of the branch B in Fig. 2 can be equivalent to the product operation of the two matrices in Fig. 3. Therefore, in this paper, the local receptive fields are extended to the weight matrix. The ELM-AE method is utilized to train the global receptive field matrix.

Suppose the size of the input image is \(d \times d\), and the size of the local receptive field is \(r \times r\), then the size of the output feature map is \(\left( {d - r + 1} \right) \times \left( {d - r + 1} \right)\). The input weight matrix \({\hat{\varvec{A}}}^{\text{init}} \in {\mathbb{R}}^{{r^{2} \times g}}\) is randomly initialized, and \(g\) is the number of local receptive fields. \({\hat{\varvec{A}}}^{\text{init}} \in {\mathbb{R}}^{{r^{2} \times g}}\) is orthogonalized using singular value decomposition (SVD) method to generate a matrix \({\hat{\varvec{A}}} \in {\mathbb{R}}^{{r^{2} \times g}}\). Let \(F = \left( {d - r + 1} \right)^{2} \cdot g\), \({\hat{\varvec{A}}}\) is extended to the weight matrix \({\varvec{W}} \in {\mathbb{R}}^{{d^{2} \times F}}\) according to the concept shown in Figs. 2 and 3. Then, the hidden-layer output matrix \({\varvec{H}} = {\varvec{XW}}\), and the global receptive field matrix \({\hat{\varvec{B}}} \in {\mathbb{R}}^{{d^{2} \times F}}\) is trained by using formula (1). The global receptive field matrix to the \(k\)-th feature map is \({\hat{\varvec{B}}}_{k} \in {\mathbb{R}}^{{d^{2} \times \left( {d - r + 1} \right)^{2} }} ,k = 1,2, \ldots ,g\). Each column in \({\hat{\varvec{B}}}_{k} \in {\mathbb{R}}^{{d^{2} \times \left( {d - r + 1} \right)^{2} }} ,k = 1,2, \ldots ,g\) is changed into a receptive field form. These receptive fields, which are arranged in the convolution order of the branch A of Fig. 2, to the \(k\)-th feature map are the \(k\)-th global receptive field \({\varvec{B}}_{k} \in {\mathbb{R}}^{{d \times d \times \left( {d - r + 1} \right)^{2} }} ,k = 1,2, \ldots ,g\). Therefore, the number of global receptive fields is the same as the number of local receptive fields, and the size of the two types of feature maps is the same.

3.1.2 Convolution and pooling operation

In Fig. 1, the first layer and the second layer are, respectively, composed of a convolutional layer and a pooling layer. Local receptive fields and global receptive fields are used to extract features in two convolutional layers. Similar with ELM-LRF [11], \({\varvec{A}}^{L} \in {\mathbb{R}}^{r \times r \times g} ,L = 1,2\) is used to equivalently represent the local receptive fields, where \(L\) is the layer number of convolutional layers, \(r\) is the local receptive fields size, \(g\) is the number of local receptive fields, and \({\varvec{a}}^{L}_{k} \in {\mathbb{R}}^{r \times r} ,k = 1,2, \ldots ,g\) is the \(k\)-th local receptive field of the \(L\)-th layer. The size of the input image \({\varvec{X}}^{L - 1}\) is \(d \times d\), and the size of the output feature map is \(\left( {d - r + 1} \right) \times \left( {d - r + 1} \right)\). The convolutional node \(\left( {i,j} \right)\) in the feature map of the \(k\)-th local receptive field is calculated as:

$${\varvec{l}}^{L}_{i,j,k} \left( {\varvec{X}} \right) = \sum\limits_{u = 1}^{r} {\sum\limits_{v = 1}^{r} {x_{i + u - 1,j + v - 1} \cdot a_{u,v,k} } } ,i,j = 1, \ldots ,\left( {d - r + 1} \right)$$
(2)

The global receptive fields \({\varvec{B}}^{L}_{k} \in {\mathbb{R}}^{{d \times d \times \left( {d - r + 1} \right)^{2} }} ,k = 1,2, \ldots g,L = 1,2\) are trained according to Sect. 3.1.1. \({\varvec{b}}^{L}_{i,j,k} \in {\mathbb{R}}^{d \times d}\) is the receptive field corresponding to the convolutional node \(\left( {i,j} \right)\) in the \(k\)-th feature map of the \(L\)-th layer. The convolutional node \(\left( {i,j} \right)\) in the feature map of the \(k\)-th global receptive field is calculated as:

$${\varvec{g}}^{L}_{i,j,k} \left( {\varvec{X}} \right) = \sum\limits_{u = 1}^{d} {\sum\limits_{v = 1}^{d} {x_{u,v} \cdot \text{b}_{u,v,i,j,k} } } ,i,j = 1, \ldots ,\left( {d - r + 1} \right)$$
(3)

Then, the generated \({\varvec{l}}^{L}\) and \({\varvec{g}}^{L}\) are, respectively, input into the pooling layer of size \(e\), and the combinatorial node \(\left( {p,q} \right)\) in the \(k\)-th pooling map of local and global receptive field is, respectively, calculated as:

$${\varvec{lh}}^{L}_{p,q,k} = \sqrt {\sum\limits_{i = p - e}^{p + e} {\sum\limits_{j = q - e}^{q + e} {\left( {{\varvec{l}}^{L}_{i,j,k} } \right)} }^{2} } ,p,q = 1, \ldots ,\left( {d - r + 1} \right)$$
(4)
$${\varvec{gh}}^{L}_{p,q,k} = \sqrt {\sum\limits_{i = p - e}^{p + e} {\sum\limits_{j = q - e}^{q + e} {\left( {{\varvec{g}}^{L}_{i,j,k} } \right)} }^{2} } ,p,q = 1, \ldots ,\left( {d - r + 1} \right)$$
(5)

The generated \({\varvec{lh}}^{L}\) and \({\varvec{gh}}^{L}\) are concatenated into \({\varvec{X}}^{L} = \left[ {{\varvec{lh}}^{L} ,{\varvec{gh}}^{L} } \right]\). \({\varvec{X}}^{L}\) is input to the next layer and repeats the above operations to fully fuse the local feature with the global feature.

figure a

3.2 Feature coding

In the feature coding part, the dimensions of the convolution features of each layer are reduced, and then, the low-dimensional features are input into the final two-hidden-layer extreme learning machine for encoding and classification.

3.2.1 Feature dimension reduction

In order to make full use of the identifiable category details in the shallow layer features, the input image, the first layer output feature maps and the second layer feature maps are, respectively, concatenated to generate a feature vector. Image or feature maps are concatenated by columns to generate high-dimensional feature vectors. For example, the image size of the NORB database is \(32 \times 32 \times 2\), and the generated feature vector is 2048-dimension. In order to reduce the dimension while encoding, these three vectors are, respectively, multiplied by its corresponding matrices of the feature dimension reduction layer. And sample labels are used to train the weight matrix \(\eta_{i} (i = 1,2,3)\) according to the theory of ELM. Let \({\varvec{X}}_{1} = {\varvec{X}}^{1}\), \({\varvec{X}}_{2} = {\varvec{X}}^{2}\) and \({\varvec{X}}_{3} = {\varvec{X}}^{0}\), three weight matrices can be trained by formula (6), where \(i = 1,2,3\), \({\varvec{T}} \in {\mathbb{R}}^{{N \times {\text{m}}}}\) is the label matrix corresponding to the input image, and \(m\) is dimension of \({\varvec{T}}\). \(N\) is the number of features in \({\varvec{X}}_{i}\), and \(P_{i}\) is the dimension of the features in \({\varvec{X}}_{i}\).

$${\varvec \eta}_{i} = \left\{ {\begin{array}{*{20}c} {\left(\frac{{{\varvec{I}}_{{P_{i} }}}}{C} + {\varvec{X}}_{i}^{\text{T} } {\varvec{X}}_{i} \right)^{ - 1} {\varvec{X}}_{i}^{\text{T} } {\varvec{T}}} & {N > P_{i} } \\ {{\varvec{X}}_{i}^{\text{T} } \left(\frac{{{\varvec{I}}_{N}}}{C} + {\varvec{X}}_{i} {\varvec{X}}_{i}^{\text{T} } \right)^{ - 1} {\varvec{T}}} & {N < P_{i} } \\ \end{array} } \right.$$
(6)

Then, the output of the feature dimension reduction layer is \({\varvec{Y}}_{i} = {\varvec{X}}_{i} {\varvec \eta} (i = 1,2,3)\), and \({\varvec{Y}}_{i} \in {\mathbb{R}}^{{N \times {\text{m}}}}\). The label vector of the NORB database is 5 dimensions. After the encoding of \({\varvec \eta}_{i} (i = 1,2,3)\), the dimension of output feature \({\varvec{Y}}_{i} = {\varvec{X}}_{i} {\varvec \eta} (i = 1,2,3)\) is reduced to 5, which effectively reduces the calculation amount of the subsequent equivalent encoding.

3.2.2 Two-hidden-layer extreme learning machine

After dimension reduction step, features \({\varvec{Y}}_{1}\), \({\varvec{Y}}_{2}\) and \({\varvec{Y}}_{3}\) are concatenated, and \({\varvec{Q}} = [{\varvec{Y}}_{1} ,{\varvec{Y}}_{2} ,{\varvec{Y}}_{3} ]\) is input into the two-hidden-layer ELM which is used to combine and encode features of each layer. Among them, \({\varvec \beta}_{1}\) and \({\varvec \beta}_{2}\) are used to encode features, and \({\varvec \beta}_{3}\) is used to classify final features. The dotted line in the two-hidden-layer ELM indicates that the input feature is concatenated with the output feature. For example, \({\varvec{R}} = {\varvec{Q}}{\varvec \beta}_{1}\) and the input of the \({\varvec \beta}_{2}\) layer is \(\left[ {{\varvec{R}},{\varvec{Q}}} \right]\). \({\varvec{S}} = \left[ {{\varvec{R}},{\varvec{Q}}} \right]{\varvec \beta}_{1}\), and the input of the \({\varvec \beta}_{3}\) layer is \(\left[ {{\varvec{S}},{\varvec{R}},{\varvec{Q}}} \right]\), where \({\varvec{S}} \in {\mathbb{R}}^{{N \times {\text{m}}}}\), \({\varvec{R}} \in {\mathbb{R}}^{{N \times {\text{m}}}}\), \({\varvec{Q}} \in {\mathbb{R}}^{{N \times 3{\text{m}}}}\). It can be observed that this connection structure can ensure that the features of each layer can flow to deeper layers in the structure, so that the features in the first layer can be utilized while the \({\varvec \beta}_{3}\) layer utilizes the features \(\left[ {{\varvec{S}},{\varvec{R}},{\varvec{Q}}} \right]\).

In order to enable \({\varvec \beta}_{1}\) and \({\varvec \beta}_{2}\) to perform dimensionality reduction while encoding features, \({\varvec \beta}_{1}\) and \({\varvec \beta}_{2}\) are trained using the sample label \({\varvec{T}}\). Let \({\varvec{H}}_{1} = {\varvec{Q}}\), \({\varvec{H}}_{2} = \left[ {{\varvec{R}},{\varvec{Q}}} \right]\), \({\varvec{H}}_{3} = \left[ {{\varvec{S}},{\varvec{R}},{\varvec{Q}}} \right]\). \(N\) is the number of features \({\varvec{H}}_{i}\), and \(M_{i}\) is the dimension of the features \({\varvec{H}}_{i}\).

$${\varvec \beta}_{i} { = }\left\{ {\begin{array}{*{20}c} {\left(\frac{{{\varvec{I}}_{{M_{i} }}}}{C} + {\varvec{H}}_{i}^{\text{T} } {\varvec{H}}_{i} \right)^{ - 1} {\varvec{H}}_{i}^{\text{T} } {\varvec{T}}} & {N > M_{i} } \\ {{\varvec{H}}_{i}^{\text{T} } \left(\frac{{{\varvec{I}}_{N}}}{C} + {\varvec{H}}_{i} {\varvec{H}}_{i}^{\text{T} } \right)^{ - 1} {\varvec{T}}} & {N < M_{i} } \\ \end{array} } \right.$$
(7)

Better accuracy is obtained using kernel mapping in the classification layer of \({\varvec \beta}_{3}\). By transforming the matrix product in the formula into a kernel function, the mapping of features from low-dimensional space to high-dimensional space is realized. The traditional KELM only has a form of formula used when \(N < M_{3}\). However, when \(N >> M_{3}\), the above formula will produce a high-dimensional square matrix. For example, MNIST has 60,000 sample features, and the sample dimension is only 784. Formula used when \(N < M_{3}\) will produce a square matrix of \(60000 \times 60000\). The inversion of the high-dimensional square matrix will significantly increase the amount of calculation. In order to improve the classification accuracy and avoid the generation of high-dimensional matrix, when \(N > M_{3}\), we replace the \({\varvec{H}}_{3}^{T} {\varvec{H}}_{3}\) in \({\varvec \beta}_{3}\) with the Gaussian radial basis kernel function (8) to realize the partial nuclear mapping function of KELM, as shown in formula (9).

$$\text{K} (x_{1} ,x_{2} ) = \exp \left( { - \frac{{\left\| {x_{1} - x_{2} } \right\|^{2} }}{{\sigma^{2} }}} \right),\sigma > 0$$
(8)
$${\varvec \beta}_{3} = \bigg(\frac{{{\varvec{I}}_{{M_{3} }}}}{C} + \text{K} \left({\varvec{H}}_{3}^{\text{T} } ,{\varvec{H}}_{3}^{\text{T} } \right)\bigg)^{{{ - }1}} {\varvec{H}}_{3}^{\text{T} } {\varvec{T}}$$
(9)

The input test feature of the \({\varvec \beta}_{3}\) layer is \({\varvec{h}}_{3}\), and the output prediction value of the two-hidden-layer ELM is \({\varvec{t}} = {\varvec{h}}_{3} {\varvec \beta}_{3}\). When \(N < M_{3}\), we use the traditional form of KELM, and the predicted value is calculated as:

$${\varvec{t}} = \text{K} ({\varvec{h}}_{3} ,{\varvec{H}}_{3} )\bigg(\frac{{{\varvec{I}}_{N}}}{C} + \text{K} ({\varvec{H}}_{3} ,{\varvec{H}}_{3} )\bigg)^{ - 1} {\varvec{T}}$$
(10)
figure b

3.3 Time and space complexities

The structure in Fig. 1 is used as an example to analyze the time and space complexities of the ELM-ARF. The training sample matrix is \({\varvec{X}} \in {\mathbb{R}}^{{N \times d^{2} }}\), and the test sample matrix is \({\varvec{X}}_{\text{t}} \in {\mathbb{R}}^{{M \times d^{2} }}\). The local receptive field size is \(r\). The number of both local receptive fields and global receptive fields per layer is \(g\). The pooling size is \(e\). The output feature matrix of the first layer is \({\varvec{X}}^{1} \in {\mathbb{R}}^{{N \times l_{1} g}}\), where \(l_{1} = \left( {d - r + 1} \right)^{2}\). The second layer output feature matrix is \({\varvec{X}}^{2} \in {\mathbb{R}}^{{N \times l_{2} g}}\), where \(l_{2} = \left( {d - 2r + 2} \right)^{2}\). The label matrix of the training sample is \({\varvec{T}} \in {\mathbb{R}}^{N \times m}\), and the label matrix of the testing sample is \({\varvec{T}}_{\text{t}} \in {\mathbb{R}}^{M \times m}\). The time and space complexities of the single-layer ELM-LRF are used to compare with that of ELM-ARF. In the single-layer ELM-LRF, the number of local receptive fields is \(4g\), and other parameters are the same as ELM-ARF.

In the training stage, the time complexity of ELM-ARF is \(\text{O} \left( {N\left( {6l_{1}^{2} g^{2} + 6l_{2}^{2} g^{2} + 8l_{1} l_{2} g^{2} + 4d^{2} l_{1} g + 2d^{4} } \right) + 9\left( {l_{1}^{3} \,+ l_{2}^{3} } \right)g^{3} + d^{6} } \right)\). \(l_{1}\) and \(l_{2}\) are amplified to \(d^{2}\). Assume \(d^{2} = l_{1} = l_{2} = pN\), where \(p \ll 1\). The training time complexity of ELM-ARF can be approximated as \(\text{O} \left( {\left( {\left( {4g + 20g^{2} } \right)p^{2} + 18g^{3} p^{3} } \right)N^{3} } \right)\). The training time complexity of the ELM-LRF can be approximated as \(\text{O} \left( {\left( {\left( {4g + 32g^{2} } \right)p^{2} + 64g^{3} p^{3} } \right)N^{3} } \right)\). By adding a convolution feature extraction layer and a feature dimension reduction layer to reduce the amount of computation, the ELM-ARF has a lower training time complexity.

In the testing stage, the time complexity of ELM-ARF is \(\text{O} \left( {M\left( {2d^{2} l_{1} g + 4l_{1} l_{2} g^{2} } \right)} \right)\), which can be approximated as \(\text{O} \left( {\left( {2 + 4g} \right)gp^{2} M^{3} } \right)\). The testing time complexity of the ELM-LRF can be approximated as \(\text{O} \left( {4gp^{2} M^{3} } \right)\). Compared to ELM-LRF, ELM-ARF has more structure, which contains more calculations during testing. Therefore, ELM-ARF has a higher testing time complexity.

The space complexity of ELM-ARF is \(\text{O} \left( {N\left( {3l_{1} g \,+ 3l_{2} g + d^{2} } \right) + 2d^{2} l_{1} g + 4l_{1} l_{2} g^{2} { + }d^{2} l_{2} } \right)\), which can be approximated as \(\text{O} \left( {\left( {8 + 4g} \right)gp^{2} N^{2} } \right)\). The space complexity of the ELM-LRF can be approximated as \(\text{O} \left( {8gp^{2} N^{2} } \right)\). Compared to ELM-LRF, ELM-ARF needs to store more weight matrices. Therefore, it has a higher space complexity.

4 Experiments

In order to verify the validity of ELM-ARF, we carry out experiments in USPS [24], MNIST [25], NORB [26] and CIFAR10 [27] databases, and the experimental results are compared with the results of some convolutional networks trained based on the ELM method. The experimental environment is the supercomputing system in the High Performance Computing Center of Yanshan University, whose specific hardware is 1 Intel E5-2683v3 CPU (28 cores 2.0 Ghz), 64 GB memory per node. We use resource scheduling instructions to occupy 1 node (28 cores, 64 GB). The operating system and software environment are Centos7.2, MATLAB R2018a.

4.1 USPS database

USPS is a handwritten digital recognition database containing a total of 9298 images, which contain ten numbers from 0 to 9. Example images in the database are shown in Fig. 4. The training sample image is 7291, and the test sample image is 2007. The number in the image is centered, and the images are all normalized to \(16 \times 16\) pixels. The database has small number of samples and is relatively simple, so it is first used to verify the validity of ELM-ARF. We select all training samples and test samples for experimentation.

Fig. 4
figure 4

Example images in the USPS database

For the USPS database, we need to select the optimal network parameters for the ELM-ARF. These parameters include the size of local receptive fields, the number of layers, the number of receptive fields per convolutional layer and the penalty coefficient. The size of local receptive fields is, respectively, set to \(3 \times 3\), \(4 \times 4\), \(5 \times 5\). The number of layers \(i\) is set to \(\{ 1,2,3\}\). The number of local receptive fields is equal to the number of global receptive fields, and the number of both per convolutional layer is represented by \(g\). The parameter \(g\) is set to \(\{ 1,2, \ldots ,8\}\). The number of receptive fields per convolutional layer is \(2 \times g\), and the total number of receptive fields is \(2 \times g \times i\). The penalty coefficient \(C\) is set to \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). The pooling layer size is \(3 \times 3\), which is consistent with the literature [11].

The accuracy of ELM-ARF is changed as the parameters change. Figure 5 shows the accuracy mesh of ELM-ARF with different numbers of layers and different sizes of local receptive fields. The mesh diagrams with the same number of layers are placed on the same row, and the number of layers increases from top to bottom. The mesh diagrams with the same local receptive field size are placed in the same column, from left to right, and the size is \(3 \times 3\), \(4 \times 4\) and \(5 \times 5\), respectively. All mesh diagrams show that the accuracy is changed with the change of \(g\) and \(C\). It can be observed that the classification accuracy increases as the number of layers increases while the highest classification accuracy of the network decreases as the size of local receptive fields increases. When the number of layers is 2 and local receptive fields size is \(3 \times 3\), the ELM-ARF can achieve the highest accuracy in Fig. 5. When the number of layers is 2 and the size of local receptive fields is \(4 \times 4\), the average accuracy is the highest and the mesh is smoother. The USPS database is relatively simple. When the network is added to 2 layers, the accuracy has reached 99.5%. The accuracy increase is not obvious when the network is added to three layers, but the training time is significantly increased. Therefore, the number of layers of ELM-ARF is set to 2.

Fig. 5
figure 5

ELM-ARF parameters selection on USPS database. a 1 layer and \(3 \times 3\) LRF, b 1 layer and \(4 \times 4\) LRF, c 1 layer and \(5 \times 5\), d 2 layers and \(3 \times 3\), e 2 layers and \(4 \times 4\), f 2 layers and \(5 \times 5\), g 3 layers and \(3 \times 3\), h 3 layers and \(4 \times 4\), i 3 layers and \(5 \times 5\)

ELM-ARF can obtain better experimental results when the local receptive field size is \(3 \times 3\) or \(4 \times 4\). Therefore, these two sizes are used in combination. The first layer uses local receptive fields of size \(4 \times 4\), and the second layer uses local receptive fields of \(3 \times 3\). In order to test the effect of this combination, the parameter \(g\) is set to \(\{ 1,2, \ldots ,20\}\), and the penalty coefficient \(C\) is set to \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\) for experimentation. The experimental results show that accuracy is increased with the increase in the number of receptive fields, and accuracy is slightly reduced with the increase in the penalty coefficient. The accuracy of the combined structure can reach 95.2% when the number of receptive fields per layer is 2. When the number of receptive fields is increased to 12, the accuracy is all over 99%. When the number of receptive fields is 38 and the penalty coefficient is 0.001, the accuracy of ELM-ARF reaches 99.74%, which is the highest in Fig. 6. This shows the effectiveness of the combination of \(4 \times 4\) and \(3 \times 3\).

Fig. 6
figure 6

Accuracy and time of ELM-ARF (2 layers, \(4 \times 4\) and \(3 \times 3\)) on USPS database

In Table 1, the accuracy of the ELM-ARF is compared to other algorithms. For fair comparison, the penalty coefficient for all algorithms is selected from \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). ELM-ARF is set to 2 layers. The total number of receptive fields is 24. Each convolutional layer contains 6 local receptive fields and 6 global receptive fields. And the penalty coefficient is 0.1. ELM-LRF is set to single layer, 24 local receptive fields of \(4 \times 4\) size and pooling size of \(3 \times 3\). CKELM is set to two layers, 24 local receptive fields of \(8 \times 8\) size and pooling size of \(3 \times 3\). The DC-ELM is consistent with the literature [18]. The ELM-MSLRF is set to single layer with 24 local receptive fields, and the size of local receptive field and pooling is set according to the literature [20]. Other parameters are consistent with ELM-ARF. Among the several algorithms in Table 1, ELM-ARF achieves the highest classification accuracy in the case of low training time. This proves the effectiveness of ELM-ARF on small database. In this paper, the highest testing accuracy in each table is shown in bold.

Table 1 Comparison of accuracies and time on USPS database

4.2 MNIST database

In order to test the classification ability of ELM-ARF on a database with simple image content and large number of images, MNIST is selected for experiments. Example images in the database are shown in Fig. 7. The MNIST database contains 70,000 handwritten digital grayscale images from 0 to 9, of which 60,000 are used as training samples and 10,000 are used as test samples. Each image is size-normalized to \(28 \times 28\) pixels, and the content is centered. We use 60,000 images for training and 10,000 for testing.

Fig. 7
figure 7

Example images in the MNIST database

Figure 8 shows the accuracy mesh of ELM-ARF with different parameter combinations on MNIST, and its arrangement is consistent with Fig. 5. When the size of local receptive fields is \(4 \times 4\), the accuracy of the 3 layers is lower than that of the 2 layers. Considering that the combination of \(4 \times 4\) and \(3 \times 3\) has achieved good results on the USPS database, we continue to experiment with the same combination as in Sect. 4.1. The parameter \(g\) is set to \(\{ 1,2, \ldots ,15\}\), and the penalty coefficient is set to \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). The experimental results are shown in Fig. 9. Figure 9a, b is the mesh diagrams of accuracy and cost time of the combination of \(4 \times 4\) and \(3 \times 3\). When 30 (\(g = 15\)) receptive fields per layer are used, the highest accuracy of 98.95% can be achieved in Fig. 9a. Therefore, the combination of \(4 \times 4\) and \(3 \times 3\) is more effective on the MNIST database.

Fig. 8
figure 8

ELM-ARF parameters selection on MNIST database. a 1 layer and \(3 \times 3\) LRF, b 1 layer and \(4 \times 4\) LRF, c 1 layer and \(5 \times 5\), d 2 layers and \(3 \times 3\), e 2 layers and \(4 \times 4\), f 2 layers and \(5 \times 5\), g 3 layers and \(3 \times 3\), h 3 layers and \(4 \times 4\), i 3 layers and \(5 \times 5\)

Fig. 9
figure 9

Accuracy and time of ELM-ARF (2 layers, \(4 \times 4\) and \(3 \times 3\)) on MNIST database

Figures 5, 6, 8 and 9 show that accuracy is increased with the increase in \(g\), and the change in \(C\) has a little effect on accuracy. Comparing the mesh diagrams of the same column in Figs. 5 and 8, it is observed that the accuracy of ELM-ARF with 2 layers is generally higher than that of the 1 layer, but the addition of the third layer has no significant effect on the improvement of accuracy. Comparing the mesh diagrams of the same row, it can be found that the increase in the size of the local receptive field has few effects on the accuracy. Therefore, when classifying digital images, \(g\) and the number of layers have an important influence on the classification performance of ELM-ARF. When the number of layers is 2 and \(g < 7\), the accuracy increases significantly as \(g\) increases, and higher accuracy can be obtained in less time. When \(g > 7\), the effect of accuracy being improved is reduced, but the accuracy continues to be improved.

Experimental comparison with some algorithms using the same number of samples is shown in Table 2, in which experimental results published in other studies are listed. The network parameters used by each algorithm in Table 2 are different. For example, ELM-LRF is set to a single layer and 48 local receptive fields, and ELM-MSLRF even uses 200 local receptive fields. ELM-ARF achieves the highest accuracy in Table 2 with only 30 receptive fields per layer.

Table 2 Comparison of accuracies and time on MNIST database

In order to reduce the amount of calculation, some algorithms randomly select 10,000 or 15,000 samples from 60,000 samples for training, such as CKELM [17] and DC-ELM [18]. For fair comparison, we use 60,000 samples for training and 10,000 for testing. The network parameters of these algorithms are set to be the same as ELM-ARF, and the experimental results are shown in Table 3. The penalty coefficient for all algorithms is selected from \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). The number of layers of ELM-ARF is set to 2. The total number of receptive fields is 20, and the penalty coefficient is 1000. The parameters of the DC-ELM are set to be consistent with the literature [18]. ELM-LRF is set to a single layer, 20 local receptive fields of \(4 \times 4\) size and pooling size of \(3 \times 3\). The ELM-MSLRF is set to a single layer with 20 local receptive fields, and the size of local receptive field and pooling is set according to the literature [20]. CKELM is set to 2 layers, 20 local receptive fields of \(5 \times 5\) size and pooling size of \(2 \times 2\). Under the same calculation conditions, the training time of ELM-ARF is relatively long, which is caused by calculating the global receptive field and the feature coding, but the accuracy is the highest in Table 3. From the experimental results in Sects. 4.1 and 4.2, it can be concluded that ELM-ARF has a good classification effect on handwritten digital images when the local receptive field size of the first layer is set to \(4 \times 4\) and the size of the second layer is set to \(3 \times 3\).

Table 3 Comparison of unified experimental environment and parameters on MNIST database

4.3 NORB database

The images in USPS and MNIST are digital images, and the image content is simple. To test the ability of ELM-ARF to process images of complex content, we used the NORB database for experiment. The database contains a total of five categories of toy objects: people, animals, airplanes, trucks and cars. Each category contains 10 instances, and the database has a total of 50 instances. By utilizing different viewpoints and various lighting conditions, each instance contains 972 stereoscopic images. Each stereo image has 2 images (left and right sides). All images are unified into \(32 \times 32\) pixels. Example images in the database are shown in Fig. 10. During the experiment, we select 5 instances in each category and use a total of 24,300 image pairs for training. We select 5 remaining instances in each category and use a total of 24,300 image pairs for testing.

Fig. 10
figure 10

Example images (left and right sides) in the NORB database

On the USPS and MNIST databases, the classification accuracy of ELM-ARF with 2 layers is better than that of 1 layer. We carry out experiment by setting the number of layers to 2 and 3 on the NORB database. The other parameter settings are the same as 4.1 and 4.2. In the mesh diagrams of the first row, the change trend of accuracy is relatively stable when the number of receptive fields and the penalty coefficient change. In the mesh diagrams of the second row of Fig. 11, the accuracy of 3 layers is obviously oscillated with the increase in the penalty coefficient, which decreases the average accuracy. In Fig. 11b, the \(4 \times 4\) receptive field reaches the highest accuracy of 96.7% in the three mesh diagrams when there are only 12 receptive fields per layer and the penalty coefficient is 0.1. We also use the combination of \(4 \times 4\) and \(3 \times 3\) to perform experiments, and their accuracy is lower than that of \(4 \times 4\) local receptive field in Fig. 11b. Therefore, in the NORB database experiment, the number of layers is set to 2, and the size of the local receptive field is set to \(4 \times 4\).

Fig. 11
figure 11

ELM-ARF parameters selection on NORB database. a 2 layers and \(3 \times 3\) LRF, b 2 layers and \(4 \times 4\) LRF, c 2 layers and \(5 \times 5\) LRF, d 3 layers and \(3 \times 3\) LRF, e 3 layers and \(4 \times 4\) LRF, f 3 layers and \(5 \times 5\) LRF

The ELM-ARF is set to 2 layers, each of which has 6 local receptive fields of \(4 \times 4\) size and 6 global receptive fields, and the penalty factor is 0.1. The images in Fig. 12 are output feature maps of each convolutional layer and pooling layer when the ELM-ARF processes the input image. In the first row, the six images on the left are the feature maps generated by the local receptive fields of the first convolutional layer. The right side is the feature maps generated by the global receptive fields. The images of the second row are pooling maps. The left and right sides are, respectively, generated by pooling the receptive field features. The images in the third row are the feature map generated by the second convolutional layer. The images in the fourth row are pooled maps. Comparing the left and right sides of Fig. 12, the feature maps of the local receptive fields have more texture details, while the feature maps of the global receptive fields have a smoother object contour.

Fig. 12
figure 12

Output feature maps of ELM-ARF on NORB database

To prove that adding the global receptive fields and the feature coding structure effectively improves the classification performance, experiments are carried out in two aspects: (1) The global receptive fields is replaced by local receptive fields. The experimental results are shown in Fig. 13a. When \(C < 0.1\), the accuracy increases as \(C\) increases. When \(C > 0.1\), the accuracy decreases as \(C\) increases. The highest accuracy is only 93.8% when \(g = 7\), and the training takes 47.3 s. (2) The feature coding structure is removed, and the output features of the second layer are directly input into the classification layer trained with ELM. The experimental results are shown in Fig. 13b. The accuracy increases as \(C\) increases. The highest accuracy is only 95.3% when \(g = 6\), and the training takes 42.4 s. Compared with Fig. 11, the accuracy of the above two experiments is decreased. In Fig. 11b, the highest accuracy is 96.7% when \(g = 6\), and the training takes 49.3 s. This proves that the combination of global receptive fields and feature coding structure effectively improves the accuracy when the training time is slightly increased.

Fig. 13
figure 13

Effectiveness of ELM-ARF. a Remove the global experience field and b remove the feature coding structure

The trends of accuracy change with respect to \(C\) in Fig. 13a, b are different. Figure 13a, b contains local receptive fields, so the accuracy difference between them is caused by the global receptive fields and the feature coding structure. It can be found that the accuracy of the structure with the global receptive fields is not reduced as the \(C\) increases, and the accuracy of the structure with feature coding reaches the highest at \(C = 0.1\). Therefore, we combine the setting methods of the penalty coefficient in the above two structures and experiment. The penalty coefficient \(C\) of the feature coding structure is set to 0.1, and the penalty coefficient of the global receptive field is set to \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). The experimental results are shown in Fig. 14a. We also set the two penalty coefficients \(C\) uniformly to \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\) for experimentation. The experimental result is shown in Fig. 14b. Compared to Fig. 14b, a has a more stable growth trend and higher accuracy. The coefficient \(C\) of the feature coding structure has defaulted to 0.1 in Fig. 14a. ELM-ARF achieves the highest accuracy when the \(C\) of the global receptive fields is 100 and \(g = 14\). The highest accuracy in Fig. 14a is 98%, which can prove the effectiveness of the global receptive fields.

Fig. 14
figure 14

Penalty coefficient \(C\) selection, a\(C\) of the feature coding structure are set to 0.1, b\(C\) of the global receptive fields and feature coding structure are uniformly changed

In Figs. 11, 13 and 14, the increase in \(g\) and \(C\) has an important influence on accuracy. Comparing the mesh diagrams of the same row in Fig. 11, it can be found that the increase in the size of the local receptive field has few effects on the accuracy. Comparing the mesh diagrams of the same column, it can be found that the increase in the number of layers causes large fluctuations in accuracy. Therefore, when classifying object images, \(g\) and \(C\) have an important influence on the classification performance of ELM-ARF. When \(C\) of the feature coding structure is less than 1, the accuracy is improved as \(g\) and \(C\) of the global receptive fields increase within the given range.

The experimental results published in some studies are listed in Table 4. The number of layers of ELM-ARF is set to 2 layers in which the total number of receptive fields is 56. The \(C\) of the feature coding structure is 0.1, and the \(C\) of the global receptive fields is 100. ELM-ARF achieves the highest accuracy of 98% in Table 4. For fair comparison, the penalty factor for all algorithms is selected from \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). ELM-LRF is set to a single layer, 56 local receptive fields of size \(4 \times 4\) and pooling size of \(3 \times 3\). The ELM-MSLRF is set to a single layer with 56 local receptive fields, and the size of local receptive fields and pooling is set according to the literature [20]. Two-layer CKELM is set to 56 local receptive fields of \(4 \times 4\) size and pooling size of \(3 \times 3\). The setting of DC-ELM is consistent with that in [18]. The experimental results are shown in Table 5. When the total number of receptive fields is the same, ELM-ARF obtains the highest accuracy with the third fastest training speed, which shows the effectiveness of the framework in dealing with complex image classification problems.

Table 4 Comparison of accuracies and time on NORB database
Table 5 Comparison of unified experimental environment and parameters on NORB database

4.4 CIFAR10 database

Finally, ELM-ARF is used to challenge the object image database CIFAR10 commonly used for deep learning. The CIFAR10 data set [26] contains 10 categories of color images. The size of image is \(32 \times 32 \times 3\). Example images in the database are shown in Fig. 15. The training set contains 50,000 images, and the testing set contains 10,000 images. All images are used for training and testing.

Fig. 15
figure 15

Example images in the CIFAR10 database

None of the algorithms in Table 6 are tested on this database, so these algorithms are trained using the same conditions as ELM-ARF. The ELM-ARF is set to 2 layers in which the number of receptive fields is 44. The coefficient \(C\) of the feature coding structure is set to 0.001, and the coefficient \(C\) of the global receptive fields is set to 1000. The penalty coefficient for other algorithms is selected from \(\{ 10^{ - 3} ,10^{ - 2} , \ldots ,10^{3} \}\). The total number of local receptive fields of ELM-LRF, CKELM and ELM-MSLRF is set to be the same as ELM-ARF. The setting of DC-ELM is consistent with that in [18]. ELM-ARF obtains the highest accuracy with the third fastest training speed, and the accuracy of ELM-ARF is 5% higher than that of the second highest CKELM.

Table 6 Comparison of accuracies and time on CIFAR10 database

5 Conclusion

In this paper, the extreme learning machine with autoencoding receptive fields (ELM-ARF) is proposed to effectively utilize features with global attributes of the image and the features extracted by each layer. By using the theory of ELM-AE to train the global receptive fields, ELM-ARF can effectively avoid instability caused by the random initialization of the receptive field matrix while extracting the global features. At the same time, the shallow layer features can be input to any deep layer to be combined through the structure of the ELM-ARF, so that the feature of each layer from shallow to deep is effectively utilized. The experimental results show that ELM-ARF can achieve higher accuracy on the above four databases with less speed reduction. It can be proved that ELM-ARF can effectively deal with object classification problems.

Although the classification results of ELM-ARF on CIFAR10 cannot be compared with CNN, ELM-ARF does not use the time-consuming reverse iteration to train the network. By utilizing the theory of ELM-LRF and ELM-AE, 63% of the CNN classification accuracy is achieved with only 247 s. In future research, the local receptive field will be further studied to improve ELM-ARF while dealing with classification problems.