1 Introduction

With the rapid development of mobile Internet technology, users upload and share massive images every day. How to enable users to accurately find the information they need in the massive image data resources is an important research topic at present. Content-based image retrieval (CBIR) emerges as the times require. It can search for corresponding images [6, 10, 14, 18] that meet the query conditions in the image database according to the objective visual features such as pixel information, color, texture, shape, etc. contained in the image itself, and no need to manually annotate the image.

One of the most fundamental problems in CBIR is how to achieve effective expression of images. Because of this, the extraction and expression of features has been widely concerned. The traditional content-based image retrieval technology is based on the underlying visual features of the image, and it has a huge “semantic gap” problem with human perception of images. Therefore, semantic-based image retrieval has become one of the key issues in the field of image retrieval [1, 16].

For the construction of image semantic features, a more common approach is to directly classify the underlying features of the image to obtain semantic features. For example, the literature [7] proposes to learn the semantics of images by learning the joint probability model of images and annotations. The main idea is to model the image feature vector and the text annotation with semantic information as a non-parametric Gaussian kernel model, and solve the “semantic gap” problem through the model. The disadvantage of this method is that the selection of visual features plays a decisive role and the robustness is poor.

In recent years, deep learning technology has been greatly developed. Deep learning technology is developed on the basis of drawing lessons from the principle of human brain visual mechanism. It is a process of iteration and abstraction layer by layer. The greatest advantage of in-depth learning is that it can learn image features independently, from bottom edge features to object structure features, and even more abstract features [3, 11]. Convolutional Neural Network (CNN), Deep Boltzmann Machine (DBM) and Automatic Encoders (AE) are classical deep learning algorithms. They have achieved good results in image classification tasks. For example, the literature [9] proposes a binary image retrieval method based on Deep Belief Network (DBN) and Softmax classifier. However, in the process of using the back propagation to correct the connection weight and offset of the network, the DBN algorithm is prone to problems such as small gradient, low learning rate and slow error convergence [12]. In [15], Stacked Auto encoder (SAE) is used to extract geometric features for image classification. However, SAE is easily affected by the imbalance of training data, and there are many parameters in a single SAE model. The classification effect is easy to change with the change of parameters and the robustness is poor [8, 20]. Therefore, some scholars have proposed a Stack Denoising Automatic Encoder (SDAE), which adds appropriate noise to the input of SAE and takes the noisy feature as the original complete feature. This method improves the generalization ability of the SAE model. For example, the literature [2] proposed a gesture image recognition method based on SDAE, which improved the performance of SAE to some extent. However, SDAE is a deep-structured neural network model that requires layer-by-layer training. Therefore, the amount of calculation is large, the training speed is slow, and it takes a long time to adjust to the optimal parameters.

In view of the semantic gap problem, this paper introduces a structure and training method of convolution depth Boltzmann machine (C-DBM) combined with CNN and DBM to construct a model from the underlying visual features to the advanced semantic features,which is a layer-by-layer iterative, layer-by-layer abstract deep network mapping model. The model aims to reduce the semantic gap and obtain image semantic features. Finally, according to the extracted semantic features, the image classification and retrieval are performed by the Dropout regularized softmax classifier. The experimental results demonstrate the effectiveness of the proposed method.

2 Convolutional neural network (CNN)

CNN is a kind of artificial neural network. It is affected by delayed neural network (TDNN) and uses weight sharing to reduce the size of network parameters. It consists of a convolutional layer and a pooled layer. The convolution layer performs a convolution operation, and the linear filter inputs the signal to each local receptive field area in a sliding window manner to perform inner product operations, and generates a pair of input signal positions through a nonlinear activation function. The activation value is finally obtained, and finally the feature map is obtained [5, 17]. In addition, CNN usually combines the Softmax classifier to solve multi-classification problems. The CNN architecture for image retrieval is shown in Fig. 1.

Fig. 1
figure 1

CNN architecture for image retrieval

The input content of the previous layer and the learning kernel (weight) constitute a single neuron in the convolutional layer. In the same layer, neurons with the same feature map share the same kernel, and the kernels of neurons with different feature maps are also different. The input and output expression of a neuron is:

$$ {u}_j^t=f\left(\sum \limits_{i=1}^N{u}_i^{l-1}\ast w{}_{ji}+{b}_j^l\right) $$
(1)

In the above formula, \( {u}_i^{l-1} \) is the input neuron of the l − 1 layer, N is the number of input neurons, \( {b}_j^l \) is the bias term, and f is the activation function.

For a given input feature map, the pooling operation of the pooling layer is to sample it. The sampling process does not change the number of input feature maps, but the feature map becomes smaller, expressed as:

$$ {u}_j^t=f\left({\beta}_j^l down\left({u}_j^{l-1}\right)+{b}_j^l\right) $$
(2)

The CNN training process targets minimizing the error function. A classification problem with Ntraining samples, the reconstruction error is expressed as follows:

$$ E=\frac{1}{2}\sum \limits_{n=1}\sum \limits_{k=1}{\left({s}_k^n-{y}_k^n\right)}^2 $$
(3)

In the above formula, \( {s}_k^n \) is the n dimension of the k sample, and \( {y}_k^n \) is the output value of the n input sample at the k output layer. The sum of the errors on each neuron sample represents the error of the entire data set. The error for a single sample is expressed as:

$$ E=\frac{1}{2}\sum \limits_{k=1}^K{\left({s}_k^n-{y}_k^n\right)}^2 $$
(4)

The training phase of CNN consists of two steps: forward propagation and back propagation. Forward propagation is the process by which a feature map passes from the current layer to the next layer through a predefined activation function with learning parameters (weights and offsets). For example, the output of layer l is defined as vl = f(ul), where ul = (Wlvl − 1 + bl) .During the process of back propagation, the weights Wl and blare updated by a stochastic gradient descent strategy. The data is then normalized so that they exhibit a normal distribution in the feature space, which speeds up the convolution.

3 Deep Boltzmann machine (DBM)

DBM is an undirected graph model [13], as shown in Fig.2. DBM can learn high-level representations from the data itself through unsupervised learning, and then fine-tune the model with a small amount of annotated data in a supervised learning manner. Unlike the Deep Trusted Network (DBN), the estimation derivation process of DBMincludes top-down feedback, which enables better transmission and processing of input data uncertainty and ambiguity [4].

Fig. 2
figure 2

Deep Boltzmann model

Considering a two hidden layer DBM and ignoring the deviation values in the visible and hidden layers, the energy of the model can be defined as:

$$ E\left(v,{h}^1,{h}^2;\theta \right)=-{v}^T{W}^1{h}^1-{h}^{1T}{W}^2{h}^2 $$
(5)

Where θ = {W1, W2} is the model parameter, W1 represents the connection weight between the visible layer v and the first hidden layer h1, and W2 is the connection weight of the first hidden layer h1 and the second hidden layer h2. The probability of assigning to a visual layer state v at this time is:

$$ p\left(v;\theta \right)=\frac{1}{Z\left(\theta \right)}\sum \limits_{h^1,{h}^2}\exp \left(-E\left(v,{h}^1,{h}^2;\theta \right)\right) $$
(6)

The conditional probabilities of the visible layer unit, the first hidden layer unit, and the second hidden layer unit are respectively:

$$ p\left({v}_i=1|{h}^1\right)=\sigma \left(\sum \limits_j{W}_{ji}^1{h}_j\right) $$
(7)
$$ p\left({h}_j^1=1|v,{h}^2\right)=\sigma \left(\sum \limits_i{W}_{ji}^1{v}_i+\sum \limits_m{W}_{mj}^2{h}_m^2\right) $$
(8)
$$ p\left({h}_m^2=1|{h}^1\right)=\sigma \left(\sum \limits_j{W}_{jm}^2{h}_j^1\right) $$
(9)

The only top layer in the DBN is the Restricted Boltzmann Machine (RBM). In contrast, the adjacent two layers in the DBM form an RBM. For RBM, after training it, the entire model can be expressed as:

$$ p\left(v;\theta \right)=\sum \limits_{h^1}p\left({h}^1;{W}^1\right)p\left(v|{h}^1;{W}^1\right) $$
(10)

Where \( p\left({h}^1;{W}^1\right)=\sum \limits_vp\left({h}^1,v;{W}^1\right) \)is an implicit prior to h1,which is defined by the parameter.

The second layer of RBM can be obtained by replacing p(h1; W1) with\( p\left({h}^1;{W}^2\right)=\sum \limits_{h^2}p\left({h}^1,{h}^2;{W}^2\right) \). It should be noted here that the newly added hidden layer does not change the probability distribution of the model. This way, using only one RBM replacement for the topmost hidden layer at a time, you can increase the depth of the model without changing the underlying probability. Replacing p(h1; W1) with the second RBM is equivalent to improving the model distribution of p(h1; W1). It is therefore feasible to use the upper and lower RBMs and then average them to derive the p(h1; W1, W2).Using W1bottom-up and W2 top-down propagation is equivalent to calculating double input data forv, because from the perspective of the graph model, h2 is a value that depends on v.

The structure of DBM determines that its training process is different from DBN. The training algorithm of DBM mainly adjusts the weights to compensate for the problem that there is only bottom-up and no top-down signal flow. The DBM training process is more complicated than the DBN. Since the DBM belongs to the undirected graph model, the middle layer is directly connected to the upper and lower layers, so the mean field algorithm is usually used.

4 Image semantic feature extraction combined with CNN and DBM

The semantic features of the image are hierarchical, and the low-level image features have lower abstraction and higher correlation with the image data content itself. High-level image features have a high degree of abstraction and are less correlated with the content of the image data itself. Therefore, it is considered to establish a hierarchical learning model, which uses unsupervised learning to learn image data, so as to obtain the characteristics of the data itself. As the level increases, the learned features become high-order features of the input data. In this paper, the deep learning model is used to learn the semantic features of the image, and the abstraction of the model is improved by the increase of the depth, so as to establish the semantic hierarchical structure, and then use the classification model to implement semantic-based classification of the image.

CNN has good adaptability to image stretching, affine and other changes, but it ignores high-order statistical features in the image. DBM has good properties in capturing high-order dependencies in images, but it is slower when applied to larger images, and is more sensitive to external changes, and lacks the ability to capture local invariance. Therefore, this paper considers combining these two models to obtain a deep learning model with local invariance and ability to learn high-order statistical features.

A Convolutional-Deep Boltzmann Machine (C-DBM) can be obtained by using CRBM instead of each RBM in the DBM. Using C-DBM as the semantic feature extraction module of the image semantic classification model, the C-DBM-based image semantic classification model shown in Fig. 3 can be obtained.

Fig. 3
figure 3

Image semantic classification model based on C-DBM

The training process of C-DBM involves only the first three layers, and there is no participation in the classification layer. Modifying multiple convolution kernels at the input layer and then performing convolutional mapping yields an implicit layer containing low-level semantics. On this basis, the extraction operation is performed, and finally the low-level semantic unit is obtained. The double convolution map here and the DBM weight doubling principle are the same, and both are to compensate for the loss caused by the top-down input. Then, based on this, the high-level semantics are extracted. The basic process is similar to DBM. The whole process is completely unsupervised learning, and the extracted features can be seen as a more essential description of the data content itself.

The training algorithm of C-DBM is the same as the training algorithm of DBM. In the process of bottom-up training propagation, the weights need to be doubled to compensate for the loss of probability caused by the top-down signal. When the modified C-DBM is training, the input formulas of the hidden sub-layers of the first layer and the second layer are as follows:

$$ I\left({h}_{i,j}^{1,k}\right)={\left(2{W}_k^1\cdot V+{b}_k^1\right)}_{i,j} $$
(11)
$$ I\left({h}_{i,j}^{2,k}\right)={\left(\sum \limits_l^{n_l}{W}_{kl}^2\cdot {p}^{1,l}+{b}_k^2\right)}_{i,j} $$
(12)

Here \( {h}_{i,j}^{1,k} \) denotes the unit in the ksublayer of the first hidden layer directly connected to the visible layer, \( {W}_k^1 \) denotes the k convolution kernel in the first CRBM, and \( {b}_k^1 \) denotes the deviation on the ksublayerof the first hidden layer. \( {h}_{i,j}^{2,k} \) denotes a unit in the ksublayerof the second hidden layer, n1 denotes the convolution kernel number of the first hidden layer, h1, l denotes the lsublayer in the first hidden layer, and \( {W}_{kl}^2 \) denotes the second hidden layer. The k convolution kernel connected to the sublayerl of the first hidden layer, and \( {b}_k^2 \) represents the deviation on the k implicit sublayer.

All the hidden layer and the abstraction layer unit are binary. In order to obtain the binary state of each binary unit, the posterior probability of the hidden layer and the extracted layer must be obtained in the training process. During the training process of C-DBM,the posterior probability formulas of the hidden layer unit and the abstract layer unit are as follows:

$$ p\left({h}_{i,j}^k=1|v\right)=\frac{\exp \left(I\left({h}_{i,j}^k\right)\right)}{1+\sum \limits_{\left({i}^{\hbox{'}},{j}^{\hbox{'}}\right)\in {B}_{\alpha }}\exp \left(I\left({h}_{i^{\hbox{'}},{j}^{\hbox{'}}}^k\right)\right)} $$
(13)
$$ p\left({h}_{\alpha}^k=1|v\right)=\frac{1}{1+\sum \limits_{\left({i}^{\hbox{'}},{j}^{\hbox{'}}\right)\in {B}_{\alpha }}\exp \left(I\left({h}_{i^{\hbox{'}},{j}^{\hbox{'}}}^k\right)\right)} $$
(14)

After the initial training, the parameters of the model are initialized to a better position. Based on this, the C-DBM is trained in Mean Field (MF) to fully train the model. It should be noted that this process is also conducted in an unsupervised manner.

Since the C-DBM algorithm used in this paper uses two hidden layers C-DBM, the first hidden layer, that is, the low-level semantic layer simultaneously receives input information from the visible layer and the advanced semantic layer. The posterior probability formula of the hidden layer in the mean field training is modified to:

$$ I\left({h}_{i,j}^{1,k}\right)={\left({W}_{kl}^1\cdot v+{b}_k^1\right)}_{i,j} $$
(15)
$$ I\left({p}_{i,j}^{1,k}\right)={\left(\sum \limits_{l=1}^{n_l}{\tilde{W}}_{lk}^1\cdot {h}^{2,l}+{b}_k^1\right)}_{i,j} $$
(16)
$$ p\left({h}_{i,j}^1=1|v,{h}^{\hbox{'}}\right)=\frac{\exp \left(I\left({h}_{i,j}^1\right)+I\left({p}_{\alpha}^1\right)\right)}{1+\sum \limits_{\left({i}^{\hbox{'}},{j}^{\hbox{'}}\right)\in {B}_{\alpha }}\exp \left(I\left({h}_{i,j}^1\right)+I\left({p}_{\alpha}^1\right)\right)} $$
(17)
$$ p\left({p}_{\alpha}^1=1|v,{h}^{\hbox{'}}\right)=\frac{1}{1+\sum \limits_{\left({i}^{\hbox{'}},{j}^{\hbox{'}}\right)\in {B}_{\alpha }}\exp \left(I\left({h}_{i,j}^1\right)+I\left({p}_{\alpha}^1\right)\right)} $$
(18)

In the above formula, \( {\tilde{W}}_{lk}^1 \) represents the left and right and vertical flip operations on the convolution kernel between the ksublayer in the first hidden layer and the lsublayer in the second hidden layer, which is.

Equations (17) and (18) adopt a method called “probability type maximum extraction”, which embodies the probability sampling of the context information composed from the visible layer and the second hidden layer, and realizes Standardization and comprehensive inference of context information flow. (i', j') ∈ Bα represents a probability calculated from the context and all convolutional layer units within the extraction region that are outputted with α. Since the second hidden layer and the pooled layer have no top-down information flow, they are the same as the formula that propagates from low to upward during training, and will not be described here.

The process of image semantic classification algorithm based on C-DBM is mainly divided into three parts. (1) First, layer-by-layer training and mean field training are performed on the first three layers of the network. (2) Then take the output of the C-DBM module as the extracted semantic feature. (3) Supervised training of the Softmax classifier according to the extracted features, thereby completing the training process of the entire network. The learning process of layer-by-layer training and mean field training adopts the unsupervised learning method to learn the semantic features in the image content without the participation of category information.

5 Softmax classifier with dropout regularization

After obtaining the image semantic features, it is necessary to perform image classification and recognition by a classifier. The end of a traditional CNN network typically uses a fully connected softmax classifier. Due to the large number of parameters of the neural network, over-fitting is likely to occur in practice. This paper introduces the Dropout algorithm [19] on the classifier side, which can effectively prevent the model from over-fitting and make the model have better generalization ability.

The connection method of the Dropout algorithm is to randomly set the original input data to a certain proportion of ρ to 0, and only other elements that are not set to 0 participate in the operation and connection.

Suppose there is a neural network with a hidden layer of the L layer whose input-output relationship is as shown in Eqs. (19) and (20):

$$ {z}_i^{l+1}={w}_i^{l+1}{y}^l+{b}_i^{l+1} $$
(19)
$$ {y}_i^{l+1}=f\left({z}_i^{l+1}\right) $$
(20)

Where z is the input vector, y is the output vector, w is the weight, b is the offset vector, and f is the activation function, which is used to limit the amplitude.

When adding to Dropout, the input-output relationship of the feed forward neural network is as shown in Eqs. (21)~(23), where the Bernoulli random variable \( {r}_i^l \) obeys the Bernoulli distribution with probability ρ:

$$ {\tilde{y}}^l={r}^l{y}^l $$
(21)
$$ {z}_i^{l+1}={w}_i^{l+1}{\tilde{y}}^l+{b}_i^{l+1} $$
(22)
$$ {y}_i^{l+1}=f\left({z}_i^{l+1}\right) $$
(23)

For the sake of simplicity of description, we assume that each update parameter takes only one sample. The specific process is as follows: First, the input vector is set to 0 according to the ratio ρ, and the element without 0 is involved in the operation and optimization of the classifier; Then accept the input vector of the second sample. At this time, the elements participating in the training are also selected according to the random setting of 0 until all the samples have been learned once. Since each time a sample is entered, the way to set 0 is random, so the network weight parameters are different for each update. In the final prediction process, the parameters of the entire network are multiplied by 1-ρ to obtain the final classifier network parameters. Because the parameters of each update are different, the Dropout algorithm can be regarded as a combination of neural networks into multiple models, which can effectively prevent over-fitting and improve the prediction accuracy of the model [17].

For a fully connected layer network, the effect of Dropout on all hidden layers is better than that on only one hidden layer, and the probability should be chosen appropriately. Too extreme will lead to poor results. Through many experiments, the best probability should be chosen to be 0.5. As shown in Fig. 4, when ρ=0.5, half of the neurons in the hidden layer of the Dropout neural network are set to 0, which means that it does not work during the training. It is restored when the probability is not set to 0 during the next training.

Fig. 4
figure 4

Schematic diagram of neural network connection with Dropout ρ= 0.5

6 Experimental results and analysis

6.1 Experimental setup

The STL-10 dataset is a public image collection from Stanford University that studies image recognition datasets for unsupervised feature learning and deep learning algorithms. The STL-10 includes ten types of color images with an image resolution of 96 × 96, covering airplanes, birds, cars, cats, deer, dogs, horses, monkeys, boats, trucks, and more. Each type of image includes 500 training images and 800 test images. In the data set, in addition to the above-mentioned tagged categories, it also includes other types of images, such as animals (bears, rabbits, etc.), and cars (trains, buses). A partial image of the STL-10 data set is shown in Fig. 5. From this, it can be seen that the difference among the images in one type is very large.

Fig. 5
figure 5

STL-10 image dataset part of the picture

The input layer of the C-DBM model is set to a size of 32 × 32 × 3 (that is, the input can be regarded as three mapping layers of size 32 × 32). The convolutional layer in the first hidden layer contains 6 feature maps, the convolution kernel size is 5 × 5, and the extraction layer has an extraction area size of 2 × 2. The convolutional layer of the second hidden layer contains 8 feature maps, the size of the convolution kernel is 7 × 7, and the size of the extraction region of the extraction layer is 2 × 2. Finally, the output units of the model are combined into a one-dimensional vector. The activation function of the model uses the sigmoid function. It should be pointed out that, as mentioned above, the perceptual domain of visual cortical cells becomes larger as the level increases. Therefore, the second layer here uses a larger convolution kernel, which is more in line with the principle of biological vision, which is also the place where the model of this paper is different from the conventional convolutional neural network.

All modules of the system are completed on the Matlab2014a GUI platform. The experimental platform is a Windows 7 64-bit system PC, CPU is Intel Core i5–6400, 2.7GHz clock, and memory is 8G.

6.2 Performance indicators

The evaluation indicators of the retrieval system are: precision rate, recall rate, and F value. The recall rate reflects the comprehensiveness of the search within the number of limited return images; the precision rate reflects the retrieval accuracy of the system. These two indicators are a contradiction. In many cases, it is difficult to meet the requirements of two indicators at the same time. The general retrieval system only needs to be able to reach an optimal balance point. The F value is such an indicator for comprehensive assessment of accuracy and recall.

$$ \mathrm{Precision}\ \mathrm{Rate}=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{images}\ \mathrm{related}\ \mathrm{to}\ \mathrm{search}\ \mathrm{results}}{\mathrm{Total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{images}\ \mathrm{retrieved}} $$
$$ \mathrm{Recall}\ \mathrm{Rate}=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{image}\mathrm{s}\ \mathrm{related}\ \mathrm{to}\ \mathrm{search}\ \mathrm{results}}{\mathrm{Total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{related}\ \mathrm{image}\mathrm{s}\ \mathrm{in}\ \mathrm{the}\ \mathrm{image}\ \mathrm{library}} $$
$$ F=\frac{2\times \mathrm{precision}\ \mathrm{rate}\times \mathrm{recall}\ \mathrm{rate}}{\mathrm{precision}\ \mathrm{rate}+\mathrm{recall}\ \mathrm{rate}} $$

6.3 Performance analysis

First, verify the feasibility of the image retrieval system. In the image retrieval module, the user inputs an image himself and returns the retrieved semantic similar image. Enter an image of a pair of aircraft and horses respectively, and retrieve the first four images of the output as shown in Fig. 6. As you can see, these images retrieved are very similar to the input image and belong to the same category. This indicates that deep learning can extract robust semantic features from complex images and has good linear reparability. Moreover, it is very robust to factors such as color, illumination and background.

Fig. 6
figure 6

Retrieve the first 4 most similar images returned

The method is compared with several existing methods, which are the retrieval methods based on deep belief network (DBN) and Softmax classifier proposed in [9], and the image classification method based on stack denoising automatic encoder (SDAE) proposed in [2], both of which are classical methods. After these methods are trained multiple times on the STL-10 image dataset, the retrieval accuracy and recall rate obtained on each type of image are shown in Table 1. It can be seen that compared with SDAE and DBN algorithms, the C-DBM method proposed in this paper exceeds SDAE and DBN in terms of average classification accuracy and average recall rate. In addition, the average F value of the DBN method is 0.5653; the average F value of the SDAE method is 0.6036; the average F value of the method is the highest, reaching 0.6315. This also illustrates the effectiveness of the method in this paper.

Table 1 Retrieval performance of three methods on each type of image

SDAE has better learning ability than SAE and can be used to classify image semantic features. By using convolutional mapping of local data and then learning the statistical features in the data, this has higher performance. However, C-DBM can better learn high-order semantic features in images and has better performance in applying to image semantic classification. This shows that compared with the classical deep learning model, the semantic model used in this paper extracts the semantic features in the image better. The learned semantic features are more suitable for semantic classification tasks than the traditional semantic features.

On the STL-10 image dataset, set the interval of recall rate to 0.1, and calculate the precision rate based on the recall rate. The precision rate - recall rate curve (PR curve) is shown in Fig. 7 (the horizontal axis is the recall rate and the vertical axis is the precision rate). It can be seen from the figure that the precision rate of the proposed method is better than other methods when the recall rate is the same, indicating that the algorithm can improve the precision rate of the search.

Fig. 7
figure 7

Precision rate - recall rate curve

7 Conclusion

In order to solve the “semantic gap” between image features in traditional image retrieval, an image retrieval method combining deep learning semantic feature extraction and regularization Softmax is proposed. Combine DBM with CNN to construct a Convolutional Depth Boltzmann Machine (C-DBM) to extract effective semantic features. The Dropout regularized Softmax classifier is used to classify and identify image features. The experimental results on the STL-10 image dataset show that the proposed method can extract semantic features effectively and has high retrieval accuracy.

In the future work, a simplified DBM model is combined with CNN to ensure the retrieval accuracy and reduce the computational complexity.