1 Introduction

In the image classification task, applying multiple representations to an object can effectively improve the classification accuracy. In particular, it is almost a common way to exploit multiple training samples of each object for classifying a test sample if they are available. In the field of face recognition, face images show great differences because of different facial expressions, lighting and postures [1,2,3], which causes great difficulties in face recognition. In order to improve the accuracy of face recognition, many researchers have proposed different methods. For example, the expression invariant face recognition algorithm can effectively improve the recognition accuracy of face images with different expressions [4]. Jian et al. proposed a face recognition method based on illumination compensation [5]. Sharma et al. proposed a face recognition method based on position-invariant virtual classifiers [6]. Considering the symmetry of face images, Xu et al. proposed a method to generate ”symmetrical” face images using original images. Combining original images and symmetrical face images can better reduce the impact of image appearance changes and improve the classification accuracy [7]. In addition, using the original image to generate a virtual image and providing multiple representation methods for the same face image can well reduce the error rate of face recognition [8]. Similarly, if an original face image is corrupted by noisy, it can also be used as a virtual image [9, 10]. In recent years, dictionary learning has been widely used in image classification and face recognition. In order to improve the robustness of face recognition, Wang et al. proposed a method of Discriminative and Common hybrid Dictionary Learning (DCDL) [11]. Xu et al. proposed a new dictionary learning framework that can effectively represent face images and enhance the diversity of face images [12]. Robust, Discriminant and Comprehensive Dictionary Learning Method (RDCDL) is a new dictionary learning method recently proposed by Lin et al. [13], which can also effectively improve the ability of image classification.The multi-resolution dictionary learning method can well enhance the robustness to noise of the dictionary by virtue of different resolutions [14].

After the naive sparse representation classification (SRC) algorithm was proposed, it has been widely used in the fields of image processing and face recognition. For example, in face recognition, Xu et al. proposed a sparse representation method based on l2 regularization [15], which can achieve a noticeable precision improvement. Sparse representation is increasingly applied to image classification [15,16,17,18,19], image super-resolution [20, 21], image denoising [22], and image alignment [23]. At the same time, various sparse representation algorithms have been proposed one after another. Moreover, dictionary learning, as an important kind of methods directly related with sparse representation, is getting more and more attention from researchers. We generally divide the sparse representation algorithms into two categories, the first category is based on the original training sample, and the second category is based on the dictionary. The sparse representation based on the original training sample uses the training sample to linearly represent the test sample, and the dictionary-based sparse representation uses a dictionary to represent the test sample, which is generated from the set of the original training samples [17, 24,25,26,27,28,29,30]. The sparse representation algorithm based on the original training samples contains a large number of examples, such as the orthogonal matching pursuit (OMP) [31], L1-regularized least squares (L1LS) [32], the primal augmented Lagrangian method (PALM) [33], the dual augmented Lagrangian method (DALM) [33], fast iterative shrinkage and thresholding algorithm (FISTA) [34], two-phase test sample sparse representation (TPTSR) [35], collaborative representation (CRC) [36], etc. These algorithms have good results in face recognition.

In this paper, we propose a novel image classification algorithm based on the idea of sparse representation. The algorithm can better preserve the large-scale information and holistic features of the original image, and reduce the difference in different images of the same object, which is greatly beneficial to the image classification task. The improved algorithm has the following important property: if the original image is a gray image, when its pixel value varies between 0 and 255, the pixels of the virtual image generated by the improved algorithm have symmetrical values. In particular, the pixel values are symmetrical with respect to \(\sqrt [2]{127*128}\). The improved algorithm combines generated new image representations and original representations to perform classification. We conducted experiments on several facial image databases. The experimental results effectively verified that the improved image classification algorithm can achieve significant performance in image classification.

The other parts of the paper are organized as follows. Section 2 describes the principles and steps of the proposed algorithm and proposes two methods for generating virtual images. Section 3 explains the characteristics and advantages of the proposed algorithm intuitively. Section 4 presents a lot of experimental results and analyzes the results. Section 5 provides the conclusions of the paper.

2 Algorithm principle and steps

2.1 Algorithm steps

In this section, we will explain the steps of the algorithm in detail. The main steps of the algorithm are as follows.

Step 1::

Select training samples and test samples. All original images are divided into two parts: training samples and test samples.

Step 2::

Obtain the virtual training samples using (15) or (16).

The original training sample is expressed as follows:

$$ A=(A_{1},A_{2},...,A_{C}) $$
(1)

The virtual image generated by the training sample is recorded as V:

$$ V=(V_{1},V_{2},...,V_{C}) $$
(2)

C is the number of classes, and each class has n training samples. The original and virtual training samples of the i-th class are denoted as Ai and Vi respectively, where Ai = (ai1, ai2,…, ain), Vi = (vi1, vi2,…, vin). Each column vector aij in Ai represents the j-th image of the i-th class. The vij in Vi is also a column vector, which represents the virtual image generated by the j-th training sample of the i-th class.

Step 3::

Let y denote the test sample, and yv is the virtual test sample generated by y. According to CRC, we can express y by a linear combination of all training samples, as follows:

$$ y=\sum\limits_{k=1}^{Cn}a_{k}x_{k} $$
(3)

ak denotes the k-th training sample, which is the k-th column in matrix A, and xkis the coefficient of ak. We rewrite (3) as follows:

$$ y=Ax $$
(4)

We use the regularized least squares method to solve the equation. As a result, we can obtain the linear combination coefficient x:

$$ x=(AA^{\mathrm{T}}+\lambda I)^{\mathrm{-1}}A^{\mathrm{T}}y $$
(5)

Similarly, we can get the linear combination coefficient β of the virtual training samples:

$$ \beta=(VV^{\mathrm{T}}+\lambda I)^{\mathrm{-1}}V^{\mathrm{T}}y_{v} $$
(6)

λ is a constant, I is the identity matrix.

Step 4::

Let \({d_{o}^{i}}\) denote the distance (i.e. score) between the test sample and the original training sample of the i-th class, and \({d_{v}^{i}}\) denote the distance between the test sample and the virtual training sample of the i-th class:

$$ {d_{o}^{i}}=\parallel y-A_{i} x_{i} \parallel_{2} \quad i=1,2,{\ldots} C $$
(7)
$$ {d_{v}^{i}}=\parallel y_{v}-V_{i} \beta_{i} \parallel_{2} \quad i=1,2,{\ldots} C $$
(8)

xi and βi are the coefficient vectors of the training samples and the virtual training samples of the i-th class, respectively, where xT = ((x1)T,(x2)T,…,(xC)T), βT = ((β1)T,(β2)T,…,(βC)T). ∥⋅∥2 represents the L2 norm.

Step 5::

In this step we introduce the process of merging the results of the original training samples with the virtual training samples. We adopt a simple and efficient fusion method proposed in [37]. Let \({S_{o}^{1}}, {S_{o}^{2}},\ldots , {S_{o}^{C}}\) and \({S_{v}^{1}}, {S_{v}^{2}},\ldots , {S_{v}^{C}}\) denote the ascending ordering results of \({d_{o}^{i}}\) and \({d_{v}^{i}}\), respectively. Then the fusion weights W1 and W2 are calculated by following equations:

$$ w_{10}={S_{o}^{2}}-{S_{o}^{1}} $$
(9)
$$ w_{20}={S_{v}^{2}}-{S_{v}^{1}} $$
(10)
$$ W_{1} = \frac{w_{10}}{w_{10} + w_{20}} $$
(11)
$$ W_{2} = 1 - W_{1} $$
(12)

Finally, the ultimate fusion result is obtained as follows:

$$ d_{i}=W_{1} {d_{o}^{i}}+W_{2} {d_{v}^{i}} \quad i=1,2,{\ldots} C $$
(13)

We classify the test sample by the following formula:

$$ p_{j}= \mathop{arg\min}\limits_{i} d_{i} $$
(14)

If pj is the minimum value of di, then the test sample is classified into the j-th class.

2.2 Generating of virtual images

In this paper, we propose an improved image representation method to represent the original image. A transform of the original image is attained using the improved image representation method. We call the transform virtual image. How to generate virtual images is introduced as follows.

We take the gray image as an example to illustrate how the original image generates its corresponding virtual image. The maximum pixel value in the gray image is 255, denoted as Pmax. The pixel value at the r-th row and the c-th column of the original image is recorded as Src, the generated virtual image is represented by V, and the pixel value at the r-th row and the c-th column of the virtual image is recorded as Vrc. The representation of the original image to generate the virtual image is as follows:

$$ V_{rc} = \sqrt[2]{S_{rc} \cdot (P_{max} - S_{rc})} $$
(15)

By analyzing the above formula, we can draw the following conclusions:

  1. (1)

    If Src is equal to 0 or the maximum pixel value of the image, the virtual image has a pixel value of 0 at the corresponding position.

  2. (2)

    If Src is closer to \(\frac {P_{max}}{2}\), the pixel value of the corresponding position of the virtual image is larger, and the maximum value is \(\sqrt [2]{127*128}\).

  3. (3)

    When the pixel value of the original image is Src or (PmaxSrc), the pixel values of the corresponding positions in the virtual image are the same. In other words, the pixel values in the virtual image are symmetric with respect to \(\sqrt [2]{127*128}\).

It has been proved in [37] that medium-intensity pixels have strong stability and are more conducive to image classification. Compared to other similar methods, the pixel value of the virtual image generated by our method is significantly reduced and better concentrated near the medium-intensity pixel. Moreover, two pixels with similar pixel values in the original training samples have smaller differences in the virtual image, which can greatly improve the performance of image classification.

If the original image is a gray image, the maximum value of the virtual image generated by our method is \(\sqrt [2]{127*128}\), which not only ensures that the pixel value of the virtual image is distributed near the medium intensity, but also is beneficial to preserve the large-scale information of the original image. To some extent, the virtual image properly highlights the global features of the image, which will be very beneficial for image classification.

Based on the principle of the image representation method proposed above, we also proposed another new scheme to generate virtual images. It is shown by a large number of experiments that the method can significantly improve the image classification accuracy. The equation for generating a virtual image is as follows:

$$ V_{rc} = \sqrt[3]{S_{rc} \cdot (P_{max} - S_{rc})} $$
(16)

Compared to (15), the virtual image generated by (16) has a smaller pixel value and the difference between pixels is smaller, which seems to be easier to obtain global information of the image.

3 Algorithm analysis

In this section, we mainly analyze the characteristics and advantages of the proposed algorithm. By comparison with the algorithm proposed in [37], the face recognition experiment is taken as an example to intuitively explain the principle of the algorithm on basis of (15). We select the first face image of the first subject in the ORL face database as an example for analysis. This is a gray image. The face image is shown in Figs. 1 and 2 shows the distribution of the original pixels of the sample.

The pixel value distribution of the virtual image generated by the algorithm in [37] is shown in Fig. 3. When the pixel value of the original training sample changes from 0 to 255, the symmetry in pixel values of the virtual image are shown in Fig. 4.

Fig. 1
figure 1

The first face image of the first subject in the ORL face database

Fig. 2
figure 2

Original pixel value of the first sample of the first subject in the ORL face database

Fig. 3
figure 3

Pixel value of the first virtual image of the first subject in the ORL face database attained using the algorithm in [37]

Fig. 4
figure 4

Symmetry in pixel values of the virtual images attained using the algorithm in [37]

Figure 5 shows the pixel value distribution of the virtual image generated by the proposed algorithm, and Fig. 6 shows the symmetry in pixel values of the virtual image as the pixel value of the original image changes from 0 to 255.

Fig. 5
figure 5

Pixel value of the first virtual image of the first subject in the ORL face database using the proposed algorithm

Fig. 6
figure 6

Symmetry in pixel values of the virtual images attained using the proposed algorithm

Figure 7 shows the normalized data distribution of the same sample under the original image, the algorithm in [37], and the proposed algorithm. Normalization means converting an image vector into a unit vector with a norm of 1.

Fig. 7
figure 7

Normalized data distribution of the first sample of the first subject in the ORL face database under the original image, the algorithm in [37], and the proposed algorithm

According to Figs. 3 and 4, it is intuitively reflected that the virtual image generated by the algorithm in [37] has a very large pixel value, far exceeding the pixel range of the conventional gray image. Moreover, the two pixels with similar pixel values of the original image have great differences in the virtual image, which causes large-scale information of the original image to be lost. These problems are not conducive to image classification. By comparing Figs. 5 and 6, we see that the pixel value of the virtual image generated by our algorithm is significantly reduced, and the maximum pixel value is \(\sqrt [2]{127*128}\). In addition, the difference between two pixels in a virtual image is greatly reduced. The large-scale information of the original image is well preserved in the virtual image. Meanwhile, in the proposed algorithm, for gray images, pixels of intensity i and (255 − i) have the same intensity in the virtual image, and the pixels whose values are closer to the medium-intensity play a more important role.

Figure 7 shows that the virtual image generated by our proposed algorithm has a low correlation with the original image, which indicates that the original image and the obtained virtual image are complementary.

In summary, the proposed algorithm can obtain more abundant large-scale information, and to some extent, more information corresponding to the global feature of the image. As we know, large-scale and global information is more important for recognition of object appearances. Therefore, our proposed algorithm has greater precision for image classification.

Similarly, the algorithm for generating virtual images using (16) also has the above characteristics and advantages. Figure 8 shows the pixel value of the virtual image generated by the algorithm, and Fig. 9 shows the symmetry in pixel values of the virtual image when the pixel of the original image changes between 0 and 255.

Fig. 8
figure 8

Pixel value of the first sample of the first subject in the ORL face database attained using (16)

Fig. 9
figure 9

Symmetry in pixel values of the virtual image when using (16)

Figure 10 shows eight original face images (line 1) and virtual images generated by the algorithm in [37] (line 2) and virtual images generated by the proposed algorithm (line 3) of a subject in the Georgia Tech face database.

Fig. 10
figure 10

The first line is the original face image, the second line is the virtual image generated by the algorithm in [37], and the third line is the virtual image generated by the proposed algorithm

We can find that the virtual image generated by the algorithm in [37] or the proposed algorithm is a relatively natural face image. Although there are some great differences between virtual image and original image in appearance, the fusion of the virtual image and the original image can provide multiple representation methods for the same face image, which is beneficial to improve the accuracy of face classification.

4 Experimental and results

In this section, we verify the feasibility and rationality of the proposed algorithm through a large number of experiments. We conducted experiments on three face databases, namely ORL face database, Georgia Tech face database and FERET face database. The experimental results show that the proposed algorithm has a greater precision improvement than other similar algorithms in face image classification.

In order to better reflect the advantages of the proposed algorithm, we compared it with the typical sparse representation algorithms, such as L1-regularized least squares (L1LS), the primal augmented Lagrangian method (PALM), fast iterative shrinkage and thresholding algorithm (FISTA). Then, it compares with the new algorithm proposed in recent years. For example, the multi-resolution dictionary learning method proposed [14], Robust Sparse Linear Discriminant Analysis (RSLDA) [38], block-diagonal low-rank representation (BDLRR) [39] and the improved collaborative representation [37]. In addition, the methods of applying collaborative representation, PALM, L1LS, and FISTA directly on the original image are called naive collaborative representation, naive PALM, naive L1LS, naive FISTA, respectively.

In all experiments, the face images of each subject were divided into two parts: the training set and the test set. The training set and test set are mutually exclusive, and the sum of the two is the total face images of the subject.

The specific implementation process is: Firstly, we use the improved image representation method to generate the virtual image of the original image, and then apply the sparse representation algorithm to the original image and the virtual image respectively, and obtain the classification scores corresponding to the test image through collaborative representation [36]. The score fusion scheme is used to fuse scores obtained by the original image and the virtual image respectively, and the ultimate classification score of the test sample is obtained. The following is an experimental analysis of different algorithms on various databases.

4.1 Experiments on the ORL database

In this section, we experimented with the proposed algorithm on the ORL face database [40]. The ORL face database contains a total of 40 subjects, each with 10 images and a total of 400 gray face images. These images have different angles, lighting and facial expressions. The facial expressions include different details such as smiling and not smiling, eyes open and closed, glasses and no glasses. In our algorithm, all face images in the ORL database are first adjusted to an image of 56×46 pixels. Figure 11 shows an example of images of two subjects in the database.

Fig. 11
figure 11

Image examples of two subjects in the ORL database

On ORL face database, we conducted experimental comparison of different algorithms, including sparse representation algorithm and the newly proposed algorithm. The experimental results are shown in Table 1, which shows the classification error rates of various algorithms on this database.

Table 1 Rate of classification errors (%) on the ORL dataset

From Table 1, we can clearly see that our algorithm achieves the best performance when the number of training samples per subject is 2, 3, 4, and 5. For example, when the number of training samples per subject is 5, the error rate of our proposed algorithm (15) is 7.50%. However, the classification error rates of Original collaborative representation [37] and Multi-resolution dictionary learning [14], RSLDA [38], and BDLRR [39] are 8.5%, 9.55%, 8.00%, 8.00%, respectively. When the number of training samples per subject is 4, the error rate of our proposed algorithm (16) is 7.08%. However, the classification error rates of Original collaborative representation [37] and Multi-resolution dictionary learning [14], RSLDA [38], and BDLRR [39] are 8.75%, 16.62%, 10.83%, 10.42%, respectively. Through the experiment of ORL face database, it is verified that our proposed algorithm can significantly improve the accuracy of image classification.

4.2 Experiments on the georgia tech database

In this section, we experiment with the proposed algorithm on Georgia Tech face database [41, 42]. Georgia Tech face database has a total of 50 subjects, each subject has 15 JPEG format color images, a total of 750 color face images. The background of the image in the database is messy, and the resolution of the image is 640x480 pixels. The image of each subject contains a frontal face image and a tilted face image of the subject with different expressions, lighting and scales. Each image is manually labeled to determine the position of the face in the image. In our improved algorithm, the image in the database is processed first, using the face image with the background removed, and each face image is 40×30 pixels. Figure 12 shows the face images of three subjects in the Georgia Tech face database.

Fig. 12
figure 12

Image examples of three subjects in the GT dataset

The comparison results of image classification error rates of different algorithms on Georgia Tech face database are shown in Table 2.

Table 2 Rate of classification errors (%) on the GT datasetet

From Table 2, we can see that when the number of training samples per subject is 1, 2, 3, our algorithm performs better than other algorithms. For example, when the number of training samples per subject is 2, the error rate of our proposed algorithm (15) is 51.69%. However, the classification error rates of Original collaborative representation [37] and Multi-resolution dictionary learning [14], RSLDA [38], and BDLRR [39] are 52.15%, 65.86%, 51.70%, 56.77%, respectively. Similarly, when the number of training samples per subject is 3, the classification error rate of our algorithm (16) is 48.17%, which is 4.00%, 13.08%, 2.49%, and 1.00% lower than Original collaborative representation [37], Multi-resolution dictionary learning [14], RSLDA [38], and BDLRR [39], respectively. The results show that our proposed algorithm greatly improves the image classification accuracy.

4.3 Experiments on the FERET database

The proposed algorithm was tested on the FERET face database [43, 44]. FERET face database is one of the most widely used face databases in the field of face recognition. It is to collect the images of the subjects under different lighting conditions. The face images of each subject show the characteristics of different posture and facial expressions. In this section, experiments were performed using the ”ba”, ”bj”, ”bk”, ”be”, ”bf”, ”bd”, and ”bg” subsets of the FERET face database. There are 1400 gray face images of 200 subjects, and each subject has seven gray face images. Figure 13 shows the face images of three subjects in the FERET face database.

Fig. 13
figure 13

Image examples of three subjects in the FERET dataset

In the experiment, we adjust all human face images to 40x40 pixels. The classification error rate comparison results of different algorithms on the FERET face database are shown in Table 3.

Table 3 Rate of classification errors (%) on the FERET dataset

From Table 3, we can see that when the number of training samples per subject is 1, 2, 5, our proposed algorithm has a lower classification error rate than other algorithms. For example, when the number of training samples per subject is 5, the error rate of our proposed algorithm (15) is 28.75%. However, the classification error rates of Original collaborative representation [37] and Multi-resolution dictionary learning [14], RSLDA [38], and BDLRR [39] are 30.25%, 49.02%, 30.25%, 29.75%, respectively. Similarly, when the number of training samples per subject is 2, the proposed algorithm (16) has a classification error of 39.8%, which has higher classification accuracy than other algorithms, such as RSLDA [38] and BDLRR [39]. The above experiments show that the proposed algorithm can effectively improve the accuracy of image classification.

5 Conclusions

In order to improve the accuracy of image classification, especially image classification on deformable objects, such as human faces, this paper proposes an image classification algorithm. The experimental results show that our algorithm has better classification performance than other similar algorithms, such as Multi-resolution dictionary learning, RSLDA, BDLRR, L1LS, FISTA, PALM and other sparse representation algorithms. At the same time, the proposed algorithm has the advantages of high computational efficiency, simple implementation, and complete automation. In addition, the two new image representation procedures that we propose are complementary to the original image when representing the object. The original image and the virtual image are used to perform multiple representations on the same object, which makes our algorithm very versatile. The above experiments also prove the feasibility and effectiveness of the algorithm.