1 Introduction

In the field of computer vision and pattern recognition, face recognition has always received extensive attention [4, 9, 17, 33]. However, face recognition still faces challenges due to the influence of illumination, angle, and facial expressions, as well as the small number of samples. Sparse representation based classification (SRC) [24] has excellent feature extraction and classification recognition performance. After its application in face recognition, many important research results have been obtained [2, 3, 7, 12, 30, 31]. Inspired by SRC, Zhang et al. [36] proposed a more general model, Collaborative representation based classification (CRC), which not only has lower complexity but also achieved very competitive results. Keinert et al. [10] proposed a Group-Sparse Representation-based method, which is robust to noise, occlusions, and disguises. Considering the pose, expression, and misalignment changes of face images, MA et al. [16] used a sparse representation method with an adaptive weighted spatial pyramid structure. In view of the small number of training samples, the fast kernel sparse representation algorithm [5] classifies the original data by nonlinear mapping to the high-dimensional data space, which has high operating efficiency. Considering that real face recognition systems require high execution efficiency, Ye et al. [34] improved an extended sparse representation classification algorithm that is suitable for many single training samples. The JLSRC algorithm [25] reveals the potential relationship of the data by learning two low-rank reconstruction dictionaries and embeds non-negative constraint and an Elastic Net regularization in the coefficient vector of the dictionary to ensure that the dictionary has the better discriminative ability and is more robust to noise. To better deal with the structural noise of face images such as occlusion, illumination, etc., Tan et al. [22] used a novel geometric sparse representation model with a single image, which better retains the original geometric structure information of the image.

For sparse representation, the dictionary learned from training samples has an important influence on the result of image classification. Dictionary learning is an important branch of sparse representation. Due to its excellent performance, it has become an important method for studying face recognition [13, 18]. Classical dictionary learning methods include K-SVD [1], LC-KSVD [8], etc. Considering the importance of label constraints to the discriminative ability of the dictionary, Zhang [35] proposed a dictionary learning method using the labels of training samples, while Shrivastava et al. [21] only used part of the labeled data to obtain the dictionary. Li et al. [11] used both label constraints and local constraints to construct dictionary discriminants and obtained very competitive results in face recognition experiments. Due to the small number of samples, Lin et al. [14] used training samples of different facial expressions to generate a dictionary. Zhang [38] et al. proposed a sample-expanded multi-resolution dictionary learning method, which not only increases the diversity of image representation but is also more suitable for the classification of images with different resolutions in reality. Xu et al. [32] observe the object through two different angles of the image row and column.

To reduce the influence of external conditions such as illumination and angle, and improve the accuracy of image classification, this paper proposes a novel image classification algorithm. First, the algorithm uses a new image representation method to generate virtual images. This not only ensures that the large-scale information of the original training sample is retained in the virtual sample but also reduces the difference between different face images of the same person. Based on sparse representation, the algorithm uses original training samples and virtual samples to linearly represent test samples and obtains the classification scores corresponding to the two types of samples. Then, this paper designs a simple and efficient weight fusion scheme. This scheme can convert the classification scores of the original image and the virtual image into the final classification distance. Experimental results on multiple face databases show that the proposed algorithm has great advantages in classification accuracy.

The other parts of the paper are organized as follows. Section 2 introduces sparse representation and related works. Section 3 describes the algorithm steps and related implementation details. Section 4 gives the principle and analysis of the proposed algorithm. Section 5 shows the experimental results and analysis. Section 6 provides the conclusions of the paper.

2 Related works

This section mainly introduces the basic principles of sparse representation and related works. We define some notations that will be used in this paper.

Assuming that the subjects of the data set have C classes, and the number of training samples of each class is n, then the original training samples are recorded as X = (X1,X2,…,XC), where Xi = (xi1,xi2,…,xin). Xi represents the training samples of the i-th class, and the column vector xij represents the j-th training sample of the i-th class.

The basic principle of sparse representation theory is to use a set of over-complete bases to linearly represent the input signal, and the obtained linear combination coefficients can approximately represent the input signal with a certain degree of sparsity. The most classic algorithm is the sparse representation based classification (SRC) [24]. The model of this algorithm can be expressed as:

$$ \min\limits_{\alpha} \parallel y-X \alpha {\parallel_{2}^{2}}+\lambda \parallel \alpha \parallel_{1} $$
(1)

Where y represents the test sample, which can also be considered as the input signal. X represents the set of training samples, α is the coefficient vector, λ is the parameter and ∥ α1 represents the L1 norm constraint.

The computational complexity of solving the sparse representation model is high, which is difficult to meet the requirements of practical applications. Therefore, convert the L1 norm to L2 norm constraint as follows:

$$ \min\limits_{\alpha} \parallel y-X \alpha {\parallel_{2}^{2}}+\lambda \parallel \alpha \parallel_{2} $$
(2)

This model is also called Collaborative representation based classification (CRC) and has higher computational efficiency. Then the difference between the test sample y and its estimated value Xiαi can be calculated to achieve classification, as follows:

$$ \begin{aligned} Identify(y)=\underset{i}{\arg\min} \parallel y-X_{i} \alpha_{i} {\parallel_{2}^{2}} \quad i=1,2,{\ldots} C \end{aligned} $$
(3)

At present, many algorithms based on sparse representation have been developed and achieved excellent performance in face recognition. For example, a two-phase test sample sparse representation method can accurately classify test samples [26]. In the first stage, the method represents the test sample as a linear combination of all training samples and determines the nearest M neighbors of the test sample. In the second stage, these neighbors are used to linearly represent the test sample and the result is used for classification. Furthermore, Using the symmetrical image [27] or mirror image [28] of the face image can not only increase the number of samples but also effectively improve the performance of image classification.

Xu et al. [30] used a novel image representation method to obtain virtual images and fused the original images and virtual images by a weighted fusion method to obtain the final classification results of test samples. On this basis, Zheng et al. [39] extended two new image transformation methods and achieved good performance.

3 Proposed algorithm

3.1 Algorithm steps

This section introduces the implementation process of the proposed algorithm. The specific steps of the algorithm are as follows.

Step 1. All samples of the data set are divided into two parts: training set and test set.

Step 2. Use formula (11) to generate virtual images. All images are converted to virtual images.

Step 3. Solve the coefficients that use training samples to linearly represent test samples.

According to the idea of sparse representation, the training sample X can be used to linearly represent the test sample. Assuming that y is a test sample, y can be linearly combined by X, as follows.

$$ y=X\alpha=\sum\limits_{i=1}^{Cn}x_{i}\alpha_{i} $$
(4)

xi is the i-th column of X, representing the i-th training sample, and αi is the coefficient vector of xi. According to the regularized least square method to solve the equation, the linear combination coefficient α of y is as follows:

$$ \alpha=(XX^{\mathrm{T}}+\lambda E)^{\mathrm{-1}}X^{\mathrm{T}}y $$
(5)

λ is a constant and E is the identity matrix.

In the same way, the virtual sample V is used to linearly represent yv as follows:

$$ y_{v}=V\beta=\sum\limits_{j=1}^{Cn}v_{j}\beta_{j} $$
(6)

yv is a virtual sample generated by the test sample y using equation (11). vj is the j-th column of V, representing the j-th virtual sample, and βj is the coefficient vector of vj. Then the linear combination coefficient β of yv is as follows:

$$ \beta=(VV^{\mathrm{T}}+\lambda E)^{\mathrm{-1}}V^{\mathrm{T}}y_{v} $$
(7)

Step 4. Calculate the classification distance.

The distance between the test sample y and the original training sample of the i-th subject is recorded as \({d_{1}^{i}}\).

$$ {d_{1}^{i}}=\| y-X_{i}\alpha_{i}\|_{2} \quad i=1,2,{\ldots} C $$
(8)

Similarly, the distance between y and the virtual training sample of the i-th subject is recorded as \({d_{2}^{i}}\).

$$ {d_{2}^{i}}=\|y_{v}-V_{i}\beta_{i}\|_{2} \quad i=1,2,{\ldots} C $$
(9)

Step 5. Weight fusion scheme.

Use formula (14) to fuse the original image and the virtual image to obtain the classification distance di of the test sample.

Step 6. Classify the test sample.

di reflects the distance between the test sample and each subject, and the class of its minimum value is the class of the test sample. The test sample is classified by the following methods:

$$ s_{j}= \underset{i}{\arg\min} d_{i} $$
(10)

The test sample is classified into the class of the minimum value sj of di, that is, the j-th class.

3.2 Generating of virtual images

The proposed algorithm uses a novel image representation method to convert the original training samples into virtual images. The conversion process is as follows.

$$ V=\sqrt[4]{{X \cdot (255 - X)}} $$
(11)

V represents the virtual image, and X is the original training sample. It can be easily seen that V and X have the same structure, that is, V = (V1,V2,…,VC), where Vi = (vi1,vi2,…,vin). Vi represents the virtual image generated by the original training sample of the i-th subject, and the column vector vij is the virtual image generated by the j-th training sample of the i-th subject.

This image representation method is based on the conversion of pixel values of gray images. From the above equation, it can be seen that the pixels with the pixel value of p or (255 − p) in the original image are all mapped to the same pixel value, which reflects the symmetry of the virtual image. In addition, Xu et al. [30] argue that medium pixels have stability and are more suitable for the classification of deformable objects such as faces. Equation (11) allows the pixel values of the virtual image to be better distributed around pixels of medium intensity and significantly reduces the differences between pixels. This transformation result can better preserve the large-scale information and global features of the original image, which are often beneficial for image classification [39].

3.3 Weight fusion scheme

The image classification algorithm is applied to the original image and the virtual image respectively to obtain the distances between them and the test sample, denoted as \({d_{1}^{i}}\) and \({d_{2}^{i}} (i=1,2,{\ldots } C)\).

The distance represents the contribution when the training sample set linearly represents the test sample, the smaller the distance, the greater the contribution. Therefore, the algorithm sorts the distance between the test sample and the original face image and the virtual face image in ascending order. Assume that the sorting result of \({d_{1}^{i}}\) is \({q_{1}^{1}},{q_{1}^{2}},\cdot \cdot \cdot ,{q_{1}^{C}}\), and the sorting result of \({d_{2}^{i}}\) is \({q_{2}^{1}},{q_{2}^{2}},\cdot \cdot \cdot ,{q_{2}^{C}}\). Since small distances such as \({q_{1}^{1}},{q_{1}^{2}},{q_{2}^{1}},{q_{2}^{2}}\) will have an important impact on the classification results, the weight fusion scheme pays more attention to these small distances.

The fusion weights of the original face image and the virtual face image are denoted as w1 and w2, respectively.

$$ w_{1} = \frac{{q_{1}^{1}}+{q_{1}^{2}}}{({q_{1}^{1}}+{q_{1}^{2}})+({q_{2}^{1}}+{q_{2}^{2}})} $$
(12)
$$ w_{2} = 1 - w_{1} $$
(13)

The calculation method of the distance between the test sample and the i-th subject is as follows:

$$ d_{i}=w_{1} {d_{1}^{i}}+w_{2} {d_{2}^{i}} \quad i=1,2,{\ldots} C $$
(14)

4 Rationale of the proposed algorithm

Image representation is a very promising method in the field of computer vision and pattern recognition. The appropriate image representation plays an important role in the final result of image classification. For deformable objects, such as human faces, using image representation to generate potential samples can reflect the changes of the original samples to a certain extent. Especially in the case of insufficient samples, generating virtual samples is very helpful to improve the accuracy of image classification.

This paper proposes a novel image representation method to generate virtual images, implemented as (11). In the virtual image, the difference between pixels is significantly reduced. To a certain extent, for the same subject, the difference between its different images has been reduced, while the virtual image retains the rich large-scale information of the original image. Figure 1 shows the first face image of the first subject in the ORL face database. Figure 2 shows the distribution of pixel values of the image. Figure 3 shows the distribution of the pixel values of the virtual image generated by the image. Figure 4 shows that the pixel values of the virtual image are symmetrical.

Fig. 1
figure 1

The first face image of the first subject in the ORL database

Fig. 2
figure 2

Pixel value of the first face image of the first subject in the ORL database

Fig. 3
figure 3

The pixel value of the virtual image generated by the first face image of the first object in the ORL database

Fig. 4
figure 4

Symmetry of pixel values of the virtual image

From Figs. 23, and 4, it is obvious that the pixel value of the virtual image is significantly reduced, and the distribution of the pixel value is more concentrated. After the two pixels with larger pixel value differences in the original image are mapped to the virtual image, the difference is significantly reduced. When the pixel value of the original image varies from 0 to 255, the pixel value of the virtual image has symmetry.

Figure 5 shows the normalized data distribution of the original face image and its corresponding virtual sample. It can be seen from the figure that the proposed image representation has a low correlation with the original face image, which indicates that the virtual image can be used as a supplement to the original image to represent the face. In other words, the virtual sample is also another representation method of the subject.

Fig. 5
figure 5

The standardized data of the first face image of the first subject in the ORL database under the original image and the novel image representation

Figure 6 shows the original gray face images of a subject in the Georgia Tech face database and the virtual face images generated by them. It can be seen from the figure that the virtual face image still has complete facial features. Therefore, the original training sample and the virtual sample can be combined to represent the face object. Besides, the weight fusion scheme is a very simple and efficient method, which is often applied to image classification. This paper designs a new weight fusion scheme to fuse the classification scores of the original face image and the virtual image. This scheme can automatically obtain the fusion weight of the original face image and the virtual image and has high computational efficiency.

Fig. 6
figure 6

The original gray face images of a subject in the Georgia Tech face database (the first row) and their corresponding virtual face images (the second row)

5 Experimental and results

To reflect the superiority of the proposed algorithm, we conducted experiments on the ORL face database [20], Georgia Tech face database [6], and FERET face database [29]. The experimental results are compared with many related face recognition algorithms, including classic algorithms and the latest algorithms, namely Naive collaborative representation, Naive PALM, Naive L1LS, Naive FISTA, PALM, Improved collaborative representation [30], BDLRR [37], RSLDA [23], Multi-resolution dictionary learning (MRDL) [15] and Improved image representation and sparse representation (IIRSR) [39]. The algorithm directly applied to the original training samples is called Naive. This section compares the classification error rates of different face recognition algorithms. Experimental results show that the proposed algorithm has achieved satisfactory results. The novel virtual image representation method and weight fusion scheme play an important role in reducing the classification error rate of the algorithm.

In all experiments, the face images of each subject in the face database are divided into training sets and test sets. When fixing the number of training samples for each subject, the program is executed ten times, and the final result is the average of all experimental results. To verify the effect of extended data on the algorithm, we use exponential expansion [19] instead of (11) to generate virtual images, and the experimental results are denoted as Qin [19]. Furthermore, while all algorithms use Euclidean distance by default to calculate the difference between images, we also verified the effect of other distance calculation methods, such as Manhattan distance.

5.1 Experiments on the ORL database

The ORL face database collects face images of subjects from different angles under different lighting conditions. Each subject contains samples of different facial expressions, such as eyes open or closed, smiling or not smiling. At the same time, the face images in the database contain partial occlusions, such as with and without glasses. The face database collected a total of 400 face images of 40 subjects, each subject contains 10 samples, and the size of each face image is 56x46 pixels. Figure 7 shows the face images of three subjects in the face database. On the ORL face database, Table 1 shows the classification error rates of different face recognition algorithms. In the experiment, the number of training samples of each subject is set to 2, 3, 4, and 5 respectively, and the remaining samples are used as the test set.

Fig. 7
figure 7

Face images of ORL face database

Table 1 Rate of classification errors (%) on the ORL dataset

From Table 1, we can clearly see that the proposed algorithm is very competitive in terms of classification accuracy. For example, when the number of training samples is 5, the classification error rate of the proposed algorithm is 2.5% lower than that of “Improved collaborative representation” and 1.5% lower than that of IIRSR2. Similarly, under other training sample numbers, the proposed algorithm has a higher recognition accuracy rate than other face recognition algorithms.

5.2 Experiments on the Georgia Tech database

The Georgia Tech face database has 750 color face images, composed of 50 subjects, and 15 face images for each subject. The face images of each subject are collected under different angles and different illuminations, and there is a small amount of occlusion, such as with or without glasses. We use cropped face images for experiments. The size of the cropped image is 40x30 pixels and the messy background of the original image is removed. Figure 8 shows the face images of three subjects in the database.

Fig. 8
figure 8

Face images of Georgia Tech face database

In the experiment, for each subject, the number of training samples is 1, 2, and 3, and the remaining face images are used as the test set. On the Georgia Tech face database, the comparison results of the classification error rates of different face recognition algorithms are shown in Table 2.

Table 2 Rate of classification errors (%) on the Georgia Tech dataset

From Table 2, compared with other face recognition algorithms, the proposed algorithm obtains the lowest image classification error rate. When the number of training samples for each subject is 3, the proposed algorithm reduces the classification error rate by 4.84% and 1.5% respectively than the “Improved collaborative representation” and the IIRSR2 algorithm. Compared with other face recognition algorithms, the proposed algorithm has greater advantages in classification accuracy.

5.3 Experiments on the FERET database

The face images of the FERET face database are collected from the subjects under different lighting conditions, reflecting the different postures and facial features of the subjects. The experiment uses multiple subsets of the database to verify the face recognition algorithm, which are “ba”, “bj”, “bk”, “be”, “bf”, “bd” and “bg” subsets. In the experiment, there are 1400 face images in all subsets, including 200 subjects, and each subject has 7 face images. The size of each face image in the experiment is 40x40 pixels. Figure 9 shows examples of face images of three subjects in the database. On the FERET database, when the number of training samples for each subject is set to 1, 2, and 3, the classification error rates of different face recognition algorithms are compared to Table 3.

Fig. 9
figure 9

Face images of FERET face database

Table 3 Rate of classification errors (%) on the FERET dataset

From Table 3, when the number of training samples for each subject is 1, 2, and 3, the proposed algorithm obtains the lowest classification error rate. When the number of training samples is 3, the proposed algorithm reduces the classification error rate by 3.25% and 1.75%, respectively, compared with the “Improved collaborative representation” algorithm and the IIRSR2 algorithm. Experimental results show that the proposed algorithm has better image classification results.

From the experiments, compared with the other two databases, the recognition accuracy of the algorithm is higher on the ORL database. The main reason is that the face images in the Georgia Tech database and the FERET database vary greatly under different expressions, lighting, and postures, and the differences between different face images of the same subject are greater. Also, when calculating the distance between images, using Manhattan distance and Euclidean distance achieved similar results.

6 Conclusions

In this paper, we propose an improved image representation method and design a weight fusion scheme, based on both of which implement a novel image classification algorithm and apply it to face recognition. This image representation method converts the original image into a virtual image, effectively preserving the large-scale information and global features of the original image. In addition, the virtual image can serve as a complementary representation of the object. Combining the original image can represent objects from different perspectives, which plays an important role in the correct classification of the algorithm, especially deformable objects such as faces. Moreover, the proposed weight fusion scheme is simple and effective, which is used to fuse the original image and the virtual image and obtain the final classification distance of the test image. The experimental results show the rationality and feasibility of the proposed algorithm. Compared with other related face recognition algorithms, the proposed algorithm achieves the best classification accuracy, and the algorithm can also be applied to other image classification tasks.