1 Introduction

It is remarkable that recently the researchers in the field of pattern recognition proposed a promising image recognition method, i.e. the so-called sparse representation method [24]. This method does not classify images in the conventional way. It first uses a linear combination of a subset of the training samples to express the test sample. Then, it bases on the expression result to classify the test sample. This method has obtained a very good performance and has been commented as a face recognition breakthrough [5]. However, the sparse representation method has a very high computational cost. This is mainly because it depends on an iterative algorithm to obtain its solution. The sparse representation method was also used for breast cancer biomarker identification and classification [6], signal processing [7] and image decomposition [8].

The main difference between the sparse representation method and previous face recognition methods is as follows: previous face recognition methods are usually composed of three stages: the feature extraction, classifier selection and classification. The feature extraction stage is usually implemented by a transform method such as the methods based on the independent component analysis methodology [911], on the principal component analysis methodology [1217] and on the discriminant analysis methodology [1825]. However, the sparse representation method adopts a noticeable novel way to address the face recognition problem. It does not contain the feature extraction and classifier selection stages. Instead, it first attempts to express the test sample as a sparse linear combination of the training samples. Hereafter, the term ‘sparse linear combination of the training samples’ means that if the test sample is expressed as a linear combination of all the training samples, the majority of the components of the combination are zero. It seems that the sparse representation method attempts to seek the potential effects of different training samples on ‘constructing’ the test sample and intends to classify the test sample into the class whose training samples has the maximum effect. In other words, this method assumes that the test sample can be approximated by the sum of the effects of the training samples from different classes and the class that has the maximum effect is the most similar to the test sample.

In this paper, we propose a very simple and fast face recognition method. This method is partially similar to the sparse representation method, whereas it is computationally much more efficient. Moreover, our method has a distinctive characteristic that its solution can be solved with ease. The proposed method first selects only one neighbor sample, for the test sample, from each class and then expresses the test sample as a linear combination of all the selected neighbor samples. Finally, the method constructs a classification procedure on the basis of the expression result. The proposed method performs well in face recognition applications.

As the proposed method uses only L training samples to express and classify the test sample (L is the number of all the classes), it is also a sparse representation method and is able to inherit the advantages of this kind of method. The analysis also demonstrates that the training samples selected and used by our method are the most suitable ones that can produce the minimum classification error. In other words, if we use other L training samples to express and classify the test sample, we will obtain a higher classification error. The experimental result shows that the proposed method outperforms the nearest neighbor classification method (NNCM). NNCM is indeed a special form of the k nearest neighbor classifier [26]. NNCM first determines the training sample that is the nearest to the test sample and then classifies the test sample into the class of the training sample. It seems that the proposed method achieves this by modifying the neighbor relationships, between the test sample and training samples, determined by the Euclidean metric. Moreover, though our method depends on much fewer training samples to perform classification than the NFS method proposed in [1], it might obtain a better performance.

The rest of the paper is organized as follows: Sect. 2 describes our method. Section 3 provides our analysis of the proposed method. Section 4 presents our experiments and results. Section 5 offers our conclusion.

2 The proposed method

In this section, we formally describe our proposed method. This method consists of two processes. The first process selects the nearest training sample, of the test sample, from every class. Supposing there are L classes, we obtain L nearest training samples (NTS), for the test sample, each being from one class. The second process expresses the test sample as a linear combination of all the selected L NTSs and exploits the determined linear combination to classify the test sample. Hereafter, we assume that each training and testing samples are all column vectors.

The first process of our proposed method works as follows: let \( A_{i}^{k} \) (i = 1, 2, ··· n k , k = 1,2, ··· L) denote the ith training samples of the k th class and n k be the number of the training samples of the k th class. This process calculates the distance between test sample y and \( A_{i}^{k} \) using

$$ d_{i}^{k} = ||A_{i}^{k} - y||^{2} . $$
(1)

If \( j = \mathop {\arg \min }\limits_{i} d_{i}^{k} \), then \( A_{j}^{k} \) is identified as the nearest training sample from the kth class. We denote \( A_{j}^{k} \) by NTS k . Once the first process identifies all the NTS k , K = 1, 2, ···, L, we define matrix S = [NTS1···NTS L ].

The second process of our proposed method works as follows: it assumes that test sample y can be approximately represented by a linear combination of all the NTS k , K = 1, 2, ···, L. In other words, it assumes that the following equation is approximately satisfied:

$$ y = \sum\limits_{i = 1}^{L} {\beta_{i} {\text{NTS}}_{i} } . $$
(2)

Eq. (2) can be rewritten into

$$ y = S\beta , $$
(3)

where β = (β1···β L )T. If S T S is not singular, we can obtain the least squares solution of (3) using β = (S T S)−1 S T Y. If S T S is nearly singular, we can solve β using \( \beta = (S^{T} S + \mu I)^{ - 1} S^{T} Y \), where μ is a positive constant and I is the identity. We refer to this solution scheme as regularized solution scheme of our method. After the β is obtained, we use \( \hat{y} \) to denote Sβ, i.e. \( \hat{y} = S\beta \). We refer to \( \hat{y} \) as the expression result of our method. In Sect. 4, we will convert the one-dimensional expression result into a two-dimensional image, which allow us to see how the expression result is close to the original test sample.

Eq. (2) shows that each NTS makes its own contribution to representing the test sample and the contribution that the ith NTS makes is β i NTS i . Moreover, the ability, of representing the test sample, of the ith NTS can be evaluated by the deviation between β i NTS i and Y. We define the deviation as \( e_{i} = ||Y - \beta_{i} {\text{NTS}}_{i} ||^{2} \). Our method regards that the smaller e i is, the greater ability of representing the test sample the ith NTS has. The second process identifies the NTS that has the minimum deviation from the test sample and classifies the test sample into the class of the identified NTS. It should be pointed out that each class has only one NTS and the ith NTS corresponds to the ith class. If \( e_{t} = \min e_{i} \), then the test sample is classified into the tth class. We refer to the nearest neighbor from the tth class as the nearest neighbor determined by our method.

3 Analysis of our method

In this section, we analyze our method for exploring its characteristics. Our method differs from the sparse representation method in [2] as follows: our method has a very simple solution scheme, whereas the iterative solution scheme of the sparse representation method in [2] is computationally much inefficient. On the other hand, our method can be viewed as a special sparse representation method. Indeed, if the linear combination in our method is compulsorily rewritten as a linear combination of all the training samples, then the coefficients of the linear combination of all the training samples except for NTSs should be zero.

Although both our method and the method in [2] work as sparse representation methods, they achieve the sparseness in two different ways. Our method uses the first process to produce the sparseness and it is known how sparse the coefficients are (i.e. there are how many zero coefficients) and which coefficients are zero. However, the sparse representation method proposed in [2] achieves the sparseness by its iterative solution scheme and it is not clearly known which coefficients of the linear combination are equal or close to zero. We also refer to our method as “hard” sparse representation method. On the other hand, the method in [2] can be referred to as “soft” sparse representation method.

Our method has two advantages. The first is that it exploits only a small number of training samples to express and classify the test sample. As a result, our method can solve (3) computationally efficiently. When solving the linear system shown in (3), our method needs a time complexity of O (L 2 M + L LM), where M is the dimension of the sample vector. The NFS method in [1] needs a time complexity of \( O\left( {L(n^{2} M + n^{3} + nM)} \right) = O(nNM + Nn^{2} + NM) \) to solve the L linear systems. n and N are, respectively, the numbers of training samples of each class and all the training samples. The second advantage of our method is that though it is simple and partially similar to NNCM, it can perform better than NNCM as shown in Sect. 4. NNCM can be described as a method that exploits the nearest neighbor from each class to classify the test sample. That is, we can present NNCM as follows: NNCM first selects the nearest neighbor from each class for the test sample. NNCM identifies the sample that is the closest to the test sample among the L selected nearest neighbors (NTSs) and assumes that the test sample is from the same class. L is the number of all the classes. It seems that the main reason why our method outperforms NNCM is to modify the neighbor relationships between the test sample and training samples, determined by the Euclidean metric. As shown in Sect. 2, in our method, the neighbor relationships, between the test sample and training samples are ultimately determined by the deviation between the test sample and the contribution to representing the test sample of the NTS. In other words, we can say that our method consists of the following two procedures: the first procedure uses the Euclidean metric to select neighbor samples from each class for the test sample. The second procedure using exploits the deviation defined in Sect. 2 to reorder the neighbor relationship between the test sample and the neighbor samples determined by the first procedure. That is, the second procedure uses the deviation between the test sample and the contribution to expressing the test sample of the NTS as the metric. Using this metric, the second procedure determines the ‘final nearest neighbor’ and classifies the test sample into the class that the ‘final nearest neighbor’ belongs to. We show the flowchart of our method using Fig. 1, which clearly shows that our method identifies the RTS having the minimum deviation and assumes that the test sample is from the class of the RTS identified. We can also say that the classification in our method is equivalent to the nearest neighbor classification based on the metric of ‘deviation’.

Fig. 1
figure 1

Flowchart of our method. It is clear that the classifier in our method is equivalent to the nearest neighbor classifier based on the metric of ‘deviation’

When our method attempts to exploit L training samples to express the test sample, it indeed uses the most suitable and significant L samples. In other words, among all the training samples, the L nearest neighbors (NTSs) are the most L important training samples in terms of the ability of expressing the test sample. Actually, as shown in Sect. 4, if we use other L training samples to express and classify the test sample, the classification performance might be very poor.

4 Experimental results

We used the ORL [27], Yale [28] and AR [29] face databases to test our method. The face images in the ORL database include variations in facial expression (smiling/not smiling, open/closed eyes) and facial detail. The subjects are in an upright, frontal position with tolerance for some tilting and rotation of up to 20°. Each of the face images contains 112 × 92 pixels. The Yale database contains face images with a variety of expressions such as normal, sad, happy, sleepy, surprised, and winking, all obtained under different lighting conditions. Some faces also wear glasses. The AR face database is a large-scale database. We used 3,120 gray face images from 120 subjects from this database, each providing 26 images. These images were taken in two sessions [30] and show faces with different facial expressions, in varying lighting conditions and occluded in several ways. For the ORL and Yale databases, when s samples of all the n samples per class were used for training, we conducted experiments on all the \( C_{n}^{s} \) training sets and \( C_{n}^{s} \) testing sets. Before implementing all the methods, we converted each sample vector into a unit vector with the length of 1 in advance. We used the regularized solution scheme to obtain the solution of our method and set μ to 0.01. We also tested the NNCM and the NFS method proposed in [1]. Moreover, to show the reasonability of the way of selecting neighbors for the test sample, we also modified our method by selecting the first or last training sample from each class for the test sample and by using the selected first or last training sample to express and classify the test sample. We refer to the methods using the first and last training samples as the method using the first sample and the method using the last sample, respectively. We also modified our method by selecting the furthest neighbor, from each class, for the test sample and exploited them to express and classify the test sample. We refer to this method as the method using the furthest neighbor.

Figure 2 shows the first six nearest neighbors, of one test sample, determined by NNCM and our method on the AR database. This figure shows that the ‘final nearest neighbor’ determined by our method [shown in 2-(b’)] is from the same subject as the test sample, but the nearest neighbor determined by NNCM [shown in 2-(b)] is not from the same subject as the test sample. As a result, our method can correctly classify the test sample, whereas NNCM will fail to do so. Figure 3 shows some original test samples and the two-dimensional images corresponding to the expression result of our method. It seems that when the face was not occluded, the two-dimensional image corresponding to the expression result was very similar with the original test sample. On the other hand, when the face was occluded, the two-dimensional image corresponding to the expression result was not very close to the original test sample. Figure 4 shows the distances between the test sample shown in Fig. 2 and all the 120 NTSs, the components of the solution vector of our method and the deviation of the test sample from the 120 NTSs. Figure 4a clearly (a) shows that the NTS from the 30th class is the closest to the test sample. Thus, NNCM will classify this test sample into the 30th class. However, Fig. 4c shows that the NTS from the first class has the minimum deviation from the test sample and our method will correctly classify the test sample. Figure 4 visually shows that our method can ‘reorder’ the neighbor relationships between the test sample and NTSs by using the deviation between the test samples and NTSs.

Fig. 2
figure 2

The first six nearest neighbors, of one test sample of the first subject in the AR database, determined by NNCM and our method. Both (a) and (a’) denote the same test sample from the AR database. bg denote the first six ‘nearest neighbors’ determined by NNCM. b’g’ denote the first six nearest neighbors determined by our method (these samples have the first six smallest deviations from the test sample). It is clear that our method can correctly classify the test sample, whereas NNCM will fail to do so. In this case, the first four images of each subject were used as training samples and the remaining samples were used as testing samples

Fig. 3
figure 3

Some original test samples and the two-dimensional images corresponding to the expression result of our method. The first and second rows show some original test samples and the two-dimensional images corresponding to the expression result of our method, respectively. The third and forth rows show some other original test samples and the two-dimensional images corresponding to the expression result of our method, respectively. In this case, we also used the first four images of each subject as training samples and took the remaining samples as testing samples

Fig. 4
figure 4

The distances (a) between the test sample shown in Fig. 2 and all the 120 nearest neighbor samples, the components (b) of the solution vector of our method and the deviation (c) of the test sample from its nearest neighbors. This figure shows that the proposed method can ‘reorder’ the 120 nearest neighbor training samples. a shows that the nearest neighbor from the 30th class is the closest to the test sample (as a result, NNCM will classify this test sample into the 30th class). b shows that the first component of the solution vector of our method has the maximum absolute value. c shows that the nearest neighbor from the first class has the minimum deviation from the test sample. As a result, our method will correctly classify the test sample. The deviation is defined in Sect. 2

Tables 1, 2 and 3 show the experimental results. They show that our method always classifies more accurately than NNCM and the method using the furthest neighbor. For example, when the first four images per class from the AR database were used as training samples and the others were used as test samples, the ratios of the classification errors obtained using our method, NNCM, the NFS method proposed in [1] and the method using the furthest neighbor are 31.67, 42.69, 41.86 and 45.64%, respectively. We see that the difference between the rates of classification errors of NNCM and our method is 11.02%. In this case, the rate of classification errors of our method is also 10.19% lower than that of the NFS method. Moreover, the fact that our method always obtains a much lower error rate than the method using the furthest neighbor, the method using the first sample and the method using the last sample also demonstrates that though these methods use the same number of training samples to express and classify the test sample, our method uses the most suitable training samples. Compared with the NFS, our method on the AR database classified much more accurately. For the ORL and Yale databases, the classification performance of our method is close to that of the NFS method.

Table 1 Means of the rates of the classification errors of the nearest neighbor method and the proposed method on the AR database
Table 2 Means of the rates of the classification errors of the nearest neighbor method and the proposed method on the Yale database
Table 3 Means of the rates of the classification errors of the nearest neighbor method and the proposed method on the ORL database

5 Conclusion

NNCM can be described as a method that exploits the nearest neighbor from each class to classify the test sample as follows: if there are L classes, NNCM first selects L nearest neighbors, each being from one class, for the test sample. NNCM then identifies the sample that is the closest to the test sample among the L selected nearest neighbors and assumes that the test sample is from the same class. The method proposed in this paper also uses these nearest neighbors from each class to express and classify the test sample. It is remarkable that this method is able to achieve a much lower error rate than NNCM, and the maximum difference between the accuracies of the two methods might be greater than 10%. Our method achieves this by exploiting the ability of repressing the test sample of the training sample rather than only a simple distance to classify the test sample, which has been proven to be a good way to produce a high classification accuracy. The analysis also shows that for our method, the L nearest neighbors used are the most suitable training samples to express and classify the test sample. Moreover, though our method is ‘sparser’ than the NFS method proposed in [1], it might obtain a better performance. Besides the method design and experimental analysis, this paper also visually presents the rationale and characteristics of the proposed method.