1 Introduction

Face recognition is one of the most important branches of biometrics [1]. Especially in recent years, due to the governments’ growing concern about the public security, face recognition has become more popular among the researchers. Linear subspace learning-based methods have been successfully applied on face recognition. These methods are used to extract low-dimensional features, which are more discriminant for facial image classification. Typical subspace learning-based methods include principal component analysis (PCA) [2, 3], Fisher linear discriminant analysis (LDA) [2, 3], locality preserving projections (LPP) [4] and unsupervised discriminant projection (UDP) [5]. Sparse representation and manifold learning methods are also widely exploited in face recognition [610].

Subspace learning-based methods could be divided into two kinds: unsupervised methods and supervised methods. LDA is a representative supervised method to learn discriminant subspace. Unfortunately, it cannot be directly applied to small sample size (SSS) problems [11] because the within-class scatter matrix is singular. Face recognition is a typical SSS problem, and many works have been proposed to use LDA for face recognition. The most popular method is Fisherface [2]. First, Fisherface uses PCA to reduce the dimensionality of the original feature; second, LDA is applied in PCA subspace. To overcome the singularity, we could add a singular value perturbation to the within-class scatter matrix [12]. Regularized discriminant analysis (RDA) is a more systematic method [13]. Another regularized version of LDA is penalized discriminant analysis (PDA) [14, 15]. PDA aims not only to address the SSS problem but also to smooth the coefficients of discriminant vectors. In face recognition, the dimension of features is often more than ten thousand. It is not practical for RDA or PDA to process high-dimensionality covariance matrices. The famous null subspace methods include LDA + PCA method [16] and direct LDA [17]. Loog [18] proposed the approximate pairwise accuracy criterion (aPAC) that used a weight to emphasize the close class pair in order to reduce the merging of close class pairs. Tao et al. [19] proposed to maximizing the geometric mean of all pairs of Kullback–Leibler (KL) divergences for subspace selection, which maximized the geometric mean of KL divergences between different class pairs.

Due to the influence of the camera hardware, surroundings and light variation, the real images are often in a very low resolution and difficult to recognize. In the criminal investigations, if we still use the traditional feature extraction strategies, then we need the image of acquisition and that of the database be normalized to the same low resolution, which severely decreases the recognition ability.

In general, low-resolution face recognition brings two problems: First, the low resolution of the face images leads to a low recognition rate; second, as for the difference between the training samples and test samples, the traditional face recognition algorithms cannot be directly applied to the low-resolution face recognition. So, it is necessary to build a reliable algorithm for the low-resolution face recognition.

Canonical correlation analysis (CCA) [20] was presented by Hoteling in 1936, which is a classical method of multivariate statistical analysis. The basic idea of the CCA is to transform the correlation between two groups to several variables. In this way, we can fully and simply depict the relationship between two variables by little canonical correlation variable. Thus, CCA is broadly used in the correlation analysis and predictive analysis between the items. Huang et al. [21] gave a face image super-resolution reconstruction using CCA. We will simply use CCA in face recognition.

If we regard the low-resolution and high-resolution images as two different groups of variable, we can use CCA to find the transform pair between them. Therefore, we can project the low- and high-resolution images to the same linear space and realize the match of images with different resolutions. To overcome the aforementioned problems, in this work, we propose a method named low-resolution degradation face recognition over long distance based on CCA. This method will connect the low- and high-resolution images by extracting their correlation and also avoid the dimension mismatch when the high-resolution images are normalized to low ones. The experiments show that the proposed method achieves promising results in several face recognition problems.

2 Low-resolution face recognition

Subspace learning methods are successfully used in face recognition. By seeking for a linear projection matrix, we project the training and testing samples to the same space, in which it is easy to classify the different kinds of samples. These projection methods require that the training samples and test samples have the same resolution, as shown in Fig. 1.

Fig. 1
figure 1

Illustration of the subspace learning-based methods

In the low-resolution degradation face recognition over long distance, it cannot meet the requirement of the same resolution between the low-resolution images and the high-resolution images. So we cannot directly apply subspace learning methods. One simple way is that we can rescale the face recognition by down-sampling the high-resolution images to the same size as the low-resolution images and employ the traditional feature extraction methods. The operation will lose much useful information, which decreases the recognition rate. We plan to seek a method that could directly classify the high-resolution images and the low-resolution degradation images.

Different from the traditional ways, we are seeking two linear transformation matrices w h and w l to project low- and high-resolution images to the same low-dimensional space, respectively. Then, we can make a match to recognize the face recognition between the different resolutions, as shown in Fig. 2.

Fig. 2
figure 2

Illustration of the subspace learning-based methods at different resolutions

As for some correlations between the different resolution face from the same person, we just need to seek a way to explore the correlation between the low- and high-resolution images. So we transform the low- and high-resolution images to a same space. The above ideas not only preserve the details of high-resolution face image, but also avoid the mismatching of dimensions.

3 CCA

CCA [20] is a kind of multivariate statistical analysis method, which is broadly applied to correlation analysis. As aforementioned, we will apply CCA to the low-resolution degradation face recognition.

3.1 The objective function

Assume that the vector dimensionality of the low- and high-resolution images is n h and n l, respectively. Note the m high-dimension training samples as \( x^{{_{i} }} ,\,i = 1,2, \ldots ,m \) which compose a training set X:

$$ X = \left( {\begin{array}{*{20}c} {x^{1} } \\ {x^{2} } \\ \vdots \\ {x^{m} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {x_{1}^{1} } & {x_{2}^{1} } & \cdots & {x_{{n_{h} }}^{1} } \\ {x_{1}^{2} } & {x_{2}^{2} } & \cdots & {x_{{n_{h} }}^{2} } \\ \vdots & \vdots & \ddots & \vdots \\ {x_{1}^{m} } & {x_{2}^{m} } & \cdots & {x_{{n_{h} }}^{m} } \\ \end{array} } \right). $$
(1)

Similarly, m low-dimension training samples \( y^{i} ,\,i = 1,2, \ldots ,m \) compose the sample set Y,

$$ Y = \left( {\begin{array}{*{20}c} {y^{1} } \\ {y^{2} } \\ \vdots \\ {y^{m} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {y_{1}^{1} } & {y_{2}^{1} } & \cdots & {y_{{n_{l} }}^{1} } \\ {y_{1}^{2} } & {y_{2}^{2} } & \cdots & {y_{{n_{l} }}^{2} } \\ \vdots & \vdots & \ddots & \vdots \\ {y_{1}^{m} } & {y_{2}^{m} } & \cdots & {y_{{n_{l} }}^{m} } \\ \end{array} } \right). $$
(2)

Different from the traditional feature extract method, CCA aims to seek for a pair of projection matrix w h and w l and make the high-dimension samples and low-dimension samples project to the same space using the linear transformation \( R^{{1 \times n_{d} }} ,\,n_{d} \le n_{l} \)

$$ \hat{x}^{i} = x^{i} w^{\text{h}} ,\quad \hat{y}^{i} = y^{i} w^{\text{l}} $$
(3)

where \( \hat{x}^{i} ,\hat{y}^{i} \in R^{n} ,\,i = 1, \ldots ,m \) can capture the maximum variation direction of the original data. But, CCA introduces the correlation between the high-dimensional samples and the low-dimensional samples, which guarantees the maximum correlation between the projected \( \hat{x}^{i} \) and \( \hat{y}^{i} . \) The CCA criterion function is as follows:

$$ \begin{aligned} & \mathop {\arg \,\hbox{max} }\limits_{{w^{\text{h}} ,w^{\text{l}} }} \hbox{max}\,J\left( {w^{\text{h}} ,w^{\text{l}} } \right) = \frac{{w^{{{\text{h}}^{\text{T}} }} S_{xy} w^{\text{l}} }}{{\sqrt {w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} \cdot w^{{{\text{l}}^{\text{T}} }} S_{yy} w^{\text{l}} } }} \\ & {\text{s}}.{\text{t}}.\,w^{{{\text{h}}^{\text{T}} }} w^{\text{h}} = I,\quad w^{{{\text{l}}^{\text{T}} }} w^{\text{l}} = I \\ \end{aligned} $$
(4)

In problem (4), S xx and S yy are the covariance matrix of X and Y, respectively, and S xy is the cross-covariance matrix of sample set X and sample set Y, as shown below

$$ S_{xx} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {x^{i} - \bar{x}} \right)^{\text{T}} \left( {x^{i} - \bar{x}} \right)} $$
(5)
$$ S_{yy} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {y^{i} - \bar{y}} \right)^{\text{T}} \left( {y^{i} - \bar{y}} \right)} $$
(6)
$$ S_{xy} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {x^{i} - \bar{x}} \right)^{\text{T}} \left( {y^{i} - \bar{y}} \right)} $$
(7)

3.2 Solution of the target problem

According to the objective function (4), we build the following Lagrange multiplier function

$$ \begin{aligned} L\left( {w^{\text{h}} ,w^{\text{l}} } \right) & = w^{{{\text{h}}^{\text{T}} }} S_{xy} w^{\text{l}} - \frac{{\lambda_{1} }}{2}\left( {w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} - 1} \right) \\ & \quad - \frac{{\lambda_{2} }}{2}\left( {w^{{{\text{l}}^{\text{T}} }} S_{yy} w^{l} - 1} \right) \\ \end{aligned} $$
(8)

where \( \lambda_{1} \) and \( \lambda_{2} \) are Lagrange multipliers. Let the partial derivative to the projection axis w h and w l be zero; then, we have:

$$ \frac{\partial L}{{\partial w^{\text{h}} }} = S_{xy} w^{\text{l}} - \lambda_{1} S_{xx} w^{\text{h}} = 0 \Leftrightarrow w^{{{\text{h}}^{\text{T}} }} S_{xy} w^{\text{l}} = \lambda_{1} w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} $$
(9)
$$ \frac{\partial L}{{\partial w^{\text{l}} }} = S_{yx} w^{\text{h}} - \lambda_{2} S_{yy} w^{\text{l}} = 0 \Leftrightarrow w^{{{\text{l}}^{\text{T}} }} S_{yx} w^{\text{h}} = \lambda_{2} w^{{{\text{l}}^{\text{T}} }} S_{yy} w^{\text{l}} $$
(10)

As we know \( S_{xy} = S_{yx} , \) thus,

$$ w^{{{\text{h}}^{\text{T}} }} S_{xy} w^{\text{l}} = w^{{{\text{l}}^{\text{T}} }} S_{yx} w^{\text{h}} $$
(11)

According to Eqs. (9)–(11), we get

$$ \begin{aligned} & \lambda_{1} w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} = \lambda_{2} w^{{{\text{l}}^{\text{T}} }} S_{yy} w^{\text{l}} \, \Leftrightarrow w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} = \frac{{\lambda_{2} }}{{\lambda_{1} }}w^{{{\text{l}}^{\text{T}} }} S_{yy} w^{\text{l}} \\ & \quad \Leftrightarrow w^{{{\text{h}}^{\text{T}} }} S_{xx} w^{\text{h}} = aw^{{{\text{l}}^{\text{T}} }} S_{yy} w^{\text{l}} , \\ \end{aligned} $$
(12)

where \( a = \lambda_{2} /\lambda_{1} \). By substituting Eq. (12) into Eqs. (9) and (10), we get two characteristic equations

$$ S_{xy} S_{yy}^{ - 1} S_{yx} w^{\text{h}} = a\lambda_{1}^{2} S_{xx} w^{\text{h}} $$
(13)
$$ S_{yx} S_{xx}^{ - 1} S_{xy} w^{\text{l}} = \lambda_{2}^{2} S_{yy} w^{\text{l}} $$
(14)

We set \( M_{xy} = S_{xx}^{ - 1} S_{xy} S_{yy}^{ - 1} S_{yx} \), \( M_{yx} = S_{yy}^{ - 1} S_{yx} S_{xx}^{ - 1} S_{xy} \); then, the projection eigenvectors w h and w l are their eigenvalues. By solving the eigenvectors of M xy and M xy , we obtain a group of optimal projection vector pair \( W^{\text{h}} = \left( {w_{1}^{\text{h}} ,w_{2}^{\text{h}} , \ldots ,w_{d}^{\text{h}} } \right) \) and \( W^{\text{l}} { = }\left( {w_{1}^{\text{l}} ,w_{2}^{\text{l}} , \ldots ,w_{d}^{\text{l}} } \right) \) with \( d \le \hbox{max} \left( {n_{\text{h}} ,n_{\text{l}} } \right) \). So we will obtain the best correlation between \( \hat{X} = XW^{\text{h}} \) and \( \hat{Y} = YW^{\text{l}} , \) which is the projected sample set X and Y. The vectors \( \hat{x} \) and \( \hat{y} \) have the same dimensionality. By this, we can calculate their Euclidean distance directly.

3.3 Descriptions of algorithm flow

As we mentioned above, the low-resolution degradation face recognition over long distance in this paper is using CCA. The framework is shown in Fig. 3.

Fig. 3
figure 3

Framework of the proposed method

First, we centralize the sample set, and then, we train the high-dimension set X and low-dimension set Y to follow the correlation criterion of Eq. (4). The projections \( \hat{x}^{i} \) and \( \hat{y}^{i} \) which are the projections of x i and y i on the projection vector w h and w l will be in the same space R. And their correlations reach the peak. At last, in the space R, we use the nearest neighbor classifier base on Euclidean distance to compare their minimal Euclidean distances between the projected low-resolution test samples and the high-resolution training samples and to classify the face.

4 Experiments

In this section, we conduct the low-resolution face recognition experiments on the extended Yale B, ORL and AR face databases. We compare the proposed method with PCA, LDA and multidimensional scaling (MDS) and LPP. Here, we exploit the nearest neighbor classifier with Euclidean distance. For PCA, LDA and LPP, we first normalize the high-resolution image to a low-resolution one and then extract the features.

4.1 Data sets

4.1.1 Yale B face database

The Extended Yale B face database includes 38 people, and each one has about 64 different images. The image resolution is 192 × 168. In this paper, we choose 36 images of each person to do experiments, in which 15 of them are used as high-resolution images and the other 21 images are used as low-resolution images; especially, the quality of high-resolution image is three times over that of low-resolution image. In the training, we make two experimental sample sets with different resolution: One uses the 48 × 42 resolution images as the high-resolution images and the 16 × 14 resolution images as the low-resolution images, as is shown in Fig. 4; the other uses the 24 × 21 resolution images as the high-resolution images and the 8 × 7 resolution images as the low-resolution ones, as shown in Fig. 5.

Fig. 4
figure 4

Images in sample set 1

Fig. 5
figure 5

Images in sample set 2

4.1.2 ORL face database

ORL face database has 40 persons, and each one has 40 face images. The resolution of image is 112 × 92. Ten images have different angels on one person. In sample set 3, we condense the first four images to the 56 × 46 high-resolution images and the later six images to 28 × 23 low-resolution images, as is shown in Fig. 6. In sample set 4, we condense the first four images to 28 × 23 as high-resolution images and the later six images to 14 × 12 as low-resolution images, as shown in Fig. 7.

Fig. 6
figure 6

Images in sample set 3

Fig. 7
figure 7

Images in sample set 4

4.1.3 AR face database

AR face database [22, 23] consists of 120 persons, and each one has 26 images in different sunlight, expression, shield and aging. In this paper, we select 50 persons and each one has seven images in different expression and sunlight. In sample set 5, we condense the first three images of each one to 80 × 60 as high-resolution images and the later four images to 40 × 30 as low-resolution images, as shown in Fig. 8. In sample set 6, the first three images of each one is resized to 40 × 30 as the high-resolution images and the later four images to 20 × 15 as low-resolution images, as shown in Fig. 9.

Fig. 8
figure 8

Images in sample set 5

Fig. 9
figure 9

Images in sample set 6

4.2 Experimental results

In the first test, we choose 15 low-resolution samples of each person and 15 high-resolution samples of each person in the sample set 1 as the training sample set, and the remaining six low-resolution images of each person as the test sample set. Similarly, we pick 15 low-resolution samples of each person and 15 high-resolution samples of each person in the sample set 2 as training set, and the remaining six low-resolution images of each person as test set. For PCA and LDA, we use the mentioned 15 low-resolution images to train and the six ones to test. The parentheses in the PCA test indicate the reserved dimension; LDA test uses the PCA to pretreat and retain 98 % energy. Gaussian heat kernel with a variance of 1e2 and ten-nearest neighborhood graph are used in LPP test on set 1, and Gaussian heat kernel with a variance of 1e4 and 25-nearest neighborhood graph are used in LPP test on set 2. In MDS, the iteration number is 60, and the results are shown in Table 1.

Table 1 Top recognition rates of the first test

Similarly, the second test picks ten high-resolution images of each person and ten low-resolution images of each person as the training sample; the 11 remaining low-resolution samples of each person are regard as test sample. Other parameters are the same as that in the test 1, and the results are as shown in Table 2.

Table 2 Top recognition rates of the second test

From Tables 1 and 2, the recognition rates are higher with a high resolution. For example, the recognition rate is higher in sample set 1 than the sample set 2, which indicates that the image resolution has influence on the recognition rates. Obviously, PCA is not satisfied in low-resolution field. As shown in the table, the proposed method has a better performance than the other three, whether in the first or second groups. And both the two results imply that the number of training set will influence the recognition rate. So we employ the third test, we pick the MDS as a contrast for its stable rate variation. Kernel projection simply uses the unit projection; we make 60 times iteration with the kernel parameter 0.5. By steadily raising the number of sample (training sample from 2 to 15), we get the histograms between the recognition rate and the sample number of the two algorithms. Results of the two samples are shown in Figs. 6 and 7, respectively.

From the third test group, the recognition effect of CCA algorithm is closely related to the sample number, according to Figs. 10 and 11. When the sample number is <7, the recognition effect of CCA will decline drastically. With more test samples, the CCA keeps a relatively perfect recognition effect. In the sample 1, when the number of training sample is 13, the recognition rate can reach 95.11 %; in the sample 2, with 8 × 7 low-resolution images which are hard to be recognized by naked eyes, the CCA can also reach a recognition rate of 70 %. These results demonstrate that CCA algorithm is applicable to the low-resolution images.

Fig. 10
figure 10

Recognition rates versus the number of the training set in sample set 1

Fig. 11
figure 11

Recognition rates versus the number of the training set in sample set 2

In the fourth test, as for everyone, we pick four high-resolution samples and four low-resolution samples to compose a training sample set, and the two low-resolution images remained are regard as test sample set. Similarly, we pick four high-resolution images and four low-resolution images in sample 4, and the remaining two face images are regarded as test sample. For the PCA and LDA tests, we use the mentioned four low-resolution images to train and the two remaining low-resolution images to test; the parentheses in the PCA test indicate the reserved dimension; LDA test uses the PCA to pretreat and retain 98 % energy. Consistently,LPP utilizes the PCA to pretreat and retain 98 % energy. Gaussian heat kernel with a variance of 90- and two-nearest neighborhood graph are used in LPP test on set 3, while a Gaussian heat kernel with a variance of 100- and one-nearest neighborhood graph on set 4. As for the MDA test, the number of iterations is 60, and the results are shown in Table 3.

Table 3 Top recognition rates of the fourth test

In the fifth test, for everyone, we pick three high-resolution samples and three low-resolution samples to form the training sample set in the sample set 5, and the remaining one low-resolution face image to form the test sample set. We make similar procession on the set 6. For the PCA and LDA tests, we use the three low-resolution images to train and one low-resolution image to test. The parentheses in the PCA test indicate the reserved dimension, and both LDA test and LPP use the PCA to pretreat and retain 98 % energy. A Gaussian heat kernel with same variance 1e4 is utilized on both set 5 and set 6. However, different nearest neighborhood graph is adopted in LPP test, 16 for set 5 and 5 for set 6. As for the MDA test, the number of iterations is 60, and the results are shown in Table 4.

Table 4 Top recognition rates of the fifth test

5 Conclusions

In this paper, CCA is firstly applied on low-resolution degradation face recognition over long distance. We extract the most correlated component between the low-resolution and high-resolution images using CCA. CCA reduces the requirements of the same dimension of images and also avoids the mismatch between the low-resolution images and high-resolution images. CCA achieves very high recognition rates in our experiments on three face databases. The experimental results show that CCA could be successfully used on low-resolution degradation face recognition over long distance. In the future, how to apply the supervised CCA, locality preserving CCA and sparse CCA on low-resolution degradation face recognition over long distance deserves further studies.