Keywords

1 Introduction

A great number of achievements have been made in the area of automatic face recognition during last decades, especially under well-controlled circumstances. However, the performance of face recognition system in real world always degrades dramatically when the quality of input face images becomes poor, such as low-resolution. This is a specific concern in surveillance environment where the target is far from the sensor, resulting in low-resolution face images.

To solve the low-resolution (LR) problem, a two-step framework is proposed following the intuition of first recovering lost detail information of LR face images and then applying traditional face recognition algorithms on recovered face images. In fact, most proposed two-step algorithms of LR face recognition apply super-resolution (SR) technique as the first step [15]. The super-resolved face images are then passed to the second general face recognition pipe. Through the development of last decade, there exists many SR algorithms to reconstruct high-resolution (HR) images from a single LR image [1] or multiple LR images [2]. In many real-world face recognition systems, the intuitive solution is interpolation which are simple and fast, such as bilinear, cubic and so on. The learning-based super-resolution (LSR) algorithms [1, 35] recently draw a lot of attention owing to its promising performance. Freeman et al. [1] proposed a patch-wise Markov Random Field as the SR prediction model and recovered HR images by MAP estimation. Baker and Kanade [3] proposed to recover the HR face image from an input LR one by “face hallucination” model based on face priors. Liu et al. [5] proposed to combine a holistic and a local model for SR reconstruction. Inspired by locally linear embedding (LLE) [7], Chang et al. recovered the HR face image from the spatial neighbors of its LR counterpart. Yang et al. [8] proposed to incorporate sparse representation into SR framework which achieves outstanding performance. However, these algorithms aim more at the effect of visual enhancement rather than the performance of the specific face recognition task.

Recently, some algorithms avoiding an explicit SR stage have been introduced into face recognition flow. Gunturk et al. [9] investigated to transfer from pixel domain to eigenface domain for SR reconstruction. Hennings-Yeomans et al. [10, 11] integrated the aims of SR and face recognition simultaneously through a joint objective function. Although these methods improve the recognition rate, their speed even for the speed-up version is slow due to an optimization procedure for each test image. To avoid the super-resolution step, Coupled Mapping (CM) based methods are proposed for LR face recognition. Li et al. [12] proposed Coupled Locality Preserving Mapping (CLPM) based on CM for LR face recognition. Inspired by locality preserving methods [13, 14] for dimensionality reduction, the CLPM brought in a penalty weighting matrix into the objective function to preserve the local relationship of the original space. The CLPM emphasized more on the objective of recognition rather than just reconstruction and thus yielded a better performance. However, it ignored the label information of the training set, which is vital for face recognition. To take advantage of label information, some LDA-like algorithms were introduced into coupled mapping, such as Simultaneous Discriminant Analysis (SDA) [19], Coupled Marginal Fisher Analysis (CMFA) [18]. In [17], Shi et al. first constructed local optimization for each training sample according to the relationship of neighboring data points and then incorporated the local optimizations together for building the global structure. However, these algorithms fail to consider recognition and geometric information of training set simultaneously, thus some valuable information is missing and performance is limited for challenging problems [17].

In this paper, we propose a novel algorithm called Large Margin Coupled Mapping (LMCM) for LR face recognition, which takes both recognition information of the training data and the local geometric relationship of face image pairs into account to maximize the distance of between-class pairs and minimize the distance of within-class pairs in the common subspace. With appropriate constraints, the new-defined optimization problem could be solved in an analytical close-form. So it can be fast enough for real time applications.

The remaining of this paper is organized as follows. Section 2 demonstrates the LR face recognition problem and the formulation of CM. Section 3 describes the details of our proposed algorithm LMCM. Section 4 shows experimental results on FERET and SCface databases. Section 5 draws conclusions of this paper.

2 Low Resolution Face Recognition

In the scenario of LR face recognition, the task could be simplified to find an appropriate distance measure between a LR face image \( l_{i} \) and a HR one \( h_{j} \), i.e., \( d_{ij} = dist\left( {l_{i} ,h_{j} } \right) \). Here, \( l_{i} \in {\mathbb{R}}^{m} ,\;i = 1,\,2,\, \ldots \,,\,N_{p} \) and \( h_{j} \in {\mathbb{R}}^{M} ,\;j = 1,\,2,\, \ldots \,,\,N_{g} \), (m < M) represent the m-dimension feature vectors of the LR query images and the M-dimension HR ones registered in the gallery set, respectively. Due to the dimension mismatch of the feature vectors of LR and HR face images, some common distances (e.g. Euclidean distance) obviously cannot be applied directly. To deal with this problem, traditional two-step algorithms based on explicit SR attempt to find a mapping, \( {f_{SR}} \)\( {\mathbb{R}}^{m} \mapsto {\mathbb{R}}^{M} \), to project the LR image into the target HR space, and then directly calculate the distance in the HR space:

$$ d_{ij} = dist\left( {f_{SR} \left( {l_{i} } \right),h_{j} } \right) $$
(1)

Different from the two-step algorithms, CM based methods intend to establish two coupled mappings: \( f_{L} \)\( {\mathbb{R}}^{m} \mapsto {\mathbb{R}}^{n} \) for LR face images and \( f_{H} \)\( {\mathbb{R}}^{M} \mapsto {\mathbb{R}}^{n} \), to project both the LR and HR feature vectors into a common feature space. Here, n represents the dimensionality of the new common feature space. Then the distance can be measured by:

$$ d_{ij} = dist\left( {f_{L} \left( {l_{i} } \right),f_{H} \left( {h_{j} } \right)} \right) $$
(2)

Now the critical problem is to pursue an ideal common feature space. For low-resolution face recognition, the objective of CM algorithm is that the projections of LR and HR face image of the same subject should be as close as possible in the new common feature space. Let \( f_{L} \left( l \right) = P_{L}^{T} l \) and \( f_{H} \left( h \right) = P_{H}^{T} h \) be linear mappings, respectively, where \( P_{L} \) and \( P_{H} \) are two projection matrices with size of \( m \times n \) and \( M \times n \). This principle is formulated as the following objective function:

$$ J_{CM} \left( {P_{L} ,P_{H} } \right) = \sum\nolimits_{i = 1}^{{N_{t} }} {\left\| {P_{L}^{T} l_{i} - P_{H}^{T} h_{i} } \right\|^{2} } $$
(3)

\( N_{t} \) represents the number of the training images.

We use \( L = \left[ {l_{1} ,\,l_{2} ,\, \ldots \,,\,l_{{N_{t} }} } \right] \) and \( H = \left[ {h_{1} ,\,h_{2} ,\, \ldots \,,\,h_{{N_{t} }} } \right] \) to denote the original LR and HR feature vectors in the training set, respectively. Equation (3) can be reformulated as

$$ J_{CM} \left( {P_{L} ,P_{H} } \right) = tr\left( {\left\| {P_{L}^{T} L - P_{H}^{T} H} \right\|^{2} } \right) $$
(4)

where \( tr( \cdot ) \) is the matrix trace operator. Furthermore, using some deductions of linear algebra, Eq. (4) can be rewritten as

$$ J_{CM} \left( {P_{L} ,P_{H} } \right) = tr\left( {\left[ {\begin{array}{*{20}c} {P_{L} } \\ {P_{H} } \\ \end{array} } \right]^{T} \left[ {\begin{array}{*{20}c} L & 0 \\ 0 & H \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} I & { - I} \\ { - I} & I \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} L & 0 \\ 0 & H \\ \end{array} } \right]^{T} \left[ {\begin{array}{*{20}c} {P_{L} } \\ {P_{H} } \\ \end{array} } \right]} \right) $$
(5)

We can further let \( P = \left[ {\begin{array}{*{20}c} {P_{L} } \\ {P_{H} } \\ \end{array} } \right] \), \( Z = \left[ {\begin{array}{*{20}c} L & 0 \\ 0 & H \\ \end{array} } \right] \) and \( A = \left[ {\begin{array}{*{20}c} I & { - I} \\ { - I} & I \\ \end{array} } \right] \), where \( I \) is the identity matrix. Finally, we can get a compact form as

$$ J_{CM} \left( {P_{L} ,P_{H} } \right) = tr\left( {P^{T} ZAZ^{T} P} \right) $$
(6)

\( P_{L} \) and \( P_{H} \) can be obtained by minimizing Eq. (6). The details of the optimization procedure can be referred to [12].

3 Proposed LMCM

The CM algorithm described above obtains the projection matrices following the criteria that the distance between each LR face image and the corresponding HR one should be as close as possible. However, it only takes advantage of part of verification information of the training data, e.g. the face image pairs belonging to the same subject. In this paper, we draw an inspiration from Maximum Margin Projection (MMP) [16] and propose LMCM algorithm for LR face recognition, which seeks linear coupled mappings to force a margin between the distance of between-class subjects and the distance of within-class ones in the common feature space, as shown in Fig. 1. To achieve this, we utilize the verification information along with local geometry and identification information of the training data.

Fig. 1.
figure 1

Overview of the proposed LMCM algorithm. Different shapes represent different subjects.

Verification Information with Local Geometry:

Under this scenario, verification information lies in the distance between face image pairs: ones of identical subjects tend to have small distance and ones of different subjects tend to have large distance.

In order to discover both discriminant and geometrical structures of the face images, we construct two graphs, within-class graph \( G_{w} \) and between-class graph \( G_{b} \). In graph \( G_{w} \), face images share the same identities are connected, while in graph \( G_{b} \), face images belong to different subjects are connected. Let \( W_{w} \) and \( W_{b} \) represent the weight matrices of \( G_{w} \) and \( G_{b} \), respectively. As HR feature is considered to have more discriminant information, we build these weight matrices in the original HR image space. We define them as the following form

$$ W_{w,ij} = \left\{ {\begin{array}{*{20}c} {e^{{ - \frac{{\left\| {h_{j} - h_{i2} } \right\|_{2} }}{\sigma }}} ,\; if\; h_{i} ,h_{j} \; connected\; in \;G_{w} } \\ 0 \\ \end{array} } \right. $$
(7)
$$ W_{b,ij} = \left\{ {\begin{array}{*{20}c} {e^{{ - \frac{{\left\| {h_{j} - h_{i2} } \right\|_{2} }}{\sigma }}} ,\; if\; h_{i} ,h_{j} \; connected\; in\; G_{b} } \\ 0 \\ \end{array} } \right. $$
(8)

where \( \sigma \) is the mean distance between each pair of face images in the training data.

Now, consider the problem of mapping LR and HR face images into a common subspace so that the connected face images of \( G_{w} \) stay as close as possible, while the connected face images of \( G_{b} \) stay as far as possible. Let \( P_{L} \) and \( P_{H} \) represent projection matrices. A reasonable criterion for learning the projection matrices is to optimize the following objective functions:

$$ \mathop { \hbox{min} }\limits_{{_{{P_{L} ,P_{H} }} }} \sum\nolimits_{i,j} {\left\| {P_{L}^{T} l_{i} - P_{H}^{T} h_{j} } \right\|_{2}^{2} W_{w,ij} + \left\| {P_{L}^{T} l_{i} - P_{L}^{T} l_{j} } \right\|_{2}^{2} W_{w,ij} + \left\| {P_{H}^{T} h_{i} - P_{H}^{T} h_{j} } \right\|_{2}^{2} W_{w,ij} } $$
(9)
$$ \mathop {\hbox{max} }\limits_{{P_{L} ,P_{H} }} \sum\nolimits_{i,j} {\left\| {P_{L}^{T} l_{i} - P_{H}^{T} h_{j} } \right\|_{2}^{2} W_{b,ij} + \left\| {P_{L}^{T} l_{i} - P_{L}^{T} l_{j} } \right\|_{2}^{2} W_{b,ij} + \left\| {P_{H}^{T} h_{i} - P_{H}^{T} h_{j} } \right\|_{2}^{2} W_{b,ij} } $$
(10)

where \( W_{w} \) and \( W_{b} \) represents the weight matrices of \( G_{w} \) and \( G_{b} \) respectively. The objective function (9) constructed on the within-class graph \( G_{w} \) imposes a large penalty if neighboring face images of the identical subject in original space are mapped far apart. Similarly, the objective function (10) constructed on the between-class graph \( G_{b} \) imposes a large penalty if neighboring face images belonging to different subjects are mapped close together. The ultimate goal of these objectives is to force a margin between face feature vectors of different subjects.

Following some simple algebraic steps, the objective function (9) can be reduced to the following matrix form

$$ \begin{aligned} & \mathop { \hbox{min} }\limits_{{_{{P_{L} ,P_{H} }} }} Tr\left( {P_{L}^{T} L\left( {2D_{w}^{L} + D_{w}^{H} - W_{w} - W_{w}^{T} } \right)L^{T} P_{L} + P_{H}^{T} H\left( {D_{w}^{L} + 2D_{w}^{H} - W_{w} - W_{w}^{T} } \right)H^{T} P_{H} } \right) \\ & \quad - \,Tr\left( {P_{L}^{T} LW_{w} H^{T} P_{H} + P_{H}^{T} HW_{w}^{T} L^{T} P_{L} } \right) \\ \end{aligned} $$
(11)

where \( D_{w}^{L} = \sum\nolimits_{j} {W_{w,ij} } \) and \( D_{w}^{H} = \sum\nolimits_{i} {W_{w,ij} } \).

Similarly, the objective function (10) can be reduced to a similar matrix form

$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{{P_{L} ,P_{H} }} Tr\left( {P_{L}^{T} L\left( {2D_{b}^{L} + D_{b}^{H} - W_{b} - W_{b}^{T} } \right)L^{T} P_{L} + P_{H}^{T} H\left( {D_{b}^{L} + 2D_{b}^{H} - W_{b} - W_{b}^{T} } \right)H^{T} P_{H} } \right) \\ & \quad - \,Tr(P_{L}^{T} LW_{b} H^{T} P_{H} + P_{H}^{T} HW_{b}^{T} L^{T} P_{L} ) \\ \end{aligned} $$
(12)

where \( D_{b}^{L} = \sum\nolimits_{j} {W_{b,ij} } \) and \( D_{b}^{H} = \sum\nolimits_{i} {W_{b,ij} } \).

Similar deduction with (5) to (6), we can rewrite Eqs. (11) and (12) as follows

$$ \mathop { \hbox{min} }\limits_{{_{{P_{L} ,P_{H} }} }} Tr(P^{T} ZA_{w} Z^{T} P) $$
(13)
$$ \mathop {\hbox{max} }\limits_{{P_{L} ,P_{H} }} Tr(P^{T} ZA_{b} Z^{T} P) $$
(14)

where \( {\text{P}} = \left[ {\begin{array}{*{20}c} {P_{L} } \\ {P_{H} } \\ \end{array} } \right] \), \( {\text{Z}} = \left[ {\begin{array}{*{20}c} L & 0 \\ 0 & H \\ \end{array} } \right] \), \( {\text{A}}_{w} = \left[ {\begin{array}{*{20}c} {2D_{w}^{L} + D_{w}^{H} - W_{w} - W_{w}^{T} } & { - W_{w} } \\ { - W_{w}^{T} } & {D_{w}^{L} + 2D_{w}^{H} - W_{w} - W_{w}^{T} } \\ \end{array} } \right] \), \( {\text{A}}_{b} = \left[ {\begin{array}{*{20}c} {2D_{b}^{L} + D_{b}^{H} - W_{b} - W_{b}^{T} } & { - W_{b} } \\ { - W_{b}^{T} } & {D_{b}^{L} + 2D_{b}^{H} - W_{b} - W_{b}^{T} } \\ \end{array} } \right] \).

Identification Information as Regularization Term:

The identification information classifies the face image into one of the subjects, which encourages the algorithm to learn projection matrix that can map each face image into its own cluster. In this paper, we take advantage of identification information by minimizing the within-class scatter. In learning projection matrices \( P_{L} \) and \( P_{H} \), we aim to solve the following optimization problem:

$$ \mathop {\hbox{min} }\limits_{{P_{L} ,P_{H} }} S_{W} $$
(15)

where \( S_{W} \) represents the within-class scatter. As the overall mean of the training data is zero, the definitions of the scatter matrix are formulated as:

$$ S_{W} = \sum\nolimits_{i} {(x_{i} - \mu_{i,c} )(x_{i} - \mu_{i,c} )^{T} } $$
(16)

where \( x_{i} \) is the n-dimension feature projected by high or low resolution face images into the new common space, \( \mu_{i,c} \) is the mean of the projected feature with class label of c which \( x_{i} \) belongs to. With some linear algebra, Eq. (16) can be rewritten in the following matrix form:

$$ S_{W} = \left( {X - U} \right)\left( {X - U} \right)^{T} $$
(17)

where U is the \( n \times 2N_{t} \) mean matrix with column \( \mu_{i,c} \), and X is the \( n \times 2N_{t} \) data matrix with column \( x_{i} \). Let \( \varLambda \) be a \( C \times C \) diagonal matrix with element \( \varLambda_{i} \). These matrices can be represented by \( P_{L} \) and \( P_{H} \) as:

$$ U = P^{T} ZD\Lambda ^{ - 1} D^{T} $$
(18)
$$ X = P^{T} Z $$
(19)

where \( P = \left[ {\begin{array}{*{20}c} {P_{L} } \\ {P_{H} } \\ \end{array} } \right] \), \( Z = \left[ {\begin{array}{*{20}c} L & 0 \\ 0 & H \\ \end{array} } \right] \) and \( D = \left\{ {d_{ij} } \right\}_{{2N_{t} \times C}} \) with

$$ d_{ij} = \left\{ \begin{aligned} & 1,\; if\; x_{i} \, \in \,class\;j \\ & 0,\;if \;x_{i} \, \notin \,class \;j \\ \end{aligned} \right. $$
(20)

With (18) and (19), Eq. (17) can be rewritten as:

$$ S_{W} = P^{T} Z(I - D\Lambda ^{ - 1} D^{T} )(I - D^{T}\Lambda ^{ - 1} D)Z^{T} P $$
(21)

In this paper, the identification information is taken as a regularization term. This is the main difference between our proposed algorithm and CMFA in [18], where identity matrix is taken as the regularization term in the denominator. And the identification term is a key factor for performance improvement. Finally, the optimization problem with objective functions (13) and (14) reduces to

$$ \mathop {\hbox{max} }\limits_{{P_{L} ,P_{H} }} \frac{{Tr(P^{T} ZA_{b} Z^{T} P)}}{{Tr(P^{T} ZA_{w} Z^{T} P + \xi S_{W} )}} $$
(22)

where \( \xi \) is the balance factor between the verification and identification information. In the experiments below, this factor is set to 0.05;

The coupled projection matrices \( P_{L} \) and \( P_{H} \) that maximize the objective function (22) can be obtained by solving the generalized eigenvalue problem

$$ \left( {ZA_{b} Z^{T} } \right)P = \lambda (ZA_{w} Z^{T} + \xi Z( - D\Lambda ^{ - 1} D^{T} )(I - D^{T}\Lambda ^{ - 1} D)Z^{T} )P $$
(23)

After obtaining the projection matrices \( P_{L} \) and \( P_{H} \), we mapped both LR and HR images into the common space and utilize Euclidean norm to measure the distance of each image pair, as described in (24).

$$ Dis = \left\| {P_{L}^{T} l_{i} - P_{H}^{T} h_{i} } \right\|^{2} $$
(24)

For each probe image, we take as its identity the subject with the smallest distance in the gallery. We use True Positive Identification Rate (TPIR), also refer to as Rank-1 Identification Rate in this circumstances, to measure the performance of our proposed method, as defined in the following

$$ TPIR = \frac{\# (correct\;idetified\;images)}{\# (probe)} $$
(25)

4 Experimental Results

To evaluate effectiveness of the proposed method, we applied our methods on two public databases: FERET [6] and SCface [15]. Performance is measured by rank-1 identification rate. Before projection, the gray pixel distribution of one image is normalized to have average intensity 0, standard deviation 1 and unit norm.

4.1 Experimental Result on FERET Database

We follow the same test protocol as [17] when we conduct experiments on a subset of FERET database. The subset (ba, bd, be, bf, bg, bj, bk) contains 200 subjects with variations of illumination (bk), expression (bk) and pose (bd, be, bf, bg). We choose 50 subjects for training and the rest 150 subjects are used for test. In the test phase, 4 images of each subject are selected as gallery and the remaining as the probe. In the experiment, the HR face images and corresponding LR ones are scaled with resolution of 32 × 32 and 8 × 8. Figure 2 shows some of the HR (top row) and LR (bottom row) face images in FERET database. To evaluate our proposed LMCM algorithm, we compare it with CLPM [12], SDA [19], CMFA [18] and the algorithm proposed in [17].

Fig. 2.
figure 2

HR (Top row) and LR (Bottom row) face images from FERET database

Table 1 presents the experiment results of LMCM algorithm on FERET database. Our method with 53-D features achieves the recognition rate of 90.00 %, which is higher than 55.22 % for CLPM, 72.09 % for SDA, 75.98 % for CMFA and 80.90 % for coupled mapping method used in [17]. The main reason lies in that our method takes more advantage of the supervised information of the training set than other methods. There are two main differences between CMFA and our proposed algorithm. First, we construct the weight matrices \( W_{w} \) and \( W_{b} \) in a different way, which can capture more discriminant information compared to the method applied in CMFA. Second, we use within-class scatter as the regularization term instead of identical matrix, which can take advantage of the identification in the training data. Our proposed LMCM algorithm also shows its high capability to handle different variations, such as pose and expression, except for low resolution. Table 2 is the test time for each image pair.

Table 1. Rank 1 performance on FERET database. The values are rank-1 identification rate (%)
Table 2. Test time for each LR and HR image pair

4.2 Experimental Result on SCface Database

To show the real recognition performance of our LMCM algorithm under the surveillance circumstances, the SCface database is chosen as a new set to illustrate the recognition performance of LMCM. SCface is a database of static images of human faces [15] captured by surveillance cameras. Images were taken in uncontrolled indoor environment using five video surveillance cameras at three different distances. The database contains 4,160 face images (in visible and infrared spectrum) of 130 subjects, as shown in Fig. 3. Face images from different cameras and distances mimic the real-world conditions. The subset used contains images from surveillance cameras cam1–cam5: (I) distance of 2.6 m (i.e., LR), and (II) distance of 1.0 m (i.e., HR). The resolution of the processed images is 48 × 48 and 16  ×  16 for the HR and LR, respectively.

Fig. 3.
figure 3

Examples of face images of one subject with one camera and 3 different distances

For this experiment, the protocol of [17] is implemented. All subjects are used for training and test. In the experiment, LMCM is compared with CLPM, SDA, CMFA and Coupled Mapping Method in [17]. For SCface database, 80 subjects are selected to define the training set. The rest of 50 subjects are used as the test set. This procedure is repeated 10 times. The average results are presented in Table 3. Overall, the rank 1 recognition rates are much lower compared to the FERET database due to the real world challenges posed in SCface database. We can see from the results that our proposed LMCM algorithm improves the LR face recognition significantly on SCface database. The main reason lies in that LMCM learns the discriminant information between HR and LR face images to force a margin between the projection of identical and different subjects according to recognition information. Compared to other algorithms in Table 3, our proposed algorithm apparently can capture more such discriminant feature for LR face recognition (Table 4).

Table 3. Experiment on SCface. The values are rank-1 identification rate (%)
Table 4. Test time for each LR and HR image pair

5 Conclusion

In this paper, we propose a novel algorithm to solve low-resolution face recognition problem without SR procedure. Our method projects both the HR and LR face images into a new common feature subspace by maximizing the distance of features with different labels and minimizing the distance of features with identical label. The objective function attempts to force a margin between different subjects using both the identification and verification information. Experimental results on FERET and SCface databases show that our proposed method can achieve promising performance. In the future, applying nonlinear mappings by kernel methods and using more discriminative features instead of raw intensity will be studied.