Keywords

1 Introduction

When appearing before a camera, the automated face recognition system can accurately tell who you are, which makes our society move further towards intellectualization. The rapid increase of requirements in different application scenarios, like public security, financial domain, personal safety, and etc., brings big challenges to face recognition. With the development of imaging techniques, the acquirement of image data becomes easier and face images can be collected from kinds of sources, resulting in the emergence of some researches on cross-domain face recognition. We mainly discuss some of the heterogeneous face recognition tasks in this paper, since they are more difficult cross-domain problems: sketches is often used to match a suspect for forensic use; near infrared face images can help eliminate the impact of ambient light; 3D images can model real three dimensional face structures, and capture the surface shape; thermal infrared face images recognition is not very mature but of huge potential which can also be used in medical treatment; low resolution recognition is useful for identifying a person in surveillance video.

Different statistical distribution of face images in kinds of domains makes traditional methods simply computing the distance between the features of two face images invalid. The cross-domain gap from the intra-class distance is bigger than that from inter-class distance. If features of different modalities are pull closer to each other compulsively, the data space will become misshapen and thus cause the over-fitting problem. There are often three ideas to alleviate the cross-modality differences throughout the related literatures: exploring invariant face representation by designing universal face features; synchronizing other modalities based on a corresponding modality; projecting images in multiple domains to a common subspace based on some subspace learning methods. Due to the differences in imaging techniques, the recognition methods are specific in detail for different heterogeneous face recognition tasks. The following parts of this paper will introduce the existing heterogeneous face databases which contain both visible light images (VIS) and face images from other modalities in Sect. 2, list and summarize established heterogeneous face recognition methods in Sect. 3, compare and analyze the experimental results in Sect. 4, and discuss some challenging problems and future research direction in Sect. 5.

Fig. 1.
figure 1

Heterogeneous face image pairs. The images in the first row are sketch, near infrared image, thermal infrared image, 3D image and low resolution image respectively and the second row shows the corresponding visible light face images.

2 Database

Heterogeneous face image databases provide intuitional experience and first-hand information for heterogeneous face recognition system. These heterogeneous face databases are used for normal research and as standard test data sets. The general situation is that the images collected in these databases are small-scale, due to that special and professional imaging equipment or manual work are needed.

2.1 Sketch-VIS Face Database

Sketch database is originally used for the research of criminal investigations. However, real forensic sketch is hard to collect and the number of sketch is limited. To create more images for research, there are often four kinds of sketch databases: forensic sketch database, sketches are drawn by artists according to the description of eyewitnesses based on their recollection of the suspects; viewed sketch database, sketches are drawn by artists viewing the corresponding digital photos; semi-forensic sketch database, facial sketches are drawn based on the memory of sketch artists rather than the description of eyewitnesses; composite sketch database, the sketches are composited with candidate facial components using soft kits. The composite sketches are more affordable and easy-acquired than forensic sketches.

  • CUHK Face Sketch Database (CUFS) [viewed]

    CUFS includes a total of 606 frontal view faces and the corresponding sketches drawn by artists. The face images are collected under normal lighting condition and with a neutral expression.

  • CUHK Face Sketch FERET Database (CUFSF) [viewed]

    CUFSF includes 1,194 photo-sketch pairs in total, being the biggest available sketch database. The photos are from 1194 persons from the FERET database with lighting variations, and the sketches are drawn by artists when viewing the photos with shape exaggeration.

  • IIIT-Delhi Semi-forensic Sketch Database [semi-forensic]

    Totally, 140 photos in III-D Viewed Sketch Database are used to compose the semi-forensic sketch database.

  • Forensic Sketch Database [forensic]

    This database is collected by Image Analysis and Biometrics Lab. It contains 190 forensic sketch-digital image pairs.

  • PRIP-VSGC Database [composite]

    The PRIP-VSGC Database is a software-generated sketch database. There are 123 subjects from the AR database, each with one digital image and 3 composite images using two different kits.

2.2 NIR-VIS Face Database

Owing to the intensity of the near infrared light (NIR) is stronger than visible light, the gray values of face images collected under near infrared lights are only relevant to the distances between objects and the near infrared light sources, which is an excellent characteristic for face recognition.

  • CASIA HFB Database

    This database collects face images from 3 modalities: visual, near infrared and three-dimensional (3D) face images. 57 females, and 43 males are included and each has 4 VIS and 4 NIR face images.

  • CASIA NIR-VIS 2.0 Database

    It is the existing biggest NIR-VIS database, which consists of NIR and VIS images from 725 subjects in total, each with 1–22 VIS and 5–50 NIR face images. Two views are developed to report unbiased performance. This database also has standard protocols for performance evaluation and reporting.

2.3 Thermal-VIS Face Database

Thermal infrared images are taken under thermal infrared imagery systems and are generated due to the specific heat pattern of each individual. The characteristic of not relying on ambient light helps lower the intra-class variability and contributes to improve the recognition rate. Thermal infrared images can be captured in night environments and are very useful for crime investigation.

  • IRIS Thermal/Visible Face Database (OTCBVS Benchmark Dataset Collection)

    This database contains both thermal and visible face images and includes pose, expression and illumination variations. There are 30 identities in the database and 4228 pairs of thermal/visible face images are generated.

  • Natural Visible and Infrared Facial Expression Database(NVIE Database)

    The NVIE database collectes images recorded simultaneously by a visible and thermal infrared camera from more than 100 subjects. Expression, illumination and wearing glasses variations are included.

  • PUCV- Visible Thermal-Face (VTF) Database

    A total of 12160 images in both visible and thermal spectra from 76 individuals are collected in the PUCV-VTF database. Images from two modalities are acquired simultaneously. There are variations in wearing glasses and facial expressions such as frown, smile, and vowel.

  • Other Thermal Face Recognition Database

    The database collected in [4] has 385 participants and different collecting sessions. Wilder et al. [5] acquire a 101 subjects thermal database. A co-registered visual and LWIR images of 3244 face images from 90 individuals are collected by the Equinox Company [6]. A database simultaneously acquired in visible, near-infrared and thermal spectrums is published in [7] which can undergo the research on multiple modalities and get more general perspective on face recognition.

2.4 3D-2D Face Database

Besides can be captured in different spectral bands, faces can also be recorded as measurements of 3D facial shape. The 3D face data has good performance when dealing with illumination and pose variations.

  • UHDB11 Database

    The UHDB11 database is collected for 3D-2D face recognition and consists of samples from 23 individuals. This database is generated with different illumination conditions and head poses, and is very challenging companying with the 3D-2D variations.

  • Texas 3D Face Recognition Database (Texas 3DFRD)

    This database is a collection of 1149 pairs of facial color and range images of 105 human subjects with high resolution. The images pairs are captured simultaneously. Besides, facial points are located in both 2D and 3D images manually which can help face recognition.

  • FRGCv2.0 database

    The FRGCv2.0 database is the biggest 3D-2D face database to the best of my knowledge which consists of 4007 3D-2D face image pairs from 466 individuals and includes pose and expression variations.

3 Methods

3.1 Sketch-VIS Face Recognition

To address the sketch recognition problem, researchers propose photo/sketch synthesis method. It is the earliest proposed method, and once images from different modalities are synthesized into the same modality, traditional face recognition methods can be used. The whole trends for face synthesis methods are: the approximation relation of two modalities is from linear [8], locally linear [9], to nonlinear [1]; the concerned object is from basic pixels [10] to their statistical distribution with subspace method [11]. Linear approximate methods apply transformation globally and may not work well if the hair region is included. Thus patch-based methods are proposed to simulate the nonlinear projection but ignore the spatial relationship of patches. To reduce the blurring and aliasing effects, embedded hidden Markov model are applied into the theme which can approximate the nonlinear relationship well and achieve better performance. Since the synthesis task is more complicate than recognition, some researchers study another method for heterogeneous face recognition omitting the synthesis procedure. Such methods aim to project two image modalities on to an intermediate modality and conduct recognition in the common subspace [12]. Another breakthrough in heterogeneous face recognition methods is on the feature extraction stage, and the key point lies in finding the modality-invariable features [2]. The acquired near infrared face image and sketch share much similarities with normal visible face images intuitively though they are different at pixel-level, and the data is both two dimensional, so some methods are applicable for both sketch and near infrared face recognition and in fact some articles do experiments in these two modalities with their proposed methods.

3.2 NIR-VIS Face Recognition

Near infrared face images are mainly to deal with the illumination variations [38]. Due to the distinction of imagery technology, the data distribution of NIR and VIS face images in high dimensional space are inconsistency. Thus general face recognition methods are not suitable. The three common methods for heterogeneous face recognition are applied in NIR and VIS domain as follows: in the work of [9, 33], synthesis based methods are employed to transform one face modality to the other. Tang et al. [33] propose an Eigen-transformation method while Liu et al. [9] reconstruct image patches using LLE. A recent work done by Felix et al. [39] proposes a \(\ell _{0}\)-Dictionary based approach to reconstruct the corresponding image modalities and acquires good performance; the common subspace methods are used respectively in [31, 34] by employing LDA and TCA (transfer component analysis) respectively; modal-invariant features are often based on SIFT or LBP. The development of deep learning driving the heterogeneous face recognition algorithm further. Some unsupervised deep learning methods are used in this topic: J. Ngiam et al. [30] propose a Bimodal Deep AE method based on denoising auto encoder; To exert the potential effects of all layers, a multi-modal DBM approach is suggested by [32]; a RBM method combined with removed PCA features is proposed in the work of [35]. With these methods, the matching accuracies of heterogeneous face images are improved gradually, but still far below than the state-of-the-art VIS face recognition rates. Recently, a unified CNN framework is proposed in the study of [29], which integrates the deep representation transferring and the triplet loss to get consolidated feature representations for face images in two modalities. It alleviates the over-fitting problem for CNNs on small-scale datasets, and achieves great performance on the existing publicly available biggest CASIA 2.0 NIR-VIS Face Database.

3.3 Thermal-VIS Face Recognition

Different with visible imaging receiving the reflected lights, the thermal images are acquired by receiving the emitted radiation, thus causing a large cross-modal gap [16]. The emitted radiation is affected by many factors, and usually time-lapse, physical exercise and mental tasks are considered, which makes the recognition tasks even more challenging. There are already some studies on the within-domain thermal infrared face image recognition, but few addressed on the cross-modal thermal-visible face recognition. The first work on thermal to visible face recognition is done by [13], which resorts to a partial least squares-discriminant analysis (PLS-DA) method; A MWIR-to-visible face recognition system which consists of preprocessing, feature extraction (HOG, SIFT, LBP), and similarity matching is proposed by [14]; Klare and Jain [15] propose a nonlinear kernel prototype representation method for both thermal and visible light images and use LDA to improve the discriminative capabilities of the prototype representations; The authors of [13] improve their methods by incorporating multi-modal information into the PLS model building stage, and design different preprocessing and feature extraction stages to reduce the modality gap [16]; Recently, a graphical representation method which employs Markov networks and considers the spatial compatibility between neighboring image patches is proposed by [17], and the method achieves excellent performance on multiple heterogeneous face modalities including the thermal-visible scenario.

3.4 3D-2D Face Recognition

3D images are robust to illumination and pose variations compared to 2D images. The face recognition based on 3D dada is more accurate than visible light images and more practical than 3D-3D. Early work focus more on 3d-aided face recognition or 3D images reconstruction and little work explore the cross-modal 3D-2D face recognition. Partial Principle Component Analysis based method is used in [18] to extract features in two modalities and reduce the feature dimension. Kernel Canonical Correlation Analysis (CCA) is employed in [20] to maximize the feature correlation between patches in 2D texture images and 3D depth images. A fusion scheme based on Partial Least Square (PLS) and CCA is suggested in [21] to further improve the performance by learning the correlation mapping between 2D-3D. In the work of Riccio et al. [19], they propose to calculate the geometrical invariants based on several control points where locating the fiducial points accurately in both 2D and 3D modalities is also a challenging problem. Di Huang et al. [9] present a new biological vision-based facial description method, namely Oriented Gradient Maps (OGMs) which can simulate the response of complex neuron to gradient information in a pre-defined neighborhood and hence describe local texture changes of 2D faces and local shape variations of 3D face models. Recently, a 3D-2D face recognition framework is proposed in [23]. They use a bidirectional relighting method for non-linear, local illumination normalization and a global orientation-based correlation metric for pairwise similarity scoring, which can generalize well for diverse illumination conditions.

3.5 High-Low Resolution Face Recognition

The HR-LR face recognition is mainly used in video surveillance scenario where the face images acquired in surveillance pictures are usually of very low resolution. To match the detected faces with the enrolled face images of high resolution, a large cross-modal gap should be bridged. Existing related investigates try to improve the recognition rate of a LR image by reconstructing HR face, but the matching between faces in these two modalities are less studied. Since the reconstruction can help asymmetric face recognition, this paper will introduce some super resolution reconstruction (SR) algorithm. Usually two classes of methods are used in this topic [24]: multi-frames SR retrieving high-frequency details from a set of images, and single-frame SR inferring the HR counterpart of a single image based on extra information from the training samples. Specific to the topic of heterogeneous HR-LR face recognition, the subspace projection based method which is efficient to the similarity measure between different resolutions plays an important role, such as [25, 26]. A recent study done by [27] proposes a local optimization based coupled mappings algorithm and constrains the LR/HR consistency, intraclass compactness and interclass separability. The newest work up to now is proposed in [28] which employs a coupled kernel-based enhanced discriminant analysis (CKEDA) method to maximize the discrimination property of the projected common space. This work is demonstrated effectiveness on a public face database and the LR face images are acquired by downsampling as other HR/LR face experiments set.

Table 1. Rank1 recognition rates and VR@FAR=0.1 % of various methods on CUFS [19].
Table 2. Rank1 recognition rates and VR@FAR=0.1 % of various methods on CASIA 2.0 NIR-VIS Face Databse [29].

4 Results

To evaluate the recognition performance of heterogeneous face recognition, usually rank-1 recognition rate and verification rate at certain false accept rate are computed. As there are too many possible factors affecting the final results, different methods are not comparable if their experiments are employed on distinct datasets or with diverse experimental protocols. Sketch and near infrared face images are studied early and extensively, so there are some widely used datasets and standard protocols and related recognition results are listed in Tables 1 and 2. The recognition rate of sketch-VIS faces achieves a level of nearly 100 percent on CUFS database, but there are still promotion spaces for real world forensic sketch recognition. The highest rank-1 recognition rate and verification rate on the existing publicly available biggest NIR-VIS database, CASIA 2.0 NIR-VIS Face Database, are achieved by [29] as shown in Table 2. As for other above mentioned domains, transverse comparison is hard for the non-unified experimental settings. There is a trend that patch based methods are better than global based methods, non-linear based methods are better than linear based methods and usually the combination of related methods may generate higher accuracy.

5 Discussion

A good face recognition algorithm should have the power to deal with face images coming from different image sources or even different modalities, and is general adaptive for different scenes. Due to different characteristics for heterogeneous modalities or scenes, most existing algorithms have to deal with different data respectively though few methods try to deal with two or three modalities. However, humans can recognize an identity quickly no matter in sketch, near infrared image or long-distance monitoring scene. Neurologists and psychologists have been studying this phenomenon, and they want to find the recognition mechanism for face images having a low correlation with the phenomenological information. Specific to the problem of limited data, long life learning might be the future solution which can learn with one or two inputs even without labels and remember the knowledge learned before. The challenges in live detection would find their solutions if two or more modalities can be combined and well fused for recognition, which is deserved to study further.