1 Introduction

As a newly emerged biometric application, the cross-modality face matching [17] also called heterogeneous face recognition (HFR) has attracted much attention over the last decades for its wide range of usage in surveillance systems. Cross-modality face matching involves matching face images from alternate image modalities, such as infrared images to visible light images, sketches to photos, and 3D range images to 2D photographs. However, the performances of conventional face recognition algorithms decrease largely due to the appearance differences of cross-modality images. To address this issue, a number of HFR methods [4, 6, 13, 17, 26, 27] have been developed to solve the cross-modality matching problem. These methods generally fall into three categories: 1) homogenous image synthesis [6, 39, 40], 2) common subspace learning [13, 19, 20, 27, 34], and 3) modality-invariant feature extraction [12, 16, 18]. Homogenous image synthesis based methods generate pseudo-homogenous images, and thus, the cross-modal matching problem can be solved by using the existing FR algorithms. The common subspace learning based methods try to learn a coupled common subspace in which the cross-modality data points are considered to be more comparable than in their original representations. And the modality-invariant feature extraction based methods address the cross-modality FR problem by designing an effective invariant descriptor, and reducing the appearance differences in the feature representation stage.

Most of these existing methods try to solve this problem by deducing the cross-modality feature gap, and they have not considered the similarity measure between heterogeneous features. Recently, Extreme learning machines (ELM) [10, 11] with its high learning efficiency in feature classification have attracted increasing attention from worldwide researchers. ELM algorithms have good generalization performance in many real applications. However, very few work that considers both the feature representation and the similarity measure has been reported in the HFR research community. In this paper, we propose a new ELM ensemble based approach for cross-modality face matching. There are two stages in our proposed framework. In the first stage, we consider the cross-modality feature representation by a data-driven way, namely, the feature descriptor is optimally learned from the two modalities at the image pixel level. In the second stage, the voting based ELM is implemented as the classifier for the cross-modality face recognition.

The remainder of this paper is organized as follows. Section 2 is the related work and Section 3 describes the new ELM ensemble based approach. Experimental results and discussions on two different heterogeneous face databases are presented in Section 4. Section 5 draws the conclusion of this paper.

2 Related work

2.1 Cross-modality face matching

Previous work on cross-modality face matching can be grouped to three categories: 1) homogenous image synthesis, 2) common subspace learning, and 3) invariant feature extraction. Most of these approaches can be organized into two steps, namely, the cross-modality feature representation and the follow-up classification.

The typical synthesis methods usually represent the data in either of the two modality, the synthesis data can then be compared directly in one modality. For instance, Tang and Wang proposed an eigen-transformation method that synthesized pseudo sketch images from the training photo sets [39] and a photo-sketch transformation method using a multiscale Markov Random Fields (MRF) model [40]. Liu et al. [29] proposed to generate the sketches from photographs using a local linear embedding method. Gao et al. [6] utilized the embedded hidden Markov model (E-HMM) to learn the nonlinear relationship between a sketch and its corresponding photo. However, most of the synthesis methods are “task specific”, which are usually designed for two fixed modalities and not generalized well when the task is changed.

The second category, common subspace learning methods represent the feature points by projecting them into a common discriminant subspace [19, 20, 27, 34]. Subspace learning approaches, such as Canonical Correlation Analysis (CCA) [7, 8] and Partial Least Squares (PLS) [41], have been approved as an effective tool in cross-modality tasks [23, 34, 36]. In Ref. [27], Lin and Tang proposed to solve the inter-modality problem using Common Discriminant Feature Extraction (CDFE), which formulated the learning objective by incorporating both the discriminative ability and the local consistency. Lei and Li et al. proposed the coupled spectral regression (CSR) [19] which modeled the properties of heterogeneous data separately by learning two associated projections. Later, they proposed the coupled Discriminant Analysis (CDA) [20] by incorporating the Locality Constraint in Kernel Space to improve the generalization ability. Even though these approaches have shown good performance in HFR, they ignore the intuitive appearance differences at the feature level. And if the cross-modality difference at the feature level is large, the discriminant power of the subspace learning methods will be reduced largely.

Methods in the third category try to reduce the cross-modality gap at the feature extraction stage. Many local appearance descriptors, e.g. variants of Local Binary Patterns (LBP) [1],SIFT [31] and Difference of Gaussian (DOG) filter [26], are utilized to represent the cross-modality features. Klare et al. [18] proposed to extract the SIFT and Multiscale LBP for forensic sketch and mug shot photo matching. Huang et al. [12] proposed to learn modality-invariant features (MIF) for HFR. B.F. Klare et al. proposed a kernel prototype similarities based generic framework [16] which introduces two filters and three different feature descriptors for feature extraction. Zhu et al. [48] proposed a feature representation method using three steps, namely, Log-DoG filtering, local encoding and uniform feature normalization. Li et al. [25] proposed to extract the common features from cross-modality face images and applied it onto optical face images and infrared face images matching. Yi et al. [43] proposed to use a series of local RBMs to learn the shared representation of two different modalities. However, most of these local descriptors are pre-defined in a hand-crafted way and they may not be the optimal one to extract the inter-modality variations.

2.2 Extreme learning machines (ELM)

In this subsection, we briefly review the ELM and its applications on pattern classification [9, 11]. ELM is recently proposed for efficiently training single-hidden-layer feed forward neural networks (SLFNs). And ELM performs more consistently with a much faster training speed [9]. The essence of ELM is that ELM performs classification by projecting original data to a high dimensional vector and changes the classification task into a multi-output functional regression problem [2].

With its high learning efficiency, ELM [10, 11] has attracted increasing attention on a widespread type of applications, e.g. pattern classification, object recognition, data analysis et.al. Huang et al. [11] extended ELM to Least square SVM (LS-SVM) [37] and proximal SVM (PSVM) [5], and provided a unified solution for multi-class classification. Kasun et al. [15] proposed an ELM based Auto Encoder (ELM-AE) for Big Data application. Cao et al. [3] proposed an improved ELM based method using the basic ELM and the OP-ELM and applied the algorithm for protein sequence classification. Later, researchers have proposed the ensemble based ELM or the ELM ensemble [28], which connected the ELM network in parallel and consider the average of the ELMs outputs as the final predicted result [9]. For example, Yang et al. [42] proposed a modified ensemble of extreme learning machine (ELM) based on attractive and repulsive particle swarm optimization (ARPSO) to improve the convergence performance of the ensemble system. Zhang et al. [45] conducted a robust AdaBoost.RT based ensemble ELM (RAE-ELM), which combined ELM with the novel self-adaptive AdaBoost.RT algorithm to achieve a better performance for regression problems.

Many ELM based approaches are proposed in FR tasks, such as, Zong and Huang [46] proposed a ELM based method in multi-label FR applications. Zong et.al. [47] later proposed a kernelized ELM method in FR. Mohammed et.al. [32] proposed a bidirectional 2DPCA and ELM framework by using curvelet feature. Long et.al. [30] proposed a graph regularized discriminative non-negative matrix factorization (GDNMF), where the projection matrix is learned jointly by both the graph Laplacian and supervised label information.

However, these methods can not be utilized directly for cross-modality face recognition due to the appearance difference in different modalities. Meanwhile, a single ELM can be improved to achieve better generalization performance [2, 14, 44]. In this paper, we propose a new ensemble ELM based approach, which is also a feature learning based ensemble ELM, for cross-modality face matching. The complete discriminative feature learning (CDFL) is used to extract the cross-modality facial features. The voting based extreme learning machine (V-ELM) [2] is utilized to perform the final image classification. Compared to other neural network based HFR methods, the proposed method requires less computational time and obtains better accuracy.

3 The proposed ensemble ELM based approach

In this section, we first introduce the basic formulation and the optimization of our Complete Discriminative Feature Learning (CDFL). Then, we explain how to use V-ELM for feature classification. Finally, the whole ensemble ELM based approach is presented (Fig. 1), which illustrates the whole process of our approach for cross-modality face matching.

Fig. 1
figure 1

The whole process of the new ensemble ELM based approach for cross-modality face matching.Part I is the feature learning phase and Part II is the face matching phase

3.1 Complete discriminative feature learning for feature representation

Given an p×q image M and \( \mathcal {L}(M) \) is the filtered image of m. Suppose the discriminative image filter vectors to be and \({\ell } = \left [\ell _{1}, \ell _{2}, {\cdots } ,\ell _{k}\right ]^{T}\), the value of the filtered image at position k is,

$$ \mathcal{L} (M)^{n} = \ell^{T}M^{n} $$
(1)

where M n is the image patch centered at position n. Considering the LBP feature extraction process, the pixel difference vector (PDV) [21, 22] can be defined as, \( d\mathcal {L}(M)^{n} = \left [\mathcal {L} (M)^{n_{1}} - \mathcal {L} (M)^{n},\mathcal {L} (M)^{n_{2}} - \mathcal {L}(M)^{n}, {\cdots } , \mathcal {L} (M)^{n_{L}} - \mathcal {L}(M)^{n}\right ]^{T} \) where \(\mathcal {L} (M)^{n}\) and \(\mathcal {L} (M)^{n_{d}}\) is the pixel value of filtered image at the center n and the n d -th position n d , n d ∈{n 1,n 2,⋯n L }, and L is the number of neighbours. Under the linear assumption, it’s quite natural to deduce that the PDV can be represented as:

$$ d\mathcal{L} (M)_{ij}^{n} = \ell^{T}d(M)_{ij}^{n} $$
(2)

where \(d(M)_{ij}^{n} \) is the n -th PDV of j -th sample from the i -th class, \( d(M)_{ij}^{n} = \left [\left ((M)_{ij}^{n_{1}} - (M)_{ij}^{n}\right ),\left ((M)_{ij}^{n_{2}} - (M)_{ij}^{n}\right ), {\cdots } ,\left ((M)_{ij}^{n_{L}} - (M)_{ij}^{n}\right )\right ]^{T}\).

The goal of complete discriminant feature learning (CDFL) is to find the optimal combined discriminative filter that can make the image PDVs of the same person similar in different modalities, so that the discriminant pixels are strengthened and the undistinguishable ones are suppressed, which makes the mapping simplified. The CDFL is defined as the following,

$$ \ell_{t} = \sum\limits_{t = 1}^{k} {U}_{g_{t}}V_{t} $$
(3)

where k and g t are the numbers of the discriminant filters and the t-th row vector of the filter graph, separately. \(g = \left \{g_{1}, {\cdots } ,g_{t}, {\cdots } ,g_{k}\right \}\) contains M discriminative filters. Thus, U can be considered as the Matrix that is consisted of the discriminative image filters and V is the projection coefficients that can be treated as the weights of U.

3.1.1 Discriminative filters learning

The samples of the two sets are defined as, \(M^{x} = \left [M_{1}^{x},M_{2}^{x}, {\cdots } ,M_{N_{x}}^{x}\right ]\) and \(M^{y} = \left [M_{1}^{y},\right .\) \(\left .M_{2}^{y}, {\cdots } ,M_{N_{y}}^{y}\right ]\), where M x and M y indicate two different image modalities and N x and N y is the number of samples. According to (1) and (2), the discriminative filter learning aims to find an optimal image filter vector U, which can naturally split into a pair of image filter vectors U x and U y as \(U= \left [ {U}_{x};{U}_{y}\right ]\).

Given C classes of heterogenous faces, and N i is the number of samples from i -th classes. The intra-modality and cross-modality within class and between class scatters of the filtered image are denoted as,

$$\begin{array}{@{}rcl@{}} {G_{w}}^{xx} \!&=& \! \sum\limits_{i = 1}^{C} \!\sum\limits_{j = 1}^{N_{i}} \left(d\mathcal{L}(M^{x}\right)_{ij}\! - \!d\mathcal{L} (\bar{M}^{x})_{i})(d\mathcal{L}(M^{x})_{ij}\! - \!d\mathcal{L}(\bar{M}^{x})_{i})^{T}\\ {G_{w}}^{xy} \!&=&\! \sum\limits_{i = 1}^{C} \!\sum\limits_{j = 1}^{N_{i}} \left(d\varphi (P^{x})_{ij}\! - \!d\mathcal{L}(\bar{M}^{y})_{i}\right)\left(d\mathcal{L}\left(P^{x}\right)_{ij}\! -\! d\varphi \left(\bar{M}^{y}\right)_{i}\right)^{T} \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} {G_{b}}^{xx} &=& \sum\limits_{i = 1}^{C} C_{i}\left(d\mathcal{L}\left(\bar{M}^{x}\right)_{i} - d\mathcal{L}\left(\bar{M}^{x}\right)\right)\left(d\mathcal{L}\left(\bar{M}^{x}\right)_{i} - d\mathcal{L}\left(\bar{M}^{x}\right)\right)^{T}\\ {G_{b}}^{xy} &=& \sum\limits_{i = 1}^{C} C_{i}\left(d\mathcal{L}\left(\bar{M}^{x}\right)_{i} - d\mathcal{L}\left(\bar{M}^{y}\right)\right)\left(d\mathcal{L} \left(\bar{M}^{x}\right)_{i} - d\mathcal{L}\left(\bar{M}^{y}\right)\right)^{T} \end{array} $$
(5)

\({G_{w}}^{yy}\) and \({G_{w}}^{yx}\) are similar to \({G_{w}}^{xx}\) and \({G_{w}}^{xy}\) according to (4). And \({G_{b}}^{yy}\) and \({G_{b}}^{yx}\) can be defined similarly as (5). \( d\mathcal {L}(M^{x})_{ij}\) and \( d\mathcal {L}(M^{y})_{ij}\) are the Pixel Difference Matrixes (PDM) of the j-th sample pair from the i-th class, and \(\mathcal {L}(\bar {M}^{x})_{i}\), \(\mathcal {L}(\bar {M}^{y})_{i}\) are the mean matrixes of the PDVs on the filtered image from the i-th class, \(\mathcal {L}(\bar {M}^{x})\) and \(\mathcal {L}(\bar {M}^{y})\) are the total mean vectors of the PDVs of the same sample set.

The within class scatter and between class scatter of the filter images can be defined as,

$$ \boldsymbol{G}_{\boldsymbol{w}} = \left[\begin{array}{ll} G_{w}^{xx}&G_{w}^{xy}\\ G_{w}^{yx}&G_{w}^{yy} \end{array} \right], \boldsymbol{G}_{\boldsymbol{b}} = \left[\begin{array}{ll} G_{b}^{xx}&G_{b}^{xy} \\ G_{b}^{yx}&G_{b}^{yy} \end{array} \right] $$
(6)

According to (1), we get:

$$ \boldsymbol{G}_{\boldsymbol{w}} = U^{T}\tilde{\boldsymbol{G}}_{\boldsymbol{w}}U ,\boldsymbol{G}_{\boldsymbol{b}} = U^{T}\tilde{\boldsymbol{G}}_{\boldsymbol{b}}U $$
(7)

And \(\tilde {\boldsymbol {G}}_{\boldsymbol {w}}\) and \(\tilde {\boldsymbol {G}}_{\boldsymbol {b}}\) is defined as \(\tilde {\boldsymbol {G}}_{\boldsymbol {w}} = \left [\begin {array}{ll} \tilde {G}_{w}^{xx}&\tilde {G}_{w}^{xy}\\ \tilde {G}_{w}^{yx}&\tilde {G}_{w}^{yy} \end {array} \right ]\) , \( \tilde {\boldsymbol {G}}_{\boldsymbol {b}} = \left [\begin {array}{ll} \tilde {G}_{b}^{xx}&\tilde {G}_{b}^{xy}\\ \tilde {G}_{b}^{yx}&\tilde {G}_{b}^{yy} \end {array}\right ]\). And (4) and (5) are changed into:

$$\begin{array}{@{}rcl@{}} {G_{w}}^{xx}\!&= \! U_{x}^{T}\sum\limits_{i = 1}^{C} \sum\limits_{j = 1}^{N_{i}}\! (d(M^{x})_{ij}\! -\! d(\bar{M}^{x})_{i})(d(M^{x})_{ij} \,-\, d(\bar{M}^{x})_{i})^{T} \! U_{x} =\! U_{x}^{T}\tilde{G}_{w}^{xx}U_{x}\\ {G_{w}}^{xy} \!&= \! U_{x}^{T}\sum\limits_{i = 1}^{C} \sum\limits_{j = 1}^{N_{i}}\! (d(M^{x})_{ij}\! - \! d(\bar{M}^{y})_{i})(d(M^{x})_{ij}\! - \!d(\bar{M}^{y})_{i})^{T} \!U_{y} = \!U_{x}^{T}\tilde{G}_{w}^{xy}U_{y} \end{array} $$
(8)

and

$$\begin{array}{@{}rcl@{}} {G_{b}}^{xx} \!&=&\!U_{x}^{T}\sum\limits_{i = 1}^{C} C_{i}(d(\bar{M}^{x})_{i}\! -\! d(\bar{M}^{x}))(d(\bar{M}^{x})_{i} \!- \! d(m^{x}))^{T}U_{x} =U_{x}^{T}\tilde{G}_{b}^{xx}U_{x}\\ {G_{b}}^{xy} \!&=&\! U_{x}^{T}\sum\limits_{i = 1}^{C} C_{i}(d(\bar{M}^{x})_{i}\! -\! d(\bar {M}^{y}))(d(\bar{M}^{x})_{i}\! -\! d(\bar{M}^{y}))^{T}u_{y} = U_{x}^{T}\tilde{G}_{b}^{xy}U_{y} \end{array} $$
(9)

Finally, the problem of discriminative filters can be achieved by solving the generalized eigenvalue problem of \( \tilde {\boldsymbol {G}}_{\boldsymbol {b}}U_{t} = \lambda _{1} \tilde {\boldsymbol {G}}_{\boldsymbol {w}}U_{t}\).

3.1.2 Enhanced weight learning

As it is in (3), the CDFL can be treated as a weighted sum of all these discriminative filters U using a linear projection from M-dimensional subspace to 1-dimensional vector. And the weight vector V can be considered as the projection coefficients of the linear projection. While the discriminative image filter learning aims at learning the discriminant pixels in each small image patch, it fails to notice the discrimination in both the intra-personal and inter-personal PDV pairs. Therefore, the aim of weighting the discriminative filters is to make the similarities of PDVs from two classes (the intra-personal and inter-personal PDV pairs) more discriminable, and the weight learning problem of the CDFL can then be transformed to a two class linear projection.

Give M groups of Z, Z is the similarities of PDVs pairs. \(\mathrm {Z} = \left [ \mathrm {Z}_{1};\mathrm {Z}_{2}; {\cdots } ;\mathrm {Z}_{M} \right ]^{T}\), and \(\mathrm {Z}_{\gamma } = [{\mathrm {Z}^{\text {intra}}}_{\gamma },{\mathrm {Z}^{\text {inter}}}_{\gamma }], \gamma \in \left \{ 1,2, {\cdots } ,m \right \}\), the samples of the two classes (the intra-personal and inter-personal PDV pairs) are defined as,

$$ \left\{\begin{array}{llll} {\mathrm{Z}^{\text{intra}}}_{\gamma} = [\theta_{1}^{\text{intra}},\theta_{2}^{\text{intra}}, {\cdots} ,\theta_{N_{\text{intra}}}^{\text{intra}}]\\ {\mathrm{Z}^{\text{inter}}}_{\gamma} = [\theta_{1}^{\text{inter}},\theta_{2}^{\text{inter}}, {\cdots} ,\theta_{N_{\text{intra}}}^{\text{inter}}] \end{array} \right. $$
(10)

where 𝜃 intra and 𝜃 inter indicate the similarities of the intra-personal and inter-personal PDV pairs, which are defined as,

$$ \left\{ \begin{array}{llll} \theta^{\text{intra}} = \left\| d(M)_{ij}^{n} - d(M)_{\mu \upsilon}^{n} \right\|, \text{where} \ i = \mu \ \text{and}\ j \neq \upsilon\\ \theta^{\text{inter}} = \left\| d(M)_{ij}^{n} - d(M)_{\mu \upsilon}^{n} \right\|, \text{where}\ i \ne \mu \end{array} \right. $$
(11)

and N intra and N inter is the number of PDV pairs from the two classes. To address this problem, we utilize the Fisher Linear Discriminant Analysis (FLDA) to get the optimal linear projection. The between class scatter matrix and within class scatter matrix of P are defined as:

$$\begin{array}{@{}rcl@{}} {G_{b}}^{\prime} &=& \left(\bar{\mathrm{Z}}^{\text{intra}} - \bar{\mathrm{Z}}^{\text{inter}}\right)\left(\bar{\mathrm{Z}}^{\text{intra}} - \bar{\mathrm{Z}}^{\text{inter}}\right)^{T}\\ {G_{w}}^{\prime} \! &=& \!\sum\limits_{n_{i} = 1}^{N_{\text{intra}}} \left(\mathrm{Z}_{n_{i}}\! - \!\bar{\mathrm{Z}}^{\text{intra}}\right)\left(\mathrm{Z}_{n_{i}}\! -\! \bar{\mathrm{Z}}^{\text{intra}}\right)^{T} \! +\! \sum\limits_{n_{j} = 1}^{N_{\text{inter}}} \left(\mathrm{Z}_{n_{j}}\! -\! \bar{\mathrm{Z}}^{\text{intra}}\right)\left(\mathrm{Z}_{n_{j}}\! -\! \bar{\mathrm{Z}}^{\text{intra}}\right)^{T} \end{array} $$
(12)

where \(\bar {\mathrm {Z}}^{\text {intra}}\) and \(\bar {\mathrm {Z}}^{\text {inter}}\) is the mean of the two classes. Thus, the projection coefficients V, which is also the optimal weights for discriminating the PDV pairs, can be obtained by solving the generalized eigenvalue problem of \({G_{b}}^{\prime } V = U_{2}{G_{w}}^{\prime } V \).

3.2 V-ELM for classification

To tackle the issue of cross-modality face matching and improve the general performance of ELM, a voting based ELM [2] is utilized in our model for feature classification. In V-ELM, multiple indenpent ELMs are firstly trained with the a fixed number of hidden nodes and the same activation function. The learning parameters of each ELM are randomly initialized independently. Then, the predicted label is determined by a majority voting method. The V-ELM classifier utilized in the proposed recognition approach can be described as follows.

Assuming that the available training feature dataset is \(\left \{ \left (\boldsymbol {x}_{i} ,t_{i}\right ) \right \}_{i=1}^{\mathbf {N}} \), where x i , t i , and N represent the feature vector of the i-th face image, its corresponding category index and the number of images, respectively, the SLFN with κ nodes in the hidden layer can be expressed as

$$ \boldsymbol{o}_{i} = \sum\limits_{j = 1}^{\kappa} \theta_{j} g(\boldsymbol{a}_{j}, \boldsymbol{b}_{j}, \boldsymbol{x}_{i}), i=1,2,\cdots,\mathbf{N} $$
(13)

where o i is the output obtained by the SLFN associated with the i-th input protein sequence, \(\boldsymbol {a}_{j} \in \mathbb {R}^{d}\) and \(b_{j} \in \mathbb {R}\) \(\left (j = 1,2,\cdots ,\kappa \right )\) are parameters of the jth hidden node, respectively. The variable \(\boldsymbol {\theta }_{j} \in \mathbb {R}^{m}\) is the link connecting the jth hidden node to the output layer and g(⋅) is the hidden node activation function. With all training samples, (13) can be expressed in the compact form as

$$ \mathbf{O} = \boldsymbol{H} \boldsymbol{\theta} $$
(14)

where 𝜃=(𝜃 1,𝜃 2,⋯,𝜃 κ ) and O are the output weight matrix and the network outputs, respectively. The variable H denotes the hidden layer output matrix with the entry H i j =g(a j ,b j ,x i ).

To perform multi-classes classification, the ELM classifier generally utilizes the One-Against-All (OAA) method to transform the classification application to a multi-output model regression problem. That is, for a C-categories classification application, the output label t i of the face image feature x i is encoded to a \(\mathcal {C}\)-dimensional vector \(\boldsymbol {t}_{i} = \left (t_{i1},t_{i2},\cdots ,t_{i\mathcal {C}}\right )^{T}\) with \(t_{i\mathbf {c}} \in \left \{1,-1 \right \}\) \(\left (\mathbf {c} = 1,2,\cdots ,\mathcal {C}\right )\). If the category index of the face image x i is c, then t i c is set to be 1 while the rest entries in t i are set to be −1. Hence, the objective of training phase for the SLFN in (13) becomes finding the best network parameters set \(\varDelta = \left \{\left (\boldsymbol {a}_{j}, b_{j}, \boldsymbol {\theta }_{j} \right ) \right \}_{j = 1, {\ldots } ,\kappa }\) such that the following error cost function is minimized

$$ \min\limits_{\varDelta} E=\min\limits_{\varDelta} \left\|\mathbf{O} - \boldsymbol{T} \right\| $$
(15)

where \(\boldsymbol {T} = \left (\boldsymbol {t}_{1}, \boldsymbol {t}_{2},\cdots ,\boldsymbol {t}_{\mathbf {N}} \right )\) is the target output matrix.

ELM theory claims that random hidden node parameters can be utilized for SLFNs and the hidden node parameters may not need to be tuned. In such case, the system (14) becomes a linear model and the network parameter matrix can be analytically solved by using the least-square method. That is,

$$ \boldsymbol{\theta} = \boldsymbol{H}^{\dag} \boldsymbol{T} $$
(16)

where H is the Moore-Penrose generalized inverse of the hidden layer output matrix H given by [35]. The universal approximation property of the ELM algorithm is also presented in [11].

Suppose that M independent networks trained with the ELM algorithm are used in V-ELM. Then, for each testing sample x t e s t , M prediction results can be obtained based on these independent ELMs. A corresponding vector \(\pi _{\boldsymbol {M},x_{test}} \in \mathbf {R}^{C}\) with dimension equal to the number of class labels is used to store all these M results of x t e s t , where if the class label predicted by the M th (m∈[1,2,⋯,M]) ELM is τ, the value of the corresponding entry τ in the vector \(\\phantom{\dot {i}\!}pi_{\boldsymbol {M},x_{test}}\) is increased by one, as the following:

$$\pi_{\boldsymbol{M},x_{test}}(\tau ) = \pi_{\boldsymbol{M},x_{test}}(\tau ) + 1 $$

After all these M results are achieved and assigned to \(\pi _{\boldsymbol {M},x_{test}}\), the final class result of x t e s t is then achieved by conducting a majority voting:

$$ C_{test}= \arg \max\limits_{\tau \in [1, {\cdots} ,C]} \{ \pi_{\mathbf{M}},x_{test}(\tau )\} $$
(17)

3.3 The ensemble ELM based approach for cross-modality face matching

The whole procedure of ensemble ELM approach can be divided into two parts, 1) the feature learning by CDFL and 2) the V-ELM based cross-modality face matching. Figure 1 illustrates the whole process of the ensemble ELM based approach for cross-modality face matching.

In the feature learning phase, face images are firstly divided into small patches and a LBP-like feature extraction is used to get the pixel difference vector (PDV). The CDFL image filters are then learned from the pixel difference matrix (PDM). In the matching phase, we firstly get the new image pattern by the learned filter matrix. Then, histogram-based local features are extracted to boost the local features. Thirdly, CCA [7] is utilized to map the data onto a common subspace and reduce the feature dimensions. Finally, V-ELM is applied to get the final matching results.

4 Experiments

In this section, we compare our ensemble ELM based approach with some state-of-the-art HFR methods, such as PLS [34], CDFE [27], CCA [7], Discriminant Image Filter Learning (DIFL) [22] and some hand-crafted feature extraction methods, e.g. LBP and LTP. Here, two different heterogeneous face recognition applications, which are VIS to NIR and 2D Vs. 3D, are utilized to evaluated the performance of our proposed method. The following describes the details of the experimental setups and the results.

4.1 Heterogeneous face bimetrics (HFB): VIS vs. NIR

The HFB database[24] is used to evaluate Visual (VIS) image vs. Near Infrared (NIR) image heterogeneous scenarios. In this experiment, a released database Ver.1 which contains images from 100 subjects, with 4 NIR and 4 VIS images per subject, is utilized. All the images are scaled, transformed and cropped to 128 × 128 size according to the eye position. Some of the cropped HFB example images are shown in Fig. 2. In this experiments, VIS images of each person are utilized as the gallery set and the corresponding NIR images are utilized as the probe set. And the filtering window s to be s=3 with which the neighbours N i =8 participates in the image filtering.

Fig. 2
figure 2

Some of the cropped examples from HFB database

In the first experiment, the frontal 80 persons are utilized for training, and the rest 20 persons are utilized for testing. We compare the recognition performance of the proposed Ensemble ELM with several popular HFR methods, and meanwhile, we compare the new proposed method with invariant feature descriptos, such as LBP, LTP and DIFL [22]. The hidden nodes of ELM and our ensemble ELM (with Sigmoid function) are choosen to be 1000. Figure 3 shows the recognition rate varies with the first 100 dimensions of the several HFR methods.

Fig. 3
figure 3

Performance comparison on the HFB database, where a shows the recognition comparison of Ensemble ELM with different HFR methods and b shows the comparison of different classifiers

As it is shown, Fig. 3a is the comparison with other HFR algorithms while Fig. 3b is the comparison with different classification methods. We can see in Fig. 3a that the proposed ensemble ELM based method significantly outperforms the other methods, and most of the compared methods get their highest recognition rate at a average range of dimension 50 to 80 except CCA which gets its highest recognition rata around dimension 30.

In particular, Table 1 gives the detailed recognition performance of different methods for the HFB database. The results from Table I indicate that ensemble ELM based approach is effective in improving cross-modality face recognition performance in general. The rank-1 recognition rate obtained by the new proposed approach is 83.8 %, which is 1.3 % and 2.5 % higher than the ELM and NN based approach.

Table 1 Performance Comparison on Three Heterogeneous Scenarios in terms of Rank-1 Rec. Rate (in(%))

4.2 Face recognition grand challenge (FRGC): 2D photos vs. 3d range images

The last experiments are conducted on the FRGC [33] 2D vs. 3D face database. In this experiment, FRGC v2, which contains 4007 2D and 3D face image pairs of 466 persons, is utilized to evaluate the performance of the proposed method. This database consists of frontal views, expressions and et.al., but none of them is wearing glasses. All these images are cropped in the same way to 100 × 100 size according to the eye position. The cropped examples of FRGC database are shown in Fig. 4.

Fig. 4
figure 4

The cropped examples from CUFSF database. The first row contains the examples of digital photo and the second row is the corresponding 3D range image

To evaluate the robustness of our method against expression variations, 285 subjects with more than 6 samples are picked out, and 5 samples of each person are selected as training set and the rest for testing. The hidden notes of the ELM based methods are chosen to be 500. And we repeat random selection 10 times to get an average rank-1 recognition rate. Fig. 5 shows the experimental results of the ensemble ELM based approach comparing with some popular HFR methods, such as PLS [34], CDFE [27], CCA [7]. As we can see in Fig. 5a and Fig. 5b, our ensemble ELM based method constantly outperforms the other compared HFR methods. The rank-1 recognition rate obtained by the new proposed approach is 95.7 %, which is 3.1 % and 3.9 % higher than the ELM and NN based approach. Detailed recognition results are displayed in Table 1.

Fig. 5
figure 5

Performance comparison on the FRGC database, where a shows the recognition comparison of Ensemble ELM with different HFR methods and b shows the comparison of different classification methods

4.3 Results and discussion

The recognition results of different cross-modal FR algorithms on the two databases are given in Table 1, from which we can observe that the Ensemble ELM method consistently outperforms the other compared HFR methods. From Table 1, we see that the rank-1 recognition rate on HFB and FRGC databases are 83.8 % and 95.7 %, respectively.

We have the following two observations from the above comparisons:

  1. 1)

    The proposed method provides an Ensemble ELM based approach for cross-modality image matching problems. The proposed approach solves the cross-modality FR problem from these two ways. Firstly, the cross-modality appearance differences are reduced by learning a new feature descriptor, and the cross-modality features are represented in a more discriminant way. Secondly, the recognition performance is boosted by the Ensemble ELM, which achieve better classification accuracy.

  2. 2)

    The proposed method consistently outperforms all other methods on two cross-modality applications. The reason lies the Ensemble ELM approach exploits the most discriminant features by using the complete discriminative feature learning. Meanwhile, the ELM are utilized in parallel and the final classification is obtained by the voting strategy.

5 Conclusion

Extreme learning machine (ELM) have good generalization performance in many real classification applications. In this paper, we have proposed a new ensemble based ELM approach for cross-modality face matching. The proposed approach exploits the combination of feature learning based face descriptors and the voting-base extreme learning machine (V-ELM). In this new approach, the cross-modality feature differences is first reduced at the image pixel level in a data-driven way, then, Voting based ELM, which has adopted multiple independent ELM training instead of a single ELM training, is utilized to achieve the cross-modality face matching result. Experiments on three different HFR applications show the effectiveness and generalization of the proposed new method.