1 Introduction

Face recognition is nonintrusive and natural technique among other biometric modalities like iris and fingerprint. It is a visual pattern classification problem. Due to automation of face recognition in numerous applications including surveillance, biometric authentication and human computer interaction, has motivated many researchers to design a proficient and accurate classifier. The face recognition system performance deteriorates when the database incorporates a huge number of face images collected under uncontrolled environment conditions [10]. Face recognition system imply one to many matching of query face image from the set of facial images in the associated database. The recognition rate of face recognition system relies upon various factors like pose, age span, facial expression, facial wear, and illumination conditions [61]. It is a very challenging task in non-cooperative user scenarios specifically in surveillance video. The key challenges for face recognition system are i) immense variation in face images of same subject due to viewing direction, ii) non-convexity property of face, that requires a non- linear algorithm that can map high dimensional feature space to low dimension, iii) a system is trained on small number of face images [39, 61]. To handle all these challenges the face recognition system consists of mainly two modules: feature extraction and classification. The feature extraction module selects the discriminative features from normalized facial image and the objective of classification module is to design a powerful and robust classifier as the non-linearity and non-convexity of face image cannot be completely resolved by image normalization and efficient feature extraction.

The performance of face recognition system degrades due to variations in illumination conditions, as the change in the appearance of a face image due to variation in illumination is larger than face identity changes. The illumination normalization is central problem in face recognition system and various well-known techniques were developed to tackle this problem. The standard techniques to deal with illumination normalization are gamma correction, logarithmic transform, Histogram Equalization (HE), histogram match etc. However, these algorithms are unable to give satisfactory results due to variants illumination conditions. Hafed et al. had used discrete cosine transform (DCT) to obtain the feature vector of face image. Chen employed DCT in logarithmic domain by discarding low-frequency DCT coefficients to overcome the affect of illumination variations [13]. Xie et al. had normalized both large (low-frequency components) and small scale features for improved facial image recognition and restoration [56]. Vishwakarma proposed a fuzzy filter based illumination normalization algorithm in which low-frequency DCT coefficients were modified using fuzzy polynomial membership function [53].

After performing preprocessing on facial image, features are extracted. A variety of feature descriptors are developed and analyzed in literature [2, 4, 5, 8, 12, 14, 47]. Deniz et al. had introduced the fusion of Histogram of oriented gradients (HOG) at different scales to acquire the important features for face recognition [17]. Zhou et al. had proposed an approach for face recognition in which features are extracted by the integration of principal components and Linear discriminant analysis (LDA) [62]. Fernando et al. had introduced the concept of multi-scale gray level co-occurrence matrix for feature descriptors [14]. Furthermore, fuzzy approximation theory can be applied on pixel wise feature extraction of image. There is perpetually uncertainty in real pattern classification problems. Fuzzy logic can be employed to analyze these problems as it deals with approximate reasoning rather than precise [23]. Zadeh had introduced the concept of fuzzy logic for imprecise mode of reasoning that helps to take a decision in uncertain conditions [59, 60]. kim et al. designed a automatic facial expression recognition classifier using neurofuzzy approach [37]. Kyoung et al. had implemented the concept of fuzzified feature vectors for face recognition. Hai-jun et al. had designed an online-sequential ELM with fuzzy activation functions for classification problems and function approximation [49].

The extracted features are recognized using classifier [1, 19, 21, 36, 41, 42] . The traditional gradient based learning algorithms are not efficient for challenging real time recognition applications. The structural parameters of these algorithms are iteratively calculated that leads to slow learning rate. Beside this, traditional algorithm are affected by overfitting, underfitting, out of memory and easily convergence to local minima problems [30, 33]. To work out with the issues of classical learning approaches, the concept of non-iterative learning algorithms was introduced [20, 24, 25, 29]. Huang et al. had proposed the concept of ELM, a fast and non-iterative single hidden layer framework for classification (binary and multiclass) and regression problems [29, 31]. Based on Bartlett concept of strong generalization performance, the objective of ELM is to minimize the output weight norm along with training error [7]. Benefitting from the non–linear mapping capability, ELM has been materialized in numerous applications like object recognition, imbalance learning, signal processing, traffic sign recognition [16, 34, 35, 67].

Although ELM has achieved some accomplishments, but there is a scope for enhancement. In previous few years researchers had done a lot of research to optimize the structure of ELM. Zhu et al. used the concept of differential evolutionary algorithm for selection of input weights [63]. Fei et al. proposed improved ELM for better performance by encoding the priori information for function approximation [26]. Huang et al. [32] schemed incremental ELM, where hidden nodes were incremented randomly with complex activation function. One of its variants was introduced by Feng et al. named as error minimized ELM (EM-ELM) [20]. In this design, hidden nodes can be incremented one by one or chunk by chunk. Deng et al. introduced the concept of regularized ELM, that works on the theory of weighted least square approach for the databases containing outliers [15]. To automate the architecture of ELM, Yoan et al. proposed optimally pruned ELM, in which statistical techniques were utilized to measure the significance of hidden nodes [46]. Wang et al. introduced effective ELM, in which the proper selection of bias and input weight were done to achieve the high performance rate [54]. Zong et al. had successfully implemented ELM on face recognition application and compared it with support vector machine (SVM) [65]. Zong et al. introduced the kernel version of ELM and implemented on face recognition application [66]. Jose et al. introduced regularized ELM, in which different regularization techniques were used to prune the architecture of ELM [45]. Huang et al. proved that ELM is better than SVM and least square SVM in terms of optimization constraints and efficiency [33]. Zong et al. schemed weighted ELM to balance the input data with imbalance class distribution by assigning different weights according to user’s needs [67]. Hong et al. introduced the hierarchical ELM based on hierarchical structure for sequential learning [27]. Tang et al. introduced the concept of multi-layer ELM by adding more than one hidden layer to the original ELM design [52]. Zhiyong et al. had utilized the ELM concept for traffic sign recognition using HOG features [34]. In order to avoid the manual tuning of network parameters in multi layer-ELM Wong et al. introduced the kernel version of ML-ELM, entitled as ML-KELM [55]. Chengbo et al. introduced the concept of improved weighted ELM for classification of imbalance data based on majority voting [44]. Deng et al. had used the concept of ELM autoencoder along with online sequential ELM for feature extraction and classification respectively in visual tracking system [16] .

For computer vision applications the single kernel classifier is not an appropriate choice. This motivates researchers to use the multiple kernel learning with classification techniques. Multiple kernel learning (MKL) is a method to select and integrate data derived from different sources such as strings, graphs and trees for a specific recognition problem [18, 57]. It strengthens the ability of machine learning by allowing the kernels to be chosen based on data. Gonen et al. proposed and analyzed combination and computational time of different methodologies for MKL on real time data [22]. Zhuang et al. explored the integration of different kernels in multi layer architecture [64]. Yanfeng et al. utilized the concept of MKL for classification of hyperspectral images, in which kernel weights were calculated by statistical significance. Bucack et al. had implemented different approaches of MKL for object recognition and analyzed that MKL is more efficient than single kernel function [11]. Fabio et al. proposed a new approach of MKL in which the kernel combination parameters were calculated by min-max algorithm [6]. Xinwang et al. proposed the sparse, non-sparse and radius incorporated methods to find the optimized kernel combination coefficient [43]. Xiao dong et al. implemented the convex combination approach to integrate different base kernels and exercised it for classification problems [40].

The objective of present investigation is to design a novel classification engine that can extract invariant features for efficient classification of non-linear face images captured under uncontrolled constraints. The essence of this novel work is:

  • The extension of original ELM is proposed with deterministic and multiple kernel learning approach. In original ELM (non-deterministic) the input parameters are randomly assigned and hidden layer neurons are experimentally calculated. In deterministic ELM the input and hidden layer parameters are analytically evaluated, which results in steady and invariant output.

  • To make it powerful and robust for real world heterogeneous databases and to discriminate between intra and inter class variance, fuzzification is envisioned for invariant feature extraction. Further the input space is mapped to optimized kernel feature space, which is integration of multiple base kernels.

  • To handle the real time challenges (lightning variations) of face recognition system some pre-processing techniques are needed to be employed. As the face image is composed of low and high frequency components and illumination variations affect the low-frequency components. Therefore to surpass the effect of lightning conditions, low frequency components needed to be modified. In the proposed method we have used the fuzzy filter based normalization algorithm that modifies the low-frequency components using polynomial membership function.

Extensive set of experiments have been executed on face databases to analyze the performance of proposed approaches with contemporary techniques including OMKELM, KELM and ELM. The experiential results acknowledge the supremacy of proposed approaches with aforesaid techniques.

This paper is framed as follows. The concept of fuzzy logic, Extreme learning machine and multiple kernel learning are briefed in Section 2. The proposed work followed by analysis is presented in Section 3. In Section 4 empirical evaluation is performed on comprehensive set of databases and section 5 summarize our conclusions.

2 Preliminaries

2.1 Fuzzy logic

In the crisp set an individual is either a member or non-member of a given set Q. The characteristic function of crisp set Q is given as:

$$ {\mu}_{\mathcal{Q}}=\left\{\begin{array}{c}1,x\in \mathcal{Q}\\ {}0,x\notin \mathcal{Q}\end{array}\right\} $$
(1)

Fuzzy logic is based on the principle of imprecise and approximate mode of reasoning, that helps in taking decision in an environment of imprecision and uncertainty [59]. Each element in fuzzy set is assigned a value that denotes its membership grade in fuzzy set. This grade is analogous to the degree with which an individual is compatible or similar with the concept illustrated by fuzzy set [38]. An individual with larger membership grade belongs to the fuzzy set with greater degree. These grades are characterized by real number in [0, 1]. The individual with a 1 value of membership grade indicates full membership and 0 value is for full-non membership. Crisp set is considered to be a restricted case of fuzzy set with only two values of membership grade i.e. 0 and 1. The membership grade of an individual in fuzzy set is determined by the membership function (MF). The MF maps individuals of a given set E (universal set) into real values in [0, 1]. The Membership grade μA for fuzzy set A is represented by

$$ {\mu}_A:E\to \left[0,1\right] $$
(2)

2.2 Pre-processing technique: Illumination normalization

The objective of normalization algorithm should be to reduce or nullify the effect of illumination variation on images, without changing the details of image. In this investigation fuzzy filter [53] is used for normalization of images. The key points behind this approach are i) the illumination variations remarkably affect low frequency (LF) elements and ii) the facial features which can segregate two facial images are in ascending order of effectiveness in frequency domain. Consider a face image of size U × V, where U determines number of rows and V represents the number of columns. Perform contrast stretching by employing Adaptive histogram equalization (AHE) followed by logarithmic transform (LOG) written as AHE +  LOG. After contrast enhancement, image is converted from spatial to frequency domain by applying DCT. The DCT for an image of dimension U  ×  V is given as follows:

$$ D\left(m,n\right)=\Omega (m)\Omega (n)\sum \limits_{xc=0}^{U-1}\sum \limits_{yc=0}^{V-1}f\left( xc, yc\right)\times \cos \left[\frac{\pi \left(2 xc+1\right)m}{2U}\right]\times \cos \left[\frac{\pi \left(2 yc+1\right)n}{2V}\right] $$
(3)

For m = 0, 1, 2,... U − 1 and n =  0, 1, 2,…, V  − 1.

The inverse DCT transform is given by

$$ f\left( xc, yc\right)=\sum \limits_{m=0}^{U-1}\sum \limits_{n=0}^{V-1}\Omega (m)\Omega (n)D\left(m,n\right)\times \cos \left[\frac{\pi \left(2 xc+1\right)m}{2U}\right]\times \cos \left[\frac{\pi \left(2 yc+1\right)n}{2V}\right] $$
(4)

For xc = 0, 1, 2,…, U −1 and yc =  0, 1, 2,…, V −1.

where

$$ {\displaystyle \begin{array}{c}\Omega (m)=\left\{\begin{array}{l}\frac{1}{\sqrt{U}},m=0\\ {}\sqrt{\frac{2}{U}},m=1,2,\cdots, U-1\end{array}\right\}\\ {}\Omega (n)=\left\{\begin{array}{l}\frac{1}{\sqrt{V}},n=0\\ {}\sqrt{\frac{2}{V}},n=1,2,\cdots, V-1\end{array}\right\}\end{array}} $$
(5)

Obtain the AC coefficients of image (representing the change in intensity values of pixels) in the ascending order of frequency. Determine the DCT coefficients for complete face image using Eq. (3).

$$ D{(c)}_{c=\mathrm{0...}\left( UV-1\right)}=D{\left(m,n\right)}_{n=\mathrm{0...}\left(V-1\right),m=\mathrm{0...}\left(U-1\right)} $$
(6)

The initial coefficients are acknowledged as low-frequency DCT (LF-DCT) coefficients. Consider a universal set c depicting the indices of LF-DCT coefficients. To depreciate the effect of illumination, LF-DCT coefficients are modified by fuzzy polynomial MF with increasing order of degree of membership of LF-DCT indices. The polynomial MF is written as follows:

$$ {\mu}_G=\frac{c^{\omega }}{C^{\omega }} $$
(7)

where C is index of largest LF-DCT and ω is constant which is experimentally obtained. Let us consider LFN number of LF-DCT coefficients. Obtain the fuzzy membership grade for LFN coefficients using Eq. (7). These fuzzy membership grades can be expressed in the form of vector as

$$ G=\left[{G}_0,{G}_1,\dots, {G}_q\dots, {G}_{LFN}\right] $$
(8)

The modified DCT coefficients of image can be obtained using:

$$ MD{(c)}_{c=\mathrm{0...}\left( UV-1\right)}=D{(c)}_{c=\mathrm{0...} LFN}\times G+D{(c)}_{c= LFN+\mathrm{1...}\left( UV-1\right)} $$
(9)

The fuzzy filter algorithm for illumination normalization is outlined as follows:

figure a

2.3 Extreme learning machine

ELM is non-deterministic learning approach with feedforward architecture, initially designed for single hidden layer [31, 33]. ELM foundation is based on the concept of non-iteratively calculation of input and hidden layer parameters. In ELM for accomplishing the higher accuracy rate the hidden layer parameters are experimentally tuned and input parameters are randomly projected. The output function of ELM is given as:

$$ {f}_o=\sum \limits_{j=1}^{\eta }{\theta}_j{b}_j\left({x}_i\right)=\sum \limits_{j=1}^{\eta }{\theta}_jb\left({iw}_j.{x}_i+{a}_j\right)=b(x)\theta =Y,i=1,\dots, N $$
(10)

Where iwj is the input weight vector connecting the ith input neuron to jth hidden node, aj is learning parameter of hidden layer known as bias, θj is the weight vector related to output layer, connecting jth hidden neuron to output nodes and fo is the output of ELM. η and N denotes the count of hidden nodes and training samples respectively. Equation (10) can be compactly written as:

$$ B\theta =Y $$
(11)

B is the output matrix of hidden layer that can be defined in the matrix form as

$$ B=\left(\begin{array}{l}b\left({x}_1\right)\\ {}\vdots \\ {}b\left({x}_N\right)\end{array}\right) $$
(12)
$$ ={\left(\begin{array}{l}b\left({iw}_1,{a}_1,{x}_1\right)\cdots b\left({iw}_{\eta },{a}_{\eta },{x}_1\right)\\ {}\vdots \\ {}b\left({iw}_1,{a}_N,{x}_1\right)\cdots b\left({iw}_{\eta },{a}_{\eta },{x}_N\right)\end{array}\right)}_{N\times \eta } $$
(13)

The qth column of B is the output vector of qth hidden node and qth row of B is output of hidden layer analogous to input xq. x =  [x1,  …xN] is the input data vector and Y  =  [y1,  …yN]T is target output vector for N count of input instances. θ  =  [θ1,  θ2,  …,  θη]T denotes the output vector for η count of hidden neurons. Each element of θ is a vector of dimension equal to output nodes (number of classes).

The input weight matrix is defined as:

$$ IW={\left(\begin{array}{l}{iw}_{11}{iw}_{12}\cdots {iw}_{1d}\\ {}\vdots \\ {}{iw}_{\eta, 1}{iw}_{\eta, 2}\cdots {iw}_{\eta, d}\end{array}\right)}_{\eta \times d} $$
(14)

The size of randomly assigned input weight vector for d dimensional input data is η  ×  d. IW is dynamic in size, as the count of hidden nodes is not static (experimentally calculated).

Equation (11) is linear system and its least-square solution is

$$ \theta ={B}^{\dagger }Y $$
(15)

Where B is Ϻoore-Ƥenrose Ğeneralized inverse [48] of output matrix B of hidden layer. The Ŏrthogonal ƿrojection approach is exploited to determine Ϻoore-Ƥenrose Ğeneralized inverse of matrix B: B  =  BT(BBT)−1 .

For obtaining the strong generalization performance and to determine the invariant output solution, a positive number W is added to the diagonal of BBT [28].

$$ {f}_o= b\theta =b(x){B}^T{\left(\frac{I}{W}+{BB}^T\right)}^{-1}Y $$
(16)

W is regularization coefficient and b(x) is ELM feature mapping, generally known to user.

Different from the iterative learning algorithms, ELM objective is to minimize the output weight along with training error. The objective of ELM for classification problems is:

$$ \operatorname{Min}:\frac{1}{2}{\left\Vert \theta \right\Vert}^2+\frac{W}{2}\sum \limits_{k=1}^N{\left\Vert \delta \right\Vert}^2 $$
$$ \mathrm{Such}\ \mathrm{that}:b(x)\theta ={y}_k^T-{\delta}_k^T\mathrm{for}\ k=1,\dots, N $$
(17)

Where δk  =  [δk, 1,  …,  δk, h]T is training error vector of h output neurons corresponding to input data xk..

2.4 Multiple kernel learning

In kernel based classifiers the efficiency relies on finding an appropriate kernel for representing a data. Single kernel based classifiers does not give good empirical performance to classify heterogeneous data, where each instance is represented by multiple source of data [3, 11]. To address this problem multiple kernel learning is utilized to determine optimal kernel from a set of base kernels. The objective of MKL is to combine the different sets of features by determining the optimal combination of different kernels [22, 58]. MKL is an approach to concatenate features from multiple sources and fed to a single learning algorithm. The optimal combination coefficient of different kernels is data dependent.

The MKL formulation using linear combination of set of u pre-defined kernels is given by:

$$ K\left({x}_i,{x}_k\right)=\sum \limits_{v=1}^u{\gamma}_v{K}_v\left({x}_i,{x}_k\right) $$
(18)

With γv & >  0 and \( \sum \limits_v{\gamma}_v=1 \). K(xi,  xk) is resultant optimal kernel. Kv(xi,  xk) represents the vth sub-kernel and γv is the weight (combination coefficient) of vth kernel. Each sub-kernel uses different set of feature vectors for every instance. Depending upon the significance of features the weights are assigned to the different kernels to achieve accurate classification.

The Eq. (18) can be equivalently defined as

$$ K\left(\cdot, \cdot; \gamma \right)=\sum \limits_{v=1}^u{\gamma}_v{K}_v\left(\cdot, \cdot \right) $$
(19)

The choice of kernel K(⋅,  ⋅) and its combination coefficient depends on perceptivity of the classification problem. In terms of feature mapping the MKL can be formulated as

$$ \theta \left(\cdot; \gamma \right)=\left[{\gamma}_1{\theta}_1\left(\cdot \right),{\gamma}_2{\theta}_2\left(\cdot \right),\cdots, {\gamma}_v{\theta}_v\left(\cdot \right),\cdots {\gamma}_u{\theta}_u\left(\cdot \right)\right] $$
(20)

Where θ(⋅; γ) and θv(⋅) are feature mapping corresponding to K(xi,  xk) and Kv(xi,  xk) respectively.

3 Proposed work

In this section, we propose novel progressive ELM based algorithms subject to multi-class pattern classification applications. The objective of proposed algorithms is to overcome the randomness of original ELM along with enhancement in classification performance by utilizing multikernel and fuzzy logic concepts. The suggested algorithms with single hidden layer feedforward neural network structure are a) Deterministic ELM, in which feature vectors are obtained from pixel intensity values and feature vector mapping is performed by employing multikernel approach b) Deterministic ELM that utilizes fuzzy MF for determination of feature vectors and multikernel technique for feature mapping. The proposed approaches are evaluated on face recognition application. The block diagram of system architecture is shown in Fig. 1. As the face images captured under real time environment contains illumination variations, the training and test images are pre-processed using illumination normalization technique. For this, fuzzy filter has been applied in DCT domain. The normalized face images are classified using proposed Deterministic ELM based classifiers.

3.1 Deterministic multikernel extreme learning machine (DMK-ELM)

The real world pattern classification application demands a fast and deterministic learning algorithm to classify complex data. The random practicing of structural parameters (weights related to input layer and biases of hidden neurons) in ELM results in varying and non-deterministic output solution [4, 65]. Different classification results are obtained for the same database by changing the count of hidden nodes. For obtaining the immutable and deterministic output solution the statistics of input and hidden layer should be calculated with the help of invariant parameters. Also, to deal with the complex data that require the representation of each instance with multiple features the single kernel feature mapping is not an optimal choice. In this paper, we design a deterministic ELM integrated with MKL approach. The proposed approach is capable to classify simple and complex data, by performing appropriate selection of kernels and optimal combination coefficients. In DMK-ELM for obtaining the deterministic output, the input and hidden layer parameters are obtained with the help of input data samples. In the proposed approach, the achieved output solution is stable as the structural parameters are analytically calculated. In DMK-ELM the feature vector utilized for mapping input data to output classes is concatenation of feature vectors obtained with the help of pre defined sub-kernels.

The mathematically formulation of DMK-ELM is:

The input weight IW = [iw1, …, iwη] can be determined with the help of input data xi of dimension d as follows:

$$ \frac{1}{m}\left[\frac{x_1}{Z_1},\frac{x_2}{Z_2},\dots, \frac{x_{\eta }}{Z_{\eta }}\right] $$
(21)

Where Z is analytically determined by evaluating the norm of input data as follows:

$$ {Z}_i={\left[\sum \limits_{j=1}^d{\left({x}_{ij}\right)}^2\right]}^{1/2} $$
(22)

And the value of constant m is determined by calculating norm of all Zi:

$$ m={\left[\sum \limits_{i=1}^{\eta}\left({Z}_i^2\right)\right]}^{1/2} $$
(23)

As the counts of hidden neurons are equivalent to number of input samples, the Eq. (21) and (23) can be re-formulated as:

$$ =\frac{1}{m}\left[\frac{x_1}{Z_1},\frac{x_2}{Z_2},\dots, \frac{x_N}{Z_N}\right] $$
(24)
$$ m={\left[\sum \limits_{i=1}^N\left({Z}_i^2\right)\right]}^{1/2} $$
(25)

For d dimensional input data, the bias of kth hidden node is determined by mean of input weight vector IW of size N × d.

$$ {a}_k=\frac{1}{d}\sum \limits_{i=1}^d{iw}_{ki} $$
(26)

The objective function of DMK-ELM for multi-class classification problem is

$$ \operatorname{Min}:\frac{1}{2}\left\Vert {\theta}^2\right\Vert +\frac{W}{2}\sum \limits_{i=1}^N{\left\Vert {\delta}_i\right\Vert}^2 $$
$$ \mathrm{Suchthat}:{\theta}^Tb\left({x}_i,{iw}_i:\gamma \right)={y}_i-{\delta}_i,\forall i\sum \limits_{v=1}^u{\gamma}_v=1,{\gamma}_v\ge 0 $$
(27)

B(.;  γ) = [γ1B1(.), …, γuBu(.)] is the feature vector with respect to all the kernels from 1,2,…u.

The mathematically re-formulation of Eq. (27) is

$$ \operatorname{Min}:\frac{1}{2}\left\Vert {\theta}^2\right\Vert +\frac{W}{2}\sum \limits_{i=1}^N{\left\Vert {\delta}_i\right\Vert}^2 $$
$$ \mathrm{such}\kern0.5em \mathrm{that}:\theta \sum \limits_{v=1}^u{\gamma}_v{b}_v\left({x}_i,{iw}_i\right)={y}_i-{\delta}_i,\forall i\sum \limits_{v=1}^u{\gamma}_v=1,{\gamma}_v\ge 0 $$
(28)

Equation (28) is optimization problem (joint-convex) and its Lagrangian function is given by:

$$ L\left(\theta, \delta, \gamma \right)=\frac{1}{2}{\left\Vert \theta \right\Vert}^2+\frac{W}{2}{\left\Vert {\delta}_i\right\Vert}^2-\sum \limits_{\phi =1}^N\sum \limits_{k=1}^h{\alpha}_{\phi k}\left({\theta}_k\sum \limits_{v=1}^u{\gamma}_v{b}_v\left({x}_{\phi },{iw}_{\phi}\right)-{y}_{\phi k}+{\delta}_{\phi k}\right)+\tau \left(\sum \limits_{v=1}^u{\gamma}_v-1\right) $$
(29)

Where τ and α ∈ N × h are lagrangian multipliers. The optimal condition of Eq. (29) is mathematically defined using KKT theorem [9] as follows:

$$ \frac{\partial L\left(\theta, \delta, \gamma \right)}{\partial {\theta}_k}=0\to {\theta}_k=\sum \limits_{\phi =1}^N\sum \limits_{v=1}^u{\gamma}_v{\alpha}_{\phi j}{b}_v\left({x}_{\phi },{iw}_{\phi}\right) $$
(30)
$$ \frac{\partial L\left(\theta, \delta, \gamma \right)}{\partial {\delta}_{\phi k}}=0\to {\delta}_{\phi k}=\frac{\alpha_{\phi k}}{W},\forall \phi, \forall k $$
(31)
$$ \frac{\partial L\left(\theta, \delta, \gamma \right)}{\partial {\alpha}_{\phi k}}=0\to {\theta}_j\sum \limits_{v=1}^u{\gamma}_v{b}_v\left({x}_{\phi },{iw}_{\phi}\right)-{\tau}_{\phi k}+{\delta}_{\phi k}=0,\forall \phi, k $$
(32)

Equation (32) can be written as:

$$ \left(B\left({iw}_i,{x}_i;\gamma \right)B\left({iw}_j,{x}_j;\gamma \right)+\frac{I}{W}\right)\alpha ={Y}^T $$
(33)

The output function of DMK-ELM is

$$ {f}_o=B\left({iw}_i,{x}_i;\gamma \right)B\left({iw}_j,{x}_j;\gamma \right){\left(\frac{I}{W}+B\left({iw}_i,{x}_i;\gamma \right)B\left({iw}_j,{x}_j;\gamma \right)\right)}^{-1}{Y}^T $$
(34)

The Deterministic multikernel ELM algorithm is outlined as follows:

figure b

3.2 Deterministic extreme learning machine with fuzzy feature extraction (DMK-ELM-FFE)

In this section, DMK-ELM is further extended to DMK-ELM-FFE, by incorporating the concept of fuzzy logic for feature extraction. DMK-ELM is pixel wise algorithm that works on the pixel intensity of face image. In DMK-ELM-FFE the fuzzy theory is used to determine the pixel wise coalition of face image to distinct classes. It handles the uncertainty involved in face recognition due to varying environment conditions. The pixel-wise membership value extracts the unobserved information from face images likely to a considerable extent. It considers facial image as input and uses π MF to fuzzify the intensity of pixels for obtaining degree of membership of pixels to different classes. Consider a face image of dimension U  ×  V. This can be represented in the form of UV dimensional vector T as: T  =  [t1,  t2,  …,  td], which is the universal set in this context. The size of vector T is d, which denotes the number of pixels in face image i.e. UV. The MF considers each image as a vector and fuzzifies it. The π MF used for the fuzzification of face image is given by:

$$ \pi \left(T:\alpha, \chi, \beta \right)=\left\{\begin{array}{c}0,T\le \alpha \\ {}2{\left(\frac{\left(T-\alpha \right)}{\left(x-\alpha \right)}\right)}^2,\alpha \prec T\le {q}_1\ \\ {}\begin{array}{c}1-2{\left(\frac{\left(x-T\right)}{\left(x-\alpha \right)}\right)}^2,{q}_1\prec T\le x\\ {}2{\left(\frac{\left(T-x\right)}{\left(\beta -x\right)}\right)}^2,\chi \prec T\le {q}_2\ \\ {}\begin{array}{c}1-2{\left(\frac{\left(\beta -T\right)}{\left(\beta -x\right)}\right)}^2,{q}_2\prec T\le \beta \\ {}0,T\ge \beta \end{array}\end{array}\end{array}\right\} $$
(35)

Where α, β and χ represents min, max and mean of a specific pixel element in all training images. π MF provides highest membership grade equal to 1 at mean χ and 0.5 at two cross-over points (q1 and q2). The q1 is between min and mean, whereas q2 lies between mean and max. The value of membership grade is less than 0.5 beyond q1 and q2. The pixels with membership grade value less than 0.5 are used to enhance the capability of face recognition system to assimilate the variation unwrapped by training data. The π MF is calculated based on the values of min, max and mean of a pixel in all training images using:

$$ {\displaystyle \begin{array}{l}\ \alpha =\mathit{\min}(TD)\\ {}\beta =\mathit{\max}\ (TD)\\ {}\chi = mean\ (TD)\end{array}} $$
(36)

Where TD is a vector representing the intensity values of a particular pixel in all training images. For example if TX is pixel matrix of all the training images, α (1) will be the min in the first row of this matrix. Similarly β and χ are the max and mean values respectively of vector TD. After applying π MF the membership grade of face image is represented in the vector form as: M  =  [m1,  …,  m2,  …,  md]. M is a feature vector obtained after fuzzification of input image vector. Concatenate the pixel wise vector i.e. T and fuzzy feature vector M. The resultant feature vector is

$$ {x}_f={x}_T\cup {x}_M $$
(37)

The objective function of DMK-ELM-FFE is:

$$ \operatorname{Min}:\frac{1}{2}{\left\Vert \theta \right\Vert}^2+\frac{W}{2}\sum \limits_{i=1}^N{\left\Vert \delta \right\Vert}^2 $$
$$ \mathrm{such}\kern0.5em \mathrm{that}:{\theta}^Tb\left({x}_f,{iwf}_i:\gamma \right)={y}_i-{\delta}_i,\forall i\sum \limits_{v=1}^u{\gamma}_v=1,{\gamma}_v\ge 0 $$
(38)

The objective function of DMK-ELM-FFE in Eq. (38) is formulated in the same way as achieved for DMK-ELM (Eq. (34)). The primary difference is that the DMK-ELM is pixel wise algorithm and DMK-ELM-FFE is multi-modal (pixel-wise and fuzzify). The output function of DMK-ELM-FFE is:

$$ {f}_o=B\left(\left({iwf}_i,{x}_fi\right);\gamma \right)B\left(\left({iwf}_j,{x}_fj\right);\gamma \right){\left(\frac{I}{W}+B\left(\left({iwf}_i,{x}_fi\right);\gamma \right)B\left(\left({iwf}_j,{x}_fj\right);\gamma \right)\right)}^{-1}{Y}^T $$
(39)

The Deterministic multikernel ELM with Fuzzy feature extraction is outlined in algorithm  3.

figure c

3.3 Analysis of proposed approaches

We conclude this section by comparing our proposed algorithms with ELM, KELM (with RBF and polynomial kernel functions) and OMKELM. The objective of DMK-ELM is to achieve deterministic and accurate solution for real time classification problems. The empirical results achieve with the help of ELM are variant. The parameters affecting the ELM performance are input weights, hidden layer biases and count of hidden neurons. In ELM, the input weights and biases are arbitrarily projected and hidden layer nodes are adjusted with respect to classification accuracy. In DMK-ELM these parameters are analytically calculated with the help of input data. The DMK-ELM is computational efficient as the input weights and biases are obtained by norm and mean operations. DMK-ELM is more accurate and potential than KELM that exercises single kernel for mapping the feature vectors, which is not efficient to discriminate inter class variance.

To make the DMK-ELM more accurate, we use the concept of approximation theory of fuzzy logic to design DMK-ELM-FFE. In this the feature vectors are obtained with the help of fuzzification operation (πMF). It is computationally faster as the parameters of π MF are analytically calculated with the help of min, max and mean operation on input data. It performs better than OMKELM, as the feature vectors are powerful to extract the hidden information from training data that reveals the inter class variance.

4 Experimental evaluation

This section demonstrates the utility of proposed approaches of classification. For this the experiments are performed on prominent face databases AT& T [50], Yale [8], Georgia Tech [12] and AR [68]. A brief description of face databases is stated in sub section. The accomplished executed results of introduced techniques are compared with OMKELM, KELM and ELM.

4.1 Databases

  1. 1)

    AT& T database: consist of 400 images (gray scale) depicting 40 subjects (male and female). The images are captured with distinct expressions and decoration e.g. surprise, happiness, neutral, open/closed eyes. The dimension of each image is 112  ×  92. The face images of database are exemplified in Fig. 2.

Fig. 1
figure 1

Block diagram of proposed system

Fig. 2
figure 2

Sample face images with different expressions from AT & T database

Fig. 3
figure 3

Face images of Yale database with lightning variations

Fig. 4
figure 4

Face images depicting persons of Georgia Tech database

Fig. 5
figure 5

Face images of AR database with illumination variations

Fig. 6
figure 6

Comparison of classification accuracy of DMK-ELM, DMK-ELM-FFE, OMKELM, KELM and ELM based on percentage recognition rate

  1. 2)

    Yale database: comprises of facial images representing 15 persons. The facial image of each person spans a variation in facial details like surprised, winking, center-light, left-light and illumination conditions. The variations are depicted in 11 facial images. The size of each image is 320  ×  243. The images of database are exemplified in Fig. 3.

  1. 3)

    Georgia Tech database: Comprises of 750 color facial images, depicting 50 individuals. The images of this database vary in terms of lightning conditions, appearance and facial expressions. Each image is of dimension 120  ×  90. The images of this database are exemplified in Fig. 4.

  1. 4)

    AR database: consist of 4000 images representing 70 male and 56 female subjects. An individual with varying occlusions, face configuration and lightnining conditions are depicted in 26 facial images. We have employed a subset of this database depicting 100 individuals. For each subject 14 images of dimension 165  ×  120 with changing expressions and environmental conditions are considered. The images of database are exemplified in Fig. 5.

4.2 Experimental setup

In our simulations we have implemented the variants of ELM on AT& T, Yale, Georgia Tech (GT) and AR databases. For calculation of performance results, the cardinality of training set is increased with one image up to 8 images per subject. For exemplary, consider AT& T database, the minimum cardinality of training set is 40 and maximum is 320 images. Corresponding to this, the testing set has maximum cardinality of 360 images and minimum of 80 images. For ELM techniques utilizing the kernel formulation, we have exploited the RBF kernel:

$$ K\left(a,{a}_i\right)=\exp \left(-\sigma {\left|a-{a}_i\right|}^2\right) $$
(40)

and polynomial kernel:

$$ K\left(a,{a}_i\right)={\left(a\cdot {a}_i+\zeta \right)}^{\varphi } $$
(41)

Here σ, ζ, and φ are kernel parameters, that are experimentally adjusted with respect to classification accuracy.

In case of ELM, employing random input weight iw and bias b, we have exploited sigmoidal activation function:

$$ \psi \left(x, iw,b\right)=\frac{1}{\left(1+\exp \left(-\left( iw\cdot x+b\right)\right)\right)} $$
(42)

x is the image vector and its dimension for AT& T database is 10,304. We figure out the performance of ELM, utilizing random weights for a count of hidden nodes η  =  (100,  200,  …,  1000) and coefficient of regularizationW  =  (1,  5,  10,  …,  100,  120). After trying a range of values for W and kernel parameters, the best classification results are obtained for W  =  10, σ  =  100, ψ  =  2, ζ  =  1 and η  =  1000. In OMKELM the kernel function exploited for the performance calculation is convex combination of RBF and polynomial kernel given as:

$$ K\left(a,{a}_i\right)={\gamma}_1{K}_1\left(a,{a}_i\right)+{\gamma}_2{K}_2\left(a,{a}_i\right),\sum \limits_{i=1}^2{\gamma}_i=1 $$
(43)

The combination coefficient of different kernels is database dependent and has different optimal values for distinct databases.

4.3 Performance measures for classification

In our experiments, we utilize several metrics to characterize the performance of proposed algorithms on face databases. The main metrics employed in our research are:

  • Testing accuracy: refers to the ratio of correctly classified instances from the total number of testing instances (testing dataset).

$$ accuracy=\left(1-\frac{m\_ class}{Total}\right)\times 100 $$

where m_class represents the instances which are misclassified as another class and Total represents the cardinality of testing dataset.

  • Precision: refers to the proportion of predicted positive examples that are actually true positives. It is used to measure the correctly predicted labels. Precision is calculated as follows [51]:

\( precision=\frac{\sum \limits_{i=1}^h{TP}_i}{\sum \limits_{i=1}^h\left({TP}_i+{FP}_i\right)} \) where h is number of classes, TP is True Positives, FP is False Positives

  • Recall: refers to proportion of true positive examples that are predicted to be positive. It is used to measure the number of correct labels predicted by a classifier. Recall is calculated as follows [51]:

$$ recall=\frac{\sum \limits_{i=1}^h{TP}_i}{\sum \limits_{i=1}^h\left({TP}_i+{FN}_i\right)}\kern0.5em \mathrm{where}\kern0.5em \mathrm{FN}\kern0.5em \mathrm{is}\kern0.5em \mathrm{False}\kern0.5em \mathrm{Negative} $$
  • F-measure: is an accuracy measurement, which determines a weighted average between precision and recall. It is relationship between positively labeled data and actual prediction by a classifier based on per-class average [51].

$$ F- measure=\frac{2\times precision\times recall}{precision+ recall} $$

4.4 Empirical results

Empirical results calculated by implementing competing approaches and proposed techniques on AT& T, Yale, GT and AR databases, utilizing testing accuracy as performance metric are revealed in Tables  1, 2, 3 and 4 respectively. The experimental results calculated for face databases, utilizing precision, recall and F-measure as performance metrics are revealed in Tables  5, 6, 7 and 8. The results are analyzed for different cardinality of training dataset.

Table 1 Performance results based on testing accuracy for different classifiers on AT & T database
Table 2 Performance results based on testing accuracy for different classifiers on Yale database
Table 3 Performance results based on testing accuracy for different classifiers on GT database
Table 4 Performance results based on testing accuracy for different classifiers on AR database

4.4.1 Classification results on AT& T database

The AT& T database incorporates very less illumination variations, so no pre-processing is done on this database. The value of kernel combination coefficients γ1 and γ2 is 0.5. In AT& T database, for all the classifiers the accuracy rate increases with increase in size of training set. From the recognition results it can be stated that kernel formulation of ELM gives better results than ELM with random input parameters. For exemplary, when the cardinality of training set is 240 images, the recognition rate achieved with ELM, KELM (RBF) and KELM (polynomial) classifier is 93.75%, 96.25% and 96.80% respectively. Further, the results evaluated using multi kernels (OMKELM) are better than single kernel ELM. Although the results of KELM with polynomial kernel function are comparable with OMKELM. As compared with OMKELM, there is significant improvement in accuracy rate using DMK-ELM and DMK-ELM-FFE. For exemplary when two images per subject are used for training, the percentage increase in recognition rate is 9% with DMK-ELM and 12.5% for DMK-ELM-FFE. Figure 6(a) shows the performance comparison graph of different techniques on this database. To establish the efficacy of the proposed algorithms, the performance of all the classifiers are evaluated on other metrics for AT& T database and the values are listed in Table 5. It is clearly evoked by comparing the results using precision, recall and F-measure, that proposed classifiers are significantly superior to that of other existing ELM based classifiers.

Table 5 Performance results based on precision, recall and F-measure for different classifiers on AT & T database

4.4.2 Classification results on Yale database

The face images in Yale database are normalized using fuzzy filter, as the database contains large variation in terms of illumination and expressions. To reduce the effect of lightning variations, the numbers of LF-DCT coefficients considered are 21 with 0.5 value of fuzzy constant. The kernel combination coefficient for RBF and polynomial is 0.4 and 0.6 respectively. The recognition rate in this database is more than AT & T database in most of the cases. For example, when the training set size is eight images per subject we achieve 100 percentage recognition rate with KELM (polynomial), OMKELM, DMK-ELM and DMK-ELM-FFE. The accuracy results reveal that multi kernels technique is more accurate than single kernel. For example, on comparing the accuracy results of OMKELM and KELM (polynomial) the percentage increase in accuracy rate is 2%, when four images per subject are used for training. The results illustrate that proposed algorithms give better results than the OMKELM. For example when the training images are seven for each face identity, the proposed algorithms give 100% accuracy, while 98.67% accuracy is achieved with OMKELM. On average the increase in accuracy rate is more than two percentage by implementing the proposed approaches. Figure 6(b) shows the graph of comparison, based on percentage recognition rate of different approaches for Yale database. The performance results of proposed approaches based on precision, recall and F-measure are more promising than the comparative ELMs for all the size of training and test datasets. In Table 6, when seven and eight images per identity are considered for training the values of precision, recall and F-measure is equal to 1, which means all the test images are classified to their corresponding classes.

Table 6 Performance results based on precision, recall and F-measure for different classifiers on Yale database
Table 7 Performance results based on precision, recall and F-measure for different classifiers on GT database

4.4.3 Classification results on GT database

The GT database contains color images, so before performance evaluation these images are converted to gray scale images without performing any pre-processing. The recognition rate of this database is very less irrespective of training size, when compared with other databases. The maximum accuracy is achieved when the cardinality of training dataset is 400, i.e. eight images per subject as there are 50 subjects with 15 images per subject. As it can be verified from the results that OMKELM gives better performance than ELM and KELM (with single kernel). The recognition results of OMKELM are comparable with DMK-ELM, but there is 10% improvement in accuracy rate with DMK-ELM-FFE irrespective of size of training dataset. On comparing the proposed approaches, the DMK-ELM-FFE results are more accurate than DMK-ELM irrespective of cardinality of training set. For example, when two images per subject are considered for training, the accuracy rate increases by 22% with DMK-ELM-FFE classifier. Figure 6(c) shows the graphical comparison, based on accuracy rate of proposed approaches with existing techniques in literature for GT database. The precision, recall and F-measure for GT database is less when compared with other databases. Although the values of these metrics are higher in case of proposed algorithms than existing ELM variants. On analyzing the performance results based on different metrics in Tables 3 and 7 for GT database, it can be concluded that among the proposed algorithms DMK-ELM-FFE is superior than DMK-ELM.

4.4.4 Classification results on AR database

As the AR database contains colored images with non-uniform lightning variations, so the images in this database are normalized and converted to gray scale. The LF-DCT coefficients modified by employing fuzzy filter are 91 and the value of fuzzy constant is 0.5. The values of kernel coefficients γ1 and γ2 are 0.4 and 0.6 respectively. On analyzing the accuracy results it can be stated that OMKELM performs better than other existing techniques of classification. The percentage increase in accuracy rate is 11%, 20% and 4% with ELM, KELM (RBF) and KELM (polynomial) respectively, when the cardinality of training set is 200 images. The achieved recognition rates with proposed schemes are more than OMKELM. For exemplary, when four images per identity are considered, the increase in accuracy rate is more than 4.5 percentage. On analyzing the performance results of proposed approaches, DMK-ELM-FFE performs better than DMK-ELM. For example when the size of training dataset is 500 images, the increase in accuracy rate is 2%. Figure 6 (d) displays the graphical comparison of different classifiers for AR database, formed on recognition rate. In Table 8, the performance results of different classifiers based on precision, recall and F-measure for AR database are shown. The empirical results based on all performance measures proves that proposed non-iterative DMK classifiers are more efficient than other state-of-the-art classifiers.

Table 8 Performance results based on precision, recall and F-measure for different classifiers on AR database

5 Conclusions

This paper prospects the issue of face recognition efficiently and deterministically by developing multi kernel learning based deterministic ELM along with fuzzy based feature extraction. The foundation of proposed approach is ELM classifier, whose structural parameters are analyzed in detail. Although the ELM is efficient classifier, but it cannot be employed in classification applications that demands deterministic and immutable output solution. To make it deterministic, the structural parameters of ELM are analytically devised. Furthermore, to make it more efficient, we employed data related multikernel approach, in which kernel and their combination coefficients are experimentally determined by investigating the perceptivity of the problem. To make it more accurate, we use fuzzy π MF that works on the intensity of pixels to find their association with different classes. From analyzing the results, it is evident that Kernel model of ELM is more accurate than ELM in most of the cases. Further, when single kernel approach is adopted the polynomial kernel gives more promising result than RBF kernel. For all the databases, OMKELM results are more accurate than single kernel approach, although the improvement in result is database dependent. On analyzing the empirical results, it can be concluded that the presented approaches outperform ELM, KELM and OMKELM. The experiential results acknowledge that the DMK-ELM-FFE performs superior than DMK-ELM for all the databases. The differences in performance results of proposed approaches are larger, when the databases are not pre-processed.

The objective of the proposed novel methods is to overcome the randomness of ELM along with to make it more efficient and accurate with optimized kernel and fuzzification of pixels. In contempt of proposed methods outshines other competitive algorithms, there is enough research work virtue to be explored in future such as to develop the variants of proposed algorithms for multilayer and to develop a potential algorithm that deals with extensive large massive training set emanated from augmentation of large-scale kernel functions.