1 Introduction

3D face recognition is widely used throughout the world due to the availability of easily collectable 3D data and capabilities of computation with the availability of highly economical graphical processing units (GPUs). However, acquiring 3D images are harder as compared to 2D scans. Therefore, the number of images is limited in public databases [25, 86, 90]. In [90], a high resolution spontaneous 3D dynamic facial expression database is presented. This work supports 3D spatiotemporal features exploration in subtle face expression. In [86], high resolution data acquisition is done using 3D dynamic imaging system setup. There are total of 101 number of subjects, six unique expressions, 606 number of 2D texture videos, 606 number of 3D model sequences, and approximately 60,600 3D models. In [25], 3D face recognition is improved using multi-instance enrollment representation. The experiments were performed on ND-2006 3D face dataset [57], that consists of 13,450 3D images. There are various techniques available in the literature for handling 3D mesh data, RGB-D image, or point cloud data [23, 64, 99]. ElSayed et al. [23] presented a robust method for detection of skin using 3D colored point clouds. This method is extended to solve 3D face detection problem by building a weighted graph for the initial 3D colored point clouds. A linear programming algorithm is used for predicting model using data mining approach and classifying the graph regions of skin versus non-skin regions. Zhou et al. [99] presented a dense 3D face decoding method using a non-linear 3D morphable model (3DMM) by training over joint texture and shape autoencoders using direct mesh convolutions. It is shown in [99] that how these autoencoders are usable in training very light weight models performing Coloured Mesh Decoding (CMD) at speed of over 2500 FPS. Pham et al. [64] presented a novel robust hybrid 3D face tracking framework from RGBD videos. It is capable of tracking head pose and face actions without any intervention or recalibration from a user.

Some well-known methods for handling 3D features are slower in computation time as compared to deep learning techniques [48, 50, 76]. Spreeuwers [76] presented a 3D face registration technique based on intrinsic coordinate system of the face. Principal Component Analysis and Linear Discriminant Analysis (PCA-LDA) is used for feature extraction along with matching score of likelihood ratio. The overall method takes 2.5 s per image, which is too slow as compared to the technique proposed in this paper. Li et al. [50] presented 3D face recognition technique by extending SIFT-like matching framework to mesh data. Lei et al. [48] represented the facial scan using keypoint-based multiple triangle statistics (KMTS), which is a robust method to partial facial data, pose variations, and facial expressions. An approach called two-phase weighted collaborative representation classification (TPWCRC) is used. Experiments were performed on Bosphorus [71], UMBDB [17], GavabDB [55], SHREC 2008 [82], BU-3DFE [85], and FRGC v2.0 [66] datasets. There are various challenges in 3D face recognition viz. pose, occlusion, expression, lighting, etc. These variations affect intra-class recognition capabilities of 3D face recognition system [41].

Based on the above mentioned issues, a voxel-based 3D face recognition system is proposed that utilize the basic concepts of locality preserving projections (LPP), triplet loss, simulated annealing, and game theory. The reason behind the use of LPP is to remove the sparseness of meshes with a non-uniform number of voxels. LPP is chosen over PCA due to representing high dimensional data in low dimension, LPP is computed by optimal linear approximation to eigenfunctions of Laplace Beltrami operator on the manifold whereas PCA projects the data along with maximal variance directions [33]. The triplet loss training reduces the distance between the intraclass faces and maximizes the distance between different class faces. It helps in increasing the reliability of the system in face identification. Simulated annealing is used for minimizing the error rate by using the probability-based random threshold value. Generator and discriminator are part of game theory that helps in the correct selection of triplet generated based on simulated annealing. The combined effect of these techniques makes the proposed method robust towards the occlusion in 3D face recognition.

The main contributions of this paper are as follows.

  1. 1.

    The proposed approach utilizes generator and discriminator for voxel-based face recognition.

  2. 2.

    A deep learning and simulated annealing based framework is proposed for voxel-based 3D occlusion invariant face recognition (V3DOFR).

  3. 3.

    The proposed approach is validated using three standard datasets with a significant amount of pose and occlusion variation.

  4. 4.

    The proposed technique is compared with other state-of-the-art methods for voxel-based face recognition, occlusion invariant face recognition, 3D landmarks based face recognition, and 3D mesh-based face recognition.

The remaining structure of this paper is organized as follows. Section 2 presents the background work done in the field of face recognition. The proposed research framework is presented in Section 3. Section 4 discusses the experimental results in detail. Section 5 presents the future work. The concluding remarks are drawn in Section 6.

2 Background

In this section, the basic concepts of deep learning-based face recognition, voxelization, locality preserving projections (LPP), triplet loss, game theory, and simulated annealing are discussed. Thereafter, the related work is discussed.

2.1 Preliminary

2.1.1 Deep learning-based 3D face recognition

Training and testing are the two main phases of deep learning-based 3D face recognition (see Fig. 1). There are two sub-phases in training phase namely, pre-processing and deep learning. During pre-processing, 3D face acquisition is made by three methods such as RGB-D depth image, 3D face mesh image, and 3D point cloud image [61]. Once 3D face is acquired, face alignment as well as registration is done for maximum utilization of the available information. There are three methods of proceeding with 3D face after alignment. First, coarse-detail based facial landmark detection is one of the fastest method. Facial landmarks lack the finer details of the face. To overcome this, voxelization is used that includes fine-details of a 3D face. Voxelization is a slower process in contrast to landmarks detection. The third method is to use a 3D object as a whole, in the form of 3D RGB-D image, mesh image, or the point cloud image. After the completion of pre-processing phase, there is an availability of deep learning models. There are different types of models such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), autoencoders (AEs), generative adversarial networks (GANs), and reinforcement learning (RL) [2, 3]. CNN’s and RNNs are used in supervised learning for images and text, respectively. AEs and GANs are used in semi-supervised learning. RL is used in unsupervised learning.

Fig. 1
figure 1

A general framework of deep learning-based 3D face recognition

There are two sub-phases in testing phase namely, validation and verification. In validation, the testing dataset is processed through alignment. Face registration and trained deep learning model is used to predict the array of classes corresponding to all images in testing dataset. The accuracy of face recognition model is computed as follows [37].

$$ \mathrm{Accuracy}=\frac{Number\ \mathrm{of}\ \mathrm{images}\ \mathrm{correctly}\ \mathrm{predicted}}{Total\ \mathrm{number}\ \mathrm{of}\ \mathrm{images}}\ \mathrm{x}\ 100 $$
(1)

In verification, the query image is processed through face alignment and registration. The trained deep learning model is used to predict the class of image. Finally, a similarity score is calculated in comparison to the predicted class images [14].

$$ \mathrm{Similarity}\ \mathrm{Score}=\sum \limits_{i=1}^N\frac{corr\left( Query\ \mathrm{Image},{\mathrm{I}}_i\right)}{N} $$
(2)

2.1.2 Voxelization

Voxel representation is widely used in multiple fields viz. computer graphics, computational science, real-time computer vision, and 3D shape matching. Dynamic modeling requires voxelization or real-time scan conversion. In this process, the triangular mesh is used to create voxel representation from the input surface [60].

Let point O be arbitrary origin point, and G be a polyhedron with triangular faces t1, t2, t3, …, tn, then H = {H1, H2, H3, …, Hn} be covering of G with 3D tetrahedra. Hi is defined by O and triangular faces ti [58]. Point A is considered to be inside the polyhedron G iff

$$ {\sum}_i \operatorname {sign}\left({\mathrm{H}}_i\right)\ \mathrm{incl}\left({\mathrm{H}}_i,A\right)>0 $$
(3)

where sign(Hi) is true when Hi > 0, and incl(Hi, A) is true for all A∈ Hi.

2.1.3 Locality preserving projections

Suppose there are large n-dimensional vectors of data points. The intrinsic property of data is used for dimensionality reduction of these large vectors. Locality preserving projection (LPP) builds a graph that consists of the neighborhood information of dataset [33] and solves the linear dimensionality reduction problem. LPP is a linear approximation of non-linear Laplace Eigenmap and is as follows [9, 33].

In locality preserving projections algorithm, Graph G with P nodes are connected according to the Euclidean norm and k-nearest neighbors of two nodes (see Step 1). In Step 2, weights are assigned to nodes. In the final step, the computation of final l-dimensional vector is calculated based on the generalized eigenvector problem.

2.1.4 Triplet loss

Triplet loss uses face embeddings as vectors. It chooses three embeddings namely, Anchor (A), Positive (P), and Negative (N) from the dataset such that A and P belong to the same class, and N belongs to a different class. A, P, and N are selected randomly based on three categories viz. easy triplets, hard triplets, and semi-hard triplets. Easy triplets (see Eq. 4) has a loss of 0. In hard triplets (see Eq. 5), negative embedding is closer to anchor embedding as compared to positive embedding. In semi-hard triplets (see Eq. 6), the negative embedding is not closer than the positive embedding but still has positive loss [73].

$$ d\left(A,P\right)+ magin<d\left(A,N\right) $$
(4)
$$ d\left(A,N\right)<d\left(A,P\right) $$
(5)
$$ d\left(A,P\right)<d\left(A,N\right)<d\left(A,P\right)+\mathrm{margin} $$
(6)

The loss of a triplet (A, P, N) is defined as [73].

$$ L=\max \left(d\left(A,P\right)-d\left(A,N\right)+ margin,0\right) $$
(7)

The main objective of triplet loss is to minimize the loss by pushing d(A, P) → 0 and d(A, N) > d(A, P) + margin by triplet loss training. Figure 2 represents the concept of triplet loss using three images given as input in form of A, P, and N to deep learning model for triplet embeddings and triplet loss training.

Fig. 2
figure 2

Concept of triplet loss technique

2.1.5 Game theory

Game theory is a term used jointly with generative adversarial networks (GANs). GANs [29] are one of the types of generative models. Let Pdata(I) be the distribution of a real image I and PJ(J) be the distribution of the input. Let generator G(z) capture the Pdata distribution by using an adversarial process. The discriminator D distinguishes between real images and generated images. The formulation of the adversarial process in the form of a minimax game (see Eq. 8).

$$ {\min}_G{\max}_D{E}_{I\sim {P}_{data}}\Big[\log D(I)+{\mathrm{E}}_{Z\sim {P}_Z}\left[\log \left(1-D\left(G(z)\right)\right)\right] $$
(8)

Theoretically, the global optimum PG(Z) = Pdata [30] is reached by the minimax game on reaching Nash equilibrium [69] by the adversarial process. Recently, AttGAN [34] has achieved facial attribute editing as well as gender and age transformation (see Fig. 3(a) and (b)) viz. reconstruction of blond hair, eye glass, changing expression, makeup, etc. Face aging with conditional GANs [4] achieved remarkable results in generating faces of different age from a single image (see Fig. 3(c)).

Fig. 3
figure 3

a Comparison of transforming attributes using VAE-GAN [46], IcGAN [63], and AttGAN [34], b Gender and age transformation using AttGAN, and c Transformation of face to any age using Age Conditional GAN [4]

2.1.6 Simulated annealing

Simulated annealing (SA) optimization algorithm is based on metallurgical practices in which a particular material is heated at high temperature, and then it is brought to low temperature gradually. Shifting of atoms become unpredictable when heating a material at high temperature. It helps in the elimination of impurities as the material takes pure crystal form after cooling. In terms of optimization, SA introduces a degree of randomness, which may take the solution from better to worse in an attempt to escape local minima and increasing the probability of achieving global optimum [44]. The applications of SA are diverse [7, 13, 52] by single criterion optimization [8].

Figure 4 holds four states A, B, C, and D having different energy. The main target is to find a path having maximum energy by using simulated annealing algorithm to traverse every state exactly once. For understanding purposes, the four states have been connected in two ways, clockwise and anticlockwise. The total sum of energies by clockwise traversing of all states is 35, whereas the total sum of energies by anticlockwise traversing is 70. Hence, the maximum energy state path is selected by anticlockwise traversing.

Fig. 4
figure 4

State energy maximization using simulated annealing

In a simulated annealing algorithm, the initial state is chosen randomly (see Step 1). Es acts as current state energy. The new state becomes the current state if there is a positive energy change, and otherwise, the new state is chosen with probability eΔE/T (see Step 2). During the process, temperature T is decreased gradually so that solution converges towards the global optimum.

2.2 Related work

Kim et al. [42] proposed a novel 3D face recognition algorithm using a face expression augmentation technique alongwith deep convolutional neural network. They used 2.5D or depth images as 3D face images and transfer learning on FRGC [66], CASIA 3D [12], BU-3DFE [85], Bosphorus [71], and 3D-TEC [78] datasets. They presented a technique of augmenting facial expressions from single 3D face image. VGG-16 model has been used to implement the transfer learning. They claimed the rank-1 accuracy for face recognition is 99.2% for Bosphorus. However, it would fail on voxel based, mesh based, or landmark based 3D face recognition technique when the input data is in sequential nature.

Gilani et al. [101] proposed a technique to generate millions of 3D facial images of unique identities by simultaneously interpolating between the facial identity and facial expression spaces to close the gap between the sizes of 2D and 3D datasets. There may be a loss in the depth factor due to the conversion of 3D into 2D images but the augmentation makes up for the loss. In closed and open world recognition scenarios, the proposed FR3DNet outperforms the existing face recognition algorithms. The main advantage of [101] is that it helps in building bigger 3D datasets as compared to the standard 3D datasets. This work is done on 3D images generated from 2D images. Similar to [42], this method fails on voxel based face recognition.

Korshunov and Marcel [45] proposed a public dataset, namely, Deepfake, generated with VidTIMIT database [70]. The main objective of this dataset is to help in generating the swapped faces of two people from videos using generative adversarial networks (GANs). This work is done on 2D face videos. They emphasized that the quality of video is significantly impacted by training and blending parameters. It is observed that the VGG-Net and FaceNet are in jeopardy due to Deepfake videos. The error rate obtained from the FaceNet was 8.97%. The Deepfakes generated from GANs are challenging for both the face detection and recognition systems. The face swap technology based on GANs provided greater challenge to 2D face recognition. This work if presented on voxel face videos would be a challenging for 3D face recognition.

Gecer et al. [28] came up with a novel 3D Morphable Models (3DMM) fitting strategy, which is based on generative adversarial networks (GANs) and a differentiable renderer. The novel cost function integrated various content losses on deep identity features from a face recognition network. The high fidelity 3D Face Reconstruction was achieved by using non-linear energy based cost optimization, GAN texture model, differentiable renderer, cost function, and model fitting techniques. During the fitting process, Adam solver was used for optimization. Abrevaya et al. [1] presents a GAN based 3D face modeling novel architecture which combines a 3D generator with a 2D discriminator leveraging the conventional CNNs. The feature loss, identity loss, and expression loss are calculated by the discriminator to give the real or fake output. Four publicly available 3D face datasets have been used namely BU-3DFE [85], Bosphorus [71], BP4D-Spontaneous [90], and BP-4DFE [89].

Patil et al. [62] presents a survey on 3D face recognition. It provided an extensive review on 3D face recognition in terms of feature detection, the classifiers, 3D face databases, types of 3D facial data acquisition techniques viz. stereo acquisition, laser beam scanning, and fringe pattern acquisition using structured light. Different 3D face representations, namely, point cloud representation, 3D mesh representation, and depth image representation are discussed. Different registration techniques of 3D faces as iterative closest point (ICP) algorithm, spin images, simulated annealing, and intrinsic coordinate system are discussed with their pros and cons. Wu et al. [81] extracted the features from the whole 3D model. For 3D object shapes, a volumetric representation is used. Based on the mesh surface, each voxel location contains a binary value of 0 or 1 with a grid size of 303. Voxel grids hold vast information in terms of facial density and texture. This method is better than the depth image technique. Moreover, voxels can be directly used in training 3D convolutional neural networks (3DCNNs) and 3D generative adversarial networks (3DGANs).

Rathgeb et al. [68] presented an overview of impact and detection of facial beautification in face recognition. The plastic surgery, facial retouching, facial cosmetics are common these days. Due to these beautifications, the face recognition based biometrics become an enormous challenge. Facial recognition is used in mobile phones unlocking, payment applications, automated border control etc. The challenges were presented in this work. All the work discussed in this paper is 2D in nature and lacks discussion on 3D face recognition and challenges. Hassaballah and Aly [31] discussed about the significant challenges, which are faced while building a face recognition system for the real world. 3D face recognition and video-based face recognition have been discussed in the work, however, deep learning based techniques are not mentioned at all.

Scherhag et al. [72] presented the survey on face recognition systems under morphing attacks. The generalizability of deep face recognition systems have increased the vulnerability against attacks. The morphing on 2D faces using correspondence, warping, and blending are discussed. This work does not discuss morphing over 3D face images. Ding and Tao [19] discussed 2D face image based pose-invariant face recognition (PIFR). PIFR methods are grouped in four categories viz. pose-robust feature extraction approaches, face synthesis approaches, multi-view subspace learning approaches, and hybrid approaches. The main challenge of face recognition under different poses is self-occlusion because of non-frontal pose. Other challenges are resolution of an image, illumination in an image, and expression.

Bowyer et al. [11] presented a survey of 3D and multi-modal 3D + 2D face recognition. All the techniques mentioned in the paper are working around feature vectors or range images for most of the cases. All the datasets used in the presented researches had small 3D face datasets. In modern times, humungous size of datasets can be handled using deep learning techniques and advent of GPUs. Cho et al. [16] proposed a graph-structured module called Relational Graph Module (RGM), which focuses on the high-level relational information between the facial components. The heterogeneous face recognition (HFR) problem is handled in this work. HFR is a type of face recognition in which face is matched across two domains viz. near-infrared (NIR), visible light (VIS), or the sketch domain. RGM did the embedding of spatially correlated feature vectors into the graph node vectors and performs the relation modeling between different nodes of the graph. In addition to RGM, a Node Attention Unit (NAU) was used to perform node-wise recalibration. This model is able to handle HFR database.

Huang et al. [38] developed an adaptive curriculum learning loss (CurricularFace) for deep face recognition. CurricularFace addressed an idea of curriculum learning into a loss function for achieving a novel training technique. This technique addresses easy samples in the early training stage and hard samples in the later stage. Different importance is assigned to different samples based on the corresponding difficulty. The datasets used were CASIA-WebFace [84], refined MS1MS2 [18], LFW [47], CFP-FP [74], CPLFW [97], AgeDB [56], CALFW [98], IJB-B [80], IJB-C [53], and MegaFace [40]. Bi et al. [10] investigated the conditional GAN (cGAN) for understanding the face-to-sketch translation issues. Along with learning of mapping relationships between the face and the sketch, these networks generate a loss function for automatically training the mapping relationships. In the presented work, it is considered that multi-scale image representation can capture the structure, image texture, and other features accurately. Three layer pyramid model was constructed to obtain the multi-scale information. The multiscale cGAN model was used to train the mapping relationships. The datasets used were CUFS database [79], CUFSF dataset [88], and FERET database [65]. Fan et al. [27] presented a perceptual metric for facial sketches namely Structure Co-Occurrence Texture (Scoot). Scoot simultaneously considered the co-occurrence texture statistics and the block-level spatial structure.

Sharma and Kumar [75] presents a voxel based 3D face reconstruction technique using sequential deep learning. The datasets used in the presented work are Bosphorus, UMBDB, and Kinect Face DB. The process of voxelization is followed by variational autoencoders, bidirectional long short-term memory, and triplet loss training followed by support vector machine based prediction. The mirroring technique is used for reconstruction of the 3D voxelized face. Using the reconstructed face, a sequential deep learning framework is utilized for gender recognition, emotion type recognition, occlusion type recognition, and person identification.

Multiple deep learning metric algorithms [5, 6, 15, 59] have been designed loss function such that they can learn more distinguishing features. Evolutionary algorithms are mostly used for feature optimization because the search capability of these algorithms is better than others [83, 91]. In [21, 35, 36, 87, 92], the latest developments in machine learning, mathematical modeling, and optimization techniques are presented. The main shortcoming of the above-mentioned techniques is that it is difficult to recognize a face from 3D occluded face datasets. To resolve this problem, 3D occlusion invariant face recognition framework is proposed.

3 Proposed research framework

This section discusses the motivation followed by voxel-based 3D occlusion invariant face recognition framework.

3.1 Motivation

The proposed framework is motivated by the recent success of generative adversarial networks (GANs). The use of voxels makes it possible to include the finer details of 3D face. According to the best of author’s knowledge, little work has been done in the field of voxels and deep learning for 3D face recognition. The proposed framework utilizes the concepts of voxelization, locality preserving projections, triplet loss, simulated annealing, and game theory. In the traditional approach, 3D mesh images are converted into depth images (2.5D) or Epipolar geometry-based multiple 2D images, which are used to train the conventional neural networks (CNNs) or autoencoders. In contrast to 2D and 2.5D images, the presented work is implemented using voxels in 3D form. In Sharma [75], mirroring technique based face reconstruction was done after voxelization along with BiLSTM based sequential deep learning. Figure 5 shows the comparison among traditional approach given by [22, 67], and the proposed approach of 3D face recognition framework. The proposed approach uses voxels in contrast to depth images or epipolar geometry images.

Fig. 5
figure 5

Comparison between the traditional approach [22, 67] and proposed approach of 3D face recognition framework

3.2 Proposed 3D face recognition framework

The proposed framework for 3D face recognition consists of two phases, namely, training and testing. Figure 6 presents the proposed 3D face recognition framework.

Fig. 6
figure 6

Proposed framework for 3D face recognition (a) Training phase (b) Testing phase

3.2.1 Training phase

There are two sub-phases in the training phase, namely, pre-processing and simulated annealing based deep learning. The detail descriptions of these phases are mentioned in preceding subsections.

3.2.2 Pre-processing

During the training phase, voxelization and locality preserving projections are two well-known preprocessing techniques for generating embeddings. Figure 7 shows the mesh images and their corresponding voxel images. The voxelization process converts a 3D mesh into voxel form in such a way that 3D coordinates are generated for each triangular mesh represented using cubes in different grid sizes. A single mesh is converted into three different voxel grid sizes viz. 43, 83, and 163. The number of voxels generated is sparse for different phases, even for the same size grid. Locality preserving projections are used to handle the problem of sparseness. 43 voxels are converted into 64 × 3 embedding, 83 voxels are converted into 128 × 3 embedding, and 163 voxels are converted into 256 × 3 embedding. Ensembling is a famous technique in making the prediction model more robust towards new test images. Hence, three different kinds of grid sizes are used. It helps in boosting the quality of training data during the preprocessing step.

Fig. 7
figure 7

Mesh images and its corresponding voxel images

3.2.3 Adversarial voxel triplet generator and simulated annealing based prediction

The pre-processing sub-phase produces normalized voxel embedding for further processing. The generator produces triplets of Anchor (A), Positive (P), and Negative (N) for triplet loss training. Motivated from [95], normalized voxel embeddings for a voxelized mesh image x is represented as V(x) ∈ L. Given <A, P, N> as a triplet, <A, P> is relevant (positive) pair and <A, N> is irrelevant (negative) pair. The objective function to train V(x) such that minimizing the following loss:

$$ {L}_{V, tri}={\left[d\left(V(a),V(p)\right)-d\left(V(a),V(n)\right)+m\right]}_{+} $$
(9)

where \( d\left(x,y\right)={\left\Vert \frac{x}{\left\Vert x\right\Vert }-\frac{y}{\left\Vert y\right\Vert}\right\Vert}^2 \) is squared Euclidean distance between two L2-normalized vectors. m is the least margin required between d(a, p) and d(a, n) during training, and [.]+ ≜ max(., 0) denotes the positive component of the input. Let the adversarial voxel triplet generator (G) generates an adversarial sample G(V(x)) ∈ L by modifying the feature representation V(x) of an image x. While generator training to minimize the triplet loss, G produces hard triplet examples by pushing away the same category vectors and pushing close the different category vectors.

The following objective is to be minimize the adversarial voxel triplet loss during training G,

$$ {L}_{G, tri}={\left[d\left(G\left(V(a)\right),G\left(V(n)\right)\right)-d\left(G\left(V(a)\right),G\left(V(p)\right)\right)+m\right]}_{+} $$
(10)

Finally, with a fixed G objective function for training becomes

$$ {L}_{V, tri}={\left[d\left(G\left(V(a)\right),G\left(V(p)\right)\right)-d\left(G\left(V(a)\right),G\left(V(n)\right)\right)+m\right]}_{+} $$
(11)

Here, LG, tri and LV, tri makes up an adversarial loss pair. Comparing Eq. (9) and Eq. (11), V is trained through the triplets generated by G pushing the <A, P> closer and <A, N> apart to meet margin m.

The adversarial mechanism using a generator (G) is insufficient without the use of discriminator along with it. The role of discriminator (D) is to monitor and the constrain the triplet generator G from producing random triplet vectors for attaining a low value of LG, tri. Using a discriminator D, a feature vector is categorized into (C + 1) categories, where real class examples are represented by the first C categories and the final one denotes the fake class. The triplet <A, P, N> has the labels<lA, lP, lN>, the positive pair has lA = lP and the negative pair has lA ≠ lN. The following loss function is minimized for training D.

$$ {L}_D={L}_{D, real}+\beta {L}_{D, fake} $$
(12)

Here, D is forced to do the classification of feature vectors of the triplet correctly by the first term (LD, real).

$$ {L}_{D, real}=\left[{L}_{ll}\left(D\left(V(A)\right),{l}_A\right)+{L}_{ll}\left(D\left(V(P)\right),{l}_P\right)+{L}_{ll}\left(D\left(V(N)\right),{l}_N\right)\right]\ast 0.33 $$
(13)

where Lll signifies the log loss. However, the second term LD, fake enables D to differentiate between real features and the generated features.

$$ {L}_{D, fake}=\left[{L}_{ll}\left(D\left(G\left(V(A)\right)\right),{l}_{fake}\right)+{L}_{ll}\left(D\left(G\left(V(P)\right)\right),{l}_{fake}\right)+{L}_{ll}\left(D\left(G\left(V(N)\right)\right),{l}_{fake}\right)\right]\ast 0.33 $$
(14)

Here, fake class is denoted by lfake.

D plays a crucial role in helping G for the preservation of a class of input features. Hence, the subsequent loss enforces the class preservation assumption and represented as

$$ {L}_{G, class}=\left[{L}_{ll}\left(D\left(G\left(V(A)\right)\right),{l}_A\right)+{L}_{ll}\left(\mathrm{D}\left(G\left(V(P)\right)\right),{l}_P\right)+{L}_{ll}\left(D\left(G\left(V(N)\right)\right),{l}_N\right)\right]\ast 0.33 $$
(15)

The final loss value is minimized by training the voxel triplet generator G and defined as

$$ {L}_G={L}_{G, tri}+\gamma {L}_{G, class} $$
(16)

Based on the mean triplet loss for multiple grid sizes, simulated annealing threshold is applied for accepting the predicted similarity score. The concept of simulated annealing has been introduced here to make sure that the minimization problem of the mean loss value coming as output from adversarial triplet loss training under discriminator is handled in an effective way by keeping a check on the threshold value. If the mean loss value does not satisfy the simulated annealing threshold, then embedding is dropped and sent back to the generator for the new triplet generation. The similarity score and final class are generated through discriminator classifying the selected embeddings. Figure 8 depicts prediction and score matching using adversarial triplet loss and simulated annealing. In this figure, M is the number of embeddings after the voxel normalization, and n is the number of triplets forming via generator.

Fig. 8
figure 8

Prediction of face using adversarial voxel triplet generator and simulated annealing

3.2.4 Testing phase

There are two sub-phases in the testing phase, namely, pre-processing and the prediction for validation and verification. Figure 9 shows the pre-processing and prediction phase for different grid size voxels.

Fig. 9
figure 9

Pre-processing and prediction processes in the testing phase

3.2.5 Pre-processing

The pre-processing during testing phase is considered either for validation or for verification at one time. For validation, the testing dataset is considered. Voxelization process is carried out on each image in the testing dataset. For verification, the voxelization process is carried out on a single query image. Locality preserving projection normalizes voxels removing their sparseness for deep learning model.

3.2.6 Prediction

In case of validation, an array of class predictions is given as output from trained deep learning model. For validation, the output array values are compared with ground truth values to calculate the accuracy of the model. In the case of verification, the predicted value is a single class. For verification, the final similarity score is calculated using the correlation value [49].

3.3 V3DOFR and computational complexity

The proposed algorithm of voxel-based 3D occlusion invariant face recognition (V3DOFR) consists of five steps. Firstly, raw 3D mesh image is taken as input, and the number of triangular units of mesh is counted. If there are no triangular units found, then an error message is generated (see Step 1). In Step 2, voxelization is performed for different grid sizes. The number of voxels and grid sizes are linearly proportional. During the process of voxelization, there is an inconsistency in voxel count due to different poses with in the same class. To overcome this inconsistency, locality preserving projection (LPP) is used in Step 3, that will help in removing the sparseness while maintaining the neighboring voxel properties. Thus, LPP is a more effective technique than principal component analysis (PCA) for dimensionality reduction in maintaining the voxel properties at facial feature level. Different grid sizes are converted into different number of LPP feature set. Once the LPP embeddings are generated, triplet generation is followed using generator in Step 4. The generator randomly selects Anchor (A), Positive (P), and Negative (N) embeddings. Deep learning-based triplet loss training is performed for computation of loss value. Further, normalization of loss values for corresponding grid sizes is performed in Step 4. In final step, average of normalized loss value is calculated for simulated annealing-based triplet selection. After triplet selection, discriminator assigns the class identification number. If triplet is not selected, then new triplet is generated through generator.

3.3.1 Computational complexity

The time complexity of the proposed algorithm is as follows. The pre-processing of mesh requires O(n) time. In Step 2 voxelization (i.e. 2(a)-2(h)), all steps require O(n) time and sub-steps (i.e. 2(i)-2(m)) requires O(1) time. In Step 3 calculation of LPP embeddings (i.e. 3(a)-3(c)) requires O(n3) time [32], and other sub-steps (i.e. 3(d)-3(g)) requires O(1) time. In Step 4 Triplet generation with generator and discriminator takes O(1) time for steps (i.e. 4(a)-4(b)) and O(n3) time [20] for rest of the sub-steps (i.e. 4(c)-4(g)). The simulated annealing based prediction takes O(1) time. Hence, the total complexity of proposed technique is O(n3).

4 Experimental results and discussion

In this section, the performance of the proposed technique is compared with the existing techniques along with their visual verification. This section presents datasets used, parameter setting, and computational time analysis.

4.1 Datasets used

The datasets used for evaluation of the proposed techniques are Bosphorus face database [71], UMBDB face database [17], and KinectFaceDB face database [54]. Bosphorus dataset consists of 105 subjects in different poses and occlusions. There are 381 occluded images in Bosphorus dataset. All images are annotated with subject ID and pose, occlusion, or emotion description. The total number of images in Bosphorus dataset is 4666. For UMBDB dataset, there are 1473 different images. 590 images are occluded out of 1473 with a different type of occlusions. The number of subjects in the dataset is 143. The modalities of this dataset are 2D and 3D. In KinectFaceDB dataset, there are a total of 52 subject’s data. Three types of modalities are covered in this dataset viz. 2D, 2.5D, and 3D. The total number of images is 936, and 312 images are occluded out of it. Table 1 presents the detail description of these datasets. Table 2 presents the occlusion description for these datasets.

Table 1 Description of datasets used
Table 2 Occlusion description for datasets

4.2 Parameter setting

The parameters of the proposed approach are mentioned in Table 3. In the voxelization process, the grid sizes are kept as 4x4x4, 8x8x8, and 16x16x16, respectively. The corresponding number of neighbours for locality preserving projection are 16, 64, and 128 using the current voxels of the corresponding grid size. K-nearest neighbour along with adjacency weight matrix is assigned for an effective LPP embeddings. The number of epochs are 2700, 1200, and 800 corresponding to various grid sizes in triplet loss training. Adaptive moment (Adam) optimizer [43] is employed in triplet loss training. The alpha value is kept to be 0.2 and mean absolute error is used as a loss parameter. The loss function used in the discriminator is the logarithmic loss function, which directly gives the values in range 0 to 1. The batch size is kept to be 30, the dropout rate is kept at 40%, the learning rate is 0.005, and the activation function used is the rectified linear unit (ReLU).

Table 3 Parameter setting

ElSayed et al. [24] used Siamese neural network with (2, 500, 1) model, where 2 are the number of inputs, 500 are the number of nodes in the hidden layer and giving single output. Tan et al. [77] used ResNet-18 model with a 256 × 256 grid size of the depth map image. Adam optimizer has been used along with an initial learning rate of 0.01 and weight decay of 5 × 10−5. Liu et al. [51] built a face reconstruction model based on the pose and expression normalization using 128 SIFT descriptors and tanh activation function for yaw poses of 0°, ± 10 ° , ± 20 ° , …, ± 90°.

4.3 Performance evaluation metrics

The seven well-known performance evaluation measures such as accuracy, sensitivity, specificity, precision, FPR, FNR, and F1 score are used for comparing the quality of the proposed technique along with other techniques. These measures are computed from the confusion matrix and are shown in Fig. 10.

Fig. 10
figure 10

Confusion matrix

With reference to confusion matrix in Fig. 10, it is important to understand the concepts of true positive (TP), true negative (TN), false-positive (FP), and false-negative (FN). When actual and predicted values are ‘YES’, it is known as TP. When both values are ‘NO’, it is known as TN. In FN, the actual value is ‘YES’, but predicted is ‘NO’. For FP, the actual value is ‘NO’, but predicted is ‘YES’.

Accuracy is the measure of total correctness while predicting the classes and defined as [100].

$$ Accuracy=\frac{TP+ TN}{TP+ TN+ FP+ FN} $$
(17)

Sensitivity is the measure of correct classification of all the true positives and defined as [100].

$$ Sensitivity=\frac{TP}{TP+ FN} $$
(18)

Specificity is the measure of correct classification of all the true negatives and defined as [100].

$$ Specificity=\frac{TN}{TN+ FP} $$
(19)

Precision is defined as the ratio of actual positive values compared to total positive values including the predicted ones. It is mathematically represented as

$$ Precision=\frac{TP}{TP+ FP} $$
(20)

False Positive Ratio (FPR) is the ratio of wrongly predicted negative values to total number of negative values in actual and predicted. FPR is defined as

$$ FPR=\frac{FP}{FP+ TN} $$
(21)

False Negative Ratio (FNR) is the ratio of wrongly predicted positive values to total number of positive values in actual and predicted. The mathematical representation of FNR is as follows

$$ FNR=\frac{FN}{FN+ TP} $$
(22)

F1 Score is represented as the harmonic mean of precision and sensitivity value. It is defined as

$$ F1\ Score=\frac{2\ast TP}{2\ast TP+ FP+ FN} $$
(23)

4.4 Non-adversarial versus adversarial voxel triplet generator face recognition technique

This section compares the techniques of 3D face recognition by using adversarial voxel triplet generator and without using the adversarial technique. Table 4 shows the performance comparison between non-adversarial and adversarial based voxel triplet generator.

Table 4 Comparison between proposed adversarial and non-adversarial based voxel triplet generator face recognition

The accuracy obtained over three datasets is 8–10% better in case of adversarial voxel-triplet based face recognition than the non-adversarial voxel-triplet generator based face recognition. Hence, the use of the adversarial technique in a combination of simulated annealing has proven to be beneficial for the computation of face recognition accuracy.

4.5 Performance evaluation

In this sub-section, the performance of the proposed techniques and four well-known techniques namely ElSayed [24], Tan [77], Liu [51], and Sharma [75] has been evaluated in four different experimentations. These are voxel, occlusion invariant face, landmarks, and 3D mesh. In each experimentation, the evaluation has been done through seven performance measures viz. Accuracy, Sensitivity, Specificity, Precision, False Positive Rate (FPR), False Negative Ratio (FNR), and F1 Score. The validation of the proposed technique and compared algorithms have been tested over the dataset mentioned in Section 4.1.

Table 5 shows the performance comparison of various face recognition techniques using voxels. The training dataset has been generated using randomly selected 80% images from the given set. 20% images are used in the testing dataset. In Bosphorus dataset, the proposed technique provides better results than the existing techniques in terms of performance measures except specificity and FPR. While, Sharma [75] provides better value of specificity and FPR. The accuracy obtained from the proposed technique is 90.8%. Similarly, in UMBDB dataset, the proposed technique outperforms the other techniques in terms of performance measures except for sensitivity and specificity. The accuracy achieved by the proposed technique is 81.9%. The main reason behind to drop the accuracy in UMBDB dataset is that the presence of more dynamic occlusion present in UMBDB dataset as compared to Bosphorus dataset. In case of Kinect Face DB dataset, the best accuracy achieved through Sharma’s method [75]. However, the proposed technique provides accuracy with a difference of 0.1%. Sharma [75] technique outperforms the others in terms of FNR and specificity. Whereas, precision, FPR, and F1 score obtained from the proposed technique is better than the existing techniques. The proposed technique and ElSayed [24] achieved the sensitivity at par with 92.7% and 92.9%, respectively.

Table 5 Performance measures obtained from various face recognition techniques using voxels

Table 6 presents the results obtained from various face recognition techniques under occlusion environment. The proposed model and the other four techniques have been trained with the non-occluded images in the dataset. However, it has been tested on the occluded images. In Bosphorus dataset, the best accuracy obtained from the proposed technique is 81.5%. The accuracy achieved by the proposed approach is better than the second best technique by 2.1%. In terms of sensitivity, the proposed technique is the second best technique. For specificity, FPR, FNR, and F1 Score, the proposed method outperforms the other face recognition methods. In case of precision, the proposed technique and ElSayed [24] provide 84.1% and 85.3% value, respectively. In case of UMBDB and Kinect Face DB dataset, the proposed technique attained best value for all the performance measures except sensitivity. The best accuracies obtained from the proposed approach over UMBDB and Kinect Face DB are 67% and 77.9%, respectively. The sensitivity and specificity values achieved by using the proposed technique in case of UMBDB are 79.2% and 38.0%, and in case of Kinect Face DB are 88.1% and 40.4%, respectively.

Table 6 Performance measures obtained from the various face recognition techniques under occlusion condition

Table 7 shows the results obtained from 3D face recognition techniques using landmarks. The training and testing datasets are generated in ratio of 80–20 randomly using 26 landmarks in each case. In case of Bosphorus dataset, the proposed technique has the best results for most of the performance measures. The highest accuracy achieved from the proposed face recognition technique is 84.9% with 93.8% sensitivity and 49.7% specificity. The proposed technique achieved precision as 90.8%, FPR as 36.6%, FNR as 8.3%, and F1 score as 91.3%. Sharma’s method [75] provides the second best accuracy, sensitivity, FPR, FNR, and F1 score. In case of UMBDB dataset, the recognition accuracy obtained from the proposed approach is 77.4%. The proposed method outperforms the other methods in terms of performance measures except specificity. Sharma’s method [75] provides better results than the proposed method in terms of specificity. In case of Kinect Face DB dataset, the proposed technique outperforms all the other methods in terms of evaluation metrics except specificity. Sharma’s method [75] has more specificity than the proposed technique. The accuracy achieved from the proposed face recognition for Kinect Face DB is 81.6%.

Table 7 Performance comparison between different 3D face recognition techniques using landmarks

Table 8 shows the results obtained from different face recognition techniques using mesh. The training and testing dataset is partitioned into 80–20 ratio randomly. While using Bosphorus dataset, the accuracy achieved from the proposed face recognition technique is 88.7%. While, Sharma [75] method provides 87.4%. The sensitivity achieved by the proposed technique is 92.8%. Tan [77] attained the best sensitivity value 94.1%. In all the other evaluation metrics, the proposed technique achieved the best results. In case of UMBDB dataset, the best accuracy for the 3D face recognition is achieved by the proposed technique with a value of 79.2%. The best sensitivity is achieved by ElSayed [24] with 5.2% difference from the proposed technique. The precision, FPR, FNR, and the F1 score of the proposed technique are 86.6%, 48.7%, 10.4%, and 87.9% respectively. Liu [51] has performed better in case of specificity with 45.8% for UMBDB dataset. In case of Kinect Face DB, Sharma’s method [75] outperforms the other methods including the proposed method for all evaluation metrics.

Table 8 Performance comparison between different face recognition techniques using 3D mesh

4.6 Visual verification

Visual verification of random 3D mesh images based on occlusion invariant proposed framework is presented in Fig. 11. All 3D meshes have an occlusion in them viz. hand on eyes, hair, glasses, hands-on mouth, cloth, cap, and finger. Ten 3D meshes from the occluded images are selected randomly for verification. Predicted subject IDs are given along with actual subject IDs. Nine out of ten meshes have the correct predicted value during verification. Hence, it validates that the proposed method is an occlusion invariant.

Fig. 11
figure 11

Visual verification of 3D meshes with their actual and predicted subject IDs in Batch 1 and Batch 2

4.7 Computational time analysis

Table 9 depicts GPU based computational time obtained from the proposed approach and other techniques. Four well-known 3D face recognition techniques are compared with the proposed method using voxels, landmarks, and meshes for pre-processing, recognition, verification, and the corresponding learning model. The computation time presented is calculated as the average time in all the phases. This work has been run on GeForce GTX 1080Ti GPU model with 3584 CUDA (Compute Unified Device Architecture) cores and a memory speed of 11 Gbps.

Table 9 Computation time (in ms) on GPU for proposed technique versus other techniques in voxel-based face recognition

The overall time computation of proposed technique using the landmarks is the fastest and using meshes the slowest when compared to the proposed technique using the voxels.

4.8 Convergence analysis

The convergence of accuracy obtained from the proposed technique for all datasets is shown in Fig. 12. Bosphorus dataset converges at 90% accuracy in 2700 epochs, UMBDB dataset converges at 81% accuracy in 1200 epochs, and KinectFaceDB dataset converges at 85% in 800 epochs. The convergence plot is based on accuracy obtained from combined approaches of triplet loss training, simulated annealing, and game theory.

Fig. 12
figure 12

Accuracy based convergence plot for proposed approach

Figure 13 shows the accuracies obtained from the face recognition model using voxelized technique on three datasets viz. Bosphorus, UMBDB, and KinectFaceDB. It can be seen from Fig. 13 that the accuracy on Bosphorus dataset is 90.8%, UMBDB is 81.9%, and KinectFaceDB dataset is 85.6%. These values can be verified in Table 5 performance measures obtained from various face recognition techniques using voxels.

Fig. 13
figure 13

Final accuracy comparison of three datasets

5 Future work

The attention based models are being used for improving the accuracy of facial expression recognition [39]. There are other attention based models viz. image based attention [26], edge based attention [96], weakly supervised attention [93], and uncertainty based attention [94]. In [26], a high quality dataset namely SOC (Salient Objects in Clutter) is used to update the previous saliency benchmark for salient object detection. The attention of the deep learning model is brought to the objects in the image and the target is to detect the salient object in clutter and bring it to the foreground. This technique can be extremely useful to detect the facial landmarks such as eyes behind eyeglasses. The facial features can be effectively reconstructed using this approach by bring the salient facial features in the foreground of the occluding object. In [96], EGNet based on edge guidance network is presented for salient object detection. It focuses on the complementarity of salient edge information and salient object information to generate fine boundaries. This technique can be used in face detection similar to shape-from-shadow technique. The shape from shadow as well as shape from fully convolutional neural networks (FCNs) suffers from coarse object boundaries. Due to rich edge information, the salient objects can be detected more precisely. Hence, the facial features can be detected more precisely with fine edges using EGNet.

In [93], the labeling based salient object detection is proposed using weak-supervision technique. However, there is a challenge of poor boundary localization. To handle this problem, an auxiliary edge detection task is suggested for localization of object edges explicitly. This technique can be extended to use in 3D face detection for localization of facial features such as eyes, nose, mouth, etc. In [94], uncertainty inspired RGB-D saliency detection via conditional variational autoencoders (UC-Net) is presented. A probabilistic RGB-D saliency detection network is developed using conditional variational autoencoders for modeling of human annotation and build various saliency maps for each input image by latent space sampling. This technique can be used on RGB-D images for facial landmarks detection and predicting the facial expression.

The above-mentioned techniques may be utilized in the proposed approach for better performance in near future. EGNet can be integrated with the proposed approach for better facial feature extraction. RGB-D saliency detection method can be used in the proposed approach for landmark identification and detection. The attention based models may be utilized in the proposed approach for better recognition.

The simulated annealing based deep learning techniques can be implemented in all types of CNN models where backward propagation is done to calculate the loss between different layers. This simulated annealing can also be used in other deep learning models viz. autoencoders, variational autoencoders, GANs etc. because in all of them a simulated annealing based threshold value can be kept for loss acceptance.

6 Conclusions

In this paper, voxel-based 3D occlusion invariant face recognition framework is proposed. The proposed framework utilizes the concept of generator and discriminator based deep learning. Bosphorus, UMBDB, and Kinect Face DB have been used for implementing face recognition techniques. The best average accuracy obtained from face recognition using voxels by the proposed technique is 86.1%. Similarly, for occlusion invariant face recognition, the best average given by the proposed technique is 75.5%. In case of face recognition using 3D landmarks, the best average accuracy for the proposed technique is 81.3%. In case of face recognition using 3D meshes the best average accuracy given by the proposed technique is 83.9%. Adding the adversarial training strategy for triplet generation ensures low biasness. This technique, coupled with simulated annealing allows the proposed method to be robust in different areas using voxels.