Keywords

1 Introduction

Dependability of biometric systems is a key aspect in their worldwide diffusion and everyday usage, regardless of the specific application they are supposed to improve, and it is tightly related to the overall reliability in the process of accessing a given resource or a given place. Nevertheless, it is worth noting that while a given biometric system could perform well in terms of accuracy and robustness (i.e. featuring low False Acceptance Rate and high False Rejection Rate), this does not automatically means it is dependable [24]. Dependability of a biometric system, indeed, implies much more than a high performance of the processing pipelines (though the latter is a fundamental requirement), since it involves other aspects such as the reliability of the capture process, the capability to cope with uncontrolled conditions and, not secondarily, the resistance to attacks from malicious users.

In this work we focus on this last aspect of the dependability of a biometric system, with particular regard to face biometrics which represents one of the most diffused way to perform person authentication and identification in a contactless and natural way. The idea inspiring the proposed approach is to increase the level of resistance of face biometrics to presentation attacks [4], by exploiting face dynamics related to utterance of a given sentence. These dynamic facial features are subject dependent and represent a sort of motion signature involving much greater difficulty in counterfeiting it, compared to static face representations. To this aim, video capture of face changes during sentence utterance (see Fig. 1) are used to extract dynamic local features from face lower half, by means of the LBP-TOP variant of the Volume Local Binary Pattern (VLBP) method. These spatial-temporal features are then used to train a deep feed-forward neural network and subsequently to find correspondences between the probe descriptor and the available gallery. The experiments conducted on audiovisual samples of the public XM2VTS dataset show state-of-the-art recognition accuracy exceeding 99%, along with a high robustness to intra-class variations (the way sentence is pronounced by the same subject) and good independence from the choice of the sentence, confirming the advantages of using inherently dependable dynamic facial features.

The rest of this paper is organized as follows: Sect. 2 resumes a selection of works related to the present study; Sect. 3 presents in detail the proposed approach to inherently safer face biometrics; Sect. 4 describes the results from the experiments carried out. Finally, Sect. 5 draws conclusions, along with directions for future research.

Fig. 1.
figure 1

A sequence of frames showing the effects of uttering a sentence on the lower face region. Regions comprised between upper and lower lip, nose-base and upper-lip, chin contour and mouth are all affected to a variable degree depending on both anatomical characteristics and specific speaking habits.

2 Related Works

Our proposal is aimed at extracting, representing and matching dynamic facial features related to the way a sentence is pronounced. Consequently, besides face recognition, related works comprise studies and papers dealing with different interconnected topics, typically based on lip-motion representation and analysis for audio-visual speech or speaker recognition [8, 10, 25, 30]. Lip feature extraction from human image is useful in several applications. First systems exploited only audio information. Later, visual parts have been also used, either combined with audio or individually. Mouth regions in lip reading domains can typically be represented in two ways: grayscale pixel-level information or high level visual information (geometry, like width, height, surface and mouth opening). In [18] a lip feature extraction algorithm based on Local Binary Patterns (LBP) and Stacked Sparse Autoencoders (SSAE) is presented. According to this method, LBP texture features are extracted from lip images. Then high-level features are extracted using SSAE, which adopts an unsupervised learning to discriminate high-level features. As final step, the method uses fine-tuning in order to improve overall performance. This method features a wide applicability along with high classification accuracy. In [20] the authors propose a spatio-temporal approach to track lip movements, learning from visemes of the French language. It implements three modules. First, a lips tracking system through which lips are segmented using both color and geometric information, since mouth has different color from face skin. Then, a second processing stage implements lip motion tracking by using a particle filter. Finally, visual information are extracted and classified, to allow the recognition of the pronounced viseme.

On a parallel line of research, the work described in [3] can continuously classify if a person is speaking in a video sequence based on lip movement. Firstly, head area is segmented; then, a skin detection technique is applied in order to segment the face area. Next, based on both geometry and color as in [20], the mouth area in each frame is further segmented. A first rough mouth opening detection is based on the fact that the opening area has a darker gray level than its average. Subsequently, only frequency components between 1 Hz to 10 Hz of the detected feature signal are considered to classify the speaking activity by comparing with a threshold. Another method to detect silence sections is proposed in [27]. In this case, the author analyzes geometric parameters such as lip contour’s time trajectory, namely interolabial width and height. This method achieved 80% of correct silence detection and 5% of false one. One of the first methods for automated features extraction from lips motion has been proposed in [7] as a potentially valuable resource to improve the resistance of audiovisual authentication systems to replay attacks by means of a liveness-verification test [22]. Following works [11] and [12], have more formally described the dynamic characteristics of lip-motion which account for its advantage as a secure biometric descriptor. In these approaches the motion component of the captured image sequence is extracted from orientation maps and is then combined to simultaneously extracted speech features to achieve a higher user verification precision. Speaker recognition by lip-motion and speech-reading is also proposed in [5], where a specific experiment based on Hidden Markov Model (HMM) is performed to assess the saliency of lips dynamics. In [26] a statistical approach for lip activity detection and speaker detection in videos is proposed. The main idea is to apply signal detection techniques to a feature extracted from mouth region intensities.

Neural Networks (NNs) have been extensively used in speech recognition as feature extractors in HMM-based speech recognizers [15]. Linear short-term-memory networks (LSTMs) started to replace larger parts of the speech processing system by HMMs. An end-to-end neural network system [13] outperformed HMM-based systems, achieving the best error (16%) on the large Switchboard Hub5’00 speech recognition benchmark [14]. HMM integrated with a multi-boosted learning approach is also exploited by the authors of [17] to devise a comprehensive lip-password enabled speaker verification system. Local spatio-temporal descriptors based on LBP-TOP and a support-vector-machine (SVM) classifier are proposed in [32], while in [19] user authentication through silent utterance of a pass-phrase is approached as a high dimensional time series matching problem. The prospective anti-spoofing advantage of facial dynamics have been first explored in [28] through a combination of Dynamic Mode Decomposition algorithm, LBP and SVM. More recently, a mobile-phone based approach to lip-motion enabled user verification has been proposed in [31] by means of a specialized active shape model algorithm and Gaussian mixture model of lip motion.

With regard to the various descriptors and methods cited above, the proposed approach exploits dynamic facial features not restricted to the sole lips region, but including all the surrounding regions of the lower portion of the face at both the appearance and the motion level, thus resulting in a more discriminant and robust physical/behavioral biometrics, providing a more dependable an accurate recognition performance.

Fig. 2.
figure 2

Schematic view of the overall processing pipelines for the proposed method.

3 Method Description

The rationale of analyzing dynamic local features is motivated by the assumption that the motion patterns of points belonging to those face regions more directly affected during utterance of a given sentence can effectively characterize an individual. In this analysis, we decided to focus only on the visual aspect of speech without considering the audio component, since we are interested in a unimodal system. According to this approach, the succession of frames captured during utterance, contains highly discriminant spatial and temporal information [21]. We are interested not only in the information related to face texture, but also in their changes over time within the frames sequence, producing dynamic textures. As proved by numerous works in the literature [1, 2, 6, 9], indeed, dynamic textures analysis provides the following advantages:

  • local texture analysis capturing spatial and temporal information;

  • features robust to image transformations;

  • computational simplicity;

  • good robustness to lighting variation;

  • multi-level resolution analysis.

The overall processing pipeline of the proposed method consists of several stages, from subject acquisition to face detection and normalization, and then to dynamic features extraction and recognition, as depicted in Fig. 2. Subject acquisition involves the capture of a video sequence that has to be normalized with regard to the number of frames by means of a re-sampling process aimed at obtaining a clip whose length is consistent to the length of any gallery samples. Each frame of the sequence is therefore analyzed by a face detector [29] that allows to identify the image region in which the subject’s face is present. Subsequently, up to 59 facial features are found on the face crop previously detected by means of an efficient landmarks predictor based on [16]. By exploiting these numbered landmarks, the frame is cropped again retaining only the lower face region comprised below the ideal line connecting landmark #2 to landmark #12 (refer to Fig. 3). Finally, the video segments thus obtained are converted into gray-scale and spatially resampled to a resolution of 200 \(\times \) 200 pixels. At the lowest level, the proposed approach is based on the Local Binary Pattern [23], one of the most used and reliable texture descriptor.

Fig. 3.
figure 3

Facial landmarks considered for facial ROI cropping.

The LBP descriptor replaces the value of each pixel of the image with a decimal value, which is called LBP code and encodes the local structure of the pixel’s neighborhood. This is achieved starting from a kernel (central) pixel and considering a serie of neighboring points; for each of them a thresholding is performed with respect to the central pixel value. Concatenating the 0 and 1, calculated through the thresholding operation, a binary value is obtained (see Fig. 4). This value corresponds to the LBP code of that neighborhood. For each block on which LBP is applied, the LBP histogram (i.e. the occurrence of the LBP code in that specific area) is then computed.

The extension to the temporal domain of this simple local descriptor, is known as Volume Local Binary Pattern or VLBP and is particularly suited to describe dynamic-textures such as those resulting by the aforementioned acquisition process. The VLBP descriptor computes the LBP value for each pixel belonging to an area of the space-time volume defined by the dynamic texture, and for each area calculates the histogram, or the occurrence of the LBP codes. To this aim each frame has to be break down into blocks, through a grid. In the present work, we found an adequate partitioning value by using a 4 \(\times \) 4 grid applied to the lower face crops resulting by previously described normalization process, resulting in 16 different areas, each of the size of 50 px. The rationale behind this breakdown was preserving the dynamic characteristics for each block. The number of neighboring points that are considered for each pixel kernel determines the number of bits used to represent the LBP code, therefore the width of the histogram. The latter corresponds to the feature vector relative to the block to which it belongs. The dimensions on which it is applied are X and Y (referring to the spatial domain of each frame) and Z (referring to the frame number in a time sequence) and allow to analyze the local structure not only spatially, but also in its temporal evolution. For each pixel the VLBP code is calculated considering not only the spatial neighborhood, but also the temporal one (see Fig. 5). Consequently, the histogram computed for each block appears to be considerably larger than the histogram resulting from classic LBP, since it takes into consideration more pixels around the central one, leading to a considerable increase in feature vector dimension. To the aim of reducing the computational complexity of the VLBP technique we used a simpler version of it referred as Local Binary Pattern on Three Orthogonal Planes (LBP-TOP), which considers only 3 orthogonal planes for analyzing the local features and is extensively described in [33].

Fig. 4.
figure 4

LBP pattern generation process.

The (LBP-TOP) technique reduces the number of possible patterns by 2(3p + 2) (when considering only 3 planes in the Z dimension) to 3 * 2p, where p represents the number of neighboring points. In this work 36 spaced points were used on a circumference of radius 6, centered on the pixel of interest. The patterns thus obtained are then scaled into integers that can be represented on 8 bits. The binary patterns obtained are extracted from the XY, XZ and YZ planes. The histograms obtained from the three planes are linked together obtaining a single vector of features (Fig. 6). An extension of the original operator is the so-called “uniform pattern”, which can be useful to further reduce the length of the feature vector without losing relevant information. Some binary patterns, indeed, occur more often than others in image textures. An LBP code is said to be uniform when it contains only binary patterns that have at most two transitions 0–1 or 1–0. The histogram relative to an LBP technique with uniform pattern will have a distinct bin for each uniform pattern, while it will have a single bin for all non-uniform patterns. In the specific case, considering the value of LBP code expressed on 8 pixels (with possible values between 0 and 255), there are 58 different uniform patterns and therefore the final histogram will consist of 59 bins, where the 59th represents the “other” class.

The resulting feature vector was used to train a fully-connected deep feed-forward (DFFN) neural network schematically depicted in Fig. 7. This network architecture has been preferred over the popular Convolutional Neural Network (CNN) which typically results much more computationally expensive, requiring a better hardware and more time for training. The number of hidden layers was determined experimentally and the final configuration featuring three hidden levels was the most effective and efficient found. The network provides in output a percentage of probability of belonging to each class, for each sample shown in the testing phase. The class with the highest percentage is then selected, without the use of particular thresholds. The choice of parameters, activation functions and architecture was determined on an experimental basis; a number of tests was therefore performed, modifying the combinations of these variables. It is worth noting that fitting too much parameters to a dataset can led to bad performance on real application or challenging tests, different to training ones. The choice of a good feature representation helped us to reduce this risk to a minimum. However, it is practically impossible to make an absolutely generic model because it would need a infinite dataset. To this regard our approach used processing and features representation as most general as possible, with the ReLu activation function chosen for the input layer, the sigmoid activation function for the three hidden layers and the softmax for the output layer. The number of input nodes was set equal to the size of the feature vector (19824), while the number of output node is determined by the number of possible subjects (295).

Fig. 5.
figure 5

VLBP descriptor generation.

Fig. 6.
figure 6

LBP-TOP descriptor generation.

The network was implemented through the Keras framework with Tensorflow backend; the optimizer and the evaluation metric used are respectively SGD (Stochastic Descending Gradient) and accuracy. All the other parameters of the network, such as the number of epochs, batch size, learning rate, momentum, decay and dropout have been optimized experimentally. The best performing configuration resulted to be the following: epochs = 20, batch-size = 32, learning-rate = 0.1, decay = 0.000001, momentum = 0.

4 Experiments

The experiments described below were conducted the on XM2VTS public database, which is a reference dataset for audiovisual speaker recognition and lip-based speech/speaker recognition. The test-bed was a Fujitsu Celsius machine, featuring an Intel Xeon Octa-Core 2.10 GHz processor and 128 GB of RAM. XM2VTS comprises records of 295 subjects, characterized by a great inter-class variability, both from a demographic and an ethnic point of view. Furthermore, the variability of the same subject in different sessions, such as the growth of beard, presence or absence of the glasses and the change of the hairstyle, also provides wide intra-class variations (see Fig. 8). The dataset, acquired in a controlled environment, is composed of video clips in which each subject pronounces different sentences.

Fig. 7.
figure 7

Network layout of the fully connected Deep Feed-Forward Network architecture used in the proposed method.

More in detail, the dataset is composed of 3 sections: the first contains 4 sessions in which the user pronounces the phrase “Joe took father’s green shoe bench out”; the second contains the rotation of the face from left to right (not used in this work); finally in the third there are 4 sessions, for each of which the subjects repeat twice the sequences “zero one two three four five size seven eight nine” and “five zero six nine two eight one three seven four” interspersed with a small pause. The acquisition of 295 subjects has been carried out at a resolution of 720 * 576 at 25 fps.

As mentioned in Sect. 3, several pre-processing operations were required for the use of the video clips. In the first instance, the two 2 sequences contained for each session were separated into the same video file. The separation was achieved by dividing the video in half; although not very refined, this solution has led to optimal results. For each segment thus obtained, the number of frames was calculated, determining the minimum (84 frames) and the maximum (344 frames). Utterance speed is comprehensibly different for each subject according to several factors. One of them depends on the type of sentence to be pronounced, since numbering the digits from zero to nine is simple and natural for everyone as it is mnemonic. For the other sequences, however, the individual needs to learn the order of the digits, and of the words, to then pronounce them quickly. In the latter case, we notice a strong temporal difference in the videos of the different sessions. To this variability is added a further variability due to the characteristic speech speed of every subject. Therefore a resampling operation has been performed in order to uniform the feature vector.

The network was trained on the ordered sequence and three experiments were then conducted to evaluate the robustness of the LBP-TOP descriptor in identifying the subjects. In the first experiment the sequence “zero one two three four five size seven eight nine” was used, dividing it into 80–20% respectively for the train and for the test.

The ROC (Fig. 9) and FAR/FRR (Fig. 10) curves, graphically describe the behavior of the system. The robustness of the proposed approach is confirmed by the EER value of 0.03 and the CMC (Fig. 11), which is 99.8% already at rank-0, reaching 100% at rank-2. This implies a correct classification for almost all the samples, with a very high probability of assignment as shown by the FAR/FRR curve. For the CMC curve only the first 6 of 295 rank have been reported in order to better appreciate the step between rank 0 and 2.

Fig. 8.
figure 8

Inter-class and intra-class variability in XM2VTS dataset.

Afterwards, the network was tested on the unordered sequence (“five zero six nine two eight one three seven four”) on which she was not trained. In this test the percentage dropped to 98.9%. Finally, to verify the robustness of the model, for the third test, the non-numerical sequence “Joe took fathers green shoe out” was used in the test phase. The accuracy rate was 98.4%. Table 1 report a summary of the results obtained for the three types of sentences pronounced by the subjects in the database. It is worth to note that if a sequence of separated images of the same persons was used rather than a video recording the actual face motion due to speech, the recognition performance would be very low even for a genuine subject. This is directly due to the kind of motion information the method is able to represent, which consists in organized changes of the lower face region instead than a generic difference between a sequence of frames.

Fig. 9.
figure 9

ROC curve for the proposed method.

Fig. 10.
figure 10

FAR/FRR curve and EER for the proposed method.

Fig. 11.
figure 11

CMC curve for the proposed method.

The proposed method’s behavior, depicted by the results of the experiments, confirms its intrinsic reliability in applicative contexts where the risk of counterfeiting is potentially high. The face dynamic signature provided, indeed, is much more difficult to be forged than any conventional static face descriptor. At the same time, the low Equal Error Rate make a biometric system based on the proposed descriptor suited to medium to high security applications.

Table 1. Resume of the experiments.

5 Conclusions

We presented a method for person recognition exploiting LBP-TOP based representation of dynamic facial features to provide increased dependability in face biometrics thanks to the intrinsic difficulty in forging such a time-dependent descriptor. The proposed deep feed forward network, trained and tested on the audiovisual speech samples from XM2VTS database, delivered a 99.8% recognition rate dropping to 98.4% in challenging testing conditions, achieving in both cases state-of-the-art performance level. Future research will concern more challenging experiments including other public datasets and a direct comparison with the best methods available in literature. An extension of this work could also include the audio component of the speech samples for implementing a bi-modal biometric system, to further improve both accuracy and reliability of the proposed method.