1 Introduction

In general, facial expression plays a major role in signifying the emotional and mental state of a human being. Psychologists say that 7% of the total information is conveyed through language - 38% through language auxiliaries such as speech rhythm, tone, etc., and 55% through facial expression. Hence, Facial Expression Recognition (FER) is considered an important research topic for real-time life applications such as computer vision, robotics, medical treatment, driver fatigue surveillance, and much other Human-Computer Interaction (HCI) applications. Ekman and Friesen [16] have proposed 6 basic emotions from the human facial images such as anger, happiness, surprise, sadness, disgust, and fear, that are expressed frequently through facial expressions. These can occur by the events which are expressed during communication. The traditional approaches may not handle the requirement of real-time applications effectively. Also, the effect of change in illumination conditions, noise, occlusion, and variations in head posture are some of the important factors that affect the expression recognition systems resulting in uncertainty and ambiguity. Many techniques use handcrafted features for facial expression recognition and thus, require increased efforts in terms of programming and computation cost.

It is well-known that expression and human emotions are natural and intuitive processes. According to Psychologists, various parts of the human face contribute to different categories of emotions. For instance, the upper part of the face such as the eyes and eyebrows contribute to anger and fear. On the other hand, disgust and happiness are mainly expressed through the mouth [10], and surprise is influenced by the upper and lower parts of the face. In complex applications such as Human-Computer Interaction (HCI), assistive robotics, Gaming, etc., it is important to measure the expression with accurate emotional information having minimum error [9].

It is observed that automatic FER is a complex task due to variations in the features of faces in terms of a person’s identity, illumination conditions, head pose, etc. Presently available human-computer interfaces are unable to provide natural interaction to the users as they are not taking complete advantage of the communicative media. The HCI can be improved significantly, provided computer systems recognize the user’s emotions through facial expressions, and body gestures and react to the users’ wishes. Thus, the researchers have shown interest in automatic FER systems as the classical systems have failed miserably.

It is known that emotion recognition deals with two primary steps such as feature extraction and classification. Feature extraction refers to the selection of a set of features or attributes that influences emotional expressions. The emotional expression in turn is influenced by the motions and positions of the facial features and their muscles beneath the skin. These feature movements convey the emotional state to an observer. The classification mechanism handles the classification scheme for identifying the class of emotion. As a result, FER systems can be evaluated based on recognition rate and is purely dependent on the feature extraction strategies. At times, a good classifier may not provide high classification accuracy if the extracted features are poor. On the other hand, for a large set of features that describes an emotion, the recognition system fails due to a poor classifier. Thus, many techniques are used for feature selection such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), Rough Sets, Gabor Filter, Fourier Descriptors, etc. Similarly, various machine learning techniques such as Neural Networks, Fuzzy approaches, Support Vector Machine (SVM), Hidden Markov Model (HMM), etc. are used for facial expression classification. In addition, various segmentation algorithms have also been proposed recently and one of them is [37, 38] proposed automating the segmentation process of brain images.

Thus, the paper deals with the above-discussed issues and considers the nose feature of the human face, and tries to understand its proportion of contribution in terms of facial expressions. To the best of our knowledge, no literature is available, that uses the pyramid structure to extract the feature from the nose. The sample images are shown with handcrafted drawings for better clarity. Otherwise, the method uses an automated way to extract the features of the nose. We extract the feature (micro-level) of the nose through segmentation, localizing the nose, etc. The advantage of the proposed feature as compared with the deep learning feature is that we use a geometrical model based on a pyramid and tetrahedron. It is robust such that it can be used on any type of nose. However, while using deep learning, we have to propose more numbers of filters to model the structure of the nose, which is complex and time-consuming.

The proposed approach builds a geometrical model to map the human nose and superimpose on it a pyramid structure. The points of interest and their deformation are constructed from the input image, which is a Region of Interest (RoI), and used as a feature vector for further analysis and interpretation. The rest of the article is organized as follows:

  • In Section 2, the recent and relevant literature is reviewed.

  • In Section 3, the pyramid structure of the nose is proposed and a feature extraction scheme is presented.

  • In Section 4, the experimental results are presented with discussion and analysis.

We have concluded the paper in the last section.

2 Literature survey

The application of facial expression recognition from the human face has many folds such as computerized psychological counseling and therapy, identifying criminal and antisocial motives, etc. This section represents some of the recent state-of-the-art literature. In early 1975, the authors [15] have measured facial expressions with the help of the movement of the cheek, chin regions, and wrinkles of the human face. Various authors have proposed approaches such as Fuzzy Integral [26], Fourier descriptor [42], Template Matching [8], and Neural Network models [18] to measure the human expressions. Cohen et al [12] recognize emotions’ temporal variations in facial expressions from the live video display. The frames of the videos are continuously monitored and the features were extracted frame-wise. However, it is noticed that measuring the facial expression from the video is hard as its resolution and clarity have been low. Gao et al. [20] have presented the line-based caricatures method and used the single facial image to recognize facial expression. The performance of this approach was found to be low as it has modelled the imitation of a person from the picture, and the time complexity of the algorithm was poor. Li and Ji [28] have proposed an approach based on a probabilistic framework to dynamically model and recognize the user’s expressions. It has assisted in the form of a response. However, one of the drawbacks of this approach is that segmenting the Region of Interest (RoI) for localizing the area was difficult. This is due to the fact that there is a significant variation in the attribute values of different regions of the face. The authors [54] have stated that facial expressions have a direct correlation to many basic movements of human eyes, eyebrows, and mouth. A fuzzy kernel clustering has been used to identify the correlation between basic movements of the facial action units. It is reported that it was hard to measure the exact facial expression from blurred input faces since this method was unable to handle the blurred images. Chakraborty et al. [11] have studied the psychological states of the subjects which affect the emotional features. For instance, the wider variations of facial expressions of the human face have influenced the individual features of emotion. There are differences in facial features of the different subjects for the same emotion. There is a small random variation in the facial feature around a specific feature point. This has been found in multiple human face instances of similar emotions by conducting repeated experiments. However, it is noticed that various subjects of the human face have many complexities in terms of features of respective facial components. Zhang and Liu, [51] have proposed fuzzy logic approach for recognizing the emotional expression from the human face. The emotion has been excited by audiovisual stimulus and the variation in the facial regions are recorded as video movie clips. It is observed that the automated emotion recognition system has failed to map human emotions with human perception in many of the video clips.

In recent times, automated segmentation techniques are used for segmenting objects in images. The authors in SivaSai et al. [36] have proposed an automated fuzzy-based segmentation technique and recurrent neural network. The image is preprocessed using the Adaptive Bilateral Filter (ABF) for removing noises. The binary thresholding and Fuzzy Recurrent Neural Network (FR-Net) have been used for segmenting Region of Interest (ROI). The authors in [39] have presented different optimization techniques that are used for automated segmentation procedures. The performance of algorithms has been evaluated quantitatively using the genetic algorithm-based approach. A two-level classification approach has been presented by Drume and Jalal [14]. Both Principal Component Analysis (PCA), and Support Vector Machine (SVM) have been used for classification. However, the accuracy has not been encouraging since the improvement has been marginal. Benitez-Quiroz et al. [4] have proposed an approach that annotates both the Action Units (AUs) and corresponding intensities to label the motion categories of human facial expression. Various statistical parameters of facial points such as distances and angles between facial landmarks, etc. have been defined as geometric features. The change in the shading feature on the skin regions has been extracted using Gabor filters to represent both the shape and geometric features of every AUs. The activation level of every AU is calculated using Kernel Subclass Discriminant Analysis (KSDA) [48] to determine emotion categories. A method has been proposed using a feature selection algorithm to classify emotions based on appearance [24]. However, this approach failed to handle noise and generalized the appearance model. Demir [13], has used Local Curvelet Transform (LCT) as a feature descriptor to extract the geometric features based on the wrapping mechanism. The extracted geometric features are mean, entropy, and standard deviation. In addition, geometrical features such as energy, and kurtosis are extracted using a three-stage steerable pyramid representation Mahersia and Hamrouni [31]. The author Biswas [7] has extracted features using Discrete Contourlet Transform (DCT) and it has been performed in two key stages such as Laplacian Pyramid (LP), and Directional Filter Bank (DFB). In the LP stage, the image is portioned into the low pass, bandpass and confines the discontinuities position. In the DFB stage, the bandpass portion is processed to form a linear composition by associating the discontinuities’ position. Siddiqi et al. [35] have used Stepwise Linear Discriminant Analysis (SWLDA) to extract the localized features with backward and forward regression models. However, this method has faced challenges to recognize the emotion while there is variation in terms of age, culture, and gender. It is well-known that variations in the size of the image, the facial orientation, glasses or masks on the faces, and lightning conditions are the factors that increase the complexity in FER. Vivek and Reddy [44] have combined three well-known machine learning algorithms such as Cat Swarm Optimization (CSO), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO) for determining the human facial expression. However, the classification is poor while there is a fusion of emotions. Similarly, Alhussein [1] has used a set of well-known algorithms such as Radon Transform (RT), and Higher-Order Spectral (HOS) and found that the misclassification rate was on the higher side. A large number of methods such as Biorthogonal Wavelet Entropy, Fuzzy SVM, and Stratified Cross-Validation have been combined to recognize human emotion in Zhang et al. [53]. However, the performance has failed due to outliers and noises.

Liu et al. [30] have constructed the features which are supposed to be informative and non-redundant to facilitate learning and generalization. The approach has used PCA and LBP for extracting features from active facial patches. However, the classification accuracy has been reduced due to the presence of variation in illumination. Kumar et al. [27] have proposed a Weighted Projection-based LBP (WPLBP) approach to extract features based on the instructive regions. The LBP are assigned weights based on the significance of the instructive regions. The recognition rate of this approach has been reduced due to occlusions and head pose variations. Salmam et al. [33] have proposed a Supervised Descent Method (SDM) for extracting features in three stages. In the first and second stages, the facial main positions are extracted for selecting the related positions. At the final stage, it estimates the distance between the various components of the face using Classification and Regression Tree (CART). The metric likely Decision tree and Gini impurity are estimated to classify the expression. However, this method has failed to perform well if the input image is a lower resolution one.

Li et al. [29] have created the RAF-DB database having 29,672 image instances that are based on the crowdsourcing annotation. These images are labeled independently with forty annotations.

A deep locality preserving learning algorithm has been proposed for recognizing facial expressions and achieved only 44.55% as accuracy. Ghimire et al. [22] have used fifty-two facial key points and these points are extended as lines and triangles from which the geometrical features are extracted. The triangular geometric features have achieved good accuracy and however, it has failed for low-resolution images. Zavarez et al. [49] have employed Gabor motion energy filters to identify the dynamic facial expressions of individuals. These filters along with Genetic Algorithms (GA) and SVM are used as a feature for experiments to classify the facial expressions from video sequences. One of the issues with the approach is the time complexity. Zhong et al. [55] have proposed a FER framework based on subspace learning on local structure. Bilkhu et al. [6] have proposed a FER system to recognize six emotional features. A cascaded Regression Tree is applied for extracting the features. The system has used Logistic Regression, Support Vector Machine, and Neural Network for classifying the six emotions. However, one of the issues of this approach is that it has failed in distinguishing certain expressions such as disgust with anger and sadness with fear. This is due to the fact that the high similarities between these classes of facial expressions which leads to higher misclassification rate. Florea et al. [19] have proposed an approach that combines semi-supervised learning and inductive transfer learning into an Annealed Label Transfer (ALT) framework to tackle the label scarcity issue in FER. ALT transfers a learner’s knowledge on a labeled wild FER dataset to an unlabeled face dataset to generate pseudo labels. These labels’ confidence is increased to improve the primary learner’s performance in recognition. Zeng et al. [50] have proposed Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) to address the label noise and alleviate prediction bias to a specific wild FER dataset. IPA2LT trains a Latent Truth Network (LTNet) to extract the true latent label for a sample using the inconsistency between the labels generated with a prediction model and manual labels. Ghazouani [21], has proposed the GP − FER framework, genetic programming is used for selecting features, and fusion technique is applied for facial expression recognition. Tree-based genetic program with three functional layers (feature selection, feature fusion, and classification) are formed. The binary classifier is used and it performs discriminative feature selection and fusion for each pair of expression classes. The final emotion is captured by performing a unique tournament elimination between all the classes using the binary programs. Yao, et al. [47] have used active learning and Support Vector Machine (SVM) to classify facial action units (AU) for facial recognizing facial expression. Active learning was used to detect the targeted facial expression AUs, while SVM was utilized to classify different AUs and ultimately map them to their corresponding facial expressions. Shahid et al. [34] have proposed a two-stage approach to distinguish seven expressions on the basis of eleven different facial areas. The combination of contour and region harmonics is used to develop the interrelationship of sub-local areas in the human face. The authors have applied a multi-class support vector machine (SVM) with subject-dependent k-fold cross-validation to classify human emotions into expressions. The authors Swaminathan, and Vadivel, [41] have proposed 37 emotions as combined emotions in which 16 emotions are newly derived and validated by using the Facial Action Coding System (FACS) and statistical analysis of combined emotions. Avani et al. [2] have proposed the parabola theory to map with mouth features. The points on the lips are considered feature points to construct feature vectors. The Latus Rectum, Focal Point, Directrix, Vertex, etc. are considered to identify the feature points of the lower lips and upper lips to understand the facial expressions. Avani et al. [3] have proposed an Interval graph of facial regions for FER. Here the approach assumed that common intersecting salient points of facial regions for estimating the emotions. The facial region is decomposed into four sub-regions and the Interval graph is extracted for each region. The common salient points and degree and direction of deformation are measured for vertical, horizontal, and diagonal directions. Wang et al. [45] have addressed label uncertainty by proposing a Self-Cure Network (SCN) to re-label the mislabeled samples. A self-attention mechanism estimates a weight for each sample in a batch based on the network’s prediction and identifies label uncertainty using a margin-based loss function. One of the drawbacks of this approach is that it has failed in recognizing the expressions while the face in the image is non-frontal, varying lighting conditions and occlusions.

The same authors further designed a Region Attention Network (RAN) to address pose and occlusion in wild FER datasets bypassing regions around facial landmarks to a CNN [46]. The final feature vector is obtained by combining weighted feature vectors of cropped regions using a self-attention module. Though the model learned to discriminate occluded and non-occluded faces, it has failed to distinguish disgust with anger and sadness with fear. Minaee et al. [32] have proposed a deep learning method based on the attentional convolutional network that focuses on critical areas of the face on various datasets, including FER-2013, CK+, and JAFFE. It encodes shape, appearance, and extensive dynamic information. However, one of the issues of this approach is that it requires more time for training the data. A significant challenge that deep FER systems face is the lack of sufficient quality training data in large volumes. This issue has been handled by proposing feature extraction techniques by Fan et al. [17]. This approach hybridizes several machine learning methods, such as support vector regression (SVR), grey catastrophe (GC), and random forest (RF). However, it is found that this method requires voluminous training data to converge the training algorithm. In recent times, hypotheses are proposed and validated using statistical tests on time series data. Though these methods are effective for time series data, they are found to be unsuitable for the features of the proposed work.

Based on the above review, it is imperative that a method is required to handle various issues such as convergence of training and learning algorithm, variation in illumination, pose and orientation, using the importance of nose component to improve the performance. As a result, we propose a novel scheme to extract the feature from the nose-based properties of pyramid and tetrahedron. Since this feature is on geometrical properties, it handles the variation in illumination, pose, and orientation. The performance is measured in terms of classification accuracy and found that the misclassification rate is very low.

3 Proposed approach

3.1 Segmentation of the nose region

The input image is converted as an HSV image for effectively processing the input. The HSV color space is more robust towards change in external lighting. In cases, there is a minor change in external lighting (such as pale shadows, etc.). Hue values vary relatively less as compared to RGB values. The colors used in HSV can be related to human perception [43], which is not always the case with RGB or CMYK. The image consists of three channels such as Hue, Saturation, and Value. This color model does not use primary colors directly. It uses color in the way humans perceive them. As per Stockman and Shapiro [40], HSV color space can be represented pictorially as hexacone as shown in Fig. 1. The Intensity (I) or Intensity Value (V) is the vertical axis, H is the color represented in angular values, and S denotes the saturation for measuring the depth of the color.

Fig. 1
figure 1

The HSV color space – hexacone

Below, we present the proposed approach that considers and analyses human expressions such as happy, sad, fear, anger, disgust, surprise, etc. We consider the human nose to analyze its contribution in the facial expression and map the tetrahedron/pyramid geometric structure to extract the important feature vectors. The overall flow diagram of the proposed approach is depicted in Fig. 2 below.

Fig. 2
figure 2

The framework of the proposed approach

It is observed from Fig. 2 that given an input image, the nose component of the face is localized and segmented using the boundary information in the preprocessing stage. We have used the Fuzzy C-Means clustering for segmentation and corresponding mathematics are given in Eqs. 13. A sample output of the clustering is given below in Fig. 3.

Fig. 3
figure 3

Sample output of Fuzzy C-Mean Clustering (a) Original Image (b) Segmented Image of Sample 1 from JAFE data set (c) Nose Localization of Sample 1 (d) Another Sample Image (e) Segmented Image of Sample 2 from JAFE data set (f) Nose Localization of Sample 2

The proposed geometrical structure based on the properties of the pyramid is superimposed over the nose component of the facial image after localizing the nose and the same is depicted in Fig 3. In Fig .3(a), the sample image from the JAFE dataset is depicted and Fig. 3 (b) and (c) are segmented images and pyramid structure superimposed images. Similarly, Fig. (d), (e), and (f) are the original, segmented, and pyramid structure superimposed images of another sample image from the JAFE dataset. Various components such as eyes, mouth, etc. of the face are used as a reference for exactly superimposing the pyramid structure and an example is given in Fig. 4. The explanation of variables of the pyramid structure is explained later in the paper.

Fig. 4
figure 4

Nose component area is super imposed with pyramid structure. (a) Sample nose localized image (b) Feature Points (Variables)

The training and testing phase of the proposed work applies n-fold cross-validation and is depicted in Fig. 5. During the training phase, the feature is extracted for all the training images and stored in a repository. Similarly, the feature is extracted for a given input image and compared with all the features in the repository. We have used Manhattan distance for calculating the similarity value between the feature of query facial image and features stored in the repository. Since we have used n-fold cross-validation for all the features stored in the repository, all the features have contributed both in testing and training for improving the classification accuracy.

Fig. 5
figure 5

Overall proposed approach Flow diagram

Below we present the mathematical base of the proposed approach.

The Fuzzy Clustering Algorithm (FCM) [5] is used for segmenting the nose region of the face. It is a clustering scheme that handles a piece of data that divides data into two or more clusters. Calculating the degree of fuzziness is critical using a suitable objective function with an objective to minimize its value. The objective function is given in Eq. 1

$$ {J}_{FCM}=\left(u,v\right)={\sum}_{i=1}^c{\sum}_{k=1}^N{\mu}_{ik}^p\ {\left\Vert {x}_k-{v}_i\right\Vert}^2 $$
(1)

In Eq. 1, v = {vi}i = 1 are the cluster type, V={μik} is an array and is the position matrix, c represents the centre of cluster, N is the total number of pixels, xk is the kth pixel and vi is the centroid of the ith cluster. Further ‖xk − vi2 is the distance between vi (cluster centre) and xk (kth pixel) μik is the fuzzy membership value of kth pixel in the ith cluster. This membership value i.e., μik has to satisfy the following constraint.

$$ {\mu}_{ik}\in \left[0,1\right],1\le K\le N,0<{\sum}_{K=1}^N{\mu}_{ik}<N,i\le i\le c, and\ {\sum}_{i=1}^c{\mu}_k=1,1\le {x}_k\le N. $$

It is known from Eq. 1 that the objective function has minimized the distance between any pixel xk and centroid of the ith cluster vi. Correspondingly, μik, which is fuzzy membership value of the kth pixel associated with ith cluster. Its value is in the range of [0–1] and its association with all the clusters are interpreted from the value of the membership function. Based on the range of membership values, the following partition matrix is updated by equations given below in Eqs. (2) and (3).

$$ {\mu}_{ik\kern0.5em }=\frac{1}{\sum_{j=1}^c{\left({d}_{ik}^2/{d}_{jk}^2\right)}^{1/p}} $$
(2)
$$ {v}_i=\frac{\sum_{k=1}^N\ {\mu}_{ik}^p\ {x}_k\ }{\sum_{k=1}^N{\mu}_{ik}^p} $$
(3)

The FCM is applied for segmenting nose from the face and a sample result of segmentation is depicted in Figs. 3 and 4.

We have considered Yale dataset (Details are given in Experimental Section) for conducting experiments. The segmented nose region along with the original nose image are shown in Figure 2. It is observed from the Figure, that the FCM in the HSV color space can segment the nose region effectively for further processing and analysis.

3.2 Analysis of nose region using pyramid structure

The procedure for extracting features along with required variables for the nose is depicted in Fig. 2. Most of the pyramids are characterized on their base. All the pyramid structure has a common property, that their sides are triangular shaped. While all the triangles are equilateral, the pyramid is termed a regular tetrahedron. In contrast, if the edges of the triangle have different lengths, the pyramid is termed an irregular tetrahedron. In the proposed approach, the properties of the triangular base pyramid are exploited to estimate expression/emotion on the human face. The nose on the face is represented as a triangular base pyramid and it is characterized as a regular or irregular tetrahedron based on various properties. The nose of the human face may not always be a regular tetrahedron and often goes with irregular. A sample tetrahedron is shown in Figure. 6

Fig. 6
figure 6

Pyramid and Tetrahedron model (a) Sample tetrahedron (b) Tetrahedron over lapped over a Nose

In Figs. 6(a) and (b), the Length of an edge is denoted as a and the Area is denoted as A. The area of the tetrahedron is the function of areas of three congruent equilateral triangles, which is given below

$$ A=\left(x\ast \frac{\left(\sqrt{x}\right)}{y{a}^2}\right) $$
(4)

In the above Eq. the value of x and y are 3 and 4 respectively and the unit of A is square units. The Area of the Whole surface (AW) of the tetrahedron is the function of areas of four congruent equilateral triangles and is shown below.

$$ AW=\left(x1\ast \frac{\left(\sqrt{x}\right)}{y{a}^2}\right)=\sqrt{y}\ast {a}^2 square\ units $$
(5)

In the above Eq. the value of ×1, x and y are 4, 3 and 4 respectively. The Volume (V) of the regular tetrahedron is denoted as

$$ V=\left(\frac{1}{3}\times area\ of\ the\ base\times height\right) $$
(6)
$$ V=\left(\frac{1}{3}\right)\times \left(\frac{\sqrt{3}}{4}\right)\times {a}^2\times \frac{\left(\sqrt{2}\right)}{\sqrt{3}}\times a $$
(7)
$$ V=\left(\frac{\sqrt{2}}{12}\right)\times {a}^3\ cubic\ units $$
(8)

The area (A) and Area of Whole surface (AW) given in Eqs. (4) and (5) are the function of length of edges of tetrahedron. Further, in the plane ∆ JLK we have, JNLK, thus

$$ {JN}^2=\left({JL}^2-{LN}^2\right)={a}^2-{\left(\frac{a}{2}\right)}^2=\frac{3{a}^2}{4} $$
$$ Now,\kern0.5em JG=\left(\frac{2}{3}\right) JN\ or\ {JG}^2=\left(\frac{4}{9}\right)\times {JN}^2 $$
$$ or\ {JG}^2=\left(\left(\frac{4}{9}\right)\times \left(\frac{3}{4}\right)\times {a}^2\right) $$
$$ or\ {JG}^2=\left(\frac{a^2}{3}\right) $$

In addition, MG ┴ JG and JM = a and thus, from ∆ JGM we get,

$$ {MG}^2={JM}^2-{JG}^2 $$
$$ or\kern0.5em {MG}^2=\left({a}^2-\left(\frac{a^2}{3}\right)\right) $$
$$ or\kern0.5em {MG}^2=\left(\frac{2{a}^2}{3}\right) $$

As a result,

$$ MG=\frac{\sqrt{2a}}{\sqrt{3}}= height\ of\ the\ regular\ tetrahedron $$
(9)

The height of regular tetrahedron given in above Eq. 9 is the function of a. As a result, the tetrahedron based model of the nose of human face can be effectively used for estimating emotion. For an emotion, say, happy, the deformation on the nose area is measured and compared with neutral nose. A similarity measure, say, Manhattan distance is used for calculating the deformation from which the emotional type is estimated.

3.3 Classification of facial expressions

In this subsection, the procedure of modelling feature vectors such as area, area of whole surface, Volume, Height, and the points on the nose from a to m is presented. These features are later given as input to classifiers. Support Vector Machine, Multi-Layer Perceptron, and Random Forest classifiers are used for classifying facial expressions. The section below presents the procedure of classification by these classifiers.

3.3.1 Support vector machine (SVM)

SVM is the proposed approach that has been used for classification as it is one of the well-known machine learning algorithms. SVM has a set of hyperplanes in an infinite dimensional space and it addresses a binary pattern classification problem. As a result, it is considered as the linear and nonlinear SVM. In SVM, the margin between various classes is minimized effectively while maintaining a minimum training set error. It is possible to find a global minimum very easily as the SVM is a convex quadratic programming method. However, it is important to set and tune various important parameters for the SVM classifier. Some of the important parameters are penalty and smoothness parameters of the radial-based function. The feature database is trained by the SVM as given below. The training dataset TS is represented as follows.

$$ TS=\left\{\left({fv}_1\right),\left(f{v}_2\right),\left({fv}_3\right)\cdots \left({fv}_n\right),\right\},{c}_j\in \left\{ Emotional\ Expressions\right\},\left({fv}_i\right)\in {FV}^d $$
(10)

In the above Eq. (10), fvi represents the feature vector extracted from the nose, cj represents the corresponding expression category and n is the size of the dataset. The feature vector is n-fold cross validated for training the SVM. To train the multi-class SVM classifier for all the well-known six emotional expressions, the images from CK++ and JAFFE datasets are grouped based on their emotional category. The training and testing set of the data set is presented to SVM for training the classifier as it is supervised classifier. As a result, the classifier effectively finds the decision boundary between the various emotional categories as shown in Fig. 7. During the training, linear kernel is used such that each input fvi has d attributes to denote its class cj = {Emotional Expressions}. The training dataset TS is shown in Eq. (10).

Fig. 7
figure 7

Sample multiclass output space of SVM

3.3.2 Multi-layer perceptron (MLP)

MLP is considered as a feed forward neural network. Generally, it consists of input, hidden and output layers. The number of hidden layers of the network can be one or more depending on the problem. The data is presented to the input layer, processed by both input and hidden layers and fed forward to the output layer as represented in Fig 7. The MLP uses supervised learning techniques called back propagation for training. MLP solves problems which are not linearly separable. For instance, let us consider feature vector from all regions of the face. If we have m input data{(fv1), (fv2), (fv3)……. (fvm)}, it is called as m features. A feature, say one row of the feature database is considered as one variable, which influences a specific outcome of emotional category. Each of the m features is multiplied with a weight (w1, w2, …, wm), sum them all together for obtaining a dot product as given below.

$$ W(FV)={w}_1\left({fv}_1\right)+{w}_2\left({fv}_2\right)+\dots +{w}_m\left({fv}_m\right) $$
(11)
$$ W(FV)={\sum}_{i=1}^m{w}_i\left({fv}_i\right) $$
(12)

Here, m features are given as input in the form of (FV) to the input layers of the MLP as shown in Fig. 8. Suitable weights are determined during the training process such that the dot product is calculated between weights and inputs as shown in Eqs. (11) and (12). Similarly, the weights for each neuron in the hidden layers are also calculated. The dot product between the weights and input values of hidden layer is similar to the input layer and as a result, the hidden layer is considered as yet another single layer perceptron.

Fig. 8
figure 8

Typical Structure of Multi-Layer Perceptron

3.3.3 Random Forest

In random forest classifier, more number of individual decision trees are present and operates as ensemble. The individual decision trees predict a class the class which got the most votes is considered as predicted class. A typical structure of Random Forest is shown Fig. 9.

Fig. 9
figure 9

Random Forest

The procedure of training the Random Forest in the proposed approach is as follows. We considered the feature vector (P), extracted from each data sets and used for training the classifier.

$$ P={p}_1,{p}_2,\dots ..{p}_n $$
(13)
$$ C={c}_1,{c}_2,\dots ..{c}_m $$
(14)

In the above Eqs. (13-14), n is the number of data samples for training and m is number of emotional expression classes. Now, given the training set P, with responses Q, select a random sample to replace the training set by repeatedly performing bagging for B times. Once the training is over, the predictions on unseen samples p can be estimated by averaging the predictions from all the individual regression tree on p, which is given below Eq. (15).

$$ \hat{f}=\frac{1}{B}={\sum}_{b=1}^B{f}_b\left({p}^{\prime}\right) $$
(15)

The proposed approach minimizes the variance of the model given above to have strongly de-correlating trees. In addition, the uncertainty of prediction on sample p can be written as follows.

$$ \sigma =\sqrt{\frac{\sum_{b=1}^B\left({f}_b\left({p}^{\prime}\right)-\hat{f}\right)}{B-1}} $$
(16)

In the above Eq. (16), B is the number of samples.

In all the classifiers, the training is n-fold cross validation to ensure the participation of each sample in both training and testing for improving the classification accuracy. The performance of each classifier while conducting the experiments is presented in Section 4.

4 Experimental results

4.1 Details of dataset

In this section, the performance of the proposed geometrical feature of the nose using the properties of Pyramid is presented. We have considered the well-known benchmark datasets such as Japanese Female Facial Expression (JAFFE), Cohn–Kanade Facial Expression (CK++), and Indian Spontaneous Expression Database (ISED) for experiments. The JAFE database contains 213 images and seven expression classes namely Sadness, Neutral, Happiness, Disgust, Anger, Fear, and Surprise for 10 different Japanese models. All the face images are resized to 256 × 256. The Cohn–Kanade Facial Expression database (CK++) is developed by Carneo Milan University Pittsburgh and has 593 images belonging to 123 different subjects. Happy et al. [25] has created ISED dataset for analyzing facial expression. This dataset consists of images with spontaneous expressions of both male and female genders from India as birthplace. The dataset consists of images of 50 persons with 428 video frames. Fan et al. [17] have used Kolmogorov-Smirnov Predictive Accuracy (KSPA) and Augmented Dickey-Fuller (ADF) for testing the data. The data is on fluctuation in electricity load, which is a time series data and thus, Kolmogorov-Smirnov Predictive Accuracy (KSPA) and Augmented Dickey-Fuller (ADF) type methods can be used to test the result. However, in this work, we have static data and Kolmogorov-Smirnov Predictive Accuracy (KSPA) and Augmented Dickey-Fuller (ADF) may not be used for testing. The effect of illumination effects is handled using Histogram equalization. The experiment is conducted on system with 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40 GHz, 2401 MHz, 4 Core(s), 8 Logical Processor(s). The Operating System is Microsoft Windows 11 Home Single Language with 8 GB RAM. We have used the Open CV library to segment the face portion from the whole image using the frontal face detector. The illumination effect is handled by histogram equalization such that the effect of illumination is normalized. The Simulink to the interface of open CV for MATLAB 2013 environment is used for experiments. The experiments are conducted to estimate the well-known expressions such as neutral, surprise, disgust, happy, feat, sad, and anger. We have used the recognition rate / classification accuracy as a performance measure for each emotional category and evaluated each dataset. The evaluation of the JAFFE dataset for various expression categories is presented in Fig. 10 and the overall accuracy is 85.95%. Figure 10 is the same as the confusion matrix.

Fig. 10
figure 10

Recognition rate on Jaffe Dataset

4.2 Performance of the proposed approach using machine learning algorithms

It is known in Fig. 8 that the recognition rate/classification accuracy is 85.95% on JAFEE dataset. This is found to be on the lower side and has to be improved. Thus, in this paper, we have used Random Forest (RF), Multi-Layer Perceptron (MLP), and Support Vector Machine (SVM) for improving performance. It is known that there are set of base parameters for each algorithm and all these parameters are fixed based on experiments. We have used LibSVM library for implementing SVM and used Linear, Polynomial, Radial Basis Function (RBF) kernels. The recognition rate of SVM with RBF is shown in Fig. 9. During the training stage, the C and Gamma in RBF are used as 20 and 0.1 respectively. The recognition rate on JAFFE dataset using Multi-Layer Perceptron is shown in Fig. 10. The configuration of MLP is: input layer is four neurons, hidden layer consists of twenty neurons and output layer consists of seven neurons, the learning rate is 0.3, momentum is 0.2, and training time is 500. Similarly, we have used Ensemble learning is used with Ad boost Meta - algorithm along with the random forest. The training model is built for this experiment has used the bagging with Hidden Markov Model. The random subset of the dataset is generated by applying re-sampling for the random forest tree algorithm. The testing and trading phase of all the classifiers have used 10-fold Cross-Validation to avoid over and underfitting. The recognition rate of various classifiers on JAFFE dataset is presented in Figs. 11, 12, 13 and 14. The overall accuracy from the Fig. 11 is 92.22%

Fig. 11
figure 11

Recognition rate on JAFFE dataset with SVM and RBF

Fig. 12
figure 12

Recognition rate on JAFFE dataset with MLP

Fig. 13
figure 13

Recognition rate on JAFFE dataset with Bagging +HMM

Fig. 14
figure 14

Recognition rate on JAFFE dataset using Ad boost + Random Forest

It is found that the recognition rate of most of the classification algorithms is from 95% to 96%. It is found that the performance of the proposed approach is good for angry and sad, happy and disgust is moderate and low for a surprise. Also, in the JAFFE data set, surprise images are also classified as normal. This is due to the fact that these normal images with an open (close) mouth is having a similar feature set as a surprise. Surprise images in JAFFE are normal with a close or slightly open mouth. Based on the results, the proposed approach has performed well on surprise expression having random forest with ensemble classifier configuration as shown in Fig. 14.

4.3 Performance comparison

In Table 1, we present the comparative analysis of the results and JAFFE dataset is considered as data set. It is observed from the results that the proposed approach performs well for the expression categories of happy, angry and surprise.

Table 1 Performance Comparision - the recognition Rate for various Emotions

The recognition rate is presented in Table 1 and the values are highlighted to emphasis the performance enhancement. The reason for performance enhancement is due to the fact that the proposed approach has captured the features of the nose effectively. The deformation due to various expressions in the nose has reflected at various points of the pyramid and thus has achieved good results. In the results, the lowest recognition rate is achieved by Alhussein [1], which is 80.5%. The methods proposed by Shahid et al. [34] and Ghazouani [21] have achieved a 93% recognition rate. It is also observed that the performance of methods proposed by Yao et al. [47] and Li et al. [29] has achieved 92% accuracy. However, overall, the performance of the proposed node feature using a pyramid structure is good. While observing the performance of Srinivasu et al. [39], it is 81.1% and the reason is that the segmentation algorithm works well in the medical images.

The recognition rate/classification accuracy of the proposed approach on the CK++ dataset is shown in Table 2. It is observed that the recognition rate of anger, surprise, and happiness is notable compared to other expressions. The recognition rate of fear, disgust, and sad conflicts. Yao et al. [47] have achieved 93.8% as the recognition rate and Shahid et al. [34] have achieved 90.2%. In Ghazouani [21], anger has contraction with surprise. The recognition rate of the nose feature is good for happiness, surprise, and anger not encouraging for other expressions such as sad, disgust, and fear. It is observed from Table. 2 that the recognition rate of the proposed approach for the nose feature is (97.3%), which is encouraging compared to others in the literature.

Table 2 Comparisons of recognition rates of various approaches for different Emotions on CK+ Database

5 Conclusion

In this section, we conclude the paper with our contributions, limitations, and future perspective of the study. We have proposed a geometrical model using the concept of pyramid and tetrahedron. The properties of these are effectively used for modelling the nose component of the human face for estimating the expression and deriving the emotion. The feature vector from the nose is extracted, superimposed on the neutral face, and compared with the images in the database. The degree of deformation is used for estimating the expression and deriving the emotion. The classification accuracy is used as a performance metric and we have used JAFE and CK++ datasets for experiments. The performance of the proposed approach is compared with the contemporary methods. It was found that the performance is encouraging. However, one of the limitations of the proposed approach is that it captures only the micro-level deformation of the nose. Thus, this method may not be suitable for any real-time application, which uses macro-level deformation as a feature. This issue can be handled in future by capturing the features from all the components of the face. Thus, the Marco level feature can be constructed for better performance.