1 Introduction

A lot of research work has been carried out to measure the impact of Lockdown on society as it has put the whole world into a turmoil state. So, it is at the center of everyone’s mind. Although it is at the center of everyone’s mind, the mentality and philosophy of each of our generations are not the same. This is why the effects of Lockdown have impacted each generation individually, changing them in various means [6, 16, 38, 54]. According to some experts, this lockdown crisis will have far-reaching and profound effects on Generation-Z. Experts are apprehensive that it will have a deep impact on Generation-Z minds and will last a long time [9, 65]. People are prone to emotions. Our conversation is full of different kinds of emotions. That is why in various studies, emotion has been given a lot of importance [12]. Philosophical studies advocate that Facial based expression (non-verbal) plays a big role in determining human emotions though it has some limitations [2, 5, 8, 13,14,15, 21, 24, 26, 28, 31,32,33, 36, 40, 42,43,44, 51,52,53, 55, 57, 58, 60, 69, 70]. Those studies also show that we can get an in-depth picture of human emotions from a spontaneous voice or Acoustic Information (verbal) [4, 10, 20, 22, 29, 35, 45, 47, 64, 67].

Earlier several studies have been conducted to determine the impact of Lockdown. The primary focus of those studies was to find some clinical solutions, measure the impact on the global economy, issues related to the migrant workers, etc. [6, 9, 16, 17, 38, 54, 65]. However, very little research has been done to analyze the emotions of Generation-Z. Therefore there is a clear research opportunity in the literature. It inspires us to determine Generation-Z emotions based on a multimodal approach, which combines facial expressions with acoustic information at the decision level using the Fuzzy rule-based techniques.

The advent of multimodal methods for determining human emotions has given a big thrust in this research area. In a multimodal approach, emotions are obtained from multiple sources such as facial expression, acoustic Information, body gesture, etc. [25, 30]. In this approach, decision-level fusion is a well-known practice that integrates emotions obtained from different sources [59]. The difficulty in decision-level fusion is that number of emotion labels defined in some widely used datasets is different and the types of labels are different as well. For example, FER2013 [19] is a well-known dataset used to study facial expressions. The number of labels defined in this dataset is seven. Another frequently used dataset for facial expression analysis is CK+ [36]. In the CK+ dataset, the number of labels is eight also the type of labels are not identical to what is defined in FER2013. RAVDESS, a popularly used dataset to study AcI-based emotions [34]. The number of emotion labels defined in RAVDESS is eight but the types of labels are not identical as defined in FER2013 or CK+.

To overcome the problem in the current study Organize-Split-Fuse (OSF) model has been proposed which is largely influenced by the valence-arousal model. In general, human emotions are bi-polar i.e. positive or negative, Watson [62]. Research works of Russell [50] suggest that all emotions can be represented by combining two things, one related to valence (negative/positive) and the other to arousal (intensity of valence). Similarly, Whissell [63] has represented emotions in terms of evaluation and activation. The objective of the work is to analyze the emotions of a section of Generation-Z during the Lockdown period, based on the proposed OSF model.

  • Problem Formulation: In the current study, the following steps have been adopted to satisfy the stated objective:

  • At the outset, the FE and AcI of each student have been separated from the generated dataset and stored separately.

  • Two separate pre-trained CNN-based models have been employed to classify FE and AcI based emotions for each student from the generated dataset

  • Classified emotions have been organized and stored separately based on their valence (negative or positive), using the first phase of the proposed OSF model i.e. Organize.

  • Afterward stored emotions are further sub-divided into six sub-classes of emotion namely High Negative (HN), Moderately Negative (MN), Low Negative (LN), Low Positive (LP), Moderately Positive (MP), and High Positive (HP) using the second phase of proposed OSF model i.e. Split.

  • Finally, these six classes of the emotion of two modes (FE and AcI) have been combined at the decision level i.e. Fuse, the last phase of the proposed OSF model. To achieve this fusion, the Fuzzy rule-based classification systems have been engaged [7].

  • Novelty: The proposed OSF model is effectively able to analyze students’ emotions using the multimodal approach. The proposed model successfully fused obtained emotions at the decision level using a Fuzzy rule-based classification system. The samples used are spontaneous and natural in the generated dataset. Facial expressions have been extracted from Colour and B/W videos. The rest of the sections are organized as follows:

Some of the works which are relevant for this study have been discussed in the literature review part. The next section describes the methodologies adopted which are followed by the proposed methodology. The outcome of the study has been given in the form of results and discussion. Finally, the observation of the current study has been mentioned in the conclusion section.

2 Literature review

In this section, some contributions of the researchers related to FER, SER, Fuzzy, Multimodal, and Human emotions have been discussed. People use facial expressions as a non-verbal medium to express their emotions and stimuli [12, 57]. With the advancement of human-computer interaction, automatic facial expression analysis has gained a lot of popularity. Facial expression recognition (FER) is a very popular area of research. In a FER-based system, facial signs are converted into facial expressions. Ekman and Friesen [13] through their work showed that human beings express six basic emotions, anger, disgust, fear, happiness, sadness, and surprise, irrespective of their culture. This is known as the categorical model of FER. There are two major types of FER systems: Static image [33, 44] and Dynamic sequence [26, 69]. In the Static image FER system, features are extracted from a single image but in Dynamic sequence FER, the temporal relation among adjoining frames of facial expression sequence are considered. In the traditional FER methods, Shallow learning and handmade features [5, 31] were mostly used. All traditional dynamic sequence-based FER system follows three basic steps. In the first step, facial parts are being detected, cropped and other necessary things are being done [58, 60] to make the system robust. In the second phase, required features either texture [32, 40] or appearance [36, 51] based are being extracted, and finally, using these extracted features conventional machine learning classifiers like SVM, kNN, etc. [8, 28, 70] are used to classify these emotions. However, achieving the desired performance in real-life situations remains elusive. From 2013 onwards due to the availability of cheap with improved processing capabilities and with the introduction of the deep neural network, the accuracy level of the FER system has been improved significantly [21, 53]. In recent years, a large number of research papers have been published on automatic facial expression analysis [14]. For the stated purpose researchers haves employed various deep learning mechanism [2, 24, 52, 55]. The advantage of DNN is that unlike other traditional systems it can extract useful features more accurately [15]. Lately, Mohan et al. [42] have applied local gravitational force descriptors to identify human emotions in various challenging situations using an improved deep neural network model. In their work, geometric features were fused with the holistic feature using score level fusion mechanism. In the following year, Mohan et al. have introduced ‘FER-net’ based DNN to identify the FE [43]. In both of these works, authors have employed five widely used databases to evaluate the performance of their proposed work which achieved significant accuracy.

Research suggests that apart from the FER-based system we can also get people’s emotions from acoustic information (SER). It is currently one of the hot topics in research and its presence is being felt in various sectors of our life. Researchers applied acoustic information to solve various real-life problems; Psychological Assessment [35] Human-Computer Interaction [10], Call Center [20], etc. Iqbal et al. show that acoustic information can be generated from speech in real-time. In their study, the authors have implemented the gradient boosting method to achieve their objective [22]. Pinto et al. employed Deep CNN to develop one emotion model that can understand people’s emotions based on spoken language [47]. In their work author(s) has engaged RAVDESS as a dataset. To evaluate the performance of their model F1 score has been used as a metric. The weighted score achieved on the test data is 0.91. Zhang et al. [67] developed three shared models using RAVDESS as a dataset. The objective is to find the relationship between speech and songs. Their work suggests that although the process of recognition of speech and song are dissimilar, they have some relation and can be treated as the same. Mel Frequency Cepstral Coefficients (MFCCs) is a well-known non-parametric method used to extract features from acoustic information [4, 29, 45, 64]. Muda et al. [45] has successfully built one model to recognize acoustic information using MFCC techniques. In their study author(s) has used Dynamic Time Warping (DTW) to measure the testing patterns. Kuang et al. [29] used MFCC, STFT, and SIFT features to classify human emotions. In their proposed study, they employed RAVDESS as a dataset and Alexnet as a classifier that achieved significant accuracy (95.88%). Earlier in the year 1995, Lecun and Bengio [30] have suggested that CNN can be used to extract information from different sources such as images, speeches, etc. Of late researchers have combined facial expressions with acoustic information to get a complete spectrum of people’s minds. In the year 2018, Jannat et al. successfully fused audio data with video data to get people’s emotions [25]. In the very next year, Tzirakis et al. proposed an automatic affect recognition system that works in a real-world environment based on audiovisual signals [59]. In the case of a multimodal way of emotion recognition, the basic challenge is how to merge the emotions obtained from different modes. D.Zhang et al. [68] proposed that this can be achieved at the decision level where the emotions obtained from different modes are combined at the last level. Mohan et al. [42] have applied score level fusion at the final stage of the classification process to merge geometric and holistic facial features in a FER-based system using DCNN.

The Fuzzy rule-based classification system (FRBCs) is a well-known classification mechanism that can be applied at the decision level to, combine decisions obtained from different modes [7]. The FRBCs is based on the concept of Zadeh’s Fuzzy principles [66]. Mohammadpour et al. [41] successfully classify coronary artery disease based on the FRBCs approach. In their work author(s) has developed 144 fuzzy rules to classify the disease into four classes. In this process author(s) has achieved 92.8% accuracy. The presence of Fuzzy principles and their implements are found during Lockdown [17]. The hybridization of Fuzzy is also felt in the image processing domain [61]. For the design of a multimodal recognition system involving facial, acoustic information, gesture, etc., a sufficient number of labeled training data with all possible variations of the populations and environments are required. Many publicly available and widely used few of those datasets are FER2013 [19], CK+ [36], RAVDESS [34], JAFFE [37], EmotioNet [3], SAVEE [56], and TESS [46].

Human emotions are complex therefore classifying human emotion is a stimulating job. Watson et al. suggested that human emotions can be broadly classified into two categories i.e. positive or negative [62]. Russell [50] proposed one 2D model to represent the emotional state in terms of valence and arousal where valence represents a positive or negative state and the intensity of valence is represented in terms of arousal. Whissell [63] developed a dimensional model ‘The Dictionary of affect in language’ to represent the emotion in terms of evaluation and activation. The work of Gasper [18] suggests that there is little consensus among the researchers regarding the ‘neutral affect’ and therefore leaves the decision on the experiment’s needs and its type. The research work of Damasio [11] and Izard [23] strongly questioned the presence of ‘neutral affect’. According to them, we are living in a world that is full of emotions and there is nothing called a ‘neutral world’ because we are always feeling something and our expressions are full of emotions. They further suggest that there can be nothing called an ‘affectless mind’ and all our emotions are tuned with emotions be it positive or negative. Summary of a few works related to FER and Acoustic have been shown in Table 1.

Table 1 Summary of work (FER and Acoustics)

3 Methodology

In this section, a few baseline techniques relevant to this study have been discussed. The Convolutional Neural Network model has been applied by several researchers to solve the problem related to real-life [1]. It’s a type of DNN algorithm that takes the image as input and recognizes an image from the rest of the images. The recognition is being made based on weight and bias values obtained during the learning process. The main principle behind the CNN architecture is the ‘Convolutional Layer’. In CNN models each input image goes through a sequence of Convolutional layers. The connecting layers are filtered (kernel), pooling, fully connected layer (FC), and at the end of the layers, the “softmax” function is used to classify an image with prospective values between 0 and 1. In the case of CNN based model ReLU act as an activation function. It accumulates the weighted inputs. If the value is greater than the threshold value (0), it passed the signal into the next convolutional layer otherwise, inputs are rejected.

$$ y=\mathit{\max}\left[\left(\sum \limits_i{w}_i{x}_i+a\right),0\right] $$
(1)

In the CNN model to prevent exploding gradient problems and vanishing gradient problems, Batch normalization is used. Let ×1; ×2; xm, be a small batch then the mean value m and deviation σ can be obtained respectively using

$$ u=\frac{1}{m}\sum \limits_{i=1}^m{x}_i $$
(2)
$$ {\sigma}^2=\frac{1}{m}\sum \limits_{i=1}^m{\left({x}_i-\mu \right)}^2 $$
(3)

Then normalize ×1; ×2; xm by using a small number € in case σ = 0 by

$$ {\hat{x}}_i=\frac{x_i-\mu }{\sqrt{\upsigma^2+\text{\EUR}}} $$
(4)

The purpose of training is to reduce the loss as much as possible. Cross-Entropy is a popularly used Loss function expressed as

$$ Loss=-\left[\sum \limits_{i=1}^n{P}_i\mathit{\ln}{y}_i\right]/n $$
(5)

Vgg16 is a kind of CNN model proposed by Simonyan and Zisserman [53]. The Network gains its popularity due to its simplicity. In this model 3 × 3 convolutional layers have been placed on top of each other in growing depth. Max pooling has been used to reduce volume size. It has two fully-connected layers, each having 4096 nodes. In the end, there is a ‘Softmax classifier’. In the current study, VGG16 has been used to build FER based system as well as to classify facial expression-based emotions. Fuzzy Logic is the brainchild of Zadeh [66]. In a Fuzzy based system at first, we fit the linguistic variables in the form of crisp value to a Fuzzy system. Then these crisp values have been Fuzzified (converted to Fuzzy sets), thereafter the degree of the set memberships are determined using some membership function such as Triangle, Gaussian, etc. In the next level a set of if/then rules which are also known as a Fuzzy rule-based classifier has been applied. Finally, the fuzzy values are reconverted into the crisp value. In the case of a Fuzzy system, the membership function is used to represent/assign the degree of membership

$$ A=\left\{\left(x,\mu \kern0.15em A(x)\right)|x\in X\right\} $$
(6)

Here, μA (x) represents the membership function or degree of membership function, of x in A and X, is the Universal set. The triangle Fuzzy membership can be expressed using the given equation.

$$ trimf\left(x:a,b,m\right)=\left\{\begin{array}{c}0\kern2.75em x\le a\\ {}\frac{x-a}{m-a}\kern0.75em a\le x\le m\\ {}\frac{b-x}{b-m}\kern0.75em m\le x\le b\\ {}0\kern2.75em c\le x\end{array}\right. $$
(7)

Fuzzy rule-based classification mechanism is a well-known method in the machine learning domain because of its simplicity. It is extensively used to solve various real-life problems such as image process, sentiment analysis, etc. [17, 41, 61]. The if/then fuzzy rule is based on two parts i.e. IF (antecedent) THEN (consequence). The first one is specifying the membership function of antecedent Fuzzy sets and the second one is determining consequent class Cj and certainty grade CFj of the fuzzy if/then rule (Rj). Given a n-dimensional, c-class problem, we apply the Fuzzy if-then rule in the following form

$$ \mathrm{Rule}\ {R}_j:\mathrm{If}\ {x}_1\ \mathrm{is}\ {A}_{j1}\ \mathrm{and}\dots \mathrm{and}\ {x}_n\ is\ {A}_{jn}\ \mathrm{Then}\ \mathrm{class}\kern0.5em {C}_i\kern0.5em \mathrm{with}\ {CF}_i\ j=1\dots \kern0.5em N. $$
(8)

Where Rj is the ith Fuzzy if-then rule, N is the total number of Fuzzy if-then rules, X = [x1,. .., xn] is n-dimensional pattern vector, Aj1 presents antecedent Fuzzy sets for the ith attribute, Cj represents a consequent class i.e. one of the c classes, and CFj is a probable grade of the fuzzy if-then rule (Rj). The De-Fuzzification of the Fuzzy value into a crisp value can be done using Eq. (9)

$$ Y=\raisebox{1ex}{${\int}_{\mathrm{min}}^{\mathrm{max}}\mu (y) ydy$}\!\left/ \!\raisebox{-1ex}{${\int}_{\mathrm{min}}^{\mathrm{max}}\mu (y) dy$}\right. $$
(9)

Where Y is the result of Defuzzification, μ (y) is the membership function, y is the output variable, min is the lower limit, and max is the maximum limit for defuzzification.

4 Proposed method

This section describes the methods adopted for the current study. Both FE and AcI-based emotions have been considered together to analyze the minds of a section of Generation-Z.74 students who are pursuing engineering were involved in this process. The overall process has been depicted using a block diagram, see Fig. 1(a-c). The proposed work has three major phases; build two separate conventional CNN models [47, 53] using benchmark dataset, classification of emotions (FE and AcI) using these two models, employ the ‘OSF’ model to understand Gen-z emotions.

Fig. 1
figure 1

(a-c) The block diagram of the proposed work. Here ‘FE’ =Facial Expression, ‘AcI’ = Acoustic Information, ‘CFEBE’ =Classified Facial expression Based Emotions, ‘CAIBE’= Classified Acoustic Information Based Emotions, ‘OSF’ =Organize-Split-Fuse, Negative--, Neutral-----, Positive…

About the dataset

In the proposed work ‘FER-2013’ dataset has been employed for training, validation, and testing the Facial expression-based CNN. In the case of acoustic information, RAVDESS datasets have been engaged. Finally, student datasets have been employed to classify emotions based on two modes. In the following section, a brief description of the datasets used in this study has been given.

FER2013

This is an image dataset comprising 35,889 48* 48-pixel gray-scale facial expression images. The images are labeled with the seven universal emotions (Table 2).

Table 2 Emotions labels and numbers of images in the FER2013 dataset

RAVDESS

The Dataset is a collection of audio and video clips of 24 actors. They were expressing the same two lines with eight different emotions. Here in this study, only the audio clips of the RAVDESS dataset have been considered. Table 3 shows the distribution of eight such emotions.

Table 3 Emotions labels and number of audio samples in RAVDESS dataset

Students dataset

The dataset contains 74 numbers of videos. The students who appeared in these 74 videos are from an engineering college in West Bengal, India. Students of all years i.e. first to the fourth year were participated. Most of them are belong to the Computer Science and Engineering disciplines. Out of those 74 students, 29 of them are girls while the rest 35 are boys. The average age group of these students is 20. The average length of each video is 10.98 seconds. The total duration of all the videos is 813 seconds. Out of these 813 seconds, the contribution of girls is 386 seconds. The contribution of boys is 427 seconds. The average contribution of the girls in those videos is 13.31 seconds while the average contribution of boys is 9.48 seconds. Each facial and audio file has a unique name in the dataset. The videos were recorded during the period of lockdown i.e. in between the 1st week of April to the 3rd week of April 2020. The message, communicated in those video communications is based on the following dos and don’ts to be followed during the lockdown:

(a) Maintain social distance in public space (b) Wash your hand with soap/sanitizer regularly (c) Wear a mask and make use of hand gloves whenever goes out (d) Do not spread fake news (e) Respect and extend help to the corona worriers (f) Do not show any apathy towards the infected person or persons. (g)Stay home to stay safe (h) Practice yoga (i) Read Books (j) Watch movies with the family.

About the role of CNN models

In the current study CNN is used to serve two purposes. First, it is employed to build the FER and SER system following the traditional process [43]. In the case of FER based system, a VGG16 model [53] has been adopted (Table 4). The model was trained, validated, and tested using the FER2013 dataset. In the case of SER, a CNN model proposed by [47] has been engaged. The RAVDESS datasets have been used to train, validate, and test the CNN model (SER). Subsequently using these two models emotions of 74 students have been classified. Figure 3(a-b) shows the emotion classification process for both modes and subsequent computation of classified emotions using these pre-build models.

Table 4 VGG16 Model architecture

Separation of facial and audio part

At the facial and the acoustic information of each such student have been separated from their respective videos and stored separately. The process has been depicted in Fig. 2

Fig. 2
figure 2

The separations of FE and AcI from the captured Video

Classification of gen-Z emotions using the pre-built CNN models

Applying the pre-build VGG16 model, section 4.2, emotions of Gen-Z have been classified and several occurrences of the classified emotion labels in these videos have been computed for further processing. To achieve the stated purpose traditional FER-based process has been adopted [43], see Fig. 3(a). In the case of audio, acoustic information has been divided into ‘n’ numbers of chunks. Necessary padding has been done for the last chunks to make the length size equal. Then employing the CNN model [47] emotions are classified and several occurrences of the classified emotion labels in this audio were also computed, see Fig. 3 (b).

Fig. 3
figure 3

a Classification of Facial expression and computation of classified emotion labels using pre-build CNN model b Classification of Acoustic information and computation of classified emotion labels using pre-build CNN model

Valence based Organization of Emotions

Surprise is the thematic term for describing a standing response. It starts with abrupt attention and then progresses into astonishment and finally converted into befuddled amazement. According to Whissell, [63] and Robinson [49] surprise is a positive emotion. Moreover, some popularly available public datasets have considered the surprise as a pleasant state [46]. Therefore in the current study, the ‘surprise’ has been considered as a positive state of emotions. Neutral does not Figure in this approach of emotion organization (valenced based) since some literature says that Neutral’ does not represent any valence state [62]. It does not Influence Cognition or Behaviour [18]. Moreover, some research suggested that “It is not Possible to Feel Neutral Because People are Always Feeling Something” [11, 23]. Table 5 shows the emotion labels and their corresponding valence states.

Table 5 Emotion labels and their corresponding valence

Based on Table 5 the emotions of each student have been categorized and subsequently organized using Eq. (10). The process of valence-inspired emotion organization for both modes has been presented using Algorithm 1. Figure 4(a-b) depicted the process of valence-inspired emotion organization.

Fig. 4
figure 4

(a) Valence based organization of classified emotions, FE-based. (b) Valence based organization of classified emotions AcI-based

$$ {T}_v=\left\{\begin{array}{c}{P}_v\kern3em Dv>0\\ {}{Nu}_v\kern2em Dv=0\\ {}{N}_v\kern2.75em Dv<0\\ {}\ \end{array}\right. $$
(10)

Where Tv= Types of valence; Dv=Difference between the cumulative value of all positive and negative valence; Nuv=Neutral or absent of valence; Pv is the positive valence and Nv is the negative valence.

The Following algorithm has been used separately to organize the classified emotions (FE and AcI).

Algorithm 1:
figure d

Valence based organization of classified emotions

Splitting of Valenced emotions

Once emotions of both types have been obtained and organized based on their valences of emotion, see Fig. 4(a-b), it has been further divided into six sub-classes of emotions using the degree of valence (Dv) i.e. arousal. Algorithms 2 and 3 presented the process of splitting while Fig. 5(a-b) depicted the outcome of the process.

Algorithm 2:
figure e

Sub-classification of facial expression based on the degree of valence emotions.

Algorithm 3:
figure f

Sub-classification of acoustic information based degree of valenced emotions.

Fig. 5
figure 5

(a) Sub-classifications of valenced emotions based on Algorithm-2 (b) Sub-classifications of valence emotions based on Algorithm-3

Decision Level Fusion using Fuzzy rule-based classification system

Here in this section sub-classified six facial expression-based emotions have been amalgamated with six sub-classified acoustic information-based emotions. To achieve these objectives Fuzzy rule-based classification system has been employed. The process of amalgamation has been explained using Algorithm 4. In Fig. 6, the combination process has been depicted. The inputs and outputs of the proposed fuzzy system have been shown in Table 6. Table 7 shows the proposed Fuzzy rules to be applied in the current study to fuse the emotions obtained from both modes.

Algorithm 4:
figure g

Fusion of split emotions using Fuzzy rule-based classification system.

Fig. 6
figure 6

The decision level fusion of facial and acoustic information based emotions using Fuzzy rule-based classification system

Table 6 The input of the Linguistic variables in the Fuzzy system and decision based on De-Fuzzification
Table 7 Fuzzy rules to fuse both modes of emotions based on the degree of emotion i.e. Dv

5 Results and discussion

During the initial observation, it was observed that the quality of facial expressions in a few recorded videos is very poor but the audio quality is good. The same is observed in the case of acoustic data. The possible reasons are mentioned in the challenges and limitations of this study. In this case, the decision has been made based on either one of the two approaches. The number of samples considered for the facial expressive emotions is 68 and for acoustics, the number of samples considered is 71.

The performance of the employed CNN model (VGG16) has been depicted in Tables 8 and 9. While in Table 10 performance has been compared with some previous work. Figures 7 and 8 show the emotions obtained from facial expression and acoustic information, respectively. Based on the valence state, identified emotions have been organized into two classes (Eq. 10 and Algorithm 1). The results obtained have been shown in Figs. 9 and 10.

Table 8 Confusion matrix for the employed CNN model
Table 9 Performance of the employed CNN model
Table 10 Comparative analysis of various FER based works
Fig. 7
figure 7

Emotions of the students based on facial expression using VGG16

Fig. 8
figure 8

Emotions of students based on acoustic information using CNN [47]

Fig. 9
figure 9

Classifications of facial expression based emotions using Algorithm 1

Fig. 10
figure 10

Classifications of acoustic information-based emotions using Algorithm 1

Once the emotions of both modes have been organized successfully then organized facial and acoustic information-based emotions have been further sub-classified into six sub-classes based on the degree of intensity of their valence (Dv). To achieve this objective Algorithm 3 and 4 have been applied. The output of this sub-classification i.e. the second part of the proposed OSF model has been depicted in Figs. 11 and 12.

Fig. 11
figure 11

Sub-classifications of facial emotions based on the intensity of valence

Fig. 12
figure 12

Sub-classifications of acoustic information based emotions based on the intensity of valence

Finally, six sub-classified emotions, obtained from both of these modes have been combined. To achieve this goal a Fuzzy rule-based classification mechanism has been engaged. Figure 13(a) shows the proposed Fuzzy system while Fig. 13(b-c) shows the degree of membership of each mode. The value of each mode lies between −10 to +10. In Figs. 14, 15, 16, the findings of fusions of two modes, based on Fuzzy rule-based classification systems have been displayed. The year-based emotions of the students have been illustrated in Fig. 17. The comparative emotional states of boys and girls during this lockdown period have been shown using Figs. 18, 19, 20, 21.

Fig. 13
figure 13

(a) Proposed Fuzzy Interface System (b) Degree of membership for Facial emotions (c) Degree of membership for acoustic information based emotions

Fig. 14
figure 14

Fuzzy rule-based combined emotions in terms of their valence

Fig. 15
figure 15

Fuzzy rule-based classification of combined emotions in terms of their degree of valence

Fig. 16
figure 16

Fuzzy rule-based polarized emotions of students

Fig. 17
figure 17

Year-wise students emotions using a fuzzy rule-based classification system

Fig. 18
figure 18

Boys vs. Girls, Fuzzy rule-based combined emotion of all the years

Fig. 19
figure 19

Boys vs. Girls: year-wise classes of emotions

Fig. 20
figure 20

Fuzzy rule-based combined emotions of Boys for all the years

Fig. 21
figure 21

Fuzzy rule-based combined emotions of Girls for all the years

6 Discussion

History reminds us that there is some acceleration of negative emotions in society after every great catastrophe like World War I and World War II. The impact of global Lockdown is one such big catastrophe. In the present study, the results/emotions obtained from the student’s facial and acoustic information also support this claim. The emotions obtained from their facial expression show the presence of more positive emotions in comparison to negative emotions, Fig. 7. On the other hand, the results obtained from acoustic pieces of information show the dominance of negative emotions over positive emotions, Fig. 8. Thus to conclude our findings, the OSF model has been employed. At the outset, emotion labels have been identified based on their valence (positive/negative) and then numbers of such emotions have been added and compared. The result obtained from Fig. 9 shows that like Fig. 7, the supremacy of positive emotions (37) over its counterpart negative (31). The results obtained from acoustic pieces of information show the dominance of negative emotions (43) over positive emotions (28), Fig. 10. At the end of the first step of the proposed OSF model, we got two class labels, namely, positive and negative, for both modes.

Fig. 11 shows the sub-classified emotions, obtained from the facial expression based on the degree of valence. The results show the values of HP (14) and MP (9) are more compared to HN (9) and MN (4). It also shows the dominance of LN (19) over LP (14). The sub-classified emotions, obtained from acoustic information, based on the degree of valence show that HN (10), and MN (19) prevailed over HP (5), MP (9), Fig. 12. It also shows the presence of LP (14) and LN (14) are on the same scale. After the successful completion of splitting i.e. the second part of the OSF model, six sub-classes of emotions have been obtained based on the degree of valence for both modes (facial and acoustic).

Fig. 13(a) shows the process employed to fuse the emotions of both modes using the ‘Mamdani’ based Fuzzy Inference System (FIS). Fig. 13(b) and (c) show the degree of memberships of emotions for both modes. Here in this study based on Algorithms 2 and 3, the membership values of the proposed Fuzzy system have been determined. The value lies between −10 to +10. Where −10 represents the Highest Negative (HN) value while +10 represents the Highest Positive value (HP). The rest of the memberships has been defined between these two values based on Algorithm 2 and 3.

Fig. 14 shows the student’s emotions in terms of three classes. It shows that the negative emotions (34) are comparatively more than positive ones (31), while there are 9 neutral emotions. Figure 15 shows the student’s emotions in terms of the degree of valence. It shows that HP (15) prevails over HN (15), indicating that students are more positively oriented when the degree of emotions is at its highest level. It also exhibits the supremacy of MN (13) over MP (7) and marginal dominance of LN (6) over LP (5). Figure 16 shows that the presence of positive, negative, and neutral emotions is 42%, 46%, and 12% respectively. Figure 17 unveils a very important scenario. It shows that as students progress toward senior classes, negative emotions increase. (0.42, 0.44, 0.50, and 0.56 respectively). It also shows a gradual decrease in positive emotions (0.50, 0.46, 0.33, and 0.22 respectively). It further shows the presence of more neutral emotions in the senior students compared to their juniors (0.08, 0.10, 0.17, and 0.22 respectively).

To get the entire spectrum of students’ emotions further analysis shows that boys are more apprehensive than girls while girls have more optimism. Figure 18 narrates, that boys (26) have more presence of negative emotions compared to girls (9). It also shows that the presence of positivity in girls is more (19) compare to boys (11) although the presence of neutral emotions in the boys (8) is more than in girls (1). The year-wise comparison of emotions also shows very interesting statistics, Fig. 19. It shows the presence of less negativity and more positivity in the girls compared to boys except for the final year students where the presence of positivity in boys is more compared to girls. It also tells that apart from 1st-year, the emotions of girls are completely bi-polar while for other years there are few presences of neutral emotions. The year-wise analysis shows the chaotic state of students’ emotions, Figs. 20 and 21. It does not follow any patterns or defined directions.

7 Conclusion

There is no denying that the effects of Lockdown will completely change the whole world, as this kind of catastrophe is extremely unusual. However, experts believe that this will have a major impact on Generation-Z. It will completely change their view of the world. In the submitted work, emotions of a section of Generation-Z have been classified by combining the FE-based emotions with the AcI. A fuzzy rule-based classification system has been employed at the decision level to fuse six sub-classified information obtain from two modes (video and audio) into three class labels (positive, neutral, and negative). This has been done based on the degree of their valence(Dv). The test dataset used is comprised of 74 small videos. The overall study reveals that

  • The junior students have fewer Negative, emotions compared to their seniors. The probable reason is that they are highly optimistic, more energetic, and confident to handle any odds that may come their way. Again it could be that they failed to comprehend the threats that lie in the feature due to their lack of maturity.

  • The result analysis also shows that first-year students are more happy compare to other years while the fourth-year students are more apprehensive about the fallout of this lockdown and corona crisis. They are emotionally more neutral compare to their juniors, suggesting that they are more confused about the fallout.

  • It also shows that the sign of fear or surprise is very less in all the cases though signs of sadness have been observed in some time.

  • The study also reveals that compared to boys, girls are more optimistic about their future. The outcome of Figs. 18 and 19 also supported this claim to some extent.

  • The year-wise comparison between boys and girls in Figs. 20 and 21 shows students’ chaotic state of mind.

Challenges: Since we have considerable control over a large part of facial muscles, we can mask our facial emotions from the rest of the world to some extent. It has been recorded that sometimes instead of revealing many negative feelings like annoyance or aggravation, emotions such as happiness, joy, etc. have been expressed. Therefore in such a circumstance, the face makes a mockery of the mind and acts as a mask. Thus, making the facial expression-based emotion classification process more stimulating or sometimes inducing imprecision. In the case of acoustic signals addition of background or ambiance noise makes the quality of the signal poor and noisy. Thus, sometimes suppressed the necessary acoustic features, making the detection process hard or inducing imprecision.

Limitations: During this study, we face some problems, firstly, it can detect only the front part of the face with some restrictions. Second, in addition to the different ambiance, the resolution and quality of the capture device used to record student expressions were different. Third, the study does not represent the entire Generation-Z. In the future, to extract facial and acoustic-based emotions, different CNN models and datasets can be engaged. To extract emotions based on acoustic information apart from MFCC, other features can also be included.