Keywords

1 Introduction

A huge amount of data is produced per second in today’s world, and making sense of it is a tedious effort. It’s crucial to highlight that sentiment detection using text is still a work in progress, and while product reviews have received a lot of attention, we are concentrating on dual sentiment detection in videos using text analysis.

The classification of an input text as positive, negative, or neutral in terms of polarity is the basic task in sentiment analysis. This analysis can be done at the level of the document, sentence, or feature. Consumer perceptions of goods, commodities, branding, political views, and social activities can be captured using methodologies from this field. Analysis of Twitter users’ activities, for example, can aid in predicting reputation of political groups or alliances. Sentiment analysis studies when it comes to micro blogging revealed that Twitter messages accurately consider the political situation.

One of the most delicate academic topics is mental health, because it is heavily influenced by the people mindset and feelings. The use of social media platforms like Facebook, Instagram, Flickr, and other grows daily, with photographs and videos playing an increasingly important role. Nowadays, our emotions may be deduced from our facial expressions. We can learn about each other’s moods by observing their facial expressions. Sentiment analysis plays a critical role in making this recognition easier and more efficient. The word “sentiment,” which means “emotions,” will be assessed using the sentiment analysis system. Our objective is to predict sentiments using video because the majority of prior research has been on text-based sentiment analysis.

Though academics in NLP and pattern extraction presented numerous techniques to handle the issue of sentiment analysis, the social networking setting presents several unique obstacles. Aside from the massive volumes of data present, most verbal exchanges on virtual communities are short and informal. Further, in addition to verbal communications, users increasingly use photographs and videos to represent themselves on even the most popular social media sites. The data provided in such video images is connected not just to semantic contents such as things or activities in the obtained image and also to impact and sentiment signals communicated by the displayed picture. As a result, such data is important in determining the emotional effect even beyond semantic. As a result, photographs and videos are one of the greatest common methods for individuals to show their feelings and share their views on social networking, which has become increasingly important in gathering information regarding folk’s thoughts and emotions.

2 Literature Survey

In Paper [1]: in this research, a method for automatically recognizing human emotion is developed utilizing CNN and facial expressions. Author has applied BPNN, CNN, and SURF feature extraction methods. Data was collected from the CASIA webface. The system has an 88% accuracy rate. The prediction was constrained by the small dataset (200 samples).

In Paper [2]: a voice-to-text conversion and management application was developed by the study’s author using the Google Cloud Speech API and a collection of user-generated audio text files. They were devoid of any relevant textual data that may have enabled the user to change the mistakenly recognized content. The system utilized Google Cloud Speech API methods. Their organization and study were beneficial and could be used as proof.

In Paper [3]: in order to identify the emotions, this article uses deep learning-based multimodal emotion recognition from speech and facial expression. This study bases its ability to recognize emotions on deep learning. There is a dataset of verbal and facial expressions in the text. The system used the CNN and LSTM algorithms. While researching more efficient feature extraction techniques and multimodal fusion, they did not incorporate modalities like text and gesture into multimodal models. Combining speech and facial expression data has substantially enhanced the evaluation methodologies. They also contrasted their approach with current multimodal systems and found a significant improvement.

In Paper [4]: the author of the research utilized a collection of YouTube video comments to anticipate the sentiment using YouTube videos. Author used NLP to analyze the sentiment of customer reviews. 75.435% of the relevant video access is accurate. As a result, it may be concluded that their technique may correctly predict a favorable conclusion if a YouTube video is examined based on comment language.

In Paper [5]: this work used ML integrated with IOT to introduce sentiment analysis and mood detection on the Android platform, which can recognize the emotion. The North Face, Google Now, Alexa, Akinator, and chatbots were just a few of the tools used by the study’s author. Data collection activities involved social media. There is no such possibility in emotion analysis or mood prediction. This study seeks to provide an explanation from the source, the main factor that underlies all of the issues, in order to address the problem, which appears to be challenging and intriguing.

In Paper [6]: in order to predict the sentiment on an image, this study suggests a machine learning-based classification method that uses SVM classifiers. The CNN + SVM algorithms are used in the article. The author utilized the Twitter and Tumbler dataset. The accuracy rate for the task was 99.2%. As a result, they lacked a deep learning method for multimodal sentiment analysis.

In Paper [7]: the sentiment on user-generated video, audio, and text was predicted by the author using the dataset of user-generated audio, video, and text. Python, SVM, the decision tree method, and OpenCV were some of the approaches used. During testing, the task has a 70% accuracy rate. In conclusion, no certain precision was attained.

In Paper [8]: the work of sentiment analysis and topic recognition in video transcriptions is presented in the publication. Two of the author’s key methods were SVM and LSTM. The details were supplied via the MUSE-TOPIC SUBCHALLENGE. Accuracy on the test set was 66.16%, whereas development accuracy was 56.18%. The groups were indicated rather than clearly determined from continuous single, necessitating more investigation.

In Paper [9]: in this study, the datasets from Twitter, Flickr, and Instagram were used. They introduced ME2M, a simple-to-use but effective model for picture sentiment analysis. The ME2M model’s usefulness and applicability were shown by the observed results. They lacked any popularity forecast based on image sentiment analysis.

In Paper [10]: according to the study, deep learning might be used for picture sentiment analysis. Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Region Convolutional Neural Networks (RCNN), and Fast RCNN were some of the techniques used by the author. The FERET dataset was used in the study. They highlighted some of the important work that has been done for picture sentiment analysis when combined with deep learning approaches throughout the years in this study. As a result, there was no room for sentiment analysis or mood recognition.

In Paper [11]: improving sequence-to-sequence voice conversion by adding text-supervision, the author used the text-based phonetic information dataset. Machine learning methods such as the Hidden Markov model (HMM) and seq2seq VC model were used. Although the proposed methods considerably improve the seq2seq VC model, model execution is still hindered when there is a lack of training data. As a result, they had to deal with the challenge of performing significantly poorer when there are only a few training sets available.

In Paper [12]: the C3D network, VGG16 network, and ConvLSTM model approach were used to do sentiment recognition for brief annotated GIFs. The verbal emotion scoring rate is derived using the SentiWordNet3.0 model after that. Data included a gif video. Extensive testing encompassing both theoretical and practical assessments have proven the efficacy of the provided GIF video sentiment analysis program.

They were therefore without an effective strategy for dealing with complex parameters that appeared in brief annotated GIFs.

In Paper [13]: the study focuses on sentiment analysis and emotion identification in static images. The author made use of the UMD faces dataset. VGGNet16 and CNN models were used as techniques. The work lacked an efficient system for dynamic visuals. The suggested approach beats prior models and yields more accurate, upbeat results when tested on a model to estimate.

In Paper [14]: this study suggests a brand new facial expression element for sentiment analysis of videos. In order to validate our suggested feature, we employ a machine learning framework. The outcomes of the trial show that the feature is beneficial.

In Paper [15]: an overview of current developments in the field of multimodal sentiment analysis was provided in this survey study. The most prominent feature extraction techniques and datasets in the areas have been sorted into categories and discussed. On the CMU-MOSI and CMU-MOSEI datasets, two frequently used datasets for multimodal sentiment analysis, the efficacy and efficiency of the thirty-five models have been examined.

3 Related Work

A CNN is a deep learning system that can recognize when an image has been processed, assign significance to various aspects within the image, and differentiate between them. Excellent software for image processing and computer vision is OpenCV. It is just a free-source library providing operations including object tracking, finding landmarks and face detection, among others. The machine learning method “support vector machine” could be used to resolve regression and classification problems. It really is, however, generally employed in categorization difficulties. PCA is a method for lowering the dimension of these kind of datasets, boosting accurateness while minimizing data redundancy. To extract the sentiment of each word, each utterance, and eventually, each video, the CNN converts a textual utterance to a logical form: a machine-understandable representation of its meaning. Block diagram of sentiment approach is shown in Fig. 1.

Fig. 1
Block diagram links data collection, pre-processing, data sorting, feature extraction, classification algorithm, and sentimental analysis as mentioned in the sequence.

Block diagram of sentiment approach

Our aim is to predict the sentiments from video, so we will be using a video to create the data for audio as well as for video we will be capturing certain pictures from the provided video and applying CNN algorithm for image to text conversion, and similarly for the audio, we will be collecting the audio data and applying the ML approach that is the CNN for speech to text conversion. Features will be extracted from the video and audio through OpenCV model like detection of faces in this case eyes movement of lips and keywords/phrases in case of audio. CNN employs a feature extractor during the training phase. The weights are determined by training specialized neural network types that make up CNN’s feature extractor. A neural network called CNN extracts the features of the input images, while a different neural network categorizes the characteristics. The feature extraction network uses the input image as a starting point. The neural network uses the extracted feature signals for classification. The result is subsequently generated by the neural network categorization based on the picture characteristics. The convolution layer stacks and sets of pooling layers are part of the neural network for feature extraction. The convolution layer, as its name suggests, uses the convolution method to modify the picture. For the classification, SVM and PCA techniques will be applied for video as well as for audio which will classify the extracted features such as happy, sad, anger, and surprise.

Expression Intensity: The recognition of an expression is significantly influenced by the expression’s intensity. When the expression is less subtle, it is easier to recognize it. It has a significant impact on the model’s accuracy.

  • Step I: Get the image frame from a data.

  • Step II: Image preprocessing (cropping, resizing, rotating, color correction).

  • Step III: To use a CNN model, extract the key features.

  • Step IV: Categorize your emotions.

  • Image and Video Frame Face Detection

The human face is detected and located in the first stage using video from a camera. In real time, the coordinate of a frame is to determine the position of the real face. Face recognition is still a challenging procedure furthermore, it is not assured that all faces in a specific input picture will be retrieved, particularly in uncontrolled conditions with inadequate illumination, varying head positions at long a distance or an obstruction.

  1. II.

    Image Preparation

After the faces are discovered, the pictures are optimized before being sent to the sentiment classifiers. This action greatly enhances the classification accuracy. Validating the image for varying illumination, thresholding, picture reduction, fixing picture rotation, sizing the picture, and cutting the picture are all important substeps in image preprocessing.

  1. III.

    AI Model for Emotion Classification

Soon after preprocessing, the required features have been extracted from which was before data containing the discovered faces. There are numerous approaches for detecting various aspects of the face. For example, Action Units (AU), face landmark motions, landmark distances, features of gradients, face texture, and so forth. The most common utilized classifiers in AI emotion identification are SVM or CNN. Finally, the detected human face is assigned a pre-defined class (label) based on facial expression, such as “joyful” or “neutral.”

3.1 Facial Expression Recognition of FER-2013

The FER-2013 dataset for facial emotion detection is provided by Kaggle, and this dataset was introduced at the International Conference on Machine Learning (ICML). Few images of the dataset are shown in Figs. 2, 3, 4, 5, 6, 7, and 8.

Fig. 2
A close up photo of a man wearing spectacles with his mouth open.

Angry

Fig. 3
A close-up photo of the woman looking to her right.

Disgust

Fig. 4
A close-up photo of a man with his eyes wide open.

Fear

Fig. 5
A close-up photo of a woman wearing glasses.

Happy

Fig. 6
A close-up photo of a girl.

Neutral

Fig. 7
A close-up photo of a woman.

Sad

Fig. 8
A closeup photo of a boy with his mouth open.

Surprise

Each face in this dataset has been categorized on the basis of emotion categories, where the grayscale of every image is 48pixelx48pixel. In Fer-2013 dataset, there are 35,887 number of images with seven distinct expression kinds are identified by seven distinct categorization descriptors. Number of data in the FER-2013 is given in Table 1.

Table 1 Number of data in the FER-2013

3.2 Micro-Classification of Facial Expression

In social psychology, a micro-expression is a facial expression that is simple to see and recognize as a form of communication. Information is transmitted through facial expressions about emotions, our objectives and goals, and are fundamental to interpersonal communication. Understanding and being able to read facial emotions naturally makes the desired conversation easier. The classification of human facial expressions involves three steps: face recognition, feature extraction, and facial expression classification. The authors of this study used a method that could categorize facial expressions on a large scale and included seven fundamental human expressions (Figs. 9, 10, 11, 12, 13, 14, and 15).

Fig. 9
A close-up photo of a man laughing with his teeth visible.

Features of joyful expressions

Fig. 10
A close-up photo of a man angry with his eyebrows raised and teeth visible.

Anger expression characteristics

Fig. 11
A close-up photo of a man with his eyes closed and both his hands on the forehead.

A sad expression’s defining features

Fig. 12
A close-up photo of a girl with her fingers in her mouth.

Typical fear expression

Fig. 13
A close-up photo of a man looking to his left.

Disgust expression

Fig. 14
A close-up photo of a child with his mouth open and teeth visible.

Typical surprise expression

Fig. 15
A close-up photo of a man with no expression on his face.

Neutral face

Human Face is detected as following:

  1. 1.

    Eyebrows pulled down (shows anger)

  2. 2.

    Eyebrows pulled up and together (shows fear)

  3. 3.

    Upper lip pulled up (shows disgust)

  4. 4.

    Eyes neutral (shows neutral)

  5. 5.

    Cheeks raised (shows happy)

  6. 6.

    Lip corners pulled down (shows sad)

  7. 7.

    Mouth hangs open (shows surprise)

  8. (1)

    Happy

A smile is a facial expression that can convey enjoyment or like for something. The happy expression is characterized by an upward movement of the cheek muscles and the sides or edges of the lips to form a smile.

  1. (2)

    Anger

When expectations and reality diverge, angry facial expressions result. The expression is visible in the way the eyes are focused when staring, the way the lips are contracting, and the way the inner eyebrows on both sides are merging and bending down.

  1. (3)

    Sadness

Based on the traits of a sad facial expression, a sad face will arise when there is disappointment or a sensation of missing something which includes a loss of focus in the eye, a downward pull of the lips, and a drooping of the upper eyelid.

  1. (4)

    Fear

Fear is a type of expression that manifests when a person finds themselves unable to handle a situation or in a frightening environment. The two eyebrows that raise simultaneously, the tightened eyelids, and the horizontally wide lips all indicate anxiety on a person’s face.

  1. (5)

    Disgust

A person who displays facial disgust after witnessing something unusual or after listening to information that is unimportant. A person’s face will show signs of distaste when the upper lip raises and wrinkles appear at the nasal bridge.

  1. (6)

    Surprise

When someone receives a sudden, unexpected, or significant event or communication and is unaware of it previously, they will express surprise. A surprised expression is depicted by the lifted brows, wide-open eyes, and reflexive widening of the mouth.

  1. (7)

    Neutral

A person who is perceived as snobbish and lacking in regard for others frequently underestimates others by their facial expression.

4 Results Analysis

The system was tested in this work at various stages of the design recognition of facial micro-expression. The outcomes demonstrated that the face expression detection system could use the CNN architectural model in an ideal and timely manner. In Table 2 according to the evidence, data training can be carried out most effectively when utilizing a separate convolution layer, and trained model’s face expression can be accurately predicted for anger 0.40%, disgust 0.24%, fear 0.35%, happy 0.66%, neutral 0.40%, sad 0.37%, surprise 0.68% of the time. Analysis of the system’s results after implementation is absolutely necessary.

Table 2 Result of facial expression testing

4.1 Prediction Test of Facial Expression

For all seven expressions, experiment is carried out for ten times, and the system is successful to recognize the expression.

The outcomes of expressing anger and fear incorrectly came about once each, whereas expressing disgust incorrectly came around twice. Table 3 of the report displays the findings. Table 3 displays which expressions are straightforward to anticipate and which ones are more challenging.

Table 3 Confusion matrix

5 Future Scope

Companies may learn via sentiment research how consumers feel about a brand, whether it’s favorable, negative, or neutral. One of the most crucial methods for retaining clients’ attention and engagement is brand monitoring, which includes sentiment research. Anyone can use sentiment analysis to assemble and evaluate massive volumes of text data, such as news, social media, views, and suggestions, to predict the outcome of an election. It considers how both candidates are seen by the general population. The availability of huge and stable datasets makes a significant contribution in this regard. Indeed, we brought out some difficulties with the available datasets in this research. Modern social media platforms allow for the collection of large volumes of photographs as well as a range of linked data. These can be used to specify both input and “ground truth” properties. To avoid the association of noisy data with the photos, these textual data must be adequately filtered and processed, as previously described. Systems with larger purposes could be designed to address new difficulties or to focus on new emergent tasks. For example, idea programs can help people bridge the gap between real and virtual communication. Emoji’s have been growing in popularity for years, mainly to the proliferation of social media platforms, and they are now an essential element of how people communicate online. They’re commonly used to convey user reactions to messages, photos, or breaking news. As a result, investigating novel communication routes may help to improve present state-of-the-art performance. This can also be utilized in cybercrime to study criminals’ expressions in order to determine the true reason for the malpractice committed by the crooks.

6 Conclusion

The study’s purpose was to create a system that is flexible, cost-effective, adaptable, and, most importantly, portable. It’s a trustworthy method for ensuring the accuracy of social product reviews. Machine learning is where our proposed sentimental analysis system fits in. Our main goal was to achieve high-accuracy sentimental video detection. This analyzing feature can also help us analyze video reviews. Many social media platforms now demand audio and video surveillance, including Facebook, Twitter, and YouTube. Using our technology, we can analyze consumption and detect opinion in a certain product. Because of the rapid expansion of social media, multimedia data has become a crucial transporter of human thoughts and opinions. The study of social networks has risen to prominence as a possible research area. We looked at the most common methodologies for textual sentiment analysis on social media based on a superficial assessment. The most common multimodal sentiment analysis approaches, as well as visual sentiment analysis were examined.

The goal of this work was to provide a thorough examination of the visual sentiment analysis topic, related challenges, and region techniques. Significant meaning with real enterprise software that would benefited from sentiment analysis on image and video research has indeed been explored.