1 Introduction

Effective communication is essential for the proper reception of information and the generation of an appropriate response (Das 2017). Communication breakdowns may occur when the recipient does not comprehend the sounds, letters, words, phrases, sentences, and other components of the message. The utilization of nonverbal cues such as gestures and facial expressions can aid in enhancing communication. Emotions play a significant role in human decision-making and are expressed as physical responses that vary over time and in different environmental contexts.

Baby crying serves as a means of communication of their needs, and parents and caregivers must be able to comprehend their cries. However, some inexperienced parents may struggle to interpret their baby’s crying. As per Le et al. (2019), crying conveys an infant’s emotions, physical needs, and any internal or external problems they are experiencing. The classification and recognition of baby cries from audio data is a significant task and is proved to be done better by machine learning than by human in (Mukhopadhyay et al. 2013). However, it presents a challenge due to the limited availability of datasets (Ji et al. 2021). Previous studies, as discussed in Sect. 2, have explored various techniques for baby cry classification; but few have investigated semi-supervised solutions, particularly the integration of Support Vector Machine (SVM) and Long Short-Term Memory (LSTM). Hence, the aim of this study is to examine the effectiveness of the SVM-LSTM semi-supervised technique, which shows promise in addressing the issue of insufficient labeled training data.

In light of the scarcity of baby cry datasets due to data sensitivity, this investigation utilizes adult audio datasets as a more flexible experimental alternative to explore the SVM-LSTM semi-supervised model and determine the optimal number of labeled and unlabeled training data required. The CREMA-D (Cao et al. 2014) dataset, which consists of recordings of adults speaking with varying emotional expressions, is employed due to its high reliability and large volume. The findings from this dataset will yield valuable insights into the performance of the SVM-LSTM semi-supervised technique for audio-based emotion classification tasks. Additionally, this study sheds light on the appropriate amount of labeled and unlabeled training data necessary to obtain reasonable audio-based emotion classification outcomes. The outcomes of these experiments will be beneficial to future researchers seeking to investigate semi-supervised techniques to address the challenge of insufficient labeled data in the classification of baby cries.

2 Related works

This section discusses about previous works that studied about baby cry classification (Sect. 2.1) and adult emotion classification (Sect. 2.2).

2.1 Audio-based baby cry classification

Prior to the popularization of machine learning techniques, Mima and Arakawa (2006) employed handcrafted rules to characterize shapes of the power spectrum obtained from their self-collected baby cry voices. Despite achieving an accuracy of 85%, the authors mentioned the problem of misjudgment that may arise due to the ambiguity and mixed emotions present in the data, which their rule-based approach could not resolve. To address such data variability, machine learning methods that can learn patterns from real data offer promising solutions.

In a study by Rosen et al. (2021), Mel-Frequency Cepstral Coefficients (MFCC) feature extraction was utilized on the Donate-A-Cry dataset for baby mood prediction and emotion classification. Based on the extracted features, several classifiers including Decision Tree, Random Forest, SVM and Logistic Regression were experimented with. The authors reported that Random Forest and SVM were the most accurate with the highest accuracy of 91%. They also concluded that Random Forest outperformed SVM, possibly because the algorithm may reduce overfitting of the dataset. This suggests that large sample size experiments may not be appropriate for SVMs.

Given that audio signals can be transformed and visualized as 2D imagery, Convolutional Neural Networks (CNNs), which excel in image inputs, have emerged as one of the popular deep learning choices for baby cry classification. While it is more convenient to train both the feature extractor and classifier in an end-to-end deep learning manner, as demonstrated by Chang and Tsai (2019) for baby cry classification, Ashwini et al. (2021) have split the network into two parts, using a deep CNN for feature extraction and an SVM for feature classification. Despite the challenge of tuning two separate networks, Ashwini et al. have reported a high classification accuracy of 88.89%. To further improve the classification performance, Le et al. (2019) have used spectrogram images of baby cries to experiment with the ensemble method, in which a pre-trained CNN (ResNet50) and SVM were combined. The authors have reported the highest accuracy of 91.1%. Notably, the use of a pre-trained CNN has enabled the authors to achieve this high accuracy with a relatively small amount of training data (i.e., 2268).

Kristian et al. (2022) proposed a generative approach to classify baby cries using AutoEncoder to learn the underlying latent vectors from the data. Two spectrograms, namely amplitude and dB-scaled spectrograms, were utilized in the proposed system, in addition to facial images, resulting in a multimodal visual-audio system. The authors reported that the multimodal feature with face image and dB-scaled spectrogram gave the best cry detection performance, with an accuracy of 87.5% and F1 score of 87.1%, outperforming unimodal features. However, the study also showed that the best unimodal results were obtained from dB-scaled Convolutional AutoEncoder (CAE), which achieved an accuracy of 87.27% and F1 score of 86.89, and were not significantly different from the best multimodal model. These results suggest that audio alone may be sufficient to achieve good performance in baby cry classification.

Although CNN is commonly used for image data processing, it has a limited analysis range due to its fixed-size kernel window, which only covers nearby pixels. However, CNN’s short-range analysis is not suitable for audio signals as it fails to capture the long-range relationships between audio components. To address this issue, Maghfira et al. (2020) proposed a CNN and Recurrent Neural Network (RNN) combined model to extract spatiotemporal features. The joint model demonstrated the highest accuracy of 94.97% and 86.03% for binary cross-entropy and categorical cross-entropy, respectively. In contrast, Ji et al. (2021) introduced a Graph Convolutional Network (GCN) approach to cover both short-range and long-range analysis on the spectrogram of baby cries. Although CNN’s limitations cannot be ignored, Liang et al. (2022) compared three models—Multi-Layer Perceptron (MLP), CNN, and LSTM—for baby cry classification. The reported results showed that both CNN and LSTM were effective, but CNN performed better in recognizing the specific needs of a baby.

The task of baby cry classification using deep learning poses a challenge due to the scarcity of available annotated data. This scarcity may limit the capacity of big networks for training with proper labels. A possible solution to this problem is to use pre-trained models, as Le et al. (2019) did in their study. Another trending approach for tackling this issue is to utilize additional unlabeled data through semi-supervised or self-supervised learning. In the research by Ji et al. (2021), both supervised and semi-supervised training for baby cry classification were studied. The findings demonstrated that the semi-supervised model, which was trained using only 20% of labeled training data, outperformed the CNN model trained with 80% of labeled data. Additionally, their results indicated that the accuracy increased as the amount of labeled data increased. In another work by Mahmoud et al. (Mahmoud 2020), K-Nearest Neighbors (KNN) was employed as a classifier for infant cry classification. However, KNN usually performs poorly with insufficient data. To overcome this limitation, Mahmoud et al. utilized semi-supervised learning by leveraging unlabeled data from Google AudioSet in addition to labeled data from Dunstan Baby Language and Baby Chillanto datasets. Their work highlights the potential of utilizing additional unlabeled data to improve the performance of even simple algorithms like KNN.

Table 1 presents an overview of previous studies in baby cry classification. While various techniques, emotions, and datasets have been explored, most studies, with the exception of the rule-based approach in (Mima and Arakawa 2006), adopt either spectrogram or MFCC as a means to convert audio signals to 2D imagery. Regarding classification methods, a few works, such as (Mahmoud 2020); Ji et al. 2021), have applied semi-supervised learning to baby cry classification. Nevertheless, the number of investigations that have utilized this technique remains limited, thereby indicating the possibility for future research in this area. Additionally, as of the time of writing this paper, no prior study on baby cry classification has employed a combined approach of Support Vector Machines (SVM) and Long Short-Term Memory (LSTM) for addressing this task in a semi-supervised manner.

Table 1 Previous works in audio-based unimodal baby cry classification

2.2 Audio-based adult emotion classification

As stated in Sect. 1, while the ultimate objective of this study is to classify baby cries, emotional voice datasets of adults will be employed to facilitate flexible experiments and feasibility assessments, and to evaluate the efficacy of the proposed audio-based emotion classification method using varying amounts of labeled and unlabeled training data. Despite the differences between adult voices and infant cries in terms of pitch, duration, and frequency and intensity variability, prior research on audio-based emotion classification that is not specific to baby voices, such as the 2021 survey by Fahad et al. (2021) and the recent work of Mohan et al. (2023) in 2023, utilizes techniques similar to those summarized in Table 1. Therefore, despite the use of distinct audio datasets, the outcomes of this study have the potential to generalize to other audio datasets, including those of baby cries, in the future.

3 Methodology

The aim of this study is to investigate the applicability of machine learning and deep learning techniques for the classification of speech into various emotional states, such as Neutral, Angry, Happy, Sad, Disgust and Fear. In Fig. 1, a step-by-step procedure is presented to accomplish this task. The top row of the figure illustrates the process of generating more labeled data, which involves utilizing a small set of labeled data to train a SVM (Cortes and Vapnik 1995). Subsequently, the trained SVM is utilized to predict pseudo labels for unlabeled data, and only those pseudo-labeled data with high probabilities are retained for further processing. The resulting larger dataset, consisting of both originally labeled data and pseudo-labeled data, is employed to train the target model of LSTM (Hochreiter and Schmidhuber 1997).

The subsequent sub-sections present an overview of the fundamental components of the investigation. Section 3.1 provides a description of the dataset and the techniques employed to extract the features that can be utilized as input to the machine learning models. Section 3.2–3.4 outline the particulars of developing the models, starting with the SVM and LSTM models in Sect. 3.2 and 3.3, respectively, culminating in the final semi-supervised solution in Sect. 3.4. In addition to utilizing SVM and LSTM as baselines for performance comparison, to address the issue of expanding the sample size using semi-supervised methods, SVM was used to generate pseudo-labeled data due to its relatively shorter training time and ease of review. LSTM was selected as the ultimate model for the classification task because it is one of the fundamental techniques for processing sequence data.

Fig. 1
figure 1

An overview of the proposed semi-supervised process

3.1 Data preparation

The objective of this investigation was to discern the emotional state of speech via the analysis of a dataset comprising all six basic emotions: Neutral, Angry, Happy, Sad, Disgust, and Fear. The dataset, drawn from the Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D) (Cao et al. 2014), was collected from a cohort of 48 males and 43 females in a noiseless recording environment. A total of 7,442 samples were gathered from the Script session, and the number of samples categorized by mood is presented in Fig. 2.

Fig. 2
figure 2

Number of samples classified by six emotions

Consistent with prior research as summarized in Table 1, this study first performs feature extraction prior to developing the model. To extract audio features, the audio library Librosa version 0.9.1 (https://librosa.org/doc/latest/index.html) is utilized in Python for data manipulation. The methodology employed in this research commences with labeled datasets and the extraction of audio features using Mel Frequency Cepstral Coefficient (MFCC) (Sato and Obuchi 2007). Subsequently, each audio clip is assigned a matrix size equal to the MFCC coefficient multiplied by its duration, and normalized using Max-Min prior to being fed into the model. Typically, only average values are used as input features; however, this study includes maximum and minimum values in order to increase the level of detail in the inputs fed into the model. Overall, 39 features are extracted from each audio clip sample for use in model development as shown in Fig. 3.

Fig. 3
figure 3

An example of MFCC resulting from featured extraction

3.2 Support Vector Machine (SVM)

To perform SVM training, a set of input parameters is utilized, and the scikit-learn library function is employed. Specifically, the SVM model is constructed with a linear kernel and a decision function of one versus rest (ovr). In the context of evaluating the performance of a semi-supervised model against a baseline, a single SVM model is trained on a dataset consisting of 3,721 samples for subsequent comparison. This dataset is further partitioned into a training subset and a testing subset, each comprising 50% of the total available samples in the CREMA-D dataset. The trained SVM model is then applied to the test dataset to calculate the validation accuracy, which is reported in Sect. 4.1.

3.3 Long short-term memory (LSTM)

This paper employs the LSTM algorithm available in the Keras library as a deep learning model. To construct the model, one LSTM layer and three Dense layers are designed, as illustrated in Fig. 4. The LSTM layer is configured with 256 nodes, with a dropout ratio of 0.2. The subsequent Dense layers are constructed with 128 and 64 nodes, respectively, and also utilize a dropout ratio of 0.2. The activation function for these Dense layers is set to the hyperbolic tangent (tanh). The output Dense layer comprises six nodes and employs the softmax activation function. Additionally, the model is trained using the adam optimizer, with a training batch size of 512 and a split ratio of 0.2 for the training and validation datasets. The trained LSTM model is then applied to the test dataset to calculate the validation accuracy, which is reported in Sect. 4.2.

Fig. 4
figure 4

Model Architecture for LSTM

3.4 Semi-supervised LSTM

This section explains the process of integrating a semi-supervised LSTM with an SVM to produce a semi-supervised model that employs pseudo labels. The semi-supervised LSTM is trained using two types of training data: labeled training data and pseudo-labeled training data. The number of labeled training data is fixed at 3,721 samples, which is the same as that used for training the SVM. In contrast, the number of pseudo-labeled training data is determined by varying the probabilistic threshold specified in Table 2. For instance, if a probability of 50% is used, the semi-supervised LSTM will be trained and tested with a total of 4,953 samples, comprising 3,721 labeled samples and 1,232 pseudo-labeled samples. The total number of samples, 4,953, is then partitioned into training and testing subsets using an 80:20 ratio, yielding 3,962 training samples and 991 testing samples. The trained semi-supervised LSTM model is then applied to the test dataset to calculate the validation accuracy, which is reported in Sect. 4.3.

Table 2 The result of the determination of the number of samples using probability at each threshold and the accuracy achieved

4 Results and discussion

This section presents the experimental results concerning SVM (Sect. 4.1), LSTM (Sect. 4.2), and semi-supervised LSTM (Sect. 4.3). Furthermore, Sect. 4.4 discusses the outcomes and limitations of this research. The evaluation metrics utilized in this study include accuracy, precision, recall, and F1-score as calculated in Eqs. 1–4 where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

$$Accuracy\, = \,(TP + TN)/(TP + TN + FP + FN)$$
(1)
$$Precision\, = \,(TP)/(TP + FP)$$
(2)
$$Recall\, = \,TP/(TP + FN)$$
(3)
$$F1\, Score\, = \,(2\,*\,Recall\,*\,Precision)/(Recall + Precision)$$
(4)

4.1 Results of SVM

The present section describes the results obtained in the evaluation of the SVM model, displaying the validation accuracy as a confusion matrix, as shown in Fig. 5, and the classification report in Table 3. As indicated in Fig. 5, the SVM model presents high frequencies of accurate predictions, as evidenced by the left diagonal line. Regarding the classification report in Table 3, the F1-score is computed to evaluate the overall performance of the model since the number of samples in each class is similar. However, by examining precision, recall, or F1-score for each speech emotion class, it is found that the model performs better in predicting the anger and sadness categories than in the other categories.

Fig. 5
figure 5

Confusion Matrix for SVM regarding six emotions of angry (ANG), disgust (DIS), fear (FEA), happy (HAP), natural (NEU), and sadness (SAD)

Table 3 Report on the SVM classification

4.2 Results of LSTM

As discussed earlier in Sect. 3.3, the LSTM model was trained using 5,954 training samples, which constituted 80% of the data. Subsequently, the trained model was evaluated using another 1,488 testing samples, which accounted for the remaining 20% of the dataset. The LSTM model was trained with specific parameter configurations, and the resultant Fig. 6 demonstrates the effectiveness of these settings. The training time for the model was 8 s per epoch, which cumulatively totaled to 40 min. The accuracy of the trained model on the training data was 99.92%, while the corresponding loss was 0.0032. On the validation set, the model achieved an accuracy of 48.23% with a loss of 4.5079. These metrics provide insights into the model’s performance during training and testing phases.

The validation accuracy of the LSTM model was visualized in a confusion matrix, as depicted in Fig. 7. In addition, Table 4 presents the classification report for the LSTM model. The F1-score in Table 4 suggests that the performance of the LSTM model is comparable to that of the SVM model (LSTM = 0.42; SVM = 0.46). Notably, like the SVM model, the LSTM model demonstrated better prediction of the emotions of anger and sadness than other emotions.

Fig. 6
figure 6

The result comparison between train and validation of LSTM

Fig. 7
figure 7

Confusion Matrix for LSTM regarding six emotions of angry (ANG), disgust (DIS), fear (FEA), happy (HAP), natural (NEU), and sadness (SAD)

Table 4 Report on the LSTM classification

4.3 Results of the semi-supervised LSTM

As depicted in Fig. 8, the validation accuracy of the model was observed to be higher than the train accuracy, which is theoretically contradictory, and it was difficult to discern the reason behind this anomaly. The correlation between accuracy and the threshold percentage is demonstrated through a graph in Fig. 9. The graph displays a bell-like curve where the accuracy tends to rise as the probability increases, with the highest accuracy of 89.72% occurring at a probability of 50%. Conversely, the accuracy diminishes when the probability surpasses 90%, which results in the lowest accuracy of 40%.

Fig. 8
figure 8

Validation accuracy of semi-supervised model

Fig. 9
figure 9

The relationship between accuracy and threshold percentage

4.4 Discussion

Section 4.1–4.3 indicate that the supervised SVM and supervised LSTM models demonstrate no significant differences in accuracy, while the semi-supervised LSTM model exhibits the highest accuracy. An interesting observation is that the SVM model primarily selects easy-to-predict emotions, such as anger and sadness, to be included as pseudo-labeled training samples. Figure 10 shows that when the threshold is set to 50%, the original dataset is well-balanced across all emotions. However, after incorporating additional pseudo-labeled training samples recruited by the SVM model, the new dataset exhibits a slight imbalance, with more samples from the anger and sad groups. Addressing this imbalance issue in future research is critical, whether by training the semi-supervised LSTM model with a special technique or by carefully monitoring the distribution of pseudo-labeled samples obtained from the first stage of the SVM model.

Fig. 10
figure 10

Left is the numbers of samples in the original dataset. Right is the numbers of samples in the semi-supervised new dataset. The six emotions are angry (ANG), disgust (DIS), fear (FEA), happy (HAP), natural (NEU), and sadness (SAD)

Another potential concern is that the model could miss important information due to the reduction of features. In this study, only 39 features from the MFCC data were selected, specifically 13 features each for the minimum, maximum, and average values. This may lead to improper utilization of the data in the time domain. There are some limitations and issues to be taken into account. Firstly, the elimination of granularity in the data and the use of a single pivot may result in the model being unable to identify certain data patterns that could be used to predict the mood of certain classes, such as fear and disgust. Some of these exemplified classes may exhibit unique wave characteristics hidden within the finer ranges as originally collected. Secondly, future studies may consider incorporating additional features like spectrogram, or employ deep learning for automatic feature extraction, in order to reduce these limitations and problems associated with feature selection.

The last concern is regarding the current CREMA-D dataset that may lack of effectiveness in conveying emotions. The dataset contains audio recordings of subjects, including actors, speaking in accordance with predetermined sentences. However, this method of eliciting emotions may not be as effective as other methods, as the resulting audio may not accurately convey the intended emotion. This limitation may pose a challenge in the classification of emotions. To address this concern, future research could focus on collecting more natural audio data from actual emotional voices or conversations, exploring more alternative datasets and performance metrics altogether to increase diversity, and conducting more comparative experiments with recent techniques.

5 Conclusion

The present study aims to investigate the feasibility of audio-based emotion classification, with the ultimate goal of classifying baby cry emotions. However, given the sensitivity of baby cry data and the limited amount of available data, the current study explores the potential of semi-supervised learning as a means of reducing the need for labeled data and allowing for scalable model training. In this regard, the CREMA-D adult dataset, encompassing six distinct emotions, is utilized to test the hypothesis and determine the optimal number of labeled data required. Specifically, three models, namely supervised SVM, supervised LSTM, and semi-supervised LSTM, are trained and evaluated. Results suggest that the first two models exhibit similar accuracies, whereas the semi-supervised LSTM yields significantly higher accuracy.