1 Introduction

About 130 million babies are born globally each year. Taking good care of newborns is a big challenge, especially for first time parents. Following the suggestions from other parents and books is not enough to solve the problems in practice. The main reason is because it is difficult to understand the meaning of the infant cries. Infants communicate with the world through crying. Experienced parents, caregivers, doctors, and nurses understand the cries based on their experiences. Young parents get frustrated and have trouble calming down their babies because all cry signals sound the same to them. Accurately interpreting infants’ cry sound can help parents take better care of their babies. Research on infant cry started as early as 1960s when Wasz–Hockert research group identified the four types of the cries (pain, hunger, birth, and pleasure) auditorily by trained nurses [1]. In the early years, researches have determined that different types of cries can be differentiated auditorily by trained adult listeners. But training human perception for infant cry is much harder than training machine learning models. In Mukhopadhyay’s study, the highest classification accuracy by training a group of people to recognize some cry sounds is 33.09% while machine learning algorithm based on spectral and prosodic features can recognize the same set of data and reach 80.56% accuracy [2]. Building smart machines to understand infant cry leads the way to build intelligent robot caregivers in the future. Besides understanding infants’ daily life needs, disease prediction is another critical task in infant cry research. Since infants’ vocal tract and breathing system are affected by some diseases, the cry signals of unhealthy infants contain unique characteristics that differ from healthy cry signals. Known examples of such diseases include deaf, autism, and asphyxia, etc. Analyzing pathological cry signals to identify diseases is a non-invasive and fast method that can save infants’ lives, especially in the areas that lack of medical equipment and expertise. In the early years of infant cry research, many works have focused on classifying normal and pathological cry signals. In Saraswathy’s review [3], 34 papers on classification of normal and pathological cry signals published from 2003 to 2011 are listed. The works include identifying diseases such as hypo-acoustic, asphyxia, hypothyroidism, hyperbilirubinemia, cleft palate, etc.

Infant cry research involves data collection, cry signal processing, feature extraction and selection, and classification. Due to the sensitivity of cry data, it has been difficult for researchers to acquire data needed. Researchers either record cry clips by themselves or ask permissions for datasets from other authors. Most databases are recorded in hospital, Neonatal Intensive Care Unit (NICU), home, and clinics, etc. by recording in real time or by setting up electronic recording devices close to the infants’ crib for long period of time. Signal processing is a must to remove background noises and perform cry segmentation to build cry databases. Once the database is available, feature extraction is the step to extract features from different domains of the cry signals. Features extracted from time domain, cepstral domain, or prosodic domain, etc. represent different aspects of the cry signal. Selecting the most appropriate features and reducing the feature dimensions are another task to build effective classification models. Applying appropriate machine learning models for specific cry features is vital for classification or detection accuracy. As the second Artificial Intelligence (AI) winter ends in 1990s [4], neural networks emerge as a popular method in infant cry research. Neural networks are computing system, containing interconnected neurons, inspired by biological brain system. Input vectors, neurons, weights, activation functions, and output are the main elements in a neural network. Each neuron has a value computed in the forward propagation process based on the weights of each connection and bias of each layer. Activation functions are used to achieve nonlinearity in the network. The back propagation is the key algorithm to train the model and minimize the loss function, which evaluates how well the model fits the dataset. During the 2000s, most methods adopted in infant research are related to neural networks including scaled conjugate gradient neural network, multi-layer perceptron, general regression neural network, evolutionary neural network, probabilistic neural network, neuro-fuzzy network, and Time Delay Neural network, etc. Hidden Markov model and Support Vector Machine (SVM) were also adopted in the 2000s. In the recent decade, many traditional machine learning methods, such as SVM, K-Nearest Neighbor (KNN), Gaussian Mixture Model (GMM), fuzzy classifier, logistic regression, K-means clustering, and Random Forest, are applied to pathological cry classification, cry reason classification, and cry sound detection. In the same period, novel neural network architectures are used pervasively in industry and research. Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), CNN-RNN, Capsule Net, Reservoir Network, and neuro-fuzzy networks open a new chapter in infant cry research.

This survey reviews infant cry research mainly focusing on the signal processing techniques and machine learning methods developed in the past decade. We first review typical databases used in the research, then introduce pre-processing approaches of infant cry signals, and describe a diversity of features either in time domain or in frequency domain as well as suprasegmental features of infant cry signals. We focus on reviewing the state-of-the-art methods using KNN, SVM, GMM, and CNN-based algorithms for classification and detection. We provide a list of resources for the researchers who are interested to work in this domain, and finally we make a point of the future work in this research area.

2 Data acquisition

As shown in Fig. 1, automatic infant cry research generally involves five stages: data acquisition, pre-processing, feature extraction, feature selection, and classification. Discovering novel methods in any of the stages can help improve the performance of the final classification accuracy.

Fig. 1
figure 1

Five stages of infant cry research

The data acquisition stage includes recording the infant cry sounds and labeling. Most databases are recorded in hospitals or homes, labeled by doctors, nurses, or parents. Digital recorders are placed close to infants and are either operated on the spot to capture the cry signals one by one or left on to record the sound events around the infants for a long period of time. Infant sound is a short-term stationary signal, and it is assumed to be more stationary because of infants’ lack of full control of the vocal tract. Due to the limitation of resources and sensitivity of infant cry data collection process, the total amount of infant cry database is very limited. From the previous review papers [3, 5, 6], we can see that the most commonly used database in infant cry research is Baby Chillanto database [7]. Baby Chillanto database was collected by the National Institute of Astrophysics and Optical Electronics, CONACYT Mexico [8]. It contains five types of cry signals including deaf, asphyxia, normal, hungry, and pain. Each cry is equally segmented into 1-s long and the total number of cries is 2268. Another database used in multiple literatures is named Dunstan Baby Language database [9], which is extracted from the Dunstan baby video tutorial presented by Priscilla Dunstan who invented the Dunstan Baby Language theory. There are several versions of Dunstan Baby Language database since authors extracted the audio clips in their own ways. The version described in [9] consists of 315 wave files, sampled at 16 kHz, with a variable length between 0.3 and 1.6 s. Each utterance is a word of infant speech corresponding to one of the five “Dunstan words,” which were translated as “Neh” = hungry, “Eh” = need to burp, “Oah” = tired, “Eairh” = low belly pain, and “Heh” = physical discomfort.

Many databases are self-recorded for research. Researchers need to contact other authors to check availability of desired databases. One database named Donate A Cry [10] is available online, but it is not well labeled and only one literature is found using this database. Table 1 shows the commonly used databases in recent research. Some databases are recorded in the Neonatal Intensive Care Unit (NICU), pediatric clinics, or baby-sitting environments [1124]. Some cry audio signals online are also collected in [25]. Some synthetic databases are created by the authors in order to compare the performances of the proposed methods on real databases and synthetic databases [11, 18, 26, 27]. In Ferretti’s work [18], the CNN detects the cry signal better on the synthetic database than the real database. It shows that the automatic detection and classification of real-time infant cry is still challenging because the real-time environment may exist many types of complications that can affect the quality of the cry signals. Synthetic databases can be generated by adding noises to clean cry recordings or combining different cries together. Training models on synthetic databases can avoid requiring a large amount of data to be acquired in sensible environments such as NICUs [18].

Table 1 Main databases used in literatures

From Table 1, we can see that most datasets are with limited samples. The average size is 2983 and only one database is close to 20,000 samples. Due to the sensitivity of collecting the cry data, especially pathological cry signals, small dataset size is one of the challenges in infant cry research. Data augmentation techniques are used to artificially increase the data size. Zhang et al. created new waveform images from training datasets by transforming these waveform images into slightly faster or slightly slower waveforms for the purpose of increasing training datasets to overcome overfitting problem [12]. In [43], several data augmentation techniques, such as noise variation, signal intensity variation, tonality variation, and spectrogram’s size alteration, were used to artificially increase either the number of audio signals or the number of spectrograms. The experimental results showed that these data augmentation methods cannot lead to accuracy improvement. The reasons lie in the fact that the limited data cannot capture the diversity of variations within infant cry signals.

3 Signal processing and feature generation

3.1 Pre-processing

The main tasks in pre-processing stage are denoising and audio segmentation. The complication of the recording environment leads to unclean infant cry signals. In a neonatal care unit, besides infant cry signals, there could be many kinds of sounds such as footsteps, adult’s speech, air-conditioner sound, alarm sound, etc. To detect or classify cry signals accurately, cleaning up the recorded data at the pre-processing stage is a crucial step. To clean up a signal, the first task is denoising, which removes the background sounds such as speech, fan, footstep, etc. Turan and Erzin applied high-pass FIR filter to remove the speech sound and low frequency noise in the recording [41]. Ferretti et al. reduced coherent noise source by a filter-and-sum beamformer and uses OMLSA post-filter to reduce the residual diffuse noise [18]. In [16], Gu et al. used optimized Blackman window to handle each frame signal, which is the result after the endpoint detection. The signal noise is significantly reduced after filtering.

Audio segmentation task is commonly performed using Voice Activity Detection (VAD). VAD technique is widely used in speech recognition to detect the human speech in audio signals. Researchers also use it to detect the infant cry and remove the silent duration in a sample recording. VAD also faces the challenge of separating the cry and noise. Pan et al. uses it to detect the presence or absence of baby cry in a noisy environment to improve the overall baby cry recognition rate [56] and it is used to detect the sections of the audio with sufficient audio activity [57]. In [41], authors implemented a basic VAD algorithm, which uses short-time features of audio frames and a decision strategy for determining sound and silence frames. Sometimes researchers also manually cut the samples to remove the silent part and the voice interference part, and only the continuous crying part of the sound was retained [51].

3.2 Feature extraction

Infant cry signal differs from adult speech. Figure 2 gives a comparison of spectrograms between infant sound and adult speech. We can see that the variations within waveform and spectrogram are quite different, especially in the areas of energy, intensity, and formants. In general, infant cry is a combination of vocalization, silence, coughing, choking, and interruptions, which includes a diversity of acoustic and prosodic information at different levels. It is the only way for babies to communicate with the world.

Fig. 2
figure 2

Adult speech vs. infant cry signal in time and frequency domain

Feature extraction is the stage to extract the discriminative features from the audio signals and later feed into the machine learning algorithms. It is one of the most vital parts of a machine learning process [58]. Performing feature extraction task either in time or frequency domain addresses the fundamental work of baby cry analysis and processing. Time domain features, such as zero-crossing rate, amplitude, and energy-based features, etc., is simple and straightforward to compute. While time domain features are not robust enough to cover the variations within infant cry signals and the features are sensitive to background noises, the frequency domain features have strong ability to model the characteristics within infant cry signals. The commonly used MFCCs, LPCCs, and LFCCs have proven better performance than using time domain features. On the other hand, it is shown that infant cry signal is rhythmic and has cyclic changes due to the natural interruption and breath. The high-level information, such as prosodic features, are important to improve the discriminative ability within signals. Therefore, attaching prosodic domain features together with time or frequency domain is capable for capturing both physical and physiological information. In addition, spectrogram is an image that is a time-frequency representation of an audio clip. It is known that spectrogram has a strong ability to present the signal and include both acoustic and prosodic information.

Figure 3 depicts the main categories of the audio features that are applied to research related to speech, music, and environmental sounds. Acoustic and prosodic features are commonly used for infant cry detection and classification. Cepstral domain features, prosodic features, and image-based features are widely used in speech processing and infant cry processing with a proportion over 70% research articles. In this section, we review feature extraction approaches in the latest research work. The detailed explanation and algorithms of audio features can be found in [58] and [59].

Fig. 3
figure 3

Main audio feature categories

3.2.1 Cepstral domain features

Mel-frequency cepstral coefficient (MFCC) is widely used in speech recognition. It is a cepstral representation of the audio signals. Researchers use it to test proposed approaches [17, 29, 49, 52, 57, 6062] and often use it for baseline experiments [13, 15, 22, 31, 37, 63]. Liu et al. used MFCC along with two other cepstral features Linear Prediction Cepstral Coefficients (LPCC) and Bark Frequency Cepstral Coefficients (BFCC) for infant cry reason classification. The result showed that the BFCC with a neural network model produces the best recognition rate of 76.47% [13]. The main idea of LPCC is to remove the redundancy from a signal and tries to predict next values by linearly combining the previous known coefficients. It is used in [16] for cry detection. Linear Frequency Cepstral Coefficients (LFCC) extraction process is similar to MFCC extraction. The difference is that it uses a linear filter-bank instead of the Mel filter-bank [37, 64]. In [22] and [65], the authors showed that LFCC performs better than MFCC in discriminating high frequency audio signals such as female voice and baby cry signals. In [24], Singh et al. explored the residual MFCC and implicit LP residual features that represent excitation source information. Researchers have also tried other cepstral features such as Fast Fourier Transform (FFT) [23, 66], Log-Mel feature [11, 18], Mel Scale [43], Constant-Q Chromagram [43], Log-mel spectrum [12], and delta spectrum [12].

According to auditory perception models, MFCC coefficients are more robust than other coefficients such as LPC coefficients. In our previous work [15], MFCC features of normal and abnormal infant cry signals within a certain frame combined with 12 orders were plotted in a space. It is observed that the acoustic features of normal infant cry signals are quite different from the asphyxiated ones as shown in Fig. 4. It indicates that the value range and tendency of acoustic features of normal and asphyxiated infant cry are different.

Fig. 4
figure 4

Multiple order MFCC features of normal and asphyxiated infant cry

3.2.2 Prosodic domain features

It is shown that infant cry is made of four types of sound: one coming from the expiration phase, a brief pause, and a sound coming from the inspiration phase followed by another pause. Variations in intensity, fundamental frequency (F0), formants, and duration are typical acoustic cues that carry prosodic information about infant cry and speech [13, 67]. It is shown that the above prosodic features are efficient to identify the types of infant cry. Adult F0 ranges between 85 and 200 Hz while infant crying F0 is characterized by its high F0 250–700 Hz. F0 is commonly computed using an autocorrelation-based method provided by Praat [68].

Our previous work [15] has shown that combining weighted prosodic features with MFCC features help improve the classification accuracy in a deep learning model. Other researchers have also found that F0 is critical in identifying infant cry signals [40]. Chittora and Patil used F0 to calculate unvoiced segments ratio and found out unvoiced percentage in a cry is an important parameter for analysis of infant cry [19]. Orlandi et al. used mean, median, standard deviation, and minimum and maximum of F0 and F123 to exploit differences between full-term and preterm infant cry [21]. In 2017, Torres et al. used three handcraft features (voiced/unvoiced counter, consecutive F0, and harmonic ratio accumulation) to show comparable detection performance but resulting in 20 times lower computational cost than standard MFCCs with no additional memory cost [27].

3.2.3 Image domain features

Spectrogram is an image that is a time-frequency representation of an audio clip. It is known that spectrogram has a strong ability to present the signal and include both acoustic and prosodic information. Spectrogram can be extracted through framing, FFT, and calculating the log of the filtered spectrum steps illustrated in Fig. 5. Feeding spectrograms into classifiers can solve the problem of different cry signals having different durations. Instead of using zero padding to achieve same length of feature vectors, normalization is applied in the process of spectrogram generation, which produces the same size images without changing the original signal. Besides feeding the spectrogram into CNN [9, 35, 48, 50] and capsule neural network [41], researchers take extra step to use the spectrogram image to retrieve extra features such as Local Binary Pattern (LBP), Local Phase Quantization (LPQ), and Robust Local Binary Pattern (RLBP) [43] to help improve the classification performance.

Fig. 5
figure 5

The flowchart of spectrogram generation

Waveform image represents the pattern of sound pressure amplitude in the time domain. It is also used in deep learning models such as AlexNet to achieve above 90% accuracy on identifying the asphyxia cry [28, 30]. In our previous work, we use Praat to generate images containing the prosodic feature lines including F0, intensity, and formants. The prosodic feature images CNN model is good at identifying certain types of cry signals. Combining it with spectrogram CNN and Waveform CNN produces 5% better accuracy on Baby Chillanto database and 4% on Dunstan Baby Language database [69].

3.2.4 Other relevant domain features

Other domain features used in infant cry research include time domain features such as zero-crossing rate, short-time energy, and voiced-unvoiced regions, etc. Zero-crossing rate is the rate at which the signal passes zeros and changes signs. It can be used in conjunction with short-time energy to detect endpoints of speech utterances, hence to detect the existence of the cry sound from other sounds happening in the environment [17, 67]. Since the amplitude of an audio signal varies with time, the short-time energy can serve to differentiate voiced and unvoiced segments. It is used in [20, 57, 70] for infant cry detection and classification. Torres et al. used voiced-unvoiced counter, which counts all frames having a significant periodic content, as one of the features for cry detection [27]. Linear Predictive Coding (LPC) serves as a time domain measure of how close two different waveforms are and it is used for infant cry classification in [13, 49, 71].

Wavelet Transform is a method to convert the audio signal into time-frequency domain. The waveform packet transform was used in asphyxia classification research and reached high accuracy of 99% with neural network models [72]. It also performs well in infant cry reason classification. The Discrete Wavelet Transform MFCC (DWT-MFCC) features work well with SVM and neural network architectures [31, 33, 51, 73].

Researchers also calculate the statistical natural parameters of the data such as mean frequency, standard deviation, and third quartile range, etc. to help infant cry detection and classification [39]. Feature extraction is a critical step in audio processing. Besides aforementioned Praat software, feature extraction tools such as LibROSA library [74] and OpenSMILE toolkit [75] have made audio feature extraction easier.

3.3 Feature selection

Feature selection is the process of selecting a subset of features from the original features extracted from the audio signals using the feature extraction techniques. The objective is to reduce the dimensionality of the features without reducing classification accuracy. Less features require less computational resources, and hence make building smart infant cry detection and classification devices possible and affordable in the future. The original features may also contain some redundant information that prevents effectively differentiating the different types of cry signals. Selecting the right features to fit the specific need of the task may also improve the classification accuracy. This section reviews some feature selection methods applied to the infant cry research. F-ratio method was used to select the top 20 MFCC features. The coefficients that have significant importance have higher F-ratio scores [63]. In 2013, Yamamoto et al. used Principal Component Analysis (PCA) to reduce the dimensionality of FFT features [23]. Forward variable Selection Method (FSM) was applied to infant cry classification by Wang in 2010 [55] and Okada et al. proposed Iterative FSM (IFSM) based on cross-validation concept in 2011 [54]. Later, Binary Particle Swarm Optimization (BPSO) was used to remove the redundant features and keep the significant features from MFCC coefficients in [61, 62]. Orlandi et al. used a software called Biovoice to extract 22 features from the cry signal and then used a genetic algorithm-based search method to select the best features to feed to the classifiers [21]. In 2016, Wahid et al. compared five feature selection methods: OneR, ReliefF, Fast Correlation-Based Filter (FCBF), Consistency-Based Subset Evaluation (CNS), and Correlation-Based Feature Selection (CFS). It is proven that the feature selection techniques were able to greatly reduce the feature space, hence to reduce computational time. Most selection technique can also improve the performance of the neural network classifier [76].

In 2019, Tuduce et al. utilized three Best Feature Selection (BFS) approaches to exclude irrelevant features and redundant features and tested them with 35 classifiers. The feature set is reduced from over 6000 features to 500 and the result shows that BFS can improve the classification accuracy for some classifiers [45]. Feature selection techniques remove the features irrelevant to the specific task, so it can reduce the feature space, save computational time, and improve classification accuracy.

4 Infant cry classification

With data cleaned and segmented and features extracted, selected, and normalized, finding the appropriate classifier is the most important stage in the machine learning process. In this section, we review some popular machine learning methods and applications used in infant cry classification in the past decade.

4.1 Infant cry classification models

4.1.1 Traditional machine learning classifiers

  1. A

    Support Vector Machine The most popular probabilistic classifier used in infant cry classification is Support Vector Machine (SVM) [26, 40, 43]. The types of SVM include multi-class SVMs [25], linear, and RBF kernels binary SVM [31]. The features fed into the SVM include temporal features, prosodic features, and cepstral features. In 2017, Onu et al. compared SVM to other non-linear classifiers like neural networks on asphyxia classification and concluded that SVMs are designed to work effectively with limited examples and high-dimensional data [29]. In 2015, Chang et al. used the incremental SVM learning model, which keeps adding new data into the dataset in each training step, producing more than 18% better accuracy than the original SVM model on infant cry classification based on FFT features [77].

  2. B

    K-Nearest Neighbor KNN is a well-known pattern recognition method used in classification. There are k nearest neighbors in the feature space. The goal is to assign test sample to the class that its nearest neighbor belongs to. If k is greater than 1, the nearest neighbor is selected based on the number of nearest neighbors. In the case of infant cry classification, researchers used Euclidean distance, Minkowski distance, and other methods to measure the distance between two sample feature vectors. Feature vectors selected are usually MFCC and LFCC [20, 22, 37, 64]. Cohen and Lavner used KNN algorithm, in which each frame is classified either as a cry or not a cry, and then the sample is classified to be cry signal if more frames in the sample is identified as cry [57].

  3. C

    Gaussian mixture model GMM is a probabilistic model that assumes the datapoints are in Gaussian distribution of some mean and variance. The idea is to learn the parameters to model the provided training data as mixture of several Gaussian distributions. Then the test data can be classified by the trained model. Expectation Maximization (EM) algorithm is used for finding the maximum likelihood estimates of the parameters under GMM-based structures [24, 39]. In 2016, Banica et al. used GMM-UBM method to classify Dunstan baby cries. The universal background model (UBM) is a GMM model that is trained on large amount of general cry signals with no specific labels. The classification accuracy of the GMM-UBM with MFCC achieved 70% on Dunstan baby cries [38] and 50.6% on SPLANN database [47]. GMM-UBM is also used by Alaie et al. to classify healthy cries and pathological cries. The Boosting Mixture Learning adaptation method proposed outperforms the MAP algorithm [78]. In 2019, Sharma et al. compared the GMM clustering to hierarchical clustering and K-means clustering on cry features and showed the GMM model produces the best result with least amount of overlapping datapoints with a certain database [39]. It is shown that GMM-based classifiers are sensitive to environment and cannot lead to satisfied results especially with limited training data.

  4. D

    Fuzzy classifier Fuzzy logic systems have been used in many applications such as transmission systems, power systems, and wireless network routing [79]. It is also used in infant cry classification. Selected features are converted into fuzzy values in the fuzzification step, certain fuzzy membership functions are used, and fuzzy rules are defined. In [66], Kia et al. used fuzzy classification to detect infant cry signals from laughter signals. In [71], Rosales-Pérez et al. used fuzzy decision tree, fuzzy decision forest, fuzzy KNN, and fuzzy relational neural network classifier for pathological cry classification. Type-2 fuzzy pattern matching algorithm is used in [80] to classify asphyxia, normal, and hyperbilirubinemia. It also outperforms SVM and logistic regression classifier on classifying hunger and pain [81].

  5. E

    Logistic regression classifier Logistic regression classifier is a low-complexity supervised algorithm, and it is usually used as a referencing experiment for infant cry research. Lavner et al. used it to show that CNN performs better on cry detection [17] and Orlandi et al. used it to compare with many other classifiers, in which random forest performed the best on classifying full-term and preterm infant cries [21].

  6. F

    K-means clustering K-mean clustering represents an unsupervised algorithm mainly used for clustering. Unlabeled data points can be gradually separated into groups based on the mean value and centroid moving. Sharma et al. used K-means clustering to show that the GMM model has better performance differentiating different types of cry [39]. In [22], K-means clustering was used to build a speaker database for speaker recognition.

  7. G

    Bagging, boosted trees, and random forest Bagging, boosted trees, and random Forest are techniques that perform ensemble decision trees. They all combine multiple decision trees to produce better performance. Experiments have shown that they are powerful on infant cry classification. Osmani et al. showed bagging and boosted trees outperform SVM [67]. Milano et al. compared it to MLP, SVM, Reservoir Network, GMM, and HMM models and showed random forest classifier is next to Reservoir Network [82]. In [21, 45, 83, 84], an open source data mining software named Waikato Environment for Knowledge Analysis (WEKA) is used. Among over 100 classification algorithms implemented in WEKA, random forest outperforms SVM, MLP, logistic regression, and BayesNet, etc. Tuduce et al. tested 40 classifiers in WEKA and the tree classifiers showed the best overall performance comparing to Bayes classifiers, lazy classifiers, function classifiers, and rule classifiers, etc. [83].

4.1.2 Neural network-based models

Artificial Neural Network (ANN) is a machine learning method. In 1995, Petroni et al. made the first attempt of ANN in infant cry classification [85].

  1. A

    Feed Forward Neural Network (FFNN) is the simplest neural network and Multi-Layer Perceptron (MLP) is a type of FFNN that contains at least three layers. The experiments in [37] and [13] both showed that FFNN’s performance was not as good as nearest neighbor classifier based on MFCC features. MLP was used in [52, 6163] with MFCC for identifying pathological cries. To classify asphyxia, Hariharan et al. used Probabilistic Neural Network (PNN), General Regression Neural Network (GRNN), and Time-Delay Neural Network (TDNN) and achieved above 97% accuracy [33, 86].

  2. B

    Convolutional Neural Network is a deep learning algorithm that has been successfully used in computer vision, language processing, and other domains achieving unprecedented high accuracy. Multi-channel CNN, which accepts multiple channel input, were applied in [11]. Manikanta et al. used 1D CNN on MFCC features for cry detection and the result outperformed feed forward neural network and SVM classifier [25]. In 2019, Le et al. applied transfer learning with CNN on spectrograms on Baby Chillanto database and achieved promising result [35].

  3. C

    Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN), which has internal states to make accepting sequence of data possible and is known as best neural network for time series data such as language translation and speech recognition. Mark Huckvale fed the low-level signal temporal features into the bidirectional LSTM model and later combine with another two dense-layer neural network in [40]. The LSTM itself and combined network both outperformed the baseline SVM model.

  4. D

    CNN-RNN is a deep learning architecture combining CNN with RNN. It has shown its power in sound detection and classification. In Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 competition, Lim et al. used it to win the first place on detecting the target sounds (baby crying, glass breaking, gunshot) with mixed noisy background [87]. In 2019, Maghfira et al. used CNN-RNN to classify the five types of cries and reached the highest accuracy of 94.97% for Dunstan Baby Languages database [36].

  5. E

    Neuro-fuzzy Network combines fuzzy logic with neural networks and it has been used successfully by researchers in infant classification. In 2009, Santiago-Sánchez et al. used type-2 fuzzy set to classify asphyxia and hyperbilirubinemia [80]. In 2012, Molaeezadeh et al. proposed a type-2 fuzzy pattern matching classifier and it outperformed SVM and logistic regression classifiers in classifying hunger and pain [81]. In recent years, it is noticed that combining fuzzy systems with neural networks can unite their advantages and evade the disadvantages of both methods. Fuzzy systems require rules while neural networks directly learn from data. Neuro-fuzzy approach was used to classify Dunstan baby language type of cries. Neural networks were trained and the Mandani fuzzy logic was adopted after data normalization to create new “transformed dataset,” which is used for final classification step of KNN [88]. The classification accuracy reached 86.25%, which is better than normal neural network model, SVM, and GMM methods.

  6. F

    Capsule Network [89] is a deep learning topology that adds a structure called capsules into the CNN model. As maxpooling in CNN only picks the maximum value within a region and throws away information in certain positions, higher-level capsules cover larger regions of the image and performs routing by agreement instead. CapsNet was applied to classify infants’ emotional cry in domestic environments and the accuracy is improved more than 10% over the CNN model with spectrograms [41].

  7. G

    Reservoir Network (RN) is a neural network model derived from RNN. Its input nodes connect to a non-trainable reservoir, which contains connected non-linear units with randomly generated fixed weights. Ntalampiras used RN in infant cry multi-class classification [82] with fused feature sets and showed that RN model outperformed MLP, SVM, random forest, GMM clustering, etc.

Many machine learning methods have been experimented in infant research. Each of them has advantages and disadvantages and no algorithm is the perfect for every dataset and task. Selecting a suitable model to achieve high performance is challenging. To determine the classification ability of the different models, Fuhr et al. experimented differentiating healthy infant cries and cries of infants suffering from several diseases using 12 classifiers including SVM, decision tree, KNN, MLP, etc. The result showed only C5 decision tree and KNN achieved greater than 90% accuracy [90]. Applying many algorithms on the task before selecting the algorithm to use is impractical. Comparing the machine learning algorithms used in infant research, we analyze them from the following aspects. Readers can choose the appropriate algorithm accordingly for their datasets and tasks.

  • Time complexity. It includes training time and classification time relying on the data size, searching space, and the complexity of coefficients. In general, traditional methods such as SVM, K-means clustering, and GMM-based approaches are relatively simple and straightforward. Smaller sample size is acceptable, which differs from neural network methods. Hence, training time, searching time, and classification time are much less than those of neural network methods. Also, fine tuning in neural network models also requires more developing time.

  • Sample complexity. It indicates whether the model requires large size of data or not to learn. It depends on the complexity of the data and the complexity of the algorithms. To reach better performance, neural network methods generally require larger sample size for complex searching space than other traditional algorithms. Larger size infant cry databases are needed for deep neural networks.

  • Parametricity. It indicates if the number of the parameters used in the model is fixed or it varies along when new data is brought in. Linear regression, GMM, and neural networks are parametric methods while KNN and SVM are nonparametric models.

  • Feature complexity. Features extracted from either time domain or frequency domain have the same abilities to represent the different characteristics of the cry signals in different models. There is no feature complexity difference involved for traditional models or neural network-based models. But using too many features to represent one sample may cause overfitting issue; therefore, selecting the most appropriate features for specific models is critical.

  • Parallelizability. Parallelizability is a pivotal feature for saving the training time of machine learning methods. Large amount of data in neural networks is associated with high computation cost in both time and space. Parallelizability with Graphics Processing Unit (GPU) computing greatly reduces the training time and made deep learning possible. Other method such as KNN is easy to parallel, but parallelism is tricky if the next step is based on the previous step result such as decision trees.

With current powerful computation environments, methods used in infant cry research can achieve real-time prediction. Due to limited samples in current infant cry databases, training time and testing time have not been highlighted as an issue. There are no very deep models with big data involved in the research yet. At present, the largest dataset has less than 20,000 samples. The small and imbalanced datasets lead to high classification accuracy but low confidence for some of the tasks. To achieve high performance with high confidence, real big data with real deep learning models are to be explored.

4.2 Infant cry applications

Researchers use different classifiers to perform infant cry processing tasks. In the past decade, most research work continue to pay effort to improve the classification accuracy of infant cry signals including differentiating the pathological cries from the normal cries and understand the meaning behind the cry signals. In this section, we review the significant works on infant cry classification and detection.

4.2.1 Infant cry reason classification

In the early years of infant cry research, more works were performed on automatically differentiating the cries of healthy infants from pathological cries. In recent years, exploring the meaning of the cries attract more research interests. As Table 2 shown, some significant works are done on this topic. It is noticeable that researchers are using different datasets, most of which are self-recorded. With different datasets in similar research, even the classification types are the same, it is unfair to make direct comparison on the performances of the proposed methods. The infant classification remains in challenging stage due to the lack of standard public datasets and the classification accuracy is still relatively low.

Table 2 Significant works on infant cry reason classification

4.2.2 Infant pathological cry classification

Infant cry signals have been used to identify many diseases such as asphyxia, hypo-acoustic (hearing disorder), hypothyroidism, hyperbilirubinemia, cleft palate, respiratory distress syndrome, ankyloglossia with deviation of the epiglottis and larynx, etc. Readers can find the related works on pathological cry classification before 2011 in [3]. In the past decade, researchers continue to apply novel methods to classify normal cry and pathological cry. Asphyxia cry is the most popular disease in research. Table 3 shows the latest works on classifying normal cry from asphyxia cry. Researchers have been using the Baby Chillanto database to perform the binary classification. In 2012, Probabilistic Neural Network (PNN) and General Regression Neural Network (GRNN) reached 99% accuracy [34, 86], the latest SVM model reached 97.7% accuracy [31], and the deep learning FFNN model reaches 96.74% accuracy [15].

Table 3 Classification of asphyxia cry from other cries

Besides identifying asphyxia, other types of diseases have also been studied. According to Esposito’s review [94], it is shown that the infants’ cry signals are useful for early diagnosis of autism spectrum disorder (ASD). In 2012, Orlandi et al. analyzed the cry signals of the high-risk infants whose siblings have already been diagnosed to be ASD. It is noticed that less cry episodes occur, F0 is lower, and Formants reach high values for high-risk infants than healthy infants [95]. Although some babies are born with ASD, it is usually diagnosed when they are 2 to 3 years old since the diagnosis involves observing the behavior of children. This leads to the difficulty of the cry signal acquisition for autism babies. In 2019, Wu et al. recorded twenty audio samples of autistic children whose ages are between 2 and 3 years old. They reached 96% accuracy by using SVM classifier with MFCC features [51]. Identifying hypo-acoustic cry signal has been successful in the early years. In 2011, Hariharan’s General Regression Neural Network reached 99% on Baby Chillanto database [96], and in 2009, O.F. Reyes-Galaviz et al. used evolutionary neural network system to reach almost 100% on Mexican-Cuba database [8]. Then in 2014, Rosales-Pérez et al. used fuzzy model and genetic algorithm to reach 99.42% on Baby Chillanto database [97]. Other types of diseases such as hypothyroidism, respiratory distress syndrome, cleft palate, and ankyloglossia with deviation of the epiglottis and larynx (ADEL) were studied in the early years and were reviewed in [3]. In 2014, Feier et al. studied newborns’ cries within minutes after birth. Random tree and random forest methods were able to classify cries of healthy newborns from premature newborns, newborns with umbilical cord strangulation during birth, and newborns with other pathologies with accuracy above 95% [98].

4.2.3 Infant cry detection

Infant cry detection is considered as a binary classification with cry and not-cry categories. It is another attractive research topic in the latest decade. The goal is to detect the infant cry signal efficiently and accurately in various environments, such as car, home, and hospital, etc., while other sounds happening at the same time. Data is recorded during a long period of time in a certain environment such as home or hospital. The detection algorithm needs to be able to detect the cry sound despite the background sounds happening in the environment. Researchers also propose different methods to build smart cradles, which can detect infant cries and alert the parents while they are away [99102]. The proposed methods not only target to higher detection accuracy but also consider the price of the baby monitoring system to make it affordable for low income families.

Table 4 shows some recent significant works on infant cry detection. It is seen that neural network-based approaches reach good performance under clean and constrained conditions. On the other hand, with noisy environment and limited training data, classifiers are sensitive at the boundary and easy to be confused and overlapped with noise signals.

Table 4 Significant works on infant cry detection

5 Challenges and future directions

With the improvement of computational ability and the use of deep learning approaches, the following challenges remain in infant cry research.

  • Lack of existing data and scalability of research. Researches are based on different datasets recorded by authors. Therefore, it is difficult to compare the performances of methods experimented on different datasets. The only database shared by some researchers is Baby Chillanto database, which has been around for two decades. The total amount of Baby Chillanto database is 2287 and the largest private database has less than 20,000 samples, which is insufficient for deep learning NN models. Data is the key elements of machine learning, especially deep learning. We notice that although some deep learning methods such as CNN and CNN-RNN are used in infant cry research, the architectures of models are not deep. The main reason is that the deep models underfit the small training dataset and lead to poor performance. To take advantage of deep learning, large-scale databases with sufficient samples covering diverse changes within acoustic and prosodic features of different babies are in need.

  • Collecting data and labeling is a time-consuming process and requires skilled labors. Most databases used so far are self-recorded by authors and private to certain people or organizations. Although some online resources are available such as videos on Youtube, which is what Google audioset links to, most cry clips have no relevant labels and many recordings are full of background noises. To accelerate the progress of building automatic infant cry classifiers, smart cradle systems, and further to build robotic babysitter caregivers, effort to make public comprehensive well-structured and labeled databases are urgently in need. In addition, databases that contain samples from specific babies that can track their cries at different ages are needed. This type of database is essential to study the characteristic of infant cry along with their body development. Setting up recording devices on infants’ cradles and recording real-time cry signals using cell phones by caregivers are the main methods used by data collectors. Baby cry translator mobile applications such as ChatterBaby [44] help predict infant cry reasons and made data collection easier. It will be more beneficial to the development of infant cry research if some newly collected datasets can be made public.

  • Poor connection between medical professionals and researchers diminishes the ability of interdisciplinary mutual promotion. Researches have proven that classifying infant cry signals is a non-invasive method and can be very helpful in some early disease diagnosis such as asphyxia, autism, cleft palate, and hypothyroidism, etc. But most of the pathological disease researches with infant cry were performed before 2010, and the sizes of the datasets were very small. The difficulties of data acquisition may be the biggest obstacles in this research area. The ethical and legal issues involved in data collection process hinder the development of infant cry research. Cooperation between medical professionals and computer scientists may trigger some opportunities in this life saving research topic.

We are currently building a large infant cry database consisting of cries of infants from 0 to 9 months old. The cry clips are recorded and labeled by parents at home and by doctors and nurses in hospitals using cell phones. It is currently in the data acquisition stage and it is expected to be a database containing over 30,000 samples reaching 50 h of recording, which fits the need of deep neural networks. We are also applying Graph Neural Network (GNN) to infant cry classification. GNN has been used across various domains and the graph can represent the non-Euclidean data with complex relationships between objects. Combined with deep learning, which has proved to perform successfully on Euclidean data, the GNN deep learning model should be able to take advantage of more features and have more discriminating abilities for infant classification tasks. In addition, new deep learning architectures embedded with prior knowledge can also be explored. With more databases available in the future, we believe that more machine learning methods can be explored in this area. Combining new audio signal processing methods and novel machine learning methods will lead this research to a remarkable future, which will change people’s lives by providing affordable infant automatic care-giving.

6 Conclusion

In this paper, we describe the significant research work in infant cry analysis and classification, providing details and resources that are helpful for both researchers and medical professionals who work in this area. It is shown that the limited database resources hinder the development of the infant cry research. Large databases with diverse samples fitting the need of deep neural networks is imperatively desired. The current tendency for feature extraction is to generate a mixed feature set and takes advantages of different domains to achieve better discriminating ability. The relevant research results show promising improvement with combined features. In addition, new neural network-based architectures become the mainstream methods. It proves better robustness and performance than traditional machine learning approaches. In the future, we are interested in creating a large database, extracting more robust features, combining features with good ratio, establishing novel neural network architectures with the use of prior knowledge as well as other space information from interdisciplinary areas.