1 Introduction

The speech produced can be defined as a result of a concentrated chain process, in which the larynx or the phonation place is the source of the aerodynamic energy of speech, and the articulatory system determines different properties that shape the created sounds [1].

In the case of children, these properties usually exhibit radical variability. These variabilities are categorized as acoustic and articulation variability. Therefore, ASR for children shows significantly poor accuracy than that of adults. By considering these challenges, Mel frequency cepstral coefficients (MFCC), its derivatives, zero crossing rate (ZCR), and spectrogram Formants are identified as interesting features to construct input vector for the auto-encoder network. In this work, the Network is trained and tested with 10 Malayalam monophthongs vowel utterances of children in the age group of five to ten.

The automatic speech processing technology automates the natural speech chain communication that consists of the speech production mechanism, transmission process, and speech perception process occurred in the ear and brain of the listener. The networks in ASR widely used to study the speech signal through the relevant acoustic information and its statistical representation. In the 1980s, the speech recognition approaches have lifted its methodology from a straightforward template-based paradigm to a more meticulous statistical framework. In the mid-1980s, the hidden Markov model (HMM) become a popular and leading framework for ASR. Since 1988, both theoretical and experimental work has been conducted to analyze the feasibility of artificial neural networks (ANNs) in statistical speech recognition. Hinton et al., proposed an advanced scheme with a fast, greedy, unsupervised, and layer-wise pre-training algorithm in deep belief network (DBN), in which each layer modeled by a restricted Boltzmann machine (RBM) [2]. Later experiments revealed that auto-encoders [3, 4] or conventional neural networks [5] organized in a similar scheme is suitable to model efficient deep neural network (DNN) framework.

The algorithm employed for training and testing has a major role in network performance. A machine learning algorithm can be described as an optimization algorithm to minimize the global error function. One of the line search method called conjugate gradient method was introduced by Hestenes and Stiefel, in the year 1950 as a most prominent iterative method for linear problem solving [6]. In the 1960s, Fletcher and Reeves enhanced this linear method to nonlinear conjugate gradient method. Even though the upgraded algorithm performs much faster than the steepest descent, the calculation complexity per iteration is relatively high since it required a line search to determine the appropriate step size in each iteration. Moller replaced the line search method with the Levenberg–Marquardt approach and introduced a sub-class of the conjugate gradient method called scaled conjugate gradient (SCG) [7, 8]. The SCG reduces the calculation complexity of the conjugate gradient. In this work, the SCG method used to train the network.

By considering the complexity of children’s speech recognition, this work proposes a novel approach that finely coupled with customized unsupervised pre-training followed by supervised training. Two different auto-encoders implemented for unsupervised pre-training, which reproduces the input as a result of the output layer. The trial and error approaches are employed to finalize the regularization parameters. As the identification of interesting features of children’s speech is much complex than that of adults, the second AE is used to extract bottleneck features. In this work, MFCC features correlated with its derivatives, ZCR, and Spectrogram formants have shown relatively best performance. These bottlenecks (interesting) features identified by the auto-encoders imply for further network training. After pre-training, a softmax training layer designed for supervised training with a sparsely labeled dataset. To accelerate the performance, a supervised fine-tuning applied to the trained network. These training layers are stacked together to form comprehensive network architecture. The designed network for this work has shown an average training accuracy of 97% and a test accuracy of 89.5%.

1.1 Malayalam vowel classification

The vowel sounds are voiced phonemes with the greatest intensity and each Malayalam vowel length usually lies between 40 to 450 ms. Different vowel qualities are produced primarily by altering the position of the tongue (front-to-back and up-to-down) and the lips configuration (neutral, spread or rounded). Malayalam vowels are classified primarily based on the tongue backness and height. Ten Malayalam monophthongs vowels are considered for this study, its classification is listed in Table 1.

Table 1 Malayalam monophthongs classification

2 Literature review

Speech recognition systems for children are much more complicated than that of adults. Russell, Martin, and Shona conducted a study on different parameters that differentiate children speech from adults [9]. The vocal tract length variability and developing articulations make the ASR complicated for children. This study concluded with the fact that adult’s speech recognition systems are not adequate to perform on children. Orozco et al., designed an automatic speech recognition system with scaled conjugate gradient (SCG) to classify the infant cry into two classes—normal and pathological cry [10]. The linear predictive coding (LPC) method used to extract features from the infant cry. The Neural Network with one hidden layer and 15 nodes used for pattern classification. The system has shown an average accuracy of 85%.

Nidhyananthan et al., conducted a study on various feature extraction techniques as well as modeling methods that might be suitable for speech or speaker recognition with developing vocal apparatus (children) [11]. The feature extraction technique considered in this study are Mel frequency cepstral coefficients (MFCC), zero-crossing peak amplitude (ZCPA) and linear predictive coding (LPC) and the modeling methods are Gaussian mixture model (GMM), hidden Markov model (HMM), generalized fuzzy model (GFM) and artificial neural network (ANN). In this experimental study, speech recognition with the MFCC feature vector has acquired the highest accuracy rate of 85%, whereas LPC feature vector has achieved 82% and ZCPA feature vector has shown the least accuracy of 38%.

Sabu, Kamini, and PreetiRao conducted a revised work of ASR in evaluating the reading skills of children in the age group of ten to fourteen with interactive feedback [12]. MFCC features are used to construct the feature vector and DNN for bottleneck feature extraction. The GMM-HMM model is used to train the network. The performance showed an average word error rate of 3.44%.

Most of the researchers applied ASR technology in children for their acoustic speech evaluation. In 1990, the DRA Speech Research Unit recommended for emerging speech recognition technology in speech and language development of children [13]. Vachani et al., have proposed a deep auto-encoder framework to enhance the feature set extracted from Mel frequency cepstral coefficients (MFCC) to improve the performance of ASR in individuals affected by dysarthria [14]. The classification model combined with deep neural network (DNN) and hidden Markov model (HMM) used for automatic speech recognition of dysarthric people. This work achieved an absolute improvement of 16% with auto-encoder that deals with tempo adaption based representation.

Anand et al., have developed a speech recognition application for visually impaired people with MFCC feature and hidden Markov model (HMM) applied to construct the acoustic model [15]. The system achieved an average accuracy of 75%, and after applying speaker adaption technique, the performance improved to 80%.

There are many research work already conducted and is being conducting in the area of regional language. Hence, very limited works are available for children’s speech recognition. Till date, automatic speech recognition for children in Malayalam dataset is not available. Therefore, feature extraction methods and classification network architecture for this work is determined based on experiments. The speech recognition works discussed in literature review sessions are used as reference for this work.

3 Methods and implementation

3.1 Data preparation

Data collection and its processing is the primary task of any machine learning approach. The subject of the training dataset consists of 10 Malayalam isolated monophthongs vowels that have been recorded from 150 children, 65 boys, and 85 girls, in the age group of five to ten. Children are very famous for their articulation error and voice opacity. Therefore, the raw dataset went through an audible quality test and preliminary spectrogram study. The raw dataset collected was 1500 (150 × 100), from that only 1350 samples of 10 elements are pruned for network training. The zoomh4n handy portable recorder is used to record the sounds in 16 bit/44.1 kHz sampling rate. The quality of the training data set influence the performance of the classification. The audio editor tool, audacity, is used for speech enhancement. The spectral noise gate technique with 12 dB range applied for noise reduction in an acceptable range.

3.2 Feature extraction methods

A precise feature vector can be embolden pattern recognition accuracy. 13 MFCC features and its corresponding 13 derivatives, one ZCR feature and three spectrogram formants (F1, F2, and F3) are used to create the feature vector, altogether 30 features (1350 samples × 30 features = 40,500 features). Most of the ASR applications attained reasonable performance accuracy with MFCC. MFCC captures the acoustic and perceptual parameters of the speech signal, and it replicates the perception process (Cochlea) of the human being [16,17,18]. MFCC have very sensitive Filter Banks that filters the speech sounds similar to Cochlea. In this work, 25 filters are applied. In order to cope up with fluctuating characteristics of speech signals, amplified signals enclosed in 30 ms frames with an overlap of 20 ms.

The ZCR consider each window and counts the total number of time the amplitude of the speech signal crosses through zero. The ZCR determines the voiced and unvoiced signal classification [19,20,21]. All the vowel sounds in Malayalam have voiced sounds, and the ZCR shows low crossing rate than that of unvoiced. Vowels can be well distinguishable with its spectrogram formants [22,23,24,25]. According to vocal tract parameters used for vowel constriction, here considered three parameters F1, F2, and F3. Figure 1, scatter plot, depicts the relationship of each feature variable against other feature variables. The feature selection policy is described in the Table 4.

Fig. 1
figure 1

Relationship between feature variables

3.3 Neural network design and implementation

The Auto-Encoder is a pattern learning approach which imposes the unsupervised training model. This model is best suitable for nonlinear as well as a complex range of problem-solving (audio, image, etc.). Each Auto-Encoder consolidated with a pair of encoder and decoder. The encoder is bound to encode the input into hidden layer representation, and the decoder reproduces the input at the output layer [26,27,28]. The sparse auto-encoder allows an auto-encoder framework to learn interesting patterns even though the hidden layer neurons are greater than the number of inputs (hn > = xn) by imposing some additional constraints called sparse constraints. The proposed architecture is described in Fig. 2.

Fig. 2
figure 2

Relationship between feature variables

In this work, the training data consist of unlabeled vowel features {f1, f2, f3,…, fn} as input and the back-propagation algorithm applied to regenerate the input as output (o(i) = f(i)). The training function is based on the optimization of a cost function that measures the error between the input ‘f’ and the output ‘o’. The appropriate regularizers added to the cost function to control the activation of neurons at each layer. The auto-encoder tries to recognize a hypothesis function hw, b(f) ≈ f on input ‘f’, using the weight ‘w’ and bias ‘b’. The regularizers act as the central controller in the sparse auto-encoder. The L2 weight regularization parameter controls the effect of L2 regularizer in each encoder. The impact of the sparsityregularizer, which applied in the cost function, is controlled by the sparsityregularizer coefficients. The sparsity proportion desires the sparsity of output from the hidden layer. The linear transfer function 'Purelin' is used in the decoder to regenerate input at the output layer (Tables 2, 3).

Table 2 Auto-encoder 1 parameters
Table 3 Auto-encoder2 parameters

Scaled conjugate gradient (SCG): The SCG is a learning algorithm used for neural network. The optimization is the main task performed by the algorithm. The algorithm begins with x0 and generates a sequence of iterations x1, x2, x3, …, xk that terminates either by (1) no more process can be made, (2) the solution has been approximated with sufficient accuracy. The SCG designed by fusing conjugate gradient method and trust region approach (Levenberg–Marquardt algorithm). Therefore, the computation complexity of SCG got reduced significantly.

The network design steps applied for this work are as follows:-

  • The first AE designed with 100 neurons. The 30 feature coefficients along with sparsity constraints given to the first auto-encoder. Therefore, output of the first Auto-Encoder is 100 × 1dimension and given as input to the second auto-encoder (Fig. 3).

  • The second AE designed with 28 neurons. Therefore, 100 × 1 dimension coefficients compressed to 28 × 1, and this compressed representation considered as bottleneck features for softmax layer training (Fig. 4).

  • The softmax layer trained with bottleneck feature extracted from the second auto-encoder in a supervised manner. The sparse matrix used for labelling (10 classes) (Fig. 5).

  • The encoders and the softmax layers are stacked together to form a deep network (Fig. 6).

  • Fine-tune the deep network with the labelled dataset.

Fig. 3
figure 3

Auto-encoder 1 architecture

Fig. 4
figure 4

Auto-encoder 2 architecture

Fig. 5
figure 5

Softmax-supervised training

Fig. 6
figure 6

Stacked auto-encoder

4 Result and discussion

In this work, an auto-encoder network framework designed for recognizing interesting feature patterns as well as appraising the network performance in classification. This work experimented with different feature extraction methods and their correlations. MFCC features are commonly used parameter for adult’s speech recognition system. The study conducted in this work shows that the Malayalam speech recognition system which is exclusively designed for children, achieves relatively low performance with only MFCC features. The feature extraction techniques that contributed to the highest performance—MFCC and its derivatives, ZCR, and spectrogram formants—are considered as suitable features for this work (Table 4).

Table4 Feature selection experiment result

The architecture consists of two layers of unsupervised or self-supervised learning model, auto-encoder, employed for pre-training and feature identification. The sparsely labeled dataset used for further supervised training and fine-tuning. Fine-tuning is an optional training layer. In this work, the fine-tuning improved the training result from 87.7 to 97% and also shows a drastic enhancement in testing as well (Table 5).

Table5 Importance of fine-tuning

The designed network framework showed an average training accuracy of 97% (Fig. 7) and testing has shown an average accuracy of 89.5% on the test dataset (Fig. 8).

Fig. 7
figure 7

Training performance

Fig. 8
figure 8

Performance on test data

The trained network classifies the test data to ten classes such as class 1— (a), class 2— (a:), class 3— (i), class 4— (i:), class 5— (u), class 6— (u:), class 7— (e), class 8— (e:), class 9— (o), class 10— (o:). 100% classification accuracy achieved by class 1, 2 and 5. Among the other classes, the worst false positive (FP) rate (19.0%) has shown by class 9 and 10. However, class 6 has shown the least accuracy with the highest false negative (FN) rate (25.0%). The false events (false positive and false negative) can be classify into interclass and intraclass misclassification. According to the articulation required to articulate vowels, the short vowels and its corresponding long vowels [e.g., (a) and (a:)] can consider as in one class (Table 1). A most identifiable parameter that distinguishes the long and short vowel in a class is its duration of articulation. The long vowels [e.g., (a:)] duration is higher than the short vowel [e.g., (a)]. Most of the speakers are not able to understand the duration difference required for uttering long and short vowel. Therefore, intra-class misclassification in isolated vowels is quite common even in adults. However, intra-class isolated vowels classification errors shall be negligible as their articulation will be corrected when combined with words.

5 Conclusion

Children got highly influenced by technologies. Several studies proved that Assistive Technologies improves speech and language development in children. Automatic speech recognition (ASR) enhances the features of Assistive Technologies. In this work, a neural network framework, which is pre-trained in an unsupervised manner (auto-encoder) and fine-tuned with a sparsely labeled dataset, used to recognize the Malayalam vowels articulated by children belongs to the age group of five to ten. Two auto-encoders used for pre-training with 100 and 28 hidden layer neurons, respectively. The bottleneck features extracted from the second auto-encoder with 28 neurons. The performance of the classification has been accelerated by supervised fine-tuning. In this work, the softmax layer trained with scaled conjugate gradient (SCG) method that combines the conjugate gradient method and the Trust Region method. Therefore, SCG regulates the calculation complexity at each iteration. The ASR for children is much more challenging than that of adults. This work conducted an initial ASR work in Malayalam vowels for children and has shown an average accuracy of 89.5% in the test dataset.