Keywords

1 Introduction

Dealing with unplanned speech [1] is one of the numerous challenges that Automatic Speech Recognition (ASR) systems for Punjabi language have to compact with. The primary indications describing spontaneous speech are hesitating like packed pause, repetition, repair and false start and many learning have paying attention on the recognition and improvement of these hesitations [2]. Therefore, identification of spontaneous speech will need a standard move from speech to accepting where original messages of the speaker are removed, as a substitute of transcribing every vocal words. Spontaneous speech, as evaluated to designed speech, is a more natural way in which people communicate with each other. However, the recognition of spontaneous speech is facing numerous challenging by the rigorous articulation alternatives and changeable silence gaps or amusement in between words. Presently, a variety of novel applications of LVCSR (large vocabulary continuous speech recognition) systems, such as automatic closed captioning, making minutes of meetings, conferences, and summarizing and indexing of speech documents for information retrieval, are dynamically being explored.

2 Automatic Spontaneous Speech Recognition System for Punjabi

Speech recognition [3] is a complicated task and states of the ability recognition systems are very complex. Automatic spontaneous speech has many prospective purposes including rule and organize, transcription of confirmed dialogue, live speech, and interactive vocal conversations (Fig. 1).

Fig. 1
figure 1

Automatic speech recognition system for Punjabi speech

The primary phase [4] of speech identification is to reduce the speech signals into flows of acoustic feature vectors, called as observations. The key chore [5] of the speech system is to obtain an audio signal as input and fabricate a sequence of words as output. The acoustic model begins a mapping among phonemes and their potential acoustic demonstrations, i.e., the phones. The prior probability is computed using the language model. Usually trigram or even 4-g supported language models are utilized in recent speech systems. The decoding method [6] in a speech recognizer’s procedure is to discover a string of words whose consequent acoustic and language models finest equivalent the input feature vector string. For that reason, the procedure of such a decoding process with trained audio and language models is often submitted to as a explore method.

3 Building an Acoustic Model for Spontaneous Punjabi Speech

In order to build an acoustic model for spontaneous Punjabi speech, it is required to train the system with word level. But the single word wav file has small in size and silence gap is more therefore even for training single word, we need sentences. For this purpose, we trained the Punjabi spontaneous speech system with multiple words and sentences with variable silence gap.

  1. A.

    Steps for training the acoustic model for Punjabi corpus

To train the system for Punjabi Language, we need following configuration files:

  1. 1.

    Dic (Independent words are store in it):

The main purpose of the dictionary file is to map Punjabi stored words with the every recorded Punjabi sound unit associated with each sounds. Two types of the dictionaries are present, first type is used in which reasonable words in the language are planned progressions of sound units, and second type of dictionary in which non-vocalizations sounds are planned to corresponding non-vocalizations or speech-like sound units is also created. The training data which we are giving as input to our system are shown in the given figure [7, 8] (Fig. 2).

Fig. 2
figure 2

Training data of Punjabi language

The dictionary file (Punjabi.dic file) will look like as shown in Fig. 3:

Fig. 3
figure 3

Dictionary files of Punjabi corpus

  1. 2.

    Filler and noise: It is also type of dictionary in which rejected noise is stored [2]. For example:

$$ \begin{aligned} < {\text{s}} > \quad \;\;{\text{SIL}} \hfill \\ < /{\text{s}} > \quad \;{\text{SIL}} \hfill \\ < {\text{sil}} > \quad {\text{SIL}} \hfill \\ \end{aligned} $$
  1. 3.

    Phone: Phone file [9] is a record of individual sound unit that needs to make a word. The various phone files are shown in the Table 1.

    Table 1 Phone files of Punjabi language
  1. 4.

    Transcript (path of wav files) and Fields (conversation of wav File):

Transcription file is a listing the dictation for each acoustic file. For example, in our Punjabi corpus, the Table 2 shows the transcription file for test audio:

Table 2 Transcript file

It is essential that each line of Punjabi text begins by <s> and finishes by </s> followed by id in parentheses. Also note that parenthesis includes only the file, exclusive of speaker_n directory. It is vital to have correct match among fields file and the transcription file.

We have two kinds of transcript and field files:

  • For training purpose (Punjabi_parpare.trans and Punjabi_parpare.fileds)

  • For testing purpose (Punjabi_check.trans and Punjabi_check.fileds)

Training files are used to create feature vector which will be used later for recognition. Testing files are used by decoder to check the recognition. Sphinx_train.test file: This is the configuration file where configuring the path for all required files (for field, transcript, etc.).

4 Steps of Creating the Language Model for Punjabi Corpus

Language model is used for decoding purpose. The language model gives framework to differentiate between words and expression that sounds alike. There are two forms of language models [10] that illustrate language—grammars and statistical language models [11, 12]. Grammars portray very simple forms of languages for grasp and organize, and they are usually written manually or produced mechanically with plain code [13, 14]. Steps for creating language model are:

  • Step1: During compilation, first we input given text file as shown in Fig. 4.

    Fig. 4
    figure 4

    Input Punjabi text file

  • Step2: Execute cmu command and create vocab file (Fig. 5).

    Fig. 5
    figure 5

    1-, 2-, and 3-g after compiling vocab file

  • Step3: Finally, language model is created with extension lm.DMP, which is used for training purpose. While training it use decoder to test the training and generate log files of decoding.

Figure 6 clearly shows that while compiling the Punjabi acoustic model for spontaneous speech, out of 128 lines and 390 words, only 2 lines and 1 word are failed. So the sentence error rate is 1.6% and word error rate is 0.5%.

Fig. 6
figure 6

Output of the decoder for Punjabi corpus

5 Graphical User Interface for Automatic Spontaneous Speech System for Punjabi Language

Language model and training data are both compiled in final jar file which is used for recognition. For live testing of speech, we have created the java based GUI for spontaneous Punjabi speech (Fig. 7).

Fig. 7
figure 7

GUI for spontaneous Punjabi speech recognition

It has an option of live speech test and speech recognition for already recorded wav files.

Figure 8 clearly shows that the output of the live speech testing for spontaneous Punjabi speech.

Fig. 8
figure 8

Output of the Punjabi spontaneous speech recognition model

6 Performance Evaluation

The performance of the research work is evaluated by comparing it with previous work done for small vocabulary system [5]. In the previous research, the total numbers of sentences were taken 7 and words were 42 of Punjabi language [15, 16, 17]. The present work has total 128 sentences and 390 words. Table 3 shows the comparison between the previous and present work on the basis of sentences error and word error rate.

Table 3 Result comparison

Graphical analysis shown in Fig. 9 represents drastic reduction in the word and sentence error rate with increase in vocabulary size in the previous and present work.

Fig. 9
figure 9

Performance comparison

7 Conclusion and Future Work

In this paper, an effort has been made to develop an automatic spontaneous speech recognition system for Punjabi corpus using sphinx toolkit. The accomplishment of spontaneous speech detection system has considerably improved in provisions of sentence along with word error rate. GUI has been created to test the live Punjabi speech using java framework. In future, system will be trained for large vocabulary so that recognition rate can be improved for voice input taken from the different person. The Language model will also be improved in future work for fast decoding and recognition.