Keywords

1 Introduction

The micro air vehicles (MAVs) can be deployed for numerous applications, spanning both civilian and military roles. MAVs are remote controlled using a portable ground control station (GCS) or radio controlled (RC). To fly MAV using RC requires skill in handling the controls on RC to realize a perfect flight. Even though GCS offers auto mode, configuring this mode requires knowledge about flight operations. GCS features multiple menu pages accessed by keyboard and/or mouse button presses, which are very tedious and time-consuming [1]. But to realize successful MAV applications, untrained personnel should also be able to fly and control the MAV operations. Voice commanding can be a convenient mechanism in such a scenario [2].

The heart of any voice-commanding system is an automatic speech recognition (ASR) block which recognizes the commands textually. ASR is basically a pattern recognition problem which requires large amount of voice/speech data for training. The database used for training is referred to as corpus and contains speech audio files and its text transcriptions. Even though there are plenty of corpora available for general-purpose speech applications, in the case of MAV there is unavailability of any such standard corpus and it is very much essential to build a good corpus to achieve accurate speech recognition. Further, the existing speech corpora are unable to cater for Indian accents [3].

This paper proposes a scheme to effectively capture the relevant speech data in a corpus through a developed tool named NALVoiceCorpus tool.

2 Speech Corpus for ASR Application

In an ASR, the feature extraction is done for the command uttered and these features are used to arrive at the best matching speech text corresponding to the uttered speech based on the three models, namely the acoustic model, language model and the lexicon/pronunciation model (see Fig. 1).

Fig. 1
figure 1

Block diagram of ASR

An acoustic model is used in ASR to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. Pronunciation model defines the probability of the phone states given the words. A statistical language model describes the probability distribution over sequences of words. All these models are developed based on the speech corpus.

In general the speech corpora can be broadly classified as the read speech, which includes book excerpts, broadcast news, lists of words and sequences of numbers, and the spontaneous speech, which includes dialogs, narratives, map tasks and appointment tasks. Further, they are classified as native (English) and non-native speech databases that contain speech with foreign accent. The Google [4] and Apple [5] clouds have very rich corpus, however, they are of not much use to the application under consideration as they heavily depend on the availability of data network. Since MAVs operate in remote areas the availability of network cannot be ensured.

There are special speech corpora for command and control applications which fall under the category [6] of key word spotting (KWS) that is capable of identifying specific words or phrases from a list of items in the language being spoken. Reference [6] also emphasizes the need to create an ASR technology capable of accurately determining speech activity regions, detecting key words, identifying language and speaker in highly degraded, weak and/or noisy communication channels. To meet such requirements there is a necessity to have an application-specific speech corpus. Under this category, the ATCOSIM, air traffic control simulation speech corpus is a speech database of air traffic control (ATC) operator speech. It consists of speech data, which were recorded during ATC real-time simulations and the utterances are in English language and pronounced by ten non-native speakers [7]. Some works in the field of voice-controlled UAVs are available [8], but information about the speech corpus used is not available. Hence, there is a necessity to develop a MAV-specific speech corpus.

Even though speech corpus collection is only a procedure, it decides the quality and efficiency of the ASR system. Therefore, the production procedure of speech corpus should be standardized [9].

3 Corpus Design for Voice-Controlled MAV

The MAV speech corpus development is done around mission planner [10], which is an open source ground control station for MAVs. However, the corpus generated can be used for any MAV GCS.

A brief explanation regarding the functionality of GCS leading to command identification is given in this section. In an MAV operation, the first step is the establishment of connection between the MAV and GCS. Once the connection is established, the next task is to plan the course of MAV flight. This is done using the FLIGHT PLAN window. Another important window is FLIGHT DATA, which gives all the information related to the current flight/mission (see Fig. 2). These screens are accessed continuously during MAV flight and hence are considered for voice activation. Further, the MAV mission from takeoff to landing can be activated through speech. Table 1 lists command candidates to be replaced by speech commands. This list is not exhaustive and commands can be added based on the user requirement.

Fig. 2
figure 2

GUI of flight data in mission planner

Table 1 List of voice commands

4 NALVoiceCorpus Tool

The “NALVoiceCorpus” tool is developed in C# to generate the MAV-specific corpus in a systematic way such that it can be accessed directly by acoustic and language modeling tools such as HTK toolkit and CMUSphinx. The front end of the application (see Fig. 3) collects the speaker information and creates the file path as per the predefined directory structure. The tool records speech at 16 bit, 16000 Hz. The commands in Table 1 is saved to a text file and loaded into the workspace. These command sentences appear one by one in the box and the user has to utter the same. To precisely capture only the speech part, the tool incorporates a voice activity detector (VAD). Once the VAD is done, a green tick mark appears and the next sentence to be uttered is displayed. The captured speech waveform is stored as a wave file (wav).

Fig. 3
figure 3

Front end of GUI of the NALVoiceCorpus tool

In an ASR, the accuracy of the models can be improved by training the models with data collected from realistic environments [11]; however, collection of corpus in realistic environments may not be feasible. To address this, the tool simulates realistic environments, such as traffic, wind, forest and battle field, while recording the corpus speech. This feature is activated by selecting field option (see Fig. 3).

Further, ASR performance gets affected by user mood such as anxiety, haste and fatigue [12]. While flying an MAV, the user is bound to get anxious and may issue some commands in haste to take control of the situation. The tool has a facility to simulate a MAV flight-like scenario while uttering voice commands. This is accomplished by integrating mission planner [10] to flight gear [13] simulator. This option can be invoked by clicking mission option of the tool.

5 NALVoiceCorpus Evaluation

The NALVoiceCorpus was evaluated by collecting speech from 25 users with varying accents for commands given in Table 1. The acoustic model and language model were created using CMUSphinx tool box.

The ASR capability of these models was compared with Indian English model available in CMUSphinx for some of the commands. The models developed with NALVoiceCorpus give superior performance compared to existing model (Table 2) as the training data is MAV-specific. From the table it is observed that “Takeoff” and “Goto Waypoint One” are not recognized as language model based on standard English treats takeoff as two separate words and the word waypoint are not present in standard dictionary.

Table 2 ASR performance comparison

6 Conclusion

This paper proposed the development of dedicated speech corpus for voice control of MAV. The specific command list was formulated based on the functionality of MAV ground control station. Further, a tool named NALVoiceCorpus, specifically developed for corpus collection, was used for corpus generation and models were created using this corpus. The ASR performance was found to be superior compared to usage of general-purpose models. Even though the tool was developed for this application, it can be easily extended for any other application.