Abstract
The main idea of this paper is to bring out the uniqueness of the speech corpus required for development of command and control applications such as voice-controlled MAV. Since MAV finds application in adverse environments, the effect of noise degrades ASR performance. Since English words uttered are greatly influenced by user’s mother tongue, there is a necessity to create a customized speech corpus. The corpus creation is accomplished by a NALVoiceCorpus tool, which is designed to capture the specific requirements of the corpus. The tool is quite generic in nature and it can find application in development of any ASR system.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The micro air vehicles (MAVs) can be deployed for numerous applications, spanning both civilian and military roles. MAVs are remote controlled using a portable ground control station (GCS) or radio controlled (RC). To fly MAV using RC requires skill in handling the controls on RC to realize a perfect flight. Even though GCS offers auto mode, configuring this mode requires knowledge about flight operations. GCS features multiple menu pages accessed by keyboard and/or mouse button presses, which are very tedious and time-consuming [1]. But to realize successful MAV applications, untrained personnel should also be able to fly and control the MAV operations. Voice commanding can be a convenient mechanism in such a scenario [2].
The heart of any voice-commanding system is an automatic speech recognition (ASR) block which recognizes the commands textually. ASR is basically a pattern recognition problem which requires large amount of voice/speech data for training. The database used for training is referred to as corpus and contains speech audio files and its text transcriptions. Even though there are plenty of corpora available for general-purpose speech applications, in the case of MAV there is unavailability of any such standard corpus and it is very much essential to build a good corpus to achieve accurate speech recognition. Further, the existing speech corpora are unable to cater for Indian accents [3].
This paper proposes a scheme to effectively capture the relevant speech data in a corpus through a developed tool named NALVoiceCorpus tool.
2 Speech Corpus for ASR Application
In an ASR, the feature extraction is done for the command uttered and these features are used to arrive at the best matching speech text corresponding to the uttered speech based on the three models, namely the acoustic model, language model and the lexicon/pronunciation model (see Fig. 1).
An acoustic model is used in ASR to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. Pronunciation model defines the probability of the phone states given the words. A statistical language model describes the probability distribution over sequences of words. All these models are developed based on the speech corpus.
In general the speech corpora can be broadly classified as the read speech, which includes book excerpts, broadcast news, lists of words and sequences of numbers, and the spontaneous speech, which includes dialogs, narratives, map tasks and appointment tasks. Further, they are classified as native (English) and non-native speech databases that contain speech with foreign accent. The Google [4] and Apple [5] clouds have very rich corpus, however, they are of not much use to the application under consideration as they heavily depend on the availability of data network. Since MAVs operate in remote areas the availability of network cannot be ensured.
There are special speech corpora for command and control applications which fall under the category [6] of key word spotting (KWS) that is capable of identifying specific words or phrases from a list of items in the language being spoken. Reference [6] also emphasizes the need to create an ASR technology capable of accurately determining speech activity regions, detecting key words, identifying language and speaker in highly degraded, weak and/or noisy communication channels. To meet such requirements there is a necessity to have an application-specific speech corpus. Under this category, the ATCOSIM, air traffic control simulation speech corpus is a speech database of air traffic control (ATC) operator speech. It consists of speech data, which were recorded during ATC real-time simulations and the utterances are in English language and pronounced by ten non-native speakers [7]. Some works in the field of voice-controlled UAVs are available [8], but information about the speech corpus used is not available. Hence, there is a necessity to develop a MAV-specific speech corpus.
Even though speech corpus collection is only a procedure, it decides the quality and efficiency of the ASR system. Therefore, the production procedure of speech corpus should be standardized [9].
3 Corpus Design for Voice-Controlled MAV
The MAV speech corpus development is done around mission planner [10], which is an open source ground control station for MAVs. However, the corpus generated can be used for any MAV GCS.
A brief explanation regarding the functionality of GCS leading to command identification is given in this section. In an MAV operation, the first step is the establishment of connection between the MAV and GCS. Once the connection is established, the next task is to plan the course of MAV flight. This is done using the FLIGHT PLAN window. Another important window is FLIGHT DATA, which gives all the information related to the current flight/mission (see Fig. 2). These screens are accessed continuously during MAV flight and hence are considered for voice activation. Further, the MAV mission from takeoff to landing can be activated through speech. Table 1 lists command candidates to be replaced by speech commands. This list is not exhaustive and commands can be added based on the user requirement.
4 NALVoiceCorpus Tool
The “NALVoiceCorpus” tool is developed in C# to generate the MAV-specific corpus in a systematic way such that it can be accessed directly by acoustic and language modeling tools such as HTK toolkit and CMUSphinx. The front end of the application (see Fig. 3) collects the speaker information and creates the file path as per the predefined directory structure. The tool records speech at 16 bit, 16000 Hz. The commands in Table 1 is saved to a text file and loaded into the workspace. These command sentences appear one by one in the box and the user has to utter the same. To precisely capture only the speech part, the tool incorporates a voice activity detector (VAD). Once the VAD is done, a green tick mark appears and the next sentence to be uttered is displayed. The captured speech waveform is stored as a wave file (wav).
In an ASR, the accuracy of the models can be improved by training the models with data collected from realistic environments [11]; however, collection of corpus in realistic environments may not be feasible. To address this, the tool simulates realistic environments, such as traffic, wind, forest and battle field, while recording the corpus speech. This feature is activated by selecting field option (see Fig. 3).
Further, ASR performance gets affected by user mood such as anxiety, haste and fatigue [12]. While flying an MAV, the user is bound to get anxious and may issue some commands in haste to take control of the situation. The tool has a facility to simulate a MAV flight-like scenario while uttering voice commands. This is accomplished by integrating mission planner [10] to flight gear [13] simulator. This option can be invoked by clicking mission option of the tool.
5 NALVoiceCorpus Evaluation
The NALVoiceCorpus was evaluated by collecting speech from 25 users with varying accents for commands given in Table 1. The acoustic model and language model were created using CMUSphinx tool box.
The ASR capability of these models was compared with Indian English model available in CMUSphinx for some of the commands. The models developed with NALVoiceCorpus give superior performance compared to existing model (Table 2) as the training data is MAV-specific. From the table it is observed that “Takeoff” and “Goto Waypoint One” are not recognized as language model based on standard English treats takeoff as two separate words and the word waypoint are not present in standard dictionary.
6 Conclusion
This paper proposed the development of dedicated speech corpus for voice control of MAV. The specific command list was formulated based on the functionality of MAV ground control station. Further, a tool named NALVoiceCorpus, specifically developed for corpus collection, was used for corpus generation and models were created using this corpus. The ASR performance was found to be superior compared to usage of general-purpose models. Even though the tool was developed for this application, it can be easily extended for any other application.
References
Kang Y, Yuan M (2009) Software design for mini-type ground control station of UAVICEMI’09. In: 9th International conference on electronic measurement and instruments IEEE
Draper M et al (2003) Manual versus speech input for unmanned aerial vehicle control station operations. In: Proceedings of the human factors and ergonomics society annual meeting, vol 47(1). CA: SAGE Publications, Sage CA: Los Angeles
Shrishrimal PP, Deshmukh RR, Waghmare VB (2012) Indian language speech database: a review. Int J Comput Appl 47(5):17–21
Google cloud speech-to-text documentation page. https://cloud.google.com/speech/docs/
Speech recognition in IOS. https://developer.apple.com/documentation/speech
Robust Automatic Transcription of Speech (RATS), for Information Processing Techniques Office (IPTO), Defense Advanced Research Projects Agency (DARPA), DARPA-BAA-10-34, (2012)
Hofbauer K, Petrik S, Hering H (2008) The ATCOSIM corpus of non-prompted clean air traffic control speech. In: LREC
Ayres Tony, Nolan Brian (2006) Voice activated command and control with speech recognition over WiFi. Sci Comput Program 59(1-2):109–126
Li A-J, Yin Z-G (2007) Standardization of speech corpus. Data Sci J 6:806–812
Mission planner overview, ardupilot.org/planner/docs/mission-planner-overview.html
Paliwal KK, Yao K (2010) Robust speech recognition under noisy ambient conditions. In: Human-centric interfaces for ambient intelligence, 135–162
Benzeghiba M et al (2007) Automatic speech recognition and speech variability: a review. Speech Commun 49(10–11):763–786
Flight gear simulator homepage, home.flightgear.org/
Acknowledgements
The authors would like to thank SIGMA panel, AR & DB for funding this activity.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rahul, D.K., Veena, S., Lokesha, H., Lakshmi, P. (2020). Speech Corpus Development for Voice-Controlled MAV. In: Kadambi, G., Kumar, P., Palade, V. (eds) Emerging Trends in Photonics, Signal Processing and Communication Engineering. Lecture Notes in Electrical Engineering, vol 649. Springer, Singapore. https://doi.org/10.1007/978-981-15-3477-5_11
Download citation
DOI: https://doi.org/10.1007/978-981-15-3477-5_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3476-8
Online ISBN: 978-981-15-3477-5
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)