Malay Cued Speech Recognition Using Image Analysis: A Review

Twahir, Muhammad Ghazali; Yusof, Zulkhairi Mohd; Ahmad, Izanoordina

doi:10.1007/978-3-030-92964-0_31

Muhammad Ghazali Twahir⁷,
Zulkhairi Mohd Yusof⁸ &
Izanoordina Ahmad⁸

Part of the book series: Advanced Structured Materials ((STRUCTMAT,volume 162))

495 Accesses

Abstract

Automatic real-time translation of gestured languages for hearing-impaired would be a major advancement on disabled integration path. Cued speech (CS) is a specific visual hand gesture that complements oral languages lip-reading. Cued speech in Bahasa Malaysia (CSBM) is an adaptation of cued speech for use in the Malay language. The cued speech recognition system is capable of detecting all necessary parameters of CS (handshape, hand position, and hand movement) and translate it to text equivalent. The aim is to help the deaf learn and practice the basic of cued speech consonant and vowel using hand gesture. This paper looks into the existing researches involved in this area and also the sensors and methods they used. Due to the limited number of researches for cued speech, related researches such as sign language (SL) translator systems and hand gesture recognition are also reviewed. This paper gives a general overview on the implementation of cued speech recognition system that automatically recognize a succession of cued speech hand gestures in real time. A Malay cued speech recognition system using image analysis is proposed.

Access provided by Autonomous University of Puebla. Download chapter PDF

Application to Aid Hearing and Speech Impaired People

Development of Sign Language Translator for Speech Impairment Person

Hearing Assistive Technology: Sign Language Translation Application for Hearing-Impaired Communication

Keywords

31.1 Introduction

As someone speaks, a hearing-impaired can try to guess the oral message by lip-reading. This is a difficult task for different phonemes which correspond to identical mouth shapes. Therefore, Dr. Cornett developed the cued speech in order to improve the lip-reading efficiency [1]. The manual gestures to lip shapes are proposed to ensure that each sound has an original visual aspect. Thus, the “hand and lip-reading” becomes as meaningful as the oral message.

A significant difference between these two communication systems is that SL is a complete language, while CS is not a language at all. For example, cued speech in Bahasa Malaysia (CSBM) is a visual representation of the spoken Malay language itself. Due to that, CS may have an advantage over SL in an environment where a translation is being made [2]. The deaf will also learn to actually speak with their mouth while learning cued speech with their hand. However, it will take a lot of time and practice for them to be able to speak proficiently and fluently. In order to improve the communication of deaf cure, there is a need of an automatic system that recognizes the CS and translates it to text equivalent.

A new approach called CSBM is based on a syllabic decomposition: The message is formatted into a list of “consonant–vowel syllable” (a CV list). Each CV is coded with a specific gesture, which is combined to its lip shape, so that the whole looks unique and understandable. A gesture contains two pieces of information which is the area handshape (for the consonant coding as shown in Fig. 31.1) and a location around the face (for the vowel as shown in Fig. 31.2). Hand coding brings the same quantity of information than the lips movement. It is as difficult to lip-read without gestures as to understand the hand coding without lips movements. This symmetry explains why a single gesture codes of numerous phonemes, which correlate to different lip shapes. Thus, there are only eight handshapes and four positions for a combination of 32 CV-gestures.

The contribution of this research is to review methods of automatically recognize a succession of cued speech hand gestures in real time. In the future, a complete hearing-impaired translator could be feasible by coupling such a device with an automatic lip-reading module and others various automates.

31.2 Literature Review on Cued Speech

A cued speech recognition system is a system that automatically recognizes the hand shape, movement, and position and then display the text equivalent in the computer screen. In order for the system to do the recognition, a camera that can capture necessary parameters of CS such as handshape, hand position, and hand movement is needed. All these parameters will be processed by certain method in order for the gesture to be accurately determined. Some of the methods have been developed for recognizing the hand gesture such as using computer vision, depth cameras, glove-based system, hidden Markov models, adaptive boosting (AdaBoost), random forest regression (RFR) Progress, finger tracking algorithm, and glove-based system. These methods will be discussed in the following sub-section.

31.2.1 Existing Cued Speech Transliterate System

There are a limited number of researches for an automated cued speech recognition system. Some of the earliest explorations into CS recognition were conducted using computer vision techniques. Aboutabit et al. [3] focused on the identification of vowels by merging CS hand positions and lips information. Hand position was conducted using the Gaussian classifier which took the 2D hand positions as input. The vowel recognition used the merged features of the lips and hand position, and obtained 77.6% identification correctness.

Given the success that hidden Markov models (HMM) have had in the field of automatic speech recognition, Heracleous et al. [4] used the context-independent HMM-GMMs to decode a set of isolated phonemes extracted from CS sentences, i.e., the temporal boundaries of each phoneme to be recognized in the video was given at the test stage. In fact, the audio-based temporal segmentation was used for the temporal alignments of the lips, hand position, and shape. The corpus was derived from a video recording of the CS speaker with blue colors on lips and hand pronouncing and coding a set of 262 French sentences. The experiments concerning the vowel, consonant, and phoneme recognitions were presented and obtain an accuracy of 78.9%.

In the previous work on CS recognition, the video images were recorded with artifices applied to the CS speaker before the recording (blue sticks on the lips, blue marks on the hand and forehead) in order to mark the pertinent information and make their further extraction easier. However, recent researches are struggled to provide a robust and real-time solution that can adequately track handshapes against varying backgrounds and occlusions.

31.2.2 Depth Sensor Camera

Over the past decade, there has been significant exploration into using depth cameras such as Microsoft Kinect for tracking the hands gesture. The device itself features an RGB camera, a depth sensor, and a multi-array microphone and is capable of tracking the users’ body movement. Since Kinect is able to track the user's full body, it seems natural to build a framework for sign language recognition. Much of the work has focused on generalized hand tracking with a priority on real-time processing and arbitrary camera angles. Ahmed et al. [5] develop Deaftalk, a sign language interpreter using Microsoft’s Kinect depth camera that provides 84% accuracy detection.

The gestures recognition technologies in Kinect visual gesture builder implemented are the AdaBoost and RFRProgress. AdaBoost is a trigger which gives us a true Boolean value, while the person is performing a particular gesture; it uses the adaptive boosting machine learning algorithm. RFRProgress on the other hand produces continuous results giving us an analog data of progress. The user is performing the gesture, thus, enabling the system to detect how much of the gesture is completed and how much is the hit rate at the particular frame of the gesture. This approach uses the random forest regression machine learning algorithm. Although it facilitates body and hands tracking and creates the depth image directly, it does not support hand shape recognition. Since sign language and CS generally features different hand shapes, similar signs cannot be distinguished.

31.2.3 Finger Tracking

A number of approaches have been explored to get around the problem of hand shapes detection, but no clear consensus as to which holds the most promise has formed. Some of the approaches, such as finger counting using convex hull algorithm were shown to work in a particular case. Gurav et al. [6] developed a method using background subtraction and HSV segmentation together to create a mask. After the hand is segmented, the number of fingers raised could be detected. The largest contour in the image which is assumed to be the hand is then found. Then, the convex hull and convexity defects which are most probably the space between fingers is classified. All this algorithm such as convex hull, background subtraction, and HSV segmentation are provided in the open computer vision (OpenCV) library. OpenCV is used in HCI, robotics, biometrics, image processing, and other areas where visualization is important and includes an implementation of Haar classifier detection and training [7]. Its finger counting algorithm achieved 92% accuracy with convex hull technique. This is a manual way of finding the number of fingers and a necessary step to identify each different finger.

31.2.4 Glove-Based

Glove-based systems have achieved the most impressive SL results in terms of vocabulary size, with over 90% accuracy being obtained in continuous sign detection across more than 5000 Chinese signs [8]. However, such systems are both expensive and require the user to wear unnatural devices.

31.3 Requirement of Cued Speech Recognition

31.3.1 System Overview

The ultimate goal of this research is to develop a Malay cued speech hand gesture recognition system. The system diagram in Fig. 31.2 shows the basic flow of the proposed system. There are a number of subsystems within the proposed system that will be explained in more detail in the section B.

In this proposed system, the user will perform desired hand gesture; thus, the Kinect camera will capture this hand shape and hand position. The handshape image is processed by the finger-tracking algorithm to determine how many fingers are engaged. Then, each finger could be classified into its respective consonant. Similarly, hand position and movement video frame are captured in parameter extraction module where skin, depth, contour points, and movement are detected. The images or video streams are next sent to the machine learning functions where they are matched and compared with stored images templates to identify the correct consonant. Lastly, the consonant is fused with the vowel to produce text of Malay syllable in the computer screen.

31.3.2 Proposed Method

Hand shape recognition by using finger gesture tracking is proposed for this research. The hand shape will be captured by using the Kinect camera which has sensors of both RGB and depth data. The hand data will be processed by convex hull methods. This method will count the number of engaged fingers, and then, each finger name is classified using fingertip tracking methods. This will be made possible by implementing the OpenCV library in the system. In result, the algorithm will be able to translate eight different hand shapes into specific consonant.

Microsoft Kinect provides Visual Gesture Builder software to detect hand position and movement using machine learning methods. VGB has built in detection technology such as discrete gesture and continuous gesture. Hand position and movement recognition can be achieved by using the AdaboostTrigger and RFRProgress algorithm. In result, the system will be able to translate four different hand positions and movements into specific vowel.

The cued speech recognition software is build using C# in MS Visual Studio. This software will be used to integrate the finger tracking algorithm and machine learning method that could translate the consonant and vowel. This method will produce a syllable which resulting a Malay words.

31.3.3 Dataset, Test, and Evaluation

In order to measure the accuracy of the system, the average hit rates of particular gesture from test runs by different users will be compared. In addition, the data validation will be counted based on false positive and false negative against the correct result. A confusion matrix will summarize the result of the testing algorithm for further inspection.

31.4 Conclusion

After the survey on the approaches used in various CS and SL recognition systems, the methodologies and algorithms involved in this research could be justified. Most of the times, a combination of different methods and algorithms has to be used to achieve a moderate to acceptable rate of recognition. This will allow the system to offer complete phonetic representation of cued speech hand gesture recognition. In the future, a complete hearing-impaired translator could be feasible by coupling such a device with an automatic lip-reading module and others various automates. From a technical point of view, there is a vast scope for the future research and implementation in this very field. The ultimate gain of the proposed study is enormous.

References

Cornett RO (1994) Adapting cued speech to additional languages: procedures followed in assigning phonemes to cue groups in the development of cued speech in its original form. Cued Speech J 19–29
Google Scholar
Nicholls GH, Ling D (1982) Cued speech and the reception of spoken language. J Speech Hear Res 25:262–269
Article Google Scholar
Heracleous P, Aboutabit N, Beautemps D (2009) Lip shape and hand position fusion for automatic vowel recognition in cued speech for French. IEEE Signal Process Lett 16:339–342
Article Google Scholar
Heracleous P, Hagita N, Beautemps D (2010) Gestures and lip shape integration for cued speech recognition. In: 2010 20th International conference on pattern recognition, pp 2238–2241
Google Scholar
Ahmed M, Idrees M, Abideen ZUl, Mumtaz R, Khalique S (2016) Deaf talk using 3D animated sign language: a sign language interpreter using Microsoft’s kinect v2. In: 2016 SAI Computing Conference (SAI), pp 330–335
Google Scholar
Gurav RM, Kadbe PK (2015) Real time finger tracking and contour detection for gesture recognition using OpenCV. In: 2015 International Conference on Industrial Instrumentation and Control (ICIC), pp 974–977
Google Scholar
Perimal M, Basah SN, Safar MJA, Yazid H (2018) Hand-gesture recognition-algorithm based on finger counting. J Telecommun Electron Comput Eng 10:19–24
Google Scholar
Farooq U, Asmat A, Rahim MSBM, Khan NS, Abid A (2019) A comparison of hardware based approaches for sign language gesture recognition systems. In: 2019 3rd International Conference on Innovative Computing (ICIC). https://doi.org/10.1109/ICIC48496.2019.8966714

Download references

Author information

Authors and Affiliations

R4R Research Cluster, Communication Technology Section, Universiti Kuala Lumpur British Malaysian Institute, Batu 8, Jalan Sungai Pusu, 53100, Gombak, Selangor, Malaysia
Muhammad Ghazali Twahir
R4R Research Cluster, Electronics Technology Section, Universiti Kuala Lumpur British Malaysian Institute, Batu 8, Jalan Sungai Pusu, 53100, Gombak, Selangor, Malaysia
Zulkhairi Mohd Yusof & Izanoordina Ahmad

Authors

Muhammad Ghazali Twahir
View author publications
You can also search for this author in PubMed Google Scholar
Zulkhairi Mohd Yusof
View author publications
You can also search for this author in PubMed Google Scholar
Izanoordina Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Ghazali Twahir .

Editor information

Editors and Affiliations

Malaysian Institute of Marine Engineering Technology, Universiti Kuala Lumpur, Perak, Malaysia
Azman Ismail
Malaysian Institute of Marine Engineering Technology, Universiti Kuala Lumpur, Lumut, Perak, Malaysia
Wardiah Mohd Dahalan
Faculty of Mechanical Engineering, Esslingen Univ of Applied Sciences, Esslingen am Neckar, Baden-Württemberg, Germany
Andreas Öchsner

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Twahir, M.G., Yusof, Z.M., Ahmad, I. (2022). Malay Cued Speech Recognition Using Image Analysis: A Review. In: Ismail, A., Dahalan, W.M., Öchsner, A. (eds) Advanced Materials and Engineering Technologies. Advanced Structured Materials, vol 162. Springer, Cham. https://doi.org/10.1007/978-3-030-92964-0_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-92964-0_31
Published: 12 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92963-3
Online ISBN: 978-3-030-92964-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Malay Cued Speech Recognition Using Image Analysis: A Review

Abstract

Similar content being viewed by others

Application to Aid Hearing and Speech Impaired People

Development of Sign Language Translator for Speech Impairment Person

Hearing Assistive Technology: Sign Language Translation Application for Hearing-Impaired Communication

Keywords

31.1 Introduction

31.2 Literature Review on Cued Speech

31.2.1 Existing Cued Speech Transliterate System

31.2.2 Depth Sensor Camera

31.2.3 Finger Tracking

31.2.4 Glove-Based

31.3 Requirement of Cued Speech Recognition

31.3.1 System Overview

31.3.2 Proposed Method

31.3.3 Dataset, Test, and Evaluation

31.4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Malay Cued Speech Recognition Using Image Analysis: A Review

Abstract

Similar content being viewed by others

Application to Aid Hearing and Speech Impaired People

Development of Sign Language Translator for Speech Impairment Person

Hearing Assistive Technology: Sign Language Translation Application for Hearing-Impaired Communication

Keywords

31.1 Introduction

31.2 Literature Review on Cued Speech

31.2.1 Existing Cued Speech Transliterate System

31.2.2 Depth Sensor Camera

31.2.3 Finger Tracking

31.2.4 Glove-Based

31.3 Requirement of Cued Speech Recognition

31.3.1 System Overview

31.3.2 Proposed Method

31.3.3 Dataset, Test, and Evaluation

31.4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation