1 Introduction

Hearing is an important function of the human body. It is an essential sense for efficient communication with others. However, according to the World Health Organization, there are an estimated 360 million Deaf and Hard of Hearing (DHH) individuals globally (328 million adults and 32 million children). Additionally, the DHH are disproportionately well represented among people of age 65 years and above [38]. The DHH use various hearing aids such as cochlear implants, sign language, captioning, reading lip movement (speech-reading), and other forms of supporting aids that are essential to improve both speech and non-speech sound awareness. Nevertheless, the available hearing aids cover less than 10% of the global need. Hence, more supporting aids are needed [38].

The DHH face many challenges in order to keep up with the demands of life. One of the difficulties they confront is interpersonal communication and understanding without a mediator, which may cause isolation and frustration. This situation becomes more problematic if those around them are not skilled in sign language. Furthermore, they may be unaware of urgent surrounding sounds, such as fire alarms, bells, and car horns, which causes them to make delayed decisions that may put their lives at risk.

Additionally, the need for a mediator between a DHH person and other people may cause misunderstandings and miscommunication depending on the skillfulness of the mediator. Moreover, such dialogs sometimes include confidential information that requires a trusted mediator, who may not be readily available. Moreover, certain critical situations may require a DHH person to communicate with a hearing person, such as in an emergency or visiting a doctor, where there is neither an available interpreter nor a layperson who knows sign language [13].

In particular, hearing-impaired students usually suffer from low education levels and inadequate preparation for the outside world. This tends to further frustrate them and push them toward isolation and loneliness [36]. These students typically depend on their communication with hearing people by lip-reading or using hearing aids. However, missing any part of reading the lips or not hearing the whole conversation via their aid may cause them to miss information that is essential to their learning. They usually do not feel comfortable informing the hearing person that they cannot hear and requesting that they raise their voice or talk slower. Missing part of the chat will make them feel less capable, less intelligent, and/or stressed, which might lead to antisocial behavior. Additionally, DHH persons are usually alert in anticipation of conversation and subsequent lip-reading, which inhibits their memory and renders them unable to remember much of the chat, because their attention is more focused on catching up and not retaining whatever has been spoken [6].

These facts, as well as the need for language translation, have led many people to turn to technology in order to find suitable, accurate, and low-cost solutions such as the applications proposed in [10, 12, 21, 35, 39]. In these situations, wearable technology may become an exceedingly useful tool for all users, and especially the DHH [16, 17].

The field of computer-aided communication has been used to help the DHH in their daily life. The development of smartphones has been particularly useful because they combine an integrated camera, sensors, and a display. Various assistive communication applications have been developed to assist the DHH [21, 35, 39]. These applications help the DHH communicate effectively with others. Thus, smartphones have become important devices for DHH people. However, external devices such as smartphones or desktop applications require DHH individuals to look at them and turn their attention away from conversational partners in case of sound alerts or notifications [15]. A better solution is needed, especially with the advances in wearable technology.

Assistive technology for the DHH is a key area in current research related to disabilities. This field focuses on wearable devices such as optical head-mounted displays, e.g., Google Glass [7]. Google Glass, shown in Fig. 1, has a built-in camera, a display prism, and some built-in movement sensors providing hands-free, efficient interaction. In addition to its affordable price and light weight, it has demonstrated great potential for assistive applications [12]. Google Glass offers great functionality in assisting the DHH without interfering with their hearing aid, which they might be using as demonstrated in [16, 17].

Fig. 1
figure 1

Google glass has a built-in camera, microphone, speakers, touch pad, and numerous sensors [7]

The article “Is Google Glass Useful for the Deaf?” [1], as well as a recent survey on the social acceptance of wearable devices for disabled people [29] and the recent decision to accept 120 DHH female students annually to our University, inspired us to start our research project, Enssat, in which we develop a mobile application that provides sound-awareness through several functions. The Enssat application utilizes a head-mounted display (HMD) device specifically for young college students who need sound awareness in their university campuses. Additionally, young people are more willing to accept and try new technologies to improve their life and to communicate with their fellow students. Recent advances in ubiquitous computing research that utilized HMDs to make disabled people more independent have also motivated us in the design and implementation of the Enssat project [16, 17, 29]. Even though the idea is appealing, the application should address the following needs:

  1. 1-

    Location-independent: it should work at home, school, and in outside spaces to detect many sounds accurately and respond to the user immediately.

  2. 2-

    Portable and lightweight: to allow students to effortlessly carry it anywhere.

  3. 3-

    Multiple functionality: it should support multiple functions, such as real-time transcription, real-time translation, and surrounding sounds alerts, all in one application.

  4. 4-

    Easy to use and user-friendly: the application should be easy to understand and follow. It should also support more than one language.

After a careful literature review, we found that none of the available applications on the market provide all of the above functions.

In this paper, we propose Enssat, a mobile application that is a complete package providing real-time transcription, real-time translation, and surrounding sounds alerts using Google Glass, as tested in our Arabic-speaking university community. Enssat’s interface is bilingual, serving Arabic- and English-reading users. In case the user is not able to obtain Google Glass, the Enssat application is still operable solely on the mobile device. Although a wide range of users, regardless of age and education level, could use Enssat, the main target users of Enssat are DHH university students. The real-time translation function could be used by any user, DHH or not.

HMDs with built-in camera and hands-free operation support many benefits for disabled people, such as notification of surrounding sounds and reducing the need for holding a mobile phone or computing device in the hands. Several applications have been built to support people with special needs [3, 8, 12, 16, 17, and]; however, there exists no application that provides real-time transcription, real-time translation, and alert management through a wearable device. Instead of looking into several applications, Enssat puts all of them into one package, which can be used on a mobile phone or, more interestingly with a HMD. Furthermore, it has been reported that HMDs are socially acceptable [29]. As a result, we wish DHH students to install Enssat in their HMD.

We built a prototype of Enssat on the Android platform using a smartphone and Google Glass and evaluated its performance among university students. Our findings indicated a high acceptance of the application and ease of use among the DHH after a short period of training.

The main contributions of our project are:

  • Helping DHH people understand the conversations around them immediately and easily.

  • Helping increase the independence of DHH people.

  • Providing some protection for deaf people by informing them about surrounding noises.

  • Saving time and effort by using technology, instead of people, for real-time translation.

  • Contributing to wearable assistive technology for the Arabic-speaking community.

These contributions are hopefully achieved through the following main functions provided by Enssat:

  1. 1-

    Real-time transcription of spoken words in 14 languages.

  2. 2-

    Real-time translation of spoken words between 14 different languages. Additionally, translation of camera-captured text between 14 different languages.

  3. 3-

    Alerting the user of surrounding sounds.

To allow the user a seamless and smooth experience, the output of all these functions is displayed on Google Glass’s prism display.

The remainder of this manuscript is organized as follows: In Section 2, the recent literature in the field of assistive technology is discussed. The proposed system and its building blocks are described in Sections 3. In Section 4, the experimental results are presented. The conclusion is drawn in Section 5 and future work is discussed in Section 6.

2 Related work

Recent trends in disability research have resulted in an increasing number of studies focused on wearable devices. Wearable sensors increase the mobility and well being of disabled people. In this section, we will describe some solutions that have been developed for the DHH. This section is organized in three parts: (1) First, we present some studies related to the DHH; (2) next, we describe the applications related to Google Glass; and (3) finally, we present a few smartphone applications for the DHH.

2.1 Studies related to the DHH

The authors in [16] were the first to investigate the group conversation visualization for the DHH using HMD technology. Because the two authors are deaf people, they investigated this issue in depth. Their aim was to increase the glance accuracy and privacy for the DHH person. They designed several approaches for group conversations, such as localizing sound its source, indicating the direction of the speaker, and responding to visual cues in real time. The design visualization augmented, rather than replaced, the wearer’s senses. During the first set of experiments, the authors presented 24 deaf people with static images of their design via an iPad to elicit feedback on the overall approach, whereas the designs on Google Glass were “pre-rendered animations of two common but difficult group conversation scenarios” to collect their views. The test subjects liked the idea of the arrow direction of the speakers displayed on Google Glass. They also liked the display of the shape of pulses with varying sizes to convey loudness. Furthermore, when asked about extra features, deaf people suggested identifying speakers to help during group conversation. They also suggested identifying nonspeech sounds, such as a phone ringing, and adding captions to read during group discussions. In the second set of experiments, the authors implemented these designs in a proof-of-concept, nonwearable prototype. They tested it on four deaf participants and collected their feedback. One issue indicated by the participants is the difficulty of looking at Google Glass and looking at people’s faces simultaneously. Overall, however, the participants appreciated the extra features enabled by Google Glass and wished to use it in a public setting. These experiments presented many future directions to improve the lives of DHH people.

In another study that addressed HMD devices [17], the authors aimed to help young children who were studying sign language as their first language of communication by delivering sign language instruction via HMD devices. One of the difficulties facing young deaf students was their inability to directly look at the signer or interpreter to understand what they were being taught. To help in this regard, the authors displayed sign language interpretation for the deaf students via HMD while watching a video on a big screen. The authors investigated the comfort and utility of HMD for young deaf children in two phases. During the first phase, eight children were shown a video on a big screen while interpreted sign language was displayed on a HMD screen. In the second phase, students were shown a short planetarium show with the narration provided in sign language on HMD. Deaf students were having difficulties in adjusting to the placement of the signer in the HMD screen. There were also difficulties in maintaining focus between the interpreter on the HMD and the outside world. The authors suggested the need to design a smaller and a lighter HMD device for a child’s head. Additionally, further research should be conducted regarding deaf children in educational settings.

Along the same lines of study to help DHH children, the authors in [18] developed a front-end mobile application of their SmartSign Dictionary for DHH children. Their aim was to help deaf children translate text words in images taken by a mobile device camera into American Sign Language (ASL). The system consisted of two parts: the first part is an online dictionary that contains a large number of videos for ASL, which is available on the Internet. The name of this dictionary is SmartSign, which is connected with the mobile app via the Internet. The second part is the mobile application, which works as follows. When the deaf child is reading and she/he came across unknown words in the book, she/he can take a picture of the word, send it to the dictionary, and then an ASL definition along with video of the word will be displayed to her/him. The application required a connection to the Internet for the dictionary.

Furthermore, the authors in [19] developed a real-time captioning system called Scribe. This system is based on a group of transcribers performing the same task. Anybody who can hear and type can participate in helping DHH people. The main contribution of the system was that it depends on multiple nonexperts to work together. Their aim is to lower the cost of captioning and at the same time help DHH students in their classes. Scribe allows collaboration among multiple users, with each of them transcribing spoken words; their captions are then input to the system, processed, and presented to the DHH user. The system performs this processing on the fly and forwards it back to the DHH in less than five seconds. In order to overcome the difficulties of several captions, they adapt multiple sequence alignment (MSA) algorithms from computational biology, which were originally used to align genomes in DNA. They use MSA to place multiple captions in a correct order and create the optimal captions.

The aim of the authors in [33] was to speed up the process of transcription of spoken words to text for the DHH. They developed the C-Print system to be used by a trained transcriptionist who is familiar with computer abbreviations to be able to transform the spoken information into text in less than two seconds. The C-Print system is based on a regular keyboard to speed up the typing and a familiar transcriptionist to write up spoken information based on abbreviation strategies into text format. This text format appears to the DHH user on a computer or mobile device. They tested this system during a conference and it produced good results. This required a transcriptionist to be trained for 50 h to become familiar with computerized abbreviations to use the C-print system. Furthermore, the DHH can use this application in a remote setting to read the text of a vocal lecture. Similarly, SWift (Sign Writing improved fast transcriber) [2] is the first web-based tool developed with the aid of specialists and deaf researchers. It aims to simplify composing single signs electronically. SWift was developed for Sign Writing users, including both deaf and hearing, to make Sign Writing an effective communication tool and a learning support for deaf people.

In [31], the author designed UbiEar, a smartphone application that detects sounds in several environments around a DHH user then alert the user about these sounds. The app is designed to work on both a cloud server and smartphone. When UbiEar is activated, it detects sounds via mobile sensors, transmits them to the cloud server to match the sensed sounds, then notifies the user via a selected method, such as light, vibration, or text using the smartphone app. The application achieved very good results in detecting sounds without using significant storage resources on the smartphone [31].

From the above discussion of DHH-related works, one can notice that a great deal of attention has been devoted to assisting the DHH. For example, [16] works well with group conversations, [17] meant to teach young children sign language, [18] focused on providing a dictionary to help the DHH, [19] developed a real-time captioning system, but required a group of volunteers, [33] was designed to speed up transcription but needed a trained individual, and [3, 31] intended to use mobile phone sensors to detect sounds. Nevertheless, a DHH individual needs to use various apps to help detect sounds, or transcribe spoken words in his daily life, which can be quite inconvenient. Providing all of the above functions and more through one application that caters to both Arabic and English users will greatly help a DHH person in his/her daily life. That is one of the goals of Enssat.

2.2 Wearable device applications

Several studies have focused on the provision of wearable devices such as Google Glass and Apple Watch to help daily life. An overview of the Google Glass from mobile researchers was presented in [25]. The authors showed that Google Glass is well-suited to study many challenges in fields such as Human Computer Interaction, Augmented Reality, and Positioning Systems, among others. Google Glass provides an interface device that can be operated hands-free and provides sensors that are useful for various forms of head and gaze tracking. The authors found transitioning from Android smartphone to Glass software development straightforward, and mentioned some limitations of Google Glass, i.e., the battery and computational resources are more limited. Yet, even with these limitations, there are many applications that have been developed to make use of Google Glass, as illustrated in the following.

The aim of [30] was to review the published work in the area of wearable technologies designed specifically for enhancing education. The author reviewed the articles published from 2013 to 2015 about the usage of Google Glass in American education systems. The author presented the most important applications in libraries, medical fields, and universities. There are several applications designed to help librarians via Google Glass. One of them is shelf reading and inventory management; another is called “first person scanner,” which aims to help librarians to directly scan books carried by patrons. Additionally, there are several applications in the field of higher education; one of these is the Glassist App, designed to help faculty to better manage their tasks. It helps the teacher in creating portfolios for each student and displaying information about them. Finally, the paper discussed the usage of wearable devices in the medical field; for example, when there is an emergency, the wearable device assists junior staff in communicating effectively with senior specialists while they are not available on location. Furthermore, Google Glass has started gaining popularity among disabled people to help them in their daily activities. For example, the U.S. Patent Office published a Project Glass patent: “Creating visual notifications of sound for the hard of hearing” [8]. The idea of the patent is to use the microphones located on the frame of the headband of Google Glass to support DHH. Once the microphone detects a sound, it will be displayed as pop-up notification on the Glass. This will include the direction and intensity of the noise. This is a great idea; however, to the best of our knowledge it has not yet been implemented. However, there are other applications; many of these are in the testing stage and hopefully will be utilized soon.

Furthermore, the authors in [23] conducted surveys to evaluate the utility and accessibility of smartwatches for the DHH. They surveyed six people via questionnaires and long-answer questions. The participants indicated that smart-watches are helpful and would like to have them in their daily lives. A smart watch would help them leave safe places such as home and go on the street. Some participants mentioned the small size and would like bigger screen, particularly for older people.

An interesting study [29] investigated the social aspect of wearable devices among regular people. The authors surveyed a large number of people about their perception of others wearing HMDs in public places. Their findings indicated that the use of HMDs by disabled people to help in their daily life is more acceptable than the use of HMDs by nondisabled users.

Hence, with the advancement of wearable technology demonstrated by the abovementioned studies and social perceptions, Enssat assists both English and Arabic DHH users with three main functionalities; transcription of speech, translation of speech and text in captured images, and surrounding sounds alerts.

2.3 Hearing aid applications on mobile devices

A large number of mobile applications have been developed in order to empower DHH users. The table below specifies some of the available applications or systems designed to enable the DHH in communicating with the outside world easily. Furthermore, we list a few other applications that help DHH people in their daily activities.

Speech-to-text recognition applications

Alert detection applications & translation applications

Capture and translate applications

Captioning on Glass “COG”:

is designed for DHH people to make their communication with others easier. With this system, a person speaks into the phone then his speech is displayed directly on Google Glass for the DHH individual [12].

The Deaf and Hearing Impaired:

is a free android application that was developed especially for deaf people to assist and protect them from some possible dangers. This application will vibrate and flash if there is a loud sound in the vicinity without the need of internet connection, and the alert mode is user-selectable. Speech-to-text and vice versa conversion is supported by the application to facilitate the communication between the DHH and others [35].

Google Translate:

is a mobile based application that can be used in either Android or iOS.

It translates between 90 different languages from around the world. The user can type, write by hand, speak or take picture of a word then translate it. The user can also hear the correct pronunciation of the word in the two different languages. It is available offline and supports the feature of saving user translations for later retrieval [10].

Deaf Application:

is an android application to help deaf people in their communication. It converts spoken English or Arabic into written text to make a DHH person aware of the communication around her/him. Furthermore, the application transforms speech or text into sign language shown in the smartphone’s screen [39].

Tap Tap:

is a paid iOS application that aims to help DHH people by directing their attention to people or things they cannot see. It notifies the user if it detects an alert or warning such as: shout, crash, smoke alarm going off, door knocking, or being called. The notifications are communicated to the user through vibration and flashing. It works completely in online mode [37].

Arab Deaf Sign Interpreter:

enables the user to enter any word then have it translated to the corresponding sign language. It is useful in easing and increasing the communication between hearing and non-hearing individuals. It has many features, including supporting both Arabic and English languages, enabling the user to save the resulting sign image on the device for later use, and sharing the resulting translation on social networks such as Twitter and Facebook [22].

Live Caption:

is designed for the DHH with little knowledge of sign language. The app transcribes spoken words in the mobile to a live text on the smartphone screen. It offers long recording time without interruptions. Moreover, it offers the ability of the user to speak, type, and edit text in one screen. Additionally, the user can share the transcribed text with others [21].

 

OpticText

is a paid iOS application that recognizes text from the picture taken by iPhone’s camera and translates it. It supports multiple language translator packages offline. The user needs only to download the desired language translator to her/his iPhone. It is based on image processing of captured images and extracting text [26].

ClearCaptions:

transcribes the user phone calls directly and allows the user on the other side of the line to see the text of the phone call while mid-call. The user should be connected to a wireless network that supports simultaneous voice and data transfer. It uses real-time speech recognition techniques. It is available free of charge on iOS and android platforms [4].

 

Lingo Cam:

is a paid IOS application that offers a real-time translator and dictionary. The user points her/his mobile camera to any word, and then the user instantly receives the translation without taking a photo. It supports translation of 16 languages. Users can also type a word and it offers immediate translation. It allows sharing of translations on social media [20].

Dragon Dictation:

is a free speech recognition application provided on android and iOS platforms. It allows the user to speak and converts the speech into displayed text or email messages. Additionally, it updates the status of the user using his voice in social networking sites such as Facebook and Twitter. It offers editing features by suggesting a list of words for speed typing. It works only in online mode [5].

 

Google Goggles:

is based on the idea of searching through taking a picture by your mobile phone camera. Then, Google Goggles will search in Google’s database and provide any useful information about the captured image. It offers reading a text and translating it into other languages [9].

The studies described in this section attempt to support people with disabilities, specifically, DHH people. The majority of these applications are designed for smartphones. Moreover, many of these applications support specific needs for DHH individuals and a few of them support transcription, alerts, and translation at the same time. To address the needs of DHH people, we developed Enssat, which provides important functions using Google Glass.

3 Design and implementation of Enssat

In this section, we describe how Enssat was built. As shown in Fig. 2, we have used a layered architectural design that divides and organizes the system by grouping the related functionalities into layers [32]. The principle of this architecture is the separation of layers so that any change made in one layer does not affect the other layers. We have divided our application into several layers, as shown in Fig. 2. The mobile interface of the application is available both in Arabic and English, as shown in Fig. 3. To further illustrate the design of Enssat, we include the Use Case Diagram (Fig. 4), showing the different functions and actors involved with the system. We also include the Class Diagram (Fig. 5), which shows the different classes used in the implementation of Enssat and how they communicate with each other.

Fig. 2
figure 2

Enssat layered architectural design

Fig. 3
figure 3

Enssat’s main interface in English and Arabic

Fig. 4
figure 4

Enssat’s use case diagram

Fig. 5
figure 5

Enssat’s class diagram

3.1 How does Enssat work?

The Enssat system consists of a Google Glass and a mobile device with the Enssat application installed. As illustrated in the Use Case Diagram (Fig. 4), Enssat has five functions. We consider four of them major functions: translating from a picture, translating from speech, alerting user (or notifications), and real-time speech-to-text transcription. Of these functions, two assume a conversation with another speech-producing entity: verbal translation and transcription. We will briefly describe their setup here, and then explain how they work in more detail in the following subsections.

  1. 1-

    Translation: this function could be used by anyone. The scenario includes a person wearing a Google Glass. This person wishes to translate the speech going on around him/her. He/she launches Enssat on his/her smart phone and selects the translation function and the languages to be translated from/to. The Google Glass wearer is then instructed by the app to have the microphone of the mobile phone close to the speech source. As the other entity produces speech to be translated, the microphone of the mobile phone connected to the Glass detects the speech; the app translates it to the language requested by the user, and displays the translated text on the Google Glass prism display. This goes on in a real-time, sentence-by-sentence fashion similar to subtitles displayed in movies and other visual media. Unfortunately, the translation currently goes in only one direction, from the speaker to the Google Glass wearer. The reverse translation, in which the Google Glass wearer produces speech that is translated to the mobile screen for the other party to read, has not yet been implemented and will be added as a feature in the future.

  2. 2-

    Transcription: This function could be also used by anyone, but is designed specifically for the DHH. The setup is similar to the translation setup described above, except that instead of translation, the text is simply transcribed onto the prism of Google Glass so that the DHH person can read what the other person is saying. Again, the person producing the speech talks into the microphone of the mobile device connected to the Glass. An additional feature is added in the transcription function, in which the top-scoring transcriptions are displayed to the speaking person for a few seconds so that in the case of an incorrect transcription, the speaker could quickly choose an alternative transcription to be displayed to the person wearing the Google Glass. If none are chosen, the top scoring transcription is displayed to the Google Glass wearer.

The Enssat system resides on two devices: the Android mobile phone and Google Glass. These two parts communicate extensively in order to perform the functions of the system. The part on the mobile device is composed of four main components. Each of these is responsible for performing at least one function of the system. A fifth component runs on Google Glass and makes sure the input and the output of the system is handled correctly. Figure 6 shows the main components and how they interact with each other. Each of these components is described in the following subsections.

Fig. 6
figure 6

Enssat’s main components

3.2 Transcription component

The function of transcribing spoken text is central to the Enssat system. The interface for this function is shown in Fig. 7. In addition to real-time transcription of spoken words and displaying them to the DHH user, it is used to convert spoken words to text to be passed to translation services when the user requests the translation function.

Fig. 7
figure 7

Screenshot showing the transcription function interface

The main challenge in the transcription function is ensuring continuous real-time transcription. For the transcription itself, Google’s Speech-to-Text service [11] is used. When the DHH user invokes the transcription function, the system starts recording through the mobile phone’s microphone. To ensure real-time, smooth transcription, the system continues to record until it detects a short silence or a pause in the speaker’s stream of speech. The system then immediately forks a thread that takes care of transcribing the recorded sound. The original thread continues to record the speaker’s words until it detects another pause. It then forks a thread to transcribe, and so on.

To transcribe the recorded words, the sound file is sent to Google’s Speech-to-Text service [11] and the returned text is displayed to the user on Google Glass and the smartphone screen. To provide more accurate transcriptions, the system provides the speaker with a sorted list of suggested transcriptions of what she/he had just said. They can quickly pick one to be displayed to the user. If none are chosen, the top matching transcription is displayed to the DHH user on Google Glass. Figure 8 shows a suggested list of transcription choices as it appears in Enssat. Picking a transcription from the suggested list is completely optional. The list is displayed for a few seconds only and does not hamper the real-time quality of the transcription process. It is used to improve the quality of the transcription in the rare cases when Google’s Speech-to-Text service [11] gets it wrong.

Fig. 8
figure 8

Enssat screenshot showing a list of candidate transcriptions

This whole process could be performed in any one of 14 different languages. The language array is shown in Fig. 9. The user decides which language to use through a drop-down list.

Fig. 9
figure 9

14 languages used in translation and transcription.

3.3 Translation component

The second function of the system works for both the hearing user and DHH user. It provides real-time translation between 14 languages. The speaker speaks to the mobile device microphone and the translation appears as subtitles for the user to view on Google Glass. The other translation service provided by the system is the translation from an image containing text. The following subsections describe these two functions in greater detail.

3.3.1 Translation from voice

The system performs the translation function between any of two of the 14 languages listed in Fig. 9. The translation, like the transcription, is performed continuously in real time. Just like the transcription, spoken words are recorded via the mobile device microphone until a pause is reached. A thread is forked to handle the translation while the original thread continues recording spoken words in fragments separated by a pause in the stream of utterances.

The recorded segment is then passed to the transcription component to be transformed into text, which is sent to the Microsoft translation service [14] and a translation to the desired language is returned to the system to be displayed on Google Glass.

Although the process may seem long and may raise some concerns as to the “real-time” claim of the system, in reality, if the device is connected to the Internet, the translation process has an excellent response time. A great deal of care was taken to ensure the efficiency of the code and the full utilization of the network connection to the translation service. The response time is evaluated through experiments described in the evaluation section of this manuscript.

3.3.2 Translation from image

In addition to real-time translation of speech, the system provides translation from images containing text. The user needs only to take a picture through the camera of Google Glass or the mobile device and this picture will be processed using Optical Character Recognition (OCR) techniques provided by the Tesseract library [34]. Through the library functions, the system extracts the text in the image. This text is then sent to the Microsoft translation service [14] and the returned translated results are displayed on Google Glass for the user.

3.4 Alerts manager component

The third function provided by the system alerts the DHH user if there are specific surrounding sounds, such as a crying baby, car horn, ambulance siren, ringing phone, or fire alarm going off. This component provides a similar function to the one provided by two apps previously discussed in the related work section; The Deaf and Hearing Impaired [35] and Tap Tap [37]. Although Enssat does not provide something new in terms of function when compared to the other two apps, we believe that the use of Google Glass in providing this function and combining with the other useful function are unique to Enssat. First, the user decides which sounds she\he wishes to be alerted about. Figure 10 shows the settings page where the user enables or disables the different sound alerts. The system then keeps listening to surrounding sounds. Once a sound is detected, the system tries to determine what kind of sound it is. If Enssat indeed identifies a sound as one the user had indicated an interest in being alerted about, the phone vibrates and/or flashes and a message alerting the user to the sound appears on both Google Glass and the mobile phone screen. In the following subsections, we describe the sound detection and sound identification steps in detail.

Fig. 10
figure 10

Enssat screenshot showing settings page

3.4.1 Sound detection

To detect surrounding sounds, the system runs a listening thread active at all times. This thread detects surrounding sounds either from the microphone of the mobile phone or Google Glass. Once a sound is detected, a snippet of length 3 s is recorded and quickly passed to another thread, which tries to identify the sound (described in the following subsection). The listening thread then goes back to listening.

3.4.2 Sound identification

In this step, the system tries to identify the recorded sound passed from the listener thread. Enssat does this through a thread that cleans the sound snippet and then compares the detected sound wave with a group of stored sounds to determine which sound, if any, Enssat reports to the user. The different sound functions are performed through Google’s MusicG library [24]. The comparison function provided by the library returns a number between zero and one indicating the level of similarity between the wave files of the detected sound and the stored sound, with one meaning exactly the same and zero meaning completely different.

When we started working on the similarity function, we faced many obstacles and uncertainties regarding the accuracy of the similarity function. The number returned from the function was variable depending on the mobile phone model and Android version. Another issue was determining when to alert the user? What is the appropriate threshold between zero and one at which to alert the user to the surrounding sound? Through trial and error and extensive experimentation on three different mobile phones, we found that a reasonable indication of similarity is as summarized in Table 2. A sample of the experimentation performed to reach these conclusions is summarized in.

Table 1 Each trial was conducted with the same sound applied to the three devices. Some of the trials were performed with a sound similar to the stored sounds and some with completely different sounds. The numbers shown indicate the similarity level between the stored sound and the captured sound as returned by the similarity function. It shows variance in similarity levels depending on the device. Many more trials where conducted incorporating heavy human judgment on the actual resemblance of the sounds. We finally deduced the numbers in Table 2.

Table 1 Testing similarity in alerts
Table 2 Result categorization of similarity levels

Table 2 shows the similarity thresholds determined based on the discussion above. If the detected sound is compared to a saved sound and the similarity function returns a value between 0.01 and 0.54, the user will not be notified. If the similarity comparison returned a value higher than 0.54, the user will be notified if that was the only sound turned on in the settings page in Enssat. However, another complicating factor might be present. What happens when the user enables more than one alert? In this case, we need to compare the captured sound with more than one group of stored sounds and the sound with the highest similarity value will be reported to the user. For example, the user turned on the alerts for all the sounds. A sound was then detected and it matched two different stored sounds: a baby crying and an ambulance siren.

However, the similarity value returned with the baby crying is 0.85 and that with the ambulance siren is 0.66. Enssat will immediately notify the user of a baby crying by displaying an appropriate message on Google Glass or vibrating and flashing the mobile device, depending on the user preferences.

3.5 Google glass connection component

This component is responsible for communicating with Google Glass. It establishes the connection with the fifth component of the system through Bluetooth technology [27]. At first, Wi-Fi was considered for communication between the mobile device and Google Glass in a client–server mode. This approach was later discarded because it required a Wi-Fi network, both devices had to be on this network, and if the network changed, new IP addresses had to be configured on both the mobile device and Google Glass. Another approach was using a WiFi-Direct connection [28], which does not require an established Wi-Fi network to be in place. This was quickly discarded because Google Glass did not support it. In the end, Bluetooth was used to connect Google Glass to the mobile device in a client–server mode.

The second important task of this component is acting as an interface between the rest of the components and Google Glass component.

To ensure a smooth user experience when using Enssat, and because Enssat could work with or without Google Glass, upon launching the app for the first time, the user is directed to the settings screen where the user indicates whether or not he/she own a Google Glass. Depending on the user’s input, either the Glass is completely discarded and the mobile device screen and camera are used instead for communicating with the user or, if the user indicated the presence of the Glass, then step-by-step instructions are provided to ensure that the Glass is within range, turned on, and that the Bluetooth is turned on in both the mobile device and the Glass. In addition, Enssat instructs the user to ensure that the Glass is Bluetooth discoverable. Afterward, the mobile device searches for discoverable devices in the vicinity and upon finding Google Glass; the connection is simply established between the two devices with the mobile device acting as a server and the Google Glass acting as a client. Currently, no action is taken if more than one Google Glass is discovered. The assumption here is that Google Glasses are not very common and hence the likelihood of finding more than one is very low.

3.6 Google glass component

The final component of the system resides on Google Glass itself. Its main function is collecting input from the camera and possibly the microphone and displaying text to the user on the Google Glass prism display.

Because we are mainly dealing with DHH users, we do not expect to receive much input from the Google Glass microphone. Still, if some abrupt sounds were detected, the alerts manager component will be notified and passed the sound for analysis. However, the display on Google Glass will be utilized heavily for showing the user, in real-time, the text corresponding to whatever is being spoken in front of him, translated or as-spoken.

In working with Google Glass, we were faced with many obstacles that usually occur with using new technologies, such as the number of useful resources available on the Web. Older technologies tend to accumulate help videos and numerous other resources that help developers deal with obstacles. At the time of programming Enssat, the resources on the Web dealing with Google Glass deployment were quite limited. In addition to that, when using Google Glass with Enssat, we were faced with two main issues:

  1. a.

    Glass Overheating: The temperature of the Google Glass went up after continuous use. When it reached a certain temperature, it turned off in order to cool down. This shutting down disconnects the mobile device from the glass and requires reestablishing the Bluetooth connection. We tried to remedy this problem by going into sleep mode after the user-finished conversations. Nevertheless, a prolonged conversation would cause the Glass to overheat and consequently shut down.

  2. b.

    Sleep Mode: A problem that puzzled us at the beginning of deploying Google Glass was an abrupt and sudden shut down of the Glass display in the middle of a function, such as translating speech. In addition to the abrupt halt of the function running for the user, this caused the connection between the Glass and the mobile device to be lost. It took us a while to determine that the Android has a default value for a system attribute android: keepScreenOn as false, which caused the Google Glass display to shut down after a predetermined period. This was changed to true and the issue was resolved.

3.7 Putting it all together

Although Enssat utilizes many services and ready-made components, such Google’s Speech-to-Text service [11], putting it all together and making these components communicate properly and seamlessly is no easy feat. In addition to utilizing some of these services as they were, Enssat had to build extra features on top of existing ones to ensure the proper execution of Enssat’s different functions. For example, although Google’s Speech-to-Text service converts an audio file into text, processing continuous speech and splitting it into snippets was quite challenging. The process is described in greater detail in subsection 3.2. Another example is the alerts function. In this function, a lot of work was put into differentiating between the different sounds. It was only through trial and error that the current level of accuracy was achieved in detecting the different sounds and alerting the user to their presence.

4 System evaluation

To evaluate Enssat, we conducted two sets of experiments, testing the responsiveness of the system in terms of time and user acceptance of Enssat. The mobile device used for experimentation is a Samsung Galaxy Note3 with 1.9 GHz octa-core and 3GB RAM running Android 4.4.2. The Google Glass that was used has an ARMv7 Processor rev 3 with 2 GB of usable memory. The setting where we conducted the tests was a quiet room in which only the speakers were allowed to talk. There were no surrounding noises. We consider this a limitation of our testing of Enssat owing to time constraints and we hope to evaluate the system in the future under more varied scenarios.

4.1 System responsiveness

Because our system relies heavily on web services, we needed to make sure the response time is reasonable and acceptable to the user. We ran two sets of experiments to measure responsiveness. One measures the response time of each of the functions. The other measures the scalability of each of the functions by increasing the number of words input to the function and observing performance deterioration. We performed each experiment five times, the highest and lowest readings were discarded, and the remaining three reading were averaged. Table 3 shows a sample of the experiments.

Table 3 Responsiveness of the system’s functions in seconds

Each of the functions was tested with multiple input variants. For each input, the response time was measured when only the mobile device was used to display the output, and again with the mobile and Google Glass connected and the latter is used to display output. Overall, the response time of Enssat was reasonable. It took a maximum of approximately 2 s from receiving the input (spoken text or image with text) and producing the output. This supports our claim of real-time transcription and translation.

Obviously, the longer the sentences, the higher the expected delay. We measure the scalability of the different functions by increasing the number of words gradually and noting the performance changes. Figure 11 shows the three functions tested with increasing number of words from 1 to 5. The worst performance slightly exceeds 2 s, which we believe is an acceptable delay for a real-time transcription or translation. The performance of the transcription is better than the functions requiring translation, because the speech translation uses the transcription function before the translation, compounding time needed for both functions.

Fig. 11
figure 11

Enssat’s scalability

4.2 User acceptance

The user acceptance test was conducted on a group of 25 participants. Of these, ten had some hearing difficulties and the remaining had no such issues but could benefit from the translation and text capturing functions of the system. All of the DHH participants were female students in the College of Education, King Saud University, Riyadh, Saudi Arabia. Their ages ranged between twenty and twenty-five years. For each student, one of the system developers took about five to ten minutes to train the user on how to use the Google Glass and the mobile application. After the test was conducted, a form with nine questions was provided for each student in order to gather and investigate their feedback about Enssat. Table 4 summarizes their feedback on nice aspects of the system. Their verbal feedback was highly positive and encouraging.

Table 4 The user acceptance results on each of Enssat’s features

5 Conclusion

The goal of this project is to design a system that provides basic communication functionalities for the DHH using Google Glass. Inspired by the recent decision to admit 120 DHH students annually to King Saud University in Riyadh, the main target users of Enssat are DHH university students. Enssat provides several important functions that aid users in their daily lives. The first function is the real-time transcription of spoken words into captions displayed on Google Glass in one of 14 different languages. The second function Enssat provides is real-time translation of spoken words between 14 languages, with the translation appearing on the Glass as captions. It also provides the capability of translating text from images taken by the Google Glass camera. The final function of the system alerts the user to surrounding sounds, such as car horns and alarms. The user selects which of the alerts are enabled/disabled. In the absence of Google Glass, the system uses the camera, microphone, and display of the mobile device. The evaluation of the Enssat application demonstrated satisfaction among DHH and non-DHH users. The evaluation of the responsiveness and scalability of the system returned reasonable numbers, justifying the real-time claim of the functions of Enssat.

6 Future work

A further study could assess the usability of Enssat among DHH users, and make it specific to the user needs. We would like to add the feature of automatic language detection to Enssat, similar to that in the Google Chrome browser, which automatically detects the language of a given website and offers to translate it the user’s own language. We hope to make this automatic within Enssat, such that the application would offer real-time translation from various languages being spoken in the same room to the DHH’s first language without her/him interfering.

Furthermore, we would like to implement text-to-speech conversation, especially Arabic language, which is not supported yet by Google packages. With this feature, the DHH user could type what she/he wishes to say to the person she/he is conversing with, and this would be converted to speech. This will allow the DHH user an improved level of interaction with the people around her/him.

In addition, we would like to further improve on the accuracy of the alert feature and allow more sounds to be detected, such as the user’s name or any recorded sound the user wishes to be alerted to.

Additionally, there are a couple of limitations of Enssat that we would like to overcome in the future. The first limitation is that Enssat handles only one person talking at a time. It would be beneficial to deal with conversations with a group of people participating in the conversation, such as in a classroom setting. The other limitation is the Google Glass battery. Because Enssat runs a listening thread for alerts, the battery depletes quickly. In the future, we would like to improve how Enssat uses Google Glass in terms of power consumption.

Finally, we would like to develop applications for the iOS and Windows Phone platforms in order to satisfy all types of users.