1 Introduction

English is arguably the most successful language in the history of civilization. English has been evolving since the fifth century and is now spoken by more than a billion people spread over all the six inhabited continents. Interestingly, a majority of the English speakers are non-native speakers and live in countries where English is not used as the primary mode of communication. English is used internationally, typically as a lingua franca for business, science and education.

English is used as a second language in countries like India, Malaysia, the Philippines and Singapore where there is either de jure or de facto recognition of the language. English is used as a foreign language in most other countries. Non-native speakers learn English because of various reasons including better access to higher education, better employability and prestige. Facilities are now available for learning English as a second language or a foreign language in most countries. However, non-native English speakers often face problems related to grammar, vocabulary and pronunciation. Non-native speakers tend to mispronounce English words because of the differences in the phonetics of their native language and English, and incorrect letter to sound conversion (Li et al. 2017). Moreover, such mispronunciations are often repetitive (Wang and Lee 2015).

Computers have been used for teaching since the 1960s and educational software is being developed ever since. Specialized software tools have been developed to teach different disciplines at different levels following different pedagogical approaches. Software tools have been developed to teach languages as well. Some such tools focus on specific aspects of language training. The first software tool to teach English pronunciation was developed in the early-1970s (Kalikow and Swets 1972). However, actual progress in Computer Aided Pronunciation Training (CAPT) for English happened only since the beginning of the new millennium. Today, sophisticated software tools are available to teach and test English pronunciation. These tools are typically used by non-native speakers. These tools detect and diagnose mispronunciations in the speech of learners, and then help the learners to correct them. Such tools are available for learners belonging to different age groups and may be used for both formal and informal learning.

This paper reviews the CAPT systems developed so far for teaching English pronunciation and classifies them according to the technology used in them. The objective of this paper is to provide English language teachers an overview of different types of CAPT systems and inform them about their utility.

2 Classification strategy

Over the years, CAPT systems have evolved from programs for just detecting mismatch in pronunciation to sophisticated pedagogical systems. There are two parallel design goals for the developers of CAPT systems as follows.

  • The first goal is to maximize the accuracy of mispronunciation detection and diagnosis so that the learners can improve their pronunciation as much as possible. Various probabilistic and artificial neural network based models have been used over the years to improve accuracy of CAPT systems.

  • The second goal is to make the CAPT systems more interactive and personalized. A CAPT system should ideally prescribe ways to improve pronunciation according to the profile of a learner.

On the basis of the technology used in them, we have classified the CAPT systems for English into visual simulation based systems, game based systems, comparative phonetics based systems and artificial neural network based systems (Fig. 1). It may be noted that these categories overlap at times. So, we assigned a CAPT system to a category according to the technology that plays the most important part in it.

Fig. 1
figure 1

Classification of CAPT systems for English

3 Visual simulation based systems

A visual simulation based CAPT system records the speech of a learner, analyzes it and provides feedback through explanatory images, animated pictorial characters and comparative videos. The approach is simple, attractive and effective especially for young learners and learners with hearing problems.

The first CAPT system for English developed by Kalikow and Swets (1972) followed this approach. They developed a system that used visual feedback for teaching English pronunciation to Spanish students. The system tried to mimic professional language trainers by providing personalized attention to the learners. The system focused on the pronunciation of vowels. It recorded videos of the learners when they spoke and later played those videos showing them their tongue location and trajectory as they voiced vowels. The system analyzed the speech of the learners isolating the vowels in multisyllabic words, and measured the amount of aspiration and time lapse before voicing of two vowels. Images were used to inform the learners about the corrections they required to make in their pronunciation. Since the system used visual feedback, it could be customized for learners with hearing difficulties. Giuliani et al. (2003) developed a system that used animated pictorial characters for teaching English pronunciation to 8–12 year old Italian children. The children had limited linguistic and technical skills. So, the system presented its feedback as animated pictorial characters, like birds and clowns, showing the steps that the children needed to undertake for pronouncing the words correctly.

4 Game based systems

Suitably designed mobile apps and computer games can be used to teach English pronunciation to learners in both formal and informal settings. A game based system can simulate real-world conversations and teach learners to speak according to the context. A few such systems have been developed recently.

Jing and Yong (2014) developed a mobile app to teach English pronunciation. The app records the speech of a learner and compares it with a pre-recorded speech of a human language expert. The app extracts salient features from a speech and uses them in the comparison. The app also uses those features to assess the learners and grades them automatically. The app has been used in the classroom to teach undergraduate students in China. The app was able to improve the pronunciation of the learners at both phoneme level and word level. Recently, Satria et al. (2017) developed a game to augment the conventional mode of teaching English in school. Learners can play the game through voice commands. The learners can progress in the game only if they pronounce correctly. The game internally maintains a database of keywords and matches the pronunciation of the learners with that database. The game provides an interactive and effective way of teaching pronunciation within the setup of a school. The game has been used for improving the vocabulary and pronunciation of 8–13 year old children in Indonesia.

5 Comparative phonetics based systems

The comparative phonetics based approach is sometimes used to teach English pronunciation to adult learners who are fluent in their native languages. In this approach, the phonemes in the native language of a learner are compared to those in English using a stochastic (probabilistic) analysis. The CAPT system then records the speech of the learners, detects mispronunciations and suggests ways for improvement considering their native languages.

Tianli et al. (2003) developed a CAPT system based on a comparison of English and Chinese phonemes for teaching English pronunciation to Chinese students. The system first demonstrates to the learners how to voice individual phonemes, multisyllabic words and complete sentences. After a demonstration session, the learners are asked to take a test. The system records the speech of the learners and analyzes the same. The analysis focuses on the utterance of the phonemes and the stress made on the syllables by the learners. The system provides feedback to the learners that they can use to improve their pronunciation. The system tries to mimic human experts while grading the learners. Abe et al. (2003) developed a system for teaching pronunciation in several Asian and European languages including English. The system uses phonological theories and comparative phonetics to improve the pronunciation and dialog of the learners. Wang et al. (2008) observed that mispronunciations are often caused by cross-language phonological comparisons made by non-native English speakers. They compared English and Chinese (Cantonese) phonemes and identified the common mispronunciations that are made by fluent Chinese speakers while speaking English. They compiled a pronunciation dictionary with correct and probable incorrect pronunciations, and used it to detect mispronunciations in the speech of the learners. The system was used to teach English pronunciation to Chinese adult learners. Qian et al. (2010) developed a probabilistic technique for detecting and diagnosing mispronunciations. The system compared Chinese (Cantonese) and English phonemes on the basis of phonological rules defined by expert linguists. The system was used by Chinese undergraduate students learning English. Qian et al. (2016) extended the work by separating the process of mispronunciation detection and mispronunciation diagnosis into two passes. The system was able to detect and diagnosis mispronunciations with a high accuracy.

6 Artificial neural network based systems

An artificial neural network is a model of computation in which a network of artificial neurons, also called nodes, is used to solve computational tasks for which no efficient algorithmic solution exists. The nodes can process information in parallel and are arranged in layers. The first layer is known as the input layer and the nodes in it receive the input signals. The last layer is known as the output layer and the nodes in it emit the output signals. There are one or more layers of nodes between the input layer and the output layer known as the hidden layers. The nodes are connected by connection links which carry signals from one node to another. The connection links have numeric weights associated with them which are multiplied to the amplitude of the signals they carry. Artificial neural networks can be used to detect and diagnose mispronunciations in a speech. An artificial neural network has to be trained first. In the training phase, the system records the speech of human language experts and the weights of the connection links in the artificial neural network are adjusted accordingly following an algorithm known as the backpropagation algorithm. The language experts typically need to voice a few hundred sentences to train the artificial neural network. Once trained, the artificial neural network can be used to detect mispronunciations in the speech of the learners. Artificial neural networks are a powerful model of computation and can be used to diagnose mispronunciations accurately.

In an early study, Akima et al. (1992) developed a system that used an artificial neural network to detect mispronunciation in the speech of Japanese schoolchildren learning English. The system emphasized on the correct voicing of the phonetic vowels. Qian et al. (2012) developed a system that uses stochastic reasoning and a deep neural network, i.e. an artificial neural network with multiple hidden layers, for diagnosing mispronunciation in a speech. The system has been used for teaching undergraduate students in China. Chen and Jang (2015) developed a system which was trained by human language experts on recognizing correct phones. The system used an artificial neural network and a support vector machine, i.e. another model of non-algorithmic computation, for recognizing words based on those phones. The system also automatically grades the learners. Wang and Lee (2015) hypothesized that a learner repeats similar mispronunciations because of the differences in the phonemes present in his/her native language and in English. They developed a system that detects mispronunciations in the speech of a learner with an emphasis on identifying frequently repeated mistakes. The system uses stochastic reasoning and a deep neural network. In a recent study, Li et al. (2017) hypothesized that the three main reasons for speakers to mispronounce are the differences in the phonetics of their native language and English, incorrect letter to sound conversion, and misreading of the text prompts. Li et al. (2017) used a deep neural network to detect mispronunciations arising from these reasons and to suggest the correct pronunciation.

7 Performance analysis

7.1 Technological and human factors

Researchers have tried to assess the CAPT systems in terms of both technological factors and human factors (Table 1).

Table 1 Comparison of performance of the CAPT systems

Researchers use various technological factors to access how correctly a CAPT system can detect and diagnose mispronunciation in speech. Wang et al. (2008) reported an accuracy of 86% in mispronunciation detection using their CAPT system. For most other systems, the diagnostic error rate varied between 14% (Li et al. 2017) and 28% (Wang and Lee 2015). Some researchers compared the accuracy of their systems with existing techniques for mispronunciation detections (Tianli et al. 2003; Qian et al. 2010, 2012).

Human factors like how the CAPT systems are received by English teachers and learners, and how much learners are benefitted by CAPT systems are matter of interest. Several researchers tried to determine the effectiveness of training using the CAPT systems. Akima et al. (1992) reported 7% decrease in mispronunciation after learners have used their CAPT system for 2 months. Recently, Jing and Yong (2014) reported 23% improvement in pronunciation performance among learners. Language experts agreed with the evaluation performed by CAPT systems (Chen and Jang 2015), while learners appreciated the CAPT system for their effectiveness and usability (Nyugen et al. 2010; Satria et al. 2017).

7.2 State-of-the-art practices

We have identified four practices that have been used by several researchers for improving the effectiveness and usability of CAPT systems. The first two practices improve the technological performance of CAPT systems, while the latter two improve human factors associated with CAPT systems.

  • Pronunciation dictionaries. A pronunciation dictionary is a database containing the acceptable pronunciations of different words. Several CAPT systems compare the utterances of learners with the entries in a pronunciation dictionary to detect mispronunciation (Wang et al. 2008; Qian et al. 2010; Wang and Lee 2015; Li et al. 2017). Pronunciation dictionaries may also contain frequently occurring mispronunciations to optimize mispronunciation detection.

  • Deep learning. Artificial neural networks have been used in CAPT systems since the early-1990s (Akima et al. 1992). However, the artificial neural networks used in recent CAPT systems have more neurons and those neurons are typically arranged in multiple hidden layers (Qian et al. 2012; Wang and Lee 2015; Li et al. 2017). CAPT systems often use deep neural networks to achieve higher accuracy.

  • Multimedia support. Several researchers have made use of the multimedia support provided by computers to improve the usability of CAPT systems (Kalikow and Swets 1972; Abe et al. 2003; Giuliani et al. 2003; Satria et al. 2017). CAPT systems use images, animation and audio and video clips to teach learners how to pronounce correctly.

  • Automated grading. To relieve language teachers from the task of grading pronunciation assignments of a large number of learners, several CAPT systems have modules to grade learners without any intervention of the teachers (Tianli et al. 2003; Jing and Yong 2014). Chen and Jang (2015) found a strong correlation between the grades given by a CAPT system and a language teacher to learners. Researchers have used both algorithmic and probabilistic techniques to grade pronunciation of learners.

8 Discussion

8.1 An analysis of CAPT systems for English

We observed that a number of software tools to teach English pronunciation has been developed in the last two decades. However, we believe that there is a need for more such tools given the large number of non-native speakers interested in learning English. Moreover, there is not much awareness about CAPT tools among learners and many English language teachers doubt their efficacy.

Table 2 presents a summary of the CAPT systems developed to date for teaching English pronunciation. We observed that those tools have been developed and used predominantly in Asian and European countries where English is used as a foreign language. No use of such tools in countries where English is used as a second language has been reported in the literature. Over the years, different researchers focused on different aspects of CAPT systems. For example, some researchers emphasized on correct voicing of vowels (Kalikow and Swets 1972; Akima et al. 1992), some researchers tried to identify frequently occurring mispronunciations (Wang and Lee 2015; Li et al. 2017) and some researchers focused on providing learners with personalized feedback (Kalikow and Swets 1972). Speech analysis and mispronunciation detection are computationally intensive tasks. Consequently, the artificial neural networks that have been used by the researchers typically have sophisticated architectures (Qian et al. 2012; Wang and Lee 2015; Li et al. 2017). Training artificial neural networks is a time consuming activity and, in the case of CAPT systems, too much effort is required from the language experts. Qian et al. (2012) experimented with techniques for minimizing the workload of the language experts.

Table 2 A summary of the CAPT systems for English

The different techniques used in the CAPT systems have their own limitations. For example, although visual simulation based systems are appealing to naive users, they do not provide detailed analysis and feedback. Game based CAPT systems are preferred by a small section of learners but require a lot of effort to develop. Requirements for a suitable native language corpus and availability of language experts are the limiting factors in developing comparative phonetics based CAPT systems. Alternatively, developing artificial neural network based CAPT systems requires large datasets of phonemes and long training periods.

Some of the CAPT tools target specific sections of learners, while the other tools are generic. Figure 2 arranges the different categories of CAPT systems according to the expected level of maturity of their target users, with the visual simulation based systems being appropriate for the most naive learners and the artificial neural network based systems being suitable for senior students and professionals. As expected, the tools targeting naive learners typically have more attractive user interfaces.

Fig. 2
figure 2

Salient features of different categories of CAPT systems for English. The categories are arranged according to their attractiveness and the expected level of maturity of the target learners

8.2 General computational linguistic tools

There are some general computational linguistic tools that we believe can be extended to teach English pronunciation. For example, Juang and Furui (2000) used a stochastic model to analyze speech signals with an objective to understand speech. In a later study with a similar objective, Shum et al. (2016) used of a combination of spectral analysis, stochastic reasoning and a deep neural network to analyze speech of learners and build a language recognition system. These systems can be adapted to accurately diagnose mispronunciations. Nyugen et al. (2010) developed a mobile app for teaching English as a foreign language to Vietnamese learners. The app personalizes the content presented to the learners based on their location and level of expertise, and the time of the day. Liaw (2014) developed a speech recognition based online reading system to supplement teaching English as a foreign language to elementary school students. In a recent experiment, Lee et al. (2017) used a recurrent neural network, i.e. an artificial neural network in which the nodes that use signals produced by one another form a cycle, to analyze the messages posted by the learners on social networking websites. Such techniques that present material to learners based on the context can be used in CAPT as well. Jain et al. (2018) used an augmented reality mobile app to teach novel words to kindergarteners. Such augmented reality mobile apps may be used to teach pronunciation as well. Nakai et al. (2018) discussed tools to display videos of the tongue and other internal organs as people voice sounds found in different languages. The tools can help in a systematic study of pronunciation.

8.3 CAPT systems for other languages

In the last two decades, some sophisticated CAPT systems have been developed for languages like Arabic, Chinese, French and Japanese. We believe that the technologies used in these systems can also be used to develop CAPT systems for English. For example, Abdou et al. (2006) used stochastic reasoning to develop a system for teaching Arabic pronunciation with an emphasis on uttering verses. A few sophisticated game based CAPT systems have been developed for Chinese and French. Chiu et al. (2007) developed a computer game to teach Chinese pronunciation. The system shows short video clips and the learners have to speak and voice words as per the scene. Later, Su et al. (2015) developed a more advanced game based system to teach Chinese pronunciation. The system presents a scenario to the learners and starts a conversation. The system continuously analyzes the speech of the learners and makes them speak specific sentences that will improve their pronunciation. Athanasopoulos et al. (2017) developed a three-dimensional game to teach French pronunciation. The learners take part in a conversation as a part of the game, and the system analyzes their speech and provides feedback for improving their pronunciation. In a recent study, Samsudin and Mano (2017) developed a visual simulation based system to teach Japanese pronunciation using animated texts. The system can be used efficiently by both naive and expert learners.

9 Conclusion

Several CAPT systems have been developed and used successfully for teaching English pronunciation to learners belonging to different age groups in different countries. However, we believe that more effort is needed on the development of CAPT systems and they should be used more proactively. Ongoing developments in digital signal processing, artificial intelligence, augmented and virtual reality and other branches of computer science will be helpful in developing more sophisticated CAPT systems in the near future. We have some recommendations regarding the development, dissemination and use of CAPT systems for English as follows.

  • Language teachers and computer engineers must collaborate to develop high quality CAPT tools. CAPT software should be disseminated through online platforms like GitHub and Google Play so that they can reach a wider audience.

  • CAPT tools should be integrated with the curriculum in schools and universities. Language teachers should be made aware of the benefits of using CAPT tools. CAPT tools may be used by the learners under the supervision of a teacher during the class hours as well as for self study later.

  • CAPT tools should be developed and used in countries where English is used as a second language. Many of these countries have a large number of students interested in learning English but inadequate English language teachers. CAPT tools for English will be immensely helpful for the students in these countries.

  • Smartphones have computation power comparable to that of a personal computer and built in support for multimedia. One mobile app for English CAPT has been already developed and used successfully. Mobile based CAPT systems can be used for both formal and informal learning. Implementing CAPT software as mobile apps will also allow more people to use them.

  • More research and innovation related to CAPT is necessary. English language teachers should offer massive open online courses (MOOCs) on English pronunciation and use CAPT tools in them. Scope of social media based and collaborative CAPT should also be investigated.