Automatic quality estimation for speech translation using joint ASR and MT features

Le, Ngoc-Tien; Lecouteux, Benjamin; Besacier, Laurent

doi:10.1007/s10590-018-9218-6

Automatic quality estimation for speech translation using joint ASR and MT features

Published: 01 June 2018

Volume 32, pages 325–351, (2018)
Cite this article

Download PDF

Access provided by CONRICYT – Journals CONACYT

Machine Translation

Automatic quality estimation for speech translation using joint ASR and MT features

Download PDF

383 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This paper addresses the automatic quality estimation of spoken language translation (SLT). This relatively new task is defined and formalized as a sequence-labeling problem where each word in the SLT hypothesis is tagged as good or bad according to a large feature set. We propose several word confidence estimators (WCE) based on our automatic evaluation of transcription (ASR) quality, translation (MT) quality, or both (combined ASR + MT). This research work is possible because we built a specific corpus, which contains 6.7k utterances comprising the quintuplet: ASR output, verbatim transcript, text translation, speech translation, and post-edition of the translation. The conclusion of our multiple experiments using joint ASR and MT features for WCE is that MT features remain the most influential while ASR features can bring interesting complementary information. In addition, the last part of the paper proposes to disentangle ASR errors and MT errors where each word in the SLT hypothesis is tagged as good, $asr\_error$ or $mt\_error$. Robust quality estimators for SLT can be used for re-scoring speech translation graphs or for providing feedback to the user in interactive speech translation or computer-assisted speech-to-text scenarios.

Quality Estimation for Machine Translation with Multi-granularity Interaction

Machine Translation Quality Estimation: Applications and Future Perspectives

Quality Estimation for English-Hungarian Machine Translation Systems with Optimized Semantic Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automatic quality assessment of spoken language translation (SLT), also named confidence estimation (CE), is an important topic because it allows us to know whether a system produces user-acceptable outputs or not. In interactive speech-to-speech translation, CE helps to judge whether a translated term is uncertain (in which case we can ask the speaker to rephrase or repeat the term). For speech-to-text applications, CE may tell us whether output translations are worth correcting, or whether they require retranslation from scratch. Moreover, an accurate CE can also help to improve SLT itself through a second-pass N-best list re-ranking or search graph re-decoding, as has already been done for text translation in Bach et al. (2011) and Luong et al. (2014b), or for speech translation in Besacier et al. (2015). Consequently, building a method which is capable of pointing out the correct parts as well as detecting the errors in a speech-translated output is crucial to tackle the above issues.

Given a signal $x_{f}$ in the source language, spoken language translation (SLT) consists of finding the most probable target language sequence $\hat{e} = (e_{1}, e_{2},\ldots ,e_{N})$, as in (1):

$$\begin{aligned} \hat{e} = \,\mathop {\arg \max }\limits _{e} \left\{ {p(e|x_{f} ,f)} \right\} \end{aligned}$$

(1)

where $f = (f_{1}, f_{2},\ldots ,f_{M})$ is the transcription of $x_{f}$. Now, if we perform confidence estimation at the “word” level, the problem is called word-level confidence estimation (WCE) and we can represent this information as a sequence q (same length N of $\hat{e}$) where $q = (q_{1}, q_{2},\ldots ,q_{N})$ and $q_{i}\in \{good,bad\}$.^{Footnote 1}

Then, integrating automatic quality assessment into our SLT process (q is defined above) can be done as in (2)–(4):

$$\begin{aligned}&\displaystyle \hat{e} = \,\mathop {\arg \max }\limits _{e} \sum \limits _{q} {\left\{ {p(e,q|x_{f} ,f)} \right\} }\end{aligned}$$

(2)

$$\begin{aligned}&\displaystyle \hat{e} = \,\mathop {\arg \max }\limits _{e} \sum \limits _{q} {\left\{ {p(q|x_{f},f,e)*p(e|x_{f},f)} \right\} }\end{aligned}$$

(3)

$$\begin{aligned}&\displaystyle \hat{e} \approx \,\mathop {\arg \max }\limits _{e} \{ \mathop {\max }\limits _{q} \{ p(q|x_{f} ,f,e) * p(e|x_{f} ,f)\} \} \end{aligned}$$

(4)

In the product of (4), the SLT component $p(e|x_{f},f)$ and the WCE component $p(q|x_{f},f,e)$ contribute together to find the best translation output $\hat{e}$. In the past, WCE has been treated separately in ASR or MT contexts and we propose here a joint estimation of word confidence for an SLT task involving both ASR and MT.

This journal paper is an extended version of a paper published at ASRU 2015 (Besacier et al. 2015), but here we focus more on the WCE component and on the best approaches to accurately estimate $p(q|x_{f},f,e)$.

Contributions A corpus (distributed to the research community)^{Footnote 2} dedicated to WCE for SLT was initially published in Besacier et al. (2014). In this paper, we present its extension from 2643 to 6693 speech utterances. In addition, while our previous work on quality assessment was based on two separate WCE classifiers (one for quality assessment in ASR and one in MT), we propose here a unique joint model based on different feature types (ASR and MT features). This joint model allows us to operate feature selection and analyze which features (from ASR or MT) are the most efficient for quality assessment in speech translation. We also experiment with two ASR systems that have different performance in order to analyze the behaviour of our SLT quality-assessment algorithms at different levels of word error rate (WER) (Levenshtein 1966). The last part of this paper proposes to disentangle ASR and MT errors in speech translation by automatically detecting the origin of SLT errors, either due to ASR or to MT.

Outline The outline of this paper is as follows: Sect. 2 reviews the state-of-the-art on confidence estimation for ASR and MT. Our word-confidence estimation (WCE) system using multiple features is then described in Sect. 3. The experimental setup (namely our specific WCE corpus) is presented in Sect. 4 while Sect. 5 evaluates our joint WCE system. Feature selection for quality assessment in speech translation is analyzed in Sect. 6. Section 7 proposes to disentangle ASR and MT errors in SLT output and finally, Sect. 8 concludes this work and gives some future perspectives.

2 Related work on confidence estimation for ASR and MT

Several previous works tried to propose effective confidence measures in order to detect errors on ASR outputs. Confidence measures are introduced for Out-Of-Vocabulary (OOV) detection by Asadi et al. (1990). Young (1994) extends this previous work and introduces the use of word posterior probability (WPP) as a confidence measure for speech recognition. The posterior probability of a word is most of the time computed using the hypothesis word graph (Kemp and Schaaf 1997). More recent approaches (Lecouteux et al. 2009) for confidence measure estimation use side-information extracted from the recognizer: normalized likelihoods (WPP), the number of competitors at the end of a word (hypothesis density), decoding process behaviour, linguistic features, acoustic features (acoustic stability, duration features), and semantic features. In addition, ASR quality estimation has been addressed in several recent studies (de Souza et al. 2015; Zamani et al. 2015; Jalalvand et al. 2016). Re-scoring the output of the ASR N-best list using ASR and MT features was also proposed by Ng et al. (2014, 2015, 2016).

In parallel, the Workshop on Machine Translation (WMT)^{Footnote 3} introduced in 2013 a WCE task for Machine Translation. Han et al. (2013) and Luong et al. (2013b) employed the Conditional Random Fields (CRF) (Lafferty et al. 2001) model as their machine-learning method to address the problem as a sequence-labelling task. Meanwhile, Biçici (2013) extended their initial proposition by dynamic training with adaptive weight updates in their neural network classifier. As far as prediction indicators are concerned, Biçici (2013) proposed seven word feature types and found among them the “common cover links” (the links that point from the leaf node containing this word to other leaf nodes in the same subtree of the syntactic tree) the most outstanding. Han et al. (2013) focused only on various N-gram combinations of target words. Inheriting most of the previously recognized features, Luong et al. (2013b) integrated a number of new indicators relying on graph topology, pseudo reference, syntactic behaviour (constituent label, distance to the semantic tree root) and polysemy characteristic. The estimation of the confidence score mainly uses classifiers like CRF (Han et al. 2013; Luong et al. 2014a), Support Vector Machines (Langlois et al. 2012), or neural networks (Biçici 2013). Some investigations were also conducted to determine which features seem to be the most relevant. Langlois et al. (2012) proposed to filter features using a forward-backward algorithm to discard linearly correlated features. Using boosting as the learning algorithm, Luong et al. (2015) were able to take advantage of the most significant features.

Finally, several toolkits for WCE have recently been proposed: TranscRater for ASR (Jalalvand et al. 2016),^{Footnote 4} QuEst++ for MT (Specia et al. 2015),^{Footnote 5} MARMOT for MT (Logacheva et al. 2016),^{Footnote 6} as well as the WCE toolkit (Servan et al. 2015)^{Footnote 7} that is used to extract MT features in the experiments of this paper.

To the best of our knowledge, the first attempt to design WCE for speech translation, using both ASR and MT features in a single classifier, was our own work (Besacier et al. 2014, 2015) which is further extended in this paper.

3 Building an efficient quality assessment (WCE) system

The WCE component solves equation (5):

$$\begin{aligned} \hat{q}=\,\mathop {\arg \max }\limits _{q} \{p_{SLT}(q|x_{f},f,e) \} \end{aligned}$$

(5)

where $q = (q_{1}, q_{2},\ldots ,q_{N})$ is the sequence of quality labels on the target language. This is a sequence-labelling task that can be solved with several machine-learning techniques such as CRF. However, for that, we need a large amount of training data for which a quadruplet $(x_{f},f,e,q)$ is available. In this work, we will use a corpus extended from Besacier et al. (2014) which contains 6.7k utterances. We will investigate if this amount of data is enough to evaluate and test a joint model $p_{SLT}(q|x_{f},f,e)$.

As it is much easier to obtain data containing either the triplet $(x_{f},f,q)$ (automatically transcribed speech with manual references and quality labels inferred from WER estimation), or the triplet (f, e, q) (automatically translated text with manual post-edits and quality labels inferred using tools such as TERp-A (Snover et al. 2009)), we can also recast the WCE problem as in (6):

$$\begin{aligned} \hat{q}=\,\mathop {\arg \max }\limits _{q} \{p_{ASR}(q|x_{f},f)^\alpha *p_{MT}(q|e,f)^{1-\alpha }\} \end{aligned}$$

(6)

where $\alpha $ is a weight giving more or less importance to $WCE_{ASR}$ (quality assessment on transcription) compared to $WCE_{MT}$ (quality assessment on translation). It is important to note that $p_{ASR}(q|x_{f},f)$ corresponds to the quality estimation of the words in the target language based on features calculated on the source language (ASR). For that, what we do is project source quality scores to the target using word alignment information between e and f sequences. This alternative approach (Eq. (6)) will be also evaluated in this work even if it corresponds to a different optimization problem than Eq. (5). In particular, the choice of $\alpha $ is only set a priori in our experiments (to 0.5) which is probably not the best option.

In both approaches—joint ($p_{SLT}(q|x_{f},f,e)$) and combined ($p_{ASR}(q|x_{f},f)$ + $p_{MT}(q|e,f)$)—some features need to be extracted from ASR and MT modules. They are more precisely detailed in the next subsections.

3.1 WCE features for speech transcription (ASR)

In this work, we extract several types of features, which come from the ASR graph, from language model scores and from a morphosyntactic analysis. These features are listed below [more details can be found in Besacier et al. (2014)]:

Acoustic features word duration (F-dur).
Graph features (extracted from the ASR word confusion networks) number of alternative (F-alt) paths between two nodes; word posterior probability (F-post).
Linguistic features (based on probabilities by the language model) the word itself (F-word), 3-gram probability (F-3g), log probability (F-log), back-off level of the word (F-back), as proposed in Fayolle et al. (2010).
Lexical Features Part-Of-Speech (POS) of the word (F-POS).
Context Features Part-Of-Speech tags in the neighborhood of a given word (F-context).

For each word in the ASR hypothesis, we estimate these 9 features: F-Word; F-3g; F-back; F-log; F-alt; F-post; F-dur; F-POS; and F-context.

In a preliminary experiment, we evaluate these features for quality assessment in ASR only ($WCE_{ASR}$ task). Two different classifiers will be used: a variant of boosting classification algorithm called bonzaiboost (Laurent et al. 2014), which implements the boosting algorithm Adaboost.MH over deeper trees, and CRF.

3.2 WCE features for machine translation (MT)

A number of knowledge sources are employed for extracting features, in a total of 24 major feature types, as depicted in Table 1.

Table 1 List of MT features extracted

Automatic quality estimation for speech translation using joint ASR and MT features

Abstract

Similar content being viewed by others

Quality Estimation for Machine Translation with Multi-granularity Interaction

Machine Translation Quality Estimation: Applications and Future Perspectives

Quality Estimation for English-Hungarian Machine Translation Systems with Optimized Semantic Features

Explore related subjects

1 Introduction

2 Related work on confidence estimation for ASR and MT

3 Building an efficient quality assessment (WCE) system

3.1 WCE features for speech transcription (ASR)

3.2 WCE features for machine translation (MT)

3.2.1 Internal features

3.2.2 External features

4 Experimental setup

4.1 Dataset

4.1.1 Starting point: an existing MT post-edition corpus

4.1.2 Extending the corpus with speech recordings and transcripts

4.2 ASR systems

4.3 SMT system

4.4 Obtaining quality assessment labels for SLT

4.5 Final corpus statistics

5 Experiments on WCE for SLT

5.1 SLT quality assessment using only MT or ASR features

5.2 SLT quality assessment using both MT and ASR features

6 Feature selection

7 Disentangling ASR and MT errors

7.1 Method 1: using word alignments between MT and SLT

7.2 Method 2: subtraction between SLT and MT errors

7.3 Example with 3-label setting

7.4 Statistics with 3-label setting on the whole corpus

7.5 Qualitative analysis of SLT errors

7.6 Experiments on 3-class error detection

8 Conclusion

8.1 Main contributions

8.2 Perspectives

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation