1 Introduction

Automatic quality assessment of spoken language translation (SLT), also named confidence estimation (CE), is an important topic because it allows us to know whether a system produces user-acceptable outputs or not. In interactive speech-to-speech translation, CE helps to judge whether a translated term is uncertain (in which case we can ask the speaker to rephrase or repeat the term). For speech-to-text applications, CE may tell us whether output translations are worth correcting, or whether they require retranslation from scratch. Moreover, an accurate CE can also help to improve SLT itself through a second-pass N-best list re-ranking or search graph re-decoding, as has already been done for text translation in Bach et al. (2011) and Luong et al. (2014b), or for speech translation in Besacier et al. (2015). Consequently, building a method which is capable of pointing out the correct parts as well as detecting the errors in a speech-translated output is crucial to tackle the above issues.

Given a signal \(x_{f}\) in the source language, spoken language translation (SLT) consists of finding the most probable target language sequence \(\hat{e} = (e_{1}, e_{2},\ldots ,e_{N})\), as in (1):

$$\begin{aligned} \hat{e} = \,\mathop {\arg \max }\limits _{e} \left\{ {p(e|x_{f} ,f)} \right\} \end{aligned}$$
(1)

where \(f = (f_{1}, f_{2},\ldots ,f_{M})\) is the transcription of \(x_{f}\). Now, if we perform confidence estimation at the “word” level, the problem is called word-level confidence estimation (WCE) and we can represent this information as a sequence q (same length N of \(\hat{e}\)) where \(q = (q_{1}, q_{2},\ldots ,q_{N})\) and \(q_{i}\in \{good,bad\}\).Footnote 1

Then, integrating automatic quality assessment into our SLT process (q is defined above) can be done as in (2)–(4):

$$\begin{aligned}&\displaystyle \hat{e} = \,\mathop {\arg \max }\limits _{e} \sum \limits _{q} {\left\{ {p(e,q|x_{f} ,f)} \right\} }\end{aligned}$$
(2)
$$\begin{aligned}&\displaystyle \hat{e} = \,\mathop {\arg \max }\limits _{e} \sum \limits _{q} {\left\{ {p(q|x_{f},f,e)*p(e|x_{f},f)} \right\} }\end{aligned}$$
(3)
$$\begin{aligned}&\displaystyle \hat{e} \approx \,\mathop {\arg \max }\limits _{e} \{ \mathop {\max }\limits _{q} \{ p(q|x_{f} ,f,e) * p(e|x_{f} ,f)\} \} \end{aligned}$$
(4)

In the product of (4), the SLT component \(p(e|x_{f},f)\) and the WCE component \(p(q|x_{f},f,e)\) contribute together to find the best translation output \(\hat{e}\). In the past, WCE has been treated separately in ASR or MT contexts and we propose here a joint estimation of word confidence for an SLT task involving both ASR and MT.

This journal paper is an extended version of a paper published at ASRU 2015 (Besacier et al. 2015), but here we focus more on the WCE component and on the best approaches to accurately estimate \(p(q|x_{f},f,e)\).

Contributions A corpus (distributed to the research community)Footnote 2 dedicated to WCE for SLT was initially published in Besacier et al. (2014). In this paper, we present its extension from 2643 to 6693 speech utterances. In addition, while our previous work on quality assessment was based on two separate WCE classifiers (one for quality assessment in ASR and one in MT), we propose here a unique joint model based on different feature types (ASR and MT features). This joint model allows us to operate feature selection and analyze which features (from ASR or MT) are the most efficient for quality assessment in speech translation. We also experiment with two ASR systems that have different performance in order to analyze the behaviour of our SLT quality-assessment algorithms at different levels of word error rate (WER) (Levenshtein 1966). The last part of this paper proposes to disentangle ASR and MT errors in speech translation by automatically detecting the origin of SLT errors, either due to ASR or to MT.

Outline The outline of this paper is as follows: Sect. 2 reviews the state-of-the-art on confidence estimation for ASR and MT. Our word-confidence estimation (WCE) system using multiple features is then described in Sect.  3. The experimental setup (namely our specific WCE corpus) is presented in Sect. 4 while Sect. 5 evaluates our joint WCE system. Feature selection for quality assessment in speech translation is analyzed in Sect. 6. Section 7 proposes to disentangle ASR and MT errors in SLT output and finally, Sect. 8 concludes this work and gives some future perspectives.

2 Related work on confidence estimation for ASR and MT

Several previous works tried to propose effective confidence measures in order to detect errors on ASR outputs. Confidence measures are introduced for Out-Of-Vocabulary (OOV) detection by Asadi et al. (1990). Young (1994) extends this previous work and introduces the use of word posterior probability (WPP) as a confidence measure for speech recognition. The posterior probability of a word is most of the time computed using the hypothesis word graph (Kemp and Schaaf 1997). More recent approaches (Lecouteux et al. 2009) for confidence measure estimation use side-information extracted from the recognizer: normalized likelihoods (WPP), the number of competitors at the end of a word (hypothesis density), decoding process behaviour, linguistic features, acoustic features (acoustic stability, duration features), and semantic features. In addition, ASR quality estimation has been addressed in several recent studies (de Souza et al. 2015; Zamani et al. 2015; Jalalvand et al. 2016). Re-scoring the output of the ASR N-best list using ASR and MT features was also proposed by Ng et al. (2014, 2015, 2016).

In parallel, the Workshop on Machine Translation (WMT)Footnote 3 introduced in 2013 a WCE task for Machine Translation. Han et al. (2013) and Luong et al. (2013b) employed the Conditional Random Fields (CRF) (Lafferty et al. 2001) model as their machine-learning method to address the problem as a sequence-labelling task. Meanwhile, Biçici (2013) extended their initial proposition by dynamic training with adaptive weight updates in their neural network classifier. As far as prediction indicators are concerned, Biçici (2013) proposed seven word feature types and found among them the “common cover links” (the links that point from the leaf node containing this word to other leaf nodes in the same subtree of the syntactic tree) the most outstanding. Han et al. (2013) focused only on various N-gram combinations of target words. Inheriting most of the previously recognized features, Luong et al. (2013b) integrated a number of new indicators relying on graph topology, pseudo reference, syntactic behaviour (constituent label, distance to the semantic tree root) and polysemy characteristic. The estimation of the confidence score mainly uses classifiers like CRF (Han et al. 2013; Luong et al. 2014a), Support Vector Machines (Langlois et al. 2012), or neural networks (Biçici 2013). Some investigations were also conducted to determine which features seem to be the most relevant. Langlois et al. (2012) proposed to filter features using a forward-backward algorithm to discard linearly correlated features. Using boosting as the learning algorithm, Luong et al. (2015) were able to take advantage of the most significant features.

Finally, several toolkits for WCE have recently been proposed: TranscRater for ASR (Jalalvand et al. 2016),Footnote 4 QuEst++ for MT (Specia et al. 2015),Footnote 5 MARMOT for MT (Logacheva et al. 2016),Footnote 6 as well as the WCE toolkit (Servan et al. 2015)Footnote 7 that is used to extract MT features in the experiments of this paper.

To the best of our knowledge, the first attempt to design WCE for speech translation, using both ASR and MT features in a single classifier, was our own work (Besacier et al. 2014, 2015) which is further extended in this paper.

3 Building an efficient quality assessment (WCE) system

The WCE component solves equation (5):

$$\begin{aligned} \hat{q}=\,\mathop {\arg \max }\limits _{q} \{p_{SLT}(q|x_{f},f,e) \} \end{aligned}$$
(5)

where \(q = (q_{1}, q_{2},\ldots ,q_{N})\) is the sequence of quality labels on the target language. This is a sequence-labelling task that can be solved with several machine-learning techniques such as CRF. However, for that, we need a large amount of training data for which a quadruplet \((x_{f},f,e,q)\) is available. In this work, we will use a corpus extended from Besacier et al. (2014) which contains 6.7k utterances. We will investigate if this amount of data is enough to evaluate and test a joint model \(p_{SLT}(q|x_{f},f,e)\).

As it is much easier to obtain data containing either the triplet \((x_{f},f,q)\) (automatically transcribed speech with manual references and quality labels inferred from WER estimation), or the triplet (feq) (automatically translated text with manual post-edits and quality labels inferred using tools such as TERp-A (Snover et al. 2009)), we can also recast the WCE problem as in (6):

$$\begin{aligned} \hat{q}=\,\mathop {\arg \max }\limits _{q} \{p_{ASR}(q|x_{f},f)^\alpha *p_{MT}(q|e,f)^{1-\alpha }\} \end{aligned}$$
(6)

where \(\alpha \) is a weight giving more or less importance to \(WCE_{ASR}\) (quality assessment on transcription) compared to \(WCE_{MT}\) (quality assessment on translation). It is important to note that \(p_{ASR}(q|x_{f},f)\) corresponds to the quality estimation of the words in the target language based on features calculated on the source language (ASR). For that, what we do is project source quality scores to the target using word alignment information between e and f sequences. This alternative approach (Eq. (6)) will be also evaluated in this work even if it corresponds to a different optimization problem than Eq. (5). In particular, the choice of \(\alpha \) is only set a priori in our experiments (to 0.5) which is probably not the best option.

In both approaches—joint (\(p_{SLT}(q|x_{f},f,e)\)) and combined (\(p_{ASR}(q|x_{f},f)\) + \(p_{MT}(q|e,f)\))—some features need to be extracted from ASR and MT modules. They are more precisely detailed in the next subsections.

3.1 WCE features for speech transcription (ASR)

In this work, we extract several types of features, which come from the ASR graph, from language model scores and from a morphosyntactic analysis. These features are listed below [more details can be found in Besacier et al. (2014)]:

  • Acoustic features word duration (F-dur).

  • Graph features (extracted from the ASR word confusion networks) number of alternative (F-alt) paths between two nodes; word posterior probability (F-post).

  • Linguistic features (based on probabilities by the language model) the word itself (F-word), 3-gram probability (F-3g), log probability (F-log), back-off level of the word (F-back), as proposed in Fayolle et al. (2010).

  • Lexical Features Part-Of-Speech (POS) of the word (F-POS).

  • Context Features Part-Of-Speech tags in the neighborhood of a given word (F-context).

For each word in the ASR hypothesis, we estimate these 9 features: F-Word; F-3g; F-back; F-log; F-alt; F-post; F-dur; F-POS; and F-context.

In a preliminary experiment, we evaluate these features for quality assessment in ASR only (\(WCE_{ASR}\) task). Two different classifiers will be used: a variant of boosting classification algorithm called bonzaiboost (Laurent et al. 2014), which implements the boosting algorithm Adaboost.MH over deeper trees, and CRF.

3.2 WCE features for machine translation (MT)

A number of knowledge sources are employed for extracting features, in a total of 24 major feature types, as depicted in Table 1.

Table 1 List of MT features extracted

It is important to note that we extract features regarding tokens in the translated hypothesis (MT or SLT). In other words, one feature is extracted for each token in the MT output. So in Table 1, target refers to the feature coming from the translated hypothesis and source refers to a feature extracted from the source word aligned to the considered target word. More details on some of these features are given in the next subsections.

3.2.1 Internal features

These features are given by the MT system, which outputs additional data like an N-best list.

Word Posterior Probability (WPP) and Nodes features are extracted from a confusion network, which comes from the output of the MT N-best list. WPP Exact is the WPP value for each word concerned at the exact same position in the graph. WPP Any extracts the same information at any position in the graph. WPP Min gives the smallest WPP value concerned by the transition and WPP Max its maximum.

3.2.2 External features

Below is the list of the external features used:

  • Proper name indicates if a word is a proper name; the same binary features are extracted to know if a token is Numerical, Punctuation, or a Stop Word.

  • (Unknown stem informs whether the stem of the considered word is known or not.

  • Number of word/stem occurrences counts the occurrences of a word/stem in the sentence.

  • Alignment context features these features (#11–13 in Table 1) are based on collocations and were proposed by Bach et al. (2011). Collocations could be an indicator for judging whether a target word is generated by a particular source word. We also apply the reverse, the collocations regarding the source side (#7 in Table 1, simply called Alignment Features):

    • \(\diamondsuit \)   Source alignment context features the combinations of the target word, the source word (with which it is aligned), and one source word before and one source word after (left and right contexts, respectively).

    • \(\diamondsuit \)   Target alignment context features the combinations of the source word, the target word (with which it is aligned), and one target word before and one target word after.

  • Longest target (or source) N-gram length we seek to compute the length (\(n+1\)) of the longest left sequence (\(w_{i-n}\)) concerned by the current word (\(w_i\)) and known by the language model (LM) concerned (source and target sides). For example, if the longest left sequence \(w_{i-2},w_{i-1},w_i\) appears in the target LM, the longest target N-gram value for \(w_i\) will be 3. This value ranges from 0 to the max order of the LM concerned. We also extract a redundant feature called Backoff Behaviour Target.

  • The target word’s constituent label (Constituent Label) and its depth in the constituent tree (Distance to Root) are extracted using a syntactic parser.

  • Target polysemy count we extract the polysemy count, which is the number of meanings of a word in a given language.

  • Occurrences in google translate and Occurrences in bing translator in the translation hypothesis, we (optionally) test the presence of the target word in on-line translations given respectively by Google Translate and Bing Translator.Footnote 8

A very similar feature set was used for a simple \(WCE_{MT}\) task (English–Spanish MT, WMT 2013, 2014 quality-estimation shared task) and obtained very good performance (Luong et al. 2013a). This preliminary experience in participating to the WCE shared task in 2013 and 2014 led us to the following observation: while feature processing is very important to achieve good performance, it requires a set of heterogeneous NLP tools (for lexical, syntactic, and semantic analysis). Thus, we recently proposed to unify the feature processing, together with the call of machine-learning algorithms, in order to facilitate the design of confidence-estimation systems. The open-source toolkit proposed (written in Python and made available on github)Footnote 9 integrates some standard as well as in-house features that have proven useful for WCE (based on our experience in WMT 2013 and 2014).

In this paper, we use only CRF as our machine-learning method, with the WAPITI toolkit (Lavergne et al. 2010), to train our WCE estimator based on both MT and ASR features.

4 Experimental setup

4.1 Dataset

4.1.1 Starting point: an existing MT post-edition corpus

For a French–English translation task, we used our SMT system to obtain the translation hypotheses for 10,881 source sentences taken from news corpora from WMT evaluation campaigns from 2006 to 2010. Post-editions were obtained from non professional translators using a crowdsourcing platform. More details on the baseline SMT system used can be found in Potet et al. (2010), and more details on the post-edited corpus can be found in Potet et al. (2012). It is worth mentioning, however, that a subset (311 sentences) of these collected post-editions was assessed by a professional translator and 87.1% of the post-edits were judged to improve the hypothesis.

Table 2 Example of training label obtained using TERp-A

Then, the word-label setting for WCE was done using the TERp-A toolkit (Snover et al. 2009). Table 2 illustrates the labels generated by TERp-A for one hypothesis and post-edition pair. Each word or phrase in the hypothesis is aligned to a word or phrase in the post-edition with different types of edit operations: “I” (insertions), “S” (substitutions), “T” (stem matches), “Y” (synonym matches), and “P” (phrasal substitutions). The lack of a symbol indicates an exact match and will be replaced by “E” thereafter. We do not consider words marked with “D” (deletions) since they appear only in the reference. However, later on, we will have to train binary classifiers (good/bad) so we re-categorize the obtained 6-label set into a binary set: E, T and Y belong to the good (G) class, whereas S, P and I belong to the bad (B) category.

4.1.2 Extending the corpus with speech recordings and transcripts

The dev set and tst set of this corpus were recorded by French native speakers. Each sentence was uttered by 3 speakers, leading to 2643 and 4050 speech recordings for the dev set and tst set, respectively. For each speech utterance, a quintuplet containing ASR output (\(f_{hyp}\)), verbatim transcript (\(f_{ref}\)), English text-translation output (\(e_{hyp_{mt}}\)), speech-translation output (\(e_{hyp_{slt}}\)) and post-edition of translation (\(e_{ref}\)) was made available. This corpus is available on a github repository.Footnote 10 More details are given in Table 3. The total length of the dev and tst speech corpora obtained are 16h52, since some utterances were pretty long.

Table 3 Details on our dev and tst corpora for SLT

4.2 ASR systems

To obtain the speech transcripts (\(f_{hyp}\)), we built a French ASR system based on KALDI toolkit (Povey et al. 2011). Acoustic models are trained using several corpora (ESTER, REPERE, ETAPE and BREF120) representing more than 600 h of French transcribed speech.

The baseline GMM system is based on mel-frequency cepstral coefficient (MFCC) acoustic features (13 coefficients expanded with delta and double delta features and energy: 40 features) with various feature transformations including linear discriminant analysis (LDA), maximum likelihood linear transformation (MLLT), and feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT). The GMM acoustic model makes initial phoneme alignments of the training data set for the following DNN acoustic model training.

The speech transcription process is carried out in two passes: an automatic transcript is generated with a GMM-HMM model of 43,182 states and 250,000 Gaussians. Then word-graph outputs obtained during the first pass are used to compute a fMLLR-SAT transform on each speaker. The second pass is performed using DNN acoustic model trained on acoustic features normalized with the fMLLR matrix.

CD-DNN-HMM acoustic models are trained (43,182 context-dependent states) using a GMM-HMM topology.

We propose to use two 3-g language models trained on the French ESTER corpus (Galliano et al. 2006) as well as on French Gigaword (vocabulary sizes are 62 and 95k, respectively). The LM weight parameters of the ASR systems are tuned via WER on the dev corpus. Details on these two language models can be found in Table 4.

In our experiments, we propose two ASR systems based on the previously described language models. The first system (ASR1) uses the small language model allowing a fast ASR system (about 2\(\times \) Real Time), while in the second system lattices are rescored with a big language model (about 10\(\times \) Real Time) during a third pass.

Table 4 Details on language models (LMs) used in our two ASR systems

Table 5 presents the performances obtained by two above ASR systems.

Table 5 ASR performance (WER) on our dev and tst set for the two different ASR systems

These WER scores may appear rather high for the task of transcribing read news. A deeper analysis shows that these news items contain a lot of foreign named entities, especially in our dev set. This part of the data is extracted from French media dealing with the European economy. This could also explain why the scores are significantly different between the dev and tst sets. In addition, automatic post-processing is applied to the ASR output in order to match the requirements of standard input for MT.

4.3 SMT system

We used the Moses phrase-based translation toolkit (Koehn et al. 2007) to translate French ASR into English (\(e_{hyp}\)). This medium-sized system was trained using a subset of data provided for IWSLT 2012 evaluation (Federico et al. 2012): Europarl, Ted and News-Commentary corpora, with a total amount of about 60M words. We used an adapted target language model trained on specific data (News Crawled corpora) similar to our evaluation corpus (see Potet et al. (2010)). This standard SMT system is used in all experiments reported in this paper (Tables 6, 7).

4.4 Obtaining quality assessment labels for SLT

After building an ASR system, we have a new element of our desired quintuplet: the ASR output \(f_{hyp}\), which is the noisy version of our already available verbatim transcripts called \(f_{ref}\). This ASR output (\(f_{hyp}\)) is then translated by the exact same SMT system (Potet et al. 2010) mentioned in Subsect. 4.3. This new output translation is called \(e_{hyp_{slt}}\) and is a degraded version of \(e_{hyp_{mt}}\) (translation of \(f_{ref}\)).

At this point, a strong assumption we made has to be revealed: we re-used the post-editions obtained from the text-translation task (called \(e_{ref}\)), to infer the quality (G, B) labels of our speech translation output \(e_{hyp_{slt}}\). The word-label setting for WCE is done using TERp-A toolkit (Snover et al. 2009) between \(e_{hyp_{slt}}\) and \(e_{ref}\). This assumption, and the fact that initial MT post-edition can also be used to infer labels of a SLT task, is reasonable regarding results (presented later in Tables 8, 9) where it is shown that there is not a huge difference between the MT and SLT performance (evaluated with BLEU) (Table 10).

The above remark is important and this is what makes this corpus valuable. For instance, other corpora such as the TED corpus could be used to obtain a quintuplet with ASR output, verbatim transcript, MT output, SLT output and target translation, but there are two main differences: first in TED, the target translation is a manual translation of the prior subtitles so this is not a post-edition of an automatic translation (and we have no guarantee that the good/bad labels extracted from this would be reliable for WCE training and testing); secondly, in our corpus, each sentence is uttered by 3 different speakers which introduces speaker variability in the database and allows us to deal with different ASR outputs for a single source sentence.Footnote 11

4.5 Final corpus statistics

The final corpus obtained is summarized in Table 6, where we also clarify how the WCE labels were obtained. For the test set, we now have all the data needed to evaluate WCE for 3 tasks:

  • ASR extract good/bad labels by calculating WER between \(f_{hyp}\) and \(f_{ref}\),

  • MT extract good/bad labels by calculating TERp-A between \(e_{hyp_{mt}}\) and \(e_{ref}\),

  • SLT extract good/bad labels by calculating TERp-A between \(e_{hyp_{slt}}\) and \(e_{ref}\).

Table 6 Overview of our post-edition corpus for SLT

Table 7 gives an example of the quintuplet available in our corpus. One transcript (\(f_{hyp1}\)) has 1 error while the other one (\(f_{hyp2}\)) has 4. This leads to respectively 2 B labels (\(e_{hyp_{slt1}}\)) and 4 B labels (\(e_{hyp_{slt2}}\)) in the speech translation output, while \(e_{hyp_{mt}}\) has only one B label.

Table 7 Example of quintuplet with associated labels

Tables 8 and 9 summarize the baseline ASR, MT and SLT performances obtained on our corpora, as well as the distribution of good (G) and bad (B) labels inferred for both tasks. Logically, the percentage of (B) labels increases from the MT to the SLT task under the same conditions.

Table 8 MT and SLT performances on our dev set
Table 9 MT and SLT performances on our tst set

5 Experiments on WCE for SLT

5.1 SLT quality assessment using only MT or ASR features

We first report in Table 11 the baseline WCE results obtained using MT or ASR features separately. In short, we evaluate the performance of 4 WCE systems for different tasks:

  • The first and second systems (WCE for ASR/ASR feat.) use ASR features described in Sect. 3.1 with two different classifiers (CRF or Boosting).

  • The third system (WCE for SLT/MT feat.) uses only the MT features described in Sect. 3.2 with the CRF classifier.

  • The fourth system (WCE for SLT/ASR feat.) uses only the ASR features described in Sect. 3.1 with the CRF classifier, i.e. predicting SLT output confidence using only ASR confidence features! Word-alignment information between \(f_{hyp}\) and \(e_{hyp}\) is used to project the WCE scores coming from ASR to the SLT output.

In all experiments reported in this paper, we evaluate the performance of our classifiers by using the average between the F-measure for good labels and the F-measure for bad labels that are calculated by the common evaluation metrics: Precision, Recall and F-measure for good/bad labels. Since two ASR systems are available, F-mes1 is obtained for SLT based on ASR1, whereas F-mes2 is obtained for SLT based on ASR2. For the results in Table 11, the classifier is evaluated on the tst part of our corpus and trained on the dev part.

Table 10 WCE performance baseline (%\(F_G\), %\(F_B\), %F-mes) on ASR1 and on ASR2 for tst set (random classifier generating G or B)
Table 11 WCE performance with different feature sets for tst set (training is made on dev set)

Concerning WCE for ASR, we observe that F-measure decreases when ASR WER is lower F-mes2<F-mes1 while \(WER_{ASR2}<WER_{ASR1}\), i.e. quality assessment in ASR seems to become harder as the ASR system improves. This could be due to the fact that the ASR1 errors recovered by a bigger LM in the ASR2 system were easier to detect. Anyway, this conclusion should be considered with caution since both results (F-mes1 and F-mes2) are not directly comparable because they are evaluated on different references (the proportion of good/bad labels differ as the ASR systems themselves differ). The effect of the classifier (CRF or Boosting) is not conclusive since CRF is better for F-mes1 and worse for F-mes2. In any case, we decide to use CRF for all our future experiments since this is the classifier integrated in the WCE-LIG toolkit (Servan et al. 2015).

To assess WCE for SLT, the observed F-measure is better using MT features rather than ASR features, i.e. quality assessment for SLT is more dependent on MT features than ASR features. Again, F-measure decreases when ASR WER is lower (F-mes2<F-mes1 while \(WER_{ASR2}<WER_{ASR1}\)). For MT features, removing OccurInGoogleTranslate and OccurInBingTranslate features lead to 63.09% and 62.33% for F-mes1 and F-mes2, respectively. Finally, it is worth mentioning that the perfomance obtained by the classifiers in Table 11 are above the random baselines in Table 10.

In the next subsection, we investigate whether the use of both MT and ASR features improves quality assessment for SLT.

5.2 SLT quality assessment using both MT and ASR features

We report in Table 13 the WCE results for SLT obtained using both MT and ASR features. More precisely, we evaluate two different approaches (combination and joint):

  • The first system (WCE for SLT/MT+ASR feat.) combines the output of two separate classifiers based on ASR and MT features. In this approach, the ASR-based confidence score of the source is projected to the target SLT output and combined with the MT-based confidence score as shown in Eq. (6) (we did not tune the \(\alpha \) coefficient and set it a priori to 0.5).

  • The second system (joint feat.) trains a single WCE system for SLT (evaluating \(p(q|x_{f},f,e)\) as in Eq. (5) using joint ASR features and MT features. All ASR features are projected to the target words using automatic word alignments. However, a problem occurs when a target word does not have any source word aligned to it. In this case, we decide to duplicate the ASR features of its previous target word. Another problem occurs when a target word is aligned to more than one source word. In that case, there are several strategies we can use to infer the 9 ASR features: average or max over numerical values, selection or concatenation over symbolic values (for F-word and F-POS), etc. Three different variants of these strategies (shown in Table 12) are evaluated here.

Table 12 Different strategies to project ASR features to a target word when it is aligned to more than one source word
Table 13 WCE performance with combined (MT + ASR) or joint (MT, ASR) feature sets for tst set (training is made on dev set)

The results in Table 13 show that joint ASR and MT features only slightly improve WCE performance: F-mes1 is slightly better than one set-up in Table 11 (WCE for SLT/MT features only). We also observe that simple combination (MT + ASR) degrades WCE performance. This may be due to the different behaviour of the \(WCE_{MT}\) and \(WCE_{ASR}\) classifiers which makes the weighted combination ineffective. The relatively disappointing performance of our joint classifier may be due to an insufficient training set (only 2643 utterances in dev). Finally, removing OccurInGoogleTranslate and OccurInBingTranslate features for Joint lowered F-mes between 1 and 2%.

These observations lead us to investigate the behaviour of our WCE approaches for a large range of good/bad decision thresholds.

Fig. 1
figure 1

Evolution of system performance (y-axis – F-mes1 – ASR1) for tst corpus (4050 utt) along decision threshold variation (x-axis). Training is made on dev corpus (2643 utt)

Fig. 2
figure 2

Evolution of system performance (y-axis – F-mes2 – ASR2) for tst corpus (4050 utt) along decision-threshold variation (x-axis). Training is made on dev corpus (2643 utt)

While the previous tables provided WCE performance for a single point of interest (good/bad decision threshold set to 0.5), the curves in Figures 1 and 2 show the full picture of our WCE systems (for SLT) using speech-transcription systems ASR1 and ASR2, respectively. We observe that the classifier based on ASR features has a very different behaviour than the classifier based on MT features which explains why their simple combination (MT + ASR) does not work very well for the default decision threshold (0.5). However, for thresholds above 0.75, the use of joint ASR and MT features is slightly beneficial compared to MT features only. This is interesting because higher thresholds improve the F-measure on bad labels and so improve error detection. Both curves are similar whatever the ASR system used. These results suggest that with enough development data for appropriate threshold tuning (which we do not have for this very new task), the use of both ASR and MT features should improve error detection in speech translation (blue and red curves are above the green curve for higher decision thresholds).Footnote 12 Although not reported here, we also analyzed the F-measure curves for bad and good labels separately: if we consider, for instance, the ASR1 system, for decision threshold equal to 0.75, the F-measure on bad labels is equivalent (52%) for 3 systems (Joint, MT + ASR and MT), while the F-measure on good labels is 76% when using MT features only, 78% when using Joint features and 77% when using MT + ASR features. In other words, for a fixed performance on bad labels, the F-measure on good labels is improved using all information available (ASR and MT features). Finally, if we focus on Joint versus MT + ASR, we notice that the range of the threshold where performance is stable is larger for Joint than for MT + ASR.

6 Feature selection

In this section, we try to better understand the contribution of each (ASR or MT) feature by applying feature selection on our joint WCE classifier. In these experiments, we decide to keep OccurInGoogleTranslate and OccurInBingTranslate features.

We choose the Sequential Backward Selection (SBS) algorithm (Aha and Bankert 1996) which is a top-down algorithm starting from a feature set noted \(Y_k\) (which denotes the set of all features) and sequentially removing the most irrelevant one (x) that maximizes the Mean F-Measure, \(MF(Y_k-x)\). In our work, we examine until the set \(Y_k\) contains only one remaining feature. Algorithm 1 summarizes the whole process.

figure a

The results of the SBS algorithm can be found in Table 14 which ranks all joint features used in WCE for SLT by order of importance after applying the algorithm on dev. We can see that the SBS algorithm is not very stable and is clearly influenced by the ASR system (ASR1 or ASR2) considered in SLT. In any case, if we focus on the 10-best features in both cases, we find that the most relevant ones are:

  • Alignment Features (source and target collocation features),

  • Occur in Google Translate and Occur in Bing Translate (diagnostic from other MT systems),

  • Longest Source N-gram Length, Target Backoff Behaviour (source or target N-gram features),

  • Word Posterior Probability Max (WPP Max) (graph topology feature),

We also observe that the most relevant ASR features ( in Table 14) are F-back, F-3g and F-context (linguistic and context features) whereas ASR lexical, acoustic and graph-based features are among the worst (F-POS, F-dur and F-post). Accordingly, in our experimental setting, it seems that MT features are more influential than ASR features. Interestingly, “source and target collocation features” (Alignment Features) and “Occur in Bing Translate” are the most prominent features (rank 1 and rank 2, respectively) when applied to dev corpus for both ASR1 and ASR2. Besides, the graph-topology feature extracted from a confusion network WPP Max outperforms the others such as Nodes and WPP Min. Nevertheless, two other features including WPP Exact and WPP any are proven to be weak in accordance with their bottom-most positions against the two above systems, whereas we were expecting to see them among the top features (as shown in Luong et al. (2015) where WPP Any is among the best features for WCE in MT).

Table 14 Rank of each feature according to the sequential backward selection algorithm on the WCE for SLT task using Joint (ASR,MT) features
Fig. 3
figure 3

Evolution of WCE performance for dev (features selected) and tst corpora when feature selection using SBS algorithm is made on dev (ASR1 system only; the same shape is observed for ASR2)

Figure 3 presents the evolution of WCE performance for dev and tst corpora when feature selection using the SBS algorithm is made on dev, for the ASR1 system; the same shape is observed for the ASR2 system. In other words, for this figure, we apply our SBS algorithm on dev which means that feature selection is done on dev with classifiers trained on tst. After that, the best feature subsets (using 33, 32, 31 until 1 feature only) are applied to the tst corpus (with classifiers trained on dev).Footnote 13

In the figure, we observe that half of the features only contribute to the WCE process since the best performance is observed with only 15 to 25 features. We also see that optimal WCE performance is not necessarily obtained with the full feature set, as it can be obtained with a subset of it.

7 Disentangling ASR and MT errors

In the previous section, we only extract good / bad labels from the SLT output while it might be interesting to move from a 2-class problem to a 3-class problem in order to label our SLT hypotheses with one of the 3 following labels: good (G), asr-error (B_ASR) and mt-error (B_MT). Before training automatic systems for error detection, we need to set such 3-class labels for our dev and test corpora. For that, in the next subsections we propose two slightly different methods to extract them. The first one is based on the word alignments from SLT to MT, and the second is based on a subtraction between SLT and MT errors.

7.1 Method 1: using word alignments between MT and SLT

In MT, the fertility of a source word denotes how many output words it translates as. If we transpose this definition to our disentangling problem, then fertility of an MT error denotes how many erroneous words—in the SLT output—it is aligned to. From this simple definition, we derive our first way (Method 1) of generating 3-class annotations.

Let \(\hat{e}_{slt} = (e_1, e_2,\ldots , e_n)\) be the set of SLT hypotheses (\(e_{hyp_{slt}}\)); \(e_{k_{j}}\) denotes the jth word in the sentence \(e_k\), where \(1\le k \le n\).

Let \(\hat{e}_{mt} = (e'_1, e'_2, \ldots , e'_n)\) be the set of MT hypotheses (\(e_{hyp_{mt}}\)); \(e'_{k_{i}}\) denotes the ith word in the sentence \(e'_k\), where \(1\le k \le n\).

Let \(L = (l_1, l_2, \ldots , l_n)\) be the set of word alignments from sentences in \(e_{hyp_{slt}}\) to related sentences in \(e_{hyp_{mt}}\), where \(l_{k}\) contains the word alignments from sentence \(e_{k}\) to the relevant sentence \(e'_{k}\), \(1\le k \le n\); \((e_{k_{j}}, e'_{k_{i}})\) = True, if there is one word alignment between \(e_{k_{j}}\) and \(e'_{k_{i}}\); \((e_{k_{j}}, e'_{k_{i}})\) = False, otherwise.

Our algorithm for Method 1 is defined as Algorithm 2. This method relies on word alignments and uses MT labels. We also propose a simpler method in the next section.

figure b

7.2 Method 2: subtraction between SLT and MT errors

Our second way to extract 3-class labels (Method 2) focuses on the differences between the SLT hypothesis (\(e_{hyp_{slt}}\)) and the MT hypothesis (\(e_{hyp_{mt}}\)). We call it subtraction between SLT and MT errors because we simply consider that errors present in SLT and not present in MT are due to ASR. This method has one main difference with the previous one, namely that it does not rely on the extracted labels for MT.

Our intuition is that the number of mt-errors estimated will be slightly lower than for Method 1 since we first estimate the number of asr-errors and the rest is considered—by default—as mt-errors.

We use the same notations as Method 1, except that \(L = (l_1, l_2, \ldots , l_n)\) is the set of alignments through edit distance between \(e_{hyp_{slt}}\) and \(e_{hyp_{mt}}\), where \(l_{k_i}\) corresponds to “Insertion”, “Substitution”, “Deletion” or “Exact”. Our algorithm for Method 2 is defined in Algorithm 3.

figure c

7.3 Example with 3-label setting

Table 15 gives the edit distance between an SLT and an MT hypothesis, while Table 16 shows how Method 1 and Method 2 set 3-class labels to the SLT hypothesis. One transcript (\(f_{hyp}\)) has 1 error. This drives 3 B labels on SLT output (\(e_{hyp_{slt}}\)), while \(e_{hyp_{mt}}\) has only 2 B labels. As can be seen for Method 1 and Method 2, we respectively have (1 B_ASR, 2 B_MT) and (2 B_ASR, 1 B_MT).

Table 15 Example of edit distance between SLT and MT
Table 16 Example of quintuplet with 2 and 3-label

These differences might be related to the word alignments from the SLT hypothesis to the relevant MT hypothesis. As Table 16 presents, “is” (SLT hypothesis) is aligned to “have” (MT hypothesis) and “have” (MT hypothesis) is labeled with “B”. It can be assumed, therefore, that “is” (SLT hypothesis) should be annotated with word-level labels by B_MT according to Method 1. However, using Method 2, “is” (SLT hypothesis) could be labeled B_ASR because the type of word alignment between “is” (SLT hypothesis) and “have” (MT hypothesis) is substitution (S), as shown in Table 15.

7.4 Statistics with 3-label setting on the whole corpus

Tables 17 and 18 present the summary statistics for the distribution of good (G), asr-error (B_ASR) and mt-error (B_MT) labels obtained with both label-extraction methods. We see that both methods give similar statistics but slightly different rates of B_ASR and B_MT.

Table 17 Statistics with 3-label setting for ASR1
Table 18 Statistics with 3-label setting for ASR2

Comparing Tables 17 and 18, it is interesting to note that while the ASR system improves from ASR1 to ASR2, the rate of B_ASR labels logically decreases by more than 2 points, while the rate of B_MT remains almost stable (less than 1 point difference) which makes sense since the MT system is the same in both Tables 17 and 18. These statistics show that the intersection between both methods is probably a good estimation of disentangling ASR and MT errors in SLT.

7.5 Qualitative analysis of SLT errors

Our new 3-label setting procedure allows us to analyze the behaviour of our SLT system. We omit examples here, but they are made available as supplementary material to this paper on a Web link.Footnote 14 Nonetheless, we can observe sentences with few ASR and MT errors leading to many SLT errors. Indeed, this is a good way of detecting flaws in the SLT pipeline such as bad post-processing of the SLT output (numerical or text dates, for instance). In contrast, there are cases where many ASR errors lead to few SLT errors (ASR errors with few consequences such as morphological substitutions (for instance in French: de/des, déficit/déficits, budgétaire/budgétaires). Finally, some ASR errors have different consequences on SLT quality (on a sample sentence, 2 ASR errors in Systems 1 and 2 lead to 14 and 9 SLT errors, respectively).

7.6 Experiments on 3-class error detection

We report in Table 19 our first attempt to build an error-detection system in SLT as a 3-class problem (joint approach only). We conducted our experiment by training and evaluating the model on Intersection(m1, m2) which corresponds to high confidence in the labels.Footnote 15 In addition to giving a better informed error detection (\(B_{ASR}\) and \(B_{MT}\) instead of B), we note that 3-class error detection leads to overall similar results if we backoff to good/bad decision (\(F_{avg}\) becomes 62.5 on ASR1 and 61.00 on ASR2 in that case).

Table 19 Error-detection performance (2 vs. 3-labels) on SLT output for the tst set

8 Conclusion

8.1 Main contributions

In this paper, we introduced a new quality-assessment task: word confidence estimation (WCE) for spoken language translation (SLT). A specific corpus distributed to the research community was built for this purpose. We formalized WCE for SLT and proposed several approaches based on several types of features: MT-based features, ASR-based features, as well as combined or joint features using both ASR and MT information. The proposal of a unique joint classifier based on different feature types (ASR and MT features) allowed us to operate feature selection and analyze which features (from ASR or MT) are the most efficient for quality assessment in speech translation. Our experiments have shown that MT features remain the most influential, while ASR features can bring interesting complementary information. For the purpose of reproducible research, our toolkit has been made available on a GitHub repository under the licence GPL V3. We hope that the availability of our corpus and toolkit could lead, in the near future, to a new shared task dedicated to quality estimation for speech translation. Such a shared task could be proposed in avenues such as IWSLT or WMT for instance. Towards the end of the paper, we proposed to disentangle ASR and MT errors and recast WCE as a 3-label setting problem.

8.2 Perspectives

A direct application of this work is the use of WCE labels to re-decode speech-translation graphs and (hopefully) improve speech-translation performance. Preliminary results have already been obtained and published by the authors of this paper (Besacier et al. 2015). The main idea is to carry a second speech translation pass by considering every word and its quality-assessment label, as shown in Eq. (4).

In addition to re-decoding SLT graphs, our quality-assessment system can be used in interactive speech-translation scenarios such as news or lecture subtitling, to improve human translator productivity by giving them feedback on automatic transcription and translation quality. Another application would be the adaptation of our WCE system to interactive speech-to-speech translation scenarios where feedback on transcription and translation modules is needed to improve communication. Finally, in this paper, engineered features were used for WCE; a natural perspective is also to learn the WCE features as it is now possible with deep neural networks, for instance.