1 Introduction

Recently, learning word representations (also known as word embeddings) of natural languages has attracted much attention (Mikolov et al. 2013a; Pennington et al. 2014; Zheng et al. 2017; Feng and Zheng 2018). These distributed representations have proven useful for many natural language processing tasks, such as language modelling (Bengio et al. 2003), sentiment analysis (Socher et al. 2013b) and syntactic parsing (Socher et al. 2013a). It is also possible to learn such word vector representations across different languages (Klementiev et al. 2012; Mikolov et al. 2013b; Hermann and Blunsom 2014; Gouws et al. 2015; Wei and Deng 2017; Søgaard et al. 2018), where similar words from multiple languages are clustered in the shared vector space. Multilingual word embeddings have been considered as an important building block for many cross-lingual tasks, including machine translation (Zou et al. 2013), parsing (Guo et al. 2015), and information retrieval (Vulić and Moens 2015).

Word alignment is often considered as a necessary pre-processing step for learning bilingual word embeddings, in which the words from two languages are first aligned automatically or semi-automatically and then the bilingual word embeddings are learned from the dataset with their words aligned explicitly. However, automatic word alignment is still a challenging task and results are unreliable for the subsequent learning of bilingual word embeddings. Some researchers have chosen to use the results of word alignment produced by an external tool like GIZA++ (Och and Ney 2003) from parallel data, but such word alignments are usually not good enough, and the alignment errors will propagate to the following word-embedding learning process. Others drop the step of automatic word alignment for this reason, but it is then impossible to fully leverage the implicit word-level information contained in a parallel corpus. In this study, we try to make better use of parallel data at both (explicit) sentence level and (implicit) word level, but the word alignment is not considered as a separate and predetermined step.

We propose a novel method to learn bilingual word alignments and word embeddings jointly, in which both tasks are reinforced mutually and gradually and can benefit from each other. The learned word alignment can be viewed as a distribution learned for each word in a sentence from a (source) language over all the words of its aligned translational equivalent from another (target) language. Under- or over-alignment problems might occur if no constraints are imposed because some words may not be aligned at all or aligned to too many words. Therefore, two criteria are proposed and enforced during the word alignment to deal with these problems: coverage and sparsity. That is, each word in a sentence should have at least one semantic equivalent in the parallel translation, but the number of such corresponding words is limited (note that a word may be aligned to a phrase in other language). We carried out two sets of experiments to evaluate our BWESA method. The first is to evaluate the quality of the learned bilingual word embeddings on two tasks: cross-lingual document classification (CLDC) and word translation. The second one is to assess the results of the word alignment using alignment error rate (AER). Our proposed BWESA approach achieved state-of-the-art or comparable results on all these tasks. The effect of word alignment information was also confirmed by the experiments.

The main contributions of this paper are: (i) we propose a novel method to learn bilingual word embeddings and alignments from parallel corpus in a joint fashion; (ii) we recommend applying the two criteria of “coverage” and “sparsity” during word alignment to deal with under- and over-alignment problems; and (iii) our model achieved state-of-the-art or comparable results on cross-lingual document classification and word translation tasks.

2 Related work

Bilingual word embedding learning methods aim to embed the words from different languages into a shared continuous vector space, where the learned word embeddings yield a useful characteristic that similar words from multiple languages are close to each other in the space. Those methods can be roughly divided into three categories with respect to their training objectives: mapping offline, mapping online, and joint training.

2.1 Mapping offline

Mapping Offline methods first learn to obtain two monolingual word embeddings separately, and then to compute the mapping between the two different vector spaces by using extra resources (such as a dictionary). Learning a mapping matrix is arguably the most common way to obtain bilingual word embeddings by constructing a dictionary from Google Translate (Mikolov et al. 2013b), leveraging a seed dictionary (Artetxe et al. 2017) or employing the singular value decomposition (Smith et al. 2017). Although word embeddings can be learned at less computational cost by these methods, they might be incapable of capturing the phenomena of homonymy and polysemy that widely exist within and across languages because these methods usually consider only one translation per word. Although not requiring a parallel corpus is an advantage, offline mapping methods rely on the assumption that underlying embeddings should have a similar structure, which is known as the isometry assumption. However, this assumption can not be taken for granted and some researchers have shown that this assumption does not hold generally (Søgaard et al. 2018; Nakashole and Flauger 2018), and can severely degrade the performance of these methods.

2.2 Mapping online

Mapping Online methods try to learn sentence-level representations (often derived from their word embeddings) for different languages by making the learned representations of each pair of parallel sentences stay close to each other in a shared vector space. Those word embeddings are learned in an indirect way, and the word-level alignment is often not forced directly. Hermann and Blunsom (2014) proposed to learn bilingual word embeddings by aligning the representations of parallel sentences while keeping sufficient distances between those of dissimilar ones. Chandar et al. (2014) used an autoencoder-based framework to produce the representation of a sentence, which can both reconstruct the bag-of-words for that sentence and those for the aligned translation. Kočiskỳ et al. (2014) proposed to learn both bilingual word embeddings and alignments based on FASTALIGN (Dyer et al. 2013), a variation of IBM model 2 (Brown et al. 1993), but they do not leverage the same or similar semantics conveyed in the parallel sentences directly when learning word alignment. Wei and Deng (2017) presented a variational autoencoder-based method, where a continuous latent variable is used to model the underlying semantics of each pair of parallel sentences and guide the reconstruction of these sentence pairs. Bilingual word embeddings are obtained indirectly in those methods by making the sentence pair well-aligned, and these methods might fail to fully capture the intrinsic semantics and syntactic characteristics at the level of their words.

2.3 Joint training

Joint Training methods learn bilingual word representations by taking both monolingual and bilingual objectives into account. The word embeddings for each language are first separately trained from the monolingual corpus, and the obtained embeddings are then further tuned to satisfy the bilingual constraints defined either from pre-computed word alignments (Zou et al. 2013), or via coarse alignments under a uniform distribution assumption (Gouws et al. 2015). Luong et al. (2015) proposed a variant of skip-gram to learn BWEs by improving on the prediction of contextual words from both the monolingual and cross-lingual sentences. These methods make it possible to leverage both relatively small but valuable amounts of parallel data as well as large unlabelled monolingual texts. However, the performance of the learned BWEs is strongly sensitive to the quality of the predetermined word alignments, and good word alignments have generally been hard to achieve up to now.

In this study, we follow the line of the joint training strategy, but our model is different from others in that it is capable of learning bilingual word embeddings and word alignments jointly. We show that these two tasks can benefit each other in such a joint learning manner. Word alignments do not need to be predetermined before the training starts and are given opportunities to be improved gradually as the learning progresses, which leads to better bilingual word embeddings.

3 Models

We here describe our BWESA (Bilingual Word Embeddingswith Soft Alignment) method that can learn bilingual word embeddings and alignments automatically and simultaneously. Our objective function can be factorized into three parts. The first is designed for monolingual purposes, the second is for bilingual use-cases, and the last one is for word alignments, respectively denoted as \(loss_{mono}\), \(loss_{bi}\), and \(loss_{align}\). The loss function can be formalized as in (1):

$$\begin{aligned} L_{bwe} = \alpha \cdot loss_{mono} + \beta \cdot loss_{bi} + \gamma \cdot loss_{align} \end{aligned}$$
(1)

where \(\alpha\), \(\beta\), and \(\gamma\) govern the relative importance of the three different parts. The first term can be further decomposed into two components, each for one language. Considering that the under-alignment or over-alignment problems will be harmful to the results of word alignment, we advocate to fulfil the two criteria of “coverage” and “sparsity” on word alignments during the learning process. Specifically, each word of a sentence should be aligned to at least one equivalent in the parallel sentence (i.e. fulfilling coverage) and at the same time, the cardinality of those semantic equivalences should be limited to a small number (i.e. fulfilling sparsity).

3.1 Monolingual objective

We chose to apply the skip-gram with negative sampling strategy (Mikolov et al. 2013a) to train the word embeddings from monolingual data since it has been widely used and can be performed at low computational cost. The philosophy behind skip-gram is that a word tends to have similar meanings to its neighbouring ones, and thus its feature representation can be trained by using the current word to predict its context (or neighbouring) words.

Specifically, for each word w in a vocabulary, Con(w) consists of all contexts in which the word w occurs in a corpus. The loss function for the word w can be formalized as in (2):

$$\begin{aligned} \begin{array}{ll} loss(w) = &{} - \sum _{c \in Con(w)}[\log \sigma (r_{w}\cdot r_c)\\ &{} - \sum _{n \in Neg(w)}\log \sigma (r_{w}\cdot r_{n})] \end{array} \end{aligned}$$
(2)

where \(r_w\) is the distributed vector representation of w, and \(\sigma\) is denoted as the sigmoid function. The negative sampling method has been widely used to learn word embeddings (Mikolov et al. 2013a; Zheng et al. 2017; Feng and Zheng 2018), where the word embeddings are trained by maximizing the conditional likelihood of the current words given their contexts by the gradient ascent algorithm, which can be factorized with respect to the current word (positive) and its negative samples using logistic regression as in Eq. (2). For each positive word, a set of k negative words, denoted as Neg(w), is randomly sampled from the vocabulary according to their frequencies. We need to train monolingual word embeddings twice, each for a language, and thus the loss for monolingual purpose, denoted as \(loss_{mono}\), is defined as the sum of two parts, as in (3):

$$\begin{aligned} loss_{mono}= \sum _{w^e\in V^e}loss(w^e) + \sum _{w^f\in V^f}loss(w^f), \end{aligned}$$
(3)

where \(w^e\) denotes a word in the vocabulary \(V^e\), extracted from the corpus of a source language, and \(w^f\) a word in the vocabulary \(V^f\) from a target language. The “source” and “target” are just used to name different languages, and can be used interchangeably without affecting the results.

3.2 Bilingual objective

In a parallel corpus, two sentences in a pair convey the same meaning, and each word (as a smaller semantic unit) in a sentence should have its own correspondence in the parallel sentence. We assume that there exists a soft alignment distribution for each word in a source sentence over the words from the target equivalent. Figure 1 illustrates a simple example of the soft alignment distribution for the German word “Traum”, where most of the weight of the distribution is given to the English word “dream” which carries a similar meaning to the German word “Traum.”

Fig. 1
figure 1

The upper half shows an example distribution of soft alignment for the word “Traum” in a German sentence over all the words in the parallel English sentence, where most of the weights are given to the word “dream” which carries a similar meaning as the German word “Traum.” The lower half illustrates a similarity matrix for a pair of sentences, in which the colour of each element shows the degree of similarity between the two corresponding words. The darker the colour is, the more similar they are semantically

We take the result of the ReLU nonlinear transformation over the dot product of two words’ embeddings as their similarity score, and such scores are further normalized to approximate the aligned distribution. The similarity score, \(a_{ij}\), can be computed and then normalized to \(\hat{a}_{ij}\), as in (4):

$$\begin{aligned} a_{ij} =ReLU(r_{w_i^e}\cdot r_{w_j^f}), \quad \hat{a}_{ij} = \frac{a_{ij}}{\sum _k a_{ik}} \end{aligned}$$
(4)

where \(w_i^e\) denotes the i-th word in a sentence from the source language e, and \(w_j^f\) denotes the j-th word in the parallel one from the target language f. The similarity score defined by the dot product gradually forces the two word vectors closer to one another during the training process, and finally causes all the words from the two different languages to be embedded in the same vector space. The ReLU function is used to produce the scores so that two dissimilar words can have a zero score, and more importantly, the sparsity property can be guaranteed to some extent since many word pairs will receive a scored of zero or close to zero.

$$\begin{aligned} dist(w_i^e)=||r_{w_i^e}-\sum _j \hat{a}_{ij}r_{w_j^f}||^2 \end{aligned}$$
(5)

We define a distance we want to reduce in the vector space in Eq. (5) to reflect how well the meaning of a source word is represented in its parallel equivalent. Specifically, such distance is formalized as the Euclidean metric between the embedding of the i-th word in a sentence and the weighted average of those of all the words in the parallel sentence. The estimated distribution \(\hat{a}_{ij}\) for the i-word from the source language is used as the weights.

Likewise, \(dist(w_j^f)\) can also be calculated for the j-th target word in the same way. Therefore, the loss function for the bilingual purpose can be defined as in (6):

$$loss_{{bi}} = \frac{1}{N}\sum\limits_{{\left( {S_{e} ,S_{f} } \right) \in D}} {\left( {\sum\limits_{{i = 1}}^{{\left| {\left. {S_{e} } \right|} \right.}} {dist\left( {w_{i}^{e} } \right) + } \sum\limits_{{j = 1}}^{{\left| {\left. {S_{f} } \right|} \right.}} {dist\left( {w_{j}^{f} } \right)} } \right)}$$
(6)

where \(\mathcal {D}=\{(s_e,s_f)_n\}_{n=1}^{N}\) is a dataset consisting of parallel sentence pairs \((s_e,s_f)\), and N is the number of those pairs.

In our model, the ‘soft’ word alignment is derived from the similarity scores estimated between the word embeddings by taking the semantic equivalences at their sentence level as a guide. The derived alignments are used to learn the bilingual word embeddings, and in turn, the learned embeddings are further used to improve the quality of the word alignments. In this way, the word alignments and embeddings can be learned jointly and reinforced mutually.

3.3 Coverage and sparsity

We introduce the two criteria of “coverage” and “sparsity” for the word alignment process, and the alignment loss \(loss_{align}\) has two parts designed to fulfill these two criteria: \(loss_{cov}\) (for coverage) and \(loss_{spa}\) (for sparsity). The coverage criterion means that each word of a sentence should be aligned to at least one equivalent in the parallel sentence, and this criterion is proposed to treat the under-alignment problem. The loss to enforce the coverage criterion for a source language is defined as in (7):

$$\begin{aligned} loss_{cov_e}=\frac{1}{N}\sum _{(s_e,s_f)\in \mathcal {D}}\sum _{j=1}^{|s_f|} \left (1-\sum _{i=1}^{|s_e|} \hat{a}_{ij}\right )^2 \end{aligned}$$
(7)

For a target language, the \(loss_{cov_f}\) can be defined in a similar way, and we take the sum of the two losses as the training objective for the coverage criterion. To meet the “sparsity” criterion, the cardinality of semantic equivalences of each word in any sentence should be limited to a reasonably small number. As discussed above, this criterion is implicitly guaranteed via ReLU nonlinear transformation defined in Eq. (4). The loss function for the word alignment can be written as in (8):

$$\begin{aligned} loss_{align} =loss_{cov_e}+loss_{cov_f} \end{aligned}$$
(8)

In summary, to jointly obtain better vector representations of words for the source and target languages, the word embeddings are first trained with the monolingual loss of Eq. (3) and then trained by using the bilingual loss of Eq. (6) as well as the word alignment loss of Eq. (8). A similar negative sampling strategy like skip-gram (Mikolov et al. 2013a) was used to fulfill the monolingual training objective, and a good (soft) word alignment distribution is learned by leveraging a sentence-level parallel corpus to meet the proposed two criteria of “coverage” and “sparsity” for the word-level alignment.

4 Experiments

We conducted three sets of experiments to evaluate our BWESA method by comparing it to other representative methods: word translation and cross-lingual document classification for the learned bilingual word embeddings, and AER for the obtained word alignments.

4.1 Training details

4.1.1 Training datasets

English-German (en-de) and English-French (en-fr) branches of the Europarl v7 corpus (Koehn 2005) were used to train the models for comparison, which contain 1.9M en-de parallel sentences with 49.7M English and 52.0M German words, and 2.0M en-fr parallel sentences with 55.7M English and 61.9M French words. The models were also evaluated on an English-Chinese (en-zh) dataset. Those two languages belong to different language families, and they are much more different from each other than en-de or en-fr pairs. The dataset for en-zh was extracted from LDC,Footnote 1 which consists of 2.5M parallel sentences with 80.8M English and 72.0M Chinese words.

4.1.2 Hyperparameters and initialization

We tuned the hyperparameters by trying only a few different settings on the validation set. In our experiments, we set the number of negative samples to 64, window size to 5, subsampling rate to 0.0001, and initialized learning rate to 0.1. The dimensionality of word embeddings was set to 40 for both en-de and en-fr pairs, and 100 for en-zh. We first let \(\alpha + \beta = 1\), and tuned \(\gamma\) within {0.5, 1.0, 2.0, 4.0, 8.0}. In this way, we can observe the model’s behaviour without the constraint on word alignment, and see how much we can improve performance by introducing this constraint later in an incremental manner. The experimental results show that the performance is relatively insensitive to the values of \(\gamma\), but we chose to set \(\gamma = 0.5\) as this yielded a slightly better performance than other values on the validation set.

Luong et al. (2015) and Gouws et al. (2015) found that for both en-de and en-fr, setting the dimensionality of word embeddings to 40 sufficed as these two languages are quite similar to each other. For a fair comparison, we followed their setting in our experiments. However, English and Chinese (en-zh) belong to different language families, and they are much more different from each other than en-de or en-fr pairs. Accordingly, we chose to enlarge the model capacity by increasing the size of word embeddings to 100. Our preliminary experiments showed that the size of word embedding generally has a limited impact on the performance if it is large enough. We tuned the values of k within {8, 16, 32, 64, 128} and set the number of negative samples to 64, which yields the best performance on the validation set.

The problem of learning bilingual word embeddings is that it has a very large search space, which makes it extremely difficult to learn good word embeddings starting with a random initialization. Accordingly, we “warm-up” the model by using information from cross-lingual word co-occurrence statistics to speed up the training process. In the first several iterations, the similarity scores between words are calculated based on the word co-occurrence, as in (9):

$$\begin{aligned} s_{ij} =\frac{count(w_i^e, w_j^f)}{count(w_i^e)}, \end{aligned}$$
(9)

where \(count(w_i^e, w_j^f)\) is the frequency of co-occurrence of \(w_i^e\) and \(w_j^f\) word pairs in the training corpus, and \(count(w_i^e)\) is the total occurrence of the i-th word for a source language. The scores of stop words or other high-frequency words need to be scaled properly, and the Inverse Document Frequency (IDF) that reflects how important a word is to a sentence in a corpus is combined to calculate such scores, as in (10):

$$\begin{aligned} s'_{ij}=s_{ij}*idf(w_j^f), \end{aligned}$$
(10)

where \(idf(w_j^f)\) is the calculated IDF of word \(w_j^f\). The score \(s'_{ij}\) will be normalized to obtain \(\hat{s}_{ij}\) in the same way defined as Eq. (4). In the first few iterations, \(\hat{s}_{ij}\) was used instead of \(\hat{a}_{ij}\) defined in Eqs. (4) and (5).

4.2 Bilingual word embedding evaluation

To evaluate the learned bilingual word embeddings experimentally, BWESA was compared to the following representative models:

  1. (1)

    DistribReps (Klementiev et al. 2012): They formulate the word embedding learning for a pair of languages as a multitask learning problem where each task corresponds to a single word, and task relatedness is derived from co-occurrence statistics in bilingual parallel data.

  2. (2)

    BICVM (Hermann and Blunsom 2014): they leverage parallel data and learn to align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. The idea behind their method is that, given enough parallel data, a shared representation of two parallel sentences would be forced to capture the common elements and words between these two sentences.

  3. (3)

    BAE (Chandar et al. 2014): they use autoencoder-based methods for cross-language learning of vector word representations that are coherent between two languages by learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages.

  4. (4)

    BilBOWA (Gouws et al. 2015): they train bilingual word embeddings on monolingual data and extract a bilingual signal from a set of sentence-aligned data with a sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for cross-lingual feature learning.

  5. (5)

    BiSkip (Luong et al. 2015): they extended the skip-gram model to learn bilingual representations by using the co-occurrence context information within a language and meaning-equivalent signals across languages.

  6. (6)

    CLSim (Shi et al. 2015): they proposed a matrix co-factorization framework for learning cross-lingual word embeddings, in which the monolingual training objective is defined in the form of matrix decomposition, and cross-lingual constraints are forced by information derived from parallel corpora.

  7. (7)

    BRAVE-S (Mogadala and Rettinger 2016): they proposed a model to learn bilingual word embeddings of words from sentence-aligned parallel corpora with the elastic net regularization proposed by Zou and Hastie (2005).

  8. (8)

    BiVAE (Wei and Deng 2017): they presented a variational autoencoding approach for training bilingual word embeddings where a continuous latent variable is introduced to explicitly model the underlying semantics of the parallel sentence pairs and to guide the generation of the sentence pairs.

  9. (9)

    Adv-Refine-CSLS (Conneau et al. 2017): they explored building a bilingual dictionary between two languages without using any parallel corpora by aligning monolingual word-embedding spaces in an unsupervised way with adversarial training and a refinement procedure.

  10. (10)

    DP (Li et al. 2019): they proposed a method to induce a word alignment by estimating the relevance between a pair of words (xy) from a source language and a target one. The relevance score is estimated by removing a word x from a source sentence, and calculating the difference in the probabilities of generating a word y in the target (translated) sentence before and after the word x being removed with the help of a machine translation model.

  11. (11)

    E-SGNS (Ormazabal et al. 2020): The core idea of their method is to fix target language embeddings and learn from scratch a set of embeddings for a source language that is aligned with the target one. They use an extension of skip-gram (Mikolov et al. 2013a) that leverages translated context words as anchor points and apply self-learning and iterative restarts to reduce the dependency on the initial dictionary. They proposed three methods to build an initial dictionary, and we chose to compare the version with unsupervised mapping initialization because this version achieved the best result in word alignment on average.

  12. (12)

    We also developed a new strong baseline, denoted as BWECLCO, which only uses the cross-lingual word co-occurrence to estimate the alignment distribution without the following word alignment learning step as BWESA.

4.2.1 Word translation

Word translation aims to select the most similar word from a target language for a given word from a source language (Mikolov et al. 2013b; Gouws et al. 2015). This task is often used to evaluate how well the similar words from different languages are aligned with each other in the learned vector space with the cosine distance. Following Upadhyay et al. (2016), the gold word pairs were extracted from the Open Multilingual WordNet (OMW) datasest released by Bond and Foster (2013), consisting of 19,675 en-de, 20,449 en-fr and 42,300 en-zh word pairs.

Table 1 The accuracy (\(\%\)) for word translation task on the open multi-lingual WordNet datasest. P@1, P@5 and P@10 denote top-1, top-5 and top-10 accuracy, respectively
Table 2 The accuracy (\(\%\)) of word translation on open multi-lingual WordNet datasest and the accuracy (\(\%\)) of cross-lingual document classification (CLDC) on Reuters RCV1/RCV2 multilingual corpora for English-Chinese pair

We report the Top-1, top-5 and top-10 accuracy (denoted by P@1, P@5 and P@10) achieved by different models in Tables 1 and 2. As we can see, BWESA produced state-of-the-art results on the OMW dataset for all three language pairs. Although E-SGNS achieved the best averaged top-1 accuracy of \(57.4\%\) among the previous models, it was surpassed by BWESA by a fairly significant margin (about 1% on average). The results reported in Table 2 also show that BiVAE outperformed the other competitors for the word translation task, even when the difference between the two languages is large. In addition, we note that the “fully fledged” BWESA model is superior to BWECLCO, with an average increment of \(9.63\%\) when word alignment learning is turned off, indicating that the joint solution for learning bilingual word embedding and alignment is preferable and both tasks can mutually benefit and reinforce each other during joint learning. The experimental results show that BWESA is capable of learning finer-grained (word-level) semantic equivalences from (sentence-level) parallel corpora, due to the fact that the alignment learning strategy causes similar words, properly chosen by the continuously improving word alignment, to become closer in the shared vector space as training progresses.

4.2.2 Cross-lingual document classification

Cross-lingual document classification (CLDC) can be used to assess the quality of the learned BWEs by training a classifier on one language and testing it on another. Following the settings of Klementiev et al. (2012), English, German, French and Chinese subsections of Reuters RCV1/RCV2 multilingual corpora were used for evaluation, and the documents labeled with one of CCAT, ECAT, GCAT, or MCAT topics are considered for this task. In this experiment, 15,000 documents were extracted from RCV1/2, in which 5000 documents were randomly selected as the test set, and the rest was taken as the training set.

Table 3 The accuracy (\(\%\)) of cross-lingual document classification (CLDC) task on Reuters RCV1/RCV2 multilingual corpora

Following Klementiev et al. (2012), three additional baseline systems (Majority Class, Glossed, and MT systems) are also listed in Table 3 for comparison. The Majority Class simply labels all the documents to be classified with the category having the most samples in the training set. The glossed system works as follows: a classifier is first trained over the documents from a source language; for a document written in another language, every word in the document is replaced with its most frequently aligned word from the source language; finally, the document with its words replaced is labeled by the classifier trained in the first step. The MT system is different from the Glossed system in that the documents to be classified are translated into the source language not by using word-level replacement, but by applying a phrase-based statistical MT tool.

The results reported in Tables 2 and 3 show that BWESA achieved consistently higher performance over the competitors on almost all the CLDC datasets considered. Although CLSim Shi et al. (2015) achieved the best result on the en-de sub-task, BWESA outperformed CLSim on the other six language pairs by a significant margin (\(4.93\%\) on average), highlighting the potential of BWESA for practical CLDC, an important downstream task for BWEs. Another noteworthy result of these experiments is the success of the joint learning strategy, which boosts classification accuracy by about \(8.18\%\) on average.

4.3 Word alignment evaluation

Both the word translation and cross-lingual document classification tasks were used to evaluate the bilingual word embeddings learned by our model and by other approaches. In this experimental setting, we would like to see how well the words from different languages are aligned, and whether the word alignment has indeed been improved by BWESA.

Table 4 The results of word alignment reported in the inverse of alignment error rate (\(1 -\) AER)
Table 5 Examples of nearest neighbours

4.3.1 Alignment error rate

AER is often used to measure how well words are aligned by comparing the model’s proposed alignments with the gold ones annotated by humans. We chose to use the inverse of alignment error rate (i.e. \(1-\)AER) suggested by Koehn (2009) as the evaluation metric. The higher the inverse rate, the better the word alignment will be. Like Levy et al. (2017), we first leveraged the Edinburgh Bible Corpus and a subset of the Europarl corpus (180K sentences) to train cross-lingual word embeddings, and then sixteen manually annotated word alignment datasets were used to evaluate the word alignments produced by different models.

We evaluate the proposed BWESA and BWECLCO for the word alignment in the inverse of AER by comparing to BilBOWA (Gouws et al. 2015) and BiSkip (Luong et al. 2015) that have been tested on the word-alignment task. We also listed IBM-model1 and IBM-model3  (Brown et al. 1993) as two strong baselines for comparison because they were particularly designed for word alignment by using the bilingual word co-occurrence statistics. In contrast, the Dice system (Och and Ney 2003) was selected for comparison, in which the Dice coefficient was introduced to measures the similarity between cross-lingual words based on the number of aligned (parallel) sentences in which they co-occur. Similar the “coverage” and “sparsity” criteria were applied in IBM-model3. In this study, we redefined those two criteria to fit the case where distributed representations are used.

As shown in Table 4, BWESA achieved the highest performance for ten different language pairs on GRACA, MIHALCEA, and HOLMQVIST datasets. Although BWESA did not outperform IBM-Model3 on the other three pairs, it still performs competitively. Note that IBM-Model3 was tailored for word alignment using many features based on linguistic knowledge. The experimental results show that our BWESA model can effectively learn high-quality bilingual word embeddings and relatively reliable word alignments in a joint manner.

5 Qualitative analysis

In this section, qualitative analyses were performed evaluating the effectiveness of BWESA on two aspects: neighbouring word discovery and word embedding visualization. The language pair English-German was used.

5.1 Nearest neighbour words

We randomly sampled five words from English and German to retrieve their top-5 nearest neighbour words within and across languages based on the Cosine similarity. The results listed in Table 5 demonstrate that the nearest neighbour words discovered by our model are generally semantically coherent. For example, English and German words describing “time” concepts (such as “moment” and “Zeit”) are well clustered, indicating that the desired word clustering is well formed for these two languages.

Fig. 2
figure 2

Two-dimensional projection of the mappings among German and English word embeddings by the t-SNE algorithm

5.2 Visualization

To illustrate how well the bilingual word representations were learned by BWESA, we plotted a two-dimensional projection of word representations produced by BWESA in Fig. 2. We used the t-SNE algorithm  (van der Maaten and Hinton 2008) to perform the projection for the illustration. English and German word pairs were randomly extracted from the Open Multilingual WordNet data (Bond and Foster 2013). The distances between any two words are calculated using the Cosine similarity. All English words are indicated using green, and for each English word, its associated German word is shown in blue if their similarity score is greater than a given threshold (say 0.8). If not, the corresponding German word is shown in yellow. We can see that there are many more blue words than yellow ones. This shows that BWESA can produce better word representations, which helps to improve the accuracy of word translation, cross-lingual document classification, and word alignment.

6 Conclusion

We have presented BWESA for learning bilingual word embeddings and alignments in a joint way, in which both tasks can mutually benefit and reinforce each other during the learning process. BWESA is able to learn bilingual word representations from parallel corpora without explicit word-level alignment information in a weakly supervised manner. Two criteria of “coverage” and “sparsity” were reintroduced for learning better word alignments in the case of distributed representations to deal with the under- and over-alignment problems. Extensive experimental results show that BWESA achieved state-of-the-art or comparable results on various cross-lingual tasks, including document classification, word translation, and word alignment.

For future work, it would be interesting to see whether bilingual word embeddings can be learned in a completely unsupervised way. Besides this avenue, we are aware that recently proposed mT5 (Xue et al. 2021), XLM (Conneau and Lample 2019), mBART (Liu et al. 2020) and Unicoder (Huang et al. 2019) could benefit from the idea of soft word alignment, which helps to learn bilingual word representations from the parallel corpora without requiring explicit word-level alignment. We leave this as future work because very large architectures are required to train such contextualized representations at the cost of great computational power, time, and resources.