Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Fernando, Aloka; Ranathunga, Surangika; Sachintha, Dilan; Piyarathna, Lakmali; Rajitha, Charith

doi:10.1007/s10115-022-01761-x

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Regular Paper
Published: 17 October 2022

Volume 65, pages 571–612, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Knowledge and Information Systems Aims and scope Submit manuscript

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Download PDF

Aloka Fernando¹^na1,
Surangika Ranathunga¹,
Dilan Sachintha¹^na1,
Lakmali Piyarathna¹^na1 &
…
Charith Rajitha¹^na1

500 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Neural machine translation systems trained on low-resource languages produce sub-optimal results due to the scarcity of large parallel datasets. To alleviate this problem, parallel corpora can be mined from the web. Two key tasks in a parallel corpus mining pipeline are web document alignment and sentence alignment. Effective approaches for these tasks obtained vector representations of the documents (or sentences) belonging to the two languages and determine the alignment between the documents (or sentences) based on a semantic similarity scoring mechanism. Recently, document or sentence representations obtained from pre-trained multilingual language models (PMLMs) such as LASER, XLM-R and LaBSE have significantly improved the benchmark scores in diverse natural language processing tasks. In this study, we carry out an empirical analysis of the effectiveness of these PMLMs of the document and sentence alignment tasks in the context of the low-resource language pairs Sinhala–English, Tamil–English and Sinhala–Tamil. Further, we introduce a weighting mechanism based on small-scale bilingual lexicons to improve the semantic similarity measurement between sentences and documents. Our results show that both document and sentence alignment can be further improved using our weighting mechanism. We have also compiled a gold-standard evaluation benchmark dataset for document alignment and sentence alignment tasks for the considered language pairs. This dataset (https://github.com/kdissa/comparable-corpus) and the source code (https://github.com/nlpcuom/parallel_corpus_mining) are publicly released.

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Article 01 December 2021

Jointly learning bilingual word embeddings and alignments

Article 01 November 2021

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Neural machine translation (NMT) systems trained on the transformer architecture [1] produce state-of-the-art results when large parallel datasets are available. However, in low-resource settings, the same architectures produce sub-optimal results [2].

Mining for parallel corpora from the web is one commonly explored solution to alleviate the parallel data scarcity problem [3]. Wikipedia, news websites and official government/institution websites are sources that are likely to contain translations of each other, to be considered for parallel corpus mining. However, for low-resource languages, the data on the web are noisy and not of good quality, which results in a noisy parallel data when automatically mined [4]. Therefore, it is essential to implement parallel corpus mining techniques that produce quality parallel data for low-resource languages.

Document alignment and sentence alignment are important tasks in the parallel corpus mining pipeline [5]. Document alignment refers to the process of identifying web documents that contain translations of each other, which are known as comparable corpora [6]. Early work on document alignment mainly relied on feature-based techniques that exploited URL meta data [7,8,9], HTML document structure [10] or machine translation-based techniques [11, 12]. However, these were outperformed by techniques that used vector representations of documents. Recent research in this line exploited document representations derived from Pre-trained Multilingual Language Models (PMLMs), which proved to be far superior to previous techniques [13].

The objective of sentence alignment is to find parallel sentences in the already identified comparable corpora or aligned documents. Existing techniques for sentence alignment were based on sentence-level features [14], information retrieval-based techniques [12], using supervised classifiers [15] and using machine translation [16]. However, more successful techniques were based on multilingual sentence embeddings [17, 18].

Currently available PMLMs to derive multilingual sentence embeddings include LASER [19], XLM-R [20], mBERT [21] and LaBSE [18]. However, except for Rajitha et al. [22] who compared LASER and XLM-R embeddings for document alignment, and Feng et al. [18] who compared LaBSE and LASER for sentence alignment, to the best of our knowledge, there has been no comprehensive evaluation of the effectiveness of these embeddings for document or sentence alignment tasks for low-resource languages. Moreover, these models are known to provide sub-optimal results for languages that are under-represented in these PMLMs [23]. These languages turn out to be those that have already been classified as low-resource languages [24]. Thus, it is important to investigate and identify ways to improve the performance of these PMLMs for document and sentence alignment in the context of low-resource languages.

In this paper, we exploit the use of bilingual lexicons to improve the semantic similarity measurement of the sentence embeddings derived from PMLMs for the tasks of document and sentence alignment. Note that bilingual lexicons can be considered as parallel data that are in the form of short phrases.

Our document alignment system is based on the work of El-Kishky and Guzmán [13]. They derived sentence embeddings using LASER and calculated the semantic distance between documents in source and target languages using the Cross-lingual Sentence Movers Distance algorithm. Our sentence alignment system is based on the work of Artetxe and Schwenk [25]. They first obtained sentence embeddings of all source and target side sentences using LASER and calculated margin-based cosine similarity over nearest neighbours.

For both these techniques, we introduce a new weighting mechanism to improve the semantic distance measurement, by utilizing existing bilingual lexicons. Our bilingual lexicons include a bilingual dictionary, glossary, designation list and person name lists. Additionally, we exploit the effectiveness of our technique considering XLM-R [20] and LaBSE [18] in addition to LASER [19] multilingual embeddings. Thus, our work also serves as the first comparative study of the performance of these three multilingual models for these tasks.

We experiment with three language pairs: Sinhala–Tamil, Sinhala–English and Tamil–English. We have compiled a gold-standard human-annotated benchmark evaluation set for document alignment and sentence alignment tasks, in these three language pairs. The considered languages belong to three distinct language families (English (En)—Indo European, Tamil (Ta)—Dravidian and Sinhala (Si)—Indo Aryan), and Sinhala and Tamil are morphologically rich low-resource languages. Thus, this dataset is a much tougher benchmark compared to other multilingual datasets [26] that only focused on a pair of high-resource related languages. We publicly release this dataset ^{Footnote 1} in the hope that it would serve in further research in this domain. This is the first manually curated dataset for the considered three languages.

Our experiments show that the use of bilingual lexicons improves the performance of the selected document and sentence alignment techniques, with the largest gains in the context of the LASER sentence representations.

Thus, the contributions of this work are as follows:

1.
We introduce a weighting scheme based on bilingual lexicons to improve the semantic similarity measurement of the document and sentence representations derived from pre-trained multilingual models, for the document and sentence alignment tasks, respectively.
2.
We conduct an empirical evaluation of the performance of sentence representations derived from LASER, XLM-R and LaBSE^{Footnote 2}, for document and sentence alignment tasks in the context of low-resource languages.
3.
We publicly release the gold-standard human-annotated benchmark evaluation datasets for the document and sentence alignment tasks in the context of three low-resource language pairs: English–Sinhala, English–Tamil and Sinhala–Tamil.

The rest of the work is organized as follows. Related work is covered in Sect. 2. In Sect. 3, we explain our approach for creating the benchmark evaluation sets and declare the bilingual lexicons used. Our lexicon-based solution is presented in Sect. 4. Results are reported in Sect. 5, with a further analysis of the results in Sect. 6. Finally, the conclusion and future work are included in Sect. 7.

2 Related work

A typical parallel corpus mining pipeline follows a sequence of tasks, namely: crawling of website data, alignment of web documents, sentence alignment and parallel sentence filtration [5]. In our study, we focus on document alignment and sentence alignment tasks.

2.1 Document alignment

Automatic document alignment refers to determining the likelihood that two documents are translations of each other. Early work on document alignment was mostly based on metadata, such as URL-based properties [7, 8], publication date [9] and HTML document structure/tags [10, 27, 28]. These were further extended with topic modelling techniques [29]. Although meta-data is a strong indication of document alignment, this alone is not effective, as the alignment properties mainly lie in the textual content. Further, such properties cannot be generalized across different domains and web sources.

In translation-based document alignment methods, the objective is to identify a strong signal that a document in the source language is the translation of another in the target language. Some of these techniques incorporated a bilingual dictionary and checked the existence of bilingual lexical terms in the documents [30]. Some others considered the alignment information at word level [31,32,33], phrase level [12, 34] or considered the existence of the n-best translated terms [11, 35] in the documents. Few techniques translated the non-English document to English and determined the alignment based on an MT evaluation metric [29, 36, 37]. Even though translation-based methods were able to score well in document alignment, their performance highly depends on the accuracy of the alignment algorithm or the translation system used. Thus, in low-resource settings, these may produce sub-optimal results.

Vector representation-based techniques first derive a vector representation for the documents in the two languages and employ a semantic distance measurement metric to determine the semantic similarity between the documents. Document pairs that obtain a semantic similarity value above a pre-determined threshold are considered to be comparable. Bag-of-words, TF-IDF [38,39,40,41] and word n-grams [42] were among the early solutions to derive document representations.

Very recently, El-Kishky and Guzmán [13] used PMLMs to derive a vector representation of each of the documents in the considered languages. Then, the distance between these document vectors was calculated to determine the aligned document pair. They used the LASER pre-trained embeddings [19] to derive document embeddings. El-Kishky and Guzmán [13] experimented for high-resource, mid-resource and also low-resource languages. As mentioned earlier, this is the baseline for our research, and more information on this technique can be found in Sect. 4.1.1.

2.2 Sentence alignment

Sentence alignment refers to the process of identifying parallel sentence pairs that are partial or complete translations of each other. Early work on parallel sentence alignment was based on sentence-length ratio [43, 44], which was purely statistical. However, when the correlation between source and target languages decreases, the performance of this approach drops rapidly [45]. Subsequent techniques considered bilingual dictionaries [14], word/phrase alignment probabilities between the sentences [12, 32] and phrase/sentence alignments coupled with bilingual suffix trees [46]. Stefanescu et al. [47] addressed sentence alignment as an information retrieval problem, while Munteanu and Marcu [15] trained a supervised classifier to determine the alignment. Some other techniques were based on machine translation [16, 48], where the non-English source sentences were translated to English and IR techniques were used to identify the aligned candidate sentence [49,50,51].

Recent work for sentence alignment was based on sentence representations by means of word embeddings or sentence embeddings. Here, first, the sentence embeddings were obtained for source and target sentences. Then, using a semantic similarity measurement, the aligned sentence pairs were identified. Initial work employed word-based embeddings trained on bi-directional Recurrent Neural Networks (RNNs) [52], Deep Averaging Networks (DANs) [53], bi-gram driven network architectures [54] and auto-encoders [55]. Hybrid techniques were also adopted, where a supervised classifier was used on top of the embedding-based semantic similarity calculation to determine the alignment [55, 56].

To optimize the results of the sentence alignment task, either the sentence representations should be enhanced or the semantic similarity distance scoring should be improved. Following the former path, Artetxe and Schwenk [19] used LASER supervised multilingual embeddings, while Kvapilíková et al. [17] experimented with XLM-based unsupervised multilingual embeddings. The choice of the sentence similarity measurement technique has been largely unsupervised (cosine similarity was the simplest one employed). However, this simple method is sub-optimal, and improved semantic similarity measurements had also been proposed [25, 54, 57]. We use Artetxe and Schwenk [25] as the baseline for the sentence alignment system, and this similarity measurement technique is further discussed in Sect. 4.2.1.

The final step is parallel sentence filtration, with the objective of removing any noisy parallel sentence pairs that had crept into the mined parallel corpus due to the noise in web data itself or due to the limitations in the preceding steps. However, this step is not explored in the scope of the study.

2.3 Pre-trained multilingual language models (PMLMs)

As discussed in the previous two sections, sentence representations derived from PMLMs have been vital in the success of recent document and sentence alignment techniques. Artetxe and Schwenk [19] used parallel data to train a shared encoder (available via the LASER toolkit), which had performed well on massive-scale parallel corpus extraction projects such as ParaCrawl [5], wikiMatrix [58] and ccMatix [59].

Current state-of-the-art multilingual models had been trained on the Transformer architecture [1]. Commonly used mBERT [21] (104 languages) and XLM-R [20] (100 languages) models had been trained on monolingual data with Masked Language Modelling (MLM) objective. Yang et al. [60] trained multilingual embeddings on parallel data using a bi-directional dual encoder with an additive margin softmax objective. The latter had been used in the work of LaBSE [18] (109 languages) which had been trained using both monolingual and parallel data with the MLM and Translation Language Modelling (TML) objectives, producing the state-of-the-art results for sentence alignment. However, LaBSE had not been evaluated in the context of low-resource languages for the tasks of document and sentence alignment.

2.4 Evaluating document alignment and sentence alignment

Datasets to evaluate document alignment and sentence alignment techniques have been introduced by several shared tasks. For example, Buck and Koehn [6] provided a hand-aligned dataset for evaluation of the document alignment task. However, this dataset was limited only to English and French. Rather than creating manually aligned datasets for the task of sentence alignment on comparable corpora, Zweigenbaum et al. [61] artificially injected parallel sentences into the comparable corpus. Their dataset also focused only on four language pairs, Chinese–English, French–English, German–English, and Russian–English. Some shared tasks did not present manually aligned datasets [26, 62]. Rather, the performance was evaluated by using the identified parallel sentences on a downstream NMT task.

3 Dataset

In this section, we describe the approach taken to create the gold standard evaluation set for the document alignment and sentence alignment tasks (Sect. 3.1). Afterward, we describe the human evaluation conducted to evaluate the quality of our gold-standard evaluation datasets (Sect. 3.2). In Sect. 3.3, we outline the bilingual lexicons used in this research.

3.1 Preparing document and sentence alignment evaluation datasets

Our research focuses on Sinhala, Tamil and English languages, which are the official languages of Sri Lanka. We selected the news websites that publish content in all these three languages as comparable web sources. The selected web sites were Hiru News^{Footnote 3}, NewsFirst^{Footnote 4}, Army News^{Footnote 5} and ITN^{Footnote 6}. We considered data from 2013 January up to April 2021.

During pre-processing, news content in paragraphs of each web page was merged into a single string, and the text contained in the image and video tags were discarded. Further, we have removed very short news documents that contained tokens less than fifty.

Army, Hiru and Newsfirst websites publish news in all three languages with the same content coverage, document structure, order of sentences and information ow. Hence, for most of the English documents, the exact translations were available in Sinhala and Tamil documents. However, for ITN News, this was different. We observed that the English article was not always available for the corresponding news in Sinhala and Tamil language articles, and for the ones with translations, there was a low correlation among the content as well.

Since our dataset was completely taken from the news domain, all the news documents had the published date as metadata. Moreover, in most cases, the same news document was published in all three languages on the same day. Therefore, before starting the aligning process, we filtered and grouped the documents using the published date and reduced the search space by a considerable amount.

We did the initial document alignment identification based on heuristics specific to the news website as described below.

URL of each Hiru news document contains a unique id, which is shared by the news articles published in the respective languages. We used this property to identify candidate-aligned document pairs. These were varied by a human annotator, and the alignments accepted by the annotator were considered for the gold standard evaluation subset for Hiru News.
Army news also had the publication date and time as shared attributes between the articles of the three languages. The same news was published in all three languages at the exact same date and time. Similar to Hiru news, we identified the candidate alignments for the Army news dataset using the publication date and time and later varied the alignment with the help of a human annotator.
Documents crawled from NewsFirst and ITN websites did not have any such metadata that we could use to create the ground truth alignment. Therefore, ground truth alignment was manually created by human annotators, which was later varied by the same annotators by switching the datasets.

We consider the alignments verified by the annotators as the gold standard evaluation set. Altogether, eleven annotators were used to conduct the document alignment annotation.

The number of selected documents from each language pair along with the number of ground truth alignment pairs for each web source is shown in Table 1. Due to the low correlation between documents published by ITN, it has a lower number of aligned document pairs compared to other sources.

Table 1 Statistics of document alignment evaluation dataset

Full size table

The aligned document pairs identified above were used as the input to the sentence alignment task. The number of input sentences on the source side and target side for each language pair is listed in Table 2. To conduct the sentence alignment annotation, we used five annotators altogether. Here also the annotations by one person were checked by another annotator for verification. Given a large number of sentences on each side, it would take a very long time for human annotators to find all sentence pairs that are translations of each other. Therefore, the gold standard sentence alignment evaluation set includes only 300 one-to-one sentence pairs from each website in all three language pairs.

Table 2 Statistics of the sentence alignment evaluation dataset

Full size table

3.2 Human evaluation on the benchmark evaluation datasets

On top of the annotation verification in the preceding stage, we have conducted a more systematic qualitative evaluation on the gold-standard dataset, following the methodology proposed by Kreutzer et al. [4]. From our gold standard evaluation set, we have sub-sampled 100 document pairs and sentence pairs from each language pair. Thereafter, we allocated three annotators to conduct the alignment verification per language pair, independently.

The annotation criteria are according to the work of Kreutzer et al. [4]. In their annotation scheme, the applicable labels for our dataset were CC (Correct translation, natural sentence) and CB (Correct translation, Boilerplate or low quality). When compiling our evaluation dataset, during the pre-processing stage we have already filtered out short sentences, so the label CS (Correct translation, Short) was not applicable. The rest of the annotation labels X ( Incorrect translation, but both correct languages), WL (Source OR target wrong language, but both still linguistic content) and NL (Not a language: at least one of source and target are not linguistic content) were also not applicable in our case since the evaluation set had already undergone human annotation.

We calculated the number of annotations for each label given by each annotator and obtained the average scores as done by Kreutzer et al. [4]. The same approach was followed for annotating the sentence alignment samples as well. The outcome of the human evaluation for document alignment and sentence alignment samples are shown in Table 3.

Table 3 Averages of the annotator scores for each label for document alignment and sentence alignment datasets

Full size table

More than 75% of document pairs have been annotated as correct alignments. When checked randomly, the ones that were considered as weak alignments had extra content on the target side which were not available in the source side and vice versa. In some other document pairs, the two languages had produced the same news incident from different perspectives. As a result, content-wise there were differences. This was another reason why some document pairs were annotated as CB.

For the sentence-aligned dataset, more than 77% had been annotated as correct alignments. However, for the En–Si language pair, the alignments were almost perfect, achieving an average alignment score of 97.33%. For the sentences marked as CB, we found that the target side had additional information compared to the source side and vice versa.

Therefore, based on the average scores we believe the gold standard evaluation set is fit to be recommended as a quality evaluation dataset.

3.3 Bilingual lexicons

As parallel data, we considered the bilingual lexicons: person names, designations, word dictionaries and glossaries. The English–Sinhala and English–Tamil Person Names and Designation lists were from the work of Priyadarshani et al. [63], while the Sinhala–Tamil bilingual lists were from Farhath et al. [64]. The bilingual dictionaries have been extracted and used internally in an independent research and is yet to be published. A part of the Tamil–English dictionary is available at the WMT 2020 shared task ^{Footnote 7}. We have obtained a Trilingual Glossary^{Footnote 8} from the Department of Official Languages, Sri Lanka. Statistics and samples of these bilingual lexicons are shown in Tables 4 and 5 respectively.

Table 4 Statistics of the bilingual lexicons

Full size table

Table 5 Overview of the bilingual lexicons

Full size table

4 Methodology

In Sect. 4.1, we describe El-Kishky and Guzmán [13]’s method for document alignment, which is used as the baseline in this study, and our improvement as a weighting scheme considering bilingual lexicons. In Sect. 4.2, we describe the baseline system by Artetxe and Schwenk [25], followed by our improvement considering the bilingual lexicons.

4.1 Document alignment

Our document alignment system make use of multilingual sentence embeddings derived from PMLMs. In other words, we determine the alignment between two documents based on the semantic similarity between them, which is calculated by a distance scoring function. We use El-Kishky and Guzmán [13]’s technique as our baseline. We improve this distance scoring function, by introducing a weighting scheme using bilingual dictionaries.

4.1.1 Baseline document alignment system

El-Kishky and Guzmán [13] defined a (1) distance scoring function to calculate the semantic distance between two documents and (2) a document matching algorithm to obtain the final aligned document pairs.

Distance scoring function Given a document pair, the objective of the distance scoring function is to calculate the semantic distance between two documents. If the semantic distance is less, then the degree of alignment increases. El-Kishky and Guzmán [13] introduced a novel distance metric named Cross-Lingual Sentence Mover’s Distance (XLSMD). XLSMD was a distance metric based on Earth Mover’s Distance (EMD). XLSMD represented each document as a normalized bag-of-sentences (nBOS) with all the sentences containing a pre-calculated probability mass (weight). Equation (1) shows the semantic distance between documents A and B. Here, $\Delta (i, j)$ is the Euclidean distance between the two sentences, which was calculated based on the sentence embeddings. As explained in Eq. (2), $T_{i,j}$ is how much of sentence i in document A was assigned to sentence j in document B (probability mass of a sentence).

$$\begin{aligned}&XLSMD(A, B) = \min _{T \geqslant 0} \sum _{i=1}^{V} \sum _{j=1}^{V} T_{i, j} \times \Delta (i, j) \end{aligned}$$

(1)

$$\begin{aligned}&Subject~to: \forall i \sum _{j=1}^{V} T_{i, j} = d_{A, i} \;\;,\;\; \forall j \sum _{i=1}^{V} T_{i, j} = d_{B, j} \end{aligned}$$

(2)

Equation (3) shows the first function used for the probability mass calculation. Here, they used the relative frequencies of sentences as the probability mass. $\sum _{s\in A}count(s)$ represents the sentence count in document A. After calculating XLSMD, the distance was used in the document matching algorithm discussed next.

$$\begin{aligned} d_{A,i} = \dfrac{count(i)}{\sum _{s\in A}count(s)} \end{aligned}$$

(3)

To make the XLSMD calculations more tractable, a greedy algorithm named Greedy Mover’s Distance (GMD) an alternative to the relaxed-EMD was introduced. Here, the algorithm first calculated the Euclidean distance between each sentence pair and sorted them in ascending order. Then, it iteratively multiplies each distance by the smallest weight among the two sentences, which was named as the flow value as shown in Eq. (4).

$$\begin{aligned} distance = distance + \Vert s_{A} - s_{B}\Vert \times flow \end{aligned}$$

(4)

However, Eq. (3) assigns probability mass uniformly across the sentences. Therefore, El-Kishky and Guzmán [13] introduced the following advanced weighting schemes in place of relative frequency.

Sentence length (SL) weighting

The SL weighting scheme was used under the assumption that longer sentences should be given more probability mass than shorter sentences. Equation (5) defines how this weight is calculated.

$$\begin{aligned} d_{A,i} = \dfrac{count(i)\times \vert i\vert }{\sum _{s\in A}count(s)\times \vert s \vert } \end{aligned}$$

(5)

Here, $\vert {i}\vert $ and $\vert {s}\vert $ represent the number of tokens in the sentences i and s, respectively.

IDF weighting

IDF stands for inverse document frequency. Here, they have used the argument that the sentences that occur more frequently in the corpus should be given less importance than the infrequent sentences in the document. Equation (6) defines how it is calculated.

$$\begin{aligned} d_{A,i} = 1 + \log \dfrac{N + 1}{1 + |{d\in D:s\in d}|} \end{aligned}$$

(6)

Here, N is the total number of documents in domain D, and $|{d\in D:s\in d}|$ is the number of documents that contain sentence s.

SLIDF weighting

In this weighting scheme, both SL and IDF weights have been multiplied to obtain an aggregated weight as shown in Eq. (7). This was to give importance to both the number of tokens and the IDF of the sentence within the document collection.

$$\begin{aligned} d_{A,i} = SL(i) * IDF(i) \end{aligned}$$

(7)

Similarly, the same weighting calculations SL, IDF and SLIDF had been done in the reverse direction, i.e., target to source document to calculate the probability mass for $d_{B, j}$ (i.e., $j^{th}$ sentence in the target document B).

Document matching algorithm In this algorithm, initially, the semantic distances between each source document and target document were calculated according to the above-mentioned scoring function. Then, starting from the document pair containing the minimum distance, subsequent pairs $d_A$ and $d_B$ were selected iteratively, such that the documents $d_A$ and $d_B$ had not been considered in a previous selection.

4.1.2 New weighting scheme based on bilingual lexicons

As a novel contribution, we modify the distance scoring function of El-Kishky and Guzmán [13], by introducing a new weighting scheme considering bilingual lexicons. This weight calculation differs based on the nature of the term mapping in the bilingual lexicons (as word-to-word mappings or phrase-to-phrase mappings). This is described in the following section. With our improvement, the semantic distance calculation between a source side document $d_A$ and target side document $d_B$ is shown in Fig. 1.

We use the bilingual lexicons mentioned in Sect. 3.3 to introduce a weighting scheme on top of the SL, IDF and SLIDF schemes. Here, if a sentence $s_{A}$ from document A contains a word in the bilingual lexicon and its translation in sentence $s_{B}$ from document B, a variable count is incremented. The total of such words in sentence $s_{A}$ is the final count value. The weighting between the two sentences $s_{A}$ and $s_{B}$ considering the variable count is shown in Eq. (8).

$$\begin{aligned} w_{A,B} = \dfrac{|s_{A}|- count}{|s_{A}|} \;\;\;\;\;\;\; |s_{A}|= Number~of~tokens~in~sentence~s_{A} \end{aligned}$$

(8)

The weighting $w_{A,B}$ is incorporated in to the GMD algorithm by modifying the distance calculation as shown in Eq. (9). Likewise, the distance is calculated considering each sentence pair in the two documents. Iterating through each sentence pair, the accumulated distance is the semantic distance between the document pair.

$$\begin{aligned} distance = distance + \Vert s_{A} - s_{B}\Vert \times flow \times w_{A,B} \end{aligned}$$

(9)

This way, when more words that map with the bilingual lexicons are identified in a sentence pair, the distance between the two sentences is lesser.

Usage of bilingual lists with one-word entries

Our person names list falls into this category. We added the parallel words in the person names bilingual list into a dictionary data structure where keys are words from language A and the values are arrays of translations of the key in language B. (Sometimes, one person’s name has multiple translations due to multiple types of spelling formats.) When calculating the weights, for each sentence pair, we iterated through the words in the sentence to calculate the mapping counts. Here, we split the sentence $s_{A}$ into words and check if each word w exists in the dictionary. If it exists, we get the parallel words $v_{B}$, and check if each parallel word exists in the sentence $s_{B}$. If so, we increase the counter and remove the mapped word from the sentence $s_{B}$. This counter value is used as the input in Eq. (8). Algorithm 1 in “Appendix A” explains this process.

Usage of bilingual lists with multi-word entries

Usage of designations bilingual list and word dictionary

Different to the person names bilingual lists, our designations and word dictionary fall into this category. Meaning, the entries contain more than one word (contains phrases). Therefore, when calculating weights, we implement a separate algorithm to identify the multiple word mapping considering the multiple words. Here, for each sentence $s_{A}$, we get all the permutations of words from length one to length five (the maximum length of a record in the dictionary is five). Then, we do the same process described above to get the mapping counts. Algorithm 2 depicts this process. When person names, designations, and word dictionaries are used in combination, we sum up the count values returned from both Algorithm 1 and in “Appendix A”, and use that value as the input for Eq. (8).

Improved dictionary

To improve the dictionary further, we add the terms from the glossary mentioned in Sect. 3.3. However, we could not see any improvement in terms of the scores. When investigated further, we observed that the glossary terms were mostly phrases and combined phrases as opposed to a single word. As a result, when the glossary was cross-checked with the sentence pairs, the number of overlapping terms was very low. Therefore, we utilized the parallel phrases in the dictionary to identify the distinct word pairs within the glossary terms. First, we cross-checked the phrases in the glossary with the words in the dictionary. We removed the parallel words that we found in the glossary phrases and extracted the remaining words from both languages as a parallel record. This way, the number of words in one record in the glossary got reduced by a considerable amount and we were able to improve the existing dictionary by adding the records we found from the glossary to the word dictionary. An overview of the improved dictionary is shown in Table 6.

Table 6 Overview of the improved dictionary

Full size table

4.2 Sentence alignment

Our sentence alignment system makes use of multilingual sentence embeddings derived from PMLMs. In other words, we determine the sentence alignment, considering the semantic similarity between the sentence pairs, calculated using these sentence embeddings. The baseline system is implemented according to the work of Artetxe and Schwenk [25]. Their method for semantic distance was defined as the margin-based cosine similarity over its nearest neighbours. We have introduced a weighting considering the bilingual lexicons, on top of this semantic distance.

4.2.1 Baseline sentence alignment system

Artetxe and Schwenk [25] obtained the LASER multilingual sentence embeddings for all the source and target sentences and aligned these sentence embeddings using a margin-based cosine similarity function. This similarity measurement considered a margin between the cosine of a given sentence pair and that of its respective nearest neighbours.

Artetxe and Schwenk [25] proposed the following three criteria for candidate generation, focusing a higher recall at the cost of precision.

Forward: Each source sentence was aligned with exactly one best scoring target sentence. As a result, some target sentences may be aligned with multiple source sentences or with none.
Backward: Equivalent to the forward strategy, but followed the candidate selection in the opposite direction.
Intersection: Intersection of forward and backward criteria, with the objective of discarding inconsistent alignments.

4.2.2 Our improvements for sentence alignment

As our contribution to the sentence alignment task, we improve the semantic distance measurement step of Artetxe and Schwenk [25]’s method by introducing a weighting scheme using the bilingual lexicons (Sect. 3.3). When sentences in the source language document $d_A$ and sentences in the target language document $d_B$ are given as inputs, the sentence alignment algorithm produces the aligned parallel sentence pairs, as shown in Fig. 2.

In the forward criterion, based on the cosine similarity we select the best matching neighbourhood (k) of 4^{Footnote 9} candidates for each source sentence, similar to Artetxe and Schwenk [25]. Then, the margin-based cosine similarity is used over its nearest neighbours to determine the aligned target sentence. Here, if the source sentence $s_{A}$ from document A contains a word w in the bilingual lexicon and the target sentence $s_{B}$ from the selected k candidates contains the translation of the word w, the variable count is incremented. This count value is used to calculate the weight using Eq. (10) (multiplicative inverse of Eq. (8)), to give a higher weight for sentence pairs having more overlapping tokens and a lower weight for sentence pairs with a lower number of overlapping tokens.

$$\begin{aligned} w_{A,B} = \dfrac{|s_{A}|}{|s_{A}|- count} \;\;\;\;\;\;|s_{A}|= Number~of~tokens~in~source~sentence ~s_{A} \end{aligned}$$

(10)

New similarity score between each source sentence $s_{A}$ and each target sentence $s_{B}$ is calculated using Eq. (11), according to the selected k candidates.

$$\begin{aligned} similarity\_score_{A,B} = cosine\_similarity_{A,B} \times w_{A,B} \end{aligned}$$

(11)

Then, each source sentence is aligned with the best scoring target sentence according to the above-calculated similarity scores.

In the backward criterion, for each sentence on the target side, an aligned sentence from the source side is identified. This is the reverse of the forward criterion method. Therefore, the weight calculation needs to be modified as shown in Eq. (12). Here, $s_{B}$ refers to the selected sentence from the target side, $w_{A,B}$ refers to the weight between $s_{B}$ and the nearest neighbours identified from the source side. The count is incremented when a word in $s_{B}$ exists in the bilingual lexicon as well as in the source sentence retrieved from nearest neighbours. The nearest neighbour retrieval is based on the cosine similarity, similar to Artetxe and Schwenk [25].

$$\begin{aligned} w_{B,A} = \dfrac{|s_{B}|}{|s_{B}|- count} \;\;\;\;\;\;|s_{B}|= Number~of~tokens~in~source~sentence ~s_{B} \end{aligned}$$

(12)

The final similarity score between sentence $s_{B}$ and $s_{A}$ is shown in Eq. (13)

$$\begin{aligned} similarity\_score_{B,A} = cosine\_similarity_{B,A} \times w_{B,A} \end{aligned}$$

(13)

In the intersection criterion, the intersection of the sentence pairs identified from the forward criterion and backward criterion are taken. Therefore, this is identical to the work by Artetxe and Schwenk [25]

5 Evaluation

We evaluated our improvements separately for document alignment and sentence alignment tasks, using the golden alignment dataset we prepared (see Sect. 3). Further, an extrinsic evaluation was conducted on sentence alignment by training an NMT system.

5.1 Document alignment

El-Kishky and Guzmán [13] used LASER multilingual sentence embeddings in their experiments. Therefore, for document alignment task, we report the results for the baseline system only using LASER embeddings. For search efficiency, Subsequently, an ablation study is conducted by sequentially adding each bilingual lexicon on top of the previous experiment. Then, we repeat the above experiments for XLM-R and LaBSE. Thus, this becomes the first empirical study of these three models for the task of document alignment.

Similar to El-Kishky and Guzmán [13], our technique is aimed at high recall at the cost of low precision. However, we have reported the recall (R), precision (P) and F1 scores over the gold-standard evaluation set. We experimented with English–Sinhala, English–Tamil and Sinhala–Tamil language pairs for each news web source. For each language pair, we report the averages of the individual scores obtained for the news sources in Table 7. Results per news source are reported in Appendix B1

Table 7 Averaged recall (R), precision (P) and F1 scores obtained for each news source with respective to the language pairs

Full size table

The document alignment results show that the baseline result of El-Kishky and Guzmán [13] has been outperformed by our improvement when incorporating all the bilingual lexicons (BL+N+Ds+MDc) for all three language pairs. Considering the averaged F1 scores, the improvement is significant (around 44% increase compared to the baseline) for the En–Ta language pair. The improvement for Si–Ta is 13% and for En–Si 2%, respectively. This is an interesting observation. Sinhala and Tamil are considered to be under-represented in LASER, meaning that the cross-lingual alignment related to these languages is weak. Therefore, by using bilingual dictionaries can enhance the cross-lingual alignment between language pairs. Additionally, the performance of document alignment depends on the correlation between source and target documents. One good example is the Army news source, on which even the baseline system performed well.

Further, it was noted that the results for En–Ta were very low compared to the other language pairs, En–Si and Si–Ta. Tamil belongs to the Dravidian family, and this language family is under-represented in many pre-trained models. Moreover, Dravidian languages have a higher linguistic distance from English, compared to the Indo-Aryan family, to which Sinhala belongs. We suspect these are the reasons to produce a lower result for the En–Ta language pair.

Both XLM-R and LaBSE outperformed the LASER scores. A further observation was that the LaBSE baseline was higher than that of XLM-R. XLM-R and LaBSE had been pre-trained using a massive collection of monolingual data using the transformer architecture, while LASER was built on the RNN architecture. Hence, we believe that they have captured the cross-lingual features better than LASER. Additionally, LaBSE had also used parallel data to improve the multilingual embeddings and to strengthen the cross-lingual transfer. As a result, LaBSE embeddings are more favourable for the document alignment task.

XLM-R and LaBSE baseline scores being better than the LASER scores for all three language pairs suggest that the multilingual embeddings obtained via self-supervised learning have a better language representation for low-resource languages, compared to LASER, which was trained in a supervised manner. This is very beneficial for non-English centric language pairs such as Si–Ta, which have been explored to a lesser extent.

When it comes to XLM-R results, the absolute average F1 score gains produced by using lexicons are in the range of 0.7 for En–Si, 1.7 for En–Ta and 1.3 points for Si–Ta, respectively. When LaBSE is used, these gains are less than 0.5 F1 points. Therefore, we can conclude that XLM-R and LaBSE already have rich cross-lingual alignment information, and the amount of additional information provided by bilingual lexicons is relatively less.

Considering the experiment using LASER embeddings, we could observe that the gains were maximum when using all bilingual lexicons. Therefore, if we could find more lexicons we could increase the task performance. Even though the person names bilingual list of Sinhala–Tamil is about ten times larger than that for the other language pairs, we could not see a considerable improvement in Sinhala–Tamil compared to the other two. This may be due to the inflected nature of the two languages. The names could be in the inflected form in the parallel content, while the lexicons contain the names in the base form.

5.2 Sentence alignment

For the sentence alignment experiments, we used three baselines:

1.
Artetxe and Schwenk [25]’s method. As mentioned in Sect. 4.2, they used LASER multilingual embeddings and considered the alignments based on Forward, Backward and Intersection criteria using margin-based cosine similarity as the distance calculation method.
2.
Hunalign [14], for the purpose of comparing our work with a statistical method. Hunalign has been used as a baseline for other research that experimented with embedding-based techniques for sentence alignment [5]
3.
Feng et al. [18]’s method. They conducted sentence alignment using raw cosine similarity over the sentence embeddings obtained from LaBSE. This baseline is useful to compare the effect of the margin-based cosine similarity [25] and raw cosine [18] distance measurement.

We applied our improvement to Artetxe and Schwenk [25]’s method, using LASER embeddings, as done by Artetxe and Schwenk [25]. Then, these experiments were repeated for XLM-R and LaBSE. We conducted the dictionary improvement on top of Feng et al. [18]’s baseline as well.

As our ground-truth alignment contain only a small fraction (approx. 300) of parallel sentences, there can be many more valid cross-lingual sentence pairs in these datasets. Therefore, we evaluated aligned sentence pairs using recall (i.e., what percentage of the sentence pairs in the golden alignment set are found by the algorithm), which was one of the commonly used measurements in other research as well [25, 65]. The results are shown in Table 8. Note that since the use of all the bilingual lexicons gave the best result for the document alignment task, here we have considered all the bilingual lexicons for the BL+Dic (baseline with dictionary improvement) experiments.

Table 8 Sentence alignment results in terms of recall (R)

Full size table

First and foremost, we note that multilingual embedding-based methods significantly outperform Hunalign [14]. Even the baseline [25] outperforms Hunalign by a significant margin of 74% with respect to recall for En–Si languages.

Compared to the LaBSE baseline that uses raw cosine similarity [18], Artetxe and Schwenk [25]’s margin-based cosine similarity reports a recall value that is around 3% higher for the Sinhala–Tamil and Tamil–English pairs. Therefore, we can conclude that margin-based cosine similarly is favourable for the sentence alignment task.

Our sentence alignment system that incorporates bilingual lexicons outperforms Artetxe and Schwenk [25]’s method in all three language pairs for all the websites with the exception of very few as seen in Table 8. Tamil–English language pair shows the highest improvement by outperforming the baseline system by on average 15%. For Sinhala–Tamil and Sinhala–English pairs, on average 8% and 4% recall gains (respectively) were obtained for LASER embeddings.

Baseline sentence alignment results for both Tamil–English and Sinhala–Tamil language pairs are considerably low compared to Sinhala–English for LASER embeddings. The low amount of training data used for Sinhala and Tamil when training the LASER toolkit could be the reason for that [19]. Further, Tamil belongs to the Dravidian family, and this language family is under-represented in many pre-trained models. Moreover, Dravidian languages have a higher linguistic distance from English, compared to the Indo-Aryan family, to which Sinhala belongs. We suspect these are the reasons to produce a lower result for the En–Ta language pair. This is the same observation with the document alignment results as well.

The bilingual lexicon terms are in nominative form. However, Sinhala and Tamil are morphologically rich languages, which means words are inflected based on gender, plurality, or morphological case category. Although the word may exist in nominative form in the bilingual dictionary, in the sentences they can be in the inflected form. So the dictionary improvement fails to identify such cases. We suspect this as a main reason for our improvement to be marginal specific to Si–Ta language pair.

We observe that the baseline sentence alignment scores considering XLM-R and LaBSE have outperformed the LASER baseline. Further, the LaBSE baselines produce the highest scores across all three language pairs. XLM-R had been trained on massive collection of monolingual data, while LaBSE had been pre-trained using monolingual data and fine-tuned using parallel data. The underlying reason for the improvement in scores we believe is the improvement in the language representations. Although XLM-R had been purely on unsupervised manner, it had still managed to capture cross-lingual features in the languages to be favorable for the sentence alignment task. Since LaBSE had been fine-tuned using parallel data, we experience that this step had helped to improve the cross-lingual alignments further, in the embeddings produced by the model. As a result, the LaBSE scores are the highest.

According to the results, our dictionary improvement is not that much significant compared to XLM-R and LaBSE baselines for Si–En and Ta–En language pairs. For Si–Ta, our improvement produces a gain of + 0.5 recall. This shows that the multilingual representations of XLM-R and LaBSE have frame for improvement when it comes to non-English centric diverse language families such as Sinhala and Tamil.

5.3 Extrinsic evaluation with NMT

To analyse the effectiveness of incorporating bilingual lists and different multilingual embeddings into the sentence alignment task, we conducted an extrinsic evaluation by training NMT systems with the obtained parallel sentences. We merged the parallel sentences obtained from each news source and trained NMT systems specific for the language pair in the forward and in reverse directions.

We used the SiTa trilingual (Sinhala, Tamil and English) parallel machine translation (MT) evaluation sets [66] created by the National Languages Processing Center of the University of Moratuwa, Sri Lanka^{Footnote 10} to evaluate the NMT performance. Additionally, we report the MT scores for the Flores v1 [67] evaluation set for Sinhala–English and Flores-101 [68] multilingual evaluation set for Tamil–English language pairs.

More recently, NMT systems fine-tuned on the mBART50 sequence-to-sequence pre-trained model [69] had been successful in terms of Sinhala and Tamil [70, 71]. Therefore, in order to build an NMT model, we decided to fine-tune the mBART50 model with the parallel sentences obtained from the sentence alignment task. Experiments were done using the fairseq toolkit [72], and the performance was evaluated using the evaluation datasets mentioned above. BLEU scores were obtained with sacreBLEU [73].

Table 9 BLEU scores for NMT systems trained with parallel data obtained from sentence alignment step considering Forward (F), Backward (B) and Intersection (I) criterion

Full size table

The NMT results shown in Table 9 are rather low, which we believe is due to the following reasons: (1): the SiTa evaluation dataset has been obtained from the official document domain, while the Flores evaluation datasets have been obtained from Wikipedia. In contrast, we mined the parallel corpus from the news domain. Therefore, the domain difference is identified as the primary reason for the NMT systems to produce low results. (2) The parallel corpus size produced by the sentence alignment task is in the range of 9,000-23000, which marks an extremely low-resource setting [3]. Both these reasons lead to the NMT system producing a low result. However, we believe that this is not a bottleneck in conducting our study as we are only interested in analysing the impact of the bilingual lexicon integration on the sentence alignment task.

We observe that comparable results are obtained across all languages for Backward and Intersection criteria for NMT models for Si$\rightarrow $En, Ta$\rightarrow $En and Si$\rightarrow $Ta. In the backward criterion, for each target language sentence, an aligned sentence from the source language is obtained. Therefore, the selected source sentence might not always guarantee a proper translation for the target sentence. This can be identified as a weak parallel sentence pair with the noise at the source side. This is an interesting observation as it indicates that the NMT is robust to source side noise. However, when the noise is in the target side (as in the case of Forward criterion), it degrades the performance of the NMT. Since the Intersection is dependent on the Backward criterion, the improvement can also be seen in NMT systems trained with the Intersection criterion. In the NMT systems trained for En$\rightarrow $Si, En$\rightarrow $Ta and Ta$\rightarrow $Si, the same observation is true for Forward and Intersection criteria. Here, the target language for the NMT system is picked up from the forward criterion. That is, in the case of En$\rightarrow $Si NMT, with the Forward criterion, for each Si sentence, an En sentence is identified. So here the noisy sentence is found on the source-side (En). Therefore, for the NMT systems in the reverse direction, the Forward criterion is favourable.

We see that the NMT scores obtained by bilingual lists have improved over the baseline scores for most of the cases as per Table 9. This means that bilingual list integration has improved the quality of the parallel sentences. Considering the SiTa evaluation set, the maximum gain provided for LASER is +1.8 BLEU, XLM-R is +0.9 BLEU and for LaBSE it is +0.5 BLEU. Similarly, for Flores evaluation set, it is +1.4, +0.6 and +0.5 BLEU for LASER, XLM-R and LaBSE (respectively). Here we can see identical patterns with respect to both evaluation sets. The gain is the highest for LASER while for XLM-R and LaBSE it is in the same range. Although the Wikipedia data have been used during training these multilingual PMLMs, it is evident that the multilingual embeddings are not biased to the evaluation set on Wikipedia.

For Ta$\rightarrow $En and Ta$\rightarrow $Si directions, it shows a maximum improvement of +1.7 BLEU and +1.3 BLEU scores (respectively) for the LASER embeddings for the SiTa evaluation set. As Tamil is an under-represented language in the LASER training data, the lexicon integration has managed to improve the NMT scores.

In sentence alignment results, the scores were always in increasing order for LASER, XLM-R and LaBSE, respectively. However, for the downstream NMT task, we observed that the scores were mostly high for LASER and LaBSE compared to XLM-R. Although we expected the sentence alignment scores and NMT scores to follow the same pattern, it was not the case. LASER had been trained purely on parallel data while LaBSE had been pre-trained using monolingual and parallel data, followed by a fine-tuning phase with parallel data. Therefore, we observe that multilingual systems pre-trained with parallel data perform better in the NMT downstream task.

6 Further analysis

Table 10 Error analysis in the sentence alignment task. Here, the alignment[corr] refers to the alignment in the gold-standard evaluation set and alignment[incorr] refers to the alignment produced in the experiments

Full size table

We conducted further analysis to identify the impact of lexicon integration on sentence alignment. Table 10 shows three scenarios where lexicon integration did not work. An example is given for the Sinhala–English pair. However, these findings are valid for other language pairs as well.

As explained by scenario A, the sentence pair that should be aligned does not contain any overlapping terms with the bilingual lexicons. Hence, it cannot be benefited by our lexicon integration. Further, the En sentence and another Si sentence from the same context have overlaps in terms of parallel lexicons. As a result, the sentence alignment algorithm selects an incorrect Sinhala sentence as the alignment for the English source sentence.

In scenario B, when there are equal overlaps between the candidate aligned sentences, the lexicon improvement is not effective. In such instances, the alignment is purely determined by the margin-based cosine similarity. In this example, both Sinhala candidate sentences have two lexicon overlaps; therefore, the selection of the aligned sentence cannot be based on the integrated lexicon.

According to scenario C, the sentences contain lexicon terms, but in an inflected form. Thus, our algorithms cannot identify those lexical terms appearing in sentences. In the example, the lexicon overlaps are missed for two word-pairs owing to inflections (in both En and Si). If the inflections were accounted in the algorithm, the correct alignment sentence-pair could be identified. We believe if a matching can be done at the lemma, a further improvement can be obtained. However, for Sinhala and Tamil, there is no lemmatizer that guarantees the coverage of the full vocabulary. Therefore, at present, working at the lemma level is not feasible.

7 Conclusion

This research improved an existing multilingual embedding-based document alignment technique and a sentence alignment technique with the use of bilingual lexicons. The study was conducted focusing on the low-resource language pairs Sinhala–English, Sinhala–Tamil and Tamil–English. Since we experimented with LASER, XLM-R and LaBSE multilingual embeddings, our work serves as an empirical study on the effectiveness of these models for document and sentence alignment. Our results show that positive gains can be obtained even with bilingual lexicons having very small quantities of parallel phrases. We have also compiled and released gold-standard human annotated evaluation sets for document alignment and sentence alignment for the considered languages, which will enable future research in this context. As future work, we plan to further improve the multilingual representations and cross-lingual mappings of the PMLMs, for low-resource languages by exploring different fine-tuning objectives.

Notes

https://github.com/kdissa/comparable-corpus.
mBERT was not considered since it does not include Sinhala.
http://www.hirunews.lk.
https://www.newsfirst.lk/.
https://www.army.lk/.
https://www.itnnews.lk.
http://www.statmt.org/wmt20/translation-task.html.
https://www.languagesdept.gov.lk/.
We use k = 4 for all experiments in this work as it gave the best results in all our experiments.
https://uom.lk/nlp

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation. Association for Computational Linguistics, Vancouver, pp 28–39
Ranathunga S, Lee ESA, Skenduli MP, Shekhar R, Alam M, Kaur R (2021) Neural machine translation for low-resource languages: a survey. arXiv preprint arXiv:2106.15115
Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72
Article Google Scholar
Bañón M, Chen P, Haddow B, Heafield K, Hoang H, Esplà-Gomis M et al (2020) ParaCrawl: web-scale acquisition of parallel corpora. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4555–4567
Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 554–563
Resnik P (1998) Parallel strands: a preliminary investigation into mining the web for bilingual text. In: Conference of the association for machine translation in the Americas. Springer, pp 72–82
Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th annual meeting of the association for computational linguistics, pp 527–534
Papavassiliou V, Prokopidis P, Piperidis S (2016) The ilsp/arc submission to the wmt 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 733–739
Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3):349–380
Article Google Scholar
Espla-Gomis M, Forcada ML, Ortiz-Rojas S, Ferrández-Tordera J. (2016) Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 685–691
Etchegoyhen T, Gete H (2020) Handle with care: a case study in comparable corpora exploitation for neural machine translation. In: Proceedings of The 12th language resources and evaluation conference, pp 3799–3807
El-Kishky A, Guzmán F (2020) Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing. Association for Computational Linguistics, Suzhou, pp 616–625
Varga D, Halácsy P, Kornai A, Nagy V, Németh L, Trón V (2007) Parallel corpora for medium density languages. Amsterdam Stud Theory Hist Linguist Sci Ser 4(292):247
Google Scholar
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Article Google Scholar
Sarikaya R, Maskey S, Zhang R, Jan EE, Wang D, Ramabhadran B et al (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: Tenth annual conference of the international speech communication association, pp 432–435
Kvapilíková I, Artetxe M, Labaka G, Agirre E, Bojar O (2020) Unsupervised multilingual sentence embeddings for parallel corpus mining. In: Proceedings of the 58th annual meeting of the association for computational linguistics: student research workshop, pp 255–262
Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2020) Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852
Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610
Article Google Scholar
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186
Rajitha C, Piyarathna L, Sachintha D, Ranathunga S (2021) Metric learning in multilingual sentence similarity measurement for document alignment. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021). Held online: INCOMA Ltd., pp 1150–1157. https://aclanthology.org/2021.ranlp-1.129
Ni J, Ábrego GH, Constant N, Ma J, Hall KB, Cer D et al (2021) Sentence-t5: scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877
Joshi P, Santy S, Budhiraja A, Bali K, Choudhury M (2020) The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6282–6293
Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 3197–3203
Koehn P, Khayrallah H, Heafield K, Forcada ML (2018) Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third conference on machine translation: shared task papers, pp 726–739
Chen J, Nie JY (2000) Parallel web text mining for cross-language IR. In: Content-based multimedia information access, vol 1. RIAO, pp 62–77
Shi L, Niu C, Zhou M, Gao J (2006) A DOM tree alignment model for mining parallel data from the web. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 489–496
Zafarian A, Sadeghi APA, Azadi F, Ghiasifard S, Panahloo ZA, Bakhshaei S et al (2015) AUT document alignment framework for BUCC workshop shared task. In: Proceedings of the eighth workshop on building and using comparable corpora, pp 79–87
Li B, Gaussier E (2013) Exploiting comparable corpora for lexicon extraction: Measuring and improving corpus quality. In: Building and using comparable corpora. Springer, pp 131–149
Ma X, Liberman M (1999) Bits: a method for bilingual text search over the web. In: Machine translation summit VII, pp 538–542
Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and e. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 57–63
Ion R, Ceauşu A, Irimia E (2011) An expectation maximization algorithm for textual unit alignment. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web, pp 128–135
Gomes L, Lopes G (2016) First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 697–702
Morin E, Hazem A, Boudin F, Loginova-Clouet E (2015) LINA: identifying comparable documents from Wikipedia. In: Proceedings of the eighth workshop on building and using comparable corpora. Association for Computational Linguistics, Beijing, pp 88–91
Uszkoreit J, Ponte J, Popat A, Dubiner M (2010) Large scale parallel document mining for machine translation. In: Proceedings of the 23rd international conference on computational linguistics (Coling 2010), pp 1101–1109
Rajitha M, Piyarathna L, Nayanajith M, Surangika S (2020) Sinhala and English document alignment using statistical machine translation. In: 2020 20th international conference on advances in ICT for emerging regions (ICTer). IEEE, pp 29–34
Jakubina L, Langlais P (2016) Bad luc@ wmt 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 703–709
Medveď M, Jakubíček M, Kovář V (2016) English-French document alignment based on keywords and statistical translation. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 728–732
Buck C, Koehn P (2016) Quick and reliable document alignment via tf/idf-weighted cosine distance. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 672–678
Germann U (2016) Bilingual document alignment with latent semantic indexing. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 692–696
Dara AA, Lin YC (2016) Yoda system for wmt16 shared task: bilingual document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 679–684
Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Berkeley, pp 169–176
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):75–102
Google Scholar
Ma X (2006) Champollion: a robust parallel text sentence aligner. In: Proceedings of the fifth international conference on language resources and evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, pp 489–492
Munteanu DS, Marcu D (2002) Processing comparable corpora with bilingual suffix trees. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp 289–295
Stefanescu D, Ion R, Hunsicker S (2012) Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th annual conference of the European association for machine translation, pp 137–144
Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th conference of the european chapter of the ACL (EACL 2009), pp 16–23
Mahata S, Das D (2017) Bandyopadhyay S. Bucc2017: a hybrid approach for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 56–59
Azpeitia A, Etchegoyhen T, Garcia EM (2017) Weighted set-theoretic alignment of comparable sentences. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 41–45
Azpeitia A, Etchegoyhen T, Garcia EM (2018) Extracting parallel sentences from comparable corpora with STACC variants. In: Proceedings of the 11th workshop on building and using comparable corpora, pp 48–52
Grégoire F, Langlais P (2017) Bucc 2017 shared task: a first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 46–50
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pp 1681–1691
Guoa M, Shenb Q, Yanga Y, Gea H, Cera D, Abregoa GH et al (2018) Effective parallel corpus mining using bilingual sentence embeddings. WMT 2018:165
Google Scholar
Leong C, Wong DF, Chao LS (2018) Um-paligner: neural network-based parallel sentence identification model. In: 11th Workshop on building and using comparable corpora, p 53
Bouamor H, Sajjad H (2018) H2@ bucc18: parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In: Proceedings of workshop on building and using comparable corpora, pp 43–47
Hangya V, Fraser A (2019) Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1224–1234
Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2021) WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1351–1361
Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 6490–6500
Yang Y, Ábrego GH, Yuan S, Guo M, Shen Q, Cer D et al (2019) Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence (IJCAI-19), pp 5370–5378
Zweigenbaum P, Sharoff S, Rapp R (2018) Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th workshop on building and using comparable corpora, pp 39–42
Koehn P, Guzmán F, Chaudhary V, Pino J. (2019) Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In: Proceedings of the fourth conference on machine translation (volume 3: shared task papers, day 2), pp 54–72
Priyadarshani H, Rajapaksha M, Ranasinghe M, Sarveswaran K, Dias G (2019) Statistical machine learning for transliteration: transliterating names between Sinhala, Tamil and English. In: 2019 International conference on asian language processing (IALP). IEEE, pp 244–249
Farhath F, Ranathunga S, Jayasena S, Dias G (2018) Integration of bilingual lists for domain-specific statistical machine translation for Sinhala-Tamil. In moratuwa engineering research conference (MERCon). IEEE, pp. 538–543
Thompson B, Koehn P (2019) Vecalign: improved sentence alignment in linear time and space. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 1342–1348
Fernando A, Ranathunga S, Dias G (2020) Data augmentation and terminology integration for domain-specific Sinhala-English-Tamil statistical machine translation. arXiv preprint arXiv:2011.02821
Guzmán F, Chen PJ, Ott M, Pino J, Lample G, Koehn P et al (2019) The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6098–6111
Goyal N, Gao C, Chaudhary V, Chen PJ, Wenzek G, Ju D et al (2021) The flores-101 evaluation benchmark for low-resource and multilingual machine translation. arXiv preprint arXiv:2106.03193
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M et al (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742
Article Google Scholar
Thillainathan S, Ranathunga S, Jayasena S (2021) Fine-tuning self-supervised multilingual sequence-to-sequence models for extremely low-resource NMT. In: 2021 Moratuwa engineering research conference (MERCon). IEEE, pp 432–437
Lee ESA, Thillainathan S, Nayak S, Ranathunga S, Adelani DI, Su R et al (2022) Pre-trained multilingual sequence-to-sequence models: a hope for low-resource language translation? arXiv preprint https://doi.org/10.48550/arXiv.2203.08850
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N et al (2019) fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: demonstrations, pp 48–53
Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the third conference on machine translation: research papers. Association for Computational Linguistics, Belgium, pp 186–191

Download references

Acknowledgements

Aloka Fernando was initially funded by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Education, Sri Lanka, funded by the World Bank. Currently, she is funded by a Senate Research Committee (SRC) grant from the University of Moratuwa, Sri Lanka. Dataset creation was funded by an SRC grant from University of Moratuwa, Sri Lanka.

Funding

Funding was provided by Higher Education Expansion and Development (AHEAD) and Senate Research Committee (SRC) Grant University of Moratuwa.

Author information

Aloka Fernando, Dilan Sachintha, Lakmali Piyarathna and Charith Rajitha have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, University of Moratuwa, Katubedda, Sri Lanka
Aloka Fernando, Surangika Ranathunga, Dilan Sachintha, Lakmali Piyarathna & Charith Rajitha

Authors

Aloka Fernando
View author publications
You can also search for this author in PubMed Google Scholar
Surangika Ranathunga
View author publications
You can also search for this author in PubMed Google Scholar
Dilan Sachintha
View author publications
You can also search for this author in PubMed Google Scholar
Lakmali Piyarathna
View author publications
You can also search for this author in PubMed Google Scholar
Charith Rajitha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aloka Fernando.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Algorithms for using bilingual lexicons

Our improvement to the document alignment and sentence alignment algorithms consider bilingual lexicons as explained in Sects. 4.1.1 and 4.2.2, respectively. The supporting algorithms related to term matching using person names (Algorithm 1) and rest of the bilingual lexicons (Algorithm 2) are shown below.

Appendix B Document alignment results

Table 11 shows the document alignment results for each news source for the language pairs English–Sinhala, English–Tamil and Sinhala–Tamil. In Table 7, the individual scores obtained for the news sources are averaged. The score in bold is the result corresponding to the best F1 score with respective to the news source and language pair.

Table 11 Document Alignment results in terms of recall (R), precision (P) and F1 with respective to each language pair. Here, BL refers to the recreated Baseline [13] considering LASER embeddings. On top of this, each bilingual lexicon had been added and the experiments were repeated. The bilingual lexicons considered were Person Names (N), Designations (Ds), Dictionary (Dc) and Improved Dictionary (MDc). Subsequently considering PMLMs XLM-R and LaBSE, the same set of experiments have been conducted.

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fernando, A., Ranathunga, S., Sachintha, D. et al. Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages. Knowl Inf Syst 65, 571–612 (2023). https://doi.org/10.1007/s10115-022-01761-x

Download citation

Received: 27 April 2022
Revised: 04 July 2022
Accepted: 12 September 2022
Published: 17 October 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10115-022-01761-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Abstract

Similar content being viewed by others

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Jointly learning bilingual word embeddings and alignments

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Explore related subjects

1 Introduction

2 Related work

2.1 Document alignment

2.2 Sentence alignment

2.3 Pre-trained multilingual language models (PMLMs)

2.4 Evaluating document alignment and sentence alignment

3 Dataset

3.1 Preparing document and sentence alignment evaluation datasets

3.2 Human evaluation on the benchmark evaluation datasets

3.3 Bilingual lexicons

4 Methodology

4.1 Document alignment

4.1.1 Baseline document alignment system

4.1.2 New weighting scheme based on bilingual lexicons

4.2 Sentence alignment

4.2.1 Baseline sentence alignment system

4.2.2 Our improvements for sentence alignment

5 Evaluation

5.1 Document alignment

5.2 Sentence alignment

5.3 Extrinsic evaluation with NMT

6 Further analysis

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Algorithms for using bilingual lexicons

Appendix B Document alignment results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation