1 Introduction

Neural machine translation (NMT) systems trained on the transformer architecture [1] produce state-of-the-art results when large parallel datasets are available. However, in low-resource settings, the same architectures produce sub-optimal results [2].

Mining for parallel corpora from the web is one commonly explored solution to alleviate the parallel data scarcity problem [3]. Wikipedia, news websites and official government/institution websites are sources that are likely to contain translations of each other, to be considered for parallel corpus mining. However, for low-resource languages, the data on the web are noisy and not of good quality, which results in a noisy parallel data when automatically mined [4]. Therefore, it is essential to implement parallel corpus mining techniques that produce quality parallel data for low-resource languages.

Document alignment and sentence alignment are important tasks in the parallel corpus mining pipeline [5]. Document alignment refers to the process of identifying web documents that contain translations of each other, which are known as comparable corpora [6]. Early work on document alignment mainly relied on feature-based techniques that exploited URL meta data [7,8,9], HTML document structure [10] or machine translation-based techniques [11, 12]. However, these were outperformed by techniques that used vector representations of documents. Recent research in this line exploited document representations derived from Pre-trained Multilingual Language Models (PMLMs), which proved to be far superior to previous techniques [13].

The objective of sentence alignment is to find parallel sentences in the already identified comparable corpora or aligned documents. Existing techniques for sentence alignment were based on sentence-level features [14], information retrieval-based techniques [12], using supervised classifiers [15] and using machine translation [16]. However, more successful techniques were based on multilingual sentence embeddings [17, 18].

Currently available PMLMs to derive multilingual sentence embeddings include LASER [19], XLM-R [20], mBERT [21] and LaBSE [18]. However, except for Rajitha et al. [22] who compared LASER and XLM-R embeddings for document alignment, and Feng et al. [18] who compared LaBSE and LASER for sentence alignment, to the best of our knowledge, there has been no comprehensive evaluation of the effectiveness of these embeddings for document or sentence alignment tasks for low-resource languages. Moreover, these models are known to provide sub-optimal results for languages that are under-represented in these PMLMs [23]. These languages turn out to be those that have already been classified as low-resource languages [24]. Thus, it is important to investigate and identify ways to improve the performance of these PMLMs for document and sentence alignment in the context of low-resource languages.

In this paper, we exploit the use of bilingual lexicons to improve the semantic similarity measurement of the sentence embeddings derived from PMLMs for the tasks of document and sentence alignment. Note that bilingual lexicons can be considered as parallel data that are in the form of short phrases.

Our document alignment system is based on the work of El-Kishky and Guzmán [13]. They derived sentence embeddings using LASER and calculated the semantic distance between documents in source and target languages using the Cross-lingual Sentence Movers Distance algorithm. Our sentence alignment system is based on the work of Artetxe and Schwenk [25]. They first obtained sentence embeddings of all source and target side sentences using LASER and calculated margin-based cosine similarity over nearest neighbours.

For both these techniques, we introduce a new weighting mechanism to improve the semantic distance measurement, by utilizing existing bilingual lexicons. Our bilingual lexicons include a bilingual dictionary, glossary, designation list and person name lists. Additionally, we exploit the effectiveness of our technique considering XLM-R [20] and LaBSE [18] in addition to LASER [19] multilingual embeddings. Thus, our work also serves as the first comparative study of the performance of these three multilingual models for these tasks.

We experiment with three language pairs: Sinhala–Tamil, Sinhala–English and Tamil–English. We have compiled a gold-standard human-annotated benchmark evaluation set for document alignment and sentence alignment tasks, in these three language pairs. The considered languages belong to three distinct language families (English (En)—Indo European, Tamil (Ta)—Dravidian and Sinhala (Si)—Indo Aryan), and Sinhala and Tamil are morphologically rich low-resource languages. Thus, this dataset is a much tougher benchmark compared to other multilingual datasets [26] that only focused on a pair of high-resource related languages. We publicly release this dataset Footnote 1 in the hope that it would serve in further research in this domain. This is the first manually curated dataset for the considered three languages.

Our experiments show that the use of bilingual lexicons improves the performance of the selected document and sentence alignment techniques, with the largest gains in the context of the LASER sentence representations.

Thus, the contributions of this work are as follows:

  1. 1.

    We introduce a weighting scheme based on bilingual lexicons to improve the semantic similarity measurement of the document and sentence representations derived from pre-trained multilingual models, for the document and sentence alignment tasks, respectively.

  2. 2.

    We conduct an empirical evaluation of the performance of sentence representations derived from LASER, XLM-R and LaBSEFootnote 2, for document and sentence alignment tasks in the context of low-resource languages.

  3. 3.

    We publicly release the gold-standard human-annotated benchmark evaluation datasets for the document and sentence alignment tasks in the context of three low-resource language pairs: English–Sinhala, English–Tamil and Sinhala–Tamil.

The rest of the work is organized as follows. Related work is covered in Sect. 2. In Sect. 3, we explain our approach for creating the benchmark evaluation sets and declare the bilingual lexicons used. Our lexicon-based solution is presented in Sect. 4. Results are reported in Sect. 5, with a further analysis of the results in Sect. 6. Finally, the conclusion and future work are included in Sect. 7.

2 Related work

A typical parallel corpus mining pipeline follows a sequence of tasks, namely: crawling of website data, alignment of web documents, sentence alignment and parallel sentence filtration [5]. In our study, we focus on document alignment and sentence alignment tasks.

2.1 Document alignment

Automatic document alignment refers to determining the likelihood that two documents are translations of each other. Early work on document alignment was mostly based on metadata, such as URL-based properties [7, 8], publication date [9] and HTML document structure/tags [10, 27, 28]. These were further extended with topic modelling techniques [29]. Although meta-data is a strong indication of document alignment, this alone is not effective, as the alignment properties mainly lie in the textual content. Further, such properties cannot be generalized across different domains and web sources.

In translation-based document alignment methods, the objective is to identify a strong signal that a document in the source language is the translation of another in the target language. Some of these techniques incorporated a bilingual dictionary and checked the existence of bilingual lexical terms in the documents [30]. Some others considered the alignment information at word level [31,32,33], phrase level [12, 34] or considered the existence of the n-best translated terms [11, 35] in the documents. Few techniques translated the non-English document to English and determined the alignment based on an MT evaluation metric [29, 36, 37]. Even though translation-based methods were able to score well in document alignment, their performance highly depends on the accuracy of the alignment algorithm or the translation system used. Thus, in low-resource settings, these may produce sub-optimal results.

Vector representation-based techniques first derive a vector representation for the documents in the two languages and employ a semantic distance measurement metric to determine the semantic similarity between the documents. Document pairs that obtain a semantic similarity value above a pre-determined threshold are considered to be comparable. Bag-of-words, TF-IDF  [38,39,40,41] and word n-grams [42] were among the early solutions to derive document representations.

Very recently, El-Kishky and Guzmán [13] used PMLMs to derive a vector representation of each of the documents in the considered languages. Then, the distance between these document vectors was calculated to determine the aligned document pair. They used the LASER pre-trained embeddings [19] to derive document embeddings. El-Kishky and Guzmán [13] experimented for high-resource, mid-resource and also low-resource languages. As mentioned earlier, this is the baseline for our research, and more information on this technique can be found in Sect. 4.1.1.

2.2 Sentence alignment

Sentence alignment refers to the process of identifying parallel sentence pairs that are partial or complete translations of each other. Early work on parallel sentence alignment was based on sentence-length ratio [43, 44], which was purely statistical. However, when the correlation between source and target languages decreases, the performance of this approach drops rapidly [45]. Subsequent techniques considered bilingual dictionaries [14], word/phrase alignment probabilities between the sentences [12, 32] and phrase/sentence alignments coupled with bilingual suffix trees [46]. Stefanescu et al. [47] addressed sentence alignment as an information retrieval problem, while Munteanu and Marcu [15] trained a supervised classifier to determine the alignment. Some other techniques were based on machine translation [16, 48], where the non-English source sentences were translated to English and IR techniques were used to identify the aligned candidate sentence [49,50,51].

Recent work for sentence alignment was based on sentence representations by means of word embeddings or sentence embeddings. Here, first, the sentence embeddings were obtained for source and target sentences. Then, using a semantic similarity measurement, the aligned sentence pairs were identified. Initial work employed word-based embeddings trained on bi-directional Recurrent Neural Networks (RNNs) [52], Deep Averaging Networks (DANs) [53], bi-gram driven network architectures [54] and auto-encoders [55]. Hybrid techniques were also adopted, where a supervised classifier was used on top of the embedding-based semantic similarity calculation to determine the alignment [55, 56].

To optimize the results of the sentence alignment task, either the sentence representations should be enhanced or the semantic similarity distance scoring should be improved. Following the former path, Artetxe and Schwenk [19] used LASER supervised multilingual embeddings, while Kvapilíková et al. [17] experimented with XLM-based unsupervised multilingual embeddings. The choice of the sentence similarity measurement technique has been largely unsupervised (cosine similarity was the simplest one employed). However, this simple method is sub-optimal, and improved semantic similarity measurements had also been proposed [25, 54, 57]. We use Artetxe and Schwenk [25] as the baseline for the sentence alignment system, and this similarity measurement technique is further discussed in Sect. 4.2.1.

The final step is parallel sentence filtration, with the objective of removing any noisy parallel sentence pairs that had crept into the mined parallel corpus due to the noise in web data itself or due to the limitations in the preceding steps. However, this step is not explored in the scope of the study.

2.3 Pre-trained multilingual language models (PMLMs)

As discussed in the previous two sections, sentence representations derived from PMLMs have been vital in the success of recent document and sentence alignment techniques. Artetxe and Schwenk [19] used parallel data to train a shared encoder (available via the LASER toolkit), which had performed well on massive-scale parallel corpus extraction projects such as ParaCrawl [5], wikiMatrix [58] and ccMatix [59].

Current state-of-the-art multilingual models had been trained on the Transformer architecture [1]. Commonly used mBERT [21] (104 languages) and XLM-R [20] (100 languages) models had been trained on monolingual data with Masked Language Modelling (MLM) objective. Yang et al. [60] trained multilingual embeddings on parallel data using a bi-directional dual encoder with an additive margin softmax objective. The latter had been used in the work of LaBSE [18] (109 languages) which had been trained using both monolingual and parallel data with the MLM and Translation Language Modelling (TML) objectives, producing the state-of-the-art results for sentence alignment. However, LaBSE had not been evaluated in the context of low-resource languages for the tasks of document and sentence alignment.

2.4 Evaluating document alignment and sentence alignment

Datasets to evaluate document alignment and sentence alignment techniques have been introduced by several shared tasks. For example, Buck and Koehn [6] provided a hand-aligned dataset for evaluation of the document alignment task. However, this dataset was limited only to English and French. Rather than creating manually aligned datasets for the task of sentence alignment on comparable corpora, Zweigenbaum et al. [61] artificially injected parallel sentences into the comparable corpus. Their dataset also focused only on four language pairs, Chinese–English, French–English, German–English, and Russian–English. Some shared tasks did not present manually aligned datasets [26, 62]. Rather, the performance was evaluated by using the identified parallel sentences on a downstream NMT task.

3 Dataset

In this section, we describe the approach taken to create the gold standard evaluation set for the document alignment and sentence alignment tasks (Sect. 3.1). Afterward, we describe the human evaluation conducted to evaluate the quality of our gold-standard evaluation datasets (Sect. 3.2). In Sect. 3.3, we outline the bilingual lexicons used in this research.

3.1 Preparing document and sentence alignment evaluation datasets

Our research focuses on Sinhala, Tamil and English languages, which are the official languages of Sri Lanka. We selected the news websites that publish content in all these three languages as comparable web sources. The selected web sites were Hiru NewsFootnote 3, NewsFirstFootnote 4, Army NewsFootnote 5 and ITNFootnote 6. We considered data from 2013 January up to April 2021.

During pre-processing, news content in paragraphs of each web page was merged into a single string, and the text contained in the image and video tags were discarded. Further, we have removed very short news documents that contained tokens less than fifty.

Army, Hiru and Newsfirst websites publish news in all three languages with the same content coverage, document structure, order of sentences and information ow. Hence, for most of the English documents, the exact translations were available in Sinhala and Tamil documents. However, for ITN News, this was different. We observed that the English article was not always available for the corresponding news in Sinhala and Tamil language articles, and for the ones with translations, there was a low correlation among the content as well.

Since our dataset was completely taken from the news domain, all the news documents had the published date as metadata. Moreover, in most cases, the same news document was published in all three languages on the same day. Therefore, before starting the aligning process, we filtered and grouped the documents using the published date and reduced the search space by a considerable amount.

We did the initial document alignment identification based on heuristics specific to the news website as described below.

  • URL of each Hiru news document contains a unique id, which is shared by the news articles published in the respective languages. We used this property to identify candidate-aligned document pairs. These were varied by a human annotator, and the alignments accepted by the annotator were considered for the gold standard evaluation subset for Hiru News.

  • Army news also had the publication date and time as shared attributes between the articles of the three languages. The same news was published in all three languages at the exact same date and time. Similar to Hiru news, we identified the candidate alignments for the Army news dataset using the publication date and time and later varied the alignment with the help of a human annotator.

  • Documents crawled from NewsFirst and ITN websites did not have any such metadata that we could use to create the ground truth alignment. Therefore, ground truth alignment was manually created by human annotators, which was later varied by the same annotators by switching the datasets.

We consider the alignments verified by the annotators as the gold standard evaluation set. Altogether, eleven annotators were used to conduct the document alignment annotation.

The number of selected documents from each language pair along with the number of ground truth alignment pairs for each web source is shown in Table 1. Due to the low correlation between documents published by ITN, it has a lower number of aligned document pairs compared to other sources.

Table 1 Statistics of document alignment evaluation dataset

The aligned document pairs identified above were used as the input to the sentence alignment task. The number of input sentences on the source side and target side for each language pair is listed in Table 2. To conduct the sentence alignment annotation, we used five annotators altogether. Here also the annotations by one person were checked by another annotator for verification. Given a large number of sentences on each side, it would take a very long time for human annotators to find all sentence pairs that are translations of each other. Therefore, the gold standard sentence alignment evaluation set includes only 300 one-to-one sentence pairs from each website in all three language pairs.

Table 2 Statistics of the sentence alignment evaluation dataset

3.2 Human evaluation on the benchmark evaluation datasets

On top of the annotation verification in the preceding stage, we have conducted a more systematic qualitative evaluation on the gold-standard dataset, following the methodology proposed by Kreutzer et al. [4]. From our gold standard evaluation set, we have sub-sampled 100 document pairs and sentence pairs from each language pair. Thereafter, we allocated three annotators to conduct the alignment verification per language pair, independently.

The annotation criteria are according to the work of Kreutzer et al. [4]. In their annotation scheme, the applicable labels for our dataset were CC (Correct translation, natural sentence) and CB (Correct translation, Boilerplate or low quality). When compiling our evaluation dataset, during the pre-processing stage we have already filtered out short sentences, so the label CS (Correct translation, Short) was not applicable. The rest of the annotation labels X ( Incorrect translation, but both correct languages), WL (Source OR target wrong language, but both still linguistic content) and NL (Not a language: at least one of source and target are not linguistic content) were also not applicable in our case since the evaluation set had already undergone human annotation.

We calculated the number of annotations for each label given by each annotator and obtained the average scores as done by Kreutzer et al. [4]. The same approach was followed for annotating the sentence alignment samples as well. The outcome of the human evaluation for document alignment and sentence alignment samples are shown in Table 3.

Table 3 Averages of the annotator scores for each label for document alignment and sentence alignment datasets

More than 75% of document pairs have been annotated as correct alignments. When checked randomly, the ones that were considered as weak alignments had extra content on the target side which were not available in the source side and vice versa. In some other document pairs, the two languages had produced the same news incident from different perspectives. As a result, content-wise there were differences. This was another reason why some document pairs were annotated as CB.

For the sentence-aligned dataset, more than 77% had been annotated as correct alignments. However, for the En–Si language pair, the alignments were almost perfect, achieving an average alignment score of 97.33%. For the sentences marked as CB, we found that the target side had additional information compared to the source side and vice versa.

Therefore, based on the average scores we believe the gold standard evaluation set is fit to be recommended as a quality evaluation dataset.

3.3 Bilingual lexicons

As parallel data, we considered the bilingual lexicons: person names, designations, word dictionaries and glossaries. The English–Sinhala and English–Tamil Person Names and Designation lists were from the work of Priyadarshani et al. [63], while the Sinhala–Tamil bilingual lists were from Farhath et al. [64]. The bilingual dictionaries have been extracted and used internally in an independent research and is yet to be published. A part of the Tamil–English dictionary is available at the WMT 2020 shared task Footnote 7. We have obtained a Trilingual GlossaryFootnote 8 from the Department of Official Languages, Sri Lanka. Statistics and samples of these bilingual lexicons are shown in Tables 4 and 5 respectively.

Table 4 Statistics of the bilingual lexicons
Table 5 Overview of the bilingual lexicons

4 Methodology

In Sect. 4.1, we describe El-Kishky and Guzmán [13]’s method for document alignment, which is used as the baseline in this study, and our improvement as a weighting scheme considering bilingual lexicons. In Sect. 4.2, we describe the baseline system by Artetxe and Schwenk [25], followed by our improvement considering the bilingual lexicons.

4.1 Document alignment

Our document alignment system make use of multilingual sentence embeddings derived from PMLMs. In other words, we determine the alignment between two documents based on the semantic similarity between them, which is calculated by a distance scoring function. We use El-Kishky and Guzmán [13]’s technique as our baseline. We improve this distance scoring function, by introducing a weighting scheme using bilingual dictionaries.

4.1.1 Baseline document alignment system

El-Kishky and Guzmán [13] defined a (1) distance scoring function to calculate the semantic distance between two documents and (2) a document matching algorithm to obtain the final aligned document pairs.

Distance scoring function Given a document pair, the objective of the distance scoring function is to calculate the semantic distance between two documents. If the semantic distance is less, then the degree of alignment increases. El-Kishky and Guzmán [13] introduced a novel distance metric named Cross-Lingual Sentence Mover’s Distance (XLSMD). XLSMD was a distance metric based on Earth Mover’s Distance (EMD). XLSMD represented each document as a normalized bag-of-sentences (nBOS) with all the sentences containing a pre-calculated probability mass (weight). Equation (1) shows the semantic distance between documents A and B. Here, \(\Delta (i, j)\) is the Euclidean distance between the two sentences, which was calculated based on the sentence embeddings. As explained in Eq. (2), \(T_{i,j}\) is how much of sentence i in document A was assigned to sentence j in document B (probability mass of a sentence).

$$\begin{aligned}&XLSMD(A, B) = \min _{T \geqslant 0} \sum _{i=1}^{V} \sum _{j=1}^{V} T_{i, j} \times \Delta (i, j) \end{aligned}$$
(1)
$$\begin{aligned}&Subject~to: \forall i \sum _{j=1}^{V} T_{i, j} = d_{A, i} \;\;,\;\; \forall j \sum _{i=1}^{V} T_{i, j} = d_{B, j} \end{aligned}$$
(2)

Equation (3) shows the first function used for the probability mass calculation. Here, they used the relative frequencies of sentences as the probability mass. \(\sum _{s\in A}count(s)\) represents the sentence count in document A. After calculating XLSMD, the distance was used in the document matching algorithm discussed next.

$$\begin{aligned} d_{A,i} = \dfrac{count(i)}{\sum _{s\in A}count(s)} \end{aligned}$$
(3)

To make the XLSMD calculations more tractable, a greedy algorithm named Greedy Mover’s Distance (GMD) an alternative to the relaxed-EMD was introduced. Here, the algorithm first calculated the Euclidean distance between each sentence pair and sorted them in ascending order. Then, it iteratively multiplies each distance by the smallest weight among the two sentences, which was named as the flow value as shown in Eq. (4).

$$\begin{aligned} distance = distance + \Vert s_{A} - s_{B}\Vert \times flow \end{aligned}$$
(4)

However, Eq. (3) assigns probability mass uniformly across the sentences. Therefore, El-Kishky and Guzmán [13] introduced the following advanced weighting schemes in place of relative frequency.

Sentence length (SL) weighting

The SL weighting scheme was used under the assumption that longer sentences should be given more probability mass than shorter sentences. Equation (5) defines how this weight is calculated.

$$\begin{aligned} d_{A,i} = \dfrac{count(i)\times \vert i\vert }{\sum _{s\in A}count(s)\times \vert s \vert } \end{aligned}$$
(5)

Here, \(\vert {i}\vert \) and \(\vert {s}\vert \) represent the number of tokens in the sentences i and s, respectively.

IDF weighting

IDF stands for inverse document frequency. Here, they have used the argument that the sentences that occur more frequently in the corpus should be given less importance than the infrequent sentences in the document. Equation (6) defines how it is calculated.

$$\begin{aligned} d_{A,i} = 1 + \log \dfrac{N + 1}{1 + |{d\in D:s\in d}|} \end{aligned}$$
(6)

Here, N is the total number of documents in domain D, and \(|{d\in D:s\in d}|\) is the number of documents that contain sentence s.

SLIDF weighting

In this weighting scheme, both SL and IDF weights have been multiplied to obtain an aggregated weight as shown in Eq. (7). This was to give importance to both the number of tokens and the IDF of the sentence within the document collection.

$$\begin{aligned} d_{A,i} = SL(i) * IDF(i) \end{aligned}$$
(7)

Similarly, the same weighting calculations SL, IDF and SLIDF had been done in the reverse direction, i.e., target to source document to calculate the probability mass for \(d_{B, j}\) (i.e., \(j^{th}\) sentence in the target document B).

Document matching algorithm In this algorithm, initially, the semantic distances between each source document and target document were calculated according to the above-mentioned scoring function. Then, starting from the document pair containing the minimum distance, subsequent pairs \(d_A\) and \(d_B\) were selected iteratively, such that the documents \(d_A\) and \(d_B\) had not been considered in a previous selection.

4.1.2 New weighting scheme based on bilingual lexicons

As a novel contribution, we modify the distance scoring function of El-Kishky and Guzmán [13], by introducing a new weighting scheme considering bilingual lexicons. This weight calculation differs based on the nature of the term mapping in the bilingual lexicons (as word-to-word mappings or phrase-to-phrase mappings). This is described in the following section. With our improvement, the semantic distance calculation between a source side document \(d_A\) and target side document \(d_B\) is shown in Fig. 1.

Fig. 1
figure 1

Process for calculating the semantic distance between source language document \(d_A\) and target language document \(d_B\). Here \(w_{A,B}\) refers to the improved weight based on the bilingual lexicons. The accumulated distance scored from this process is the semantic distance between the document pair. Subsequently, the document matching algorithm (Sect.4.1.1)  produces the final aligned document pairs

We use the bilingual lexicons mentioned in Sect. 3.3 to introduce a weighting scheme on top of the SL, IDF and SLIDF schemes. Here, if a sentence \(s_{A}\) from document A contains a word in the bilingual lexicon and its translation in sentence \(s_{B}\) from document B, a variable count is incremented. The total of such words in sentence \(s_{A}\) is the final count value. The weighting between the two sentences \(s_{A}\) and \(s_{B}\) considering the variable count is shown in Eq. (8).

$$\begin{aligned} w_{A,B} = \dfrac{|s_{A}|- count}{|s_{A}|} \;\;\;\;\;\;\; |s_{A}|= Number~of~tokens~in~sentence~s_{A} \end{aligned}$$
(8)

The weighting \(w_{A,B}\) is incorporated in to the GMD algorithm by modifying the distance calculation as shown in Eq. (9). Likewise, the distance is calculated considering each sentence pair in the two documents. Iterating through each sentence pair, the accumulated distance is the semantic distance between the document pair.

$$\begin{aligned} distance = distance + \Vert s_{A} - s_{B}\Vert \times flow \times w_{A,B} \end{aligned}$$
(9)

This way, when more words that map with the bilingual lexicons are identified in a sentence pair, the distance between the two sentences is lesser.

Usage of bilingual lists with one-word entries

Our person names list falls into this category. We added the parallel words in the person names bilingual list into a dictionary data structure where keys are words from language A and the values are arrays of translations of the key in language B. (Sometimes, one person’s name has multiple translations due to multiple types of spelling formats.) When calculating the weights, for each sentence pair, we iterated through the words in the sentence to calculate the mapping counts. Here, we split the sentence \(s_{A}\) into words and check if each word w exists in the dictionary. If it exists, we get the parallel words \(v_{B}\), and check if each parallel word exists in the sentence \(s_{B}\). If so, we increase the counter and remove the mapped word from the sentence \(s_{B}\). This counter value is used as the input in Eq. (8). Algorithm 1 in “Appendix A” explains this process.

Usage of bilingual lists with multi-word entries

Usage of designations bilingual list and word dictionary

Different to the person names bilingual lists, our designations and word dictionary fall into this category. Meaning, the entries contain more than one word (contains phrases). Therefore, when calculating weights, we implement a separate algorithm to identify the multiple word mapping considering the multiple words. Here, for each sentence \(s_{A}\), we get all the permutations of words from length one to length five (the maximum length of a record in the dictionary is five). Then, we do the same process described above to get the mapping counts. Algorithm 2 depicts this process. When person names, designations, and word dictionaries are used in combination, we sum up the count values returned from both Algorithm 1 and in “Appendix A”, and use that value as the input for Eq. (8).

Improved dictionary

To improve the dictionary further, we add the terms from the glossary mentioned in Sect. 3.3. However, we could not see any improvement in terms of the scores. When investigated further, we observed that the glossary terms were mostly phrases and combined phrases as opposed to a single word. As a result, when the glossary was cross-checked with the sentence pairs, the number of overlapping terms was very low. Therefore, we utilized the parallel phrases in the dictionary to identify the distinct word pairs within the glossary terms. First, we cross-checked the phrases in the glossary with the words in the dictionary. We removed the parallel words that we found in the glossary phrases and extracted the remaining words from both languages as a parallel record. This way, the number of words in one record in the glossary got reduced by a considerable amount and we were able to improve the existing dictionary by adding the records we found from the glossary to the word dictionary. An overview of the improved dictionary is shown in Table 6.

Table 6 Overview of the improved dictionary

4.2 Sentence alignment

Our sentence alignment system makes use of multilingual sentence embeddings derived from PMLMs. In other words, we determine the sentence alignment, considering the semantic similarity between the sentence pairs, calculated using these sentence embeddings. The baseline system is implemented according to the work of Artetxe and Schwenk [25]. Their method for semantic distance was defined as the margin-based cosine similarity over its nearest neighbours. We have introduced a weighting considering the bilingual lexicons, on top of this semantic distance.

4.2.1 Baseline sentence alignment system

Artetxe and Schwenk  [25] obtained the LASER multilingual sentence embeddings for all the source and target sentences and aligned these sentence embeddings using a margin-based cosine similarity function. This similarity measurement considered a margin between the cosine of a given sentence pair and that of its respective nearest neighbours.

Artetxe and Schwenk [25] proposed the following three criteria for candidate generation, focusing a higher recall at the cost of precision.

  • Forward: Each source sentence was aligned with exactly one best scoring target sentence. As a result, some target sentences may be aligned with multiple source sentences or with none.

  • Backward: Equivalent to the forward strategy, but followed the candidate selection in the opposite direction.

  • Intersection: Intersection of forward and backward criteria, with the objective of discarding inconsistent alignments.

4.2.2 Our improvements for sentence alignment

As our contribution to the sentence alignment task, we improve the semantic distance measurement step of Artetxe and Schwenk [25]’s method by introducing a weighting scheme using the bilingual lexicons (Sect. 3.3). When sentences in the source language document \(d_A\) and sentences in the target language document \(d_B\) are given as inputs, the sentence alignment algorithm produces the aligned parallel sentence pairs, as shown in Fig. 2.

Fig. 2
figure 2

This diagram outlines the sentence alignment algorithm for the forward criterion. Our baseline [25] uses margin-based cosine similarity as the semantic distance calculation function, while we improve this by introducing a weighting \(w_{A,B}\), which is calculated using the bilingual lexicons. In the backward criterion, for each \(s_B\) in \(d_B\), the aligned sentence is picked up from the source side. The semantic distance calculation is done in the reverse direction

In the forward criterion, based on the cosine similarity we select the best matching neighbourhood (k) of 4Footnote 9 candidates for each source sentence, similar to Artetxe and Schwenk [25]. Then, the margin-based cosine similarity is used over its nearest neighbours to determine the aligned target sentence. Here, if the source sentence \(s_{A}\) from document A contains a word w in the bilingual lexicon and the target sentence \(s_{B}\) from the selected k candidates contains the translation of the word w, the variable count is incremented. This count value is used to calculate the weight using Eq. (10) (multiplicative inverse of Eq. (8)), to give a higher weight for sentence pairs having more overlapping tokens and a lower weight for sentence pairs with a lower number of overlapping tokens.

$$\begin{aligned} w_{A,B} = \dfrac{|s_{A}|}{|s_{A}|- count} \;\;\;\;\;\;|s_{A}|= Number~of~tokens~in~source~sentence ~s_{A} \end{aligned}$$
(10)

New similarity score between each source sentence \(s_{A}\) and each target sentence \(s_{B}\) is calculated using Eq. (11), according to the selected k candidates.

$$\begin{aligned} similarity\_score_{A,B} = cosine\_similarity_{A,B} \times w_{A,B} \end{aligned}$$
(11)

Then, each source sentence is aligned with the best scoring target sentence according to the above-calculated similarity scores.

In the backward criterion, for each sentence on the target side, an aligned sentence from the source side is identified. This is the reverse of the forward criterion method. Therefore, the weight calculation needs to be modified as shown in Eq. (12). Here, \(s_{B}\) refers to the selected sentence from the target side, \(w_{A,B}\) refers to the weight between \(s_{B}\) and the nearest neighbours identified from the source side. The count is incremented when a word in \(s_{B}\) exists in the bilingual lexicon as well as in the source sentence retrieved from nearest neighbours. The nearest neighbour retrieval is based on the cosine similarity, similar to Artetxe and Schwenk [25].

$$\begin{aligned} w_{B,A} = \dfrac{|s_{B}|}{|s_{B}|- count} \;\;\;\;\;\;|s_{B}|= Number~of~tokens~in~source~sentence ~s_{B} \end{aligned}$$
(12)

The final similarity score between sentence \(s_{B}\) and \(s_{A}\) is shown in Eq. (13)

$$\begin{aligned} similarity\_score_{B,A} = cosine\_similarity_{B,A} \times w_{B,A} \end{aligned}$$
(13)

In the intersection criterion, the intersection of the sentence pairs identified from the forward criterion and backward criterion are taken. Therefore, this is identical to the work by Artetxe and Schwenk [25]

5 Evaluation

We evaluated our improvements separately for document alignment and sentence alignment tasks, using the golden alignment dataset we prepared (see Sect. 3). Further, an extrinsic evaluation was conducted on sentence alignment by training an NMT system.

5.1 Document alignment

El-Kishky and Guzmán  [13] used LASER multilingual sentence embeddings in their experiments. Therefore, for document alignment task, we report the results for the baseline system only using LASER embeddings. For search efficiency, Subsequently, an ablation study is conducted by sequentially adding each bilingual lexicon on top of the previous experiment. Then, we repeat the above experiments for XLM-R and LaBSE. Thus, this becomes the first empirical study of these three models for the task of document alignment.

Similar to El-Kishky and Guzmán [13], our technique is aimed at high recall at the cost of low precision. However, we have reported the recall (R), precision (P) and F1 scores over the gold-standard evaluation set. We experimented with English–Sinhala, English–Tamil and Sinhala–Tamil language pairs for each news web source. For each language pair, we report the averages of the individual scores obtained for the news sources in Table 7. Results per news source are reported in Appendix B1

Table 7 Averaged recall (R), precision (P) and F1 scores obtained for each news source with respective to the language pairs

The document alignment results show that the baseline result of El-Kishky and Guzmán [13] has been outperformed by our improvement when incorporating all the bilingual lexicons (BL+N+Ds+MDc) for all three language pairs. Considering the averaged F1 scores, the improvement is significant (around 44% increase compared to the baseline) for the En–Ta language pair. The improvement for Si–Ta is 13% and for En–Si 2%, respectively. This is an interesting observation. Sinhala and Tamil are considered to be under-represented in LASER, meaning that the cross-lingual alignment related to these languages is weak. Therefore, by using bilingual dictionaries can enhance the cross-lingual alignment between language pairs. Additionally, the performance of document alignment depends on the correlation between source and target documents. One good example is the Army news source, on which even the baseline system performed well.

Further, it was noted that the results for En–Ta were very low compared to the other language pairs, En–Si and Si–Ta. Tamil belongs to the Dravidian family, and this language family is under-represented in many pre-trained models. Moreover, Dravidian languages have a higher linguistic distance from English, compared to the Indo-Aryan family, to which Sinhala belongs. We suspect these are the reasons to produce a lower result for the En–Ta language pair.

Both XLM-R and LaBSE outperformed the LASER scores. A further observation was that the LaBSE baseline was higher than that of XLM-R. XLM-R and LaBSE had been pre-trained using a massive collection of monolingual data using the transformer architecture, while LASER was built on the RNN architecture. Hence, we believe that they have captured the cross-lingual features better than LASER. Additionally, LaBSE had also used parallel data to improve the multilingual embeddings and to strengthen the cross-lingual transfer. As a result, LaBSE embeddings are more favourable for the document alignment task.

XLM-R and LaBSE baseline scores being better than the LASER scores for all three language pairs suggest that the multilingual embeddings obtained via self-supervised learning have a better language representation for low-resource languages, compared to LASER, which was trained in a supervised manner. This is very beneficial for non-English centric language pairs such as Si–Ta, which have been explored to a lesser extent.

When it comes to XLM-R results, the absolute average F1 score gains produced by using lexicons are in the range of 0.7 for En–Si, 1.7 for En–Ta and 1.3 points for Si–Ta, respectively. When LaBSE is used, these gains are less than 0.5 F1 points. Therefore, we can conclude that XLM-R and LaBSE already have rich cross-lingual alignment information, and the amount of additional information provided by bilingual lexicons is relatively less.

Considering the experiment using LASER embeddings, we could observe that the gains were maximum when using all bilingual lexicons. Therefore, if we could find more lexicons we could increase the task performance. Even though the person names bilingual list of Sinhala–Tamil is about ten times larger than that for the other language pairs, we could not see a considerable improvement in Sinhala–Tamil compared to the other two. This may be due to the inflected nature of the two languages. The names could be in the inflected form in the parallel content, while the lexicons contain the names in the base form.

5.2 Sentence alignment

For the sentence alignment experiments, we used three baselines:

  1. 1.

    Artetxe and Schwenk [25]’s method. As mentioned in Sect. 4.2, they used LASER multilingual embeddings and considered the alignments based on Forward, Backward and Intersection criteria using margin-based cosine similarity as the distance calculation method.

  2. 2.

    Hunalign [14], for the purpose of comparing our work with a statistical method. Hunalign has been used as a baseline for other research that experimented with embedding-based techniques for sentence alignment [5]

  3. 3.

    Feng et al. [18]’s method. They conducted sentence alignment using raw cosine similarity over the sentence embeddings obtained from LaBSE. This baseline is useful to compare the effect of the margin-based cosine similarity [25] and raw cosine [18] distance measurement.

We applied our improvement to Artetxe and Schwenk [25]’s method, using LASER embeddings, as done by Artetxe and Schwenk  [25]. Then, these experiments were repeated for XLM-R and LaBSE. We conducted the dictionary improvement on top of Feng et al. [18]’s baseline as well.

As our ground-truth alignment contain only a small fraction (approx. 300) of parallel sentences, there can be many more valid cross-lingual sentence pairs in these datasets. Therefore, we evaluated aligned sentence pairs using recall (i.e., what percentage of the sentence pairs in the golden alignment set are found by the algorithm), which was one of the commonly used measurements in other research as well  [25, 65]. The results are shown in Table 8. Note that since the use of all the bilingual lexicons gave the best result for the document alignment task, here we have considered all the bilingual lexicons for the BL+Dic (baseline with dictionary improvement) experiments.

Table 8 Sentence alignment results in terms of recall (R)

First and foremost, we note that multilingual embedding-based methods significantly outperform Hunalign [14]. Even the baseline [25] outperforms Hunalign by a significant margin of 74% with respect to recall for En–Si languages.

Compared to the LaBSE baseline that uses raw cosine similarity [18], Artetxe and Schwenk  [25]’s margin-based cosine similarity reports a recall value that is around 3% higher for the Sinhala–Tamil and Tamil–English pairs. Therefore, we can conclude that margin-based cosine similarly is favourable for the sentence alignment task.

Our sentence alignment system that incorporates bilingual lexicons outperforms Artetxe and Schwenk  [25]’s method in all three language pairs for all the websites with the exception of very few as seen in Table 8. Tamil–English language pair shows the highest improvement by outperforming the baseline system by on average 15%. For Sinhala–Tamil and Sinhala–English pairs, on average 8% and 4% recall gains (respectively) were obtained for LASER embeddings.

Baseline sentence alignment results for both Tamil–English and Sinhala–Tamil language pairs are considerably low compared to Sinhala–English for LASER embeddings. The low amount of training data used for Sinhala and Tamil when training the LASER toolkit could be the reason for that [19]. Further, Tamil belongs to the Dravidian family, and this language family is under-represented in many pre-trained models. Moreover, Dravidian languages have a higher linguistic distance from English, compared to the Indo-Aryan family, to which Sinhala belongs. We suspect these are the reasons to produce a lower result for the En–Ta language pair. This is the same observation with the document alignment results as well.

The bilingual lexicon terms are in nominative form. However, Sinhala and Tamil are morphologically rich languages, which means words are inflected based on gender, plurality, or morphological case category. Although the word may exist in nominative form in the bilingual dictionary, in the sentences they can be in the inflected form. So the dictionary improvement fails to identify such cases. We suspect this as a main reason for our improvement to be marginal specific to Si–Ta language pair.

We observe that the baseline sentence alignment scores considering XLM-R and LaBSE have outperformed the LASER baseline. Further, the LaBSE baselines produce the highest scores across all three language pairs. XLM-R had been trained on massive collection of monolingual data, while LaBSE had been pre-trained using monolingual data and fine-tuned using parallel data. The underlying reason for the improvement in scores we believe is the improvement in the language representations. Although XLM-R had been purely on unsupervised manner, it had still managed to capture cross-lingual features in the languages to be favorable for the sentence alignment task. Since LaBSE had been fine-tuned using parallel data, we experience that this step had helped to improve the cross-lingual alignments further, in the embeddings produced by the model. As a result, the LaBSE scores are the highest.

According to the results, our dictionary improvement is not that much significant compared to XLM-R and LaBSE baselines for Si–En and Ta–En language pairs. For Si–Ta, our improvement produces a gain of + 0.5 recall. This shows that the multilingual representations of XLM-R and LaBSE have frame for improvement when it comes to non-English centric diverse language families such as Sinhala and Tamil.

5.3 Extrinsic evaluation with NMT

To analyse the effectiveness of incorporating bilingual lists and different multilingual embeddings into the sentence alignment task, we conducted an extrinsic evaluation by training NMT systems with the obtained parallel sentences. We merged the parallel sentences obtained from each news source and trained NMT systems specific for the language pair in the forward and in reverse directions.

We used the SiTa trilingual (Sinhala, Tamil and English) parallel machine translation (MT) evaluation sets [66] created by the National Languages Processing Center of the University of Moratuwa, Sri LankaFootnote 10 to evaluate the NMT performance. Additionally, we report the MT scores for the Flores v1 [67] evaluation set for Sinhala–English and Flores-101 [68] multilingual evaluation set for Tamil–English language pairs.

More recently, NMT systems fine-tuned on the mBART50 sequence-to-sequence pre-trained model [69] had been successful in terms of Sinhala and Tamil [70, 71]. Therefore, in order to build an NMT model, we decided to fine-tune the mBART50 model with the parallel sentences obtained from the sentence alignment task. Experiments were done using the fairseq toolkit  [72], and the performance was evaluated using the evaluation datasets mentioned above. BLEU scores were obtained with sacreBLEU [73].

Table 9 BLEU scores for NMT systems trained with parallel data obtained from sentence alignment step considering Forward (F), Backward (B) and Intersection (I) criterion

The NMT results shown in Table 9 are rather low, which we believe is due to the following reasons: (1): the SiTa evaluation dataset has been obtained from the official document domain, while the Flores evaluation datasets have been obtained from Wikipedia. In contrast, we mined the parallel corpus from the news domain. Therefore, the domain difference is identified as the primary reason for the NMT systems to produce low results. (2) The parallel corpus size produced by the sentence alignment task is in the range of 9,000-23000, which marks an extremely low-resource setting [3]. Both these reasons lead to the NMT system producing a low result. However, we believe that this is not a bottleneck in conducting our study as we are only interested in analysing the impact of the bilingual lexicon integration on the sentence alignment task.

We observe that comparable results are obtained across all languages for Backward and Intersection criteria for NMT models for Si\(\rightarrow \)En, Ta\(\rightarrow \)En and Si\(\rightarrow \)Ta. In the backward criterion, for each target language sentence, an aligned sentence from the source language is obtained. Therefore, the selected source sentence might not always guarantee a proper translation for the target sentence. This can be identified as a weak parallel sentence pair with the noise at the source side. This is an interesting observation as it indicates that the NMT is robust to source side noise. However, when the noise is in the target side (as in the case of Forward criterion), it degrades the performance of the NMT. Since the Intersection is dependent on the Backward criterion, the improvement can also be seen in NMT systems trained with the Intersection criterion. In the NMT systems trained for En\(\rightarrow \)Si, En\(\rightarrow \)Ta and Ta\(\rightarrow \)Si, the same observation is true for Forward and Intersection criteria. Here, the target language for the NMT system is picked up from the forward criterion. That is, in the case of En\(\rightarrow \)Si NMT, with the Forward criterion, for each Si sentence, an En sentence is identified. So here the noisy sentence is found on the source-side (En). Therefore, for the NMT systems in the reverse direction, the Forward criterion is favourable.

We see that the NMT scores obtained by bilingual lists have improved over the baseline scores for most of the cases as per Table 9. This means that bilingual list integration has improved the quality of the parallel sentences. Considering the SiTa evaluation set, the maximum gain provided for LASER is +1.8 BLEU, XLM-R is +0.9 BLEU and for LaBSE it is +0.5 BLEU. Similarly, for Flores evaluation set, it is +1.4, +0.6 and +0.5 BLEU for LASER, XLM-R and LaBSE (respectively). Here we can see identical patterns with respect to both evaluation sets. The gain is the highest for LASER while for XLM-R and LaBSE it is in the same range. Although the Wikipedia data have been used during training these multilingual PMLMs, it is evident that the multilingual embeddings are not biased to the evaluation set on Wikipedia.

For Ta\(\rightarrow \)En and Ta\(\rightarrow \)Si directions, it shows a maximum improvement of +1.7 BLEU and +1.3 BLEU scores (respectively) for the LASER embeddings for the SiTa evaluation set. As Tamil is an under-represented language in the LASER training data, the lexicon integration has managed to improve the NMT scores.

In sentence alignment results, the scores were always in increasing order for LASER, XLM-R and LaBSE, respectively. However, for the downstream NMT task, we observed that the scores were mostly high for LASER and LaBSE compared to XLM-R. Although we expected the sentence alignment scores and NMT scores to follow the same pattern, it was not the case. LASER had been trained purely on parallel data while LaBSE had been pre-trained using monolingual and parallel data, followed by a fine-tuning phase with parallel data. Therefore, we observe that multilingual systems pre-trained with parallel data perform better in the NMT downstream task.

6 Further analysis

Table 10 Error analysis in the sentence alignment task. Here, the alignment[corr] refers to the alignment in the gold-standard evaluation set and alignment[incorr] refers to the alignment produced in the experiments

We conducted further analysis to identify the impact of lexicon integration on sentence alignment. Table 10 shows three scenarios where lexicon integration did not work. An example is given for the Sinhala–English pair. However, these findings are valid for other language pairs as well.

As explained by scenario A, the sentence pair that should be aligned does not contain any overlapping terms with the bilingual lexicons. Hence, it cannot be benefited by our lexicon integration. Further, the En sentence and another Si sentence from the same context have overlaps in terms of parallel lexicons. As a result, the sentence alignment algorithm selects an incorrect Sinhala sentence as the alignment for the English source sentence.

In scenario B, when there are equal overlaps between the candidate aligned sentences, the lexicon improvement is not effective. In such instances, the alignment is purely determined by the margin-based cosine similarity. In this example, both Sinhala candidate sentences have two lexicon overlaps; therefore, the selection of the aligned sentence cannot be based on the integrated lexicon.

According to scenario C, the sentences contain lexicon terms, but in an inflected form. Thus, our algorithms cannot identify those lexical terms appearing in sentences. In the example, the lexicon overlaps are missed for two word-pairs owing to inflections (in both En and Si). If the inflections were accounted in the algorithm, the correct alignment sentence-pair could be identified. We believe if a matching can be done at the lemma, a further improvement can be obtained. However, for Sinhala and Tamil, there is no lemmatizer that guarantees the coverage of the full vocabulary. Therefore, at present, working at the lemma level is not feasible.

7 Conclusion

This research improved an existing multilingual embedding-based document alignment technique and a sentence alignment technique with the use of bilingual lexicons. The study was conducted focusing on the low-resource language pairs Sinhala–English, Sinhala–Tamil and Tamil–English. Since we experimented with LASER, XLM-R and LaBSE multilingual embeddings, our work serves as an empirical study on the effectiveness of these models for document and sentence alignment. Our results show that positive gains can be obtained even with bilingual lexicons having very small quantities of parallel phrases. We have also compiled and released gold-standard human annotated evaluation sets for document alignment and sentence alignment for the considered languages, which will enable future research in this context. As future work, we plan to further improve the multilingual representations and cross-lingual mappings of the PMLMs, for low-resource languages by exploring different fine-tuning objectives.