Keywords

1 Introduction

Automatic Term Extraction (ATE) is the task of identifying specialized terminology from the domain-specific corpora. By easing the time and effort needed to manually extract the terms, ATE is not only widely used for terminographical tasks (e.g., glossary construction [26], specialized dictionary creation [22], etc.) but it also contributes to several complex downstream tasks (e.g., machine translation [40], information retrieval [23], sentiment analysis [28], to cite a few).

With recent advances in natural language processing (NLP), a new family of deep neural approaches, namely Transformers [38], has been pushing the state-of-the-art (SOTA) in several sequence-labeling semantic tasks, e.g., named entity recognition (NER) [18, 37] and machine translation [41], among others. The TermEval 2020 Shared Task on Automatic Term Extraction, organized as part of the CompuTerm workshop [31], presented one of the first opportunities to systematically study and compare various ATE systems with the advent of The Annotated Corpora for Term Extraction Research (ACTER) dataset [31, 32], a novel corpora covering four domains and three languages. Regarding Slovenian, the RSDO5Footnote 1 corpus [13] was created with texts from four specialized domains. Inspired by the success of Transformers for ATE in the TermEval 2020, we propose an extensive study of their performance in a cross-domain sequence-labeling setting and evaluate different factors that influence extraction effectiveness. The experiments are conducted on two datasets: ACTER and RSDO5 corpora.

Our major contributions can be summarized as the three following points:

  • An empirical evaluation of several monolingual and multilingual Transformer-based language models, including both masked (e.g., BERT and its variants) and autoregressive (e.g., XLNet) models, on the cross-domain ATE tasks;

  • Filling the research gap in ATE task for Slovenian by experimenting with different models to achieve a new SOTA in the RSDO5 corpus.

  • An ensembling Transformer-based model for ATE that further improves the SOTA in the field.

This paper is organised as follows: Sect. 2 presents the related work in term extraction. Next, we introduce our methodology in Sect. 3, including the dataset description, the workflow and experimental settings, as well as the evaluation metrics. The corresponding results are presented in Sect. 4. Finally, we conclude the paper and present future directions in Sect. 5.

2 Related Work

The research into monolingual ATE was first introduced during the 1990 s s [6, 15] and the methods at the time included the following two-step procedure: (1) extracting a list of candidate terms; and (2) determining which of these candidate terms are correct using either supervised or unsupervised techniques. We briefly summarize different supervised ATE techniques according to their evolution below.

2.1 Approaches Based on Term Characteristics and Statistics

The first ATE approaches leveraged linguistic knowledge and distinctive linguistic aspects of terms to extract a possible candidate list. Several NLP techniques are employed to obtain the term’s linguistic profile (e.g., tokenization, lemmatization, stemming, chunking, etc.). On the other hand, several studies proposed statistical approaches toward ATE, mostly relying on the assumption that a higher candidate term frequency in a domain-specific corpus (compared to the frequency in the general corpus) implies a higher likelihood that a candidate is an actual term. Some popular statistical measures include termhood [39], unithood [5] or C-value [10]. Many current systems still apply their variations or rely on a hybrid approach combining linguistic and statistical information [16, 30].

2.2 Approaches Based on Machine Learning and Deep Learning

The recent advances in word embeddings and deep neural networks have also influenced the field of term extraction. Several embeddings have been investigated for the task at hand, e.g., non-contextual [1, 43], contextual [17] word embeddings, and the combination of both [11]. The use of language models for ATE tasks is first documented in the TermEval 2020 [31] on the trilingual ACTER dataset. While the Dutch corpus winner used BiLSTM-based neural architecture with GloVe word embeddings, the English corpus winner [12] fed all possible extracted n-gram combinations into a BERT binary classifier. Several Transformer variations have also been investigated [12] (e.g., BERT, RoBERTa, CamemBERT, etc.) but no systematic comparison of their performance has been conducted. Later, the HAMLET approach [33] proposed a hybrid adaptable machine learning system that combines linguistic and statistical clues to detect terms. Recently, sequence-labeling approaches became the most popular modeling option. They were first introduced by [17] and then employed by [20] to compare several ATE methods (e.g., binary sequence classifier, sequence classifier, token classifier). Finally, cross-lingual sequence labeling proposed in [4, 20, 35] demonstrates the capability of multilingual models and the potential of cross-lingual learning.

2.3 Approaches for Slovenian Term Extraction

The ATE research for the less-resourced languages, especially Slovenian, is still hindered by the lack of gold standard corpora and the limited use of neural methods. Regarding the corpora, the recently compiled Slovenian KAS corpus [8] was quickly followed by the domain-specific RSDO5 corpus [14]. Regarding the methodologies, techniques evolved from purely statistical [39] to more machine learning based approaches. For example, [25] extracted the initial candidate terms using the CollTerm tool [29], a rule-based system employing a language-specific set of term patterns from the Slovenian SketchEngine module [9]. The derived candidate list was then filtered using a machine learning classifier with features representing statistical measures. Another recent approach [30] focused on the evolutionary algorithm for term extraction and alignment. Finally, [36] was one of the first to explore the deep neural approaches for Slovenian term extraction, employing XLMRoBERTa in cross- and multilingual settings.

3 Methods

We briefly describe our chosen datasets in Sect. 3.1, the general methodology in Sect. 3.2 and the chosen evaluation metrics in Sect. 3.3.

3.1 Datasets

The experiments have been conducted on two datasets: ACTER v1.5 [31] and RSDO5 v1.1 [13]. The ACTER dataset is a manually annotated collection of 12 corpora covering four domains, Corruption (corp), Dressage (equi), Wind energy (wind), and Heart failure (htfl), in three languages, English (en), French (fr), and Dutch (nl). It has two versions of gold standard annotations: one including both terms and named entities (NES), and the other containing only terms (ANN). Meanwhile, the RSDO5 corpus v1.1 [13] includes texts in Slovenian (sl), a less-resourced Slavic language with rich morphology. Compiled during the RSDO national project, the corpus contains 12 documents covering four domains, Biomechanics (bim), Chemistry (kem), Veterinary (vet), and Linguistics (ling).

3.2 Workflow

We consider ATE as a sequence-labeling task [35] with IOB labeling regime [20, 33]. The model is first trained to predict a label for each token in the input text sequence, and then applied to the unseen test data. From the token sequences labeled as terms, the final candidate term list for the test data is composed.

3.2.1 Empirical Evaluation of Pretrained Language Models

We conduct a systematic evaluation of mono- and multilingual Transformers-based models on the ATE task modeled as sequence labeling. The models were obtained from HuggingfaceFootnote 2 according to the number of downloads and likes criteria. The chosen models are presented in Fig. 1. Regarding the multilingual systems, we investigate the performance of mBERT [7] (bert-base-multilingual-uncased), mDistilBERT [34] (distilbert-base-multilingual-cased), InforXLM [2] (microsoft/ infoxlm-base), and XLMRoBERTa [3] (xlm-roberta-base). All the chosen multilingual models are fine-tuned in a monolingual fashion due to findings from the related work [20, 35] showing that no (or only marginal) gains are obtained if the model is fine-tuned on the multilingual training data.

Fig. 1.
figure 1

Empirical evaluation of pretrained language models on the ATE task.

Regarding the monolingual models, we evaluate several English autoencoding Transformer-based models, including ALBERT [19] (albert-base-v1 and albert-base-v2), BERT [7] (bert-base-uncased), DistilBERT [34] (distilbert-base-uncased), ELECTRA (electra-small-generator) and RoBERTa [24] (xlm-roberta-base), and one autoregressive model, XLNet [42] (xlnet-base-cased). For French, we use CamemBERT [27] (camembert-base) and FlauBERT [21] (flaubert_base_uncased), for Dutch, we employ BERTje (bert-base-dutch-cased) and RobBERT (robBERT-base and robbert-v2-dutch-base) models, and for Slovenian, we choose SloBERTa (sloberta), the RoBERTa-based model trained on a large Slovenian corpus.

3.2.2 Ensemble of Transformer Models

Regarding results in Sect. 3.2.1, we propose a novel ensembling approach based on Transformer models for ATE task as we observe the general tendency for Precision to be better than Recall for all but few monolingual and multilingual models tested (see Tables 1 and 2). This leads us to believe that by combing the outputs of different models, we could achieve improvements in Recall and by extension also in the overall F1-score. We consider two strategies for combining the outputs from different models of the ensemble, namely the union and the intersection of the candidate term lists from the models of the ensemble. See the entire procedure in Fig. 2.

Fig. 2.
figure 2

The general ensembling workflow.

We hypothesize that by combining the outputs of two models, we might be able to significantly improve the Recall of the term extraction system. To validate this hypothesis, we test three combinations: Combine the outputs of the (1) best mono- and multilingual models; (2) two best monolingual models; and (3) two best multilingual models.

3.3 Evaluation Metrics

We evaluate each term extraction system by comparing the aggregated list of candidate terms extracted on the level of the whole test set with the manually annotated gold standard term list using Precision, Recall, and F1-score. These evaluation metrics have also been used in the related work [12, 20, 31].

4 Results

We first present the results of mono- and multilingual Transformer-based models obtained on ACTER and RSDO5 test sets compared with the SOTAs. Then, we demonstrate the impact of the ensemble post-processing step.

4.1 Monolingual Evaluation

4.1.1 ACTER Corpus

Not many approaches have been tested on the ACTER corpus v1.5 due to its novelty. Thus, we apply the approach proposed by [20] (i.e., employing XLMRoBERTa as a token classifier), which achieved SOTA on the previous corpus version, and consider it as a baseline. The Heart failure domain is used a test set, same as in TermEval 2020.

Table 1. Results of monolingual term extraction on the ACTER dataset.

In general, multilingual pretrained models outperform the monolingual ones in Recall and F1-score when applied for extraction of the ANN annotations in all three languages. If named entities are included (NES), monolingual models outperform multilingual models in two (English and French) out of three languages in the ACTER dataset. When it comes to individual models, InfoXLM outperforms other mono- and multilingual models in the F1-score on the Dutch corpus (for both ANN and NES) and on the English corpus (for ANN). If we compare the results of our study with the XLMRoBERTa baseline using the same monolingual settings from [20], our best-performing models surpass the baseline in all cases (e.g., the F1-score increases by 1.87% on ANN and 1.5% on NES in the English corpus; 4.01% on French NES; 0.3% on ANN and 1.92% on NES in the Dutch corpus) except for the French ANN annotations.

Table 2. Results of monolingual term extraction on the RSDO5 dataset.

4.1.2 RSDO5 Corpus

We also compare the performance of different mono- and multilingual models on the RSDO5 corpus, Here, we evaluate the models on all domains as demonstrated in Table 2. By using two domains from the RSDO5 corpus for training, the third one for validation, and the last one for testing, all the models prove to have relatively consistent performance across different combinations. The monolingual SloBERTa model outperforms other approaches (including the XLMRoBERTa baseline from [36]) in all cases by a relatively large margin in F1-score. By employing this model and looking at the best performing train/validation combinations for each test domain, we improve the SOTA baseline in the Linguistics domain by 2.21%, in Veterinary by 2.35%, in Chemistry by 5.26%, and in Biomechanics by 2.66% regarding F1-score. Our results, thus, set a new SOTA on the Slovenian corpus.

4.2 Transformer Ensembling

We also evaluate the performance of the proposed ensembling approach described in Sect. 3.2.2. The improvements/decline in performance over the best single model on different languages of the ACTER dataset are shown in Fig. 3. The results indicate that combining the acquired term sets of the two best-performing classifiers (no matter what type of classifiers they are) using the union always results in the biggest gain.

Fig. 3.
figure 3

F1-score improvement by combining two best classifiers in ACTER.

5 Conclusion

We proposed an empirical evaluation of different mono- and multilingual Transformers based models on the monolingual sequence-labeling cross-domain term extraction. The experiments were conducted on the trilingual ACTER dataset and the Slovenian RSDO5 dataset. Furthermore, we tested how ensembling different mono- or multilingual models affects the performance of the overall term extractor. The results demonstrate that multilingual models outperform the monolingual ones in Recall and F1-score when applied for ANN extraction. Meanwhile, monolingual models capture the information about terms better than multilingual ones when it comes to the extraction of NES annotations. We also showed that by ensembling different Transformer models we can obtain further boosts in performance for all languages. As a consequence, we established the new SOTA on the ACTER and RSDO5 datasets.

In the future, we would like to take advantage of prompt engineering by considering ATE as a language model ranking problem in a sequence-to-sequence framework, where original sentences and statement templates filled by candidate terms are regarded as the source sequence and the target.