Keywords

1 Introduction

Named Entity Recognition (NER) has application in several domains such as user interest modeling [27] and dialog systems [9], among other. Approaches to NER include training statistical sequential models based on handcrafted features, and more recently, deep learning models [8, 14] to ease the burden of designing crafted rules. The models with robust performance are often created based on substantial amounts of labeled training data.

Because NER tasks imply token-level labels, annotating many documents can be time-consuming, and thus costly and prone to human error. In many real-life scenarios, the lack of labeled data has become the biggest bottleneck preventing NER being used in some domains and tasks. To solve the label scarcity problem, [10] proposes a method based on BERT [3] and distant supervision which involves avoiding the traditional labeling procedure by matching tokens in the target corpus with concepts in knowledge bases such as Wikipedia and YAGO [15].

Our approach is also based on BERT. Instead of annotated data we use as reference the output of an ensemble of 3 general purpose NER developed previously [16] as well as an annotated dataset created by linguists named HAREM [17, 20]. The performance was measured using a fully annotated corpus, MiniHAREM, that is not included in the first HAREM. The results show that the proposed method is feasible, and the resulting system was able to annotate previously unseen data.

This document is organized as follows: after this Introduction follows the second section elaborating on the relevant Related Work. In the third section the proposed Method is explained followed by Results in Sect. 4. The paper ends in Sect. 5 with the Conclusion.

2 Related Work

The set of approaches proposed for NER can be classified according to 2 types: Rule-based, covering both systems based in patterns and in lists (the so-called Gazetteers); and machine learning based.

Machine learning methods are more flexible to adapt to distinct contexts if there exists enough data about the target context. Diverse machine learning methods have been applied to NER, such as Support Vector Machines (SVM), Conditional Random Field (CRF) or Neural networks (NN) (e.g., [6]).

In recent years, Deep Learning and the existence of larger datasets resulted in relevant advances in NER with systems based on Long Short-Term Memory (LSTM), Bidirectional LSTMs (Bi-LSTM) and Transformers (e.g., [10, 14]). LSTMs are recursive neural networks in which the hidden layers act as memory cells. As a result, they revealed better capabilities to deal with long range dependencies in data [7, 8]. Transformers [23], introduced in 2017, are a deep learning model based on the attention mechanism designed to handle sequential input data, such as natural language. Their potential for parallelization enabled training using huge datasets. This created the conditions for the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) [18]. Transformers demonstrated their superior efficiency in the recognition of named entities and in a variety of other classification tasks.

Following the general tendency, for Portuguese, the last years main developments on NER were systems featuring machine learning. A representative selection of works is presented in the following paragraphs:

The LeNER-Br system [13], presented in 2018, was developed for Brazilian legal documents. It features LSTM-CRF models trained using the Paramopama data set and achieved F1 scores of 97.0% and 88.8% for legislation and judicial entities, respectively. The results show the feasibility of using NER for judicial applications.

Pirovani and coworkers [19] adopted a hybrid technique combining Conditional Random Fields with a Local Grammar (CRF+LG) for a NER task at IberLEF 2019.

Lopes and coworkers [12] addressed NER for Clinical data in Portuguese showing Bi-LSTM and word embeddings superiority relative to CRF. They obtained F1 scores above 80%.

The first use of BERT in NER for Portuguese appeared in 2020 [21], adopting a BERT-CRF architecture combining transfer capabilities of BERT with the structured CRF forecasts. BERT was pre-trained using the brWac corpus (2.68 billion tokens), and the NER model was trained using the First HAREM and tested with MiniHAREM. This system surpassed the previous state art despite being trained with much less data.

To the best of our knowledge all these machine learning systems were trained with data annotated by humans, at least in a revision stage. This is a major limitation in the development of NER for new domains, which are in large demand by the expansion of potential application areas.

3 Method

The process used to assess the potential of automatically annotated entities is summarized in Fig. 1. Distinct BERT-based NERs were trained with automatically annotated data (BIO annotations) to detect words belonging to entities. The difference between both variants was if they included or not out of domain annotated entities, in this case obtained with the First HAREM dataset. The performance of entity detection of the 2 variants was assessed using a third annotated dataset (MiniHAREM) and by querying DBPedia. The main blocks of the processing are described in the following subsections.

Fig. 1.
figure 1

Overview of the process adopted to assess the potential of automatically annotated data for the development of Named Entity Detection for new domains. The development of Named Entity Detection for new domains implies the use of automatically annotated datasets for training machine learning models.

3.1 Datasets

Three datasets were used, two for training and one for testing, namely:

Automatically annotated dataset – we have used the results of our recent work on automatic annotation [16] as data to train the models. Briefly, a corpus for Tourism domain - a set of more than 300 documents obtained from Wikivoyage [25] - was annotated using a set of 3 NER (Linguakit NER [4, 11], Alen NLP [1, 5] and a DBPedia-based NER developed by the authors [16]). The output tags of these 3 NERs were combined to tag a word as part of an ENTITY or not, without including classification of the entity. More details can be found in [16].

HAREM datasets – Two HAREM [17, 20] datasets were used, the First HAREM and the MiniHAREM, both with their labels annotated manually. The evaluation of the Entity detectors was performed against MiniHAREM. Profiting from the processing and XML made available by authors of [21]Footnote 1, both text and BIO annotated files were derived.

The number of words annotated as part of Entity are presented in Fig. 2. The automatically annotated dataset for Tourism is the smallest one, being less than half the size of its combination with HAREM.

Fig. 2.
figure 2

Number of words annotated as part of an Entity for the 3 datasets used and the combination Tourism+HAREM used for training.

3.2 BERT-Based Classifiers for Entity Detection

Motivated by the recent evolutions highlighted in Sect. 2, a NER based on BERT [3] was selected for our experiments. After tests with several implementations, we have adopted a simple and documented implementation by Tobias Sterbak [22] using the Transformers package by Huggingface [26], Keras and TensorFlow. It uses a BERT case sensitive tokenizer – based on Wordpiece tokenizer – that splits tokens into subwords.

Two base systems were considered for our investigation: one trained using just automatic annotations (called Tourism); a second one trained with these automatic annotations plus the annotations of First HAREM (Tourism+HAREM).

Training

The models were finetuned with AdamW optimizer adopting the default parameters of [22], lr = \(3\times 10^{-5}\) and eps = \(10^{-8}\), in a notebook with GPU NVIDIA GeForce RTX 2060.

The training datasets were split into training and validation, with 90% for training and 10% for validation. The stop criteria adopted were a maximum of 10 epochs or the increase of loss in the validation set.

The variation of loss with epochs is presented in Fig. 3 for the two variants trained. The training process stopped, for both variants, after the third epoch due to the loss increase. The number of epochs needed to finetune the model is aligned with the information in [22] that “a few epochs should be enough ... 3–4 epochs”.

Fig. 3.
figure 3

Loss variation during the training of both system variants.

The information regarding processing time is presented in Fig. 4. It is visible that train and test durations are similar for both variants, and the complete process (train plus test) is under 8 min.

Fig. 4.
figure 4

Training and testing times (in minutes) for the two system variants.

4 Results

The output obtained by processing MiniHAREM is exemplified in Fig. 5. The performance was evaluated using the standard metrics (Precision, Recall and F1) considering both the individual words of the entity and the full entity as a single unit (the entity is well detected when all words composing it are correctly tagged).

The results obtained for both variants of the system (trained with and without using First HAREM) are summarized in Table 1.

Table 1. Results of the evaluation of both variants with MiniHAREM, showing word and entity metrics. Acc.* refers to Accuracy calculated excluding all words with tag “O”.

Table 1 shows that even testing with an out-of-domain dataset:

  • with automatically annotated, tourism domain, data the system could achieve a word-based precision and full entity detection of 90.0% and 41.4%, respectively.

  • Results improved, as expected, by using first HAREM data to complement domain data in the training process.

  • Recall is much lower. Over 50% of the words belonging to entities are not detected even when including HAREM data in the training. The best recall value when considering complete entities is 40.4%.

  • The best full entity precision is approx. 56%.

Fig. 5.
figure 5

Two fragments of the results obtained with the BERT-based NER using MiniHAREM, showing the HAREM tag and the output of the variant trained only with automatic annotations.

As the test dataset contains several entity classes not annotated in our tourism data, it is worth looking in detail at the performance by class. The split of entity-based Precision, Recall and F1 by entity class is presented graphically in Fig. 6.

Fig. 6.
figure 6

Entity-based Precision (top left), Recall (top right) and F1 by entity class.

Bar-plots show that: 1) LOCAL presents the highest F1, being Precision and Recall above 50% when trained with just automatically annotated data. 2) PERSON obtains the second highest F1 but only when HAREM data is used to complement the domain data. 3) Classes that are not in the domain dataset such as WORK or ABSTRACTION present interesting Precision and Recall, showing some potential of the system to generalize.

5 Conclusion

Aiming at making possible and simple creation of new NER for domains without annotated data available, the potential of automatic annotation to provide the necessary datasets for training of entity detectors by fine-tuning of BERT models is assessed in this paper.

A first proof-of-concept of the proposed method was evaluated with an existing manually annotated dataset, MiniHAREM. As the dataset used for testing - selected by being the only publicly available dataset for NER evaluation in Portuguese - is out-of-domain and integrates several entities without examples in the domain dataset, the results need to be considered with caution. Admitting limitations, the results are promising regarding the potential to create Named Entity detectors for a new domain without manually annotated data as demonstrated by the results obtained for classes more represented in the domain dataset (e.g., LOCAL). We also consider an interesting result the capability of the system to “generalize” from the training data and detect entities of classes that are not present in the domain data such as ABSTRACTION.

The main contribution of this paper is the proposal and evaluation of fine-tuning of pretrained BERT models with automatically annotated data to foster development of Named Entity Extraction for new domains.

5.1 Future Work

The interesting results obtained with a small domain dataset and (almost) off-the-shelf BERT-based model recommend the exploration of several improvements to the work presented, particularly:

  • Extension of the Tourism dataset, by processing additional texts by the automatic annotation process (ensemble of NER).

  • Addition of NER for classes such as TIME to the automatic annotation process.

  • Exploration of recent evolutions in BERT-based NER and similar models (e.g., GPT-3 or FLAN) [2, 24].

  • Manual revision of part of the results obtained for a subset of Tourism texts to allow computing performance metrics for domain data.

  • Try Transfer based Learning in the correction of decisions of the developed NER.

  • Add the NEC post-processing step to classify the entity detected, using, for example, queries to DBPEdia and Wikipedia.

  • Apply and evaluate the process proposed to new domains.

  • Integrate the newly obtained NER in an Information Extraction pipeline.