Keywords

1 Introduction

Neural networks achieve good NER accuracy on high-resource domains such as modern news text or Twitter [2, 4]. But on historical text, NER often performs poorly. This is due to several challenges: i) Domain shift: Entities in historical texts can be different from contemporary entities, this makes it difficult for modern taggers to work with historical data. ii) OCR errors: historical texts – usually digitized by OCR – contain systematic errors not found in non-OCR text [14]. In addition these errors can change the surface form of entities. iii) Lack of annotation: Some historical text is now available in digitized form, but without labels, and methods are required for beneficial use of such data [16].

In this paper, we address data centric domain adaptation for NER tagging on historical French and Dutch data. Following Ramponi and Plank [20], data centric approaches do not adapt the model but the training data in order to improve generalization across domains. We address both in-domain and cross-domain NER. In the cross-domain setup, we use supervised contemporary data and integrate unsupervised historical data via contextualized embeddings. We introduce artificial OCR errors into supervised modern data and find a way to perturb corpora in a general and robust way – independent of language or linguistic properties.

In the cross-domain setup as well as in-domain, our system outperforms neural and statistical state-of-the-art methods, achieving 69.3% \(F_1\) for French and 63.4% for Dutch. With the in-domain setup, we achieve 77.9% for French and 84.2% for Dutch. If we only consider named entities that contain OCR errors, our domain-adapted cross-domain tagger even performs better (83.5% French/ 46.2% Dutch) than in-domain training (77.1% French/ 43.8% Dutch). Our main contributions are:

  • Release of the preprocessed French and Dutch NER corporaFootnote 1;

  • Developing synOCR to mimic historical data while exploiting the annotation of modern data;

  • Training historical embeddings on a large amount of unlabeled historical data;

  • Ensembling a NER system that establishes SOTA results for both languages and scenarios.

2 Methods

2.1 Architecture

We use the Flair NLP framework [1]. Flair taggers achieve SOTA results on various benchmarks and are well suited for NER. Secondly, there are powerful Flair embeddings. They are trained without explicit notion of words and model words as character sequences depending on their context. These two properties contribute to making atypical entities - even those with distorted surface - easier to recognize. In all our experiments, word embeddings are generated by a character-level RNN and passed to a word-level bidirectional LSTM with a CRF as the final layer. Depending on the experiment, we concatenate (\(\odot \)) additional embeddings and refer to that as ensembling process.

2.2 Noise Methods

Since digitizing by OCR introduces a lot of noise into the data, we recreate some of those phenomena in the modern corpora that we use for training. Our goal is to increase the similarity of historical (OCR’d) and modern (clean) data. An example drawn from the dutch training corpora can be found in Fig. 1. Words that are different from the original text are indicated in bold font.

Generation of Synthetic OCR (synOCR) Errors. This method processes every sentence by assigning a randomly selected font and a font size between 6 and 11 pt. Batches of 150 sentences are printed to PDF documents and then converted to PNG images. The images are perturbed using imgaugFootnote 2 with the following steps: (i) rotation, (ii) Gaussian blur and (iii) white or black pixel dropout. The resulting image is recognized using tesseract version 0.2.6.Footnote 3. We re-align the recognized sentences with the clean annotated corpus to transfer the NER tags. For the alignment between original and degraded text we select a window of the bitext and calculate a character-based alignment cost. We then use the Wagner-Fisher algorithm [27] to obtain the best alignment path through the window and the lowest possible cost. If the cost is below a threshold, we shift the window to the mid-point of the discovered path. Otherwise, we iteratively increase the window size and re-align, until the threshold criterion is met. This procedure allows us to find an alignment with reasonable time and space resources, without risking to lose the optimal path in low-quality areas. Finally this results in an OCR-error enhanced annotated corpus with a range of recognition quality, from perfectly recognized to fully illegible. We refer to OCR-corrupted data as synOCR’d data.

Generation of Synthetic Corruptions. This method is applied to our modern corpora, again to introduce noise as we find it in historical data. Similar to [21], we randomly corrupt 20% of all words by (i) inserting a character or (ii) removing a character or (iii) transposing two characters. Therefore, we use the standard alphabet of French/Dutch. We re-align the corrupted tokens with the clean annotated tokens while maintaining the sentence boundaries to transfer the NER tags. Since the corruption method does not break the word boundaries we can simply map each corrupted word to the original one and retrieve the corresponding NER tag. We refer to synthetically corrupted data as corrupted data.

Fig. 1.
figure 1

Example from the dutch train set. Text in its original, the synOCR’d and the corrupted form.

2.3 Embeddings

We experiment with various common embeddings and integrate them in our neural system. Some of them are available in the community and some others we did train on data described in Sect. 3.1.

Flair Embeddings. [3] present contextual string embeddings which can be extracted from a neural language model. Flair embeddings use the internal states of a trained character language model at token boundaries. They are contextualized because a word can have different embeddings depending on its context. These embeddings are also less sensitive to misspellings and rare words and can be learned on unlabeled corpora. We also use multilingual Flair embeddings. They were trained on a mix of corpora from different domains (Web, Wikipedia, Subtitles, News) and languages.

Historical Embeddings. We train Flair embeddings on large unlabeled historical corpora from a comparable time period (see Sect. 3.1) and refer to them as historical embeddings.

BERT Embeddings. Since BERT embeddings [9] produce state-of-the-art results for a wide range of NLP tasks, we also experiment with multilingual BERT embeddingsFootnote 4. BERT embeddings are subword embeddings based on a bidirectional transformer architecture and can model the context of a word. For NER on CoNLL-03 [25], BERT embeddings do not perform as well as on other tasks [9] and we want to examine if this observation holds for a cross-domain scenario with different data.

FastText Embeddings. We do also use FastText embeddings [6] which are widely used in NLP. They can be efficiently trained and address character-level phenomena. Subwords are used to represent the target word (as a sum of all its subword embeddings). We use pre-trained FastText embeddings for French/DutchFootnote 5.

Character-Level Embeddings. Due to the OCR errors out-of-vocabulary problems occur. Lample et al. [15] create character embeddings, passing all characters in a sentence to a bidirectional LSTM. To obtain word representations, the forward and backward representations of all the characters of the word from this LSTM are concatenated. Having the character embedding, every single words vector can be formed even if it is out-of-vocabulary. Therefore, we do also compute these embeddings for our experiments.

3 Experiments

In the cross-domain setup, we train on modern data (clean or synOCR’d) and test on historical data (OCR’d). In the in-domain setup, we train and test on a set of historical data (OCR’d). We do use different combinations of embeddings and also use our noise methods in the experiments.

3.1 Data

We use different data sources for our experiments from which some are openly available and some historical data come from an in-house project. For an overview of different properties (domain, labeling, size, language) see Table 1.

Annotated Historical Data. Our annotated historical data comes from the Europeana Newspapers collectionFootnote 6, which contains historical news articles in 12 languages published between 1618 and 1990. Parts of the German, Dutch and French data were manually annotated with NER tags in IO/IOB format for PER (person), LOC (location), ORG (organization) by Neudecker [17]. Each NER corpus contains 100 scanned pages (with OCR accuracy over 80%), amounting to 207K tokens for French and 182K tokens for Dutch.

We preprocess the data as follows. We perform sentence splitting, filter out metadata, re-tokenize punctuation and convert all annotations to IOB1 format. We split the data 80/10/10 into train/dev/test. We will make this preprocessed version available in CoNLL format.

Annotated Modern Data. For the French cross-domain experiments, we use the French WikiNER corpus [18]. WikiNER is tagged in IOB format with an additional MISC (miscellaneous) category; we convert the tags to our Europeana format. For better comparability we downsample (sentence-wise) the corpus from 3.5M to 525K tokens. Therefore, entire sentences were sampled uniformly at random without replacement. For Dutch, we use the CoNLL-02 corpus [24], which consists of four editions of the Belgian Dutch newspaper “De Morgen” from the year 2000. The data comprises 309K tokens and is annotated for PER, ORG, LOC and MISC. We convert the tags to our Europeana format.

Unlabeled Historical Data. For historical French, we use “Le Temps”, a journal published between 1861 and 1942 (initially under a different name), a similar time period as the Europeana Newspapers. The corpus contains 977M tokens and is available from the National Library of France.Footnote 7 For historical Dutch, we use data from an in-house OCR project. The data is from the 19th century and it consists of 444M tokens. We use the unlabeled historical data to pre-train historical embeddings (see Sect. 2.3).

Table 1. Number of tokens per dataset in our experiments.

3.2 Baselines

We experiment with three baselines. (i) The Java implementationFootnote 8 of the Stanford NER tagger [12]. (ii) A version of Stanford NER published by Neudecker [17]Footnote 9 that was trained on Europeana. In contrast to our system they trained theirs on the entire amount of the labeled Europeana corpora with 4-fold cross validation. (iii) NN base. The neural network (see Sect. 2.1) with FastText, character and multilingual Flair embeddings, as recommended in Akbik et al. [1]. For French, we also list the result reported by Çavdar [8]. Since we do not have access to their implementation and could not confirm that their data splits conform to ours, we could not compute the combined \(F_1\) score or test for significance.

Table 2. Results (\(F_1\) scores on French/Dutch Europeana test set) of training on Europeana French/Dutch training set. Hist. Embs. are historical embeddings. Scores marked with * are significantly lower than NN base \(\odot \) hist. Es.

4 Results and Discussion

We evaluate our systems using the CoNLL-2000 evaluation scriptFootnote 10, with \(F_1\) score. To check statistical significance we use randomized testing [28] and results are considered significant if \(p < 0.05\).

4.1 In-domain Setup

For both languages we achieve the best results with NN base \(\odot \) historical embeddings. With this setup we can produce \(F_1\) scores of around 80% for both languages, which outperforms all three baselines in the overall performance significantly. The results are presented in Table 2. For French, the overall \(F_1\) score as well as the \(F_1\) for LOC and ORG is best with NN base \(\odot \) historical embeddings. For ORG the pre-trained tagger of Neudecker [17] works best, which could be due to the gazetteer information they included and of course due to the fact that they train with the entire Europeana data. We hypothesize that the category with the most structural changes over time is ORG. In the military or ecclesiastical context in particular, there are a number of names that no longer exist (in this form). For Dutch we observe the best overall performance with NN base \(\odot \) historical embeddings except for all entity types.

4.2 Cross-Domain Setup

As shown in Table 3, NN base performs better than the statistical Stanford NER baseline, which is in line with the observations for the in-domain training. We experimented with concatenating BERT embeddings to NN base. For both languages this increases the performance (Table 3, NN base \(\odot \) BERT). The usage of the historical embeddings is also very beneficial for both languages. We can achieve our best results by using BERT for Dutch and by using historical embeddings for French. We conclude that the usage of modern pre-trained language models is crucial for the performance of NER taggers.

We generated synthetic corruptions for the WikiNER/CoNLL corpus. This could not outperform NN base for both languages. The training on synOCR’d WikiNER/CoNLL gives slightly worse results than NN base too. The corruption of the training data without the usage of any embeddings seems to harm performance drastically, what is in line with the observation of Hamdi et al. [13]. It is striking that the training on corrupted/synOCR’d Dutch gives especially bad results for PER compared to French. A look at the Dutch test set shows that many entities are abbreviated first names (e.g. in A J van Roozendal) and are often misrecognized what leads to a performance decrease. For French the combination of NN base and historical embeddings, trained on corrupted data or on synOCR’d (NN ensemble corrupted/ NN ensemble synOCR’d) gives the best results and outperforms all other systems. For Dutch NN ensemble corrupted and NN ensemble synOCR give slightly worse results than NN base \(\odot \) BERT and NN base \(\odot \) historical embeddings, but performs better than the tagger trained on synOCR’d or corrupted data only (Table 3, NN ensemble).

Ablation Study. We analyze our results and examine the composition of NN ensemble synOCR more closely (since the results for NN ensemble corrupted are very similar we perform the analysis for NN ensemble synOCR as a representative for both NN ensemble).

The ablation study (see Table 4) shows that NN ensemble benefits from different information in combination. For French NN ensemble gives the best results only for PER. The overall performance increases if we do not use character level embeddings. There is a big performance loss if we omit the historical embeddings. If we do not train on synOCR’d data the performance decreases. For Dutch we can observe these facts even more clearly. If we do not train on synOCR’d data the \(F_1\) score even increases. If omitting the historical embeddings we loose performance as well.

Table 3. Results of training on WikiNER/CoNLL corpus. Scores marked with * are significantly lower than NN ensemble.
Table 4. Ablation study. Results of training on the clean and the synOCR’d WikiNER/CoNLL corpus.

To find out why our implementation of the assumption – synOCR increases the similarity of the data and improves results – does not have the expected effect, we analyze the test sets. It shows, that only 10% of the French and 6% of the Dutch entities contain OCR errors. Therefore the wrong predictions are mostly not due to the OCR errors, but due to the inherent difficulty of recognizing entities cross-domain. This also explains why synthetic noisyfication does not consistently improve the system. In addition there are some illegible lines in the synOCR’d corpora consisting of dashes and metasymbols, what is not similar to real OCR errors.

Fig. 2.
figure 2

Example sentence from the French test set.

To verify our assumption we also compare the different systems only on the entities with OCR errors. Here NN ensemble outperforms both of the cross-domain baselines (Table 5, Stanford NER tagger, NN base cross-domain). Compared to the French results Dutch is a lot worse. A look at the entities shows that in the Dutch test set there are many hyphenated words where both word parts are labeled. However, if looking at the parts of the word individually, a clear assignment to an entity type cannot be made, which leads to difficulties with tagging. Though it is plausible that NN ensemble can capture specific phenomena in the historical data better, since the difference between the domains is reduced by the synthetic noisyfication and the historical embeddings. The example in Fig. 2 drawn from the test set shows, that NN ensemble can handle noisy entities well in contrast to e.g. the Stanford NER tagger. Thus in a scenario with many OCR errors the NN ensemble performs well.

Table 5. Results on entities with OCR errors in the French/Dutch test set. Scores marked with * are significantly lower than NN ensemble.

5 Related Work

There is some research on using natural language processing for improving OCR for historical documents [5, 26] and also on NER for historical documents [11]. In the latter - a shared task for Named Entity Processing in historical documents - Ehrmann et al. find that OCR noise drastically harms systems performance. Like us several participants (e.g. [7, 23]) also use language models that were trained on historical data to boost the performance of NER taggers. Schweter and Baiter [22] explore NER for historical German data in a cross-domain setting. Like us, they train a language model on unannotated in-domain data and integrate it into a NER tagger. In addition to the above mentioned work, we employ “OCR noisyfication” (Sect. 2.2) and examine the influence of different pretrained embeddings systematically. Çavdar [8] addresses NER and relation extraction on the French Europeana Newspaper corpus. Ehrmann et al. [10] investigate the performance of NER systems on Swiss historical Newspapers and show that historical texts are a great challenge compared to contemporary texts. They find that the LOC class entities causes the most difficulties in the recognition of named entities. The recent work of Hamdi et al. [13] investigates the impact of OCR errors on NER. To do so, they also perturb modern corpora synthetically with different degrees of error rates. They experiment with Spanish, Dutch and English. Like us they perturb the Dutch CoNLL corpus and train NER taggers on that data. Unlike us they do also train on a subset of the perturbed corpus. We test on a subset of the Dutch Europeana corpus. Hamdi et al. [13] show that neural taggers perform better compared to other taggers like the Stanford NER tagger and they also prove that performance decreases drastically if the OCR error rate increases. Piktus et al. [19] learn misspelling-oblivious FastText embeddings from synthetic misspellings generated by an error model for part-of-speech tagging. We use a similar corruption method, but we also use synOCR and historical embeddings for NER.

6 Conclusion

We proposed new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data and addressed data centric domain adaptation. For the cross-domain case, we handle domain shift by integrating non-annotated historical data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the modern data. This allowed us to get good results when labeled historical data is not available and the historical data is noisy. For training on contemporary corpora and testing on historical corpora we achieve new state-of-the-art results of 69.3% on French and 63.4% on Dutch. For the in-domain case we obtain state-of-the-art results of 77.9% for French and 84.2% for Dutch. There is an increasing demand for advancing the digitization of the world’s cultural heritage. High quality digitized historical data, with reliable meta information, will facilitate convenient access and search capabilities, and allow for extensive analysis, for example of historical linguistic or social phenomena. Since named entity recognition is one of the most fundamental labeling tasks, it would be desirable that advances in this area translate to other labeling tasks in processing of historical data as well.