Keywords

1 Introduction

Substantial amounts of printed documents are digitized and archived as images in digital libraries. This is notably the case of historical documents, which require an Optical Character Recognition (OCR) step to give access to their textual content. Unfortunately, while the performance of OCR systems has greatly improved, it remains imperfect. In addition, a great deal of documents were digitized in a time when storing high-quality images was difficult. Such documents cannot readily benefit from improvements in OCR quality. Several studies understandably suggest that the performance of natural language processing tools is harmed by the use of OCRed text, i.e., text resulting from an OCR process [18]. This naturally makes document access more difficult since simple keyword search will for instance not match a query with the corresponding words if they suffer from OCR errors. The quality of the text generated using OCR engines depends on the algorithms used in OCR, on the parameter settings of the scanner used to digitize documents, on the quality of the original image and on the nature of the document. For instance, text generated from recent vs. historical newspapers or well-preserved vs. damaged manuscripts is usually not of the same quality. Even though a reasonable amount of OCR errors is known to have low impact on the readability of documents, the errors will be indexed as they are by search engines and other NLP tools. Subsequently, if some words are incorrectly recognized by the OCR process, they will be indexed with their errors. This causes a chain reaction for tools developed to analyze the resulting content.

A study has shown that named entities (NEs) are the first point of entry for users in a search system  [10]. As an illustration, it has been observed that 4 out of 5 user queries on the Gallica digital libraryFootnote 1 contain at least one named entity  [2]. For this reason, their quality is far more critical than that of most other words in OCRed documents. In order to improve the satisfaction of users’ information needs, it is thus necessary to ensure their quality.

Named entity recognition (NER) is a task that emerged in the middle of the 1990s  [12]. It aims to locate and categorize important concepts of a given text into a set of predefined classes. Three main labels are commonly used: persons, locations and organizations  [22]. NER techniques can be gathered in two groups: rule-based and machine learning methods. For rule-based methods, the rules are mainly defined manually. They are related to linguistic descriptions, trigger words and lexica of proper names. These rules use patterns and regular expressions in order to locate and classify named entities. Machine learning approaches, on the other hand, aim to extract rules automatically based on learning systems trained on large corpora. Rule-based methods are clearly affected by OCR errors and are not able to deal with the degradation generated by the OCR, whereas, machine learning methods present a sufficient flexibility to be automatically adapted to process noisy texts. More recently, neural networks have been shown to outperform other supervised algorithms for NER. The first deep neural network based learning system has been developed in 2011  [4]. It reached very competitive results for NER in comparison to previous machine learning systems. Therefore, many NER systems using neural networks have been proposed and have shown their abilities to outperform all previous systems  [25]. We present in this paper a comparative study of well-performing NER methods. We have chosen, in this work, to use four majors systems available: the well-known NER tool using Conditional Random Fields CoreNLP  [8] and three neural network systems BLSTM-CRF  [17], BLSTM-CNN  [3] and BLSTM-CNN-CRF  [20]. The reason being that processing degraded texts using rule based systems require substantial manual efforts to face all typical OCR degradations, unlike machine learning systems which are able to automatically overcome OCR degradations. Furthermore, most rule-based systems are domain-specific or language-dependent and cannot easily be extended to other domains or other languages  [9]. Our goal is to evaluate the impact of OCR error on NER accuracy when dealing with noisy text, a task strongly related to document indexing in digital libraries. To the best of our knowledge, no other research work has systematically studied the impact of OCR on named entity recognition over datasets in multiple languages.

In order to assess our work, we used three publicly available datasets which cover three languages (English, Dutch and Spanish). Given the lack of OCRed annotated data aligned with its ground truth, we have simulated test data by adding typical textual degradation given by an OCR engine. These data have been obtained by automatically adding many levels of degradation in those corpora. More specifically, we spread four types of common OCR degradation in the original clean text. As OCR error depends on the quality and the parameters of the digitization process, we also simulated typical scanning noises at two different levels: rare and reasonably frequent. We finally aligned clean and OCRed data in order to be able to use the same annotation data. Running NER systems through progressively noisy data allows us to draw a graph of NER results relative to OCR error rates. Results over our simulated OCRed resource show a general consistency with a real-life OCRed dataset extracted from Finnish historical newspaper provided by the national library of Finland, which confirms the relevance of our analysis.

The rest of the paper is organized as follows: Sect. 2 presents related work studying the impact of OCR. Section 3 consists in an overview of the datasets, followed by outlines of NER results over clean and OCRed texts in Sect. 4. Section 5 reports our experiments with real data and Sect. 6 concludes the paper.

2 Related Work

Despite decades of research, the output of OCR systems remains imperfect, especially when the original document is old, damaged or poorly digitized. OCR systems lie in the beginning of the digitalization pipeline and OCR errors tend to have a cumulative impact over the subsequent steps. For this reason, researchers have studied the impact of processing text data from noisy sources in order to understand the effects of OCR on text analysis tools.

Much research to process noisy data  [32] has stemmed from the field of natural language processing (NLP). Lopresti Daniel [18] for instance considered a text analysis pipeline consisting of sentence boundary detection, followed by tokenization and POS tagging. They reported that among the errors generated by the OCR process, insertion errors were worse than character deletion errors on the sentence boundaries task, while OCR substitution errors were more impactful on POS tagging. The effects of noisy texts have been evaluated also on other NLP tasks such as document summarization  [15] and machine translation  [36].

Many other works focused on information retrieval from noisy data  [5]. Chiron et al.  [2] proposed a method to estimate the impact of OCR errors on the use of digital libraries. They built an OCR error model using a large corpus of OCRed documents aligned with their corresponding ground truth. Their model allowed the estimation of the risk that a user’s query might fail to match with the targeted documents. Taghva et al.  [33] showed that moderate OCR error rates have not desperate impact on the effectiveness of classical information retrieval measures. Other studies focused on the impact of OCR errors on the classification of pathology reports for cancer notification  [37]. They concluded that OCR errors even with modest rates are not imperceptible for extracting cancer notification items.

For NER, several works have been done to extract NEs from diverse text types such as outputs of Automatic Speech Recognition (ASR) systems [7] informal SMS and noisy social network posts  [29]. Palmerand Ostendorf  [23] for example described an approach for improving named entity extraction from ASR systems outputs by explicitly modeling errors through the use of confidence scores. In a similar setting, Miller et al.  [21] have studied the performance of named entity extraction under a variety of spoken and OCRed data. They trained the IdentiFinder system  [1] on both clean and noisy input material, performance degraded linearly as a function of word error rates. They concluded that results may lose about 8 points of F-score with only 15% of word error rate. Rodriquez et al.  [30] reported that manual correction of OCR output have not a very observable improvement on NER results. In [28], Riedl et al. presented a complete framework for named entity recognition for both contemporary clean and historical noisy German using transfer learning technique. They achieved state-of-the-art performance for historical datasets with less samples that contains noise. More recently, Hamdi et al. [13] and Pontes et al. [26] used synthetic OCRed English resources to respectively study the impact of OCR errors on named entity recognition and named entity linking.

In this paper, similarly to [30] and [21], we propose to study the evolution of the performance of named entity recognition systems over noisy OCR data. Unlike them we use more sophisticated NER systems relying on the most recent neural networks models. We also use larger corpora covering four languages, thanks to a technique that allows us to synthesize and test different types and levels of noise. They contain different types of degradation that correspond to the results of long storage and the impact of digitization processes. We defined two levels of degradation for each type in order to obtain a clearer view on OCR errors and their impact on the task of named entity recognition.

3 Dataset Overview

To the best of our knowledge, no publicly available corpus has been found with named entity annotations on both clean and noisy texts at the same time. In addition, there are corpora where text produced by an OCR process is aligned with the original text but NEs are not annotated. For this reason, we have taken advantage of three available NER corpora and simulated from them several OCRed versions with variable OCR error rates. We used the public corpora (CoNLL-02 and CoNLL-03) dealing with named entities and covering three languages: English  [34], Spanish and Dutch  [6]. English data consist of Reuters news stories between August 1996 and August 1997. The Spanish corpus is a collection of news wire articles made available by the Spanish EFE News Agency while the Dutch corpus consists of four editions of the Belgian newspaper “De Morgen”. Those datasets are split into three subsets: a training set, a test set and a development set. The latter has been built in order to tune parameters of learning methods. All data files contain a single word per line with its associated named entity tag. Table 1 outlines details about each dataset used in this work.

Table 1. CoNLL-02 and CoNLL-03 datasets

The annotation of named entities follows the IOB-scheme (Inside, Outside, Beginning) where every token is labeled as B if the token is the beginning of a named entity, I if it is inside but not the first token within the named entity, or O otherwise  [27]. Four classes have been used to label NEs: PER for persons, LOC for locations, ORG for organisations and MISC for other NEs.

From test data, we simulated several OCRed versions. To do so, we first extracted raw texts from test sets and converted them into images. These images have been contaminated by adding typical synthesised noise. We then extracted OCRed data using the Tesseract open source OCR engine v-3.04.01Footnote 2 which provides a language package covering many languages among them English, Dutch and Spanish. The subsequent noisy OCRed text and the original one were finally aligned and annotations of the original corpus were projected back on the noisy version. Figure 1 describes the main steps to simulate noisy corpora. We assume that the target text is similar to the indexed text in digital libraries.

Fig. 1.
figure 1

Simulation of OCRed copora

In order to contaminate images, we used the DocCreator toolFootnote 3 developed by Journet et al.  [16]. The tool provides many options to add degradation to document images such as blurring, ink degradation and adding phantom characters. In this work, we applied four types of degradation related to storage conditions or poor quality of printing materials that may be present in digital libraries material:

  • character degradation simulates degradation due to the age of the document or the use of a scanner incorrectly set. It consists in adding small ink spots on characters and can induce the partial obscuration of characters.

  • phantom degradation simulates degradation in worn documents. Following successive uses, some characters can be progressively eroded. The digitization process generates phantom ink around characters.

  • bleed-through simulates back side ink seeping through the front side of a page. This degradation only appears with double-sided pages.

  • blurring simulates a blurring effect, as can be encountered during a typical digitization process with focus issue.

For each type of noise, we defined two levels of degradation: LEV-1 where noises are applied rarely and LEV-2 where degradation is reasonably more frequent. These levels allow generating noisy texts with an OCR error rate close to real cases  [14]. These degradation levels and types allowed building eight versions for each test corpus. We additionally defined two versions that we call respectively LEV-0 and LEV-MIX. The LEV-0 version is the re-OCRred version of original images with no degradation added while the LEV-MIX version is the result of combining all the LEV-1 degradation typesFootnote 4. The LEV-0 degradation aims to evaluate the OCR engine through sharp images whereas the goal of using the LEV-MIX degradation is to be more similar to real-world documents. Degraded documents typically contain several OCR degradations simultaneously.

Following the text extraction by the OCR, the noisy text has been aligned to its original version using the tool RETAS  [35]. An example of alignment made between the ground truth and its OCRed version is shown in Fig. 2. This alignment reflects the various errors made by the OCR engine. The difference between the two texts is denoted by the presence of the character ‘@’. Each ‘@’ in the ground truth indicates the insertion of one character by the OCR while ‘@’ in the noisy text indicates that one character has been deleted from the original text.

Fig. 2.
figure 2

Original and noisy texts alignment

In order to evaluate the OCR quality, we used two measures: the Character Error Rate (CER)  [14] which corresponds to the proportion of erroneous characters compared to the original text. and the Word Error Rate (WER)  [19] which calculates the proportion of erroneous words compared to the total number of words in the original text. A word is considered as erroneous if it contains at least one character error. Table 2 details the OCR error rates at the character and the word levels in the different OCRed version of the three datasets.

Table 2. Estimation of OCR errors rates

As can be seen from Table 2, CER and WER considerably increase when noise is added, comparing to re-OCRed clean text (LEV-0). The table also shows that the noise distributed in the documents is homogeneous. The CER is quite low while the WER is relatively high. Except for Blurring LEV-2 degradation, the CER varies between \(\sim \)1% and \(\sim \)7% while the error rate at the word level always exceeds \(8\%\). OCR error rates also show that blurring and character degradation are the most critical noise for digitized documents; they generated the highest error rate both at the character and word levels.

Despite applying the same degradation through all data, OCR is considerably more accurate through Spanish data, CER and WER rates respectively remain below 20% and 30%. On the other hand OCR error rates over English and Dutch data have more variable rates that can reach up to 50%. The bleed-through and phantom characters have a slight impact on the effectiveness of the OCR while ink degradation and blurring lead to the highest OCR error rates. Among these types of degradation, blurring is the most critical degradation that impacted the OCR outputs.

Concerning NEs, knowing their locations in the original text, we aligned them with corresponding words generated by the OCR. We identified then contaminated NEs and those well recognized by the OCR. A total of 3, 623 English named entity tokens have been well recognized by the OCR which represents \(63.33\%\). This rate achieves \(72.14\%\) for Spanish and \(59.87\%\) for Dutch. All the dataset used in this work are publicly availableFootnote 5. We provide for each test corpus (English, Dutch and Spanish) the degraded images and their noisy texts extracted by the OCR as well as the aligned version with clean data at the word and the character levels.

4 Evaluation and Results

Neural networks and the related training process require several hyper-parameters such as character embedding dimension, character-based token embedding, LSTM dimension, token embedding dimension, etc. The same parameters for training and testing have been applied on the different dataset: OCRed corpora and clean ones. English embedding has been done using Glove  [24] while word2vec  [11] was used for Dutch and Spanish word embeddings. Table 3 shows the results of NER on clean datasets. We used traditional metrics ([P]recision, [R]ecall and [F1]-score) to evaluate NER systems.

Table 3. NER Results on clean data

This first test shows that the results obtained with various methods are globally equivalent for the three languages. We can notice that neural network based approaches give slightly better results than CoreNLP. The same experiments have been run on OCRed dataset. Unsurprisingly, NER accuracy drops proportionally to the rate of OCR errors which is related to the degradation type and level. Table 4 gives the F-score of each NER system on noisy data. Results show that compared to clean data, NER results may lose from 3 to 5 points for LEV-0 OCR-ed data. This proves that OCR has a negative impact for the NER task since LEV-0 represents OCR-ed data with no noise added. In other words, even with perfect storage and digitization, NER accuracy may be affected by the OCR quality. For other types of degradation, levels of OCR error rates vary from 8% to 50% at the word level and the NER F-score may drop from 90% to 50% for English. Compared to CoreNLP, deep-learning systems showed a better ability to overcome OCR errors. They achieved satisfactory results when the word error rate was less than 20%.

Table 4. NER F1-score of noisy data

Results in Table 4 also indicate that the best NER F1-score (in bold) can be given by different NER systems according to the type and the level of degradation. For this reason, we calculated the \(\delta \) measure which gives the minimum decrease rate between the best F1-score given in clean data and the best F1-scores given in noisy data for each type and level of degradation. This measure represents the perfect system that will give the best accuracy for all degradation levels. For the three languages, \(\delta \) exceeds \(40\%\) in noisy data with WER and CER rates reaching more than 0.4 and 0.5 respectively. The Dutch F-score for example decreases under \(50\%\) using any one of the four systems through noisy texts extracted from blurred images with an OCR error rate of \(44\%\) at the word level.

Figure 3 shows the evolution of the \(\delta \) measure with respect to degradation. Types of degradation have been sorted according to OCR rates. CER and WER curves are also given for comparison.

Fig. 3.
figure 3

NER F-score degradation according to OCR error rates

5 Experiments on Historical Dataset

An additional experiment was performed on real life OCRed data, based on Finnish-language historical newspapers from the National Library of Finland (NLF)  [31]. The corpus contains around 450K tokens with more than 30K NEs. The NLF corpus distinguishes only two types of NEs: sPER and LOC. The CER and WER rates in the OCRed corpus are respectively 6.96% and 16.67% which is comparable to error rates given by the simulated bleed_LEV-2 degradation in the CoNLL corpora. Ruokolainen et al. [31] evaluated the NER annotation of the NLF corpus using CoreNLP. The system respectively yielded overall F1-scores of 71.92% and 78.79% for PER and LOC over OCRed texts which represents a loss of around 9–10% points compared to clean texts. This decrease is mostly equivalent to that obtained on the OCRed synthetic data using CoreNLP (see Table 4). With the same OCR error rates, NER F1-score on the English corpus presents a loss of 9.83% compared to results on clean corpus.

Using BLSTM-CRF, NER F1-score achieves 89.8% and 87.4% on clean and OCRed data respectively which represents a decrease rate of 2.4% points. The corresponding rates in the CoNLL corpora are between 4 and 8% points as shown in Fig. 3. Finnish results are slightly better than those obtained with synthetic data using BLSTM-CRF. This is not unexpected since the Finnish training set is larger than the CoNLL datasets. In addition the set of NEs in the NLF corpus is less refined than the set used in the CoNLL corpora. As we showed in Table 4, neural network based systems outperform CoreNLP, we have thus reported the same experiment on the NLF corpus using BLSTM-CRF. Results are shown in Table 5.

Table 5. Results on the NLF corpus

Results using the neural-network system are largely outperforming CoreNLP performances. For clean data, we obtained an overall F1-score of 89% (to be compared to 82%). More importantly, for OCRed data, the NER F1-score reaches 90.4% for PER and 83.4% for LOC, resulting in an improvement of around 11 points for both types of NEs. Despite the complexity of the NER task and the occurrence of several types of errors in the documents, the systems achieved interesting results. This proves that they can be used to distinguish named entities in degraded documents. Some word correction strategies, such as auto-encoders, language models, and so on, could be used to decrease the impact of OCR degradation on NER.

6 Conclusion

This paper is the most systematic evaluation of the impact of OCR errors on NER systems over multilingual datasets. We evaluated four machine-learning systems over three available datasets in English, Dutch and Spanish. We re-OCRed these collections and added four types of noises at two different levels in order to simulate various OCR output. All the noisy texts have been aligned with their corresponding ground truth in order to test the NER system through noisy data and to observe the evolution of their accuracy. This new dataset was made publicly available to the community. Such resources, combining OCRed data aligned with their clean version, are very useful for two reasons. First they can be used to train NLP algorithms over collections of documents that have been through an OCR process, as is notably the case of historical documents. Second, they can be used to estimate the impact of OCR over NLP applications and lead to recommendations, for instance on what application can reasonably be run over a document collection given its OCR quality.

We have studied the correlation between OCR error rates and NER accuracy using four effective systems. We showed that NER accuracy drops from \(90\%\) to \(50\%\) when the word error rate increases from \(8\%\) to \(50\%\). These experiments were validated on a real OCR dataset in Finnish, where our systematic study allowed us to outperform the best-known results by \(\sim \)11% points.

This work showed that specific post OCR correction should be developed in order to improve NER results, and thus improve information access for end users.