Keywords

1 Introduction

Recent years have seen the delivery of an increasing amount of textual corpora for the Humanities and Social Sciences. Representative examples are offered by the digitization of the gigantic Gallica collection by the National Library of FranceFootnote 1 and the Trove online Australian libraryFootnote 2, database aggregator and service of full-text documents, digital images and data storage of digitized documents. Access to this massive data offers new perspectives to a growing number of disciplines, going from socio-political and cultural history to economic history, and linguistics to philology. Billions of images from historical documents including digitized manuscript documents, medieval registers and digitized old press are captured and their content is transcribed, manually through dedicated interfaces, or automatically using optical character recognition (OCR) or handwritten text recognition (HTR). The mass digitization process, initiated in the 1980 s s with small-scale internal projects, led to the “rise of digitization”, which grew to reach a certain maturity in the early 2000 s s with large-scale digitization campaigns across the industry [12, 16]. As this process of mass digitization continues, increasingly advanced techniques from the field of natural language processing (NLP) are dedicated to historical documents, offering new ways to access full-text semantically enriched archives [33], such as NER [4, 10, 19], entity linking (EL) [26] and event detection [5, 32].

However, for developing such techniques, historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes, which are common issues in historical documents [20]. Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change and evolution. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries [29]. With existing collections [12, 15, 16] providing such metadata as the year of publication, we propose to take advantage of the temporal context of historical documents in order to increase the quality of their semantic enrichment. When this metadata is not available, due to the age of the documents, the year has often been estimated and a new NLP task recently emerged, aiming to predict a document’s year of publication [36].

NER corresponds to the identification of entities of interest in texts, generally of the type person, organization, and location. Such entities act as referential anchors that underlie the semantics of texts and guide their interpretation. For example, in Europe, by the medieval period, most people were identified simply by a mononym or a single proper name. Family names or surnames began to be expected in the 13th century but in some regions or social classes much later (17th century for the Welsh). Many people shared the same name and the spelling was diverse across vernacular and Latin languages, and also within one language (e.g., Guillelmus, Guillaume, Willelmus, William, Wilhelm). Locations may have disappeared or changed completely, for those that survived well into the 21st century from prehistory (e.g., Scotland, Wales, Spain), they are very ambiguous and also have very different spellings, making it very difficult to identify them [6]. In this article, we focus on exploring temporality in entity detection from historical collections. Thus, we propose a novel technique for injecting additional temporal-aware knowledge by relying on Wikipedia and Wikidata to provide related context information. More exactly, we retrieve semantically-relevant additional contexts by exploring the time information provided by the historical data collections and include them as mean-pooled representations in our Transformer-based NER model. We consider that adding grammatically correct contexts could improve the error-prone texts due to digitization errors while adding temporality could further be beneficial to handle changes in language or entity names.

The paper is structured as follows: we present the related work and datasets in Sect. 2 and 3 respectively. Our methodology for retrieving additional context through temporal knowledge graphs and how contexts are included within the proposed model is described in Sect. 4. We, then, perform several experiments in regards to the relativity of the time span when selecting additional context and present our findings in Sect. 5. Finally, conclusions and future work are drawn in Sect. 6Footnote 3.

2 Related Work

Named Entity Recognition in Historical Data. Due to the multiple challenges posed by the quality of digitization or the historical variations of a language, NER in historical and digitized documents is less noticeable in terms of high performance than in modern documents [47, 52]. Recent evaluation campaigns such as the one organized by the Identifying Historical People, Places, and other Entities (HIPE) lab at CLEF 2020Footnote 4 [16] and 2022Footnote 5 [17] proposed tasks of NER and EL in ca. 200 years of historical newspapers written in multiple languages (English, French, German, Finnish and Swedish) and successfully showed that these tasks benefit from the progress in neural-based NLP (specifically driven by the latest advances in Transformer-based pre-trained language models approaches) as a considerable improvement in performance was observed on the historical collections, especially for NER [24, 42, 44].

The authors of [10] present an extensive survey on NER over historical datasets and highlight the challenges that state-of-the-art NER methods applied to historical and noisy inputs need to address. For overcoming the impact of the OCR errors, contextualized embeddings at the character level were utilized to find better representations of out-of-vocabulary words (OOVs) [2]. The contextualized embeddings are learned using language models and allow predicting the next character of strings given previous characters. Moreover, further research showed that the fine-tuning of several Transformer encoders on historical collections could alleviate digitization errors [4]. To deal with the lack of historical resources, [40] proposed to use transfer learning in order to learn models on large contemporary resources and then adapt them to a few corpora of historical nature. Finally, in order to address the spelling variations, some works developed transformation rules to model the diachronic evolution of words and generate a normalized version processable by existing NER systems [8, 23]. While most of these approaches rely generally on the local textual context for detecting entities in such documents, temporal information has generally been disconsidered. To the best of our knowledge, several approaches have been proposed for named entity disambiguation by utilizing temporal signatures for entities to reflect the importance of different years [1], and entity linking, such as the usage of time-based filters [26], but not for historical NER.

Named Entity Recognition with Knowledge Bases. Considering the complementary behaviors of knowledge-based and neural-based approaches for NER, several studies have explored knowledge-based approaches including different types of symbolic representations (e.g., knowledge bases, static knowledge graphs, gazetteers) and noticed significant improvements in token representations and the detection of entities over modern datasets (e.g., CoNLL [43], OntoNotes 5.0 [35]) [27, 43]. Gazetteer knowledge has been integrated into NER models alongside word-level representations through gating mechanisms [31] and Wikipedia has mostly been utilized to increase the semantic representations of possible entities by fine-tuning recent pre-trained language models on the fill-in-the-blank (cloze) task [39, 52].

When well-formed text is replaced with short texts containing long-tail entities, symbolic knowledge has also been utilized to increase the contextual information around possible entities [31]. Introducing external contexts into NER systems has been shown to have a positive impact on the entities’ identification performance, even with these complications. [48] constructed a knowledge base system based on a local instance of Wikipedia to retrieve relevant documents given a query sentence. The retrieved documents and query sentences, after concatenation, were fed to the NER system. Our proposed methodology could be considered inspired by their work, however, we include the additional contexts at the model level by generating a mean-pooled representation for each context instead of concatenating the contexts with the initial sentence. We consider that having pooled representations for each additional context can reduce the noise that could be created by other entities found in these texts.

Temporality in Knowledge Graphs. Recent advances have shown a growing interest in learning representations of entities and relations including time information [7]. Other work [50] proposed a temporal knowledge graph (TKG) embedding model for representing facts involving time intervals by designing the temporal evolution of entity embeddings as rotation in a complex vector space. The entities and the relations were represented as single or dual complex embeddings and temporal changes were the rotations of the entity embeddings in the complex vector space. Since the knowledge graphs change over time in evolving data (e.g., the fact The President of the United States is Barack Obama is valid only from 2009 to 2017), A temporal-aware knowledge graph embedding approach [49] was also proposed by moving beyond the complex-valued representations and introducing multivector embeddings from geometric algebras to model entities, relations, and timestamps for TKGs. Further research [51] presented a graph neural network (GNN) model treating timestamp information as an inherent property of the graph structure with a self-attention mechanism to associate appropriate weights to nodes according to their relevant relations and neighborhood timestamps. Therefore, timestamps are considered properties of links between entities.

TKGs, however, show many inconsistencies and a lack of data quality across various dimensions, including factual accuracy, completeness, and timeliness. In consequence, other research [9] further explores TKGs by targeting the completion of knowledge with accurate but missing information. Moreover, since such TKGs often suffer from incompleteness, the authors of [53] introduced a temporal-aware representation learning model that helps to infer the missing temporal facts by taking interest in facts occurring recurrently and leverage a copy mechanism to identify facts with repetition. The aforementioned methods demonstrate that the usage of TKGs is considered an emerging domain that is being explored, in particular in the field of NLP. The availability of information about the temporal evolution of entities, not only could be a promising solution for improving their semantic knowledge representations but also could provide additional contextual information for efficient NER. To the best of our knowledge, our work is the first attempt to leverage time information provided by TKGs to improve NER.

Fig. 1.
figure 1

An example from the hipe-2020 dataset.

Fig. 2.
figure 2

An example from the ajmc dataset.

3 Datasets

In this study, we utilize two collections composed of historical newspapers and classical commentaries covering circa 200 years. Recently proposed by the CLEF-HIPE-2022 evaluation campaign [14], we experiment with the hipe-2020 and the Ajax Multi-Commentary (ajmc) datasets.

hipe-2020 includes newspaper articles from Swiss, Luxembourgish, and American newspapers in French, German, and English (19C-20C) and contains 19,848 linked entities as part of the training sets [12, 15, 16]. For each language, the corpus is divided into train, development, and test, with the only exception of English for which only development and test sets were produced [13]. In this case, we utilized the French and German datasets for training the proposed models in our experimental setup. An example from the French dataset is presented in Fig. 1.

ajmc is composed of classical commentaries from the Ajax Multi-Commentary project that includes digitized 19C commentaries published in French, German, and English [41] annotated with both universal and domain-specific named entities (NEs). An example in English is presented in Fig. 2.

These two collections pose several important challenges: the multilingualism (both containing three languages: English, French and German), the code-mixed documents (e.g., commentaries, where Greek is mixed with the language of the commentator), the granularity of annotations and the richness of the texts characterized by a high density of NEs. Both datasets provide different document metadata with different granularity (e.g., language, document type, original source, date) and have different entity tag sets that were built according to different annotation guidelines. Table 1 presents the statistics regarding the number and type of entities in the aforementioned datasets divided according to the training, development, and test sets.

Table 1. Overview of the hipe-2020 and ajmc datasets. LOC = Location, ORG = Organization, PERS = Person, PROD = Product, TIME = Time, WORK = human work, OBJECT = physical object, and SCOPE = specific portion of work.

4 Temporal Knowledge-based Contexts for Named Entity Recognition

The OCR output contains errors that produce noisy text and complications, similar to those studied by [30]. It has long been observed that adapting NER systems to deal with the OCR noise is more appropriate than adapting NER corpora [11]. Furthermore, [22] showed that applying post-OCR correction algorithms before running NER systems does not often have a positive impact on NER results since post-OCR may degrade clean words during the correction of the noisy ones. To deal with OCR errors, we introduce external grammatically correct contexts into NER systems which have a positive impact on the entity identification performance even in spite of these challenges [48]. Moreover, the inclusion of such contexts by taking into consideration temporality could further improve the detection of time-sensitive entities. Thus, we propose several settings for including additional context based on Wikidata5mFootnote 6 [46], a knowledge graph with five million WikidataFootnote 7 entities which contain entities in the general domain (e.g., celebrities, events, concepts, things) and are aligned to a description that corresponds to the first paragraph of the matching Wikipedia page.

4.1 Temporal Information Integration

A TKG contains time information and facts associated with an entity that provides information about spontaneous changes or smooth temporal transformations of the entity while informing about the relations with other entities. We aggregate temporality into Wikidata5m including the TKG created by [25] and tuned by [18]Footnote 8. This TKG contains over 11 thousand entities, 150 thousand facts, and a temporal scope between the years 508 and 2017. For a given entity, it provides a set of time-related facts describing the interactions of the entity in time. It is thus necessary to combine these facts into a singular element through an aggregation operator over their temporal elements.

We perform a transformation on the temporal information of every fact of an entity in order to combine them into only one piece of temporal information. Let e be an entity described by the facts:

$$\begin{aligned} \{F_e\}_{i=1}^n=\{(e, r_{1}, e_{1}, t_{1}) (e, r_{2}, e_{2}, t_{2}), \ldots (e, r_{i}, e_{i}, t_{i}), \ldots (e, r_{n}, e_{n}, t_{n})\}, \end{aligned}$$

where a fact \((e, r_{i}, e_{i}, t_{i})\) is composed of two entities e and \(e_{i}\) that are connected by the relation \(r_{i}\) and the timestamp \(t_{i}\). A timestamp is a discrete point in time which corresponds to a year in this work. The aggregation operator is the function \(AGG \rightarrow t_e\) that takes as input the time information from \(F_e\) and outputs the time information that is associated with e. Several aggregation operators are possible. Among them, natural options are mean, median, minimum, and maximum operations. The minimum of a set of facts is defined as the oldest fact, and the maximum is the most recent fact. If an entity is associated with four facts spanning over years 1891, 1997, 2006, and 2011, the minimum aggregation operator consists in keeping the oldest, resulting in the year 1891 the time information of the entity. Given that our datasets correspond to documents between 19C and 20C, the minimum operation is more likely to create an appropriate temporal context for the entities. Therefore it is a convenient choice to highlight entities matching the corresponding time period by accentuating older facts. At the end of the aggregation operation 8,176 entities of Wikidata5m are associated with a year comprised between 508 and 2001, filtering out most of the facts occurring during 21C.

4.2 Context Retrieval

Our knowledge base system relies on a local ElasticSearchFootnote 9 instance and follows a multilingual semantic similarity matching, which presents an advantage on multilingual querying and is achieved with dense vector field indexes. Thus given a query vector, a k-nearest neighbor search API retrieves the k closest vectors returning the corresponding documents as search hits. For each Wikidata5m entity, we create an ElasticSearch entry including an identifier field, a description field and a description embedding field which we obtain with a pre-trained multilingual Sentence-BERT model [37, 38]. We build one index on the entity identifier and a dense vector index on the description embedding. We propose two different settings for context retrieval:

  • non-temporal: This setting uses no temporal information. Given an input sentence during context retrieval, we first obtain the corresponding dense vector representation with the same Sentence-BERT model used during the indexing phase. Then, we query the knowledge base to retrieve the top-k semantically similar entities based on a k-nearest neighbors algorithm (k-NN) cosine similarity search over the description embedding dense vector index. The context C is finally composed of k entity descriptions.

  • temporal-\(\delta \): This setting integrates the temporal information. For each semantically similar entity that is retrieved following non-temporal, we apply a filtering operation to keep or discard the entity as part of the context. Given the year \(t_{input}\) linked to the input sentence’s metadata during context retrieval, the entity is kept if its associated year \(t_e\) is inside the interval \( t_{input}-\delta \le t_e \le t_{input}+\delta \), where \(\delta \) is the year interval threshold, otherwise it is rejected. As a result of AGG, \(t_e\) results to be the oldest year in the set of facts of entity e in the TKG. If \(t_e\) is nonexistent, e is also kept. This operation is repeated until \(|C| = k\).

4.3 Named Entity Recognition Architecture

Base Model Our model consists of a hierarchical, multitask learning approach, with a fine-tuned encoder based on BERT. This model includes an encoder with two Transformer [45] layers with adapter modules [21, 34] on top of the BERT pre-trained model. The adapters are added to each Transformer layer after the projection following multi-headed attention and they adapt not only to the task but also to the noisy input which proved to increase the performance of NER in such special conditions [4]. Finally, the prediction layer consists of a conditional random field (CRF) layer.

In detail, let \(\{x_i\}^l_{i=1}\) be a token input sequence consisting of l words, denoted as \(\{x_i\}^l_{i=1} = \{x_1, x_2, \ldots x_i, \ldots x_l\}\), where \(x_i\) refers to the i-th token in the sequence of length l. We first apply a pre-trained language model as encoder for further fine-tuning. The output is \(\{h_i\}^l_{i=1} , H_{[CLS]} = encoder(\{x_i\}^l_{i=0})\) where \(\{h_i\}^l_{i=1} = [h_1, h_2, \ldots h_i, \ldots h_l]\) is the representation for each i-th position in x token sequence and \(h_{[CLS]}\) is the final hidden state vector of [CLS] as the representation of the whole sequence x. From now on, we refer to the Token Representation as \(TokRep = \{x_i\}^l_{i=1}\) that is the token input sequence consisting of l words. The additional Transformer encoder contains a number of Transformer layers that takes as input the matrix \(H=\{h_i\}^l_{i=1} \in R_{l\times d}\) where d is the input dimension (encoder output dimension). A Transformer layer includes a multi-head self-attention Head(h): \(Q^{(h)}, K^{(h)}, V^{(h)} = HW^{(h)}_q , HW^{(h)}_k , HW^{(h)}_v\) and \(MultiHead(H) = [Head^{(1)}, \ldots , Head^{(n)}]W_O\)Footnote 10 where n is the number of heads and the superscript h represents the head index. \(Q_t\) is the query vector of the t-th token, j is the token the t-th token attends. \(K_j\) is the key vector representation of the j-th token. The Attn softmax is along the last dimension. MultiHead(H) is the concatenation on the last dimension of size \(R^{l \times d}\) where \(d_k\) is the scaling factor \(d_k\times n = d\). \(W_O\) is a learnable parameter of size \(R^d \times d\).

Fig. 3.
figure 3

NER model architecture with temporal-aware context s (context jokers).

By combining the position-wise feed-forward sub-layer and multi-head attention, we obtain a feed-forward layer \(FFN(f(H)) = max(0, f(H) W_1) W_2\) where \(W_1\), \(W_2\) are learnable parameters and max is the ReLU activation. \(W_1 \in R^{d \times d_{FF}}\), \(W_2 \in R^{d_{FF} \times d}\) are trained projection matrices, and \(d_{FF}\) is a hyperparameter. The task adapter is applied at this level on TokRep at each layer and consists of a down-projection \(D \in R^{h \times d}\) where h is the hidden size of the Transformer model and d is the dimension of the adapter, also followed by a ReLU activation and an up-projection \(U \in R^{d \times h}\).

Context Jokers For including the additional contexts generated as explained in Sect. 4, we introduce the context jokers. Each additional context is passed through the pre-trained encoderFootnote 11 generating a JokerTokRep which is afterwards mean-pooled along the sequence axis. We call these representations context jokers. We see them as wild cards unobtrusively inserted in the representation of the current sentence for improving the recognition of the fine-grained entities. However, we also consider that these jokers can affect the results in a way not immediately apparent and can be detrimental to the performance of a NER system. Figure 3 exemplifies the described NER architecture.

5 Experimental Setup

Our experimental setup consists of a baseline model and four configurations with different levels of knowledge-based contexts:

  • no-context: our model as described in Sect. 4.3. In this baseline configuration, no context is added to the input sentence representations.

  • non-temporal: contexts are generated with the first setting of context retrieval with no temporal information and integrated into the model through context jokers.

  • temporal-(50|25|10): contexts are generated with the second setting of context retrieval with \(\delta \in \{50,25,10\}\) (where \(\delta \) is the time span or year interval threshold) and integrated into the model through context jokers.

Hyperparameters. In order to have a uniform experimental setting, we chose a BERT-based cased multilingual pre-trained modelFootnote 12. We denote the number of layers (i.e., adapter-based Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. BERT has L=12, H=768 and A=12. We added two layers with H=128, A=12, and the adapters have \(128\times 12\) size. The adapters are trained on the task during training. For all context-retrieval configurations, the context size |C| of an input sentence was set to \(k=10\). For indexing the documents in ElasticSearch, we utilized the multilingual pre-trained Sentence-BERT modelFootnote 13.

Evaluation. The evaluation is performed over coarse-grained NER in terms of precision (P), recall (R), and F-measure (F1) at micro level [12, 28] (i.e., consideration of all true positives, false positives, true negatives and false negatives over all samples) in a strict (exact boundary matching) and a fuzzy boundary matching settingFootnote 14. Coarse-grained NER refers to the identification and categorization of entity mentions according to the high-level entity types listed in Table 1. We refer to these metrics as coarse-strict (CS) and coarse-fuzzy (CF).

5.1 Results

Table 2. Results on French, German and English, for the hipe-2020 and ajmc datasets.

Table 2 presents our results in all three languages and datasets (best results in bold). It can be seen that models with additional knowledge-based context jokers bring an improvement over the base model with no added contexts. Furthermore, including temporal information outperforms non-temporal contexts. ajmc scores show to be higher than hipe-2020 independently of the language and contexts. We explain this behavior by the small diversity of some entity types of the ajmc dataset. For example, the ten most frequent entities from the “person” type represent the \(55\%\), \(51.5\%\) and \(62.5\%\) from the train, development, and test sets respectively. It also exists an \(80\%\) top-10 intersection between train and test sets meaning that eight of the ten most frequent entities are shared between train and test sets. English hipe-2020 presents the lowest scores compared to French and German independently from the contexts. We attribute this drop in performance to the utilization of the French and German sets during training given the absence of a specific English training set.

The last two rows of Table 2 show the results of our best system [3] during the HIPE-2022 evaluation campaign [15]. This system is similar to the one described in Sect. 4.3 but it stacks, for each language, a language-specific language model and does not include any temporal-aware knowledge. The additional language model motivates the slightly higher resultsFootnote 15. For half of the datasets, this system outperforms the temporal-aware configurations (underlined values) but with the cost of being language dependent, a drawback that mainly impacts English hipe-2020 dataset where no training data is available.

5.2 Impact of Time Intervals

ajmc contains 19th-century commentaries to Greek texts [41] and was created in the context of the Ajax MultiCommentary projectFootnote 16, and thus, the French, German and English dataset are about an Ancient Greek tragedy by Sophocles, the Ajax, from the early medieval periodFootnote 17. The German ajmc contains commentaries from two years (1853 and 1894), English ajmc, also two years (1881 and 1896), while French ajmc just one year (1886). Due to the size of the collection, hipe-2020 covers a larger range of years. In terms of span, French articles were collected from 1798 to 2018, German articles from 1798 to 1948, and English articles from 1790 to 1960. We, therefore, looked at the difference between the contexts retrieved by the non-temporal and the temporal configurations. Table 3 summarizes these differences for train and test sets and displays the number of contexts that had been filtered and replaced from non-temporal for each time span, i.e., \(\delta \in \{50,25,10\}\). Overall, the smaller the interval of years, the greater the number of contexts that are replaced. It can be noticed that the number of replaced contexts is smaller for ajmc than for hipe-2020. This is explained by the restrained year span and the lack of entity diversity during these periods. When comparing with the results from Table 2, we can infer that, in general, it is beneficial to implement shorter time intervals such as \(\delta = 10\). In fact, temporal-10 presents higher F1 scores for ajmc in almost all cases. However, this varies with the language and the year distribution of the dataset.

Table 3. Number of replaced contexts per time span.

5.3 Impact of Digitization Errors

The ajmc commentaries on classical Greek literature present the typical difficulties of historical OCR. Having complex layouts, often with multiple columns and rows of text, the digitization quality of commentaries could severely impact NER and other downstream tasks like entity linking. Statistically, about 10% of NEs are affected by the OCR in the English and German ajmc datasets and 27.5% of NEs are contaminated in the French corpus. The models with additional context, especially the temporal approaches, contribute to recognizing NEs whether contaminated or clean. This contribution is more significant on NEs with digitization errors. It has manifested in a better improvement in recognition of the contaminated NEs compared to the clean ones despite their dominance in the data. In the German corpus, for example, the gain is about 14% points using temporal-50 compared to the baseline while only 2% points on the clean NEs. Additionally, three-quarters of NEs with 67% of character error rate are correctly recognized whereas the baseline recognized only one-quarter of them. Finally, all the models are completely harmed by error rates that exceed 70% on NEs.

5.4 Limitations

The system ideally requires metadata about the year when the datasets were written or at least a period interval. Otherwise, it will be necessary to use other systems for predicting the year of publication [36]. However, the errors of such systems will be propagated and may impact the NER results.

6 Conclusions & Future Work

In this paper, we explore a strategy to inject temporal information into the named entity recognition task on historical collections. In particular, we rely on using semantically-relevant contexts by exploring the time information provided in the collection’s metadata and temporal knowledge graphs. Our proposed models include the contexts as mean-pooled representations in a Transformer-based model. We observed several trends regarding the importance of temporality for historical newspapers and classical commentaries, depending on the time intervals and the digitization error rate. First, our results show that a short time span works better for collections with restrained entity diversity and narrow year intervals, while a longer time span benefits wide year intervals. Second, we also show that our approach performs well in detecting entities affected by digitization errors even to a 67% of character error rate. Finally, we remark that the quality of the retrieved contexts is dependent on the affinity between the historical collection and the knowledge base, thus, in future work, it could be interesting to include temporality information by predicting the year spans of a large set of Wikipedia pages to be used as complementary contexts.