Keywords

1 Introduction

The digitization of newspapers has greatly improved accessibility and clearly changed the nature of historical research, by enabling easier data access and analysis at scale through multilingual semantic data enrichment [6, 7, 10, 42]. Through better document analysis results and semantic enrichment e.g., named entity recognition (NER), relation extraction (RE), event extraction (EE), the quality of the newspaper data offered by the libraries to its users is substantially improved [12,13,14,15]. Preserving the historical memory of entities and events from historical documents and making them accessible to a larger audience, not only limited to humanities scholars and experts, could lead to better organization of our historical knowledge [2, 38, 44]. Following this statement, this process can be viewed as an area where the detection of events in historical documents can contribute to the construction of more nuanced knowledge bases that could enable further data exploration and help to shape the humanities and historians’ research [38]. Extracting event information from text documents into a structured knowledge base or ontology enables several technologies. For example, text summarization might benefit from the selection of one or more events to yield the best summary with the least extraneous information [18, 30]. Question answering can take advantage of the detected events and they will be able to answer queries about types of events (wars, disease outbreaks, political movements, climate catastrophes, terrorist attacks, etc.) [29, 43].

Therefore, for enabling the development and evaluation of event detection in historical documents, benchmarking plays an important role. However, most of the current datasets in event detection (i.e., MUC [19], ACE 2005 [48]) are not suitable for several reasons, including the high cost of manual annotation of historical texts and the difficulty in defining an event [45]. Besides, different studies have explored how different natural language processing (NLP) tasks, such as named entity recognition (NER) [3, 20, 34, 41] and entity linking (EL) [35, 46], can be impacted by the digitization process. However, to the best of our knowledge, there are no previous works regarding this type of analysis for the event detection task, mostly because there is no data.

Thus, in this paper, we develop a method to automatically discover a set of distinct, salient events from historical newspapers. This is done by leveraging the semantic similarity of contextual representations for detecting event triggers in an unsupervised manner that can be easily adaptable to other languages. The detected events can be used then to speed up the manual annotation or validation of historical corpora.

2 Event Detection in Modern and Historical Datasets

Event Detection in Modern Datasets. Prior work in event detection can be divided in: pattern-based systems [39, 40, 50], machine learning systems based on engineered features (i.e. feature-based) [8, 21, 24, 27], and neural-based approaches [9, 17, 36, 37]. There also has been a lot of interest in approaching this task with external resource-based models which are either feature-based [28, 31] or neural-based [32] combined with resources such as FrameNet [1] which is a linguistic corpus that defines complete semantic frames and frame-to-frame relations, or event data generation as in [22, 49, 51]. The approach proposed in [31] used a probabilistic soft logic (PSL) based approach and a neural network by also leveraging FrameNet to alleviate the data sparseness problem of event detection based on the observation that frames in FrameNet are analogous to events. The authors of [33] also consider that arguments provide significant clues to this task, and adopt a supervised attention mechanism to exploit argument information explicitly for event detection, while also using events from FrameNet, as extra training data. The model described in [28] also leverages FrameNet by tackling the challenge of the annotation cost and data scarcity by considering that ACE 2005 dataset defines limited and specific event schemes based on FrameNet by expressing event information with frame and building a hierarchy of event schemas that are more fine-grained and have much wider coverage than ACE.

Event Detection in Historical Datasets. When it comes to historical and digitized documents, models rely more on external resources, such as FrameNet and WordNet [16], than on event detection approaches used in the state of the art for modern datasets. FrameNet, for instance, has been highly investigated for event detection in historical and digitized, mostly due to the lack of annotated data. A project proposed in 2004 [25] involved the enhancement of materials drawn from the Franklin D. Roosevelt Library and Digital Archives and enabled data exploitation for providing a deeper search and access methods for historians of World War II. The documents were scanned, hand-validated, and enriched with various entities such as person names, dates, locations, and job titles. The work focused on the identification of communicative events in the Memorandum of conversation and implied the extraction of verbs associated with any of the FrameNet “Communication” frames and this communicative event utilized a scheme that assigned the role of communicator to a tagged person or pronoun preceding the verb. Another historical event detection module was proposed to be used for museum collections [10], allowing users to search for exhibits related to particular historical events or actors within time periods and geographic areas, extracted from Dutch historical archives. The authors focused on event detection from manually tagged textual data about the Srebrenica Massacre (July 1995). They specified event triplets and Wordnet concepts denoting event actions, participants, and locations or time markers and identified the historical events through recognition of historical actions. A novel FrameNet-based method was also proposed for performing a computational analysis of Italian war bulletins in World War I and II [7] that had never been digitized before. The bulletins were annotated with different types of information, such as named entities, events, participants, time, and georeferenced locations. Instances of major event types (e.g., bombing, sinking, battles) were established before applying the FrameNet mapping [25].

3 Data Collection

For the data collection and our experiments, we utilized the NewsEye collectionFootnote 1 that consists of a large selection of European newspapers (1850–1950) in several languages that have been digitized and made available online. The difficulty of detecting events in the NewsEye dataset does not only refer to the automatic text recognition (ATR) or digitization errors, but also to the lack of annotated data in a multilingual setting.

Thus, we decided to annotate two subsets of documents in two low-resource languages, German and French, and to experiment with a state-of-the-art event detection system in a domain and language adaptation scenario. The documents were collected using the NewsEye platform [26], and annotated by the Digital Humanities groups (native speakers) from the NewsEye consortium, University of Innsbruck (UIBK-ICH), Austria, and the Paul Valéry University Montpellier 3, France. The subjects of the datasets were selected by the annotators, depending on their line of research and interests.

4 Unsupervised Data Annotation

Following the recommendation of Sprugnoli, [45], in this work, we defined an event to be consistent with ACE 2005 [48] and chose the event types and subtypes according to their annotation guidelinesFootnote 2. We then automatically assigned a frame category to each event type by consulting the English FrameNet database. FrameNet, as indicated in Sect. 2, is a linguistic corpus containing considerable information about lexical and predicate-argument semantics in the form of frames. A frame, in FrameNet, is defined as a triplet composed of a name, like Execution, a set of Frame Elements (FEs), and a list of Lexical Units (LUs).Footnote 3 An LU is a word or phrase that evokes the corresponding frame, such as executioner and guillotine. FEs indicate a set of semantic roles associated with the frame, such as reason, instrument or place. Most frames contain a set of exemplars with annotated LUs and FEs.

For linking ACE 2005 event subtypes to FrameNet frames, we start by processing the corpus by extracting all the verbs of the corpus and grouping them using WordNet [16] synsets. Then, the grouped verbs are matched to FrameNet lexical units (LU). Finally, we associate different ACE 2005 event subtypesFootnote 4 to FrameNet by matching frames names to the event subtype names as in [28]. In summary, ACE 2005 event subtypes, are linked indirectly to FrameNet lexical units (LU), which in turn can be seen as event triggers.

For the creation of candidate event mentions, we generate dependency parse trees for each sentence in the datasetFootnote 5. Next, we focus on the extraction of noun-phrases (NPs) that can be pronouns, proper nouns, or nouns, potentially bound with other tokens that act as modifiers, e.g., adjectives or other nouns, that are generally subjects (nsubj) or objects (obj) (complements of prepositions). Finally, we obtain a triplet composed of the tree root, which is generally the verb of the sentence, and its dependents, the nsubj and the obj. A candidate event mention is, thus, represented by a triplet, where the root is commonly a verb, which can possibly be mapped to a lexical unit (LU), similarly as we did for the event trigger candidates. In Fig. 1, we present the dependency parse tree of a sentence in French.

Fig. 1.
figure 1

Example of the correspondence between syntactic arguments of the verbs and participants of the event denoted by the verb (Translation: From the same place where I had seen Danton disappear, I saw Robespierre disappear.)

For example, in Fig. 1, there are two triplets both with “disparaître” (to disappear) as root. Specifically, the first triplet is composed of j’ as subject (subj) and “Danton” as object (obj). The second triplet has for subject (subj) j’ and for object (obj) “Robespierre”. In both triplets, “Danton” and “Robespierre”, besides being objects, they are entities of type person.

For linking candidate event mentions to ACE 2005 event subtypes, we make use of multilingual BERT [11]. To be precise, we use BERT to obtain the contextual representation x of each token in every sentence in the corpus having at least one candidate event mention; \(X = [x_0, x_1 \ldots x_n]\) where n is the sentence length. Then, from X, we isolate the contextual representation \(x_i\) of the token i that represents the candidate root event mention. For example, the first “disparaître” from Fig. 1.

At the same time, we use as well BERT to obtain the contextual representation of the ACE 2005 event subtype by processing FrameNet lexical units (LU) associated, in step 1 as event triggers, for each explored event subtype. Specifically, for a specific event subtype, we concatenated all its event triggers, i.e. lexical units, to generate a pseudo-sentence that is processed by BERT. Then BERT outputs a contextual embedding of the pseudo-sentence, which represents the event subtype.Footnote 6

Finally, in order to consider a candidate event mention, we compare, through cosine similarity, the contextual embedding of an ACE 2005 subtype, with the one of the root of the event mention candidate. If the obtained cosine similarity is greater than 0.7, then the mention candidate is considered to belong to the analyzed ACE 2005 subtype.

For example, for the Attack event type in French, we compared the extracted roots with the following set of lexical units that was retrieved from FrameNet: attack, assault, strike, ambush, assail, raid, bomb, bombing, raid, infiltrate, hit, fire, small, take up arms, fire, airstrike, bombardment, counter-attack, counter-offensive. After analyzing the results, we observed that two separate sets of event triggers were extracted: (1) known events: foudroyer (strike down), armer (take up arms), attaquer (attack), frapper (strike); (2) unseen events: arracher (snatch), déchiqueter (tear off), étouffer (suffocate), empoigner (grab), trancher (shred).

5 Evaluation

As indicated in Sect. 2, there is no annotated data for historical event detection, and its creation can be expensive. Thus, to evaluate the unsupervised annotation (Sect. 4), we rely on an indirect assessment based on a fine-tuned language model.

Specifically, we train an event detection system by fine-tuning multilingual BERT [11] on English ACE 2005 following the work of Boros et al. [5]. The goal is that through zero-shotFootnote 7, the event detection system will be able to detect a subset of events in the historical corpus (Sect. 3), which would intersect with those found by the unsupervised method (Sect. 4). Ideally, the spans of tokens found by the fine-tuned model should match the spans set by the unsupervised annotation, and thus, we will be able to determine precision, recall, and F-score.

As well, following the work of Boros et al. [5], we explore for our evaluation, the fine-tuning of multilingual BERT on English ACE 2005 along with entity markers. The use of entity markers consists in augmenting the input data with a series of special tokens that include the entity type. For example, the sentence from Fig. 1 becomes From the same place where I had seen \([PER_{start}]\) Danton \([PER_{end}]\) disappear, I saw \([PER_{start}]\) Robespierre \([PER_{end}]\) disappear. To do this, we train beforehand a NER system based on a hierarchical architecture that includes a stack of Transformer layers [47] on top of a BERT encoder (BERT-n\(\times \)Transformer-CRF). This architecture, described in [4], has proved to be robust against OCR errors. The performance of the NER system on this collection is, in terms of F-score 48.32 for German, and 72.71 for French. Once the NER system has been created, we annotate the historical corpus and add the entity markers to the input of the event detection system.

6 Results and Discussion

We present the results obtained through two study cases defined by researchers from the NewsEye project and then, we discuss these results.

International Women’s Day. For this study case, we selected a subset of 207 German articles that mentioned the keyword “Women’s Day” (“Frauentag”) and “International Women’s Day” (“Internationaler Frauentag”) published between 1911 and 1933 in order to analyze the events organized on or around the International Women’s Day. For this subset, we selected events regarding gatherings or movements. These are revealed by the Conflict event type with the Demonstrate and Attack subtypes and the Contact with Meet event subtype.

To understand the meaning of the event types in a deeper analysis, we detail several types in the following paragraphs. Demonstrate and Attack are subtypes of the Conflict event type. An Attack event is defined as a violent physical act causing harm or damage. For example, in Um diesen ersehnten Zustand herbeizuführen, entsenden wir unseren Schwestern in der ganzen Welt unsere Grüße und rufen sie auf, beim internationalen Frauentag mit uns gemeinsam gegen die Fortdauer des Krieges zu demonstrieren.Footnote 8, the triggers are: for Demonstrate, demonstrieren, and for Attack, Krieges. Thus, there are, in this case, two mentions of different types of events.

Table 1. Evaluation of NewsEye German event detection.

We can observe from Table 1 that in the results for the model that does not utilize entity markers, the performance drops significantly, while their presence increases the scores values.

Death Penalty Abolition. In the 1900s, in France, there were regular debates regarding the abolition of the death penalty. For this study case, we selected a subset of 207 French articles that mentioned “guillotine” (same in French) and “death penalty” (“peine de mort”) published between 1900 and 1944. from the following newspapers: Le Matin, L’æuvre and Le Gaulois. We selected events regarding life, through the Die event subtype, conflictual events (Conflict with the Attack subtype), and criminal Justice events with Execute subtype. Due to the digitization and article separation processes, some articles contained an insignificant amount of tokens, thus, we removed those with less than ten tokensFootnote 9.

The results, summarized in Table 2, reveal the capacity of our approach for extracting events while establishing a strong baseline. However, we notice that the scores are rather imbalanced, favoring precision, which could indicate a close similarity between the chosen event types.

Table 2. Evaluation of NewsEye French event detection.

Discussion. It must be stated that the low F-scores (Tables 1 and 2) for certain event subtypes are not unexpected. In the first place, we compare two approaches in an indirect way, one using an unsupervised method, and another using a supervised method but trained on different types of documents and language. As well, the results of the NER system are not perfect, especially for German, which could affect their performance. Nonetheless, the results presented in Tables 1 and 2, show that there is an intersection between the unsupervised annotations and those predicted by the fine-tuned models. Thus, this can signal that the unsupervised approach presented here could be useful for pre-annotating historical documents before being seen by a human. This could accelerate the creation of actual corpora annotated with events and, in the future, automatize the detection of events through machine learning. However, it is clear that the unsupervised method has some limitations. We need to evaluate how well the semantics could have affected the matching of verbs and lexical units in FrameNet, if these mistakes are, for example, the reasons why certain events were not detected in French, as the low recall shows in Table 2.

7 Conclusions

In this paper, we proposed an unsupervised event detection method for detecting events in historical newspapers by relying on available resources. We also obtained promising preliminary results in event detection from multilingual articles surrounding International Women’s Day in German, and the death penalty abolition, in French. We plan in making the dataset publicly available for enabling further research, while envisioning subsequent work regarding an enhanced list of event types and studies concerning the adaptability to other languages.