Keywords

1 Introduction

Recent times have shown an abundance of textual content creating new challenges for those who want to quickly get insights, without having to read entire documents. Much of this text is in free form. Extracting information from it requires the use of computer resources capable of understanding natural language. Presenting text using temporal structures can help reduce the effort of the reader [4, 15]. For example, they can define the time period of events in news articles [18, 21], play an important role in communication platforms, such as Twitter [1,2,3] or Wikipedia [13], and help contextualize historical texts [14] or legal documents [12]. Advances on these domains are partially due to the existence of temporal taggers, such as Heideltime [19] or SUTime [9]. Timelines appear in this context as a common approach that leverages the detected temporal signals to summarize the information spread over multiple documents in a temporal order fashion. However, little is known about their use in the scope of single documents [16, 20]. An optimal summary should cover all the important temporal aspects of a text while disregarding unimportant or irrelevant dates. However, manually building these timelines may be a laborious and time-consuming task, and an impossible effort for average users or professionals interested in making sense of an increasing volume of textual data. This slows down the process of text analytics and data understanding. In this paper, we present Time-Matters, a novel system that can give users an automatic overview of the most important time-periods and associated text stories in a short amount of time without having to read text-heavy documents. This can be very useful in several scenarios and domains and fits within the recent trend of automatically generating narratives from texts [8]. For instance, it may be of importance for media outlets [17], interested in telling stories and in reaching new audiences with alternative and appealing forms, but also for those interested in quickly extracting temporal information from long documents such as Wikipedia documents.

To accomplish this objective, we adapted a previously introduced version of Time-Matters [5] which worked over queries and multiple documents, to single texts. In particular, we aim to estimate the importance of the temporal expressions detected in a text and hence disregard the non-relevant ones. The goal is to not only provide a temporal annotation of the text with the corresponding scores given by the Time-Matters algorithm, but also to offer users the chance to interact with the system with a temporal storyline component that shows the most important stories of a text. We do this in an interactive fashion that includes a timeline and graphical elements likely related to parts of the story. Further possibilities include exploring the most relevant stories of the text through temporal clustering. Another important key aspect of our approach is that it is unsupervised, domain and corpus-independent as it does not require any training stage and builds upon local text statistical features extracted from single documents. Hence, it can readily be applied to any text. The core of Time-Matters is also mostly language-independent. While it anchors on Heideltime [19] to detect temporal expressions it can also use a simple rule-based approach (focused on years detection), which, while not as effective as Heideltime, may be a good solution when performance and language is an issue. As a contribution to the research community, we make available an online demo [http://time-matters.inesctec.pt], an API [http://time-matters.inesctec.pt/api], a python package [https://github.com/LIAAD/Time-Matters] and a docker image [https://hub.docker.com/r/liaad/time-matters] of Time-Matters. On the sidelines, we also make public a python package wrapper for Heideltime [https://github.com/JMendes1995/py_heideltime] which aims to facilitate the use of this well-known temporal tagger.

2 Time-Matters Algorithm

Our assumption is that the relevance of a candidate date \({d}_{j}\) may be determined with regards to the relevant terms \({W}_{j}^{*}\) that it co-occurs with in a given context (defined as a window of n terms in a sentence or the sentence itself). That is, the more a given candidate date is correlated with the most relevant keywords of a text \({t}_{i}\), the more relevant the candidate date is for the text at hand. To model this temporal relevance, we rely on the Generic Temporal Similarity measure (GTE) [5], which makes use of co-occurrences of keywords and temporal expressions as a means to identify relevant dates within a text. In this work, relevant keyphrases and temporal expressions are respectively detected by YAKE! keyword extractor [6, 7], and Heideltime temporal tagger [9]. GTE is formalized in Eq. 1 and ranges between 0 (irrelevant) and 1 (relevant), where IS is the InfoSimba similarity measure [10].

$$\mathrm{GTE}\left({t}_{i},{d}_{j}\right)=\mathrm{median}\left(\mathrm{IS}\left({\mathrm{w}}_{\ell,j},{d}_{j}\right)\right),{\mathrm{w}}_{\ell,j}\in {W}_{j}^{*}$$
(1)

A fully detailed description of the underlying scientific approach and the evaluation methodology for the study of queries and multiple documents can be found in Campos et al. [5]. Readers are also recommended to refer to our wiki documentation [https://github.com/LIAAD/Time-Matters/wiki] for an in-depth understanding of the single document version explored in this demo.

3 Time-Matters Demonstration

We demonstrate our approach using an arbitrary text related to the 1st anniversary of the Haiti earthquake held on January 12, 2011. Texts can be given as input in the homepage or as an URL, in which case, we make use of the well-known Newspaper 3k library [https://newspaper.readthedocs.io] to extract contents. The resulting interface is divided into five major components: “Annotated Text”; “Storyline”, “Temporal Clustering”; “Timeline”; and “Scores”. In this paper, we put an emphasis on the first two, “Annotated Text” and “Storyline”, due to space reasons.

Annotated Text.

Figure 1 shows the “Annotated Text” component. At the top, we can observe the time spent to obtain the results, the number of relevant annotated temporal expressions instances and the text language. Time performance is highly dependent on the Heideltime component as computing GTE scores is a quick process. Each date is tagged with a 5-color Likert relevance scale, from least relevant dates (bold red) to most relevant ones (bold green). To get a sense of the relevance of the dates, users can also mouse over a given temporal expression. By default, only relevant temporal expressions, those with GTE scores equal or above 0.35 (according to the experiments conducted in [5]) are shown to the user. Scores close to 1 are considered highly relevant in the particular part of the text being analyzed. Equal date instances in different sentences can also result in different scores (one such approach can be explored in the advanced options section in the homepage). In addition to relevant dates, users can also ask for least relevant ones (scores < 0.35) as exemplified in Fig. 1 for the temporal expression “the afternoon of February 11, 1975” (marked in bold red), which is shown a score of 0. By doing this, we give users the opportunity to understand the effectiveness of the Time-Matters algorithm in filtering out non relevant dates initially marked by the temporal tagger. One can also observe, marked as bold, the relevant keyphrases co-occurring next to the date and that most contribute to the results of Time-Matters. By default, n-grams are set to 1, meaning that keywords will be formed by 1 single token only, though other options can be defined in the advanced options setting.

Fig. 1.
figure 1

“Annotated Text” interface. (Color figure online)

Storyline Visualization.

The storyline interface (see Fig. 2) explores the different stories of a text through a temporal lens. The component at the top, highlights the relevant dates (“1564”), its score (“0.799”), the sentence where the date occurs and a summary of that particular part of the story (“great earthquake mentioned”) given by YAKE! [6]. The story is also illustrated automatically with images. We leverage on the Portuguese web archive Arquivo.pt [11] images search API v1 [https://github.com/arquivo/pwa-technologies/wiki]. While this API can obtain results for any language it naturally works better for its native language, Portuguese. Users can then navigate between the different time-periods by either clicking at the right row (labelled in this figure example as “Recorded in Haiti, 2010”) or at the bottom timeline component which gives, per se, a temporal overview of the story.

Fig. 2.
figure 2

“Storyline” interface.

In this paper, we suggest a simple yet effective approach for summarizing a text through a temporal perspective, highlighting the most important temporal aspects of the text. As future research, we plan to investigate further elaborated solutions that study the correlation between the detected relevant dates and the relevant events found in the surroundings of the date. This can be used to improve not only the story description but also the retrieval of images.