Keywords

1 Introduction

Text summarization is an active research area in NLP [2]. Event timeline generation could be considered as a branch of text summarization which focuses on summarizing thematic events [1]. It is closely related to query-focused multidocument summarization [3] and topic detection and tracking [4]. The differences are two-fold: Firstly, it works on event-centric documents; Secondly, it considers not only textual information but also temporal information when summarizing. It eventually distinguishes from timeline generation for an individual entity [5].

By thematic event, we are referring to a collection of events relevant to a common theme. For example, the thematic event “Vụ con trong chai Number 1 (The fly in Number 1)”Footnote 1 started with an event “A customer found a fly inside an unopened Number 1 bottle” and continued with another event such as “He then contacted Tan Hiep Phat company to blackmail 1 billion VND”, etc. Our task is to generate a timeline of the most salient events relevant to a thematic event as in Fig. 1. Following previous works on event timeline generation, we treat an event as a pair of timestamp and an event description. For instance, (2014/12/03, “A customer found a fly inside an unopened Number 1 bottle”) is an event. We are not going to use event details such as triggers and participants as in Message Understanding Conference and Automatic Content Extraction (and its subsequent Knowledge Base Population) [6,7,8].

When reporting on a thematic event, such as a politic scandal or a disaster, different journalists tend to agree on salient events and story developments. Such agreement is consequently reflected on redundancy between news sources. We aim at leveraging this redundancy to detect salient events.

The task has been studied in other languages like English and French [9]. However, to our knowledge, there has been no study in Vietnamese so far although there is a large and rapidly increasing volume of event-related newswire contents in Vietnamese.

In this paper, we present our approach towards generating event timelines on Vietnamese newswire documents. Our contributions are three-fold: Firstly, to our knowledge, we are the first to tackle this problem in Vietnamese. Secondly, as redundancy from various sources is crucial in order to judge importance of information related to an event, we have gathered a large-scale corpus of Vietnamese news from 145 popular online news agencies in Vietnam in the period from 2007 to 2017. As illustrated by experimental results, such corpus is adequate for generating timelines of various events. Last but not least, we have manually created a dataset of 17 timelines for evaluation. The dataset consists of various events happening in Vietnam as well as world-wide. Experimental results on the dataset show that the models based on n-grams are on par with the word-based model.

The rest of the paper is organized as follows: Sect. 2 briefly introduces related work; Sect. 3 describes our proposed method; Sect. 4 demonstrates evaluation results; The paper is concluded in Sect. 5.

Fig. 1.
figure 1

A timeline of “The fly in Number 1” written by journalists.

2 Related Work

Most works on event timeline generation require redundancy from data. [10] proposed a framework for extensive temporal analysis and used redundant data and machine learning to detect salient dates of thematic events. Following this direction, [1] presented a Maximal Marginal Relevance-like reranking algorithm based on both temporal and thematic clustering [11]. In another approach, [12] proposed a joint graphical model for the problem. The problem has been also tackled in other languages such as French [9]. In an attempt to improve evaluation methodology, [13] proposed a metric based on so-called deep semantic units.

In this work, we aim at applying the approach as described in [1] to a new language, i.e. Vietnamese in our case. Therefore, we follow the essential parts of the method including document acquisition, temporal analysis, event ranking and selection.

Applying to a resource-scarce language like Vietnamese is not trivial. The main issues are:

  1. 1.

    Acquiring a large, redundant news corpus over a long period.

  2. 2.

    Processing temporal analysis.

  3. 3.

    Dealing with unsegmented texts, i.e. texts in which word delimiters are ambiguous. This is a specific problems in some Asian languages like Chinese, Japanese, and Vietnamese.

  4. 4.

    Creating reference timelines for evaluation.

We will discuss in more details these issues and our proposed solutions in subsequent sections.

3 Our Event Timeline Generation Method

In the first phase, from a large, redundant news corpus, temporal analysis is processed on the whole corpus. That is to say, whenever there is a temporal expression, it will be detected and will be normalized. Sentences containing at least one temporal expression are then collected. We use LuceneFootnote 2 search engine to index events as pairs of date and description sentences. Moreover, sentences containing the same normalized time will be gathered into a cluster. We use Lucene ranking algorithm to rank and select salient events for an input query and to generate the final timeline. In this paper, we follow previous works to use date as temporal interval. Each date cluster hence is consists of all sentences belong to an individual date. Our method is demonstrated in Fig. 2. Readers could refer to [1] for more details on the original method.

Fig. 2.
figure 2

Our event timeline generation method

In Sect. 3.1, we describe our news corpus. Sections 3.2 and 3.3 present temporal analysis and indexing, respectively. Section 3.4 is dedicated to temporal clustering. Timeline generation is discussed in Sect. 3.5.

3.1 The News Corpus

A large, redundant corpus is crucial in this work. To obtain such a corpus in Vietnamese, we first surveyed popular online news agencies in Vietnamese. From that, we selected 145 most popular sites. Articles from these sites between the year of 2007 and 2017 were gathered, resulting in totally 3.8M articles. HTML documents were parsed and all contents and meta-data such as title, document creation time (DCT), tags, URL were stored in XML format. Note that DCT is required for temporal analysis. Most of the articles comes from dominant agencies in Vietnam such as VnExpress, Tuoitre, Dantri, Vietnamnet. The content varies from social-economics, politics to hi-tech, entertainments, world wide events.

3.2 Phase 1: Temporal Analysis

Temporal analysis consists of detecting and normalizing temporal expressions in texts. It is shown in [10] that only 7% of temporal expressions in texts are absolute dates (i.e. with date, month, and year), the rest DCT-related dates require normalization. For example, in the sentence “Ngày 18–12, TAND Giang tuyên Minh 7 năm tù vì tài ”, we need to detect “Ngày 18–12” and normalize it as 2004/12/18. Other expressions such as “ngày hôm qua” (yesterday) or “” (last Friday) are event more challenging.

In our experiments, we used HeidelTime, a multilingual temporal analyzer which supports Vietnamese [14]. To our knowledge, it is the only temporal analyzer to date for Vietnamese. It uses JVnTextProFootnote 3 for word segmentation and part-of-speech tagging as prerequisite for temporal analysis.

After temporal analysis, we remove all sentences without temporal expressions as well as sentences with normalized temporal expressions that are incomplete (e.g. 2015/10/xx where date is missing). These sentences are indexed using Lucene search engine, each one as a document, resulting in an index of totally 11.6M documents. Statistics of the corpus is shown in Table 1.

Table 1. Corpus statistics.

3.3 Phase 2: Indexing

Vietnamese texts don’t have explicit word boundaries. Spaces, which are natural word boundaries in languages such as English, only serve as boundary between syllables in Vietnamese. Word segmentation is required for detecting word boundaries, i.e. deciding whether or not a space is a word boundary or is inside a multi-syllable word. Word segmentation is useful for downstream problems like syntactic parsing and semantic parsing. It has been shown in [15] that using n-grams is comparable to using words for information retrieval of Chinese texts.

To investigate the impact of word segmentation, in our experiments, we built three indices using unigram, bigram, and word. For example, “truy (retrieval) thông tin (information)” is indexed as {‘truy’, ‘’, ‘thông’, ‘tin’}, {‘truy ’, ‘ thông’, ‘thông tin’}, and {‘truy ’, ‘thông tin’} using unigram, bigram, and word, respectively. VnTokenizer [16] was used for word segmentation.

3.4 Phase 3: Temporal Clustering

When a query, such as “The fly in Number 1”, is fed into the Lucene index, it will return relevant documents to the query. In our experiments, we used the built-in tf-idf scoring function of Lucene, and limited to 10K documents for each query. All events having the same date value are gathered into a date cluster. Date cluster is a central notion in our method. If a thematic event lasts over a long period, the dates on which many events happen tend to be more salient than the others. Moreover, within a date cluster, not all the events are equivalently important. Salient events take an essential part in the whole story. Marginal events, for instance, could be some kind of reactions, or could describe minor details. The most important events are duplicated in several descriptions because they are reported by many news sources.

3.5 Phase 4: Timeline Generation

Timeline generation consists of selecting the most salient dates and selecting the most salient event in each date. Salience score of a date d is the accumulation of Lucene scoresFootnote 4 of all events e in d regarding a query q:

$$\begin{aligned} salience(d) = \sum _{e \in d} score_{Lucene}(e,q) \end{aligned}$$
(1)

Lucene score of an event reflects the relevance of its description to the query.

For event ranking inside a date d, we simply select the one with the highest Lucene score as the representative event of d:

$$\begin{aligned} \arg \max _{e \in d} score_{Lucene}(e,q) \end{aligned}$$
(2)

The resulting timeline has K events and could be ordered by salience or chronology. The number of events K could be varied to show only salient events or to get more details.

4 Evaluations

In this section, we describe the creation of a dataset of 17 reference timelines in Sect. 4.1. Our evaluation follows the work in [1]. Temporal contents of the timeline are evaluated by salient date detection (Sect. 4.2). Textual contents are evaluated by text summarization (Sect. 4.3).

4.1 Creation of Reference Timelines

We first investigated timelines written in Vietnamese by journalists. There were not many such timelines at the moment we conducted the experiments. Timelines describing events out of the period 2007 and 2017, which are not available in our corpus, are ignored. Our final dataset contains 17 timelines, for both events happening in Vietnam (e.g. “Airplane crashing in Hoa Lac”) and worldwide (e.g. “Missing plane MH370") as shown in Table 2.

4.2 Evaluating Salient Date Detection

Table 2. Evaluating salient date detection.

We used Mean Average Precision (MAP) to evaluate salient date detection on three systems, each uses one of the three indices: unigram, bigram, and word. For each query, the ranked list of all the dates returned by a system according to Equation (1) is compared against the dates in the reference timeline. If a date in the reference timeline is not retrieved by the system, its average precision is counted as zero.

$$\begin{aligned} MAP=\frac{1}{N}\sum _{j=1}^N(\frac{1}{Q_j}\sum _{i=1}^{Q_j}P(date_i)) \end{aligned}$$
(3)
  • \(Q_{j}\): number of relevant dates for query j

  • N: number of queries

  • \(P(date_{i}\)): precision at ith relevant date

As shown in Table 2, UNIGRAM and BIGRAM perform slightly better than WORD. Performance varies across the timelines. The main reasons for poor performance are: An event lasts for too long ( “Cao Toan My - Truong Ho Phuong Nga” lasts for three years); The event lasts only in a few days and there are many events of the same type happening in the same time but in other locations (“Airplane crashing in Hoa Lac”); There are not many news about the event, such as the riot in Dong Tam, My Duc; The event lasts for too long but the reference timeline only covers a specific period (The timeline about “Escalating tensions in Korea Peninsula” only covers events in 2017). In fact, one interesting aspect of our method is that when a user want to focus on a particular period, he could limit search range accordingly. For example, one could zoom in for events about Korea Peninsula in 2017. Implementing this feature in Lucene is straightforward.

4.3 Evaluating Text Summarization

For each query, top K dates from the system are selected. The most relevant sentence in each date is then extracted. The final timeline contains K sentences. Here, K is the number of dates in the reference timeline. We use ROUGE metric [17] to evaluate generated timelines against reference timelines.

Table 3. Evaluating text summarization.

Table 3 again demonstrates that UNIGRAM and BIGRAM perform equally to WORD, while requiring no word segmentation. On the other hands, the results are fluctuated. This is probably because we only have 17 queries and there is only one timeline per query, that could reflect subjectivity in timeline contents.

5 Conclusions and Future Works

This paper presents a timeline generation system for Vietnamese. The experiments were conducted on a large newswire corpus of articles in various domains. We empirically show that using unigrams and bigrams produces timelines of comparable quality without word segmentation.

There are more rooms for improvement. We could applying event clustering to highlight important events and event reranking for diversity as in [1]. Unlike their work, we don’t have keywords in reference timeline. Therefore, putting only event titles, sometimes too short or too ambiguous, into Lucene could return in many irrelevant results. Query expansion technique could be a useful and feasible solution. Moreover, we are going to expand the reference dataset that has multiple timelines per query for more robust evaluation. One possibility is using existed dates from timelines written in English about world-wide events for evaluating salient date detection, and further manually translating those timelines into Vietnamese for evaluating text summarization. Another direction for enhancement is improving temporal analysis for Vietnamese. TIME named entities are not only crucial for timeline generation but they are also important in other tasks such as event extraction and knowledge base population.