Towards Event Timeline Generation from Vietnamese News

Vu, Van-Chung; Ha, Thi-Thanh; Nguyen, Kiem-Hieu

doi:10.1007/978-3-031-23804-8_28

Van-Chung Vu⁸,
Thi-Thanh Ha^8,9 &
Kiem-Hieu Nguyen⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13397))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

225 Accesses

Abstract

Event timeline generation is an active research area in many languages. In this paper, we describes our attempts towards event timeline generation from Vietnamese newswire documents based on the work in [1]. Our main contributions are: i) To our knowledge, we are the first to tackle the problem for Vietnamese language. ii) Experiments were conducted on a large-scale corpus from 145 popular online news agents in the period from 2007 to 2017 which provides extensive redundant information on various themes. iii) We manually built a dataset of 17 timelines for evaluation (The dataset could be downloaded at http://is.hust.edu.vn/~hieunk/resources/vntimeline17.zip). Experimental results show that the proposed method works reasonably well for Vietnamese; and using n-grams on unsegmented texts achieves comparable performance to using word segmentation.

Access provided by Autonomous University of Puebla. Download conference paper PDF

TLS-Covid19: A New Annotated Corpus for Timeline Summarization

Timeline Generation Based on a Two-Stage Event-Time Anchoring Model

Interactive System for Automatically Generating Temporal Narratives

Keywords

1 Introduction

Text summarization is an active research area in NLP [2]. Event timeline generation could be considered as a branch of text summarization which focuses on summarizing thematic events [1]. It is closely related to query-focused multidocument summarization [3] and topic detection and tracking [4]. The differences are two-fold: Firstly, it works on event-centric documents; Secondly, it considers not only textual information but also temporal information when summarizing. It eventually distinguishes from timeline generation for an individual entity [5].

By thematic event, we are referring to a collection of events relevant to a common theme. For example, the thematic event “Vụ con trong chai Number 1 (The fly in Number 1)”^{Footnote 1} started with an event “A customer found a fly inside an unopened Number 1 bottle” and continued with another event such as “He then contacted Tan Hiep Phat company to blackmail 1 billion VND”, etc. Our task is to generate a timeline of the most salient events relevant to a thematic event as in Fig. 1. Following previous works on event timeline generation, we treat an event as a pair of timestamp and an event description. For instance, (2014/12/03, “A customer found a fly inside an unopened Number 1 bottle”) is an event. We are not going to use event details such as triggers and participants as in Message Understanding Conference and Automatic Content Extraction (and its subsequent Knowledge Base Population) [6,7,8].

When reporting on a thematic event, such as a politic scandal or a disaster, different journalists tend to agree on salient events and story developments. Such agreement is consequently reflected on redundancy between news sources. We aim at leveraging this redundancy to detect salient events.

The task has been studied in other languages like English and French [9]. However, to our knowledge, there has been no study in Vietnamese so far although there is a large and rapidly increasing volume of event-related newswire contents in Vietnamese.

In this paper, we present our approach towards generating event timelines on Vietnamese newswire documents. Our contributions are three-fold: Firstly, to our knowledge, we are the first to tackle this problem in Vietnamese. Secondly, as redundancy from various sources is crucial in order to judge importance of information related to an event, we have gathered a large-scale corpus of Vietnamese news from 145 popular online news agencies in Vietnam in the period from 2007 to 2017. As illustrated by experimental results, such corpus is adequate for generating timelines of various events. Last but not least, we have manually created a dataset of 17 timelines for evaluation. The dataset consists of various events happening in Vietnam as well as world-wide. Experimental results on the dataset show that the models based on n-grams are on par with the word-based model.

The rest of the paper is organized as follows: Sect. 2 briefly introduces related work; Sect. 3 describes our proposed method; Sect. 4 demonstrates evaluation results; The paper is concluded in Sect. 5.

2 Related Work

Most works on event timeline generation require redundancy from data. [10] proposed a framework for extensive temporal analysis and used redundant data and machine learning to detect salient dates of thematic events. Following this direction, [1] presented a Maximal Marginal Relevance-like reranking algorithm based on both temporal and thematic clustering [11]. In another approach, [12] proposed a joint graphical model for the problem. The problem has been also tackled in other languages such as French [9]. In an attempt to improve evaluation methodology, [13] proposed a metric based on so-called deep semantic units.

In this work, we aim at applying the approach as described in [1] to a new language, i.e. Vietnamese in our case. Therefore, we follow the essential parts of the method including document acquisition, temporal analysis, event ranking and selection.

Applying to a resource-scarce language like Vietnamese is not trivial. The main issues are:

1.
Acquiring a large, redundant news corpus over a long period.
2.
Processing temporal analysis.
3.
Dealing with unsegmented texts, i.e. texts in which word delimiters are ambiguous. This is a specific problems in some Asian languages like Chinese, Japanese, and Vietnamese.
4.
Creating reference timelines for evaluation.

We will discuss in more details these issues and our proposed solutions in subsequent sections.

3 Our Event Timeline Generation Method

In the first phase, from a large, redundant news corpus, temporal analysis is processed on the whole corpus. That is to say, whenever there is a temporal expression, it will be detected and will be normalized. Sentences containing at least one temporal expression are then collected. We use Lucene^{Footnote 2} search engine to index events as pairs of date and description sentences. Moreover, sentences containing the same normalized time will be gathered into a cluster. We use Lucene ranking algorithm to rank and select salient events for an input query and to generate the final timeline. In this paper, we follow previous works to use date as temporal interval. Each date cluster hence is consists of all sentences belong to an individual date. Our method is demonstrated in Fig. 2. Readers could refer to [1] for more details on the original method.

In Sect. 3.1, we describe our news corpus. Sections 3.2 and 3.3 present temporal analysis and indexing, respectively. Section 3.4 is dedicated to temporal clustering. Timeline generation is discussed in Sect. 3.5.

3.1 The News Corpus

A large, redundant corpus is crucial in this work. To obtain such a corpus in Vietnamese, we first surveyed popular online news agencies in Vietnamese. From that, we selected 145 most popular sites. Articles from these sites between the year of 2007 and 2017 were gathered, resulting in totally 3.8M articles. HTML documents were parsed and all contents and meta-data such as title, document creation time (DCT), tags, URL were stored in XML format. Note that DCT is required for temporal analysis. Most of the articles comes from dominant agencies in Vietnam such as VnExpress, Tuoitre, Dantri, Vietnamnet. The content varies from social-economics, politics to hi-tech, entertainments, world wide events.

3.2 Phase 1: Temporal Analysis

Temporal analysis consists of detecting and normalizing temporal expressions in texts. It is shown in [10] that only 7% of temporal expressions in texts are absolute dates (i.e. with date, month, and year), the rest DCT-related dates require normalization. For example, in the sentence “Ngày 18–12, TAND Giang tuyên Minh 7 năm tù vì tài ”, we need to detect “Ngày 18–12” and normalize it as 2004/12/18. Other expressions such as “ngày hôm qua” (yesterday) or “” (last Friday) are event more challenging.

In our experiments, we used HeidelTime, a multilingual temporal analyzer which supports Vietnamese [14]. To our knowledge, it is the only temporal analyzer to date for Vietnamese. It uses JVnTextPro^{Footnote 3} for word segmentation and part-of-speech tagging as prerequisite for temporal analysis.

After temporal analysis, we remove all sentences without temporal expressions as well as sentences with normalized temporal expressions that are incomplete (e.g. 2015/10/xx where date is missing). These sentences are indexed using Lucene search engine, each one as a document, resulting in an index of totally 11.6M documents. Statistics of the corpus is shown in Table 1.

Table 1. Corpus statistics.

Full size table

3.3 Phase 2: Indexing

Vietnamese texts don’t have explicit word boundaries. Spaces, which are natural word boundaries in languages such as English, only serve as boundary between syllables in Vietnamese. Word segmentation is required for detecting word boundaries, i.e. deciding whether or not a space is a word boundary or is inside a multi-syllable word. Word segmentation is useful for downstream problems like syntactic parsing and semantic parsing. It has been shown in [15] that using n-grams is comparable to using words for information retrieval of Chinese texts.

To investigate the impact of word segmentation, in our experiments, we built three indices using unigram, bigram, and word. For example, “truy (retrieval) thông tin (information)” is indexed as {‘truy’, ‘’, ‘thông’, ‘tin’}, {‘truy ’, ‘ thông’, ‘thông tin’}, and {‘truy ’, ‘thông tin’} using unigram, bigram, and word, respectively. VnTokenizer [16] was used for word segmentation.

3.4 Phase 3: Temporal Clustering

When a query, such as “The fly in Number 1”, is fed into the Lucene index, it will return relevant documents to the query. In our experiments, we used the built-in tf-idf scoring function of Lucene, and limited to 10K documents for each query. All events having the same date value are gathered into a date cluster. Date cluster is a central notion in our method. If a thematic event lasts over a long period, the dates on which many events happen tend to be more salient than the others. Moreover, within a date cluster, not all the events are equivalently important. Salient events take an essential part in the whole story. Marginal events, for instance, could be some kind of reactions, or could describe minor details. The most important events are duplicated in several descriptions because they are reported by many news sources.

3.5 Phase 4: Timeline Generation

Timeline generation consists of selecting the most salient dates and selecting the most salient event in each date. Salience score of a date d is the accumulation of Lucene scores^{Footnote 4} of all events e in d regarding a query q:

$$\begin{aligned} salience(d) = \sum _{e \in d} score_{Lucene}(e,q) \end{aligned}$$

(1)

Lucene score of an event reflects the relevance of its description to the query.

For event ranking inside a date d, we simply select the one with the highest Lucene score as the representative event of d:

$$\begin{aligned} \arg \max _{e \in d} score_{Lucene}(e,q) \end{aligned}$$

(2)

The resulting timeline has K events and could be ordered by salience or chronology. The number of events K could be varied to show only salient events or to get more details.

4 Evaluations

In this section, we describe the creation of a dataset of 17 reference timelines in Sect. 4.1. Our evaluation follows the work in [1]. Temporal contents of the timeline are evaluated by salient date detection (Sect. 4.2). Textual contents are evaluated by text summarization (Sect. 4.3).

4.1 Creation of Reference Timelines

We first investigated timelines written in Vietnamese by journalists. There were not many such timelines at the moment we conducted the experiments. Timelines describing events out of the period 2007 and 2017, which are not available in our corpus, are ignored. Our final dataset contains 17 timelines, for both events happening in Vietnam (e.g. “Airplane crashing in Hoa Lac”) and worldwide (e.g. “Missing plane MH370") as shown in Table 2.

4.2 Evaluating Salient Date Detection

Table 2. Evaluating salient date detection.

Full size table

We used Mean Average Precision (MAP) to evaluate salient date detection on three systems, each uses one of the three indices: unigram, bigram, and word. For each query, the ranked list of all the dates returned by a system according to Equation (1) is compared against the dates in the reference timeline. If a date in the reference timeline is not retrieved by the system, its average precision is counted as zero.

$$\begin{aligned} MAP=\frac{1}{N}\sum _{j=1}^N(\frac{1}{Q_j}\sum _{i=1}^{Q_j}P(date_i)) \end{aligned}$$

(3)

$Q_{j}$: number of relevant dates for query j
N: number of queries
$P(date_{i}$): precision at ith relevant date

As shown in Table 2, UNIGRAM and BIGRAM perform slightly better than WORD. Performance varies across the timelines. The main reasons for poor performance are: An event lasts for too long ( “Cao Toan My - Truong Ho Phuong Nga” lasts for three years); The event lasts only in a few days and there are many events of the same type happening in the same time but in other locations (“Airplane crashing in Hoa Lac”); There are not many news about the event, such as the riot in Dong Tam, My Duc; The event lasts for too long but the reference timeline only covers a specific period (The timeline about “Escalating tensions in Korea Peninsula” only covers events in 2017). In fact, one interesting aspect of our method is that when a user want to focus on a particular period, he could limit search range accordingly. For example, one could zoom in for events about Korea Peninsula in 2017. Implementing this feature in Lucene is straightforward.

4.3 Evaluating Text Summarization

For each query, top K dates from the system are selected. The most relevant sentence in each date is then extracted. The final timeline contains K sentences. Here, K is the number of dates in the reference timeline. We use ROUGE metric [17] to evaluate generated timelines against reference timelines.

Table 3. Evaluating text summarization.

Full size table

Table 3 again demonstrates that UNIGRAM and BIGRAM perform equally to WORD, while requiring no word segmentation. On the other hands, the results are fluctuated. This is probably because we only have 17 queries and there is only one timeline per query, that could reflect subjectivity in timeline contents.

5 Conclusions and Future Works

This paper presents a timeline generation system for Vietnamese. The experiments were conducted on a large newswire corpus of articles in various domains. We empirically show that using unigrams and bigrams produces timelines of comparable quality without word segmentation.

There are more rooms for improvement. We could applying event clustering to highlight important events and event reranking for diversity as in [1]. Unlike their work, we don’t have keywords in reference timeline. Therefore, putting only event titles, sometimes too short or too ambiguous, into Lucene could return in many irrelevant results. Query expansion technique could be a useful and feasible solution. Moreover, we are going to expand the reference dataset that has multiple timelines per query for more robust evaluation. One possibility is using existed dates from timelines written in English about world-wide events for evaluating salient date detection, and further manually translating those timelines into Vietnamese for evaluating text summarization. Another direction for enhancement is improving temporal analysis for Vietnamese. TIME named entities are not only crucial for timeline generation but they are also important in other tasks such as event extraction and knowledge base population.

Notes

1.
For convenience of readers who are not familiar with Vietnamese, we are going to use the English translation instead of the original texts in Vietnamese in the rest of the paper.
2.
https://lucene.apache.org.
3.
http://jvntextpro.sourceforge.net/.
4.
https://lucene.apache.org/core/3_6_0/scoring.html.

References

Nguyen, K.H., Tannier, X., Moriceau, V.: Ranking multidocument event descriptions for building thematic timelines. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, Dublin City University and Association for Computational Linguistics, pp. 1208–1217 (2014)
Google Scholar
Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)
Google Scholar
Goldstein, J., Mittal, V., Carbonell, J., Kantrowitz, M.: Multi-document summarization by sentence extraction. In: Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic Summarization, NAACL-ANLP-AutoSum 2000, Stroudsburg, PA, USA, vol. 4, pp. 40–48. Association for Computational Linguistics (2000)
Google Scholar
Allan, J.: Topic Detection and Tracking, pp. 1–16. Kluwer Academic Publishers, Norwell (2002)
Google Scholar
Li, J., Cardie, C.: Timeline generation: tracking individuals on twitter. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, New York, NY, USA, pp. 643–652. ACM (2014)
Google Scholar
Grishman, R., Sundheim, B.: Message understanding conference-6: A brief history. In: Proceedings of the 16th Conference on Computational Linguistics, COLING 1996, Stroudsburg, PA, USA, vol. 1, pp. 466–471. Association for Computational Linguistics (1996)
Google Scholar
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ace) program tasks, data, and evaluation. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, European Language Resources Association (ELRA) ACL Anthology Identifier: L04–1011 (2004)
Google Scholar
Ji, H., Grishman, R.: Knowledge base population: Successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, HLT 2011, Stroudsburg, PA, USA, vol. 1, pp. 1148–1158. Association for Computational Linguistics (2011)
Google Scholar
Battistelli, D., Charnois, T., Minel, J.L., Teissèdre, C.: Detecting salient events in large corpora by a combination of NLP and data mining techniques. In: Conference on Intelligent Text Processing and Computational Linguistics, Samos, Greece, vol. 17, pp. 229–237 (2013)
Google Scholar
Kessler, R., Tannier, X., Hagège, C., Moriceau, V., Bittar, A.: Finding salient dates for building thematic timelines. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, vol. 1: Long Papers, pp. 730–739. Association for Computational Linguistics (2012)
Google Scholar
Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, New York, NY, USA, pp. 335–336. ACM (1998)
Google Scholar
Tran, G., Herder, E., Markert, K.: Joint graphical models for date selection in timeline summarization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, vol. 1: Long Papers, pp. 1598–1607. Association for Computational Linguistics (2015)
Google Scholar
Bauer, S., Teufel, S.: A methodology for evaluating timeline generation algorithms based on deep semantic units. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, vol. 2: Short Papers, pp. 834–839. Association for Computational Linguistics (2015)
Google Scholar
Strötgen, J., Gertz, M.: Heideltime: High quality rule-based extraction and normalization of temporal expressions. In: Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval 2010, Stroudsburg, PA, USA, pp. 321–324. Association for Computational Linguistics (2010)
Google Scholar
Nie, J.Y., Gao, J., Zhang, J., Zhou, M.: On the use of words and n-grams for Chinese information retrieval. In: Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages, IRAL 2000, New York, NY, USA, pp. 141–148. ACM (2000)
Google Scholar
Leporati, A., Martín-Vide, C., Shapira, D., Zandron, C. (eds.): LATA 2021. LNCS, vol. 12638. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68195-1
Book Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Marie-Francine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004)
Google Scholar

Download references

Acknowledgments

We would like to thank Vccorp for kindly supporting us on conducting the experiments.

Author information

Authors and Affiliations

School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
Van-Chung Vu, Thi-Thanh Ha & Kiem-Hieu Nguyen
Thai Nguyen University of Information and Communication Technology, Thai Nguyen, Vietnam
Thi-Thanh Ha

Authors

Van-Chung Vu
View author publications
You can also search for this author in PubMed Google Scholar
Thi-Thanh Ha
View author publications
You can also search for this author in PubMed Google Scholar
Kiem-Hieu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kiem-Hieu Nguyen .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vu, VC., Ha, TT., Nguyen, KH. (2023). Towards Event Timeline Generation from Vietnamese News. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2018. Lecture Notes in Computer Science, vol 13397. Springer, Cham. https://doi.org/10.1007/978-3-031-23804-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-23804-8_28
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23803-1
Online ISBN: 978-3-031-23804-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Event Timeline Generation from Vietnamese News

Abstract

Similar content being viewed by others

TLS-Covid19: A New Annotated Corpus for Timeline Summarization

Timeline Generation Based on a Two-Stage Event-Time Anchoring Model

Interactive System for Automatically Generating Temporal Narratives

Keywords

1 Introduction

2 Related Work