Keywords

1 Introduction

Web is growing with digitized document where search is an important activity to get required information from large amount of data. Search engine is one of the biggest tools to be used by everyone around the world. Search engine is a kind of information retrieval system that asks user for a specific query and return a list of ranked URL, or documents with their titles and summary of web page or document. In some search engine, facility is provided to search between specific time period by allowing user to enter start and end date into input box and then sorting a retrieved results as per user specified chronological order. But queries like “Elections in India before 2000” requires proper treatment of temporal expressions embedded into a user’s query. In above example, user is interested into a document stating information about election before year 2000. So as a result, all documents containing information related to election before the year 2000 must be returned. Another Example, someone who is new to India wanted to know about Indian politics and moreover interested in knowing about “Anna Hazare”. In this example, user is interested in knowing details of “Anna Hazare” in chronological order like Anna in 1990, Anna in 1991, etc. A simple query like “Anna Hazare” will not satisfy that requirement. User has to give query like “Anna Hazare from 1960 to 2015.” Existing Search engine are not able to handle such queries where temporal expression is leveraged. Moreover, existing retrieval model do not take benefit of temporal expressions contained into the documents.

This paper presents a framework to overcome with above limitations by adding new functionalities to use temporal expression embedded into the documents to utilize them into retrieval. It also handles temporal expression into user’s query.

The paper is organized as follows: Sect. 2 presents literature survey on temporal information processing and time based retrieval models. In Sect. 3, Research methodology that includes framework of temporal information retrieval with components such as our temporal tagger, process to retrieve document based on time and an algorithm to represent the retrieved documents on timeline is described in Sect. 3. In Sect. 4, Results and Evaluation of system is presented. Section 5 concludes the paper and gives direction for the future work.

2 Related Work

Developing Framework for temporal information retrieval focus on two different area: (1) Temporal Information Extraction and Processing (2) Use of Such expression in Exploration of search results. Our Literature survey focuses on research that has been done in both of this area. First, we have described research that has been done in development of temporal tagger in various languages. Second phrase is a literature survey about work that has been done in temporal information retrieval.

The Message Understanding Conferences (MUCs) in 1996 and 1998 have played a significant role, but their evaluations covered only recognition of TEs, while a novel contribution towards the normalization of TEs was made in 2000 [1]. GUTime was a rule based system which was developed an extension of TempEx tagger. It was based on TimeML TIMEX3 format, which allows a functional style of encoding offsets in time expressions. It was evaluated on TERN 2004 corpus and achieved 85 % of F-measure [2]. Llorens has developed temporal information extraction system based on CRF for Spanish documents with F-measure of 91 % [3]. KUL is a machine learning-based system for recognition and normalization of temporal expression with 0.85 % precision and recall of 0.84 % [4]. Negri and Merseglia has developed a rule based system which involves tokenization, part-of-speech tagging based on a list of 5000 entries retrieved from WordNet. Then, the recognized text is processed by a set of approximately 1000 basic rules. Recognized temporal expressions and information around that is used for normalization. Then composition rules are used to resolve ambiguities wherever multiple tag placements are possible. The results in terms of F-measure on ACE 2004 data are 92.6, 83.9 and 87.2 % for detection, recognition and determining the VAL attribute value, respectively [5]. Heideltime is high quality rule based tagger for temporal expression recognition and normalization with 0.90 % precision and 0.82 % recall [6]. The Yamcha is machine learning based tagger which uses SVM and FOIL for chunking and classification of chunks. They got precision of 80.05 %, recall of 73.71 % and F-measure of 76.75 %. They have concluded that use of SVM leads to overfitting [7]. Jelena has developed a system for temporal information extraction and interpretation for serebian language with precision of 0.93 %, recall of 0.96 % and F-score or 0.94 % [8]. SUtime is the library for recognizing and normalizing temporal expressions developed by Stanford University. It is rule-based system developed in java [9].

Research has been done in extracting and processing temporal information from document in various languages like English, Hindi, Spanish, Chinese, etc., but less efforts are made in using that processed data for retrieval and presentation of the document. Research paper on the special issue on temporal information processing by Mani gives road map in this domain. It also focuses on challenges and opportunities in this domain [10]. Google has also added a prototype view:timeline() to display search result on timeline [11]. Xiaoyan Li and Croft has proposed Time bases language models which incorporate time into both query likelihood language models and relevance based language models [12]. Temporal mining of blogs is presented in [13]. J. Allen and R. Gupta and Khandelwal has proposed methods to construct temporal summaries of news stories [14]. Ricardo Baeza Yates has developed an algorithm to obtain future possible events and then searching those events for future information needs [15]. SNAKET is a system developed by paolo and Antonio for unifying hierarchical web snippet clustering with a web interface for web search, books, news and blog domains [16]. Rosie Jones and Diaz have focused on constructing query specific temporal profiles based on publication time of relevant document [17].

Various temporal taggers have been developed to extract and normalize temporal expressions from the document. However, these taggers mainly focus on Explicit temporal expressions hence they extract very few implicit temporal expressions. It may be observed that some documents, we may have large number of implicit temporal expressions like “last diwali,” “next holi,” etc. In such cases our objective is to develop a temporal tagger which extract all Indian festivals as well as of other temporal expressions from document and normalize it into a specific value. By developing such tagger, we have used it into development of our framework for temporal information retrieval and presenting retrieved document into time lined manner.

3 Research Methodology

3.1 Time, Temporal Expression and Temporal Tagger

Time is very important dimension in any information retrieval system. Temporal information is present into the form of temporal expression in any document. Processing such temporal expression from raw text is fundamental requirement for application like text summarization, question answering. A temporal expression also known as Timex also refers to every natural language phrase that denotes a temporal entity like interval or an instant. For example, “Prime Minister Narendra Modi will visit China tomorrow,” “India won the test match on last Friday.”

Temporal expressions can be classified into following categories according to Schilder and Habel [18].

Explicit Date Expressions such as “13/08/2013”, “15th August” refer explicitly to entries of a calendar system and can be mapped directly to temporal Chronons in a timeline.

Implicit All temporal expressions that can be evaluated via a given time ontology and capability of the named entity extraction approach such as name of holiday (last chritmas), next valentine day, etc.

Relative Some temporal expressions express vague temporal information and it is rather difficult to precisely place the information expressed on a time line. Such temporal expressions can be only anchored in a timeline in reference to another explicit or implicit already anchored temporal expression. For example, “on Monday,” “Before June and After March,” etc. If the document has creation date, then they can be easily anchored. Such reference date can be used to map with chronon and can be used during normalization.

We have developed our own rule based temporal tagger to extract temporal expression from document and normalize in into some standard format. First we have extracted all temporal expressions from the document, then all temporal expressions are normalized into standard values based on offset and reference date. Our tagger has one important characteristics compared to other temporal tagger that it supports normalization of Indian festivals which do not occur on some fixed days. It can handle temporal expression like “last diwali” and can translate into specific date based on selected reference time. First we have extracted temporal reference date and then tried to normalize all temporal expression by considering this reference date. We have stored data of 50 years of Indian festivals into dataset because all Indian festivals do not occur on fixed date. Our tagger is generalized to incorporate new festivals, and with new values of coming year for existing festivals. It also allows incorporating some special events like “tsunami,” “attack on taj,” etc.

3.2 Temporal Outline of Document

Based on the extracted temporal expressions and their respective normalized values, temporal outline of the document is generated. Temporal Outline can be defined as:

$$\text{TOD}:D \to \left[ {t \times n \times d \times m \times y \times p} \right]$$

where t is a set of temporal expressions extracted from documents.

n :

is a respective normalized value of temporal expression

m :

is month chronon,

y :

is year chronon,

d :

is date chronon, and

p :

is a position of the temporal expression into the document.

We can have much temporal expression in the document. So D can be a collection of

$$\{ (t_{1} \times n_{1} \times d_{1} \times m_{1} \times y_{1} \times p_{1} )\,(t_{2} \times n_{2} \times d_{2} \times m_{2} \times y_{2} \times p_{2} ) \ldots \ldots \ldots (t_{r} \times n_{r} \times d_{r} \times m_{r} \times y_{r} \times p_{r} )\}$$

where r is number of temporal expressions into the document.

Temporal outline of the document makes all temporal expressions from the document explicit for the further processing.

3.3 Exploring Search Result on Timeline

In the following section, we describe our algorithm to explore search result on timeline.

Let R is collection of retrieved document on specific user query. We assume that each document has unique id. Following algorithm is used to generate timeline.

Once the upper and lower bound of timeline is fixed, it classification of each document based on their temporal values stored into TDO needs to be done. Each Cluster in timeline contains documents belonging to that chronon. Each document may contain more than one temporal expression, so their TDO may contain more than one value. So it is obvious that that document may belongs to more than one cluster. It may be possible that some clusters do not find any document belonging to that chronon. We have finally revised timeline by removing such clusters from timeline. Once all clusters are initialized with their corresponding links, it is sent to user interface. Each Cluster can be refined into smaller chronon by user if documents have finer granules available into temporal document outline.

4 Evaluation

The initial step was to annotate document by time. From The Times of India archive of different time period, we extracted 100 news documents based on key word “Elections.” All these documents were processed using our temporal tagger. The extracted temporal expressions were stored into database with their normalized values and position into a document. Through web interface we queried like “Election from 1990 to 2000,” “Election before this diwali,” “election after this diwali,” “election,” “election since 2000,” etc. Each document in the respective cluster was checked manually and compared with the respective values. There were 90 % relevant documents into each cluster. Following snapshots show the output of above queries (Figs 1 and 2).

Fig. 1
figure 1

TimeLine for user query “election since 2010”

The result is quite satisfactory to use the system for temporal information retrieval supporting temporal expressions (Fig 2).

Fig. 2
figure 2

Search results based on query “election from 2010 to 2015”

5 Conclusion and Future Work

Temporal expressions are important structures available into a document and can be useful to improve traditional search technique. We discussed our temporal tagger which not only recognize temporal information available into document, but also normalize it into some standard form such that it becomes explicit for use in other applications. The framework developed can be used to utilize temporal information embedded into document for retrieval of documents and to make time based search and to explore search results on the timeline manner to make visualization more effective. In future, ranking algorithms can be applied on documents in each cluster when many documents are there of same granule. We are working further to improve accuracy as well as doing ranking of documents in individual cluster when many documents are there of same granule.