1 Introduction and Motivation

Shortly after the co-emerging of the fields of spatial [1, 13, 23] and temporal [8] data management, the miniaturization of computing devices and advances in Global Positioning Systems (GPS) technology have spurred a plethora of applications that demanded some type of a Location-Awareness (LBS) [22]. This, originated the fields of Spatio-Temporal Databases (STDb) [9] and Moving Objects Databases (MOD) [10], addressing various aspects of managing such data – from modelling, through indexing and query processing [7, 9, 18, 20, 28]. The peculiar feature of the popular query-categories – e.g., range, (k) Nearest-Neighbor ((k-)NN) [26, 29, 30] in MOD settings is that they are typically: (1) continuous (i.e., their answers may have to be re-evaluated based on the changes in the motion of the entities); and/or (2) persistent (i.e., their answers may need to be re-evaluated based both on the changes of the motion as well as the history of such changes) [7, 18]. More recently, researchers have turned their attention to modelling, representing, and querying spatio-temporal trajectories which also have some kind of annotated information associated with the location and/or time – bringing about the concept of semantic (resp. symbolic, spatio-textual) trajectories [19].

Complementary to these developments, the need to reduce (or even eliminate) labor-intensive process associated with retrieval of textual documents matching particular criteria, along with the contemporary advances in information systems, have spurred the field of Information Retrieval (IR) [16, 21, 24] starting in the middle of the XX century. To enable effective retrieval relevant documents by various IR strategies, the documents are typically transformed into a suitable representation, often accompanied by preparation of suitable indexes [24]. In the subsequent decades, a plethora of research worksFootnote 1 followed, enabling literature searches [6, 17], detection of various semantic correlations among (topics) in existing publications [14], etc. In addition, several prototype systems and publicly available search engines have been generated over the years (e.g., MEDLARS [6], Medline (https://www.nlm.nih.gov/bsd/pmresources.html)).

At the heart of the motivation for this work is the observation that despite the rather long co-existence of the two fields (IR and STDb/MOD) and their respective rich histories, not much has been done in terms of exploiting the possible confluence of the two – which, as we will discuss shortly, could enable novel categories of queries of relevance for various entities, from researchers, to government agencies. For example, most of the works related to spatio-temporal aspects of medical phenomena pertain to: – modelling the temporal, spatial, and evolutionary nature of subject’s conditions [4]; exploring the spreading of different chemicals, or respondses/reactions to particular stimuli (e.g., [2, 12].

Our goal in this work is to provide a foundation for addressing the problem of modelling the spatio-temporal aspects of the evolution of scientific knowledge and take a step towards enabling novel queries. Specifically, we aim at answering queries such as:

Q1: Retrieve the authors and institutions located in Eastern Europe, who have published results related to the topic of heterocyclic compounds between 2005 and 2010.

Q2: Retrieve all the topics that were published by an institution in Pennsylvania between 2008 and 2012.

Our main contributions are two-fold:

  1. 1.

    We introduce the concept of Knowledge-Evolution Trajectory (KET) as a model to formalize the fusion between the spatial, temporal, and content-based aspects of scientific publications.

  2. 2.

    We provide a preliminary proof-of-concept implementation that demonstrates the feasibility of the proposed model. Specifically, we created a SQL Server database that: (a) contains the data pertaining to medical publications, fetched from PubMed; (b) We generated the geospatial information about each publication by using Google Map API to convert the institution name into (Lat, Lon), and then used ArcGIS to generate the values to be used by the Geometry type of SQL Server.

We also conducted some preliminary experiments which, in addition to demonstrating the feasibility of our objectives, also point out some interesting research challenges.

In the rest of this paper, Sect. 2 introduces the KET model, followed by Sect. 3 in which we describe the current case-study implementation and the experimental observations. We summarize and outline directions for future work in Sect. 4.

2 Modelling Spatio-Temporal Evolution of Scientific Literature

We now present the main aspects of the KET model. We firstly introduce the concept of symbolic trajectories.

Semantic (synonymously, Symbolic or Enriched or Spatio-textual) Trajectories [3, 5, 11, 15, 19] embed contextual and/or situational knowledge into location-in-time data. In a MOD [10] setting, a trajectory is modelled as a structure of the form \(Tr_i = [o_{ID}, (x_{i1}, y_{i1}, t_{i1}), \ldots , (x_{ik}, y_{ik}, t_{ik})]\), where \(x_{ij}\) and \(y_{ij}\) (\(1 \le j \le k\)) are the coordinates of the location (\( l_{ij} = (x_{ij}, y_{ij})\)) of the object with a unique oID, obtained at time instant \(t_{ij}\). In-between two consecutive updates, the object’s motion is approximated in accordance with some kind of an interpolation. STs attempt also to describe the kinds of activities associated with a particular location and time – e.g., “stop”, “move”, “walk”, “eat”, etc. Formally (cf. [5, 19]), a semantic trajectory \(ST_i\) is a sequence of so-called, semantic episodes \(se_{i,m}\) as follows:

\(ST_i = [se_{i1}, se_{i2}, se_{i3}, \ldots se_{im}] \), and each semantic episode is a tuple of the form:

$$\begin{aligned} se_{ij}=(da_{ij},sp_{ij},x^{in}_{ij},y^{in}_{ij},t^{in}_{ij},x^{out}_{ij},y^{out}_{ij},t^{out}_{ij},tagList_{ij}) \end{aligned}$$

where:

  • da \(_{ij}\) = defining annotation; typically expressing an activity (verb) such as stop or move.

  • sp \(_{ij}\) = semantic location/position of the activity, like a POI (e.g., a museum, restaurant, zoo), home, work, etc.

  • \(t^{in}_{ij}\) and \(t^{out}_{ij}\) = entry/exit times of a semantic position.

  • \(x^{in}_{ij},y^{in}_{ij}, x^{out}_{ij},y^{out}_{ij}\) = entry/exit coordinates of a semantic position.

  • tagList\(_{ij}\) = any additional semantic information, like transportation mode, additional activity description (e.g., eat), etc.

constitute the j-th semantic episode of the i-th semantic trajectory.

Fig. 1.
figure 1

Semantic trajectory and spatio-temporal range querying

Left portion of Fig. 1 illustrates the concept of a semantic trajectory (cf. [19]), and the right portion (cf. [15]) illustrates yet another way of visualizing a semantic trajectory (i.e., color-coded activity) along with the semantics of processing the query Retrieve all the individuals who were running in the region R1, between 7:00 and 8:00 AM.

There are two fundamental observations when it comes to the existing model of semantic trajectories, and the KET model that we are proposing:

  1. 1.

    There is no concept of a “motion”, as commonly perceived in MOD trajectories (even when augmented with annotation). The scientific publications are associated with spatial data – e.g., the locations of the institution of the participating authors, and those are not mobile entities. However, there is an evolution over the temporal dimension that, in part, is associated with spatial values.

  2. 2.

    The scientific publications have a lot more contextual attributes, and those attributes are composite/richer. Namely, considering a typical meta-dataFootnote 2, they contain:

  • Title;

  • Category (in accordance with an established nomenclature of the corresponding field; possibly a set of such categories, augmented with a set of keywords).

  • A set of authors names and the corresponding institution of his affiliation.

It is precisely the elements of the institution that contain an implicit spatial value, which we use for enabling the novel categories of queries.

Based on (1) and (2) above, we define the concept of Scientific Publication Point (SPP) as follows:

\({\textit{S}PP_{i} = }\)

\({\textit{(}P_{ID}, Title, category, [(author_{1i}, inst_{1i}, x_{1i}, y_{1i}), \ldots , (author_{ki}, inst_{ki}, x_{ki}, y_{ki})], T_{pub})}\)

We re-iterate that, in practice, most of the bibliographical data sources contain the name of the institution (along with its postal-address) for each author – however, not the actual coordinates for the corresponding address. This, in turn, eliminates the possibility of asking any queries containing predicates ranging over spatial domains. However, such queries may provide insight into data/trends that could influence both government funding as well as institutional/individual collaboration plans, as exemplified by the query: Retrieve all the institutions within 100 Km from the coastal line in China, which have received more than 10M renminbi research funding in the last 4 years for medical research, but have published less than 20 articles on the topic of cytostatics.

Fig. 2.
figure 2

KET for Geo-constrained Query

Assuming that the temporal value in each \(SPP_i\) (i.e., \(SPP_i.T_{pub}\)) is represented at a uniform particular level of granularity (e.g., (month, year)), we now present formally the model for a KET (Knowledge-Evolution Trajectory) defined as follows:

Definition 1

A Knowledge-Evolution Trajectory is a sequence

\([\alpha _1(SPP^{(1)}), \alpha _2(SPP^{(2)}), \ldots , \alpha _n(SPP^{(n)})]\) where:

  • \(\alpha _i\) denotes an operator from relational algebra or a spatial predicate, applied to \(SPP_i\)

  • For any pair \(SPP^{(i)}\), \(SPP^{(j)}\), \((i< j) \Rightarrow SPP^{(i)}.T_{pub} < SPP^{(j)}.T_{pub}\)

Given the definition of \(SPP_i\), the role of \(\alpha _i\) in Definition 1 is to extract the proper content of interest, the evolution of which needs to be queried. Thus, for example, we can focus on a particular author by applying \({(SPP_i).author}\) = \({'Jones'}\). However, the main benefit of the proposed model is that we can also apply spatial predicates such as: \((SPP_i).\,(x_{mi},y_{mi}) IN'Pennsylvania'\).

Clearly, the KET model generalizes the traditional model of a trajectory used in MOD literature and, for that matter, also generalizes the semantic trajectories.

Figure 2 shows an example of a KET corresponding to an answer of the query Retrieve all the collaborative works between Midwest-based and California-based institutions, between January 2009 and December 2010. As can be seen, instead of a traditional (xy) point, we now have collection of Geo-points that constitute each one of the trajectories of the answer. Moreover, we can also see that a particular Geo-point can participate in > 1 KET – for as long as the constraints of the query are satisfied. Thus, we have a collaboration between an institution from California and Illinois on a publication related to T-cell studies, in December of 2010.

Fig. 3.
figure 3

KET for participation constraint query

Figure 3 illustrates another example of a KET – which visually (and type-wise) has a highest resemblance of a traditional moving point-object trajectory. However, it shows an example of an actual answer from our implementation, corresponding to the query Retrieve all the institutions that have published an article on cytostatics in which all the authors were from a same institution, between January, 2014 and July 2015.

We close this section with another example-query the answer of which, in some sense, does resemble spatio-temporal trajectories, but yet it has its own distinct semantics of the temporal evolution.

Fig. 4.
figure 4

Following most popular topics per institution

Figure 4 shows the answer to the query Retrieve which topics/categories had most publications, for the researchers from the SouthWest University, as well as for the ones from Arizona State University, between 2008 and 2012.

3 Case-Study: Spatio-Temporal Evolution in PubMed

We now describe in detail the current implementation of our proof-of-concept system for which the context is restricted to the PubMed data, pertaining to the various medical publications available on MedLine. We used SQL Server 2014 and we wrote python scripts to extract the data from PubMed and populate the tablesFootnote 3. As mentioned earlier, the PubMed data does not contain geo-spatial informationFootnote 4 – whereas SQL Server provides two different geospatial data types: Geometry type and Geography type. Thus, in our prototype system, we selected [Publish ID], [Title], [Author] (including their names and institutions), [Publish Date] from the returned XML records from PubMed – however, in addition, we populated the entries for the corresponding Geometry type in which the coordinate system is World Geodetic System (WGS) 1984. To populate the corresponding values for the geospatial information, we used a two-step procedure:

  1. (1)

    Google Map API was used to convert the institution name into (Lat, Lon) pair of values.

  2. (2)

    Subsequently, ArcGIS was used to convert from (float Lat, float Lon) into the corresponding Geometry type.

This is what enabled us to specify queries that can capture the spatio-temporal evolution of the knowledge represented in scientific – in this case, medical – publications.

In the first iteration, we had a “naïve” representation of the data residing in a single table, with attributes:

figure a

The table had 409702 rows in total.

Fig. 5.
figure 5

Normalized database schema

Subsequently, to eliminate the redundancy, we normalized the naïve table, and the schemata that we used in the implementation is shown in Fig. 5, corresponding to the following tables:

  • Main – with three attributes: [Publish ID] varchar(50), [Title] varchar(500), [Publish Date] date. 32768 row in total

  • Author – with three attributes: [Publish ID] varchar(50), [Author Name] varchar(50), [Institution] varchar(150). 188786 rows in total.

  • Category is a table capturing the hierarchy of the categories. It has three attributes: [Index] hierarchyid, [Information] varchar(150), [Node Level] varchar(50). 68652 rows in total. Hierarchyid is a special data type and works as the key of Category table. / represents root, /*/ represents the first level and so on

  • Geospatial – with two attributes: [Institution] varchar(150), [Location] geometry. 1316 rows in total.

In the sequel, we describe an example of the difference between evaluating queries in the naïve representation and the normalized one. Consider the following:

Q \(_{topic}\): Retrieve the KET for publications addressing the topic of diagnosis, in the period of October 1, 2012 – September 1, 2016.

The query returns a set of temporally annotated points (geo locations) for the topic, during that period.

The corresponding SQL statement for the naïve implementation is:

figure b

The SQL implementation for the same query in the normalized version is:

figure c

To illustrate the impact of the different database representation, we first illustrate the benefits in terms of eliminating the redundancy via the normalized representation:

Fig. 6.
figure 6

Space requirements

As shown in Fig. 6, the naïve implementation requires approximately five times more space than the normalized one.

Fig. 7.
figure 7

Efficiency of execution (in milliseconds)

When it comes to the efficiency of the execution with each implementation we report the corresponding measurements observed when executing Q \(_{topic}\) (labeled Q1 in Fig. 7) and the query

Q2: Retrieve the works jointly co-authored by Masaki Mori and Yuichiro Doki, between 2009–2010 and 2012–2013

on a Windows 10 machine, with Intel(R) Core(TM) i5-5257U CPU (2.70 GHz, 2701 Mhz), with 2 Cores (4 Logical Processors) and 8 GB of RAM.

As shown in Fig. 7, the normalized implementation also yields a much more efficient execution than the naïve one.

4 Conclusions and Future Work

We addressed the problem of modelling and querying the spatio-temporal evolution of the knowledge recorded in scientific publications, and took the first steps towards providing a formalism to capture such evolution across the (geo)spatial domain, as well as other contextual attributes. We proposed the concept of KET (Knowledge-Evolving Trajectories) as a possible unification between two fields (Information Retrieval and Spatio-Temporal/Moving Objects Databases). This type of unification provides opportunities for novel query categories which, to our knowledge, have not been formally treated in the literature. In addition, we provided a proof-of-concept implementation of our approach for the data available in PubMed, and compared the impacts of a naïve design of the database vs. a normalized one – albeit for limited set of queries. Our initial evaluations have demonstrated that in addition to reducing the space requirements, the normalized approach also enables faster execution of the KET-based queries.

Related Works: There are plethora of works in each of the fields of IR [6, 14, 17, 21] and STDb/MOD [7, 9, 10, 18, 20, 23, 28] – to mention but a few. The novelty of our proposed approach is to provide a formalism for fusing these works, along with the integration of the respective existing datasets, enabling novel query-categories. The closest formalism to our proposed approach is the one of semantic/symbolic trajectories [19]. However, as we argued, this approach: (1) has too simplistic model of the spatio-temporal evolution; and (2) is lacking the -dimensionality of the contexts typically associated with scientific publications.

We believe to have scratched the surface of a direction that may be of interest in many applications of societal relevance and, moreover, can pose interesting challenges. As part of our future work, we are planning to extend the KET model, and augment the current implementation so that it can incorporate different publications’ data sources. In addition, although we have provided some preliminary discussions related to the efficiency, a challenging problem is to address the efficient processing different types of KET-based queries.

Last, but not the least, it seems rather intuitive that investigations along the direction of warehousing spatio-temporal evolution of scientific publications along with further semantic similarity searches [25, 27] data may yield novel categories of analytical queries.