1 Introduction

With the growing interest in spatio-temporal data for more than a decade, the indexing and retrieval of spatio-temporal data has been a challenging research area. Multimedia data play a key role in today’s world including but not limited to education, advertisement, entertainment, communication, and information retrieval. Especially, videos have been the most intriguing media since videos have multimodal features along with spatio-temporal properties. Everyday many videos are uploaded on websites such as You Tube [41] and Google Video [10]. Various strategies have focused on modeling of different aspects of videos such as modeling fuzzy information [4] and spatio-temporal features of the objects in a video [24, 30].

The goal of our research is to model, store, query, and index the semantic contents of the videos. We classify spatio-temporal querying (STQ) into two based on how STQ can be executed:

  • Split STQ This query usually targets objects that satisfy some spatial constraints within a period of time or vice versa. The data satisfy the spatial constraints independent of historical data as long as the data belong to the given time domain. A sample query is as follows: “Give a list of regions, where the ball appears between fifth and tenth minutes” [19].

  • Coupled STQ This query usually targets objects that satisfy some (sequential) spatial constraints over a period of time. Series of consecutive or nonconsecutive constraints are part of a query. A sample query is “return videos having a Delta plane take-off right before United plane take-off but after Continental plane landing in Huntsville, Alabama.” Complex spatio-temporal patterns (STP) [11] are an example of coupled STQ.

Without generalization, the spatio-temporal indexing methodologies can be divided into two: tree-based and sequence matching. Tree-based methods require that a hierarchical organization of data is possible. In most cases, this hierarchical organization is achieved with respect to spatial properties. Tree-based approaches are effective for split STQ. The coupled STQ with tree-based indexing can be achieved by linking subqueries of a STQ. However, tree-based indexing may lose its efficiency if sequence information is queried rather than time interval due to comparing the data for all time intervals. Sequence-matching approaches can deal with temporal sequence matching. However, a proper indexing methodology is required before starting sequence matching otherwise the number of sequence matchings may be significantly high.

A good indexing structure should address the requirements of the data model as well as querying. We define a recurrent database as a database where the content or objects of interest with corresponding actions are repeated frequently. In other words, objects are associated with spatial locations, and a possible event (or an action) causes the change of these locations. If the domains of spatial locations, events, and actions that cause the change of locations are finite, the objects are likely to appear at the same locations (due to some actions) in the database. The camera is usually mounted on an almost-static platform for capturing the environment. In this sense, almost all sports videos are a subset of recurrent databases. Other examples include news-anchor, distance-learning education, and surveillance videos. Our goal in this paper is to index such recurrent databases.

We believe that spatio-temporal content of video is important for querying. Rather than a semantic instantaneous event (e.g., “missed shot”) in a clip, we are interested in a sequence of actions and objects with their locations. When we compare various spatio-temporal techniques, a general assumption on the database is the presence of a single timeline. However, it is likely that there may be more than one timeline. For example, each timeline maybe set for a different day. In such a case, the ordering of timelines may not be an important factor for STQ. The tree-based and sequence-matching methods should adapt to the presence of multiple timelines.

Traditional indexing methods cannot achieve the goals of semantic databases that maintain high-level information since they are usually designed based on non-semantic properties. Most common indexing techniques use comparison operators to retrieve the requested data from the database. The semantic querying can only be achieved by having additional layer on top of traditional indexing methods.

1.1 Our approach

In this paper, we propose an indexing and search method that directly targets coupled STQ in a recurrent database. In order to provide an efficient storage and retrieval of video data, a semantic modeling and retrieval system, SMART [14, 15] was proposed. In this paper, the SMART system is improved by providing a novel indexing method, named as S3G: a semantic sequence state graph. An early version of S3G was introduced in [27]. The major difference between this indexing method and the traditional ones is that in S3G the links between states have semantics where states maintain the discrete information about the spatial properties of objects. The events correspond to the transitions in S3G graph. Since transitions correspond to semantic events, it is possible to perform queries based on semantic concepts following the transitions in S3G. We should note that we are not interested in the shapes of the objects. We are rather interested in where and when they appear. We are interested in spatio-temporal events that can be denoted at discrete times. We assume that a semantic event causes the difference between two states. The spatial queries are performed with respect to the object–location pairs. Temporal queries based on Linear Temporal Logic are performed by following the transitions in S3G graph. Spatio-temporal queries use spatio-temporal logic. The spatio-temporal queries, such as what will be the next state (i.e., the spatial properties of objects following a transition on a given state) and whether a particular state eventually occurs in future after a given state has occurred, are implemented using S3G. Graphical viewing of the query construction is implemented to facilitate easy building complex queries.

The contributions of our approach can be summarized as follows: (a) provides a compact representation of a video database; (b) supports recurrent databases; (c) indexes a database that has multiple sequences with different timelines onto a single data structure; (d) links the states with semantic events (or actions); and (e) provides an interactive querying interface where the intermediate results are provided and help the user refine his query.

This paper is organized as follows. The following section provides the background and discusses about related work. Section 3 describes the S3G. Section 4 explains building S3G. The interface and querying are explained in Sect. 5. The last section concludes the paper.

2 Background and related work

In this section we firstly describe the related work; secondly, provide background on our SMART system; and finally, explain the spatio-temporal logic used in this paper.

2.1 Related work

Significant research has been performed on indexing temporal and spatial databases [17, 26]. A good survey on indexing temporal data appears in [34]. A recent survey on spatio-temporal video retrieval is presented in [31]. STP [11] provides indexing based on spatial locations. For each spatial location, objects are sorted based on their identifiers. If an object visits the same location, then visits of an object are sorted based on the time they visit the location. Lagogiannis et al. [20] improve this approach if an object does not visit the same location again. Indexing spatio-temporal data is mostly based on traditional database indexing techniques such as R-trees [33] and B+-trees [25]. One of the major problems of indexing based on these indexing methods is the lack of necessary semantics to build queries that require semantic information. Therefore, semantic retrieval on top of these indexing methods can only be implemented by an upper layer of semantic operations. This puts an additional burden on the retrieval. If the indexing method could capture semantic properties, the retrieval efficiency could be improved.

2.1.1 Tree-based indexing

Tree-based methods cannot be used for semantic querying directly since tree-based methods assume that hierarchical organization of data is possible. However, no order can be imposed between the semantic values. The indexing strategy should be able to order semantic components of the database. The ordering between data at different time instants is not necessarily before–after relation. The ordering may also include the cause or event between different time instants. Traditional temporal querying includes storing starting and stopping time of clips and then comparison of these starting and ending points for querying based on Allen’s temporal logic based on intervals [1].

A survey of indexing methods for multimedia databases that includes spatial data indexing is provided in [7]. The most common way of indexing spatio-temporal data is based on tree indexing. The RS-tree [28], X-tree [5], weR-tree [6], NV-tree [23], SMILe-tree [22], Bx-tree [16], ST2B –tree [8], TPR-tree [32], TPR*-tree [35], and 2n index tree [39] are tree-based indexing methodologies.

2.1.1.1 R-tree-based indexing

RS-tree [28] uses R-tree for spatial properties. The non-spatial properties are indexed with respect to R-tree using an S-tree which has the same structure to prune the results of querying the R-tree given non-spatial properties. X-tree [5] improves R-trees by incorporating supernodes to deal with overlaps in R-trees. Supernodes avoid splits in the directory that would cause inefficient directory structure. Hence, X-tree looks like a combination of R-tree like and linear array like directory. Weighted R-trees (weR-trees) [6] propose weight-balanced R-trees for dynamic manipulation of large datasets. The weight is based on the number of values and fan out of a node. The goal is to produce a partial rebuilding by gradual construction of subtrees. Time-parameterized R-tree (TPR-tree) [32] uses spatial positions and velocities in each dimension to index data. An object that can move in 2D space will have four parameters. Moving points are bounded by time-parameterized rectangles and then indexed by R-trees. The queries are assumed to be applied for a window of time. TPR*-tree [35] improves TPR-tree by reducing the cost of insertions and deletions. These methods can aim either past or future but not both since outdated objects are deleted [35]. SMILe-tree [22] is based on K-d trees and R+ trees. At the first level, the data is organized according to the K-d tree with corresponding dimensions at each level. For hierarchical organization of overlapping components, R+ trees are used to avoid overlappings.

2.1.1.2 B-tree-based indexing

B x-tree [16] is based on B+-tree. Object locations are mapped to single-dimensional values using space-filling curves. The maximum time between two updates is partitioned into n phases. An object with its location is mapped to the corresponding partition of Bx-tree based on its timestamp. It allows queries after the current time. Old partitions of Bx-tree are removed by appending new partitions. Bx-tree returns false hits as it only uses location of the objects. B dual-tree [40] partitions data in dual space, i.e., both location and velocity, to reduce the number of false hits. So it uses location and velocity to obtain the one-dimensional key. Each internal entry of a Bdual-tree maintains a set of moving rectangles, which leads to use of R-tree like query algorithms. Maintaining moving rectangles leads to high computation overload by slowing down the update operations. A self-tunable spatio-temporal B+-tree (ST2B-tree) [8] partitions the space with respect to reference points (similar to Voronoi diagram). Each Voronoi cell is assigned a grid, and an object is assigned to the grid of the closest reference point. It has two phases for time; and different reference points are used to deal with data diversity. It supports range queries for space only. In multicurves [36], each curve is responsible for a subset of the dimensions to reduce the boundary effects inside each curve. The dimensions of data elements are split among a number of space-filling curves. The projections of data are mapped to curves and the one-dimensional coordinates of data are computed and stored in a sorted list.

2.1.1.3 Other tree-based indexing

STRIPES index [29] maps 2D moving objects to 4D points using Hough-X transform and then stores them in a PR bucket-quad tree. The objects are represented in a dual transformed space. The trajectories in (d + 1) dimensional space are mapped to 2D dual space. STRIPES has high query cost and storage size. Nearest-vector-tree (NV-tree) [23] projects data on a line and segments the data and then re-projects data recursively until each segment has enough number of data points. NV-tree allows overlapping of boundaries to overcome the problem of nearest-neighbor search when two close neighbors may belong to different partitions. 2 n index tree [39] has states that include 3D positions, orientation, and speed of objects. Objects are represented as states with timestamps. Objects with limited number of timestamps can be indexed by 2n index tree.

2.1.2 Sequence matching

Sequence-matching techniques try to find the closest sequence given a proper sequence without gaps. It is also possible that the query sequence can be partially described with some gaps in the sequence. For example, the beginning and ending of a query sequence might be provided. In addition, some sequences might have a cyclic nature where some parts of the sequence may be repeated several times. However, most sequence-matching methods use proper sequence matching without any gaps.

Embedded-based subsequence matching (EBSM) method [3] maps a query to a sequence of vectors and then closest vectors are searched in the database. Further exploration for close sequences is handled by dynamic time warping based on subsequence-matching algorithm without gaps. Spatiotemporal sequence matching is used for video copy detection in [18]. Two distances are obtained: spatial distance using ordinal measures of 2 × 2 partitions of the video frames and temporal distance based on the changes in the subsequent frames. This method targets exact match between sequences (i.e., copy detection).

2.1.3 Representation

In [12], an activity is considered to be composed of action threads. A single thread action is represented by a stochastic finite automaton of event states. Events are represented as Bayesian networks (which are acyclic graphs) and event constraints such as “event (or sub-event) A should occur before event (or sub-event) B; and B should occur before event (or sub-event) C” are used to identify an event (not for retrieval). While Hongeng et al. [12] deal with event recognition, our S3G is used to retrieve the clips with desired state (object location values). In [38], an extension to MPEG-7 is proposed to detect actions such as “A follows B”. Assfalg et al. [2] propose a system that semantically annotates the sports videos at different layers of semantic significance. It, however, uses semantic annotations for multimedia indexing and retrieval without using a graphical representation of the semantic data. Lay and Guan [21] propose the use of a grammar for video retrieval. They use an adjacency matrix to determine the behavior of player such as baseline player and inverted index to retrieve clips based on object locations. They mention about the use of operators such as ADJ (temporally adjacent), W (within as a spatial operator), and temporal ordering with < operator. However, there is no information about the interface for providing queries or the complexity of implementing queries using these operators based on the proposed indexing structures.

2.2 Background on SMART

SMART [14, 15] represents semantic information as a grammar-based string that enables spatio-temporal queries in SQL query language. The player names in sports video, the shots given by the players, and their score are the examples of the semantic content of a video. This semantic information is represented as a grammar-based string that enables spatio-temporal queries in SQL query language. The major semantic contents of video are considered as objects, events, locations, and cameras in SMART. SMART can be applied to most sports games. We use tennis game as an example since it has an important advantage over most sports videos: the complete view of the game can be captured with one almost-static camera. Three main objects, as ΣO = {U, V, b}, were identified. Here, U and V represent the players, and b is the ball. Two events, as ΣE = {F, B}, were identified. Here, F represents the forehand shot and B represents the backhand shot. The tennis court was divided into various regions as shown in Fig. 1a: ΣL = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, N} where N represents the net. Six different camera views are considered for tennis videos: A, gives a close view of the player at locations 7 and 8 in Fig. 1a; B, gives a close view of the player at locations 9 and 10; C, court view; D, action replay; and R, rest time; Com, commentators. Since we focus only actions in the game, rest time and commentators are not included in our strings.

Fig. 1
figure 1

Locations of a tennis court and sample sequences [13]. a Tennis court segmentation. b Close view of player. c Player1 serving. d Player2 hits a backhand shot. e Player2 hits a forehand shot

2.2.1 An example

Figure 1b shows initially the close view of the player1 is captured by camera A. The play is then captured by the court view. The player1 and player2 are at locations 8 and 9, respectively. This sequence can be represented as {A [U] C [U8 b8 V9 b3 Bv9 b5 BU8 b4 FV10 b5] D[]}. The player1 serves and the ball hits location 3 (Fig. 1c). The player2 hits a backhand shot and the ball hits location 5 (Fig. 1d). Again the player1 hits a backhand shot at location 8 returning the ball to location 4. The player2 hits a forehand shot at location 10 (Fig. 1e) and the play ends with the ball going to location 5. This sequence is then replayed by camera D.

2.2.2 Grammar for tennis game

SMART’s generic grammar was extended to incorporate the semantics of tennis game. Figure 2 provides a grammar for the tennis videos: video is a sequence of clips; a sequence of clips has many clips; a clip has a camera view and a sequence of spatiotemporal instances; a sequence of spatiotemporal instances has a spatiotemporal instance; and a spatiotemporal instance (spt) is represented with an object (obj), location (loc), and an optional event.

Fig. 2
figure 2

A grammar for tennis videos

2.3 Spatio-temporal logic (STL)

Temporal logic allows to analyze the properties of a system with respect to time. Linear temporal logic (LTL) [37] is a type of temporal logic, that is used to analyze a system that is considered to be composed of a vertex-labeled path \( S_{0} ,S_{1} , \ldots ,S_{n} , \) where each vertex S i corresponds to a point in time as shown by Fig. 3a. We plan to use the STL notation provided in [9] with some modifications. Temporal assertion, Θ, on a state diagram (or state graph) σ, for a set of scene sequences is expressed as Θ ≔ (σ, S) ≔ θ where S is a state in the state graph, σ, and θ is a temporal formula which is formed by combining spatial assertions Φ with Boolean connectives such as ∧, ∨, ¬, → and temporal operators. A state is a set of object, location pairs. The temporal operators are global (□), next (∘), eventually (◊), until (U), and releases (R) operators. The temporal formula can be expressed as:

$$ \theta\,{:}{=}\, \Upphi \left| {\square \Upphi } \right|\circ \Upphi \left| {\diamondsuit \Upphi } \right| \, \Upphi_{ 1} {\text{U}}\Upphi_{ 2} | \, \Upphi_{ 1} {\text{R}}\Upphi_{ 2} \left| {\neg \, \Upphi } \right| \, \Upphi_{ 1} \vee \Upphi_{ 2} |\Upphi_{ 1} \wedge \Upphi_{ 2} |\Upphi_{ 1} \to \Upphi_{ 2} $$
Fig. 3
figure 3

Temporal operators a a sample sequence, b globally operator, c eventually operator, d next operator, e until operator, and f releases operator

For a given state a graph, S, these temporal operators are briefly explained as follows:

  • Global/Always This operator is denoted as “G” or “□”. GΦ or □Φ, where Φ represents a propositional formula, implies that the condition specified by Φ should be satisfied through the path of vertices. (S, S1) ⊨ □Φ implies that Φ is true for all states starting from state S1 in graph S (Fig. 3b).

  • Eventually/Finally This operator is denoted as “F” or “◊”. FΦ or ◊Φ implies that the condition specified by Φ should be satisfied eventually at a point of time in the given path of vertices. (S, S1) ⊨ ◊Φ implies that Φ becomes eventually true after state S1 in graph S (Fig. 3c).

  • Next This operator is denoted as “X” or “∘”. XΦ or ∘Φ implies that the condition specified by Φ should be satisfied by the next vertex in the given path of vertices. (S, S1) ⊨ ∘Φ implies that Φ becomes true in the state following state S1 in graph S (Fig. 3d).

  • Until This operator is denoted as “U”. ψUΦ, where ψ and Φ represent propositional formulae, implies that the condition specified by ψ should be satisfied until a particular vertex in the given path of vertices has Φ satisfied. (S, S1) ⊨ ψUΦ implies that ψ is true until Φ becomes true starting from state S1 in graph S (Fig. 3e).

  • Releases This operator is denoted as “R”. ψRΦ implies that the condition specified by Φ should be satisfied through the path of vertices, until the first vertex in the given path of vertices, has ψ satisfied. (S, S1) ⊨ ψRΦ implies that Φ is true until ψ becomes true starting from state S1 in graph S (Fig. 3f). If there is no vertex that satisfies the condition specified by ψ, then the condition specified by Φ will be satisfied throughout the path of vertices.

In this research, the temporal operators ‘Next’ and ‘Eventually’ have been implemented for supporting the spatio-temporal querying on the S3G, while ‘Global’ concept is used for optimizing the indexing of S3G states built for a tennis video database as an application.

3 S3G: a semantic sequence state graph

S3G is a graph where the events, objects, and locations are maintained as states and transitions. S3G resembles to a non-deterministic finite state machine. An S3G can be defined as M = [S, Σ, δ, s0, F] where S is a set of internal states, Σ is a finite input alphabet, s0 is the initial state, F is a set of external states, and δ is a transition function mapping S × Σ to S ∪ F. An internal state s (s ⊆ S) is a set of object–location pairs. An internal state is a subset of cross product of objects spatial locations (S ⊆ ΣO × ΣL). An external state f ⊆ F corresponds to a decision in a sequence of events. The alphabet (Σ) is a subset of the cross product of the object–event pairs (Σ ⊆ ΣO × ΣE). It should be noted that not all objects are associated with an event. S3G is cyclic in nature. Each of its nodes represents a state giving the locations of the objects involved in the video. Each node has also a list of clips displaying the corresponding state.

3.1 Components

The components of SMART grammar have events, objects, locations and camera view as the alphabet. The most important part in building an S3G is the identification of transition function and the states.

3.1.1 States

A state is identified by objects and their corresponding locations. In addition to object–location pairs, states also maintain pointers for quick retrieval of objects. For each state, there is a list of clip pointers to retrieve the corresponding clips having that state. The maximum number of states for a video is determined by \( \prod\nolimits_{i = 1}^{{|\sum_{O}|}} {(|\sum_{L} |) = |\sum_{L} |^{{|\sum_{O}|}}} \).

3.1.2 Transitions

The transitions are determined by the semantic events. In our applications, the semantic events result in the displacement of objects. An action is generally continuous. Dividing an action into discrete steps is critical in building an S3G. Determining discrete steps is decided by semantic events/actions.

3.2 An example: tennis video

A video can be considered as a collection of a series of events and a series of states due to these events. Some videos such as a sports video can have finite types of states and events. Since the details of S3G can be expressed in a simple way using a tennis video, it is chosen as an S3G application in this paper. For example, in a tennis game the most important object is the ball. Each event causes the ball to move from one location to another location. Whenever a player hits the ball, the ball changes its position. To reduce the number of states, we are interested only in specific situations. For example, we are interested in whenever a player hits the ball or the ball hits the court or net.

3.2.1 States of tennis

In a tennis game, there are three objects: two players and the tennis ball. At a given instance, the objects with their locations define a state in the video. Since there are 3 objects and 13 locations identified for a tennis video, the total number of states is 133. However, it is unlikely to have these many states for a tennis game. Therefore, we create a state as long as that state exists in the database. The start states are generated based on the initial player positions. Each tennis video has a corresponding player 1 and another corresponding player 2.

Each SMART string corresponds to point-level. Tennis game can be composed of five hierarchical levels as shown in [21]: match-level, set-level, game-level, point-level, and stroke-level. Our strings correspond to the point-level but are composed of stroke-level information. Therefore, each state sequence for a string starts from a serve until one of the players makes a mistake. While indexing a clip, we are interested in the plays. Therefore, other parts of the game such as commentators and slow motion replays are not indexed by S3G.

3.2.2 Transitions of tennis (semantic events)

Now, principally, we can say that there can be two types of shots that the tennis players can make, i.e., the forehand shot and the backhand shot. Thus, we can define four types of events for a tennis video, namely a forehand shot by player1, a backhand shot by player1, a forehand shot by player2 and a backhand shot by player2. Hence, there are four types of transitions in a tennis video.

We can thus define a tennis video in terms of a series of states where each state consists of a value for the locations of player1, player2, and the tennis ball. Corresponding to the four types of possible events there can be four types of transitions.

Each transition changes a state with a set of location values for the players and the ball before that transition to another state with a new set of location values after the transition (Fig. 4). The next state is represented as S next  = t(A(O,S old )) where an event/action A by object O from state S old leads to state S new .

Fig. 4
figure 4

States and types of transitions of S3G for a tennis video

3.2.3 Example

Consider an initial state that has the values for spatial locations as 7, 10, and 7 for the player1, the player2, and the ball, respectively. This means that the player1 and the ball were in location 7, while the player2 was in location 10. Now, if the player1 hits a forehand shot and the ball goes to location 4 and so does the player2 to hit the ball back, then we will have a new state with values 7, 4, and 4 corresponding to the new location values of the player1, the player2, and the ball, respectively. In this case, the transition for the initial state to the new state should be defined as “player1 hits a forehand shot” [F(p1)] (Fig. 5).

Fig. 5
figure 5

A transition that causes a new state

3.2.4 Cyclic nature

Consider a state with the values for spatial locations as 5, 6, and 5 for the player1, the player2, and the ball, respectively. Now, there can be a transition where the player1 hits a forehand shot and the ball goes to location 6 reaching an S3G state with location values as 5, 6, and 6 for the player1, the player2, and the ball locations, respectively. In return to this player1’s shot the player2 may hit a backhand shot hitting the ball to location 5, thus going back to the initial state in the S3G as shown in Fig. 6. Thus, S3G is a cyclic graph.

Fig. 6
figure 6

Cyclic nature of S3G

3.2.5 List of states

A transition from a state can result in one of the multiple possible states. In the tennis video example, a transition from a state or a set of location values, such as a forehand shot by player1 can result in one of the multiple possible states. For example, the player1 may retain his position after hitting the shot or move to another position. Similarly the player2 can be in any of the locations in his court side. Also the player1 has multiple location options for hitting the ball. Hence for a given state, there should be a list of states for each of the four transitions as in Fig. 7.

Fig. 7
figure 7

S3G with a list of states for the different transition types

3.2.6 List of clips

Now there can be many video clips for a given set of locations values for the player1, the player2, and the ball. So each state can have many video clips corresponding to it. Therefore, an array of these clip IDs should be maintained for the corresponding clips (Fig. 8). In addition, the rank of the state in a clip is also stored along with the clip number. If multiple transitions are possible between two states for a single clip, each clip may also maintain the transition type besides the rank information. For simplicity, we do not show the ranks in the figures.

Fig. 8
figure 8

S3G with a list of clips

Finally each state in the S3G will of the form: state S i :[\( \langle A_{1} ,A_{2} , \ldots A_{n} \rangle, \langle C_{i} \rangle \)]. Here A i is an action pointer for next S3G states and C i is pointer pointing to a list of clips with ranks having the state S i . For a tennis video, we have 4 action pointers. P P1f and P P1b point to the list of states that are possible after the player1 hits a forehand shot and a backhand shot, respectively. Similarly, P P2f and P P2b point to the list of states that are possible after the player2 hits a forehand shot and a backhand shot, respectively.

3.2.7 Complexity

The number of states may look like it grows exponentially with respect to the number of objects and locations. However, the number of states cannot be more than the number of frames in a video in the worst case. A state indicates possible positioning of objects at various locations in many video clips.

4 Building S3G

We build an S3G from SMART [14, 15] strings. For example, consider a string as U7V10b4FV10b8 that states that initially player1 and the ball are in location 7 and player2 is in location 10. The player1 serves the ball that goes to location 4. The player2 in location 10 hits the ball that goes to location 8. Two states can be defined as follows. The initial state has location values 7, 10, and 7 for player1, player2, and ball locations, respectively. The ‘following’ state has location values as 7, 10, and 8 for player1, player2, and ball locations, respectively. A transition from the initial state to the following state is defined as “player1 hitting a forehand shot”.

The assumption used in string representations to reduce string sizes should be taken care of during converting them to state representation. For example, the ball location will be the same as the player location who is serving the ball. The players change their location 7 and 10 to 8 and 9 in alternating fashion, which will not be expressed explicitly in the string representations. For example, if the beginning of the video clip has player locations as 7 and 10, the ‘next’ play will have the player locations as 8, 9 and vice versa.

4.1 Insertion into S3G

The S3G is built in an incremental order as strings are being parsed. This is done by reusing the existing states in the S3G and creating new states and transitions if required.

Assume that an S3G as in Fig. 9 is already built and we need to process a new substring as “U7b10FV10b7” for a clip with ID C2134. This defines an initial state with location values as 10, 7, and 10 for the tennis ball, the player1 and the player2, respectively. Since a corresponding state in Fig. 9 does not exist, a new state (S4 in Fig. 10) has to be created. Next, player2 hits a forehand shot and the ball goes to location 7. This corresponds to an existing state (S1 in Fig. 9) with location values 7, 10, and 7 for the tennis ball, the player1, and the player2, respectively. The clip ID C2134 is added to the clip list of state S1 as in Fig. 10. In the second case, instead of creating a new state, initial state (S4) for clip C2134 is made to point to the existing state S1. The clip C2134 is added to the list of video clips of S1.

Fig. 9
figure 9

An S3G graph before inserting clip C2134

Fig. 10
figure 10

S3G graph after inserting clip C2134 into a new state and an existing state

Algorithm 1 provides the pseudo-code for converting from string to state. Here each alphabet from the SMART [14, 15] string is read. If the alphabet represents an object (players or ball) and the object location is not implied, then the successive alphabets are read to get its location value. Once all the three object locations are determined, a state with these location values is searched using the indexing structure in Sect. 4.2.1. If the state is found, then it is linked with the previous state with suitable transition type represented by the alphabets in the string denoting the event, or else a new state is created and added to the S3G by linking it with the previous state using the transition specified by the event denoting alphabet. When the strings are parsed, the initial event is ‘serve’ (not explicitly represented in the string) and represented as a forehand shot as a transition in the graph.

Algorithm 1.
figure a

Algorithm for converting strings to states

Algorithm 2 determines whether an existing state, OP, in S3G is reachable from an initial state S. First the system checks the presence of state OP. If OP is present, the set of common clip lists are found. If the intersection of clip lists is empty, state OP is not reachable from state S. If the intersection of clips is non-empty, the ranks of the same clips are compared. If the rank of a clip in state OP is more than the rank of the clip in S, the state OP is reachable from state S. In the algorithm, C(S) returns the rank of a clip C in state S. This reachability function is used in querying.

Algorithm 2.
figure b

Algorithm for searching states in S3G

An S3G can grow significantly depending on the number of states and actions involved in a given video. The recursive algorithm for finding states discussed in the previous subsection may no longer be efficient for an S3G with a large number of states. We use hashing for determining whether a state is present in the graph or not. States may not need to be created ahead of time. They may be created as those states are encountered in video clips. This will avoid the construction of an unnecessary large graph.

5 Querying

This section describes the interface to build S3G and perform spatio-temporal querying on an S3G. We generate one S3G for the complete tennis video database. Information other than spatio-temporal content about the tennis games is stored in other tables in the database. The irrelevant clips can be eliminated as the states are visited. The user interface snapshots are provided in the Appendix 1.

5.1 Input, backup, and storage for S3G

The string representation of the video according to SMART [14, 15] is used as an input for building the S3G states. The strings can be given as input by: (a) manually typing the strings or (b) the SMART [14, 15] strings stored in the GSMART database (Fig. 20 in "Appendix 1"). S3G can be mapped to a relational model and stored in the database. When the system restarts, it can easily be built without processing the strings in the database. If the S3G index is no longer needed, it can be removed from the database.

5.2 Searching a state

Once an S3G is formed, a particular state can be queried with respect to object positions. If a state is present in the S3G, the corresponding clips in the state can be viewed. The ‘Location Selector’ interface (Fig. 22 in "Appendix 3") helps to retrieve clips that are associated with a state. The location of an object is selected by first choosing the object (the player1, the player2, or the ball) and then clicking the location on the tennis court image corresponding to the required object’s location. If a state exists in the S3G, the details of that state (Fig. 23 in "Appendix 4") are displayed with a list of the clips associated with the state.

5.3 Querying

We can apply STL on the S3G and construct spatio-temporal queries. In this paper, we provide examples of ‘next’ and ‘eventually’ operators that are applied on the tennis video database. When searching for a state sequence, it is important that the states in the sequence have the common clips. Moreover, the rank of a state for the clip should increase by one in a consecutive state. For example, if a sequence has states S1 and S2 that are searched, S1 and S2 must have at least one common clip. For the common clips, the rank should increase. For example, consider a partial S3G in Fig. 11 where ranks of states are provided as subscripts of clip numbers. Clip C1 has ranks 1 and 3 for state S1, and rank 2 for state S2. This indicates that the states of C1 are in the order of S1–S2–S1. For C7, the order is S2–S1. For a sequence from S1 and S2 with a backhand shot by player 1, clip C5 will not be retrieved since clip C5 is not common in both clips. Clip C7 is common in both states, but the order of states is not correct. Clip C1 is retrieved because it has rank 1 for state S1 and rank 2 for state S2. This means that S2 follows S1 with a backhand shot by player 1 in clip C1.

Fig. 11
figure 11

A partial S3G

In the following subsections, we do not delve into ranks assuming that rank conditions are satisfied. We put more emphasis on how temporal operations are handled.

5.3.1 ‘Next’ state query

The ‘next’ query helps us retrieve all the possible next states from a given state in the S3G. Each state in the S3G can have multiple next states for each of the four kinds of transitions.

5.3.1.1 Example

In the given instance of S3G in Fig. 12, the state S2 occurs following S1 when player1 hits a backhand shot. The state S3 occurs following S1 when player1 hits a forehand shot. Therefore, a ‘next’ query for state S1, should fetch states S2 and S3:

$$ ({\mathbf{S}},{\text{S}}_{ 1} )\models \circ \left( {{\text{S}}_{ 2} } \right)\quad{\text{and}}\quad({\mathbf{S}},{\text{S}}_{ 1} )\models \circ \left( {{\text{S}}_{ 3} } \right). $$
Fig. 12
figure 12

An instance of S3G

For each of the four transition types (player1 hitting forehand shot, player2 hitting forehand shot, player1 hitting backhand shot and player2 hitting backhand shot), the ‘next’ possible states will be retrieved. The interface for querying ‘next’ states are provided in "Appendix 5". The ‘next’ state query can also be applied multiple number of times in succession to obtain the clips with all the selected states.

5.3.1.2 Example

In Fig. 13, if S4 is selected as the current state, its next states will be S1 and S5. The states S4 and S1 have clip C7 as a common clip, while the states S4 and S5 have clip C1 as common. If S5 is selected as the required ‘next’ state and the ‘next’ state is applied to S5, the state S2 will be obtained as they have clip 11 as common, but state S4 does not have this clip. Hence, no clips are retrieved.

Fig. 13
figure 13

An instance of S3G

5.3.2 ‘Eventually’ state query

The ‘eventually’ query helps users check if a state eventually occurs in future after a particular state has already occurred in the S3G. For given two S3G states, an initial state and a final state, it can be checked if the final state eventually occurs after the initial state has occurred by checking if it is possible to reach the final state from a given start state in the S3G .

5.3.2.1 Example

Figure 14 provides an instance of S3G where S1 occurs following the state S4 when player2 hits a forehand shot. The state S2 occurs following state S1 when player1 hits a backhand shot. The state S3 occurs when player1 hits a forehand shot after the state S2. Therefore, running an ‘eventually’ query with S4 as a start state and S3 as a final state succeeds while an eventually query with S2 as a start state and S4 as a final state does not:

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \diamondsuit \left( {{\text{S}}_{ 3} } \right){\text{ but }}({\mathbf{S}},{\text{S}}_{ 2} )\models \diamondsuit \left( {{\text{S}}_{ 4} } \right){\text{ is false}}. $$
Fig. 14
figure 14

An instance of S3G

Consider the example shown in Fig. 15 where a state with locations 7, 10, and 4 are set for the player1, the player2, and the ball positions, respectively, for the initial state. When the user clicks “Set as initial state”, this initial state is shown as a thumbnail in Fig. 15. To check if a state ‘eventually’ occurs after the initial state chosen, the ‘Eventually Check’ in Fig. 16 is selected (see Appendix 6 for more information).

Fig. 15
figure 15

Initial state is set

Fig. 16
figure 16

‘Eventually’ checking

5.3.3 Combining ‘next’ and ‘eventually’ queries

The ‘next’ and the ‘eventually’ state queries can be called multiple times in any order. In Fig. 17, an instance of S3G is given where S1 is a state that is directly reachable from the state S4 after player2 hits a forehand shot. The state S2 is directly reachable from S1 after Player1 hits a backhand shot. The states S5 and S3 are directly reachable from the state S2 after the transitions resulting from “player1 hitting a forehand shot” and “player2 hitting a backhand shot”, respectively. Therefore, an ‘eventually’ query with S4 as the start state and S2 as the final state will succeed. A ‘next’ query on the state will retrieve states S3 and S5.

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \diamondsuit \left( {{\text{S}}_{ 2} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 2} )\models \diamondsuit \left( {{\text{S}}_{ 3} } \right) \quad{\text{and }}\quad({\mathbf{S}},{\text{S}}_{ 4} )\models \circ \left( {{\text{S}}_{ 2} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 2} )\models \circ \left( {{\text{S}}_{ 5} } \right) $$
Fig. 17
figure 17

An instance of S3G

Similarly a ‘next’ query on the state S4 gives the state S1 and then an ‘eventually’ query with S1 as the start state and S5 as the final state will also succeed.

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \circ \left( {{\text{S}}_{ 1} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 1} )\models \diamondsuit \left( {{\text{S}}_{ 5} } \right). $$

Our application has the feature of graphically displaying the states of a query to simplify the process of long queries that result from a series of eventually and next state queries as shown in Fig. 18. When a new query is submitted, the image of previous state will be retrieved from a location where all possible state images are stored. This image is then displayed. Thus, S3G application helps to develop queries interactively. The user can continue building queries on the results obtained from his previous subqueries, thus enabling creation of lengthy queries dynamically instead of running just a fixed query.

Fig. 18
figure 18

Query building history

Other query types using ‘releases’ or ‘until’ operator can also support queries such as “retrieve clips where both players are at the baseline until one player runs to the net”. In our application, these queries can be represented with a sequence of ‘next’ queries. So, this type of queries is not implemented separately.

6 Experiments

In this section, we first provide experimental results and then provide discussion on extendibility for other videos.

6.1 Performance

6.1.1 Comparison

We compare our method with spatio-temporal query patterns (STP) [11]. There are improvements to STP method with assumptions [20] which are not applicable to our case (e.g., an object should not visit the same location again). STP provides indexing based on positions usually divided into grids. STP index keeps the time when objects visit a specific location. There are several issues with STP: (a) STP supports querying the trajectory of a single object; (b) there is a single timeline for trajectories; (c) there are no semantic events between changing positions; and (d) the gap between any consecutive steps could be anything. To compare our method with STP, the following changes are made for STP. (1) Our queries include multiple objects; therefore, we first need to query STP for each object and check the results are synchronized (i.e., check whether objects appear at the corresponding positions at the same time). (2) After each tennis point (tennis match is composed of sets, which are composed of games, which are composed of points), the positions of players are reset. There is no need to check the positions of objects across points. We need to keep STP index structure for each point. A sequence of positions across multiple points is not related to each other and meaningless. (3) We consider all semantic transitions (all types of shots) for S3G since STP does not support transitions. (4) We consider eventual queries to compare S3G with STP since STP is not designed to support next-querying.

6.1.2 Building S3G

Since a small number of games are not enough to check the performance of indexing structures, we have simulated tennis games to measure the complexity of building S3G and querying using S3G. We have generated around 10,000 points (sequences or strings or clips). Each game has around 100 points. Table 1 shows the statistics for building S3G and STP index structures where |IS| is the number of input sequences (or strings) and |s| is the number of states (note |s| is only applicable to S3G). These timings only include the time for building the index structures and ignore the time for decoding the input strings since strings are decoded differently for each index structure.

Table 1 Time for building index structures and searching with 2 and 3 states for S3G and STP

6.1.3 Complexity

It takes 35 ms to build S3G for 10,000 points. Table 1 shows that the number of states reaches saturation when 4,096 input sequences are processed and barely increases after that. Our experimental results indicate that the time complexity is O(|IS|) + O(|s|). Since |IS| ≫|s| and the number of states reaches a saturation point, the complexity of building S3G is O(|IS|) for large databases. The time complexity of inserting a state into S3G is O(1). Similarly, the time complexity of inserting a clip into a state in S 3 G is O(1): finding the state using hashing and adding the clip to the end. Figure 19 shows the comparison of building index structures. Building S3G is faster than STP despite S3G also includes the time to build the index structure with respect to the player shots, whereas in STP those transitions are ignored. Although a separate STP index is built for each clip (this results in simple STP index structures), building a complex and single S3G index was faster than building STP index structures. If a single STP index structure had been built, the querying would not complete in a reasonable time for STP index.

Fig. 19
figure 19

Time for building index structures

6.1.4 Querying

We compare the performance of S3G and STP on ‘eventually’ queries. Table 1 provides the experimental results when 10,000 points are inserted into the database where |q| indicates the number of queries. A query returns the clips where a destination state is reachable from an initial state. We provide the results with 2 and 3 states (or predicates) for ‘eventually’ queries. S3G significantly outperforms STP that we cannot put them into a single graph. Using S3G, 8,196 queries for 3 states are completed in 600 ms, whereas it takes around 60 s for STP. Using S3G for queries where there is an initial state and a final state just took 153 ms. Figure 19 shows the results of querying for S3G. Note that the time for querying provides the total time for |q| queries. The cost of a ‘Next’ query is O(1) + O(1). The first part locates the current state using hashing and the second part locates the next state using the corresponding event.

6.2 Discussion on extendibility

The domain of applications should have the following properties: (a) the video data should be discretizable in both spatial domain and temporal domain; (b) the states should be repeated to get benefit from this indexing method, and (c) there should be preferably semantic events (or transitions) that indicate change from one state to another state. The transitions and discretization are closely related to each other. Now, we give examples of these in several different domains.

We may group sports based on the layout of the field, instruments involved in the game, the number of players, and so on. In some sports ranking is important: auto racing, swimming, and cycling. The events or transitions include completion of a lap, taking a pit stop, or passing of a player by another player. The states may involve the rank of players, the lap, and their locations on the field. A combination of lap and rank could be a better option for auto racing rather than the exact locations of cars. Another set of group of games include games where the field is divided into multiple sections. Examples include tennis, table tennis, volleyball, and badminton. Tennis example has already been covered with some detail. For volleyball the events could be passing, block, or spike, etc. Another set of games include games that are played on a board (e.g., chess). In chess, player moves are transitions, and pieces on board make the states. In a single game, it may be rare to repeat states but it is possible to have states repeated in multiple games. 8 moves of a chess game by Kasparov against World Team are: “1.e4 c5 2.Nf3 d6 3.Bb5 + Bd7 4.Bxd7 + Qxd7 5.c4 Nc6 6.Nc3 Nf6 7.0-0 g6 8.d4 cxd4”.

Another set of games include games where players have almost no restriction to move on the field. These games include soccer, football, basketball, ice hockey, etc. The research on these games includes video processing, game summary processing, and commentator speech processing. Not all players might be involved at a play during the game. The positions of players as well as their actions might be important. We investigated games from National Football League (NFL). The critical parts of the state are: the ball location, team having the ball, down-and-distance, and optional team formation [e.g., State (Team: GB; Ball: NO-37; Down; 1-10; Formation: shotgun)]. Events have more information about the event type such as rushing, passing, sacks. Events may also have information who is involved in the event (e.g., player A passes the ball to player B which is then stopped by player X of the opponent team). In baseball, the players at the bases, pitcher, batter, catchers, etc. form the states. Home-run, strike, ball, etc. correspond to transitions. In the golf, the player follows holes. The position of the ball after each hit and the hole number may correspond to a state. The hit may correspond to a transition.

If the number of states grows significantly, we propose two ways of reducing the number of states: (a) decrease the number of locations by increasing the sizes of locations and (b) use only players who are active in the play for that state.

7 Conclusion

In this paper, we introduced a new indexing method, semantic sequence state graph (S3G), for spatio-temporal data using the semantic contents of the video. S3G utilizes semantic events as transitions in the events while the spatial information is maintained in states. The semantic events result in temporal ordering among the states. The ‘next’ and ‘eventually’ operations of LTL have been implemented using the S3G platform on the tennis video database. Both these types of queries can be applied multiple times in combination to form a complex query. A sample application of S3G is shown for a tennis video database application. As future work, we plan to apply S3G in other domains. We also plan to provide more efficient ways of managing temporal queries of S3G. We also intend to create new scenarios of events, objects, and locations by reusing the video clip contents and traversing S3G. S3G can handle coupled STQ with combination of ‘next’ and ‘eventually’ queries. S3G suits well for applications where there is a finite domain of objects, events, and locations where repetitions are frequent. S3G also helps the user develop the temporal components of a query by a user interface that shows a history of the built query. The user can see whether his query will lead to any result or not while building his query.