Spatio-temporal querying recurrent multimedia databases using a semantic sequence state graph

Naik, M. M.; Sigdel, M.; Aygun, R. S.

doi:10.1007/s00530-011-0255-8

Spatio-temporal querying recurrent multimedia databases using a semantic sequence state graph

Regular Paper
Published: 06 December 2011

Volume 18, pages 263–281, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Systems Aims and scope Submit manuscript

Spatio-temporal querying recurrent multimedia databases using a semantic sequence state graph

Download PDF

M. M. Naik¹,
M. Sigdel¹ &
R. S. Aygun¹

254 Accesses
10 Citations
Explore all metrics

Abstract

We present an indexing method for spatiotemporal data: semantic sequence state graph (S³G). S³G maintains objects with their locations as states and events as transitions. The spatial information is maintained in states while the semantic events result in temporal ordering between the states. If the objects visit the same locations repeatedly, we call such databases as recurrent databases. Our querying interface supports queries based on spatio-temporal logic that includes operators such as ‘next’ and ‘eventually’. The interactive querying interface enables the user to build the query interactively and see the intermediate results of the query.

Analyzing Temporal Keyword Queries for Interactive Search over Temporal Databases

Supporting Situation Awareness in Spatio-Temporal Databases

Article 10 October 2016

A Survey of Spatio-Temporal Database Research

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the growing interest in spatio-temporal data for more than a decade, the indexing and retrieval of spatio-temporal data has been a challenging research area. Multimedia data play a key role in today’s world including but not limited to education, advertisement, entertainment, communication, and information retrieval. Especially, videos have been the most intriguing media since videos have multimodal features along with spatio-temporal properties. Everyday many videos are uploaded on websites such as You Tube [41] and Google Video [10]. Various strategies have focused on modeling of different aspects of videos such as modeling fuzzy information [4] and spatio-temporal features of the objects in a video [24, 30].

The goal of our research is to model, store, query, and index the semantic contents of the videos. We classify spatio-temporal querying (STQ) into two based on how STQ can be executed:

Split STQ This query usually targets objects that satisfy some spatial constraints within a period of time or vice versa. The data satisfy the spatial constraints independent of historical data as long as the data belong to the given time domain. A sample query is as follows: “Give a list of regions, where the ball appears between fifth and tenth minutes” [19].
Coupled STQ This query usually targets objects that satisfy some (sequential) spatial constraints over a period of time. Series of consecutive or nonconsecutive constraints are part of a query. A sample query is “return videos having a Delta plane take-off right before United plane take-off but after Continental plane landing in Huntsville, Alabama.” Complex spatio-temporal patterns (STP) [11] are an example of coupled STQ.

Without generalization, the spatio-temporal indexing methodologies can be divided into two: tree-based and sequence matching. Tree-based methods require that a hierarchical organization of data is possible. In most cases, this hierarchical organization is achieved with respect to spatial properties. Tree-based approaches are effective for split STQ. The coupled STQ with tree-based indexing can be achieved by linking subqueries of a STQ. However, tree-based indexing may lose its efficiency if sequence information is queried rather than time interval due to comparing the data for all time intervals. Sequence-matching approaches can deal with temporal sequence matching. However, a proper indexing methodology is required before starting sequence matching otherwise the number of sequence matchings may be significantly high.

A good indexing structure should address the requirements of the data model as well as querying. We define a recurrent database as a database where the content or objects of interest with corresponding actions are repeated frequently. In other words, objects are associated with spatial locations, and a possible event (or an action) causes the change of these locations. If the domains of spatial locations, events, and actions that cause the change of locations are finite, the objects are likely to appear at the same locations (due to some actions) in the database. The camera is usually mounted on an almost-static platform for capturing the environment. In this sense, almost all sports videos are a subset of recurrent databases. Other examples include news-anchor, distance-learning education, and surveillance videos. Our goal in this paper is to index such recurrent databases.

We believe that spatio-temporal content of video is important for querying. Rather than a semantic instantaneous event (e.g., “missed shot”) in a clip, we are interested in a sequence of actions and objects with their locations. When we compare various spatio-temporal techniques, a general assumption on the database is the presence of a single timeline. However, it is likely that there may be more than one timeline. For example, each timeline maybe set for a different day. In such a case, the ordering of timelines may not be an important factor for STQ. The tree-based and sequence-matching methods should adapt to the presence of multiple timelines.

Traditional indexing methods cannot achieve the goals of semantic databases that maintain high-level information since they are usually designed based on non-semantic properties. Most common indexing techniques use comparison operators to retrieve the requested data from the database. The semantic querying can only be achieved by having additional layer on top of traditional indexing methods.

1.1 Our approach

In this paper, we propose an indexing and search method that directly targets coupled STQ in a recurrent database. In order to provide an efficient storage and retrieval of video data, a semantic modeling and retrieval system, SMART [14, 15] was proposed. In this paper, the SMART system is improved by providing a novel indexing method, named as S³G: a semantic sequence state graph. An early version of S³G was introduced in [27]. The major difference between this indexing method and the traditional ones is that in S³G the links between states have semantics where states maintain the discrete information about the spatial properties of objects. The events correspond to the transitions in S³G graph. Since transitions correspond to semantic events, it is possible to perform queries based on semantic concepts following the transitions in S³G. We should note that we are not interested in the shapes of the objects. We are rather interested in where and when they appear. We are interested in spatio-temporal events that can be denoted at discrete times. We assume that a semantic event causes the difference between two states. The spatial queries are performed with respect to the object–location pairs. Temporal queries based on Linear Temporal Logic are performed by following the transitions in S³G graph. Spatio-temporal queries use spatio-temporal logic. The spatio-temporal queries, such as what will be the next state (i.e., the spatial properties of objects following a transition on a given state) and whether a particular state eventually occurs in future after a given state has occurred, are implemented using S³G. Graphical viewing of the query construction is implemented to facilitate easy building complex queries.

The contributions of our approach can be summarized as follows: (a) provides a compact representation of a video database; (b) supports recurrent databases; (c) indexes a database that has multiple sequences with different timelines onto a single data structure; (d) links the states with semantic events (or actions); and (e) provides an interactive querying interface where the intermediate results are provided and help the user refine his query.

This paper is organized as follows. The following section provides the background and discusses about related work. Section 3 describes the S³G. Section 4 explains building S³G. The interface and querying are explained in Sect. 5. The last section concludes the paper.

2 Background and related work

In this section we firstly describe the related work; secondly, provide background on our SMART system; and finally, explain the spatio-temporal logic used in this paper.

2.1 Related work

Significant research has been performed on indexing temporal and spatial databases [17, 26]. A good survey on indexing temporal data appears in [34]. A recent survey on spatio-temporal video retrieval is presented in [31]. STP [11] provides indexing based on spatial locations. For each spatial location, objects are sorted based on their identifiers. If an object visits the same location, then visits of an object are sorted based on the time they visit the location. Lagogiannis et al. [20] improve this approach if an object does not visit the same location again. Indexing spatio-temporal data is mostly based on traditional database indexing techniques such as R-trees [33] and B+-trees [25]. One of the major problems of indexing based on these indexing methods is the lack of necessary semantics to build queries that require semantic information. Therefore, semantic retrieval on top of these indexing methods can only be implemented by an upper layer of semantic operations. This puts an additional burden on the retrieval. If the indexing method could capture semantic properties, the retrieval efficiency could be improved.

2.1.1 Tree-based indexing

Tree-based methods cannot be used for semantic querying directly since tree-based methods assume that hierarchical organization of data is possible. However, no order can be imposed between the semantic values. The indexing strategy should be able to order semantic components of the database. The ordering between data at different time instants is not necessarily before–after relation. The ordering may also include the cause or event between different time instants. Traditional temporal querying includes storing starting and stopping time of clips and then comparison of these starting and ending points for querying based on Allen’s temporal logic based on intervals [1].

A survey of indexing methods for multimedia databases that includes spatial data indexing is provided in [7]. The most common way of indexing spatio-temporal data is based on tree indexing. The RS-tree [28], X-tree [5], weR-tree [6], NV-tree [23], SMILe-tree [22], Bx-tree [16], ST²B –tree [8], TPR-tree [32], TPR*-tree [35], and 2ⁿ index tree [39] are tree-based indexing methodologies.

2.1.1.1 R-tree-based indexing

RS-tree [28] uses R-tree for spatial properties. The non-spatial properties are indexed with respect to R-tree using an S-tree which has the same structure to prune the results of querying the R-tree given non-spatial properties. X-tree [5] improves R-trees by incorporating supernodes to deal with overlaps in R-trees. Supernodes avoid splits in the directory that would cause inefficient directory structure. Hence, X-tree looks like a combination of R-tree like and linear array like directory. Weighted R-trees (weR-trees) [6] propose weight-balanced R-trees for dynamic manipulation of large datasets. The weight is based on the number of values and fan out of a node. The goal is to produce a partial rebuilding by gradual construction of subtrees. Time-parameterized R-tree (TPR-tree) [32] uses spatial positions and velocities in each dimension to index data. An object that can move in 2D space will have four parameters. Moving points are bounded by time-parameterized rectangles and then indexed by R-trees. The queries are assumed to be applied for a window of time. TPR*-tree [35] improves TPR-tree by reducing the cost of insertions and deletions. These methods can aim either past or future but not both since outdated objects are deleted [35]. SMILe-tree [22] is based on K-d trees and R+ trees. At the first level, the data is organized according to the K-d tree with corresponding dimensions at each level. For hierarchical organization of overlapping components, R+ trees are used to avoid overlappings.

2.1.1.2 B-tree-based indexing

B ^x-tree [16] is based on B+-tree. Object locations are mapped to single-dimensional values using space-filling curves. The maximum time between two updates is partitioned into n phases. An object with its location is mapped to the corresponding partition of B^x-tree based on its timestamp. It allows queries after the current time. Old partitions of B^x-tree are removed by appending new partitions. B^x-tree returns false hits as it only uses location of the objects. B ^dual-tree [40] partitions data in dual space, i.e., both location and velocity, to reduce the number of false hits. So it uses location and velocity to obtain the one-dimensional key. Each internal entry of a B^dual-tree maintains a set of moving rectangles, which leads to use of R-tree like query algorithms. Maintaining moving rectangles leads to high computation overload by slowing down the update operations. A self-tunable spatio-temporal B+-tree (ST²B-tree) [8] partitions the space with respect to reference points (similar to Voronoi diagram). Each Voronoi cell is assigned a grid, and an object is assigned to the grid of the closest reference point. It has two phases for time; and different reference points are used to deal with data diversity. It supports range queries for space only. In multicurves [36], each curve is responsible for a subset of the dimensions to reduce the boundary effects inside each curve. The dimensions of data elements are split among a number of space-filling curves. The projections of data are mapped to curves and the one-dimensional coordinates of data are computed and stored in a sorted list.

2.1.1.3 Other tree-based indexing

STRIPES index [29] maps 2D moving objects to 4D points using Hough-X transform and then stores them in a PR bucket-quad tree. The objects are represented in a dual transformed space. The trajectories in (d + 1) dimensional space are mapped to 2D dual space. STRIPES has high query cost and storage size. Nearest-vector-tree (NV-tree) [23] projects data on a line and segments the data and then re-projects data recursively until each segment has enough number of data points. NV-tree allows overlapping of boundaries to overcome the problem of nearest-neighbor search when two close neighbors may belong to different partitions. 2 ⁿ index tree [39] has states that include 3D positions, orientation, and speed of objects. Objects are represented as states with timestamps. Objects with limited number of timestamps can be indexed by 2ⁿ index tree.

2.1.2 Sequence matching

Sequence-matching techniques try to find the closest sequence given a proper sequence without gaps. It is also possible that the query sequence can be partially described with some gaps in the sequence. For example, the beginning and ending of a query sequence might be provided. In addition, some sequences might have a cyclic nature where some parts of the sequence may be repeated several times. However, most sequence-matching methods use proper sequence matching without any gaps.

Embedded-based subsequence matching (EBSM) method [3] maps a query to a sequence of vectors and then closest vectors are searched in the database. Further exploration for close sequences is handled by dynamic time warping based on subsequence-matching algorithm without gaps. Spatiotemporal sequence matching is used for video copy detection in [18]. Two distances are obtained: spatial distance using ordinal measures of 2 × 2 partitions of the video frames and temporal distance based on the changes in the subsequent frames. This method targets exact match between sequences (i.e., copy detection).

2.1.3 Representation

In [12], an activity is considered to be composed of action threads. A single thread action is represented by a stochastic finite automaton of event states. Events are represented as Bayesian networks (which are acyclic graphs) and event constraints such as “event (or sub-event) A should occur before event (or sub-event) B; and B should occur before event (or sub-event) C” are used to identify an event (not for retrieval). While Hongeng et al. [12] deal with event recognition, our S³G is used to retrieve the clips with desired state (object location values). In [38], an extension to MPEG-7 is proposed to detect actions such as “A follows B”. Assfalg et al. [2] propose a system that semantically annotates the sports videos at different layers of semantic significance. It, however, uses semantic annotations for multimedia indexing and retrieval without using a graphical representation of the semantic data. Lay and Guan [21] propose the use of a grammar for video retrieval. They use an adjacency matrix to determine the behavior of player such as baseline player and inverted index to retrieve clips based on object locations. They mention about the use of operators such as ADJ (temporally adjacent), W (within as a spatial operator), and temporal ordering with < operator. However, there is no information about the interface for providing queries or the complexity of implementing queries using these operators based on the proposed indexing structures.

2.2 Background on SMART

SMART [14, 15] represents semantic information as a grammar-based string that enables spatio-temporal queries in SQL query language. The player names in sports video, the shots given by the players, and their score are the examples of the semantic content of a video. This semantic information is represented as a grammar-based string that enables spatio-temporal queries in SQL query language. The major semantic contents of video are considered as objects, events, locations, and cameras in SMART. SMART can be applied to most sports games. We use tennis game as an example since it has an important advantage over most sports videos: the complete view of the game can be captured with one almost-static camera. Three main objects, as Σ_O = {U, V, b}, were identified. Here, U and V represent the players, and b is the ball. Two events, as Σ_E = {F, B}, were identified. Here, F represents the forehand shot and B represents the backhand shot. The tennis court was divided into various regions as shown in Fig. 1a: Σ_L = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, N} where N represents the net. Six different camera views are considered for tennis videos: A, gives a close view of the player at locations 7 and 8 in Fig. 1a; B, gives a close view of the player at locations 9 and 10; C, court view; D, action replay; and R, rest time; Com, commentators. Since we focus only actions in the game, rest time and commentators are not included in our strings.

2.2.1 An example

Figure 1b shows initially the close view of the player1 is captured by camera A. The play is then captured by the court view. The player1 and player2 are at locations 8 and 9, respectively. This sequence can be represented as {A [U] C [U8 b8 V9 b3 Bv9 b5 BU8 b4 FV10 b5] D[]}. The player1 serves and the ball hits location 3 (Fig. 1c). The player2 hits a backhand shot and the ball hits location 5 (Fig. 1d). Again the player1 hits a backhand shot at location 8 returning the ball to location 4. The player2 hits a forehand shot at location 10 (Fig. 1e) and the play ends with the ball going to location 5. This sequence is then replayed by camera D.

2.2.2 Grammar for tennis game

SMART’s generic grammar was extended to incorporate the semantics of tennis game. Figure 2 provides a grammar for the tennis videos: video is a sequence of clips; a sequence of clips has many clips; a clip has a camera view and a sequence of spatiotemporal instances; a sequence of spatiotemporal instances has a spatiotemporal instance; and a spatiotemporal instance (spt) is represented with an object (obj), location (loc), and an optional event.

2.3 Spatio-temporal logic (STL)

Temporal logic allows to analyze the properties of a system with respect to time. Linear temporal logic (LTL) [37] is a type of temporal logic, that is used to analyze a system that is considered to be composed of a vertex-labeled path $ S_{0} ,S_{1} , \ldots ,S_{n} , $ where each vertex S _i corresponds to a point in time as shown by Fig. 3a. We plan to use the STL notation provided in [9] with some modifications. Temporal assertion, Θ, on a state diagram (or state graph) σ, for a set of scene sequences is expressed as Θ ≔ (σ, S) ≔ θ where S is a state in the state graph, σ, and θ is a temporal formula which is formed by combining spatial assertions Φ with Boolean connectives such as ∧, ∨, ¬, → and temporal operators. A state is a set of object, location pairs. The temporal operators are global (□), next (∘), eventually (◊), until (U), and releases (R) operators. The temporal formula can be expressed as:

$$ \theta\,{:}{=}\, \Upphi \left| {\square \Upphi } \right|\circ \Upphi \left| {\diamondsuit \Upphi } \right| \, \Upphi_{ 1} {\text{U}}\Upphi_{ 2} | \, \Upphi_{ 1} {\text{R}}\Upphi_{ 2} \left| {\neg \, \Upphi } \right| \, \Upphi_{ 1} \vee \Upphi_{ 2} |\Upphi_{ 1} \wedge \Upphi_{ 2} |\Upphi_{ 1} \to \Upphi_{ 2} $$

For a given state a graph, S, these temporal operators are briefly explained as follows:

Global/Always This operator is denoted as “G” or “□”. GΦ or □Φ, where Φ represents a propositional formula, implies that the condition specified by Φ should be satisfied through the path of vertices. (S, S₁) ⊨ □Φ implies that Φ is true for all states starting from state S₁ in graph S (Fig. 3b).
Eventually/Finally This operator is denoted as “F” or “◊”. FΦ or ◊Φ implies that the condition specified by Φ should be satisfied eventually at a point of time in the given path of vertices. (S, S₁) ⊨ ◊Φ implies that Φ becomes eventually true after state S₁ in graph S (Fig. 3c).
Next This operator is denoted as “X” or “∘”. XΦ or ∘Φ implies that the condition specified by Φ should be satisfied by the next vertex in the given path of vertices. (S, S₁) ⊨ ∘Φ implies that Φ becomes true in the state following state S₁ in graph S (Fig. 3d).
Until This operator is denoted as “U”. ψUΦ, where ψ and Φ represent propositional formulae, implies that the condition specified by ψ should be satisfied until a particular vertex in the given path of vertices has Φ satisfied. (S, S₁) ⊨ ψUΦ implies that ψ is true until Φ becomes true starting from state S₁ in graph S (Fig. 3e).
Releases This operator is denoted as “R”. ψRΦ implies that the condition specified by Φ should be satisfied through the path of vertices, until the first vertex in the given path of vertices, has ψ satisfied. (S, S₁) ⊨ ψRΦ implies that Φ is true until ψ becomes true starting from state S₁ in graph S (Fig. 3f). If there is no vertex that satisfies the condition specified by ψ, then the condition specified by Φ will be satisfied throughout the path of vertices.

In this research, the temporal operators ‘Next’ and ‘Eventually’ have been implemented for supporting the spatio-temporal querying on the S³G, while ‘Global’ concept is used for optimizing the indexing of S³G states built for a tennis video database as an application.

3 S³G: a semantic sequence state graph

S³G is a graph where the events, objects, and locations are maintained as states and transitions. S³G resembles to a non-deterministic finite state machine. An S³G can be defined as M = [S, Σ, δ, s₀, F] where S is a set of internal states, Σ is a finite input alphabet, s₀ is the initial state, F is a set of external states, and δ is a transition function mapping S × Σ to S ∪ F. An internal state s (s ⊆ S) is a set of object–location pairs. An internal state is a subset of cross product of objects spatial locations (S ⊆ Σ_O × Σ_L). An external state f ⊆ F corresponds to a decision in a sequence of events. The alphabet (Σ) is a subset of the cross product of the object–event pairs (Σ ⊆ Σ_O × Σ_E). It should be noted that not all objects are associated with an event. S³G is cyclic in nature. Each of its nodes represents a state giving the locations of the objects involved in the video. Each node has also a list of clips displaying the corresponding state.

3.1 Components

The components of SMART grammar have events, objects, locations and camera view as the alphabet. The most important part in building an S³G is the identification of transition function and the states.

3.1.1 States

A state is identified by objects and their corresponding locations. In addition to object–location pairs, states also maintain pointers for quick retrieval of objects. For each state, there is a list of clip pointers to retrieve the corresponding clips having that state. The maximum number of states for a video is determined by $ \prod\nolimits_{i = 1}^{{|\sum_{O}|}} {(|\sum_{L} |) = |\sum_{L} |^{{|\sum_{O}|}}} $.

3.1.2 Transitions

The transitions are determined by the semantic events. In our applications, the semantic events result in the displacement of objects. An action is generally continuous. Dividing an action into discrete steps is critical in building an S³G. Determining discrete steps is decided by semantic events/actions.

3.2 An example: tennis video

A video can be considered as a collection of a series of events and a series of states due to these events. Some videos such as a sports video can have finite types of states and events. Since the details of S³G can be expressed in a simple way using a tennis video, it is chosen as an S³G application in this paper. For example, in a tennis game the most important object is the ball. Each event causes the ball to move from one location to another location. Whenever a player hits the ball, the ball changes its position. To reduce the number of states, we are interested only in specific situations. For example, we are interested in whenever a player hits the ball or the ball hits the court or net.

3.2.1 States of tennis

In a tennis game, there are three objects: two players and the tennis ball. At a given instance, the objects with their locations define a state in the video. Since there are 3 objects and 13 locations identified for a tennis video, the total number of states is 13³. However, it is unlikely to have these many states for a tennis game. Therefore, we create a state as long as that state exists in the database. The start states are generated based on the initial player positions. Each tennis video has a corresponding player 1 and another corresponding player 2.

Each SMART string corresponds to point-level. Tennis game can be composed of five hierarchical levels as shown in [21]: match-level, set-level, game-level, point-level, and stroke-level. Our strings correspond to the point-level but are composed of stroke-level information. Therefore, each state sequence for a string starts from a serve until one of the players makes a mistake. While indexing a clip, we are interested in the plays. Therefore, other parts of the game such as commentators and slow motion replays are not indexed by S³G.

3.2.2 Transitions of tennis (semantic events)

Now, principally, we can say that there can be two types of shots that the tennis players can make, i.e., the forehand shot and the backhand shot. Thus, we can define four types of events for a tennis video, namely a forehand shot by player1, a backhand shot by player1, a forehand shot by player2 and a backhand shot by player2. Hence, there are four types of transitions in a tennis video.

We can thus define a tennis video in terms of a series of states where each state consists of a value for the locations of player1, player2, and the tennis ball. Corresponding to the four types of possible events there can be four types of transitions.

Each transition changes a state with a set of location values for the players and the ball before that transition to another state with a new set of location values after the transition (Fig. 4). The next state is represented as S _next = t(A(O,S _old)) where an event/action A by object O from state S _old leads to state S _new.

3.2.3 Example

Consider an initial state that has the values for spatial locations as 7, 10, and 7 for the player1, the player2, and the ball, respectively. This means that the player1 and the ball were in location 7, while the player2 was in location 10. Now, if the player1 hits a forehand shot and the ball goes to location 4 and so does the player2 to hit the ball back, then we will have a new state with values 7, 4, and 4 corresponding to the new location values of the player1, the player2, and the ball, respectively. In this case, the transition for the initial state to the new state should be defined as “player1 hits a forehand shot” [F(p₁)] (Fig. 5).

3.2.4 Cyclic nature

Consider a state with the values for spatial locations as 5, 6, and 5 for the player1, the player2, and the ball, respectively. Now, there can be a transition where the player1 hits a forehand shot and the ball goes to location 6 reaching an S³G state with location values as 5, 6, and 6 for the player1, the player2, and the ball locations, respectively. In return to this player1’s shot the player2 may hit a backhand shot hitting the ball to location 5, thus going back to the initial state in the S³G as shown in Fig. 6. Thus, S³G is a cyclic graph.

3.2.5 List of states

A transition from a state can result in one of the multiple possible states. In the tennis video example, a transition from a state or a set of location values, such as a forehand shot by player1 can result in one of the multiple possible states. For example, the player1 may retain his position after hitting the shot or move to another position. Similarly the player2 can be in any of the locations in his court side. Also the player1 has multiple location options for hitting the ball. Hence for a given state, there should be a list of states for each of the four transitions as in Fig. 7.

3.2.6 List of clips

Now there can be many video clips for a given set of locations values for the player1, the player2, and the ball. So each state can have many video clips corresponding to it. Therefore, an array of these clip IDs should be maintained for the corresponding clips (Fig. 8). In addition, the rank of the state in a clip is also stored along with the clip number. If multiple transitions are possible between two states for a single clip, each clip may also maintain the transition type besides the rank information. For simplicity, we do not show the ranks in the figures.

Finally each state in the S³G will of the form: state S _i :[$ \langle A_{1} ,A_{2} , \ldots A_{n} \rangle, \langle C_{i} \rangle $]. Here A _i is an action pointer for next S³G states and C _i is pointer pointing to a list of clips with ranks having the state S _i. For a tennis video, we have 4 action pointers. P _P1f and P _P1b point to the list of states that are possible after the player1 hits a forehand shot and a backhand shot, respectively. Similarly, P _P2f and P _P2b point to the list of states that are possible after the player2 hits a forehand shot and a backhand shot, respectively.

3.2.7 Complexity

The number of states may look like it grows exponentially with respect to the number of objects and locations. However, the number of states cannot be more than the number of frames in a video in the worst case. A state indicates possible positioning of objects at various locations in many video clips.

4 Building S³G

We build an S³G from SMART [14, 15] strings. For example, consider a string as U7V10b4FV10b8 that states that initially player1 and the ball are in location 7 and player2 is in location 10. The player1 serves the ball that goes to location 4. The player2 in location 10 hits the ball that goes to location 8. Two states can be defined as follows. The initial state has location values 7, 10, and 7 for player1, player2, and ball locations, respectively. The ‘following’ state has location values as 7, 10, and 8 for player1, player2, and ball locations, respectively. A transition from the initial state to the following state is defined as “player1 hitting a forehand shot”.

The assumption used in string representations to reduce string sizes should be taken care of during converting them to state representation. For example, the ball location will be the same as the player location who is serving the ball. The players change their location 7 and 10 to 8 and 9 in alternating fashion, which will not be expressed explicitly in the string representations. For example, if the beginning of the video clip has player locations as 7 and 10, the ‘next’ play will have the player locations as 8, 9 and vice versa.

4.1 Insertion into S³G

The S³G is built in an incremental order as strings are being parsed. This is done by reusing the existing states in the S³G and creating new states and transitions if required.

Assume that an S³G as in Fig. 9 is already built and we need to process a new substring as “U7b10FV10b7” for a clip with ID C2134. This defines an initial state with location values as 10, 7, and 10 for the tennis ball, the player1 and the player2, respectively. Since a corresponding state in Fig. 9 does not exist, a new state (S₄ in Fig. 10) has to be created. Next, player2 hits a forehand shot and the ball goes to location 7. This corresponds to an existing state (S₁ in Fig. 9) with location values 7, 10, and 7 for the tennis ball, the player1, and the player2, respectively. The clip ID C2134 is added to the clip list of state S₁ as in Fig. 10. In the second case, instead of creating a new state, initial state (S4) for clip C2134 is made to point to the existing state S1. The clip C2134 is added to the list of video clips of S₁.

Algorithm 1 provides the pseudo-code for converting from string to state. Here each alphabet from the SMART [14, 15] string is read. If the alphabet represents an object (players or ball) and the object location is not implied, then the successive alphabets are read to get its location value. Once all the three object locations are determined, a state with these location values is searched using the indexing structure in Sect. 4.2.1. If the state is found, then it is linked with the previous state with suitable transition type represented by the alphabets in the string denoting the event, or else a new state is created and added to the S³G by linking it with the previous state using the transition specified by the event denoting alphabet. When the strings are parsed, the initial event is ‘serve’ (not explicitly represented in the string) and represented as a forehand shot as a transition in the graph.

Algorithm 2 determines whether an existing state, OP, in S³G is reachable from an initial state S. First the system checks the presence of state OP. If OP is present, the set of common clip lists are found. If the intersection of clip lists is empty, state OP is not reachable from state S. If the intersection of clips is non-empty, the ranks of the same clips are compared. If the rank of a clip in state OP is more than the rank of the clip in S, the state OP is reachable from state S. In the algorithm, C(S) returns the rank of a clip C in state S. This reachability function is used in querying.

An S³G can grow significantly depending on the number of states and actions involved in a given video. The recursive algorithm for finding states discussed in the previous subsection may no longer be efficient for an S³G with a large number of states. We use hashing for determining whether a state is present in the graph or not. States may not need to be created ahead of time. They may be created as those states are encountered in video clips. This will avoid the construction of an unnecessary large graph.

5 Querying

This section describes the interface to build S³G and perform spatio-temporal querying on an S³G. We generate one S³G for the complete tennis video database. Information other than spatio-temporal content about the tennis games is stored in other tables in the database. The irrelevant clips can be eliminated as the states are visited. The user interface snapshots are provided in the Appendix 1.

5.1 Input, backup, and storage for S³G

The string representation of the video according to SMART [14, 15] is used as an input for building the S³G states. The strings can be given as input by: (a) manually typing the strings or (b) the SMART [14, 15] strings stored in the GSMART database (Fig. 20 in "Appendix 1"). S³G can be mapped to a relational model and stored in the database. When the system restarts, it can easily be built without processing the strings in the database. If the S³G index is no longer needed, it can be removed from the database.

5.2 Searching a state

Once an S³G is formed, a particular state can be queried with respect to object positions. If a state is present in the S³G, the corresponding clips in the state can be viewed. The ‘Location Selector’ interface (Fig. 22 in "Appendix 3") helps to retrieve clips that are associated with a state. The location of an object is selected by first choosing the object (the player1, the player2, or the ball) and then clicking the location on the tennis court image corresponding to the required object’s location. If a state exists in the S³G, the details of that state (Fig. 23 in "Appendix 4") are displayed with a list of the clips associated with the state.

5.3 Querying

We can apply STL on the S³G and construct spatio-temporal queries. In this paper, we provide examples of ‘next’ and ‘eventually’ operators that are applied on the tennis video database. When searching for a state sequence, it is important that the states in the sequence have the common clips. Moreover, the rank of a state for the clip should increase by one in a consecutive state. For example, if a sequence has states S₁ and S₂ that are searched, S₁ and S₂ must have at least one common clip. For the common clips, the rank should increase. For example, consider a partial S³G in Fig. 11 where ranks of states are provided as subscripts of clip numbers. Clip C1 has ranks 1 and 3 for state S₁, and rank 2 for state S₂. This indicates that the states of C1 are in the order of S₁–S₂–S₁. For C7, the order is S₂–S₁. For a sequence from S₁ and S₂ with a backhand shot by player 1, clip C5 will not be retrieved since clip C5 is not common in both clips. Clip C7 is common in both states, but the order of states is not correct. Clip C1 is retrieved because it has rank 1 for state S₁ and rank 2 for state S₂. This means that S₂ follows S₁ with a backhand shot by player 1 in clip C1.

In the following subsections, we do not delve into ranks assuming that rank conditions are satisfied. We put more emphasis on how temporal operations are handled.

5.3.1 ‘Next’ state query

The ‘next’ query helps us retrieve all the possible next states from a given state in the S³G. Each state in the S³G can have multiple next states for each of the four kinds of transitions.

5.3.1.1 Example

In the given instance of S³G in Fig. 12, the state S₂ occurs following S₁ when player1 hits a backhand shot. The state S₃ occurs following S₁ when player1 hits a forehand shot. Therefore, a ‘next’ query for state S₁, should fetch states S₂ and S₃:

$$ ({\mathbf{S}},{\text{S}}_{ 1} )\models \circ \left( {{\text{S}}_{ 2} } \right)\quad{\text{and}}\quad({\mathbf{S}},{\text{S}}_{ 1} )\models \circ \left( {{\text{S}}_{ 3} } \right). $$

For each of the four transition types (player1 hitting forehand shot, player2 hitting forehand shot, player1 hitting backhand shot and player2 hitting backhand shot), the ‘next’ possible states will be retrieved. The interface for querying ‘next’ states are provided in "Appendix 5". The ‘next’ state query can also be applied multiple number of times in succession to obtain the clips with all the selected states.

5.3.1.2 Example

In Fig. 13, if S₄ is selected as the current state, its next states will be S₁ and S₅. The states S₄ and S₁ have clip C7 as a common clip, while the states S₄ and S₅ have clip C1 as common. If S₅ is selected as the required ‘next’ state and the ‘next’ state is applied to S₅, the state S₂ will be obtained as they have clip 11 as common, but state S₄ does not have this clip. Hence, no clips are retrieved.

5.3.2 ‘Eventually’ state query

The ‘eventually’ query helps users check if a state eventually occurs in future after a particular state has already occurred in the S³G. For given two S³G states, an initial state and a final state, it can be checked if the final state eventually occurs after the initial state has occurred by checking if it is possible to reach the final state from a given start state in the S³G .

5.3.2.1 Example

Figure 14 provides an instance of S³G where S₁ occurs following the state S₄ when player2 hits a forehand shot. The state S₂ occurs following state S₁ when player1 hits a backhand shot. The state S₃ occurs when player1 hits a forehand shot after the state S₂. Therefore, running an ‘eventually’ query with S₄ as a start state and S₃ as a final state succeeds while an eventually query with S₂ as a start state and S₄ as a final state does not:

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \diamondsuit \left( {{\text{S}}_{ 3} } \right){\text{ but }}({\mathbf{S}},{\text{S}}_{ 2} )\models \diamondsuit \left( {{\text{S}}_{ 4} } \right){\text{ is false}}. $$

Consider the example shown in Fig. 15 where a state with locations 7, 10, and 4 are set for the player1, the player2, and the ball positions, respectively, for the initial state. When the user clicks “Set as initial state”, this initial state is shown as a thumbnail in Fig. 15. To check if a state ‘eventually’ occurs after the initial state chosen, the ‘Eventually Check’ in Fig. 16 is selected (see Appendix 6 for more information).

5.3.3 Combining ‘next’ and ‘eventually’ queries

The ‘next’ and the ‘eventually’ state queries can be called multiple times in any order. In Fig. 17, an instance of S³G is given where S₁ is a state that is directly reachable from the state S₄ after player2 hits a forehand shot. The state S₂ is directly reachable from S₁ after Player1 hits a backhand shot. The states S₅ and S₃ are directly reachable from the state S2 after the transitions resulting from “player1 hitting a forehand shot” and “player2 hitting a backhand shot”, respectively. Therefore, an ‘eventually’ query with S₄ as the start state and S₂ as the final state will succeed. A ‘next’ query on the state will retrieve states S₃ and S₅.

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \diamondsuit \left( {{\text{S}}_{ 2} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 2} )\models \diamondsuit \left( {{\text{S}}_{ 3} } \right) \quad{\text{and }}\quad({\mathbf{S}},{\text{S}}_{ 4} )\models \circ \left( {{\text{S}}_{ 2} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 2} )\models \circ \left( {{\text{S}}_{ 5} } \right) $$

Similarly a ‘next’ query on the state S₄ gives the state S₁ and then an ‘eventually’ query with S₁ as the start state and S5 as the final state will also succeed.

$$ ({\mathbf{S}},{\text{S}}_{ 4} )\models \circ \left( {{\text{S}}_{ 1} } \right) \wedge ({\mathbf{S}},{\text{S}}_{ 1} )\models \diamondsuit \left( {{\text{S}}_{ 5} } \right). $$

Our application has the feature of graphically displaying the states of a query to simplify the process of long queries that result from a series of eventually and next state queries as shown in Fig. 18. When a new query is submitted, the image of previous state will be retrieved from a location where all possible state images are stored. This image is then displayed. Thus, S³G application helps to develop queries interactively. The user can continue building queries on the results obtained from his previous subqueries, thus enabling creation of lengthy queries dynamically instead of running just a fixed query.

Other query types using ‘releases’ or ‘until’ operator can also support queries such as “retrieve clips where both players are at the baseline until one player runs to the net”. In our application, these queries can be represented with a sequence of ‘next’ queries. So, this type of queries is not implemented separately.

6 Experiments

In this section, we first provide experimental results and then provide discussion on extendibility for other videos.

6.1 Performance

6.1.1 Comparison

We compare our method with spatio-temporal query patterns (STP) [11]. There are improvements to STP method with assumptions [20] which are not applicable to our case (e.g., an object should not visit the same location again). STP provides indexing based on positions usually divided into grids. STP index keeps the time when objects visit a specific location. There are several issues with STP: (a) STP supports querying the trajectory of a single object; (b) there is a single timeline for trajectories; (c) there are no semantic events between changing positions; and (d) the gap between any consecutive steps could be anything. To compare our method with STP, the following changes are made for STP. (1) Our queries include multiple objects; therefore, we first need to query STP for each object and check the results are synchronized (i.e., check whether objects appear at the corresponding positions at the same time). (2) After each tennis point (tennis match is composed of sets, which are composed of games, which are composed of points), the positions of players are reset. There is no need to check the positions of objects across points. We need to keep STP index structure for each point. A sequence of positions across multiple points is not related to each other and meaningless. (3) We consider all semantic transitions (all types of shots) for S³G since STP does not support transitions. (4) We consider eventual queries to compare S³G with STP since STP is not designed to support next-querying.

6.1.2 Building S³G

Since a small number of games are not enough to check the performance of indexing structures, we have simulated tennis games to measure the complexity of building S³G and querying using S³G. We have generated around 10,000 points (sequences or strings or clips). Each game has around 100 points. Table 1 shows the statistics for building S³G and STP index structures where |IS| is the number of input sequences (or strings) and |s| is the number of states (note |s| is only applicable to S³G). These timings only include the time for building the index structures and ignore the time for decoding the input strings since strings are decoded differently for each index structure.

Table 1 Time for building index structures and searching with 2 and 3 states for S³G and STP

Full size table

6.1.3 Complexity

It takes 35 ms to build S³G for 10,000 points. Table 1 shows that the number of states reaches saturation when 4,096 input sequences are processed and barely increases after that. Our experimental results indicate that the time complexity is O(|IS|) + O(|s|). Since |IS| ≫|s| and the number of states reaches a saturation point, the complexity of building S³G is O(|IS|) for large databases. The time complexity of inserting a state into S³G is O(1). Similarly, the time complexity of inserting a clip into a state in S ³ G is O(1): finding the state using hashing and adding the clip to the end. Figure 19 shows the comparison of building index structures. Building S³G is faster than STP despite S³G also includes the time to build the index structure with respect to the player shots, whereas in STP those transitions are ignored. Although a separate STP index is built for each clip (this results in simple STP index structures), building a complex and single S³G index was faster than building STP index structures. If a single STP index structure had been built, the querying would not complete in a reasonable time for STP index.

6.1.4 Querying

We compare the performance of S³G and STP on ‘eventually’ queries. Table 1 provides the experimental results when 10,000 points are inserted into the database where |q| indicates the number of queries. A query returns the clips where a destination state is reachable from an initial state. We provide the results with 2 and 3 states (or predicates) for ‘eventually’ queries. S³G significantly outperforms STP that we cannot put them into a single graph. Using S³G, 8,196 queries for 3 states are completed in 600 ms, whereas it takes around 60 s for STP. Using S³G for queries where there is an initial state and a final state just took 153 ms. Figure 19 shows the results of querying for S³G. Note that the time for querying provides the total time for |q| queries. The cost of a ‘Next’ query is O(1) + O(1). The first part locates the current state using hashing and the second part locates the next state using the corresponding event.

6.2 Discussion on extendibility

The domain of applications should have the following properties: (a) the video data should be discretizable in both spatial domain and temporal domain; (b) the states should be repeated to get benefit from this indexing method, and (c) there should be preferably semantic events (or transitions) that indicate change from one state to another state. The transitions and discretization are closely related to each other. Now, we give examples of these in several different domains.

We may group sports based on the layout of the field, instruments involved in the game, the number of players, and so on. In some sports ranking is important: auto racing, swimming, and cycling. The events or transitions include completion of a lap, taking a pit stop, or passing of a player by another player. The states may involve the rank of players, the lap, and their locations on the field. A combination of lap and rank could be a better option for auto racing rather than the exact locations of cars. Another set of group of games include games where the field is divided into multiple sections. Examples include tennis, table tennis, volleyball, and badminton. Tennis example has already been covered with some detail. For volleyball the events could be passing, block, or spike, etc. Another set of games include games that are played on a board (e.g., chess). In chess, player moves are transitions, and pieces on board make the states. In a single game, it may be rare to repeat states but it is possible to have states repeated in multiple games. 8 moves of a chess game by Kasparov against World Team are: “1.e4 c5 2.Nf3 d6 3.Bb5 + Bd7 4.Bxd7 + Qxd7 5.c4 Nc6 6.Nc3 Nf6 7.0-0 g6 8.d4 cxd4”.

Another set of games include games where players have almost no restriction to move on the field. These games include soccer, football, basketball, ice hockey, etc. The research on these games includes video processing, game summary processing, and commentator speech processing. Not all players might be involved at a play during the game. The positions of players as well as their actions might be important. We investigated games from National Football League (NFL). The critical parts of the state are: the ball location, team having the ball, down-and-distance, and optional team formation [e.g., State (Team: GB; Ball: NO-37; Down; 1-10; Formation: shotgun)]. Events have more information about the event type such as rushing, passing, sacks. Events may also have information who is involved in the event (e.g., player A passes the ball to player B which is then stopped by player X of the opponent team). In baseball, the players at the bases, pitcher, batter, catchers, etc. form the states. Home-run, strike, ball, etc. correspond to transitions. In the golf, the player follows holes. The position of the ball after each hit and the hole number may correspond to a state. The hit may correspond to a transition.

If the number of states grows significantly, we propose two ways of reducing the number of states: (a) decrease the number of locations by increasing the sizes of locations and (b) use only players who are active in the play for that state.

7 Conclusion

In this paper, we introduced a new indexing method, semantic sequence state graph (S³G), for spatio-temporal data using the semantic contents of the video. S³G utilizes semantic events as transitions in the events while the spatial information is maintained in states. The semantic events result in temporal ordering among the states. The ‘next’ and ‘eventually’ operations of LTL have been implemented using the S³G platform on the tennis video database. Both these types of queries can be applied multiple times in combination to form a complex query. A sample application of S³G is shown for a tennis video database application. As future work, we plan to apply S³G in other domains. We also plan to provide more efficient ways of managing temporal queries of S³G. We also intend to create new scenarios of events, objects, and locations by reusing the video clip contents and traversing S³G. S³G can handle coupled STQ with combination of ‘next’ and ‘eventually’ queries. S³G suits well for applications where there is a finite domain of objects, events, and locations where repetitions are frequent. S³G also helps the user develop the temporal components of a query by a user interface that shows a history of the built query. The user can see whether his query will lead to any result or not while building his query.

References

Allen, J.F.: Maintaining knowledge about temporal intervals. In: Communications of the ACM. 26/11/1983. ACM Press (1983)
Assfalg, J., Bertini, M., Colombo, C., Del Bimbo, A.: Semantic annotation of sports videos. IEEE Multimed. 9(2), 52–60 (2002)
Article Google Scholar
Athitsos, V., Papapetrou, P., Potamias, M., Kollios, G., Gunopulos, D.: Approximate embedding-based subsequence matching of time series. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada, June 09–12, 2008, SIGMOD ’08 (2008)
Aygun, R.S., Yazici, A.: Modeling and management of fuzzy information in multimedia database application. Multimed. Tools Appl. 24(1), 29–56 (2004)
Article Google Scholar
Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: An Index Structure for High-Dimensional Data, Readings in Multimedia Computing and Networking, pp. 451–462. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Bozanis, P., Foteinos, P.: WeR-trees. Data Knowl. Eng. 63(2), 397–413 (2007)
Article Google Scholar
Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33(3), 322–373 (2001)
Article Google Scholar
Chen, S., Ooi, B., Tan, K., Nascimento, M.A. (2008) ST²B-tree: a self-tunable spatio-temporal b⁺-tree index for moving objects. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada, June 09–12, 2008, SIGMOD ’08
Del Bimbo, A., Vicario, E., Zingoni, D.: Symbolic description and visual querying of image sequences using spatio-temporal logic. IEEE Trans. Knowl. Data Eng. 7(4), 609–622 (1995)
Article Google Scholar
Google Video (2011) Google Video. http://video.google.com
Hadjieleftheriou, M., Kollios, G., Bakalov, P., Tsotras, V.J.: Complex spatio-temporal pattern queries. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05). VLDB Endowment, pp. 877–888 (2005)
Hongeng, S., Nevatia, R., Bremond, F.: Video-based event recognition: activity representation and probabilistic recognition methods. Comput. Vis. Image Underst. 96(2), 129–162 (2004)
Article Google Scholar
Internet Archive (2011) Internet archive. http://www.archive.org
Jain, V., Aygun, R.S.: SMART: A grammar-based semantic video modeling and representation. In: IEEE SouthEast Con 2008
Jain, V., Aygün, R.S.: Spatio-temporal querying of video content using SQL for quantizable video databases. J. Multimed. 4(4), 215–227 (2009)
Google Scholar
Jensen, C.S., Lin, D., Ooi, B.C.: Query and update efficient B+-tree based indexing of moving objects. In: VLDB. pp. 768–779 (2004)
Jensen, C.S., Snodgrass, R.T.: Temporal data management. IEEE Trans. Knowl. Data Eng. 11(1), 36–44 (1999)
Article Google Scholar
Kim, C., Vasudev, B.: Spatiotemporal sequence matching for efficient video copy detection. IEEE Trans. Circuits Syst. Video Technol. 15(1), 127–132 (2005)
Article Google Scholar
Koprulu, M., Cicekli, N.K., Yazici, A.: Spatio-temporal querying in video databases. Inf. Sci. Inf. Comput. Sci. 160(1–4), 131–152 (2004)
Google Scholar
Lagogiannis, G., Lorentzos, N., Sioutas, S., Theodoridis, E.: A time efficient indexing scheme for complex spatiotemporal retrieval. SIGMOD Rec. 38(3), 11–16 (2010)
Article Google Scholar
Lay, J.A., Guan, L.: Semantic retrieval of multimedia by concept languages: treating semantic concepts like words. Signal Process. Mag. IEEE 23(2), 115–123 (2006)
Article Google Scholar
Lee, M., Yoon, H., Kim, Y.J., Lee, Y.: SMILE tree: a stream data multi-query indexing technique with level-dimension nodes and extended-range nodes. In: Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication. Suwon, Korea, January 31–February 01, 2008, ICUIMC ‘08
Lejsek, H., Ásmundsson, F.H., Jónsson, B.Þ., Amsaleg, L.: NV-Tree: an efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 869–883 (2009)
Article Google Scholar
Li, J.Z., Ozsu, M.T., Szafron, D.: Modeling of moving objects in a video database. In: International Conference on Multimedia Computing and Systems, pp. 336 (1997)
Lin, D., Jensen, C.S., Ooi, B.C., Šaltenis, S.: Efficient indexing of the historical, present, and future positions of moving objects. In: Proceedings of the 6th International Conference on Mobile Data Management, Ayia Napa, Cyprus, May 09–13, 2005, MDM ‘05. ACM, New York, NY, pp. 59–66 (2005)
Min, J.S., Kim, D.H., Ryu, K.H.: A spatiotemporal data and indexing, electrical and electronic technology, 2001. TENCON. In: Proceedings of IEEE Region 10th International Conference on, vol. 1, pp. 110–113 (2001)
Naik, M., Jain, V., Aygun, R.S.: S³G: A semantic sequence state graph for indexing spatio-temporal data—a tennis video database application. In: ICSC, 2008 IEEE International Conference on Semantic Computing, pp. 66–73 (2008)
Park, D.-J., Heu, S., Kim, H.-J.: The RS-tree: an efficient data structure for distance browsing queries. Inf. Process. Lett. 80(4), 195–203 (2001)
Article MathSciNet MATH Google Scholar
Patel, J.M., Chen, Y., Chakka, V.P.: STRIPES: an efficient index for predicted trajectories. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004. SIGMOD ’04. ACM, New York, NY, pp. 635–646 (2004)
Pissinou, N., Radev, I., Makki, K., Campbell, W.J.: Spatio-temporal composition of video objects: representation and querying in video database systems. IEEE Trans. Knowl. Data Eng. 13(16), 1033–1040 (2001)
Article Google Scholar
Ren, W., Singh, S., Singh, M., Zhu, Y.S.: State-of-the-art on spatio-temporal information-based video retrieval. Pattern Recognit. 42(2), 267–282 (2009)
Article MATH Google Scholar
Šaltenis, S., Jensen, C.S., Leutenegger, S.T., Lopez, M.A.: Indexing the positions of continuously moving objects. SIGMOD Rec. 29(2), 331–342 (2000)
Article Google Scholar
Saltenis, S., Jensen, C.S.: Indexing of moving objects for location-based services. ICDE 2002, 463–472 (2002)
Google Scholar
Salzberg, B., Tsotras, V.J.: Comparison of access methods for time-evolving data. ACM Comput. Surv. 31(2), 158–221 (1999)
Article Google Scholar
Tao, Y., Papadias, D., Sun, J.: The TPR*-tree: an optimized spatio-temporal access method for predictive queries. In: Proceedings of the 29th International Conference on Very Large Data Bases (VLDB), Berlin, Germany, 2003, pp. 790–801 (2003)
Valle, E., Cord, M., Philipp-Foliguet, S.: High-dimensional descriptor indexing for large multimedia databases. In: Proceeding of the 17th ACM Conference on Information and Knowledge Management. CIKM '08. ACM (2008)
Vardi, M.Y. Branching vs. linear time: final showdown. In: Proceedings of the 7th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS’01), 2031 (2001), pp. 1–22
Wattamwar, S.S., Ghosh, H.: Spatio-temporal query for multimedia databases. In: Proceeding of the 2nd ACM Workshop on Multimedia Semantics. Vancouver, British Columbia, Canada, October 31–31, 2008, MS ’08. pp. 48–55, ACM, New York, NY
Ye, H., Luo, H., Song, K., Xiang, H., Chen, J.: Indexing moving objects based on 2ⁿ index tree. In: Proceedings of the 6th Conference on 6th WSEAS International Conference on Artificial intelligence, Knowledge Engineering and Data Bases, vol. 6, Corfu Island, Greece, February 16–19, 2007
Yiu, M.L., Tao, Y., Mamoulis, N.: The Bdual-tree: indexing moving objects by space filling curves in the dual space. VLDB J. 17(3), 379–400 (2008)
Article Google Scholar
YouTube (2011) YouTube—Broadcast Yourself. http://www.youtube.com

Download references

Acknowledgments

We thank Vineetha Bettaiah for improving the code for searching states, incorporating rank of clips, and editing of the paper.

Author information

Authors and Affiliations

Computer Science Department, University of Alabama in Huntsville, Huntsville, AL, USA
M. M. Naik, M. Sigdel & R. S. Aygun

Authors

M. M. Naik
View author publications
You can also search for this author in PubMed Google Scholar
M. Sigdel
View author publications
You can also search for this author in PubMed Google Scholar
R. S. Aygun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. S. Aygun.

Additional information

Communicated by B. Prabhakaran.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 9 kb)

Appendix

In Appendix, we provide information about our user interface for querying.

1.1 A user interface

In this section, we provide some snapshots from our user interface.

1.1.1 Appendix 1: Building S³G

Figure 20 displays the user interface where SMART strings are read from the database. When “Proceed to State Conversion” button is clicked, the S³G is built.

1.1.2 Appendix 2: Displaying states

The interface is prepared based on design in Fig. 21 where player1, player2, and the ball images are placed on all possible locations on the tennis court image. All these object images are hidden except the ones that represent the location values for the given state when displaying a state.

1.1.3 Appendix 3: Location selector interface

Figure 22 shows location selector interface filled with location values for a desired state. In this example, player1 location has a location as 7, player2 has a location as 10, and the ball has a location as 4. To search for a particular state with corresponding location values in the ‘Location Selector’ interface, the search option in Fig. 15 is selected. To retrieve states based on an event, the user needs to follow the links to retrieve the corresponding links. For example, if the user wants to retrieve the clips that result after an event such as forehand shot by player1, the transition for this event followed and the clips from the relevant states are retrieved.

1.1.4 Appendix 4: Viewing clips

A particular clip can be selected to view from the list of clips given for that state (Fig. 23). The clip will be displayed after pressing the view button. If the selected state does not exist in the S³G, a message will be displayed saying that the specified state does not exist. Figure 23 also shows player1, player2, and ball locations.

1.1.5 Appendix 5: Interface for querying next states

A state can be chosen by selecting the player1, the player2 and the ball locations in the tennis court and then corresponding ‘next’ states in the S³G can be obtained (Fig. 24). The aim is to obtain the clips having the current state and the selected next state.

Figure 25 shows an interface showing the possible ‘next’ states in the S³G for the transition, when player2 hits a forehand shot. A ‘next’ state can be selected by choosing the ‘Select State’ options button associated with that state. If the next state does not exist in the S³G for a transition then a message will be displayed stating that there is no next state for that transition. Once a current state and a next state are selected, the relevant clips can then be viewed (Fig. 26).

Only the next states that have common clips with the relevant clip set (clips common to all selected states) are displayed to select. There may be a case where there can be a next state existing for a state in the S³G but the next state has no common clip with all previously selected states. In such a case, a message stating that “the next state exists in the S³G, but no common clip exists” is displayed.

1.1.6 Appendix 6: Interface for querying eventually

The result, whether the ‘eventually’ link exists between the chosen initial and final states in the S³G or not, is obtained by selecting the “Result” option as in Fig. 27. Figure 27 shows that an ‘eventually’ link exists for the chosen initial and final states in the S³G.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naik, M.M., Sigdel, M. & Aygun, R.S. Spatio-temporal querying recurrent multimedia databases using a semantic sequence state graph. Multimedia Systems 18, 263–281 (2012). https://doi.org/10.1007/s00530-011-0255-8

Download citation

Received: 27 August 2010
Accepted: 14 November 2011
Published: 06 December 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s00530-011-0255-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Spatio-temporal querying recurrent multimedia databases using a semantic sequence state graph

Abstract

Similar content being viewed by others

Analyzing Temporal Keyword Queries for Interactive Search over Temporal Databases

Supporting Situation Awareness in Spatio-Temporal Databases

A Survey of Spatio-Temporal Database Research

Explore related subjects

1 Introduction

1.1 Our approach

2 Background and related work

2.1 Related work

2.1.1 Tree-based indexing

2.1.1.1 R-tree-based indexing

2.1.1.2 B-tree-based indexing

2.1.1.3 Other tree-based indexing

2.1.2 Sequence matching

2.1.3 Representation

2.2 Background on SMART

2.2.1 An example

2.2.2 Grammar for tennis game

2.3 Spatio-temporal logic (STL)

3 S3G: a semantic sequence state graph

3.1 Components

3.1.1 States

3.1.2 Transitions

3.2 An example: tennis video

3.2.1 States of tennis

3.2.2 Transitions of tennis (semantic events)

3.2.3 Example

3.2.4 Cyclic nature

3.2.5 List of states

3.2.6 List of clips

3.2.7 Complexity

4 Building S3G

4.1 Insertion into S3G

5 Querying

5.1 Input, backup, and storage for S3G

5.2 Searching a state

5.3 Querying

5.3.1 ‘Next’ state query

5.3.1.1 Example

5.3.1.2 Example

5.3.2 ‘Eventually’ state query

5.3.2.1 Example

5.3.3 Combining ‘next’ and ‘eventually’ queries

6 Experiments

6.1 Performance

6.1.1 Comparison

6.1.2 Building S3G

6.1.3 Complexity

6.1.4 Querying

6.2 Discussion on extendibility

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (PDF 9 kb)

Appendix

Appendix

1.1 A user interface

1.1.1 Appendix 1: Building S3G

1.1.2 Appendix 2: Displaying states

1.1.3 Appendix 3: Location selector interface

1.1.4 Appendix 4: Viewing clips

1.1.5 Appendix 5: Interface for querying next states

1.1.6 Appendix 6: Interface for querying eventually

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

3 S³G: a semantic sequence state graph

4 Building S³G

4.1 Insertion into S³G

5.1 Input, backup, and storage for S³G

6.1.2 Building S³G

1.1.1 Appendix 1: Building S³G