1 Introduction

Due to the huge progress of science and technology in the last few decades, through the Internet, people can use computers or other handheld devices to acquire multimedia resources anytime and anywhere. Among these various resources, videos provide richer information, e.g., image, audio, text. Although there are a lot of videos on the Internet, we are only interested in highlight parts of videos. Accordingly, summarization and retrieval of significant events in different kinds of videos have been hot research topics.

In this paper, we focus on sports videos. A sports video usually lasts more than an hour. There are plenty of sports games held per day in different countries. Even if we narrow the scope down to basketball games in the United States, there still have five to ten games per day in National Basketball Association (NBA). For basketball fans, it is worthy to summarize these games into highlights, so they can track daily results within just a few minutes. Many websites, such as NBA [4], ESPN [2], and Yahoo Sports [3], already make this kind of online service available. However, these online services, such as Game Recap [4], Daily Top 10 and Play of the Day [5], etc., are made by professional film editors and sports reporters by exhaustedly watching the sports videos personally. People or fans can only see the unified version through online services. Fans, who want to practice certain basketball skills or imitate specific sports stars, have to spend a lot of time to download the whole game and personally search for certain moves made by certain players. Based on these facts, a personalized sports video summarization and retrieval system is desired by fans to make their own video highlights according to their different tastes and interests.

In video summarization and retrieval, a source video is first clipped into smaller videos representing significant events through a preprocessing, called Semantic Event Detection, which detects events occurred in a video and annotates events with appropriate tags. With finer results of the preprocessing, video summarization and retrieval can be completed efficiently and correctly. Most of existing event detection schemes [13, 14, 18] use video content as their resource knowledge. Chen and Deng [7] analyze low level features (e.g. color, motion, shot) to extract and index events in a basketball video. Some researches [8, 9, 12, 17] use audio-visual (AV) features to do video summarization. Hassan et al. [8] extract AV features and apply Conditional Random Fields (CRFs) based probabilistic graphical model for sports event detection. Kim and Lee [9] build an indexing and retrieving system for a golf video by analyzing its AV content. However, the schemes relying on video content encounter a challenge called semantic gap, which represents the distance between low level video features and high level semantic events. In sports video, two kinds of external knowledge can be used to bridge the gap.

One of the external knowledge is Closed-Caption (CC) [11]. CC is the transcript of speech and sound, and it is helpful for semantic analysis of sports videos. It is mainly used in aid of listening and language learning, but only available in certain videos and certain countries. Because CC completely records the sound in video, it contains a lot of redundant information and usually lacks of structure. The other external knowledge is webcast text. Comparing to CC, webcast text is the online commentary posted by professional announcers and focuses more on sports games. It contains more detail information (e.g., event name, time, player involved, etc.), which is difficult to extract from video content itself automatically. Xu and Chua [15] first use webcast text as external knowledge to assist event detection in soccer video. They proposed a framework that combines internal AV features with external knowledge to do event detection and event boundary identification. But the proposed model is inapplicable to other team sports. Xu et al. [16] apply probabilistic latent semantic analysis (pLSA), a linear algebra–probability combined model, to analyze the webcast text for text event clustering and detection. Based on their observation, the descriptions of the same event in the webcast text have a similar sentence structure and word usage. They use pLSA to first cluster the descriptions into several categories and then extract keywords from each category for event detection. Although they extend pLSA for both basketball and soccer, there are two problems in the approach.

  1. 1)

    The optimal numbers of event categories are nine for basketball and eight for soccer in the results, which is determined by minimizing the ratio of within-class similarity and between-class similarity. In fact, there are more event categories for a basketball or soccer game. For example, in a basketball game, many events, such as timeout, assist, turnover, ejected, are mis-clustered into wrong categories or discarded as noises. This may cause side effects degrading and limiting the results of video retrieval.

  2. 2)

    After keywords extraction, events can be detected by keywords matching. In Xu et al.’s method, they use the top ranked word in pLSA model as single-keyword of each event category. But in some event categories, the single-keyword match will lead to horrible results. For example, in their method for a basketball game, “jumper” event represents those jumpers that players make. Without detecting “makes” as a previous word of “jumper” in description sentences, the precision of “jumper” event detection is decreased from 89.3 % to 51.7 % in their testing dataset. However, the “jumper” event actually is an event that consists of “makes jumper” event and “misses jumper” event. The former can be used in highlights, and the latter can be used in sports behavior analysis and injury prevention. Accordingly, using single-keyword match is insufficient and some important events will be discarded.

To treat the above-mentioned problems, we propose a method to analyze sports webcast text and extract significant text events. An unsupervised scheme is used to detect events from the webcast text and extract multiple keywords from each event. A data structure is used to store these multiple keywords and to support a hierarchical search system with auto-complete feature for event retrieval. The word “hierarchical” means that a user can get more specific results by querying more keywords and the word “auto-complete” means that the system can give suggested keywords during the query step. According to our experiments, the proposed method keeps more natural event categories and works finely with objective classification results. This provides more flexibility and extensibility while summarizing or retrieving sports videos in different purposes. Our contributions are as follows: 1) detecting semantic events from webcast text in an unsupervised manner; 2) requiring no additional context information analysis; 3) preserving more significant events in sports games; 4) extracting multiple keywords from event categories to support hierarchical searching; 5) providing auto-complete feature for finer retrieval. The rest of the paper is organized as follows. Details of the proposed method are described in Section 2. Experimental results and conclusions are given in Section 3 and Section 4.

2 Proposed method

Webcast text comprises knowledge which is closely related to the game and is easily retrieved from websites. As can be seen in Table 1, it contains time tags, team names, scores, and event descriptions. The format is so organized that we can follow the time flow and understand how the game goes on. Among this well-organized text, it is apparent that event descriptions relate to semantic events the most. Our goal is to analyze event descriptions and automatically extract significant events from them. The block diagram of the proposed method is presented in Fig. 1. It can be seen that we first filter out unrelated words of webcast text and then cluster them into significant events. We store the extracted semantic information with a pair of index tables and build a hierarchical retrieval system by manipulating the two tables. The detail of each block will be described in the following subsections.

Table 1 An example of basketball webcast text
Fig. 1
figure 1

Block diagram of the proposed method

2.1 Unrelated words filtering

In webcast text, each description can be considered as an event. It contains many words and may include player name, team name, movement name, and whether the player or the team makes the movement or not. An example is given in Fig. 2, a player named “Peja Stojakovic” failed to make a movement called “10-ft two point shot.”

Fig. 2
figure 2

An example to illustrate description and word

The number of descriptions in each basketball game is more than four hundred. The descriptions are readable and can be easily categorized into several events by human eyes. But the task is not effortless for computer machines. According to our observations, words in each description consist of three mutually disjoint word sets: 1) stop words, 2) event keywords, and 3) names. Stop words are unrelated to event and should be discarded. Event keywords are closely related to event and should be kept for event detection. Names including team names and player names should be preserved for event annotation. Our objective is to extract event keywords and use these keywords to do event clustering. To achieve the objective, based on a reference stop word list and an online name information, an interactive system is first provided to establish a sports stop word list and an event keyword list. The system will be explained in Sections 2.1.1 and 2.1.2. According to these two lists, for each webcast text, an unrelated word filtering procedure described in Section 2.1.3 is next provided to filter out stop words and to preserve name words. The remaining keywords are then used for event clustering, which will be described in Section 2.2.

2.1.1 Stop words

In information retrieval, there are some words that occur very frequently (e.g. some articles, prepositions, pronouns, be-verbs) and are useless in document matching. These words are called stop words [10]. Due to the uselessness of stop words, filtering out them during both index step and query step can reduce the index size and query processing time. This technique has been used in search engines and can be implemented through predefining a stop word list. For the variety of applications, there is no standard stop word list. Many reference stop word lists [1, 6] have been proposed by using techniques about statistics and probability.

From Table 1, it can be seen that descriptions contain articles (e.g. “the”), prepositions (e.g. “of”), range of shot (e.g. “10-ft”), and points of shot (e.g. “two point”). Some words are details of events which decrease the connections between similar events. With the aid of reference stop lists, articles and prepositions can be easily filtered out from descriptions. However, the range of shot and points of shot are exceptions in reference stop lists. Moreover, in soccer webcast text, due to the relatively larger ground, there are more unrelated words to describe locations where an event happens. For example, right wing, left wing, inside the box, outside the box, left corner, right corner, etc. Accordingly, it is hard to automatically generate a sports stop word list for all kinds of sports. So we will provide an interactive system to establish a sports stop word list.

2.1.2 The proposed interactive system for establishing sports stop word list and event keyword list

As mentioned previously, an interactive system is proposed to establish the sports stop word list and the event keyword list for sports webcast text. First, webcast text descriptions of several games are taken as training inputs, next some unrelated words are filtered out according to a reference stop word list [1] and a name word list (e.g., online box score in basketball and online player statistics in soccer). And then the system interacts with sports professionals, who will divide the remaining words into a black list and a white list. The black list contains stop words for sports, and the white list contains sports event keywords. Finally the black list is merged into the reference stop word list to get the sports stop word list. The block diagram of the interactive system is presented in Fig. 3.

Fig. 3
figure 3

The block diagram of the interactive pre-training system

Our training webcast text is conducted by 41 basketball games and 48 soccer games. After the reference stop words filtering and the name words filtering, the remaining words needed to interactively ask professionals are less than 100 in basketball and less than 200 in soccer. The responses from professionals may take just few minutes.

2.1.3 The proposed unrelated words filtering procedure

Figure 4 shows the block diagram of the proposed unrelated words filtering procedure. For a webcast text, the sports stop word list is first used to filter out unrelated words. Next the event keyword list is used to extract event keywords. Then the words with uppercase beginning in the remaining words are considered as reserved names for further indexing. According to our experiment results, the unrelated words filtering works well both in basketball and soccer.

Fig. 4
figure 4

Block diagram of unrelated words filtering procedure

2.2 Event clustering

After filtering, each description is reduced and almost exactly describes an event; for example, “misses shot” represents a missed shot. So a matching function is provided to cluster these filtered descriptions into event categories.

Filtered descriptions can be represented as FD = {fd 1 , fd 2 ,…, fd N }, and event categories can be represented as C = {C 1 , C 2 , …, C K }, where N denotes the number of descriptions in a game and K denotes the number of categories that the clustering step produces. Since a filtered description consists of some words, it can be considered as a set of words. Note that the number of keywords of an event category is not restricted to be single in our method. The matching function is defined as

$$ Text\_Match\left( {x,y} \right)=\left\{ {\begin{array}{*{20}c} {1,} & {\mathrm{if}\;\mathrm{x} = \mathrm{y}} \\ {0,} & {\mathrm{otherwise}\text{,}} \\ \end{array}} \right. $$
(1)

where x and y are two sets of words. Each filtered description, fd i , can be clustered into one category based on the following function

$$ Clustering\left( {f{d_i}} \right)=\arg \mathop{\max}\limits_m\left\{ {Text\_Match\left( {f{d_i},Keywords\left( {{C_m}} \right)} \right),\;m=1,\ldots,\;K} \right\},\;i=1,\ldots, N, $$
(2)

where Keywords(C m ) denotes the multiple-keywords set of category C m . Clustering(fd i ) = j means that description fd i is clustered into category C j . In order to avoid zero matching in (2), a flag function to examine whether the situation happens is defined as

$$ Flag\left( {f{d_i}} \right)=\mathop{\max}\limits_m\left\{ {Text\_Match\left( {f{d_i},Keywords\left( {{C_m}} \right)} \right),\ m=1,\ldots,\;K} \right\},\;i=1,\ldots,\;N. $$
(3)

The detail of the proposed clustering algorithm is given below.

2.3 Clustering algorithm

  1. Step0:

    Initialization: Given FD = {fd 1 , fd 2 ,…, fd N }.

    Set K = 1, Clustering(fd 1 ) = 1, Keywords(C 1 ) = fd 1 , i = 2.

  2. Step1:

    Cluster the description fd i according to Functions (1), (2), and (3). The procedure includes the following pseudo code:

    figure c
  3. Step2:

    If any of the descriptions in FD is not clustered yet, set i = i + 1 and go to Step1 for next iteration. Otherwise, end of iterations.

Once the clustering algorithm is completed, the filtered descriptions are clustered into event categories, and keyword extraction is done by using each keyword set as multiple keywords of the event. At the meantime, semantic event detection is accomplished. Then two data structures are built to recommend users for further queries and to support the hierarchical search.

2.4 Hierarchical search system

Figure 5 gives an example to show the concept of the proposed hierarchical search system. First, a user can query by one word to get rough results. Then he can continually query by more words to get into deeper levels for finer results. Here we implement the system by establishing a pair of index tables and manipulating them back and forth.

Fig. 5
figure 5

An example to illustrate the concept of the proposed hierarchical search system

Here we build a forward index table and an inverted index table. The former records mappings from descriptions to event keywords, and the latter stores mappings from keywords to descriptions. Note that the forward index table is established automatically after applying the unrelated words filtering procedure. Based on the forward index table, the inverted index table can be established by sequentially scanning event keyword set of each description. An example is given in Fig. 6 to do clearer explanation. Suppose we have five descriptions as shown in Fig. 6a. After applying unrelated words filtering procedure to each description, we can obtain Fig. 6b. By scanning each row in Fig. 6b, for each row, we can obtain a description index (DI) and the corresponding event keyword set (EKS). Then DI is linked to each keyword in EKS. After scanning all rows sequentially in Fig. 6b, c is established. Both inverted index table and forward index table are referred to achieve the hierarchical search system. The inverted index table is used for returning query results by intersecting those description sets mapped by query keywords. The forward index is originally just an intermediate, but reused in our method for providing suggested query keywords, i.e. auto-complete feature.

Fig. 6
figure 6

An example to illustrate the data structure for hierarchical search

In our system, a query is considered as a set of multiple words. The hierarchical feature means that a user can get more general results by querying fewer words or get more specific result by querying more words; for example, the results of querying “jumper” are those descriptions having the keyword “jumper”, and the results of querying “jumper makes” are those descriptions having both “jumper” and “makes.” The query result is the intersection of description sets obtained through the keywords of query in the inverted index list. For providing suggested query keywords, the resulting intersection set is then used as another query for the forward index list. The keyword set of each description in the resulting intersection set are extracted. Finally, the union of all extracted keyword sets is considered as the suggested query keywords. The detail algorithm of the proposed search system is given below.

2.5 Hierarchical search algorithm

  1. Step1:

    A user types several query words.

  2. Step2:

    Look up the inverted index and get description sets mapped by the query words. Intersect these description sets to obtain a query result.

  3. Step3:

    Look up the forward index and get keyword sets mapped by the query result.

  4. Step4:

    Output the union set of these word sets. The user selects some keywords from output as query words. Perform Step2 and output the query result.

Here, we use Fig. 6 as an example to do explanation. Assume that a user types a query {jumper}, the system will look up the inverted index list and get a temporary result set {D2, D4, D5}. Then, the system will look up the forward index list and recommend the user {assists, jumper, makes, misses}, i.e. the union set of {jumper, misses}, {jumper, makes}, and {assists, jumper, makes}. If the user changes his query to {jumper, makes}, the system will return {D4, D5}, i.e. the intersection set of {D3, D4, D5} and {D2, D4, D5}. Therefore, a powerful hierarchical search system with query recommendation function is built.

3 Experimental results

In most search systems, statistical analysis such as receiver operating characteristic (ROC) analysis or recall-precision is used to evaluate the performance. Through the analysis, the system degradation caused by misclassification can be estimated. However, as mentioned in Section 2.2, we cluster descriptions by an exactly matching function, so there is no misclassified event in our system. This means that both precision and recall rates of the proposed method are 100 %.

Researches aimed at detecting text events from webcast text are few. Xu and Chua [15] modeled webcast text as external knowledge in detecting events from football and soccer. The evaluation of the fusion video event detection was presented, but that of webcast text analysis alone was not. Xu et al. [16] proposed a framework to analyze webcast text and videos independently and align them through game time. According to the framework, the performance of video event detection mainly depends on webcast text analysis. Here we compare our method with Xu et al.’s work.

Our experiments are conducted by 25 NBA 2009–2010 games and 41 NBA 2008–2009 postseason games. The former are used as training database, and the latter are used as testing database to examine the reliability of the proposed method. We also collect 68 UEFA Champions League 2010–2011 soccer games, where 20 of them are used as training database and the other 48 are used as testing database. The webcast text from 134 games is acquired from ESPN website [2]. As can be seen in Table 2, hundreds of descriptions in a game are clustered into, in average, 44 semantic event categories for basketball and 20 semantic event categories for soccer.

Table 2 Average number of sports event categories in 25 basketball training data and 20 soccer training data

From Xu et al.’s previous work, the pLSA, the optimal number of event categories is nine for basketball and eight for soccer. The top three keywords of each category are selected by a conditional probability. They use the top ranked keyword as single keyword during event detection. We map the top three results of pLSA to our multiple keywords categories in Tables 3 and 4. In Table 4, because “attempt” is chosen as a member of black list in the interactive system, we use “shot” as the single-keyword match for mappings from soccer events in pLSA to those in the proposed method. The words “missed” and “misses” refer to the same verb (e.g., miss) and have the same meaning in descriptions. We consider these two words as the same and use “missed(misses)” as their common representative. In order to achieve fine performance in detecting semantic events, Xu et al. not only use keywords detection in description sentences, but also analyze context information in them. For example, in basketball, the top ranked keyword “jumper” is detected as “Jumper” event only if its previous word is “makes,” and other sentences containing word “jumper,” e.g., Kenyon Martin misses 22-ft jumper, are discarded. However, these discarded events are actually semantic events and can be valuable for further research, e.g., sports posture analysis, injury prevention, special highlight, etc. It can be seen from Tables 3 and 4 that every category of pLSA is mapped to several different semantic events of the proposed method. These several events are related but somehow different. For example, in basketball, “jumper misses” describes that a jumper is missed while “jumper makes” describes that a jumper is made successfully. In soccer, “blocked shot” describes that a shot attempt is blocked by an opponent while “missed(misses) shot” describes that a shot attempt is missed by the kicker himself. Hence, misclassifying or discarding these events decreases the precision and recall rates. However, in our method, the precision and recall rates are both 100 %. With the support of hierarchical search system, we can query multiple keywords for more specific events, which is even better than pLSA with context information. Tables 3 and 4 also show those semantic event categories which are unavailable in Xu et al.’s method, but can be detected in our method, e.g., steal, timeout, turnover for basketball and injury, blocked, penalty for soccer. These semantic events are important for special highlights or injury prevention, and should not be ignored or misclassified. So, the proposed method is superior to pLSA.

Table 3 Mappings of basketball event categories from pLSA to the proposed method
Table 4 Mappings of soccer event categories from pLSA to the proposed method

Here we want to examine the reliability of the proposed method. For basketball, 25 NBA 2009–2010 games are taken as training data. After processing all the training data and gathering the extracted semantic events, we collect the union of these semantic events as a sample set with cardinality 82. Then we process the testing data, which are collected from 41 NBA 2008–2009 postseason games, and examine whether all the semantic events extracted from testing data are listed in the sample set or not. For soccer, we use 20 UEFA Champions League soccer games as training data and 48 UEFA Champions League soccer games as testing data. According to our examination, with sparse exceptions, almost all the semantic events extracted from testing data can be found in the sample set. Tables 5 and 6 show all exception events which are quite rare. These exceptions may be caused by different writing styles or some rarely happened events, and can still be collected in an interactive way if necessary. Therefore, the proposed method is very stable.

Table 5 Occurrences of exception basketball events from 41 testing games
Table 6 Occurrences of exception soccer events from 48 testing games

4 Conclusions and future work

In this paper, we have proposed an unsupervised approach for semantic event extraction from sports webcast text and made some contributions: 1) detecting semantic events from webcast text in an unsupervised manner; 2) requiring no additional context information analysis; 3) preserving more significant events in sports games; 4) extracting multiple keywords from event categories to support hierarchical searching; 5) providing auto-complete feature for finer retrieval. According to experimental results, the proposed method extracts significant semantic events from basketball and soccer games and preserves those events that are ignored or misclassified by previous work. Furthermore, the proposed method is reliable.

Because we have proposed a great filtering step, it is believed that we may extend our approach to other free-styled basketball webcast text in the near future. We will try to apply our method to other sports, e.g., football, baseball, etc. In the further future, we will combine text information with video to build a search-and-query system for sports video retrieval.