Temporal segmentation and recognition of team activities in sports

Direkoǧlu, Cem; O’Connor, Noel E.

doi:10.1007/s00138-018-0944-9

Temporal segmentation and recognition of team activities in sports

Original Paper
Published: 24 May 2018

Volume 29, pages 891–913, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Machine Vision and Applications Aims and scope Submit manuscript

Temporal segmentation and recognition of team activities in sports

Download PDF

Cem Direkoǧlu¹ &
Noel E. O’Connor²

449 Accesses
9 Citations
Explore all metrics

Abstract

A method for temporal segmentation and recognition of team activities in sports, based on a new activity feature extraction, is presented. Given the positions of team players from a plan view of the playground at any given time, we generate a smooth distribution on the whole playground, termed the position distribution of the team. Computing the position distribution for each frame provides a sequence of distributions, which we process to extract motion features for activity recognition. We can classify six different team activities in European handball and eight different team activities in field hockey datasets. The field hockey dataset is a new, large and challenging dataset that is presented for the first time for continuous segmentation of team activities. Our approach is different from other trajectory-based methods. These methods extract activity features using the explicitly defined trajectories, where the players have specific positions. In our work, given the specific positions of the team players at a frame, we construct a position distribution for the team on the whole playground and process the sequence of position distribution images to extract activity features. Extensive evaluation and results show that our approach is effective.

Constrained multi-target tracking for team sports activities

Article Open access 16 January 2018

Recognizing Team Formation in American Football

Football tracking data: a copula-based hidden Markov model for classification of tactics in football

Article 24 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Analyzing complex and dynamic sport scenes for the purpose of team activity recognition is an important task in computer vision. Team activity recognition has a wide range of possible applications such as analysis of team tactic and statistics (i.e., especially useful for coaches and trainers), video annotation and browsing, automatic highlight identification, automatic camera control (useful for broadcasters). Despite the fact that there is much research on vision-based activity analysis for individuals [1], group activity analysis remains a challenging problem. In group activity, there are usually many people located at different positions and moving in different individual directions making it difficult to find effective features for higher-level analysis.

There are mainly two possible sources of sport videos: TV broadcasts and multiple video feeds from fixed cameras around the playing field. We first review group activity analysis techniques using broadcast videos and then review methods which investigate sport videos captured by fixed multi-camera systems.

1.1 Using the TV broadcast

Kong et al. [2] use optical flow-based features and the latent-dynamic conditional random field model to recognize three different actions (i.e., left side attacking, stalemate and left side defending) in soccer videos. Later, Kong et al. [3] proposed an alternative approach to recognize the same activities in soccer videos. They use scale-invariant feature transform (SIFT) key point matches on two successive frames and a linear SVM to classify activities. Wei et al. [4] aim to discriminate group activities in broadcast videos targeting identification of football, basketball, tennis or badminton. They extract space–time interest points and use the probability summation framework for classification. Li et al. [5] proposed a discriminative temporal interaction manifold (DTIM) framework to characterize group motion patterns in American football games. For each class of group activity, they learn a multi-modal density function on the DTIM using the players’ role and their motion trajectories. Then a maximum a posteriori (MAP) classifier is used to recognize activities. They can recognize five different activity types. Swears and Hoogs [6] also present a framework to recognize different offense types in the context of American football. First, a broadcast video is stabilized and registered to another domain. This process normalizes the plays into a common coordinate system and orientation. Players’ trajectories are then extracted for activity analysis. The temporal interactions of the players are modeled using a non-stationary kernel hidden Markov model. Ibrahim et al. [7] proposed a group activity recognition framework and experiment on a volleyball dataset. The team activity is predicted based on the dynamics of the individual people performing the activity. They build a deep learning model to capture these dynamics based on long short-term memory (LSTM) models. Shu et al. [8] recognize group activities in volleyball game using a LSTM network that forms a feed-forward deep architecture. Instead of using the common softmax layer for prediction, they introduce an energy layer and estimate the energy of the predictions. There are also some recent approaches, [9, 10], using convolutional neural networks for action recognition in ice hockey and football games, respectively.

Despite the existence of such approaches, using a TV broadcast is not effective for group activity analysis, since the camera usually captures the region of interest (such as ball locations) and many players may not be in that region. Using broadcast cameras also suffers from inaccurate player localization because of occlusions, camera motion, etc.

1.2 Using fixed multiple cameras

Most team activity analysis methods [11,12,13,14,15,16,17] use a fixed multi-camera system around the playing field to overcome the limitations of using broadcast data. The multi-camera system usually has a camera configuration to cover all locations on the playground and is therefore able to capture all players simultaneously. Player detection and tracking algorithms are employed in the videos to obtain the trajectories, and then these trajectories are transformed into the top view of the playing field for more accurate analysis. In the activity analysis stage, features (e.g., position and speed) are extracted using the explicitly defined trajectories and a model employed (e.g., Bayesian net, hidden Markov models or SVM) to recognize the group activities such as different types of offense and defense. These models are summarized below.

Intille and Bobick [11] use Bayesian belief networks for probabilistically representing and recognizing multi-agent action from noisy trajectories in American football. Blunsden et al. [12] extract features from the trajectory data and classify different offense and defense types in European handball using an SVM. Perse et al. [13] segment the play into three different phases (offense, defense and time-out) in a basketball game using a mixture of Gaussians. Then a more detailed analysis is performed to define a semantic description of the observed activity. Perse et al. [14] also present another approach which uses petri nets (PNs) for the recognition and evaluation of team activities in basketball. Hervieu et al. [15] use a hierarchical parallel semi-Markov model to represent and classify team activities in handball. Recently, Dao et al. [16] have proposed a sequence of symbols which are derived from the distribution of players’ positions in a period of time to represent and recognize offensive types (e.g., side attack and center attack) in soccer games. Li and Chellappa [17] also address the problem of recognizing offensive play strategies in American football using a probabilistic model. Varadarajan et al. [18] introduced a topic model approach to represent and classify American football plays. They develop a framework that uses player trajectories as inputs to maximum entropy discriminative latent Dirichlet allocation (MedLDA) for supervised activity learning and classification. Montoliu et al. [19] present a methodology for team activity recognition in handball games based on Author Topic Model (ATM). They use two synchronized and stationary bird’s-eye view cameras and extract optical flow-based activity features from the video frames. The evolution of motions and the recognition of team activities are based on the ATM model.

In this paper, we present a framework for temporal segmentation and recognition of team activities which is based on players’ trajectories on the top view of the playing field. Our motivation and contribution are explained below.

2 Our motivation and contribution

In team activities, there is a group of people (the team) performing activities on the constrained playground. All of the existing trajectory-based methods analyze the specific positions (set of points) obtained by either vision-based tracking or GPS-based wearable sensors. There are two main drawbacks in these approaches. First, the position information is noisy. Second and the most important drawback is that they use only specific positions and ignore the rest of the playground. By its very nature, team activity takes place over the whole playground as the entire team reconfigures itself to either attack or defend. Thus, we believe that a more holistic approach is required rather than simply considering a collection of specific player locations.

In this paper, we propose an approach that analyzes the entire playground for activity feature extraction. Given the team players’ positions from a plan view of the playing field at any given time, we solve a particular Poisson equation to generate a smooth distribution that we term the position distribution of the team. The position distribution is computed at each frame to form a sequence of distributions. Then, we process the sequence of position distributions to extract motion-information images for each frame, where the motion-information images are obtained using frame differencing and optical flow. Finally, we compute weighted moments (up to second order) of these images to represent motion features at each frame. The proposed motion features are experimented with support vector machine (SVM) classification and evaluated on two different datasets: European handball and field hockey datasets. The European handball dataset [20] is publicly available, and the position information (trajectories) of the players is collected using a similar multi-camera capture setup to those reported previously, where sample frames from these cameras are shown in Fig. 1a, b. The top view of the European handball court and a sample image from the handball game are shown in Fig. 1c, d, respectively. We also created a larger dataset for a different game with more activities, the field hockey dataset, to conduct extensive experiments on team activity segmentation and recognition. In the field hockey dataset, the position information is collected using GPS-based wearable sensors. The top view of the field hockey playground and a sample image from the game are shown in Fig. 1e, f. Results show that we can temporally segment and recognize six different team activities in handball, and eight different team activities in field hockey. We also perform better than a method [12] that analyzes the explicitly defined trajectories for recognition, and better than a method based on pretrained convolutional neural network (AlexNet) [21].

Our method is novel and different from other trajectory-based methods presented in Sect. 1.2. These methods extract activity features using the explicitly defined trajectories, where the players have specific positions at any given time, and ignore the rest of the playground. In our work, on the other hand, given the specific positions of the team players at a frame, we construct a position distribution for the team on the whole playground and process the sequence of position distribution images to extract motion features for activity recognition. As no tracking and positioning algorithm (vision based or GPS based) can be 100% accurate, the position distribution accounts for the uncertainty of players’ positions and it is defined on the whole playground which can be considered as an intensity image. Representing the positions of the team players as an intensity image instead of a set of points at any given time allows us to use frame differencing and optical flow, which are important techniques for image motion description. We extract motion features at each frame using the sequence of position distribution images instead of using the explicitly defined trajectories to represent activities.

Earlier versions of this work were presented in [22, 23]. In [22] we verified that a particular Poisson equation can be used to determine the region of highest population, corresponding to the area with the highest density of the majority of players, and to estimate the region of intent, corresponding to the region toward which the team is moving as they press for territorial advancement. Then in [23] we significantly extended this early work [22] to perform full classification of team activity. In [23], we were not concerned about the region of intent or the region of highest population, and it was an independent piece of work. However, the continuous classification of team activities in [23] was only investigated on European handball dataset, which is a publicly available small dataset. In this paper, we create and investigate a larger and challenging dataset (i.e., field hockey). We conduct extensive evaluations for our method while comparing with the most related method [12] that is designed to recognize similar activities. In particular, we assess the accuracy in detail, conduct time evaluations and study the effect of window size.

3 Team position distribution generation

We illustrate the problem in the context of European handball, where the top view of the handball field of play with the team player positions is shown in Fig. 2a. (A European handball team has 7 players.) Given the positions of the team players at any time, we aim to generate a position distribution of the team defined on the whole playground. There are many possible probability distribution models (e.g., Gaussians, Laplace or Cauchy distribution), which can be centered on each player position and then summed to generate a position distribution of the team. Since the activity is performed on the bounded playground and players have to be on the playground to be involved in the team-based activity, the position distribution must be zero outside the playground. This can be achieved by using the truncated versions (e.g., truncated Gaussians) of the probability distributions. However, most of the probability distribution models which can be used to create a smooth distribution and account for uncertainty for the positions are parameter dependent and the parameters need to be adjusted to optimize the performance of the team activity recognition. In our work, we choose to solve a particular Poisson equation to generate a position distribution since it has a unique and steady-state solution with respect to the given team player positions. The proposed Poisson equation is parameter-free and can model zero probability outside the playground without any truncation. The solution of the proposed Poisson equation only depends on the players positions.

3.1 Background to the Poisson equation

In mathematics, the Poisson equation is an elliptic-type partial differential equation [24] which arises usually in electrostatics, heat conduction and gravitation. The general form of the Poisson equation, in two dimensions, is given by,

$$\begin{aligned} \nabla ^2 I(\mathbf x ) = -Q(\mathbf x ), \end{aligned}$$

(1)

where Q is a real-valued function of a space vector $\mathbf x =(x,y)$ and it is known as the source term, I is the solution which is also a real-valued function, and $\nabla ^2$ is the spatial Laplacian operator. Given a source term $Q(\mathbf x )$, we find a solution for $I(\mathbf x )$ that satisfies the Poisson equation and the boundary conditions over a bounded region of interest. There are three general types of boundary conditions: Dirichlet, Neumann and mixed. Here, we explain the Dirichlet condition, which is used in our algorithm. In the Dirichlet condition, the boundary values (solutions) are specified on the boundary. These values can be a function of space or can be constant. The Dirichlet condition is represented as $I(\mathbf x ) = \varPhi (\mathbf x )$, where $\varPhi (\mathbf x )$ is the function that defines the solution at the boundary layer.

3.2 The proposed Poisson equation and solution

The proposed Poisson equation and the resulting distribution (solution) are obtained based on the following considerations. The top-view image of the field of play is assumed to be a binary image where the player positions are one and the rest of the positions are zero at any time during the game. Although players are expected to be in the play area during the game, players sometimes can move a little outside for a variety of different reasons, such as to serve the ball, when the ball is out or in order to talk to the coach. Thus, we expand the binary image of the field of play to include the possibility that the players may move a little outside the lines. The binary image is defined to be the source term in the Poisson equation. The boundary condition is Dirichlet which has a specific solution, $I(\mathbf x ) = 0$, at the boundaries of the expanded field of play. This means that there is no possibility for a player to be outside the region of interest. The proposed Poisson equation problem is,

$$\begin{aligned}&\nabla ^2 I(x,y) = - \left( \sum ^N_{i=1}\delta (x-x_i, y-y_i) \right) \nonumber \\&I(x,y)=0,\; \;\;\;\hbox {boundary}\,\hbox {condition,} \end{aligned}$$

(2)

where N is the number of players in the team and $(x_i,y_i)$ is the position of player i. The source function is assumed to be a linear combination of Dirac delta functions $\delta (.)$ in two dimensions. It is important to note that the proposed Poisson equation has a unique and steady-state solution at each frame. The solution is parameter-free, and it only depends on the position of the players. Therefore, when players change their position from the previous frame to the current frame, the solution also changes in the current frame.

The numerical solution methods of the Poisson equation can be categorized as direct and iterative methods. In [25], Simchony et al. pointed out that direct methods are more efficient than multi-grid-based iterative methods for solving the Poisson equation on a rectangular domain, since direct methods can be implemented using the fast Fourier transform (FFT). In our work, since the field of play is rectangular, we employ FFT-based direct methods to solve the proposed Poisson equation. The proposed equation has a Dirichlet boundary condition that needs discrete sine transforms (using FFT) to achieve an exact solution, where the detailed description of the solution method is given in [25]. The solution to the proposed equation forms peaks at the player positions. To smooth these peaks, we apply Gauss–Seidel iterations (5 iterations), as a post-processing stage, to relax the surface while maintaining the boundary condition $(I(\mathbf x ) = 0)$ outside the region of interest.

The resultant distribution provides the likelihood of a position to be occupied by players at any given time, and it is called the position distribution of the team. Figure 2b shows the position distribution for the given example, and Fig. 2c shows the same distribution with color mapping and with level sets. For handball, the resolution of the position distribution image is $220\times 120$ in our experiments. For field hockey, it is $372\times 240$.

4 Motion-information images and feature extraction

Computing the position distribution for each frame provides a sequence of position distributions. We process the sequence of distribution images to generate motion-information images which can describe motion at each frame. The motion-information images are created using frame differencing and optical flow.

4.1 Frame differencing

The simplest way in which we can detect motion is by image differencing. Figure 3a shows the direction of movement of the team players from the current frame to the next frame (50 frames later), where the starting point of the arrow represents the position of the player at the current frame and the end point represents the position of the player at the next frame. We compute the position distribution for the team at the current and at the next frames. Since the team players move from the current positions to the next positions, they create higher position distribution values in the direction of movement. To detect motion with the direction, we apply change detection by simply subtracting the current distribution from the next distribution and keep the positive values while setting the negative values to zero, i.e., $\left( I(x,y,n+m)- I(x,y,n)\right) > 0$, where I(x, y, n) represents the position distribution of the team at frame number n and m is the number of frames between the current and the next frames. Frame differencing is applied with 50 frames (i.e., $m=50$) of temporal extent in our experiments. Figure 3b shows the frame differencing whereby we keep the positive values and set the negative values to zero for the given example.

4.2 Optical flow

Although frame differencing can provide some information about the movement, we cannot exactly see how the distribution points move. In order to describe the position changes at each frame, we compute optical flow vectors that can provide the displacement of the points with directions. We employ the classical Horn and Schunck (HS) method [26] for optical flow estimation. This is a differential approach which combines a data term that assumes constancy of some image property (e.g., brightness constancy and gradient magnitude constancy) with a spatial term that models how the flow is expected to vary across the image (e.g., smoothness constraint). An objective function combining these two terms is then optimized. In our experiments, we observed that using the gradient magnitude constancy assumption (i.e., $\vert \nabla I(x,y,n)\vert =\vert \nabla I(x+u,y+v,n+m)\vert $) instead of using the brightness constancy (i.e., $I(x,y,n)= I(x+u,y+v,n+m)$) can better estimate the optical flow, where u is the horizontal optical flow and v is the vertical optical flow. Therefore, in our work, we use the gradient magnitude constancy assumption together with the smoothness constraint to compute the optical flow on the playing field. The gradient of the position distribution is computed using the Sobel operator, and the optical flow is computed from the current frame to the next frame with 8 frames of temporal extent. There are also two parameters that affect the solution of the HS method: A parameter that reflects the influence of the smoothness term is set to 0.1, and the number of iterations to achieve the solution is set to 200. Figure 4a shows the position distribution image and the estimated optical flow. For better illustration, Fig. 4b shows the zoomed in image from the red box in Fig. 4a. Note that this is a novel algorithm to compute the motion field on the top view of the playground. Kim et al. [27] compute the motion field on the top view of the playground by interpolating the player’s motion vectors, where the player’s motion vectors are generated using the specific positions of the players. On the other hand, in our algorithm, we use the specific positions to generate the position distributions, and then estimate the motion field using optical flow.

The motion-information images, using the optical flow, are generated with the following considerations. The horizontal and vertical components (i.e., u and v) of the flow are two different scalar fields. Each of these components is half-wave-rectified to generate four nonnegative channels: ${u^+}$, ${u^-}$, ${v^+}$, ${ v^-}$, so that $u={u^+}-{ u^-}$ and $v={v^+}-{ v^-}$. These channels, ${u^+}$, ${ u^-}$, ${ v^+}$ and ${v^-}$, represent directional speed images in the direction of positive x-axis, negative x-axis, positive y-axis and negative y-axis, respectively. Note that the directional speed images have also been used in [28] for individual action recognition, but their usage for group activity recognition as proposed here is novel. The directional speed images are illustrated in Fig. 4c–f for the given example.

4.3 Feature extraction

We use five motion-information images to describe motion at each frame, where one of them is obtained with frame differencing and the other four are obtained with optical flow. Frame differencing is applied with 50 frames of temporal extent, while the optical flow is computed with 8 frames of temporal extent, so that frame differencing captures motion in a longer period of time, while the optical flow captures motion in a shorter period of time. Our experiments show that describing the motion in this way performs better than other options.

Next, we compute weighted moments for each motion-information image to represent motion features at that frame. The discrete form of the equation is,

$$\begin{aligned} m_{pq} = \sum _{x}\sum _{y}w(x,y)x^{p}y^{q}\varDelta {x}\varDelta {y}. \end{aligned}$$

(3)

Here, $m_{pq}$ is the moment of order p and q, w(x, y) is the weight function, which we substitute for each motion-information image, and $\varDelta {x}=\varDelta {y}=1$ are spacing sizes of a pixel. We compute moments up to order $p+q=2$, resulting in 6 moments per image and 30 moments in total to describe the motion at each frame.

5 Classification using the motion descriptors

We investigate the use of the proposed features with support vector machine (SVM) classification. In SVM, a Gaussian radial basis function kernel is used. SVM is a powerful technique in classification; it maps training data to higher-dimensional space and constructs a separating hyperplane such that the distance between the hyperplane and a data point is maximized. Test data are then classified by the discriminant function.

In the handball dataset, the test frame is classified using the 141-by-141 neighborhood frames (141 from past and 141 from future neighborhoods), which is determined experimentally. This means that the window size is 283 (including the test frame). Each of the frames in the window is labeled with the SVM classifier by using the one-against-all method. Then the most frequent class is selected to represent the activity of the test frame. The scaling factor of the Gaussian kernel function is 2.4. The upper bound on the Lagrange parameter (i.e., the soft margin cost function parameter) is 10. In addition, we use the sequential minimal optimization method to find the separating hyperplane since we have a large dataset and this method is computationally efficient.

In the field hockey dataset, the test frame is classified using the 112-by-112 neighborhood frames, which means that the window size is 225 (including the test frame). Each of the frames in the window is labeled with the SVM classifier by using the one-against-all method, and then the most frequent class represents the activity of the test frame. The scale factor of the Gaussian kernel function is 2, and the upper bound on the Lagrange parameters is 10. The sequential minimal optimization method is used to find the separating hyperplane.

Table 1 Team activities in handball with their numbering

Temporal segmentation and recognition of team activities in sports

Abstract

Similar content being viewed by others

Constrained multi-target tracking for team sports activities

Recognizing Team Formation in American Football

Football tracking data: a copula-based hidden Markov model for classification of tactics in football

Explore related subjects

1 Introduction

1.1 Using the TV broadcast

1.2 Using fixed multiple cameras

2 Our motivation and contribution

3 Team position distribution generation

3.1 Background to the Poisson equation

3.2 The proposed Poisson equation and solution

4 Motion-information images and feature extraction

4.1 Frame differencing

4.2 Optical flow

4.3 Feature extraction

5 Classification using the motion descriptors

6 Evaluation and results

6.1 Evaluation on European handball dataset

6.1.1 Temporal segmentation and recognition

6.1.2 The effect of window size

6.1.3 The effect of motion-information images

6.1.4 The effect of SVM kernel function

6.1.5 Computational efficiency

6.2 Evaluation on field hockey dataset

6.2.1 Temporal segmentation and recognition

6.2.2 Training set selection

6.2.3 Second-half testing

6.2.4 First-half testing

6.2.5 Overall performances

6.2.6 The effect of window size

6.2.7 The effect of motion-information images

6.2.8 The effect of SVM kernel function

6.2.9 Computational efficiency

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation