Keywords

1 Introduction

In order to detect whether each pixel represents a feature, the extraction of behavior characteristics of multiplayers in soccer video is a primary operation in the process of behavior recognition [1]. The result of feature extraction is that the different subsets divided by the points on the image, which often belong to isolated points, continuous curves, or contiguous regions. Usually, there are many features that can be used for feature recognition behavior in soccer video game, such as the color feature of the stadium, ball, athletes and referees, the moving trajectory (spatial relationship) features of ball and athletes, the contour feature of players, the shape features of ball pitch line features, etc.

2 Related Work

On the feature extraction of athletes, in recent years, many researchers have carried out extensive research on it. Among them, the color features of the clothing of athletes, the contours features of the athletes, and the characteristics of the field line are extracted. In the past 10 years, the motion analysis based on contour features has developed into an important feature extraction technique. Recently, a lot of research has been done in this area. In [2, 3], novel approaches for indexing and evaluating sketch-based systems were proposed. A recent work in sports analytics [4] uses sketch-based search techniques to analyze player movements in rugby matches. It uses multiple distance-based similarity measures, which compares the user sketch against video scenes. Beside the traditional image and video search, sketch-based search techniques can also be applied to other complex data sets. In [5, 6], 2D sketches are used to search for 3D objects. Lee et al. [7] make use of synthesized background shadows to help users in sketching objects. Further applications of sketch-based search include retrieval in graphs [8], or for bivariate data patterns in scatter plots [9, 10]. Regarding data generation, in [11] 2D shape sketches are used to create and manipulate high-dimensional data spaces. A novel model-based technique that is capable of generating personalized full-body 3D avatars from orthogonal photographs [12]. The proposed method utilizes a statistical model of human 3D shape and a multiview statistical 2D shape model of its corresponding silhouettes.

3 Characteristics Extraction of Behavior of Multiplayers

3.1 Color Feature Extraction of Players and Referees

Since soccer is an ornamental sport, the playing field, ball pitch line, the clothing of referee and players are designed to have unique visual effects. From visual features, these color differences are one of the best messages [13,14,15,16]. The field is green and the field lines are white. Judges and players must wear as high a contrast as possible. Color characteristics can be used not only to improve tracking ability, but also to distinguish players which belong to different teams. Therefore, the color feature of video images can be extracted for behavior recognition [2]. In this paper, color classification learning sets is used to find interesting regions by mapping the image pixels to their color classes, and then morphological operators is used to group pixels. The mixed color space is used to detect the best distinction of the space produced by the pixels of the opponent team and the referee [17, 18].

3.1.1 Color Classification, Segmentation, and Feature Extraction

The first step in dealing with video images of a soccer game is to apply color classification and segmentation to the image. Assume that the previous color classification learning set includes the green field set, clothing set of team, and other set, color image pixels are firstly mapped to the respective classification in visual perception module, and then morphological operations is used to find the region of interest by denoising of the same type of color grouping region [3, 4].

In addition, the object of interest can be characterized by the characteristics of the image patch. In particular, the assumption of using the player’s upright position to contact the pitch plane (if considered in a particularly long image sequence, at least) [19, 20]. The size of every patch can be estimated with much accuracy by using image patch coordinates. In addition, the interest objects must meet certain compact relations (the ratio between area and perimeter), such as the players and soccer. These assumptions can be used to filter data and more reliably extract related objects.

Existing color regions and color patches can be used to define the interest complex regions that are handled by certain image processing operations. For example, in order to find the line of field, the green centralized area in the field image (not including the player or referee occlusion area) are considered to get the field line. This area can be represented by using Eq. (1).

$$ \mathrm{L}=\left(\uptheta \cap \complement \right)-\upalpha 1-\upalpha 2-\upalpha 3 $$
(1)

where, L is the area of the field line, θ is non-green area, ∁ is playfield detection, α1 is the area where the team 1 is located, α2 is the area where the team 2 is located, and α3 is the area where the referee is located.

After obtaining the color regions of interest, the color moments of the region can be extracted as color features. In the HIS color space, the central moment (the first three order color moments) of each component can be calculated by Eq. (2).

$$ \left\{\begin{array}{c}{M}_1=\frac{1}{N}\sum \limits_{i=1}^NX\left({p}_i\right)\kern1.50em \begin{array}{ccc}\begin{array}{cccc}\begin{array}{cc}& \end{array}& & & \end{array}& & \end{array}\\ {}{M}_2={\left[\frac{1}{N}\sum \limits_{i=1}^N{\left(X\left({p}_i\right)-{M}_1\right)}^2\right]}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}\begin{array}{cccc}& & & \begin{array}{cccc}& & & \end{array}\end{array}\\ {}{M}_3={\left[\frac{1}{N}\sum \limits_{i=1}^N{\left(X\left({p}_i\right)-{M}_1\right)}^3\right]}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$3$}\right.}\begin{array}{cccc}\begin{array}{cccc}& & & \end{array}& & & \end{array}\end{array}\right. $$
(2)

Among them, X represents the H, I and S components in the HIS color space. H(p i) represents the X value of the ith pixel of the image p. N is the number of pixels in the image. Figure 1 is the image of the RGB distribution, the HIS color space, H, I, and S components and histogram. The calculated center moment of image and seven characteristic values are 1.0e + 004 * [1.5278 0.000 0.0001 0.0000 0.0004 0.0002 0.0003 0.0009 0.0005 0.0007].

Fig. 1
figure 1

Color component and histogram of video football match image. (a) Original image, (b) H component of image, (c) I component of image, (d) S component of image, (e) Color distribution of image, (f) H-S histogram of image, (g) H-I histogram of image

3.1.2 Robustness of Color Classification

Lighting conditions change when the camera sweeps from one side of the court to another, clouds change, rain begins, and so on. For these reasons, reliable color segmentation cannot be achieved by prior learning of color categories and remains unchanged during the game [21,22,23,24,25,26]. On the contrary, color segmentation must adapt to the change of the sky, especially the segmentation of green sites in color categories. Therefore, the expectation maximization method is used to segment the green sites in color classes. By giving the stadium model and estimating camera parameters, it is determined that these areas must be green. The related area is expressed by Eq. (3).

$$ {\mathrm{A}}_{\mathrm{R}}={A}_{\mathrm{P}}-{A}_{\mathrm{T}1}-{A}_{\mathrm{T}2}-{A}_{\mathrm{R}\mathrm{e}}-{A}_{\mathrm{L}} $$
(3)

where, A R is the related area, A P is the playfield detection, A T1 and A T2 represent the area in which the team 1 and team 2 is located respectively, A Re is the region of the referee, and A L is the area where the pitch of a course line is located. Morphological operations are used to eliminate holes in the region in this paper. Then, the pixels in these regions are extracted to estimate the color classes of green regions. Finally, we use this color model to estimate camera parameters. In practice, the estimation of the classification model is much lower than that of the camera parameter estimation.

3.2 Contour Feature Extractions of Players

Given a video v = {I 1, I 2, ⋯, I T} which contains T frames of soccer game behavior, the related behavior contour sequence S s = {s 1, s 2, ⋯, s T} can be obtained from the original video. The size and position of the foreground region changes with the distance between the player and the camera, the size of the target, and the behavior that have been completed changes. On the basis of keeping the contour ratio of the player unchanged, the contour image of the player is centralized and normalized, so that the resulting image RI = {R 1, R 2, ⋯, R T} contains as much foreground as possible. All input video frames have the same dimension r i × c i without making the motion deform. The normalized motion contour image of player is shown in Fig. 2. If the original contour image R i of the player is represented by the vector r i in \( {\Re}^{r_i\times {c}_i} \) space by using line scanning mode, the outline of the player in the whole football game will be represented as v r = {r 1, r 2, ⋯, r T}.

Fig. 2
figure 2

Contour sequence and block feature representation of a player

In order to improve the computational efficiency, the contour image of each player is equidimensional divided into h × w sub blocks without overlap in this paper [5]. Then, the normalized values of each sub block are calculated by using Eq. (4).

$$ {N}_i=b(i)/ mv,\kern1em i=1,2,\cdots, h\times w $$
(4)

Among them, b(i) is the number of foreground pixels of the ith sub block and mv is the maximum of all b(i). In h × w space, the descriptor of the silhouette of a player in the tth frame is f t = [N 1, N 2, ⋯, N h × w]T and the outline of the player in the whole video can be correspondingly represented as vf = {f 1, f 2, ⋯, f T}. In fact, the original player contour representation v r can be considered as a special case based on block features, that is, the size of sub block is 1 × 1 (one pixel).

3.3 Feature Extractions for Stadium Line

The field line contains important coordinate information of the stadium. The results of line extraction can be directly used for camera calibration, stadium reconstruction, and calculation of the coordinates of players in the real scene. In the process of extracting court line feature, the video image of ball game is translated into binary image, then the feature of coordinate parameters of court line is extracted preliminarily using Hough transform, and at last the accurate linear coordinates is computed by using gray fitting. When using Hough transform to extract line feature, distance vector which can be calculated by Eq. (5) can be used to represent.

$$ d=x\cos \theta +y\sin \theta $$
(5)

Among them, the value range of d is the diagonal length l of the video image, that is d ∈ [−l, l], θ is the intersection angle between the vertical axis y and the level axis x and θ ∈ [0, π]. x and y represent the two-dimensional coordinates of image pixels. We define the parameter space with the subscript d and θ using an array of integers k, and set the threshold th. When we use Hough transform to run statistical calculation, if k > th, the curve with the subscript d and θ is determined a straight line. Because the calculated value d may be negative, the subscript d is rewritten as d + l. The specific steps are as follows [6].

  • Step 1: construction and initialization of the lookup table of function sin and cos.

  • Step 2: for each non-background point (white dot) of a binary image, each value of d corresponding θ is computed by using Eq. (4) and the values of d + l and k + 1 are meanwhile computed.

  • Step 3: when the values are greater than th, all subscripts d and θ corresponding to array k are found out, and then d − l is calculated.

  • Step 4: because of the field line has not been refined in the binary image to be detected, there is a certain width. In the case of Hough transformation, it leads to several groups of similar d and θ for a straight line at the same time. If there is similar d and θ, only one group is retained.

  • Step 5: the coordinates of the two ends of the field line obtained by Hough transformation are (x 0, y 0) and (x 1, y 1).The mean gray value Mean i, j of the pixels passing through the line between the point i and the point j in the gray image are calculated by using ((x 0 +  sin θ(f − m × i), y 0 −  cos θ(f − m × i)) and ((x 1 +  sin θ(f − m × j), y 1 −  cos θ(f − m × j)), respectively. Where, f expresses the range of fitting and m is expressed as the fitting step length, \( i,j\in \left[0,\frac{2f}{m}\right] \). When the calculated Meani, j is maximum, the determined line segment i , j corresponding to maximum Meani, j can be considered as the optimal course line. Figure 3 is the characteristic of the field line determined by using Hough transform.

Fig. 3
figure 3

Features extraction of stadium line. (a) Original image, (b) Extracted field line

3.4 Feature Extractions for Motion Trajectory of Player and Ball

In the process of feature extraction for motion trajectory of players and ball, the video game is divided into small segments containing a specific number of video frames. When we extract the motion trajectory of players and ball, the segment is regarded as the basic unit, that is, the length of the processed trajectory is not more than the number of frames in the video fragment. After obtaining the candidate area of the motion target of each frame of soccer game video, first, we locate the moving objects in the three consecutive frames of the video in a spatial-temporal domain. The position of second frames in three consecutive frames is centered, and the candidate areas falling near the position are found in the front and back frames. After finding such a continuous three frame image, determine whether the moving target is included in the existing trajectory. If there is no, the new trajectory is initialized with the moving target in the three consecutive frame, and the location of each trajectory is recorded. After getting the new trajectory, the Kalman filter is used to predict the trajectory. The prediction Eq. (6) is used to predict new trajectory.

$$ \left\{\begin{array}{c}{X}_t={AX}_{t-1}+{\gamma}_t\\ {}{O}_t={BX}_t+{\kappa}_t\end{array}\right. $$
(6)

Among them, X t = AX t − 1 + γ t is the motion equation of the system, O t = BX t + κ t is the observation equation of the system, X t and O t is system state vector and system state measurement vector of time t, respectively, γ t and κ t is the vector of the motion which is the normal distribution and measurement noise, and they are mutually independent. A and B express the state transfer matrix and the measurement matrix. The center position of the moving target is chosen as the measurement vector of the state of system, and the center position of the moving target and its motion speed are regarded as state vectors of the system. It can obtain Eq. (7) [7].

$$ X=\left[\begin{array}{l}x\\ {}{v}_x\\ {}y\\ {}{v}_y\end{array}\right],\kern0.5em O=\left[\begin{array}{l}x\\ {}y\end{array}\right],\kern0.5em A=\left[\begin{array}{cccc}1& 1& 0& 0\\ {}0& 1& 0& 0\\ {}0& 0& 1& 1\\ {}0& 0& 0& 1\end{array}\right],\kern0.5em B=\left[\begin{array}{cccc}1& 0& 0& 0\\ {}0& 0& 1& 0\end{array}\right] $$
(7)

Among them, (x, y) is the center position of the moving target, v x and v y express the motion speed of the moving target in the direction x and direction y. Prediction results by using Kalman filter are shown in Fig. 4, ‘+’ is true value and ‘○’ is predictive value in the prediction process.

Fig. 4
figure 4

Prediction process of Kalman filter

The location of the trajectory of the moving target in the new frame is predicted through Kalman filter, and the candidate moving targets near the location are searched in the frame image. If the motion target exists, the trajectories of the motion target are extended to the frame, and the center of the candidate moving target is used as the trajectory in the frame. If the corresponding candidate moving target is not found, the moving target corresponding to this trajectory is missing, blocked, or disappearing. When the false or occluded frames do not exceed the threshold, the trajectory of moving target is extended to this frame and the position of trajectory in this frame is instead of prediction value predicted by using Kalman filter. When the false or occluded frames exceed the threshold, we think this path has disappeared in the video, and then stop the growth trajectory. By means of trajectory growth, we can get a plurality of trajectory generated by candidate moving objects from the segments of soccer game video (including the moving target and noise), as shown in (a) in Fig. 5. The trajectory of red and blue is the track of two teams. Green trajectory is the track of soccer. When the soccer leaps and bounds in the air, its trajectory is mapped to the field. The reddish brown trajectory is the referee trajectory, and the red blue track is the track generated by noise. Because part of the generated trajectory is generated by noise, it is necessary to select the corresponding trajectories of the real moving target from these trajectories.

Fig. 5
figure 5

Trajectory generation diagram for video football game fragment. (a) Trajectories of moving objects with noise trajectories. (b) Trajectories of moving objects without noise trajectories

The set of trajectories generated by the real moving target is defined as C t. The initialization set is C t, the element in the set C t is all the trajectories, that is C t = {T i, i ∈ [1, N]}. Among them, T i is the ith track of the video clip in the current football game. N is expressed as the number of the tracks in the video clip of the current football game. We randomly select two trajectories T u and T v with different starting frames in soccer video clips, K s, u, K e, u, K s, v, and K e, v are corresponding to their starting frames and ending frames respectively and K s, u ≤ K s, v. When the end frame of the trajectory T u is larger than the starting frame of the trajectory T v, that is K e, u ≥ K s, v, the trajectories T u and T v intersects in the space-time domain, T u ∩ T ≠ ϕ. On the other hand, the two trajectories are considered to be separated. In the video segment of a football game, the trajectories of the moving targets are usually long, and the track produced by the noise is shorter. Therefore, when the two trajectories cross, the trajectory of the moving target takes a longer trajectory. We can use the Eq. (8) to calculate the set of trajectories generated by the moving target.

$$ {C}_{\mathrm{t}}=\left\{\begin{array}{lll}{C}_{\mathrm{t}}-\left\{{T}_{\mathrm{u}}\right\}& \mathrm{if}& \left(\left({K}_{\mathrm{e},\mathrm{u}}-{K}_{\mathrm{s},\mathrm{u}}\right)<\left({K}_{\mathrm{e},\mathrm{v}}-{K}_{\mathrm{s},\mathrm{v}}\right)\right)\wedge \left({T}_{\mathrm{u}}\cap {T}_{\mathrm{v}}\ne \phi \right)\\ {}{C}_{\mathrm{t}}-\left\{{T}_{\mathrm{v}}\right\}& \mathrm{if}& \left(\left({K}_{\mathrm{e},\mathrm{u}}-{K}_{\mathrm{s},\mathrm{u}}\right)\ge \left({K}_{\mathrm{e},\mathrm{v}}-{K}_{\mathrm{s},\mathrm{v}}\right)\right)\wedge \left({T}_{\mathrm{u}}\cap {T}_{\mathrm{v}}\ne \phi \right)\end{array}\right. $$
(8)

By choosing the trajectory, the set C t of the trajectories of the moving target is finally obtained, as shown in (b) in Fig. 5. The separate trajectories are included in set C t, that is, the missing frames exist between the trajectories. The main reason for the missed frames is the mutual occlusion between the moving targets on the field and the sudden change in the direction and speed of the motion. In order to get the full trajectories of the video clip of the football game, it is necessary to connect these separate tracks.

First, the predicted values \( {\hat{p}}_{\mathrm{k},\mathrm{u}} \) and \( {\hat{p}}_{\mathrm{k},\mathrm{v}} \) of the trajectories T u and T v in the interval [K e, u, K s, v] are calculated. Then we search for two points closest to the distance between the two trajectories in the predicted interval [K e, u, K s, v], corresponding to the frame a and frame b in the trajectories T u and T v. In the prediction process, Eq. (9) is used as constraint condition.

$$ \left\{\begin{array}{l}\left(a,b\right)=\underset{a,b}{\arg \min\,\, } dist\left({\hat{p}}_{a,\mathrm{u}},{\hat{p}}_{b,\mathrm{v}}\right)\\ {}a\le b,\\ {}{K}_{\mathrm{s},\mathrm{u}}\le a\le {K}_{\mathrm{e},\mathrm{v}},\\ {}{K}_{\mathrm{s},\mathrm{u}}\le b\le {K}_{\mathrm{e},\mathrm{v}}.\end{array}\right. $$
(9)

When a ≥ b, the location of the moving target which is missed before the frame a and the location of the moving target is not detected after the frame a were expressed by the predicted value of trajectory T u and track T v in this interval, respectively. Their mean is used to represent the position of moving target in the frame a. We can use Eq. (10) to calculate this mean.

$$ {p}_k=\left\{\begin{array}{ll}{\hat{p}}_{k,\mathrm{u}}&\quad {K}_{\mathrm{e},\mathrm{u}}\le k<a\\ {}\left({\hat{p}}_{k,\mathrm{u}}+{\hat{p}}_{k,\mathrm{v}}\right)/2&\quad k=a\\ {}{\hat{p}}_{k,\mathrm{v}}&\quad a<k\le {K}_{\mathrm{s},\mathrm{v}}\end{array}\right. $$
(10)

When a < b, the location of the missed moving target around frame a can be calculated as same as the time a ≥ b. The motion of the moving target is smaller in the two frames a and b, and the more accurate position of the moving target is obtained by using the linear interpolation method (Eq. 11).

$$ {p}_k=\left\{\begin{array}{ll}{\hat{p}}_{k,\mathrm{u}}& \quad {K}_{\mathrm{e},\mathrm{u}}\le k\le a\\ {}\left(k-a\right)\left({\hat{p}}_{b,\mathrm{v}}-{\hat{p}}_{a,\mathrm{u}}\right)/\left(b-a\right)& \quad a<k<b\\ {}{\hat{p}}_{k,\mathrm{v}}& \quad b\le k\le {K}_{\mathrm{s},\mathrm{v}}\end{array}\right. $$
(11)

The location of the missed moving target can be accurately filled through the connection, and the complete track of moving target can be generated.

3.5 Extraction of Temporal and Spatial Interest Points of Players and the Referee

The temporal and spatial interest points refer to the relatively large intensity changes in time and space. Temporal and spatial interest point representation is a relatively new underlying representation method of characteristics for frame sequences. The video sequence of football game video is set as v :  2 ×  → . Linear scale space representation \( R:{\Re}^2\times \Re \times {\Re}_{+}^2\mapsto \Re \) is constructed by convolution with anisotropic Gaussian kernel with different spatial and temporal variances \( {\sigma}_{\mathrm{r}}^2 \) and \( {\tau}_{\mathrm{r}}^2 \) of v. The relationship between them can be expressed by using Eq. (12).

$$ R\left(\cdot; {\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)=g\left(.;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)\ast v\left(\cdot \right) $$
(12)

The space-time separable Gauss kernel is defined by Eq. (13).

$$ g\left(x,y,t;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)=\frac{\exp \left(-\left({x}^2+{y}^2\right)/2{\sigma}_{\mathrm{r}}^2-{t}^2/2{\tau}_{\mathrm{r}}^2\right)}{\sqrt{{\left(2\pi \right)}^3{\sigma}_{\mathrm{r}}^4{\tau}_{\mathrm{r}}^2}} $$
(13)

It is of great concern to use separate scale parameters for both temporal and spatial domains, since the temporal and spatial components of events are generally independent. Moreover, the detection of events by temporal and spatial interest point operators depends on the observed time and space scales, so it is necessary to deal with the corresponding temporal and spatial scale parameters \( {\sigma}_{\mathrm{r}}^2 \) and \( {\tau}_{\mathrm{r}}^2 \) separately.

In this paper, we consider a 3 × 3 space-time second-order moment matrix consisting of first-order space and time scales, and Gauss weighting function is used to the average of Gauss kernel by Eq. (14).

$$ g\left(.;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)\mu =g\left(.;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)\ast \left(\begin{array}{c}\begin{array}{ccc}{r_x}^2& {r}_x{r}_y& {r}_x{r}_t\end{array}\\ {}\begin{array}{ccc}{r}_x{r}_y& {r_y}^2& {r}_y{r}_t\end{array}\\ {}\begin{array}{ccc}{r}_x{r}_t& {r}_y{r}_t& {r_t}^2\end{array}\end{array}\right) $$
(14)

The scale parameters \( {\sigma}_i^2 \) and \( {\tau}_i^2 \) in Eq. (14) are fused into the local scale parameters \( {\sigma}_{\mathrm{r}}^2 \) and \( {\tau}_{\mathrm{r}}^2 \) by using \( {\sigma}_i^2=s{\sigma}_{\mathrm{r}}^2 \) and \( {\tau}_i^2=s{\tau}_{\mathrm{r}}^2 \), the first derivative is defined by Eq. (15).

$$ \left\{\begin{array}{c}{r}_x\left(x,y,t;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)=\partial x\left(g\ast v\right)\\ {}{r}_y\left(x,y,t;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)=\partial y\left(g\ast v\right)\\ {}{r}_t\left(x,y,t;{\sigma}_{\mathrm{r}}^2,{\tau}_{\mathrm{r}}^2\right)=\partial t\left(g\ast v\right)\end{array}\right. $$
(15)

In order to detect interest temporal and spatial points, we search for the regions with significant eigenvalues λ 1, λ 2 and λ 3 in μ in video. In different methods of searching region, spatial domain Harris corner function is defined as Harris corner function in space-time domain by combining determinant and tracking extension of μ in Eq. (16).

$$ H={\lambda}_1{\lambda}_2{\lambda}_3-k{\left({\lambda}_1+{\lambda}_2+{\lambda}_3\right)}^3,\kern0.5em \left({\lambda}_1\le {\lambda}_2\le {\lambda}_3\right) $$
(16)

In order to show that the positive local maximum of H corresponds to the high value point λ 1, λ 2, and λ 3, the ratio is defined as α = λ 2/λ 1 and β = λ 3/λ 1, and the Eq. (16) can be rewritten as Eq. (17).

$$ H={\lambda}_1^3 (\alpha \beta -k\left(1+\alpha +\beta \right) $$
(17)

If H ≥ 0, then k ≤ αβ/(1 + α + β)3. Assume α = β = 1, the maximum possible value of k is 1/27.When the value k is large enough, the point corresponding to the positive local maximum H varies sharply along the time and space directions. Especially when the maximum value of α and β in the spatial domain is 23, k ≈ 0.005. Therefore, we can detect the temporal and spatial interest points in the video sequence v of football matches by detecting the positive local temporal and spatial maximum of H. Time and space interest point detection results are shown in Fig. 6.

Fig. 6
figure 6

Detection results of temporal and spatial interest points. (a) 52nd frame, (b) 88th frame, (c) 137th frame, (d) 223rd frame, (e) 76th frame (another video), (f) 361st frame (close-up frames)

4 Conclusion

Because single feature is difficult to effectively describe the behavior characteristics of multiple athletes, it is not reliable to use single feature to identify the behavior of multiple athletes. According to the characteristics of various features of athletes needed to extract in the process of multiplayers behavior recognition in soccer game video and feature extraction has a great influence on the final recognition results, in this paper, we extract the contour features, clothing color histogram, temporal and spatial interest points, and color moment features of players and referees. We also transform the soccer video images into binary images and extract feature of coordinate parameters of field line using the Hough transform, and calculate accurate linear coordinates using gray fitting. We use Kalman filter to track the moving object and predict its motion trajectory. In order to reduce the burden of high-level recognition algorithm, we propose using the method of growth trajectories to extract low-level features, such as trajectory features of the moving object. The experimental results show that it can greatly improve the accuracy of the behavior recognition by using these features to identify the behavior of the athletes.