Keywords

1 Introduction

Ball detection and trajectory prediction has been widely used for analysing results in various sports events. This use of the technology has paved ways for research in the related field contributing to the success in commercial scope. Hawkeye is a leading innovator in sports technology. It provides systems for tracking and predicting ball movement in a variety of sports, including cricket, football, tennis, rugby union, volleyball and ice hockey. Motivated by the need to create an object detection system for a rugby ball for mobile robots, this work provides a combined approach targeted towards constrained hardware environments, in the specific task of rugby ball detection.

The task of ball detection is not very easy as compared to other detection problems. When a ball is thrown and moves at high velocity, its image becomes blurry and elliptical. Due to shadows and illumination variations, the ball’s perceived colour varies. It’s especially tough when the ball is partially obscured. When the ball is seen as a single object, traditional ball detection algorithms, such as those based on variations of the circular Hough transform [1], work well. However, a deep neural network is required to detect the ball in complex environments.

The underlying architecture of deep neural networks is highly optimised for execution in graphics processing units (GPU). However, GPUs may be unavailable in particular areas, forcing these operations to run on CPUs. One such domain is mobile robotics, where size, weight and energy consumption constraints may limit the robot’s technology to merely CPUs, constraining the performance of deep learning systems. To overcome this limitation we have proposed a method combining the slower detection model with faster tracking algorithm optimising the entire process. Tracking algorithms such as KCF [2], Boosting [3] and MOSSE [4] are mostly used for the general object tracking task. These algorithms require initial ROI input from the frame, which is generally given manually by selecting the ROI from the first frame. These trackers provide a fast operation but tend to lose the track of objects compelling to the minor changes in shape, color and shading. In such cases, DNN detection can help trackers to maintain the course of trajectory by accurately detecting the object. Alongside detection and tracking it is important to also predict the accurate position of the ball and its future trajectory for the robot to take responsive actions. These generated trajectory equations are ideal in nature and Kalman filter [5] is used for accurate estimation. The Kalman filter is used to deal with two different scenarios: When the ball is identified, the Kalman filter predicts its state for the current video frame before correcting it with the newly detected object position. As a result, a filtered location is created. When the ball is absent in the frame, the Kalman filter predicts the ball’s present position only based on its prior condition.

This paper is mainly divided into three broad sections. First section focuses on creating an image dataset of rugby balls, preprocessing the collected data in the required format and training the custom object detection model on YOLOv5 [6]. The second section explains the combining of the tracking algorithm with the trained detection model using a multithreading approach for processing the frames smoothly. The last part explains the algorithm using the Kalman filter for trajectory prediction. Corresponding results obtained in each section are also recorded and a final conclusion is proposed for the problem.

2 Methodology

2.1 Section I

This section deals with training and deployment of the YOLOv5 [6] model for detection of ball in frame. The initial step is to create an image dataset with different orientations of the ball. As this system was to be used by a mobile robot, practical scenarios needed to be considered for which the videos of balls thrown by a person were recorded from the viewpoint of a robot. Due to these considerations, the ball in the successive frames was captured in different orientations and lighting conditions. Training a YOLOv5 model requires the data to be in a specified format, with each image having its corresponding XML file containing the bounding box coordinates and the label of the object to be detected. Software services like LabelBox [7], Roboflow [8] are frequently used to create such corresponding XML files by taking a video file input and loading each frame one by one (consecutively). This process of creating files by selecting desired objects from individual images can be tedious and time consuming, considering the size of the dataset. Hence to simplify the process we developed a Python script in which the video file is taken as input. The desired object is selected manually from the initial frame and the position in the consecutive frames is tracked by a tracking algorithm. This generates a stream of images which are saved in a folder along with their corresponding XML files. This approach efficiently reduced the time of preprocessing the dataset by approximately 87% compared to the traditional method. The preprocessed dataset was trained using a YOLOv5 model. The YOLOv5 model is the most recent addition to the YOLO family of models. YOLO was the first object detection model to include bounding box prediction and object classification into a single end-to-end differentiable network. YOLOv5 is the first YOLO model to be developed in the PyTorch [9] framework, making it significantly lighter and easier to use. YOLOv5s was chosen based on the model’s benchmarks. The parameters while training were optimal without changes in them. The dataset consisted of blurred, rotated images for better fitting. The benchmarks of training are shown in Fig. 1. As seen through the graphs, the precision and mAP values increase as the training progresses. To find the percentage correct predictions in the model, mAP is used. The mAP and precision values can be further used to compare with other network architectures. After training of the model, testing and validation was done to check the results on varied input. Fig. 2 shows the detection results on the input video, along with the associated label and bounding box.

Fig. 1
figure 1

Metrics

Fig. 2
figure 2

Detection

2.2 Section II

This part focuses on the integration of the tracking and detection part of the code. The tracking and detection of the ball are carried out on parallel threads, hence optimising the overall process. The tracking algorithm takes the bounding box points as an input in the initial step and tracking of the object in that bounding box is done in consecutive frames. Implementing only detection on the entire video increases the time even though it produces better efficiency. On the other hand tracking algorithms have a high speed of execution but the accuracy is hindered. Hence between a tradeoff, the ideal solution was to combine the detection and tracking to get both average speed and accuracy. The results were tested on a GTX 1660Ti, hence the result times may vary for CPU and other GPU cards. The tracking algorithm used was Kernelized Correlation Filter (KCF) originally proposed in [10]. The OpenCV implementation of this tracker makes the integration in code easy. According to the paper, this tracker has outperformed the top ranking trackers such as Struck [11] or TLD [12] on a 50 videos benchmark.

2.3 Section III

After successful detection and tracking of the ball, formulation of the future trajectory is carried out. This trajectory can be calculated by ballistic trajectory equations. But as these equations are ideal, they do not consider the noise in the system which introduces errors in predictions. As our system is linear in nature, we use a Kalman filter [5] which is used for estimation of unknown variables in systems with statistical noise. The Kalman filter uses a system’s dynamic model and data measured over the time (position of the ball) to form a better estimate of the state. Dynamic model in our case is the ballistic trajectory equation (Fig. 3).

The kinematic equations are as follows:

$$\begin{aligned} v_{x}' \leftarrow v_{x} + a_{x} \textrm{d}t\end{aligned}$$
(1)
$$\begin{aligned} v_{y}' \leftarrow v_{y} + a_{y} \textrm{d}t\end{aligned}$$
(2)
$$\begin{aligned} x' \leftarrow x + v_{x}\textrm{d}t + \frac{1}{2} a_{x} \textrm{d}t^2\end{aligned}$$
(3)
$$\begin{aligned} y' \leftarrow y + v_{y}\textrm{d}t + \frac{1}{2} a_{y}\textrm{d}t^2\end{aligned}$$
(4)

From above equations,

State transition matrices can be stated as

Fig. 3
figure 3

State transition matrices

Matrix A is the dynamic law and matrix B is the control matrix. Matrix B contains the controls for input variable which in our case is the acceleration in y direction. The required states of the system are position and velocity of the object. But only the position of the object is observable in our case (Fig. 4).

Fig. 4
figure 4

Matrices required for kalman computation

Matrix u is the values of inputs, i.e. acceleration. Matrix P is the initial uncertainty of the state variables. As initially state variables are unknown, the values of \(ix, iy, iv_{x},iv_{y}\) are very high in the range of \(10^{6}\).

Given the transformations A and B, and noise approximated using covariance matrices S and R, the information on x, in the form of the mean and covariance matrix G, is updated with a new observation z as follows:

$$\begin{aligned} \textrm{Predicted}\ \textrm{state}\ \textrm{estimate} \rightarrow x_{k}= A\hat{x}_{k-1}^{+}+Bu \end{aligned}$$
(5)
$$\begin{aligned} \textrm{Predicted}\ \textrm{error}\ \textrm{covariance} \rightarrow P_{k}^{-}=AP^{+}_{k-1}A^{T}+Q \end{aligned}$$
(6)
$$\begin{aligned} \textrm{Measurement}\ \textrm{residual} \rightarrow y_{k}=z_{k}-H\hat{x}_{k}^{-} \end{aligned}$$
(7)
$$\begin{aligned} \textrm{Kalman}\ \textrm{gain} \rightarrow K_{k}=P_{k}H^{T}(R+HP^{-}_{k}H^{T})^{-1} \end{aligned}$$
(8)
$$\begin{aligned} \textrm{Updated}\ \textrm{state}\ \textrm{estimate} \rightarrow \hat{x}_{k}=\hat{x}_{k} + K_{k}y_{k} \end{aligned}$$
(9)
$$\begin{aligned} \textrm{Updated}\ \textrm{error}\ \textrm{covariance}\ \rightarrow P_k^+ = (I- K_{k}H)P_{k}^{-} \end{aligned}$$
(10)

Where,

Variables with ( \(\hat{}\) ) are the estimates of the variables

Variables with (\(^{+}\)) and (\(^{-}\)) denote prior and updated estimates respectively.

Variables with ( \(^{T}\) ) denote transpose of matrices.

3 Results

The results (refer Figs. 5 and 6) were tested on different videos from various orientations and lighting conditions on a CPU as well as a GPU. The accuracy was comparable in both the cases however there was a considerable difference in the timing of the process. The stand-alone detection code takes around 0.75 s on the CPU (Intel i5 6th gen 6GB Ram) to detect the ball with around 85–95% accuracy. On the same device, detection code with tracking in separate two threads takes approximately 0.052 s. Our proposed method has reduced the time required to process one frame by 93% compared to traditional detection technique. This improved performance has significant importance in case of low computation hardware devices such Mobile robots, Edge computing and low-end devices.

In typical systems, detection and tracking are implemented in serial configuration, i.e. detect in the first frame and then track for the following ‘n’ number of frames, and the detection frame takes longer to process than the tracking frame, causing video to become stuck often. However, because we configured this detection and tracking process in two different threads, detection data is shared with the tracking thread while the tracking thread runs constantly without pauses, resulting in seamless processing with no delays in between.

We continually compute the ball’s location farther in the future, which raises the uncertainty of the ball’s position in the long future, as illustrated by the rising radius of the red circles in Fig. 5. Figures 5 and 6 depict the expected and actual trajectory of the ball. Prediction has been accurate, with a 25–30 cm difference between expected and reality.

Fig. 5
figure 5

Ball detection and trajectory prediction

Fig. 6
figure 6

Prediction and estimation

4 Conclusion

The results are pretty robust in case of normal orientation, i.e. side orientation of ball trajectory; however, the results are not as desired for front view orientation of the camera. Though detection and tracking have comparable results for different orientation, the work on trajectory prediction needs to be more flexible in terms of the view in which video is shot. Overall the results were quite sufficient for our application and can be implemented by reproducing the code on our GitHub repository.

5 Future Scope

The future scope of the project is to test this developed system on different hardware systems to improve efficiency in every case. The work on making the trajectory prediction algorithm more robust is ongoing. The current algorithm is tested on 2D videos, and the future plans are to work with 3D views.