Keywords

1 Introduction

It is common practice in broadcasting live sports to utilize many cameras and switch between them, depending on how the game develops. This switching involves professional operators and expensive specialized equipment. Besides the expensive equipment, professional coverage involves highly skilled individuals such as camera operators and a director responsible for supervising and deciding the overall operation.

Unfortunately, such an expense is prohibitive when it comes to broadcasting amateur community or school sports. In this case, despite the fact that more than one camera may be used, real-time coverage involves only the main view, without offering the option of watching the view that better covers crucial moments during the game.

As a result, this monotonous coverage of regional sports may potentially hinder the viewership and be detrimental in the progress of school and amateur games. Thus, there is a need for a cost-effective, fully automated camera view switching system, which analyzes the importance of the scene covered by each camera and then switches the view in a manner that is pleasant to the viewer.

To the best of our knowledge, there is only one existing work for designing automatic camera switching for ice hockey which is presented in [1]. This method performs an automatic camera selection using the hidden Markov model to create personalized video programs for users that are more interested in the performance or positions of the players from different perspectives than the game itself. In that respect, this method was player-centered rather than puck-centered or play-centered. However, our task is to design an automatic play-centered camera switching approach for amateur ice hockey games.

To this end, in this chapter, we propose a play-centered solution that is based on deep learning, namely, the Faster-RCNN architecture [2], to optimize view switching in regional ice hockey games. Our deep learning-based object recognition network receives video feeds from the two primary camera views of ice hockey and detects the players, net, and the puck in real time with very good precision. Then, based on the predicted confidence values for the different objects, our algorithm decides which camera view should be broadcasted.

The rest of this chapter is organized as follows. Section II presents our approach and explains the dataset selection and labeling of our dataset. Section III presents the performance evaluation of our method and discusses the results. Finally, Section IV concludes our chapter.

2 Our Approach

To keep viewers of an amateur ice hockey event engaged, there are primarily two types of views that are important; first, the side view that shows the arena (please see left image in Fig. 1) and gives a wide view of the field, and second, the goalie views (please see right image in Fig. 1) that show a closer view of the nets. It is common practice to use these cameras in amateur community or school ice hockey games. Since, most of the action is taking place away from the net, the primary view is the arena (side) view, while the goalie views are the secondary views. Thus, the natural question that arises here is when to show the goalie view, in other words, what are the criteria that will lead to switching from showing the side view to a goalie view. To address this challenge, we first propose the use of a deep learning-based object recognition approach that receives the video feeds from all the views and detects the players, puck, and the nets. Then, we use the weighted sum of the confidence values of the detected objects to decide which view to broadcast. Figure 1 shows our proposed scheme. Details regarding the dataset that we used to train our network and the criteria we introduced for switching camera views are presented in the following subsections.

Fig. 1
figure 1

Our proposed scheme for automatic camera view switching

2.1 Data Collection

In order to build a comprehensive dataset for our application, we downloaded several hockey videos of the resolution of 1920 × 1080 from YouTube [3]. The reason for turning to YouTube and not using amateur content was that we did not have access to the latter due to covid-19 restrictions, which did not allow community games to take place and could not find recorded content of previously played games. From those videos, 1000 representative frames were selected for the training-validation phase, skipping redundant frames and considering only frames with significantly different content to avoid overfitting, preferably including the puck and of high visual quality – avoided blurry, fast-moving puck frames. As already mentioned, in this study the objects of interest were the players, net, and the puck, while the referees and audience were excluded. The location of these three types of objects can be used to determine the best camera view for the current situation. An example of a labeled training frame from the side view is shown in Fig. 2a. Figure 2b shows a different example from the goalie view. For the test phase, we used four ice hockey video streams from YouTube, which were very different from the training videos [3]. These videos had the same resolution with the training videos.

Fig. 2
figure 2

Examples of labeled frames from our dataset (a) from the side view and (b) from the goalie view

2.2 Our Deep Learning Network

We chose the Faster-RCNN architecture [2] as our deep learning-based classification and object recognition network. The main reason for this choice is that Faster-RCNN is proven to be more accurate and much faster compared to its predecessors [4,5,6], making it an ideal approach for real-time object detection of the ice hockey fields [2]. Moreover, it also showed very promising results in detecting small objects. We trained this network to detect players, net, and the puck. Details about the network configuration and the training platform used are explained in the evaluation and discussion section.

2.3 Our Camera Switching Approach

The first task of our scheme for switching camera views is to receive the detection information of the objects of interest, i.e., the players, net, and puck, that comes out of our Faster-RCNN object recognition model. Then, our algorithm considers the position and confidence level of detection of all the objects, as each one has different roles to play in determining the best camera view for the current moment of the game. It is important to note that designing our algorithm to be biased toward the importance of objects to the fans, will allow our solution to be focused on the action. Driven by professional game coverage, we assume that the most important object/event in hockey broadcasting involves the puck, as the audience tries to follow its location when watching a hockey game. Following the above observation and the outcome of many trials asking subjects to validate the validity of our switching scheme, we assigned a weight to the confidence values predicted for each object type according to its importance: 20 for the puck, 1 for the net, and 1 for the player. More precisely, the confidence of each detected object in the current camera view is weighted according to its object type, and the weighted values are summed up to calculate the score for the current camera view. Please note that our method only considers objects with confidence values greater than 20%. In addition, we decided to add 10 to the weighted score calculated for the goalie view if the puck is present in that camera view.

Figure 3 shows the block diagram of our proposed camera switching scheme. To prevent any high-frequency camera switching, we built in a small delay of ten-frame duration (one-third of a second) before switching again after the last camera view change.

Fig. 3
figure 3

Our proposed camera switching scheme

3 Evaluation and Discussion

For training, we used a PyTorch implementation of Faster-RCNN [7]. The fully connected layer of the model was changed to detect the three classes required for our application. Ninety percent of the training-validation dataset was randomly selected as the training dataset and the remaining 10% was considered for the validation phase. Horizontal data augmentation was used to augment the dataset for the training phase. For the training phase, we aimed to achieve the best performance by testing different combinations of the network configurations for this phase. To this end, two different pretrained models, namely, VGG16 and ResNet-101, were used as the backbones for the Faster-RCNN. Four batch sizes, namely, 1, 6, 12, and 24, were tried. Three different learning rates were used to achieve the best training performance: 0.0001, 0.001, and 0.01. We trained our Faster-RCNN using the Nvidia V100 Volta GPU, with 32 GB of HBM2 memory available on a state-of-the-art advanced research computing network [8]. Tables 1 and 2 show the average precision (AP) of the player, puck, and net classes that were achieved for the validation frames for each of the training settings. The batch sizes (bs) and learning rates (lr) used for each training setting are also reported in these tables. As can be seen in Table 1, the best AP values that were obtained by Faster-RCNN with VGG-16 backbone were 0.616, 0.876, and 0.9 for the puck, player, and net, respectively. The mean average precision (mAP) was 0.789. This performance was achieved using the learning rate of lr = 0.01 and batch size bs = 12. According to Table 2, the best AP obtained with the ResNet-101 pretrained model was 0.587, 878, and 0.778 for puck, player, and net, respectively. The mAP for this case was 0.778. This clearly shows that VGG-16 pretrained model, batch size bs = 12, and learning rate lr = 0.01 achieved the best performance among all the training settings examined in our study. Therefore, for the test phase, we used this model and we call it our model, hereafter.

Table 1 Detection results of Faster-RCNN using different settings with VGG-16 pretrained model
Table 2 Detection results of Faster-RCNN using different settings with ResNet-101 pretrained model

In order to evaluate the performance of the trained model, we examined the trained model on the test videos with unseen frames. Our results showed that for most of the frames, our model detected the players, net, and the puck correctly. Figure 4 shows the predicted objects and the probability values assigned to the bounding boxes for four successful examples. However, there were some false positives and false negatives for the puck during the test phase. Some examples are shown in Fig. 5. Our results showed that our model detected the puck correctly when it was fully visible to the camera (see Fig. 5a). Since puck is a very small object compared to the players and the net, this contributes to the low AP of the puck. More precisely, our model could not detect the puck when it was not fully visible or blurry to the camera (Fig. 5b). Figure 5a shows an example of false positive for the puck, a case we observed only for few frames. In this rare case, the toe of the hockey stick was detected as a puck since its color was the same as the puck and the color of the hockey stick’s blade was the same as the ice hockey field. As can be seen, in this case, two objects were detected as the puck. Thus, we decided to only consider the object with the highest predicted confidence. This approach significantly improved our accuracy. This issue may be resolved by adding frames with a similar scenario to the original training frames. Finally, we used the information of the objects detected by our model for our automatic camera view switching approach to detect the instances for which the view switching was needed. Our results show an accuracy of 75% for our camera switching method in real-time. Considering the fact that only 1000 frames were used for the training and validation phases, our camera switching approach achieved a great performance.

Fig. 4
figure 4

Example frames from the test set: (a, b) players and the puck that were detected correctly by our model; (c, d) players, the puck, and one of the nets that were detected correctly by our model

Fig. 5
figure 5

Example frames from the test set: (a) players, the net, and the puck that were detected correctly by our model as well as a false positive puck; (b) players and the net that were detected correctly by our model as well as a non-visible puck (false negative)