1 Introduction

Sports broadcasting is revolutionized due to the use of advanced video and image processing techniques. Continuous efforts are being made to enhance the contents of captured and transmitted images to provide the viewers more detailed and accurate information about game status and the players. In recent times, computer-generated visualizations are increasingly used in sports broadcasting to enhance the viewers’ experience such as displaying simple data and current scores. Augmented reality (AR) is a technique for overlaying virtual objects onto the real world. AR of real sports games facilitates viewers to engage and immerse into the action. In AR applications, virtual reality replaces the real world with a simulated one. Present AR techniques for sports video can be divided into two major categories. The first category is focusing on generating virtual scenes by means of multiple synchronous video sequences of a given sports game and second category aims at synthesizing virtual sports scenes from TV-broadcasted video.

Until now, there is limited research on the development of AR applications. It has been applied to some kinds of entertainment applications using vision-based tracking technique [1]. In [2], technique is presented to generate a visual enhancement for TV-broadcasted court net sports. It had a major drawback as system generates redundant virtual scenes which abruptly affect system performance. In [3], virtual scenes are generated by means of multiple synchronous video sequences of a given sports game, but this technique also fails in real-time transmitted videos and images because of synchronization problems. A graphics-based animation AR technique has been proposed in [4], but it also fails to produce the satisfactory results as it is unrealistic due to complete loss of texture and motion of the players. In [5], authors tried to insert and manipulate 3D virtual contents into broadcasted tennis video. This technique also fails in situations like invisibility of one net. In [6], the developed system performs a camera calibration algorithm to establish a mapping between the soccer court field in the image and that of the virtual scene. It had a major drawback of player pose restriction. User had to opt from only 3 defined poses such as walk, stop and run. The confinement to only 3 poses eliminates improvement in system performance. All these previous efforts have yet to reach the acceptable level of reality. In a nutshell, contributions of our work are as follows.

  • Our work is from the extension of the above in which we have developed a real AR-based application on real-time transmission of baseball game images [57]. The developed system is free from problems in aforementioned techniques such as synchronization, generation of virtual redundant scenes, and player pose.

  • The developed technique does not have problems like loss of texture and motion of players. All the viewers have to do is, just to capture an image using camera or smart phone and after that detection and recognition of players along with the display of personal information occurs automatically.

  • For the task of player and subsequent face detection, our method does not put any restriction on the input images. Player face recognition requires frontal or near frontal faces; however, it is quite robust to recognize small resolution faces and faces as small as 5 × 5 pixel can also be successfully recognized.

In this paper, we present a complete framework for displaying player statistics in still images which can easily be extended to real-time videos. Figure 1 shows the block structure of our proposed system. In step 1, we detect players using AdaBoost [8] and Haar-like features. In step 2, we use the same algorithm to detect face [9], [10], in each player image. Step 3 matches the face with a database of player faces. Step 4 retrieves statistics of players such as name of the player, height, age, sports record, and country of belonging. Face recognition is performed using AdaBoost with LDA [11] as weak learner and NNC.

Fig. 1
figure 1

Framework of our proposed application (Image taken at Jamshil Stadium, Seoul, South Korea)

The remainder of the paper is organized as follows: In Sect. 2, we discuss our proposed player detection scheme which is based on AdaBoost algorithm. Section 3 presents face detection process based on algorithm described in step II. Section 4 explains the employed player face recognition approach. Section 5 explains retrieval of player statistics from player face database. Detailed experimental results are presented in Sect. 6. Finally, conclusions and future research directions are given in Sect. 7.

2 Player detection using adaboost

AdaBoost algorithm constructs a strong classifier as a linear combination of weak classifiers. It improves the accuracy based on a series of weak classifiers employed at different stages. For player and subsequent face detection in our framework, we use the popular Viola and Jones [8] approach. A robust player classifier is obtained by supervised AdaBoost learning. Given a sample set of training data {x i , y i }, the AdaBoost algorithm selects a set of weak classifiers {h j (x)} from a set of Haar-like rectangle features and combine them into a strong classifier. The strong classifier g(x) is defined as follows:

$$g\left( x \right) = \left\{ \begin{gathered} 1\;\;\;{\kern 1pt} \;\;\;{\kern 1pt} {\kern 1pt} \sum\limits_{{k = 1}}^{{k_{{\max }} }} {\alpha _{k} h_{k} \left( x \right) \ge \theta } \hfill \\ 0\;\;\;\;\;\;{\text{otherwise}} \hfill \\ \end{gathered} \right.$$
(1)

where x is the input image, k is the number of weak classifiers, \(\theta\) is the decision threshold, and \(g\left( x \right)\) is the final strong classifier. Table 1 lists important notations used in our paper.

Table 1 Important notations used in this paper and their description

Table 2 illustrates the complete pseudo-code and full details of AdaBoost algorithm for obtaining strong classifier of Eq. 1. More details of this employed approach can be found in [9]. From Table 2, conclusion can be drawn that a correct classification of a sample makes its corresponding weight drop for the next iteration, otherwise it is unchanged. The smaller the error rate of a weak classifier, the smaller is its corresponding β and the greater is the α. Therefore, the main idea of AdaBoost classifier is to learn a series of weak classifiers and threshold \(\theta\) that is adjusted to meet the detection rate goal. The Haar-like rectangle features are computed using “Integral Image” representation. The cascade method quickly filters out non-player image areas.

Table 2 Adaboost Algorithm

A set of 630 positive samples and 1100 negative samples are used to train the classifier. The training data are selected from our own image database of baseball players. To speed up the computation time, all player images are first converted into gray scale and size normalized to 20 × 40 pixels. Some examples of positive and negative data are shown in Fig. 2. Negative samples are obtained from different public databases. A rough estimate of the size of the smallest and largest player can be obtained from the size of the input image and therefore the size of detection window is restricted to a certain range. The width of detection window is set to half of the height. This improves accuracy and searching time. An image captured in the sports ground contains players as well as other objects such as audience and advertising boards. So the sole purpose of player detection was to confine the AR application processing area by eliminating the redundant information as mentioned above. Detecting player also gives viewers the exact position of their favorite players, as well as increases the esthetic sense.

Fig. 2
figure 2

Example of positive (left) and negative (right) training samples

3 Player face detection

Face detection employs boosting methodology which is one of state-of-the art approaches for object detection [10]. Face detection algorithm based on feature subspace, maps face image to a certain feature space. The distinction between face and non-face is made according to the distribution law in feature subspace. In our work, the objective of player face detection is to obtain spatial location of player faces. The simple features used are reminiscent of Haar basis functions. Figure 3 shows the two, three and four rectangle features.

Fig. 3
figure 3

Example of rectangle features shown relative to the enclosing detection window

The rectangle features are computed using Integral Image, which is sum of the pixels above and to the left of the point (x, y). The Integral Image I(x,y) is defined in Eq. 2.

$$I(x,y) = \sum\limits_{x* \le x,y* \le y} {i(x*,y*)}$$
(2)

where I(x,y) is Integral Image at points (x,y) and i(x,y) is original image. For the task of player face detection, use of Haar-like features and AdaBoost algorithm for both feature selection and classification results in less computation and execution time. The details of face detection approach are similar to those of player detection except for the use of different training data to build the classifier. After the player faces are detected, they are used as input to face recognition algorithm.

4 Player face recognition

Currently, efforts are ongoing for efficient storage and retrieval of face images. Various techniques exist to handle the curse of dimensionality out. Its two wide categories are: linear methods and nonlinear methods. PCA, LDA, LPP are some popular linear methods, while nonlinear methods include ISOMAP and Eigenmaps learning [12]. PCA and LDA are the two most widely used subspace learning techniques for face recognition. Jun Yu et al presented an adaptive hyper graph scheme with application in image classification [13]. Proposed algorithm generates hyperedges by linking images with nearest neighbors by varying the size of the neighborhood. Interestingly, the proposed method simultaneously learns the labels of unlabeled images and the weights of hyperedges. Therefore, it automatically modulates the effects of different hyperedges. The scheme produced better results than its baselines schemes. Full details can be seen in [13].

A research group in [14], presented a complex object correspondence construction technique in 2D animation for effective face recognition. A semisupervised scheme of patch alignment framework for complex object correspondence construction was introduced. It constructs local patches for each point on an object and aligns these patches in a new feature space, in which correspondences between objects can be detected by the subsequent clustering. Experiments conducted on MPEG-7 dataset revealed its performance comparable with that of a prominent algorithm published in T-PAMI 2009. Detailed description can be found in [14].

An interesting semisupervised multiview distance metric learning for cartoon synthesis is presented in [15], where authors introduced three visual features, color histogram, shape context, and skeleton, to characterize the color, shape, and action, respectively, of a cartoon character. A novel system was developed which successfully decomposes the modules of multiview cartoon character classification, multiview graph-based cartoon synthesis, and multiview retrieval-based cartoon synthesis. The developed technique was found extremely efficient in cartoon applications.

Combining multiple view features for cartoon character retrieval and clip synthesis was presented in [16]. Researchers concluded experimental evaluations based on cartoon character retrieval and clip synthesis. Proposed scheme demonstrated the effectiveness of the developed technique for cartoon applications. A semisupervised multiview subspace learning (semi-MSL) algorithm was also introduced to encode different features in a unified space. Details can be found in [16, 17]. In our work, we utilized AdaBoost [810] algorithm with LDA [11] as a weak learner for feature selection in LDA subspace while classification is performed using a classic Nearest Neighbor Classifier (NNC). In pattern classification, the task of learning from examples can be stated as:

Suppose we have training set, Z = {Z i }\(_{i = 1}^{C}\) containing C classes with each class Z i  = {(z ij , y ij )}\(_{j = 1}^{Li}\) consisting of a number of samples z ij and their corresponding class labels y ij , a total of \(N = \sum\nolimits_{i = 1}^{C} {Li}\) samples are available in the set. Let Z be the sample space: z ij є Z, and Y = {1,., C} be the label set: y ij (=i) є Y. Now taking as input such a set Z, the objective of learning is to estimate a function or classifier h(z) : Z → Y, i.e., h will correctly classify unseen samples (z,y). AdaBoost algorithm works by repeatedly applying a given weak learner to a weighted version of the training set in a series of rounds t = 1, …T, and then linearly combining these weak classifiers {h t }\(_{t = 1}^{T}\) constructed in each round into a single, strong and accurate classifier h f . Eq. (3) shows the final strong classifier:

$$hf(z) \, = \, \arg \hbox{max} \, y \in Y \, \sum\limits_{t = 1}^{t = T} {\log \left(\frac{1}{{\beta_{t} }}\right)h_{t} (z,y)}$$
(3)

The performance of traditional LDA-based approach is improved by incorporating it in the boosting framework. Each round of boosting generalizes a new LDA subspace particularly focusing on examples which are misclassified in previous LDA subspace. The final feature extractor module is an ensemble of several but very accurate specific LDA solutions. This kind of ensemble-based approach takes advantage of both LDA and boosting and outperforms only LDA-based systems in complex face recognition tasks. This is particularly important where less number of training samples for each subject are available compared to number of dimensions of samples, i.e., Small-Sample-Size (SSS) problem [18] and when nonlinear variations are present in facial images. Our employed face recognition system is more robust when performing recognition of low-resolution face images. This result is also verified by the authors in [11]. The pseudo-code of employed face recognition system is given in Table 3.

Table 3 Adaboost algorithm for face recognition

5 Player statistics retrieval

The final goal of this work is to display relevant statistics for player(s) once an image is captured or obtained from a video. This depends on the accuracy of detection and recognition results. Recently, various object recognition approaches have been presented for automatic retrieval of data [1921]. In [22], a general classification algorithm for image-based object recognition is presented. This new framework provided nice insights into two crucial issues in face recognition: feature extraction and robustness to occlusion. The algorithm was reported to have significant output with varying illumination and expression. There were no experiments performed on low-resolution images. Researchers in [23] presented a novel multiview hypergraph-based method that effectively integrates click data with varied visual features. It considers a pairwise discriminative constraints from click data that maximally distinguish images with high clicks counts. Therefore, a semantic manifold is constructed. Experiments performed on real datasets revealed the effectives of the proposed scheme than state-of-the art image re-ranking methods. We store players’ statistics in our system together with the corresponding players’ facial images. For a given input face image, the face recognition system calculates distance metric of the input face to all face images in database. The pair resulting in the maximum value of this distance metric (called matching score) is considered as the true match. The true matched player face image is obtained by simple comparison of all the matching scores obtained from the system while the test image is being compared to all possible players’ faces in the database. This type of AR application can be used for enhancing viewers’ experience where a viewer is using smart phone or a camera to capture photos or video of a live sport.

6 Experimental results

Our proposed system is composed of 3 modules: Player detection, face detection and face recognition. It is necessary to analyze the results of each module separately besides considering the performance of the whole system as a ‘black box’. Figure 4 shows an example of the system output for each of the module.

Fig. 4
figure 4

An example outputs of our developed system

We use a dataset of 412 diverse images with different number of players in various lighting conditions and small pose variation of face from frontal. To better evaluate the system feasibility for real-world applications, we use two types of images: images taken with a smart phone and images taken with a good quality digital camera. There are a total of 80 different subjects (players) in the dataset shown in Fig. 7. In Table 4, we give a quantitative summary of the performance of each module. The number of players per image essentially demonstrates the size of detected player. Fewer numbers of players per image imply that image is taken from a small distance and player(s) in image has high resolution which helps improve the performance of subsequent steps. In the subsequent sections we discuss and analyze each module developed in our work.

Table 4 Summary of our system accuracy

6.1 Player and face detection

Figure 5 shows some example images with detected players by employed player detection methodology and feasibility of our approach for real-world scenarios. Face detection and face recognition is performed on images containing several players with variation in lighting conditions and small deviation from frontal pose. Players with full occlusion and extremely small-sized players such as 4 × 4 could not be detected by player detection module.

Fig. 5
figure 5

Sample images illustrating performance of player detection module (Images taken using smart phone at Jamshil stadium, Seoul, South Korea)

Player face detection module takes detected players as its input and detects the faces of the players up to face size of 5 × 5. Figure 6 shows some examples of player face detection results. It can be observed that face detection module performs well in most cases except when detected players are very small sized. This is the case in lower right image in Fig. 6. Face detection module does not process extremely blurred and invisible player face images. In such cases, it is generally the case when human eye is also unable to locate such faces.

Fig. 6
figure 6

Player face detection results (Images taken at Jamshil stadium, Seoul, South Korea)

6.2 Player face recognition

We partition the database randomly into two subsets: the training set Z and test set Q. Face Recognition system is first trained with Z training images and the resulting face recognizer is then applied to Q test images to observe classification error rate (CER). Figure 7 shows database of 80 different player faces. Once the input face is matched to a face in the database, personal information of the matched player such as age, score, and nationality. is retrieved from database of players’ statistics. The developed application shows the feasibility of computer vision-based approaches for enhanced sports broadcasting.

Fig. 7
figure 7

Face database of 80 different players

Player and face detection is implemented in C while face recognition module is implemented in MATLAB. Implementing the whole system in low-level language will make it usable in real-time application such as processing images from a video. Table 5 shows execution time of each module.

Table 5 Execution Time

Figure 8 shows the final output of the proposed system on some sample images from our database of 412 images. Most of the images contain several players at various poses and lighting conditions. In Fig. 8, players with statistics displayed show success of all the modules of the system. In some cases, only player detection and face detection succeed but recognition is fails. In some cases only player detection succeeds and both subsequent face detection and recognition fails. Examples of such results can be seen in lower left image in Fig. 8. The images in which player face is not visible are not processed further and in such cases no information about the player is displayed. For extremely low resolution and images in which players are overlapping, face recognition system does not work. In such cases, player and his face detection still exhibits considerable accuracy. To further assess the performance of our face recognizer, we also used BioID database [18] in our experiments for testing and training purposes besides using player database that we developed in our lab.

Fig. 8
figure 8

Output of proposed system on sample images from database (Images taken at Jamshil Stadium, Seoul, South Korea)

Figure 9 shows accuracy of face recognition module when the numbers of training images are varied. This kind of experiment is difficult to perform with players’ database because of the difficulty to obtain many images of a single player during play particularly when we want to have more players in our database such as we consider 80 players in our experiments. It is clear as we increase the number of training images, accuracy increases and reaches almost 100 %.

Fig. 9
figure 9

Accuracy with number of training images used per subject

6.3 Further analysis of face recognition module

We perform several tests to ensure employed face recognition system is robust. One aspect of learning-based algorithm is the required number of training samples per subject to train the classifier. We take a subset from the whole player database by taking only 10 different baseball players with each player having 10 different images. Different numbers of training and testing samples are used to measure the CER. In Table 6 we show results of the sensitivity of the CER measured for various SSS learning tasks arising from different database partitions ρ(L) to various weakness extents of gClassifiers in each task ρ(r). Each combination of training and test samples was executed five times and minimum CER was obtained in each run along with iteration number as shown in Table 5. The e 20 shows that iteration number was set to 20 and (T *) shows the iteration number on which we found lowest CER. We can see as we increase the number of training samples correspondingly CER is improved.

Table 6 Minimum CER With iteration number

In Table 7 we show generalization loss with number of training (tr) examples per subject to total samples as well as estimate of best r (r * = The combination of training (tr) and testing (tst) samples on which we had lowest CER) and worst r (r  = The combination of training and testing samples on which we had highest CER).

Table 7 Generalization loss and estimation

Detailed experiments are performed to test performance of face recognition module when the input face image is of small size. Table 7 summarizes our results for 30 players in the database. The major obstacle to achieving higher accuracy from face recognition module is due to variation in pose and not the size of the face or illumination condition. As we increase the number of training samples corresponding CER is reduced. The face recognition module works well for as small as size of 5 × 5 with even 2 training samples. Player face databases range from size 231 × 251 to 10 × 10. Therefore, the system performs well in most of recognition tasks.

Table 8 Recognition performance with different size of face images

7 Conclusions

In this paper, we propose the feasibility of computer vision-based approach for application of enhanced end-user experience of sports. The primary goal of this work is to process images taken by an audience using digital camera or smart phone during play and display the relevant statistics of each player depicted in the images. Statistics of interest may include age, score, nationality, etc. This work shows the feasibility of developing a real-time augmented reality application for enhanced users’ experience which is based only on computer vision and image processing techniques [24]. One simple usage scenario of such system is when viewers are sitting in any public place or in a stadium and they do not have access to TV which shows important information of players and game status. Viewers using their smart phones or camera can start recording a video, enhancing video/image where detection and recognition of players and statistics display occur automatically with each frame in real time.