1 Introduction

Real-time video surveillance extends the importance of the intelligent world, allowing sensory connections at a worldwide level, playing as the joining point between the digital and real worlds, and serving as a convincing catalyst for the smart and digital transformation for various surveillance applications. These applications are widely expanded in various public environments, applied for real-time monitoring of physical assets, locations, analysis of video information obtained to identify security indicators, and security planning. The advent of machine, deep learning, and image processing techniques has commenced new research possibilities in this field. Deep learning has enabled the automated extraction and analysis of information from images and video sequences. The convergence of deep learning with image processing is valuable in a variety of security applications. Detection of high risk situations before they heighten is one of the essential motivations behind the advancement of artificial intelligence-based security and surveillance applications. With these advancements, operators can expand surveillance solutions beyond mere monitoring to leverage every video frame and piece of data available to identify threats and inform the emergency response.

Fig. 1
figure 1

Frontal or asymmetric camera view: sample images taken from [1,2,3]. The first two sample images are recorded from the frontal view camera, while the third image is captured from the asymmetric view. The body features of individuals vary in terms of appearance (scale, shape, size, and pose). The sample images also highlight the problem of occlusion

Fig. 2
figure 2

Overhead camera view: example images taken from [4,5,6]. All sample images are captured from an overhead view with different camera heights. It can be recognized that the visual features of the person’s body are different from a frontal view; the occlusion problem is reduced, and more coverage of the scene is obtained

In real-time video surveillance systems, detecting a person is essential for diverse applications, including person identification, person tracking, person counting, unusual event detection, and crowd monitoring [7]. Along with a wide range of applications, it is a challenging task for researchers because of the variable visual features of a person’s body, including appearances (scale and size) and deformable poses. The complex and cluttered backgrounds or scenes, lighting conditions, different kinds of occlusions, abrupt variations in motion, and camera perspectives also affect the efficiency and performance of tracking algorithms. Researchers proposed different person tracking techniques, mostly employed conventional handcrafted features and machine learning-based approaches [8, 9], which are computationally expensive and require extra background training to learn person features. In contrast, advanced deep learning-based methods e.g., [10,11,12,13] presents effective solutions for person detection and tracking which are effective in terms of efficiency and computation speed [9] and tried to overcome aforementioned challenges.

Mostly advanced machine learning and deep learning-based tracking and detection algorithms are often used frontal or asymmetric camera viewpoints [10,11,12], where images are obtained from frontal view as presented in Fig. 1. The person’s body, visual features in the images vary in terms of body orientation, different poses, movements, and body articulations. The example images also highlight the occlusion problem that occurs when the other person and object overlap each other. Some researchers, e.g., [6, 14, 15], considered an overhead camera perspective for person tracking and detection as illustrated in Fig. 2. It is noticeable that the person’s body’s visual features are different from such an extreme view than the frontal view, usually depend on the local rotations, the movements of the body, and its position with respect to the position of the camera. In an overhead view, the problem of occlusion is considerably less as compared to the frontal view, where cross object occlusion problem can occur when the scene/environment turns crowded, as illustrated in Fig. 1.

With above mentioned motivations, researchers preferred to use an overhead camera in different surveillance applications including person detection [16,17,18], person counting [4, 5, 19, 20], person tracking [20,21,22,23,24], action recognition, crowd analysis [25], behaviour understanding [26], and human posture identification [27]. Moreover, managing the occlusion dilemma, it also overcomes privacy issues [28], reduce computation and installation expenses [29]. The contrast between both camera perspectives is highlighted in Figs. 1 and 2. One can easily observe that the person’s body features depend upon different camera perspective, both camera perspective results unusual modification in the person’s body visual characteristics (pose, size, shape, scale, and body orientation). This perspective might overcome the challenges of occlusion and allows broader coverage of the view.

A real-time intelligent surveillance system is presented for an overhead view person tracking and segmentation. For person tracking and segmentation, a deep learning-based algorithm, i.e., SaimMask [30], is explored. The algorithm can produce both video target tracking and video target segmentation in real time. The algorithm is simple, versatile, and fast and delivers good results compared too other real time tracking systems [31, 32], and pand provides the state-of-the-art in the target tracking field. Meanwhile, it obtained competitive performance and the fastest speed on the DAVIS-2016, DAVIS-2017 video segmentation data set [30]. This tracking algorithm mainly realizes the segmentation of the target by combining a mask branch to the fully convolutional neural network (twin) for person (target) tracking. The algorithm is first experimented with an overhead view data set. While considering the overhead view, there are significant variations in visual features of the person; for that reason, the network is additionally trained with the overhead view person data set, and the improved trained feature layer is added with the existing network applying transfer learning. The experimental results reveal that the accuracy results of the person tracking and segmentation algorithm are improved after training. In general, the principal objectives of the paper are given as:

  • Real-time intelligent surveillance system is introduced for overhead view person tracking and segmentation.

  • The system utilized a deep learning-based tracking algorithm, i.e., SiamMask, for an overhead view person target tracking and segmentation in video sequences.

  • The performance of the network architecture is investigated by testing the tracking algorithm with pre-trained and trained network architecture.

  • The tracking performance is also compared with different bounding box representation strategies.

  • The tracking accuracy results are compared with other tracking algorithms.

The work presented in this paper is arranged as follows: a review of the related work is provided in Sect. 2. Section 3 explains a real-time smart surveillance system for overhead perspective person tracking and segmentation. The implementation, and experimental assessment of the system, are reported in Sect. 4. Finally, the conclusion and possible prospective directions of the work are presented in Sect. 5.

2 Related work

Person tracking from an overhead perspective is considered a challenging task for researchers in various surveillance applications. In this section, we presented a review of some recent overhead view-based person tracking methods.

Migniot et al. [15] presented a hybrid 3D–2D tracking approach using particle filtering for human tracking. Authors in [20] presented a graph structuring technique for overhead view person tracking. Most of the techniques developed by researchers are mainly focused on the head, head–shoulder, or sometimes, on the entire body information of the person. Few researchers like [21, 31,32,33], also applied particle filter for person tracking using overhead perspective. A good review of different overhead view detection and tracking techniques utilized for people is provided in [34].

Vera et al. [4] adopted a Hungarian approach for people tracking in overhead view video sequences. Gao et al. [35] practiced median filter and [36] adopted mean shift algorithm for person tracking. Bagaa et al. [37] provided an effective tracking system for 5G networks. Nakatani et al. [28] considered head region information as Region of Interest (RoI) and applied hair texture information for person tracking. Authors in [22, 36, 38,39,40] assumed head– shoulder information as RoI for person detection. Researchers in [32, 36, 41, 42] examined the full human body as RoI.

Some researchers, e.g., [28, 32, 35, 39], applied color-based information, while few researchers utilized edge information’s, such as canny edge detector with SIFT features [32], and Sobel filters [43], for person detection and tracking. In [44], authors utilized a feature-based method, e.g., Histogram of Oriented Gradient (HoG) and presented an efficient person detection system. In [24], authors studied local ternary patterns and support vector machine classifiers for person detection and tracking. Ozturk et al. [32] analyzed the shape of individual body as an elliptical blob and introduced a tracking and detection system. Wu et al. [45] and Wetzel et al. [46] designed person tracking and detection method utilizing depth images captured from an overhead camera.

Ahmed et al. [47] introduced a rotated HoG method to recognize people in a complex industrial setting using an overhead camera. In [14], authors proposed a robust algorithm for person detection that utilized variable/different sized bounding boxes with different angles. Authors in [24] assumed a fixed sized detection bounding box and method based on feature for person detection. Authors, in [23, 48] offered another feature-based approach for person detection and tracking in indoor and industrial conditions. Ullah et al. [49] performed a comparison and investigated some conventional tracking algorithms for the person using an overhead camera. Ullah et al. [50] further implemented a blob-based strategy and offered a rotation invariant system for person tracking. A rotated feature and classification- based method is presented by [17] for person detection.

Fig. 3
figure 3

Real-time person tracking and segmentation system for overhead view surveillance. The recorded overhead view video sequences are sent to the cloud storage and image processing unit. The image processing unit utilized a deep learning-based algorithm for person tracking and segmentation. The final tracking and segmentation results are transferred to the monitoring and surveillance unit for further processing

Deep learning methods [5] are also utilized for person tracking. The majority of the advanced studies’ practiced the frontal perspective data set. Many scholars [15, 51,52,53] offered target detection and tracking utilizing aerial and satellite data set. Authors in [54, 55] studied pre-trained deep learning approaches for detecting and tracking persons using an overhead camera perspective. Authors in [18] studied Mask-RCNN and Faster-RCNN for multiple overhead view object segmentation and detection. Ahmed et al. [7] presented multiple people tracking framework based on deep learning-based tracking and detection model using 5G infrastructure.

Ahmad et al. [16] selected a deep learning model to detect and track multiple people in an overhead view outdoor and indoor scenes. In another work, authors [56] studied different deep learning-based segmentation methods for people using an overhead view. Ahmed et al. [57] implemented two separate deep learning models for multiple object detection coupled with various tracking algorithms. Authors, in [60], presented a real-time IoT-based framework for overhead view person detection by utilizing a deep learning model.

From the above review, it is concluded that significant work has been performed by different researchers for overhead view person tracking. Researchers employed color, texture, and shape-based information for person tracking. Mostly researchers used different handcrafted feature-based approaches, while few of them also practiced different deep learning-based models. In this work, we explored the deep learning-based SiamMask algorithm for overhead view person tracking and segmentation.

3 Real-time overhead view surveillance system for person tracking and segmentation

A real-time person tracking and segmentation system is presented for an intelligent surveillance application. The technical description of the introduced system is presented in Fig. 3, which includes an image processing unit, mainly comprised of a deep learning algorithm. The image processing unit is connected with the cloud server and internet connection to enhance the efficiency of the developed system by decreasing the computational expense and processing high-resolution video sequences over the cloud in real time. The recorded video sequences are collected at the cloud storage and image processing unit with the help of the internet connection. The image processing unit consists of artificial intelligence or a deep learning-based algorithm to process or analyze the high-resolution video sequences at high processing and computation speed. The network architecture is also trained for person video sequences captured using an overhead camera. For person tracking, a deep learning algorithm, SiamMask [30], is applied. As the visual features of a person’s body from an overhead view are different, thus to enhance the system’s accuracy for person tracking in an overhead view, additional training is performed. The improved, learned features are added with pre-trained weights using transfer learning, as depicted in Fig. 3. The results of the image processing unit are further transmitted to the monitoring and surveillance unit, where it might assist monitoring operators for different surveillance applications. The developed system details are given as follows.

Fig. 4
figure 4

Schematic illustration of SiamMask [30], the input image is processed through backbone convolutions layers “Conv” ResNet-50 for feature extraction (three-branch architecture). The architecture is based on twin networks. *d is a depthwise cross-correlation process, which indicates that the correlation computation is done channel-by-channel basis. The middle as RoW (response of candidate window) is responsible for keeping the number of channels unchanged and then divides three categories or branches based on this RoW, segmentation, regression, and classification

The algorithm tries to shorten the space between binary object tracking and object segmentation. It is also known as a multi-task learning process that can be applied to resolve object tracking and segmentation problems. The algorithm’s primary innovation is that if an object rotates, the appearance of a single box typically creates a significant loss, which is actually a defect in the representation itself. SiamMask instantly predicts the mask of an object, which allows to obtain the most accurate box. The algorithm follows offline training and online speed method and efficiently improves the representation of the object (target) while confined to a simple axis-aligned boundary box.

The schematic representation of the architecture is presented in Fig. 4. It can be seen that, the input is classified into two parts, the top one is used for target image z, and the lower part is used for the search image x (a larger then Z). To have fast speed and online operability, the algorithm adopted the fully convolutional Siamese network (SiamFC). It resembles a target image z against image x to get a feature map (dense response map), as illustrated in Fig. 4. The flow chart of the SiamMask tracking algorithm/model is explained in Fig. 5.

As the size of z is smaller than x, the obtained feature map \(f_{\theta }(z)\) is also smaller than the feature map \(f_{\theta }(x)\). The \(f_{\theta }(z)\) is then slid over \(f_{\theta }(x)\), and a similarity matrix is applied to join the two matrices into a particular scoring matrix. Lastly, the large value in the scoring matrix is the point with the highest confidence, the region corresponding to image x is the predicted region of the frame image. These two input images are processed by the corresponding CNN, to obtain features of the image and producing cross-correlated feature maps [58]:

$$\begin{aligned} g_{\theta }(z; x) = f_{\theta }(z) \star f_{\theta }(x). \end{aligned}$$
(1)

Each spatial component of the feature map \(g_{\theta }(z; x)\) is referred as the response of a candidate window (RoW). It means that the highest value of the feature map corresponds to the target area in the image x. Alternatively, to enable every RoW to encode more valuable data regarding the target object, the cross-correlation feature map in Eq. 1 is replaced by depthwise cross-correlation [58], and a response/feature map with multiple channels are produced. The SiamFC network was trained on millions of frames using the logistic loss [59]. To enhance the efficiency of SiamFC, a regional recommendation network is applied, which enables a variable sized bounding box to determine the target position. In particular, every RoW encodes a collection of anchor boxes (k) proposals and similar background or object scores.

As viewed from the schematic diagram, the mask generated by each RoW is a vector, which means that the resulting mask image is very rough, and its size is smaller than the initial image. Hence, it is often flatten with a process of up-sampling and adjustment. The accuracy of predicting the mask is not so high; therefore, refine module u-shape structure is used, which combines the feature map of the backbone, and performs up-sampling to get more fine segmentation results. Along with bounding box coordinates and similarity scores, the RoW of a fully convolutional Siamese network is applied to further encode the information needed to generate a pixelwise binary mask. Thus, the Siamese network is continued with an additional branch and loss. The binary masks are predicted \(w \times h\) (one for each RoW) utilizing a single two layers neural network \(h \phi\) with \(\phi\) learnable parameters. Throughout the training, every RoW is labeled with \(y_n \varepsilon \{\pm 1\}\) a true binary label and corresponds with \(c_n\) pixelwise true mask having a size of \(w \times h\). The mask prediction loss function \(L_{{\mathrm{mask}}}\) is a binary logistic regression of all RoWs, which is given as [30]:

$$\begin{aligned} L_{\mathrm{mask}} (\theta , \phi ) = \sum _{n} \left( \frac{1+y_n}{2wh} \sum _{ij} \log \left( 1+ e^{-c_n^{ij} m_n^{ij}}\right) \right) . \end{aligned}$$
(2)
Fig. 5
figure 5

Flow chart of SiamMask algorithm for tracking and segmentation of person in overhead videos

In Eq. 2, \(c^{ij}_n \varepsilon \{\pm 1\}\) indicates the label of the corresponding pixel ij of the object in the RoW. Therefore, the \(h_{\phi }\), classification layer comprised of \(w\times h\) classifiers, all showing either a given pixel relates to the target in the candidate window or not. In original work, the \(L_{\mathrm{mask}}\) is estimated only for positive RoWs. In general, the training process is end-to-end training, which means that all three branches are trained at the same time. Thus, for all training samples, the labels of the three branches are provided. To utilize the smooth L1 and the cross-entropy losses, the output branches are trained. For bounding box regression and classification score, \(L_{\mathrm{box}}\) and \(L_{\mathrm{score}}\) is used, respectively. The total loss function is calculated as:

$$\begin{aligned} L_{3\mathrm{B}} = \lambda _1\cdot L_{\mathrm{mask}} + \lambda _2\cdot L_{\mathrm{score}} + \lambda _3\cdot L_{\mathrm{box}}. \end{aligned}$$
(3)

In Eq. 3, \(L_{3\mathrm{B}}\) describe the three-branch network and the two-branch variant (for more details of the equation we refer reader for original work [30, 59, 60]). The mask branch simply determines the loss of the positive sample. The sample is referred to as a single RoW; when there is an anchor box and ground truth in an RoW and its intersection over union IOU is greater than 0.6, it is recognized as a positive sample. For the scoring and the bounding box branch, the SiamFC [60] and SiamRPN [59] method is used, respectively. In Fig. 5, the general flow of the overall algorithm is presented. It can be seen that two images, z and x, are given to the CNN model. The CNN model performed feature extraction and output three different branch results, including classification score, bounding box, and segmentation for the target object in image x. The process is performed for all frames of the video sequences. The bounding box of all feature proposals is achieved according to the bounding box and mask branch, and the final results are obtained by applying NMS (non-maximal suppression).

4 Experimental results

Different experimentation and performance evaluation are developed in this section. A comprehensive explanation of the data set utilized for experimentation is also discussed. The experiments are implemented using the Python programming language. For testing videos, we utilized SaimMask [30] without any modification. For both variants, same as [30], a ResNet-50 architecture is utilized. During tracking, SiamMask simply assessed one frame for once, without any modification. The output mask is obtained for both variants utilizing the location and highest value in the classification branch. Furthermore, the mask branch’s output is binarised, after utilizing a per-pixel sigmoid, at the threshold of 0.5. The data set utilized for person tracking, and segmentation has also been discussed. The performance assessment is made using different quantitative tests. For initial experiments, we apply mean intersection over union (mIoU) and average precision (AP) at \(\{0.5; 0.7\}\). The tracking accuracy of the algorithm is analyzed with different state-of-the-art tracking algorithms.

Fig. 6
figure 6

Visualization results of SiamMask used for person tracking and segmentation, the results are presented for the few frames, the movement, visual characteristics, and person location are varying in the scene. In the green rectangular bounding boxes, the correctly detected target object as a person label class is shown, while the red color is used for the predicted segmented mask

4.1 Data set

A real-world data set, recorded using an overhead camera at the Southampton University, United Kingdom [48] is utilized for obtaining person video sequences. The Point Grey Flea camera with a Fujinon wide-angle lens is utilized for recording purposes. The video sequences are converted into a video frame that has a frame resolution of \(1024\times 768\) pixels and PNG format. The video sequences are recorded utilizing a single camera installed at an altitude of about 4 m from the ground.

The positions and locations of the person are manually annotated to determine the ground truth information.

4.2 Tracking and segmentation results

To improve the results, the architecture is additionally training with an overhead data set. Despite being different in nature, the original architecture gives failure, because it unambiguously separated objects from the foreground. The results of the tracking algorithm are elaborated in Fig. 6. The tracking and segmentation results of the SiamMask for the overhead person data set have been visualized for different test video frames. It can be observed that extra training enhances the overall performance accuracy of the algorithm. Persons with different visual features, as shown in Fig. 6 are now correctly classified, segmented, and tracked in subsequent frames. The red color in the sample frames shows the segmented mask, while the green color box represents an automatically rotated bounding box that is used for tracking the target person in the video frame.

The experimental results show that the SiamMask algorithm achieves good results for moving target tracking results; from the first frame to the last frame (01–2000), in overhead view video frames, the target is always in the tracking state. In Fig. 6, we show the output results for few frames, for example, in the first-row sample frames, it can be seen that a person at different locations is kept tracked from the center of the scene to the upper left corner. The algorithm segment the target region along with adjusting the bounding box. Similarly, in subsequent frames of row two, the bounding box and segmentation mask of the algorithm is accurately obtained, although the shape of the person is significantly varying in the scene when the person moves away from the camera position.

In third-row sample frames, the person suddenly changes its direction and body angle from the overhead perspective, but still, the algorithm keeps its segmentation mask and bounding box according to its detected shape. In the fourth row, the person at the center of the images is accurately tracked without any failure. The same visual effect can be seen in the last row of the image in which the position and the location of the person are varying, but the deep learning-based algorithm accurately tracked it without any failure.

In Fig. 6, we show the results for few sample frames, mostly it can be seen that the person visual features are changing in subsequent frames in terms of size, scale, pose, and orientation. The overall tracking results are good, as compared to other tradition tracking algorithms.

4.3 Performance evaluation

The surveillance system’s performance is evaluated using different performance parameters, i.e., mAP and mIOU. The tracking algorithm outputs a set of bounding boxes and segmentation masks for the person in video frames. For evaluation of classification, the predicted bounding box is matched with the ground truth bounding box, and IoU is calculated given as:

$$\begin{aligned} \text {IoU} = \frac{b_{\mathrm{pred}} \cap b_{\mathrm{groundtruth}}}{b_{\mathrm{pred}} \cup b_{\mathrm{groundtruth}}}. \end{aligned}$$
(4)

In Eq. 4, the predicted bounding box is expressed with \(b_{\mathrm{pred}}\) and ground truth bounding boxes are indicated with \(b_{\mathrm{groundtruth}}\). For person classification, IoU threshold is defined as IoU \(\ge\) 0.5 and 0.7. In addition, if multiple detections occur, then the first one is counts as positive and the rest as negatives. The precision p, recall r, and accuracy acc values are also applied for determining the mAp values and formulated as follows:

$$\begin{aligned} p= \frac{{\mathrm{tp}}}{{\mathrm{tp}}+{\mathrm{fp}}} \end{aligned}$$
(5)
$$\begin{aligned} r= \frac{{\mathrm{tp}}}{{\mathrm{tp}}+{\mathrm{fn}}} \end{aligned}$$
(6)
$$\begin{aligned} \text {acc} = \frac{{\mathrm{tp}}+{\mathrm{tn}}}{{\mathrm{tp}}+{\mathrm{tn}}+{\mathrm{fp}}+{\mathrm{fn}}}. \end{aligned}$$
(7)

In the above equations, tp represents true positive are correctly classified bounding boxes as the person, fp shows false positives the number of bounding boxes inaccurately classified, tn true negative indicates those bounding boxes which are correctly recognized as background, and fn false-negative means those bounding boxes which are incorrectly recognized other objects or background as a person.

The mAP mean average precision for N classes is given as:

$$\begin{aligned} \text {mAP}= \frac{1}{N} \sum _{i=1}^{N} \text {AP}_i. \end{aligned}$$
(8)

In Eq. 8, the interpolated average precision AP value is given as:

$$\begin{aligned} \text {AP} = \frac{1}{11} \sum _{r\varepsilon (0,0.1,\ldots ,1.0)} \max _{ r^{\prime } \ge r} p(r^{\prime }). \end{aligned}$$
(9)

The mAP and mIOU values of the algorithm are provided in Table 1. We present the results for generated bounding box from a binary mask, as original work [30] three different approaches, namely, Min–max (the object in an axis-aligned rectangle), MBR (the minimum bounding box rectangle), and Opt: (the rectangle obtained via the optimization) are used. It is worth to note here that tracking results are almost consistent with original work [30]. The results show that the algorithm produces the best mIOU, no matter which bounding box generation approach is utilized.

Table 1 Performance of bounding box representation approaches on overhead view person data set
Fig. 7
figure 7

Threshold–precision curve of the tracking algorithms

The threshold precision curve is also plotted in Fig. 7. Moreover, the results are further compared with some other tracking algorithms. We have experimented algorithms on the overhead person data set, and for a fair illustration of comparison results, the same experimental parameters are used. It can be seen that the SiamMask performs adequately and give good results with a precision rate of 0.95 as compared to other tracking algorithms.

The tracking accuracy results compared with other algorithms are shown in Fig. 8. We experimented KCF, Median Flow, TLD, and SiamMask for the overhead view person data set. It can be observed from the experimental results that the tracking accuracy rate of SiamMask is 0.95, while mostly tracking algorithms not performed well and give a tracking accuracy rate around 0.80.

Fig. 8
figure 8

Tracking accuracy of different algorithms and SaimMask

5 Conclusion and future work

In this work, a real-time person tracking and segmentation system has been introduced for an overhead view. The introduced system consists of an image processing unit that utilized a deep learning algorithm named SiamMask for real-time tracking and segmentation applications. We presented an intelligent real-time surveillance system by integrating it with the cloud and the internet server. The deep learning-based algorithm performed segmentation of the target person by combining a mask branch to the fully convolutional twin neural network. We performed additional training to enhance the performance of the system. Finally, the tracking results of the SiamMask algorithm are compared with other tracking algorithms, namely, KCF, TLD, and Median flow. Experimentation reveals that the SiamMask algorithm delivers better results than the other algorithms. The tracking accuracy rate of the SiamMask algorithm is 0.95. Moreover, the threshold–precision curve is also plotted and compared with other tracking algorithms. The algorithm’s accuracy might be enhanced with new data sets recorded in various situations against multiple backgrounds and illumination situations in future work. In addition, the work might be continued for detection, tracking, and segmentation of multiple objects.