Real-time multiple object tracking using deep learning methods

Meimetis, Dimitrios; Daramouskas, Ioannis; Perikos, Isidoros; Hatzilygeroudis, Ioannis

doi:10.1007/s00521-021-06391-y

Real-time multiple object tracking using deep learning methods

S.I.: information, intelligence, systems and applications
Published: 24 August 2021

Volume 35, pages 89–118, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Real-time multiple object tracking using deep learning methods

Download PDF

Dimitrios Meimetis¹,
Ioannis Daramouskas¹,
Isidoros Perikos ORCID: orcid.org/0000-0002-6581-4676¹ &
…
Ioannis Hatzilygeroudis¹

3898 Accesses
37 Citations
Explore all metrics

Abstract

Multiple-object tracking is a fundamental computer vision task which is gaining increasing attention due to its academic and commercial potential. Multiple-object detection, recognition and tracking are quite desired in many domains and applications. However, accurate object tracking is very challenging, and things are even more challenging when multiple objects are involved. The main challenges that multiple-object tracking is facing include the similarity and the high density of detected objects, while also occlusions and viewpoint changes can occur as the objects move. In this article, we introduce a real-time multiple-object tracking framework that is based on a modified version of the Deep SORT algorithm. The modification concerns the process of the initialization of the objects, and its rationale is to consider an object as tracked if it is detected in a set of previous frames. The modified Deep SORT is coupled with YOLO detection methods, and a concrete and multi-dimensional analysis of the performance of the framework is performed in the context of real-time multiple tracking of vehicles and pedestrians in various traffic videos from datasets and various real-world footage. The results are quite interesting and highlight that our framework has very good performance and that the improvements on Deep SORT algorithm are functional. Lastly, we show improved detection and execution performance by custom training YOLO on the UA-DETRAC dataset and provide a new vehicle dataset consisting of 7 scenes, 11.025 frames and 25.193 bounding boxes.

Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking

Deep Learning-Based Multi-object Tracking

Real-Time Object Detection and Tracking Design Using Deep Learning with Spatial–Temporal Mechanism for Video Surveillance Applications

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Computer vision is a fundamental domain that aims to allow computer systems to analyze images, extract knowledge and interpret them as humans do. Multi-object tracking (MOT) is also called multi-target tracking (MTT) and has a very important part in computer vision [27]. In general, the task of MOT is largely partitioned to locating multiple objects, maintaining their identities and yielding their individual trajectories given an input video [1]. MOT aims to process videos with the purpose to identify and track objects that belong to one or more categories, like cars, pedestrians, objects, animals without any prior knowledge concerning the appearance, the movement and the number of targets [2, 26]. Many computer vision problems depend to a large extent on multiple-object tracking systems [18, 28]. There are two important steps involved in designing such systems. The first step involves the detection of the objects [23]. So, the desired objects are detected in each frame of the video. The quality of the detection directly affects the performance of the overall monitoring procedure. The second step involves the matching of the identified objects to the previous ones to get their trajectories. High accuracy in the object detection system results in a smaller number of missing detections and ultimately produces smooth and accurate trajectories. Multiple-object tracking is also considered the process of locating multiple moving objects over time [27, 31, 41]. Multiple-object tracking systems have a variety of uses in a wide range of topics like security, video communication and compression, traffic control, medical imaging, self-driving cars and robotics [24, 29, 30].

Although object tracking is quite useful and desired, accurate, real-time object tracking is quite challenging and things are even more challenging when multiple objects are involved [37, 38]. One major issue that MOT is facing is the bounding box level tracking performance and saturation; therefore, most research is focused on handling these aspects [3, 4]. For tracking to perform sufficiently, object detection is necessary to work flawlessly across all frames and not having to use interpolation [39]. However, this is almost impossible due to occlusion, the variety in viewpoints and the noise that may be introduced in a video. Also, real-time tracking requires great computational resources and also needs to face challenges like the identity switches and various detection failures [33, 34, 40].

In this paper, first we explore the performance of various deep learning methods on the task of multiple-object tracking. We examine how widespread deep learning architectures are performing under various contexts in a wide range of scene scenarios. Also, the paper introduces a modification of the Deep SORT [5, 25] algorithm, which greatly improves the performance of object tracking methods, using different object detection models, such as YOLOv3-608 [6], YOLOv3-Tiny [6] and YOLOv4 [7]. The Deep SORT implementation is an extension to the simple online and real-time tracking (SORT) [8] algorithm and the SORT framework is utilized too as a mean to measure bounding boxes overlaps. While this is a high performance method, the number of identity switches due to occlusions from poor camera angles are too many [9,10,11,12]. By incorporating convolutional neural networks, to additionally include appearance information, Deep SORT can substantially reduce identity switches. Our modification on Deep SORT is based on the process of the initialization of the tracked object IDs and the way they are assigned and passed through, to be shown during the visualization process. The results indicate that our modified Deep SORT now properly displays the track IDs and it is closer to the ground truth in all the examined cases, a problem that exists on all YOLO & Deep SORT implementations we have found on the Internet. In addition, we present a way to improve the real-time operation of the deep learning methods by identifying and facing a bottleneck in the MOT framework. We tested and provide a way that can greatly improve the execution time of the tracking procedure. The results show that we have an increase of frames per second (FPS) in all examined deep learning networks which is up to 22%.

The novelty and the contribution of this paper can be summarized as follows. First, we explore the performance of various deep learning methods on the task of real-time multiple-object tracking. Our focus is road traffic, but we also include pedestrian tests and we compare a grand total of 7 YOLO derivatives ranging from YOLOv3-Tiny all the way to a fully fledged YOLOv4 implementation. For the tracking mechanism, we use the DEEP Sort framework, which we modified to properly display the correct track ID in the real-time video feedback, a problem that occurred in all YOLO and Deep Sort implementations we could find on the Internet. For each implementation, we provide the optimal detection and tracking parameters which could be useful for fellow researchers and hobbyists. Moreover, we perform custom transfer learning training of the YOLO detector using a slightly modified version of the UA-DETRAC dataset. The UA-DETRAC trained YOLOv4 provided state-of-the-art performance when compared to the publicly available MS-COCO trained YOLOv4 in our test scenes. In addition to that, we also performed performance characterization on this framework and found a bottleneck in the execution pipeline, which we resolved, and we saw an execution performance increase of up to 22%. For the evaluation process, we provide a wide variety of metrics and we test nine different scenes, six of which are our own. For all test scenes, we also provide the ground truth files, which we generated either from the ground up or from existing data. Finally, through the creation of the ground truth files, we also provide a new vehicle multiple-object dataset consisting of 7 scenes, 11.025 frames and 25.193 bounding boxes.

The rest of the paper is structured as follows. Section 2 examines the literature and presents related works and methods on multiple-object tracking. Section 3 presents our implementation that is based on modified Deep SORT algorithm and the YOLO detection networks. The modification made on Deep SORT algorithm is presented and the way it affects the visualization of tracking of moving objects is illustrated. Section 3 describes the way that some bottlenecks in the multiple objects tracking procedure is faced and presents the way that improves the real-time performance in terms of frames per second. Section 4 deals with the experimental study, explains the datasets used for the training phases and for the testing phase and presents the results collected. Finally, Sect. 5 concludes the article and draws directions for future work.

2 Related work

Multiple-object tracking is attracting the increasing attention of researchers in computer vision and artificial intelligence. Several works in the literature study the performance of methods and systems and a detailed description of approaches and techniques can be found in [1, 2, 12].

In the work presented in [3], authors propose an online multi-target tracker that exploits both high and low-confidence target detections in a probability hypothesis density particle filter framework. Authors formulate an early association strategy between trajectories and detections after the prediction stage, which allows performing target estimation and state labeling without any additional mechanisms. The authors’ solution has a peak multiple-object tracking accuracy (MOTA) score of 53 on MOT15 and 52.5 on MOT16.

Authors in [8] present an approach to multi-object tracking where the main focus is to associate objects efficiently for online and real-time applications. To this end, detection quality is identified as a key factor influencing tracking performance, where changing the detector can improve tracking by up to 18.9%. Despite only using a rudimentary combination of familiar techniques such as the Kalman filter and Hungarian algorithm for the tracking components, the approach achieves accuracy that is comparable to state-of-the-art online trackers. Additionally, emphasis is placed on efficiency for facilitating real-time tracking and to promote greater uptake in applications such as pedestrian tracking for autonomous vehicles. While being an overall good framework at the time, the identity switches are rather high with a value of 1001 in the MOT benchmark. Their solution has a peak MOTA score of 33.4 on MOT15.

In the work presented in [4], authors present an online method that encodes long-term temporal dependencies across multiple cues. One key challenge of tracking methods is to accurately track occluded targets or those which share similar appearance properties with surrounding objects. To address this challenge, authors present a structure of recurrent neural networks (RNN) that jointly reasons on multiple cues over a temporal window. Their motion and interaction models leverage two separate long short-term memory (LSTM) networks that track the motion and interactions of targets for a longer period—suitable for the presence of long-term occlusions. Their solution has a peak MOTA score of 37.6 on MOT15.

In the work presented in [2], authors present a comprehensive survey on works that employ deep learning models to solve the task of MOT on single-camera videos. Four main steps in MOT algorithms are identified, and an in-depth review of how deep learning was employed in each one of these stages is presented. A complete experimental comparison of the presented works on the three MOTChallenge datasets is also provided, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.

In the work presented in [13], authors build on a neural class-agnostic single-object tracker named HART and introduce a multi-object tracking method MOHART capable of relational reasoning. Authors explore a number of relational reasoning architectures and show that multi-headed, self-attention outperforms the provided baselines and better accounts for complex physical interactions in a toy experiment. Authors find that it leads to consistent performance gains in tracking as well as future trajectory prediction on three real-world datasets (MOTChallenge, UA-DETRAC and Stanford Drone dataset), particularly in the presence of ego-motion, occlusions, crowded scenes and faulty sensor inputs. On the MOTChallenge dataset, HART achieves 66.6% IOU, which itself is impressive given the small amount of training data of only 5225 training frames and no pre-training.

In the work presented in [14], authors present an end-to-end model, named FAMNet, where feature extraction, affinity estimation and multi-dimensional assignment are refined in a single network. All layers in FAMNet are designed differentiable and thus can be optimized jointly to learn the discriminative features and higher-order affinity model. Authors also integrate single-object tracking technique and a dedicated target management scheme into the FAMNet-based tracking system to further recover false negatives and inhibit noisy target candidates generated by the external detector. The proposed method is evaluated on a diverse set of benchmarks including MOT2015, MOT2017, KITTI-Car and UA-DETRAC and achieves promising performance on all of them in comparison with state-of-the-art. The authors’ method has a MOTA score of 40.6 on MOT15.

In the work presented in [15], authors introduce a focal loss-based RetinaNet, which works as one-stage object detector, is utilized to be able to well match the speed of regular one-stage detectors and also defeats two-stage detectors in accuracy, for vehicle detection. State-of-the-art performance result has been shown on the DETRAC vehicle dataset. This is important because one-stage object detectors and two-stage object detector are regarded as the most important two groups of convolutional neural network-based object detection methods. One-stage object detector could usually outperform two-stage object detector in speed; however, it normally trails in detection accuracy, compared with two-stage object detectors.

In the work presented in [16], the authors introduce deep motion modeling network (DMM-Net) that can estimate multiple objects’ motion parameters to perform joint detection and association in an end-to-end manner. DMM-Net models object features over multiple frames and simultaneously infer object classes, visibility and their motion parameters. These outputs are readily used to update the tracklets for efficient MOT. DMM-Net achieves PR-MOTA score of 12.80 @ 120+ fps for the popular UA-DETRAC challenge. Authors also introduce a synthetic large-scale public dataset Omni-MOT for vehicle tracking that provides precise ground-truth annotations to eliminate the detector influence in MOT evaluation.

In the work presented in [17], authors present a CNN-based framework for online MOT. This framework utilizes the merits of single-object trackers in adapting appearance models and searching for target in the next frame. Simply applying a single-object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Their framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. In the framework, they introduce spatial–temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark datasets, respectively.

3 Methodology

In this section, we present our framework for object tracking that relies on the modification of the Deep SORT algorithm. We describe the main methods for the object detection and tracking. architecture of this implementation. More specifically, for the object detection procedure, YOLO models are utilized to detect desired objects in a frame, and after that, a modified version of Deep SORT algorithm is introduced to perform object tracking in the sequences of the frames. Our modification on the Deep SORT algorithm concerns the process of the initialization of the object IDs, and its rationale is to consider an object as “tracked” if it is detected in a set of previous frames. We assess the performance of the YOLO object detection models trained on the MS COCO and UA-DETRAC datasets using transfer learning in order to assess their performance and identify suitable and optimal synergies for multi-object tracking. The modified Deep SORT algorithm is tested using the YOLO models in tracking cars and pedestrians, a variety of datasets and scenes, and its performance is compared to the original Deep SORT algorithm. The results indicate that our modified Deep SORT algorithm now properly displays the assigned track IDs, while also providing good tracking performance. In the following subsections, we present the modification made on the Deep SORT algorithm, the implementation of the YOLO models as well as the optimization of our framework.

3.1 Modified deep SORT tracking algorithm

One of the most widely used object tracking frameworks is Deep SORT, which is an extension to SORT (simple real-time tracker) [5]. Deep SORT achieves better tracking and less identity switches by including an appearance feature vector for the tracks which is derived, in this case, by a pre-trained CNN that runs on the YOLO detected bounding boxes. Since simple detection models are very likely to fail at detecting numerous objects consecutively as the frames go by, we need to add new methods to keep track of them and properly identify them. This is where Deep SORT comes in to make a proper MOT framework.

The Kalman filter is a crucial component in Deep SORT. Each state contains 8 variables (u, v, a, h, u′, v′, a′ and h′) where (u, v) are the coordinates of the bound box, a is the aspect ratio, and h is the height of it. The respective velocities are given by u′, v′, a′, h′. The state contains only absolute position and velocity factors, since we assume a simple linear velocity model. The Kalman filter helps us face the problems that may arise from non-perfect detection and uses prior states to predict a good fit for future bounding boxes. Now that we have the new bounding boxes tracked from the Kalman filter, the next problem lies in associating new detections with the predictions that have been created. Since they are processed independently, a method is needed to associate track_i with incoming detection_k. To solve this, Deep SORT implements 2 things: a distance metric to quantify the association and an efficient algorithm to associate the data. The authors decided to use the squared Mahalanobis distance (effective metric when dealing with distributions) to incorporate the uncertainties from the Kalman filter. Thresholding this distance can give us a very good idea on the actual associations. This metric is more accurate than, say, Euclidean distance, as we are effectively measuring the distance between 2 distributions. For the data association part, the Hungarian algorithm is used. Lastly, the feature vector becomes our “appearance descriptor” of the object. The authors have added this vector as part of the distance metric. Now, the updated distance metric is:

$$D = {\text{Lambda}}*D_{k} + \left( {1 - {\text{Lambda}}} \right)*D_{a}$$

where $D_{k}$ is the Mahalanobis distance and $D_{a}$ is the cosine distance between the appearance feature vectors and Lambda is the weighting factor. The importance of $D_{a}$ is so high that the authors make a claim saying, they were able to achieve state of the art even with ${\text{Lambda}} = 0$, meaning that they only used the appearance descriptor for the calculation. We provide the pseudocode for the Deep SORT-enabled framework below.

The variables that cause the biggest change in performance are score and IOU of each respective model, and n_init, max_cosine_distance and max_iou_distance from the Deep SORT framework. We will present the optimal values for each implementation in the videos used. We keep max_cosine_distance and max_iou distance at the same values, of 0.4 and 0.7, respectively, for all our tests. We also set our n_init at 7, unless stated otherwise. The variable n_init dictates how many successful detections a track must have before it goes from its initial tentative state to its confirmed state.

A main aspect we noticed on the functionality and the utilization of Deep SORT concerns the fact that shows the initiated track IDs on the bounding boxes and not the confirmed tracks. This operation may cause problems in tracking and numbering correctly the detected objects in the sequences of frames. To address this problem, a main modification that we implemented to the Deep SORT algorithm relates to the proper display and count of the confirmed detections. Specifically, each track has three states: the initial tentative state, the confirmed state and the deleted state. Every new track is classified as tentative for the first n_init frames. If the n_init frames pass and the track is still identified, it will become confirmed and feature similarity will also be employed. If a track fails to be identified properly using IOU similarity for every frame in the n_init phase, then it will be classified as deleted. We made sure that every bounding box shown on screen has the proper confirmed state ID on it.

Below we provide some example cases of the comparison results between the Deep SORT and the modified version. In Figs. 1 and 2, we present a case from the MOT15 dataset, where the ground truth for that part of the scene is 34 people. Our framework of the modified deep SORT measured 32 people, while the original Deep SORT resulted in measuring 79 people as it is illustrated in the bounding box IDs in Fig. 1. The difference is massive, and the main reason for this is the large number of identity switches either from occlusion or poor viewing angle as the detector struggles to maintain accurate detections across all frames. This can result in tracking numbers that, when the original Deep SORT algorithm is used, are quite higher than the ground truth. The modified version correctly displays the confirmed tracks on the video output.

An additional example case is illustrated in Figs. 3 and 4. The example case is in the frame of the “Roadtrack” test scene, where cars are detected and tracked. The grand truth of the example case is 19. In Fig. 3, the performance of the original Deep SORT is off by 6 tracking and giving IDs to 25 different cars. Also, it is worth noting that even in the same frame the non-modified code had already failed to properly ID this stack of cars, as illustrated by the fact that there is no car numbered with ID 24. The modified Deep SORT has a quite better performance which matches the ground truth.

Finally, another example case is presented in Figs. 5 and 6 in our own, real-world test scene named “Rural road dusk.” The ground truth at that part of the scenes is 18. In Fig. 5, the results of the original Deep SORT are off by 20. Although there were 18 different cars in the scenes, the original Deep SORT and YOLO framework resulted in tracking and giving IDs up to 38. In Fig. 6, the results of the modified Deep SORT are presented and we can see that the resulting IDs are very close to the ground and are off by just 1. The YOLO detector works the same way in both cases and the modified Deep SORT performs quite better compared to the original version reporting almost a perfect performance.

As illustrated in the above three example cases, the modified version of the Deep SORT has a quite good performance and resulted in better tracking and consistent annotations of IDs. This is crucial when we create real-time online MOT systems since we can even feed in real-time a live video from a camera and have it display proper IDs and tracking results.

3.2 Detection models

The Deep SORT tracking algorithm needs to be integrated with a multiple-object detection model that will perform the detection of the desired objects in a frame and after that, the Deep SORT will perform the tracking procedure. In the context of our study, we examine the performance of the Deep SORT in the pipeline with YOLO (You Only Look Once) [19]. YOLO has been proven to offer high performance and detection accuracy [35] and the Yolo models used and examined here are (i) the YOLOv3-Tiny, (ii) the YOLOv3-416 and 608 and (iii) the YOLOv4-608.

These models are trained on the MS-COCO and DETRAC datasets, and we use the weights that have been generated by the training on these datasets. The first implementation works with the YOLOv3-Tiny model and weights. In the second implementation, the YOLOv3-416 and 608 model and the corresponding weights are used and lastly, we use YOLOv4-608 with 608-by-608 tensor input to test our framework with state-of-the-art models. These YOLO detection models have been formulated into Keras along with their weights that were generated from their perspective Darknet projects. Darknet [32] is an open-source neural network framework written in C and CUDA, and it supports CPU and GPU computation.

The developers of YOLO reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection. First, YOLO is extremely fast. The frame detection process is looked at as a regression problem which enables YOLO to have a simplified pipeline. They simply run this neural network on a new image at test time to predict detections. This model achieves high throughput, which makes it suitable for process streaming video in real time. Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time, so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method, mistakes background patches in an image for objects, because it cannot see the larger context. Third, YOLO learns generalizable representations of objects. This network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means that YOLO reasons globally about the full image and all the objects in the image.

3.2.1 YOLOv3-Tiny integration

For the tiny YOLOv3 model, we are using these anchors: [10, 13], [23, 27], [37, 58], [81, 82], [135, 169], [344, 319], which correspond to the size of the bounding boxes and are fundamental for the correct training and detection of our CNN. Moreover, we set anchor mask values of [[3, 4, 5], [0, 1, 2]]. We configure them based on the design and the dimension of the objects we want to detect. To begin with, for this framework we set a score = 0.3 and IOU = 0.2 for our tests in section 7.1 and score = 0.6 and IOU = 0.3 for 7.2 and 7.3. Score is the confidence percentage for the detection coming out of our CNN. Do keep in mind that YOLOv3-Tiny has only 21 layers and it needs only 5.5 billion flops per frame. For the study, we use a tensor input of (416, 416). YOLOv3-Tiny has a mean average precision (mAP) of 23.7% on the MS COCO dataset.

3.2.2 YOLOv3 integration

For YOLOv3-416 and 608, we are using these anchors: [10, 13], [16, 30], [33, 23], [30, 61], [62, 45], [59, 119], [116, 90], [156, 198] and [373, 326], which correspond to the size of the bounding boxes and are fundamental for the correct training of our CNN. Moreover, we set anchor mask values of [[6, 7, 8], [3, 4, 5], [0, 1, 2]]. They were configured and fine-tune based on the design and the dimension of the objects we want to detect. For this framework, we set a score = 0.4 and IOU = 0.2 for our tests in Sect. 6.1 and score = 0.6 and IOU = 0.3 for 6.2 and 6.3. YOLOv3-608 has only 106 layers and it needs 140 billion flops per frame when having a tensor input of (608, 608) and 65.86 billion flops per frame when at (416, 416). The complexity is much higher, and the increased computation requirements cause a considerable drop in average frame. That being said it achieves significantly better results in the MS COCO dataset having a mAP of 55.3 for tensor input of 416 and 57.9 for tensor input of 608. In our study, we use it with tensor input of 416 and 608.

3.2.3 YOLOv4-608 integration

For YOLOv4-608, we are using the same anchors and mask that we used on YOLOv3-608. For this framework, we set a score = 0.6 and IOU = 0.3 in all of our tests, where the score is the confidence percentage for the detection coming out of our CNN. Notice that we gradually increase our detection thresholds as we continue testing more complex and higher performing models. With YOLOv4, we now set our tensor input at 608 instead of 416, which we previously did for YOLOv3, and that is because we want to see how the framework will behave when aiming at the best possible detection and feature extraction. YOLOv4 achieves an mAP of 65.7% with an input tensor of (608, 608), which is significantly higher than the previous models.

3.3 Framework optimization

An important part of the proper real-time operation of our framework concerns a set of optimization procedures that were performed. We monitor the functionality of the framework in our systems with help from Intel’s VTune software stack. We launch our application, and then we hook VTune to the corresponding process ID of our framework. We detected a bottleneck in the CPU section of our systems indicating that the CPU cannot feed fast enough our GPU, while also being able to perform the necessary calculations for the tracking algorithm provided by Deep SORT. In Fig. 7, we can see that the single-threaded nature of the software on crucial functions causes issues and, if we were to multithread and batch our functions for video pre-processing, it would not meet the criteria for a real-time tracking algorithm, since it does not process every frame as it is created. Looking closer at the graph shown in Fig. 7, we see our primary thread failing to hold steady at 100% CPU time, which is caused by our GPU having to run our CNN on a per-frame basis which causes the CPU to become idle.

The infrastructure used for our experiments is equipped with an NVIDIA GTX 1070 8 GB VRAM paired with 16 gigabytes of RAM and an Intel i7 6700 K. We used the NVIDIA CUDA toolkit version 10.0 and Tensorflow-gpu version 1.14. Keras version 2.2.4 and Python Anaconda distribution 2019.03 were the frameworks for the implementations. For the experiments, the processor was set at 4.5 GHz and the graphics card was also locked at 2.1 GHz, while the training and validation data were kept on an NVME drive to alleviate any potential storage bottlenecks.

Each row represents a thread spawned by python for this framework. The CPU first has to preprocess a frame from the video input and then send it to the GPU for the object detection part of the framework. When the GPU is doing calculations, the CPU is idling while waiting for the GPU to send back the result. When the results get sent back to the CPU, it is time for the Deep SORT algorithm to take over and match each bounding box with the correct IDs. This is also executed on the CPU. After this process is completed, the CPU writes back to the frame the output from the model along with the correct IDs. This process is getting repeated until there are no more frames to process.

The rest of the threads remain mostly idle and that is expected behavior, since we have not multithreaded the video pre-processing task or the CPU side tasks of our Deep SORT framework.

Knowing that, by default, python and tensor flow installations are not compiled to make use of more advanced SSE4.1/SSE4.2 and AVX instructions, we try to find ways to improve the performance by using optimized libraries for our system. Initially, Intel’s python packages for Numpy, Scipy and others were installed. These packages have improvements mainly from the use of SSE4.2, AVX, AVX2 instructions. However, the performance did not improve so much, because these libraries were not hotspots in our code. After this, we started timing every part of our code and found that the video processing tasks, which were powered by the Pillow library, were taking a big part of our execution time. With the use of VTune to perform HPC profiling of our framework, as shown in Fig. 8, we can see that our primary thread can grow in terms of vectorization. Since we are not bandwidth bound with just 1.2% of time spent waiting for data from DRAM, we know that our memory subsystem is ready to handle an increase in the data flow. We installed and used the Pillow 6.0.0 SIMD AVX2 package, and we got an improvement that ranged from 10 to 22%. The percentage of the improvement depends on the number of cars detected per frame, the detector used and video input resolution. This improved performance is presented in detail in the results of the experimental study.

In Table 1, we present some results from the tests, where we compare the generic SSE Pillow 6.0 version versus the AVX2 enabled, Pillow version 6.0.

Table 1 MOT performance comparison

Real-time multiple object tracking using deep learning methods

Abstract

Similar content being viewed by others

Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking

Deep Learning-Based Multi-object Tracking

Real-Time Object Detection and Tracking Design Using Deep Learning with Spatial–Temporal Mechanism for Video Surveillance Applications

Explore related subjects

1 Introduction

2 Related work

3 Methodology

3.1 Modified deep SORT tracking algorithm

3.2 Detection models

3.2.1 YOLOv3-Tiny integration

3.2.2 YOLOv3 integration

3.2.3 YOLOv4-608 integration

3.3 Framework optimization

4 Experimental study

4.1 Datasets used for the training procedure

4.1.1 MS-COCO

4.1.2 UA-DETRAC dataset

4.2 Datasets used for testing

4.3 Results

4.3.1 Results with optimized detection models trained on MS-COCO

4.3.1.1 YOLOv3-Tiny

4.3.1.2 YOLOv3-416

4.3.1.3 YOLOv4

4.3.2 Results with optimized detection models trained on UA-DETRAC

4.3.2.1 Results on YOLOv4

4.3.2.2 Results on YOLOv3

4.3.2.3 Results on YOLOv3-Tiny

4.3.3 Exploring the modified deep SORT on pedestrian videos

4.3.3.1 Results on the MOT16 scene

4.3.3.2 Results on the MOT20 scene

5 Conclusions

6 Supplementary materials

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation