Achieving real-time object detection and tracking under extreme conditions

Porikli, Fatih

doi:10.1007/s11554-006-0011-z

Achieving real-time object detection and tracking under extreme conditions

Survey Paper
Published: 18 August 2006

Volume 1, pages 33–40, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Achieving real-time object detection and tracking under extreme conditions

Download PDF

Fatih Porikli¹

1658 Accesses
56 Citations
3 Altmetric
Explore all metrics

Abstract

In this survey, we present a brief analysis of single camera object detection and tracking methods. We also give a comparison of their computational complexities. These methods are designed to accurately perform under difficult conditions such as erratic motion, drastic illumination change, and noise contamination.

Beyond traditional visual object tracking: a survey

Article 26 August 2024

Subpixel-Precise Tracking of Rigid Objects in Real-Time

Vision Tracking: A Survey of the State-of-the-Art

Article 11 January 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Object tracking is one the most important tasks in computer vision. In video surveillance, it assists in understanding the movement patterns of people to uncover suspicious events. It is a key technology in traffic management to estimate flux and congestion statistics. Advanced vehicle control systems depend on the tracking information to keep the vehicle in lane and prevent from collisions. In physical therapy, analyzing the mobility of patients improves the accuracy of their diagnosis. Learning the shopping behavior of customers by tracking assists the architecture design in retail space instrumentation. In robotics, tracking bridges the gap between the raw visual information and environmental awareness. In video summarization, it is applied to generate object-based representations and automatic content annotations. Tracking is also a fundamental technology to extract regions of interest and video object layers as defined in JPEG-2000 and MPEG-4 standards.

Even though it is essential to many applications, robust object tracking under uncontrolled conditions still poses a challenge. Real-life systems are required to track objects not only when the background scene is static but also when lighting changes suddenly, camera–object motion becomes large, color contrast becomes low, image noise soars to an unacceptable level, etc. In addition, the computational complexity is required to be kept minimum for real-time performance.

In the following sections, we describe object detection and tracking methods that are designed to resolve the above issues.

2 Object detection

Background subtraction is a common approach for discriminating moving regions in fixed camera setups. Basically, a pixel-wise reference model for the stationary part of the scene is estimated. Then, the observed image is compared with this reference to obtain a foreground mask.

A reference frame can be easily computed by averaging (or α-blending) the previous frames. However, averaging induces ghost effect, i.e. distortion of colors behind moving objects. Illumination changes, moving shadows, and other motions e.g. swaying trees also cause similar artifacts. Alternatively, predictive techniques, such as Kalman [1] and Wiener [2] filters, are applied to learn the temporal color variations at each pixel. These techniques assume color variations which can be modeled by a random variable. They heavily depend on the predefined state transition parameters. As a result, they may fail in case the color distribution does not fit into a single model or color distribution has a different shape function.

To handle multimodal backgrounds, mixture of model background is proposed [3]. Often, these models are assigned as Gaussian functions. Since the background models is updated at every frame, iterative update mechanisms, e.g. online expectation maximization (EM) algorithm, are applied to fit the models. However, the online EM blends weak modes into stronger modes and distorts the model mean values as shown in Fig. 1. To achieve accurate adaptation of models, a Bayesian update mechanism [4], which can also estimate the number of required models, is proposed. This approach is flexible enough to handle illumination variations and other arbitrary changes in the scene. There are also variants of the mixture of model background that uses image gradient and optical flow information [5]. The mixture of model approaches can converge to any arbitrary distribution if enough number of observations is provided. However, their computational cost grows exponentially as the number of models in the mixture increases. Another background modeling approach is non-parametric kernel density estimation [6]. This method stores color values of multiple frames and estimates the contributions of a set of kernel functions using all of the data instead of iteratively updating background models at each frame. Both memory and computational cost are proportional to the number of frames. As a result, kernel-based methods may become infeasible for real-time applications.

A shortcoming of the above methods is that they neglect the temporal correlation of color values. This impedes distinguishing a periodic background motion, such as swaying plants driven by wind, waves on a beach, rotating objects, etc., from the foreground object motion. Since real-world physics often induces near-periodic phenomenon in the environment, a frequency decomposition-based representation of the background, wave-back, is proposed [7]. This algorithm detects motion based on the form of the temporal color variation by comparing the frequency transform responses. Thus, the background models are fit in the frequency domain. Similar to kernel-based methods, a multitude of frames are stored. At every frame, a frequency transform is applied and the obtained coefficients are compared to the background models to generate a distance map as given in Fig. 2. Detection of the time-varying phenomenon is also attempted using corner-based background models [8]. Instead of processing every pixel, this algorithm detects a certain number of feature points using a Harris corner detector and a scale-invariant feature point descriptor. It dynamically learns a single background model and classifies each extracted feature as either a background or a foreground feature. It also uses a “Lucas-Kanade” motion tracker to differentiate motion consistent foreground points from background points with random or repetitive motion. Since feature point extraction and descriptor generation are computationally expensive, this algorithm may not be suitable for real-time applications if number of feature points increases. Sample detection results are shown in Fig. 3.

Instead of adapting models to the background, the scene can be represented using intrinsic images as a multiplication of static and dynamic parts [9]. This algorithm keeps a set of previous images and applies gradient filters. By taking advantage of the sparseness of the filter responses, it estimates the background intrinsic image using a log domain median filter. From background and current image, it constructs a foreground mask as shown in Fig. 4. This method is robust to sudden and severe illumination changes and computationally inexpensive. Sample comparison results with the mixture models are provided in Fig. 5.

We summarized the performance of the above algorithms under real-life conditions as well as their computational complexity in Table 1. We denote the image size (total number of pixels) as M, the number of background models as n, the number of previous frames as T, and the local descriptor window size as D (other constants are disregarded). We also list the speed of each algorithm running on a P4 3GHz machine for a 320 × 240 color video. Note that, the computational load and memory requirements are heavily coupled and they depend on whether a hardware implementation (or parallel processing) is adapted or not. Usually, the following techniques are employed to reduce the computational complexity and scale down the memory footprint for practical systems:

Partially updating the background or subsampling,
Limiting temporal size of detectable variations,
Assuming color channels to be independent,
Using lower dimensional color representations.

Table 1 Performance of background-based object detection

Full size table

3 Inter-frame tracking

Tracking, that is finding a region corresponding to a given object in the image, also faces similar challenges. Objects frequently change their appearance and pose. They occlude each other, become temporarily hidden, merge together, and split. Depending on the application, they exhibit erratic motion patterns and often make sudden turns.

Tracking can be considered as estimation of the state given all the measurements up to that moment, or equivalently constructing the probability density function of object location. A simple tracking approach is predictive filtering. This method uses object color and location statistics while updating an object model by constant weights [10]. When the measurement noise is assumed to be Gaussian, the optimal solution is provided by the Kalman filter [11]. When the state space is discrete and consists of a finite number of states, Markovian filters can be applied for tracking. The most general class of filters is represented by particle filters, which are based on Monte Carlo integration methods. The current density of the state (which can be location, size, speed, boundary, etc. [12]) is represented by a set of random samples with associated weights and the new density is computed based on these samples and weights. Particle filtering is a popular tracking method [13, 14]. However, it is based on random sampling that becomes a problematic issue due to sample degeneracy, especially for higher dimensional representations.

In contrast, the mean-shift tracker is a non-parametric density gradient estimator that is iteratively executed within the local search kernels [15]. It models the object probability density in terms of color histogram, and moves the object region towards the largest gradient direction. Thus, it is computationally simple. Nevertheless, if the object relocation between successive frames is larger than the kernel size, it fails to detect the object. Since the histograms are used to determine likelihood, the gradient estimation and convergence become inaccurate in case object and background color distribution is similar. To solve this issue, a multi-kernel mean-shift approach is proposed [16]. The additional kernels are obtained by background subtraction. In order to resolve the above convergence issue, another kernel that pushes the object away from the background regions are adapted.

Tracking can also be considered as a classification problem and a classifier can be trained to distinguish the object from the background [17]. This is done by constructing a feature vector for every pixel in the reference image and training a classifier to separate pixels that belong to the object from pixels that belong to the background. Integrating classifiers over time improves the stability of the tracker in cases when illumination changes. As in the mean-shift, an object can be tracked only if its motion is small. This method can confuse objects in case of an occlusion.

Object representation, that is how to convert color, motion, shape, and other properties into a compact and identifiable form, plays a critical role in tracking. Conventional trackers either depend only on color histograms, which disregard the structural arrangement of pixels, or appearance models, which ignore the statistical properties. There are several shortcomings of these representations. Populating higher dimensional histograms by a small number of pixels results in an incomplete representation. Besides, histograms are easily distorted by noise. Appearance models are sensitive to the scale changes and localization errors.

Covariance matrix representation [18] embodies both spatial and statistical properties of objects, and provides an elegant solution to fusion of multiple features. Covariance is an essential measure of how much the deviation of two or more variables or processes match. In tracking, these variables correspond to point features such as coordinate, color, gradient, orientation, and filter responses. This representation has much lower dimensionality than histograms. It is robust against noise and lighting changes as shown in Figs. 6 and 7. To track objects using covariance descriptor, an eigenvector-based distance metric is adapted to compare the matrices of object and candidate regions [19]. Covariance tracker does not make any assumption on the motion. This means that it can keep track of objects even if their motion is erratic and fast. It can compare any region without being restricted to a constant window size. In spite of these advantages, the computation of the covariance matrix distance for all candidate regions is slow and requires exponential time. An integral image-based algorithm that requires constant time is proposed to improve the speed [20]. This technique significantly accelerates the covariance matrix extraction process by taking advantage of the spatial arrangement of the points as illustrated in Fig. 8. The graphs show the ratio of computational savings for histogram extraction, which in return accelerates the tracking methods that use histograms and covariance matrix extraction.

Table 2 shows the computational load and robustness of the above tracking methods under various conditions including erratic motion, appearance changes, etc. We denote the tracked object size (number of pixels in object region) as N, the image size (total number of pixels) as M, the histogram size as H, the candidate regions as R, the number of features in the covariance matrix as F, and the number of classifiers as C. We also present the approximate processing times for tracking of a 20 × 40 object on a P4 3GHz machine.

Table 2 Performance of inter-frame tracking

Full size table

As many vision tasks, object detection and tracking also benefit from specific hardware implementations. Such implementations contain various combinations of different subsystems such as traditional digital signal processors (DSP), graphic processor units (GPU), fully programmable gate arrays (FPGA), application specific integrated circuits (ASIC), and other reconfigurable cores.

The DSPs offer software programmability, which is a cost-effective means for keeping hardware viable. With a programmable DSP architecture, it is possible to speed up fundamental low-level algorithms. On the other hand, the ASICs offer a high performance, low power, and low cost option for implementing algorithms in volume, but supporting different tracking methods requires an expanding number of the ASICs, leading to larger devices, greater power consumption, and higher cost. The GPUs also allow construction of economical and parallel architectures. Several processor-intensive algorithms including contrast enhancement, color conversion, edge detection, feature point tracking, etc. can be offloaded to the GPUs.

The FPGAs enable large-scale parallel processing and pipelining of data flow. Latest FPGAs provide significant on-chip RAM and support high clock speeds. However, current on-chip RAMs are not sufficient to support a useful level of internal RAM frame buffering in object detection and tracking. Therefore, additional external memory banks are required to provide storage during processing of image data. The high I/O capability of FPGAs supports access to multiple RAM banks simultaneously, enabling effective and efficient pipelining. By using multiple memory banks, background subtraction for a 512 × 512 frame takes only 1.96 ms on a cheaper 100 Mhz FPGA, whereas the same operation requires 77.4 ms for a 30 Mhz TMS320C44 and 3.15 ms on a 1.1 GHz TMS320C64x DSP board [21].

Still, there remains need for algorithmic improvements to achieve a real-time tracking performance under uncontrolled conditions:

Likelihood score computation between the object and the candidate regions is a bottleneck. Tracking methods employing histograms become more demanding as the histogram size increases. Some histogram distance metrics (K.L. Bhattacharya) are inherently expensive. For covariance tracking, the likelihood computation requires extraction of eigenvectors, which is slow. Fast likelihood computation methods can significantly improve the computational speed.
Complexity is proportional to the number of the candidate regions (or the search region size). Hierarchical search methods can be applied to accelerate the tracking process.
Localized search methods, such as mean-shift and ensemble tracking, become slower as the object size becomes larger. Adaptive scaling of the kernels and images without destroying the salient information can be adapted to achieve a real-time performance.
Kernel-based tracking methods become more demanding as the number of objects increases. Global search methods can be applied for applications that require tracking of a multitude of objects.

4 Conclusions

As discussed earlier, object detection and tracking are among the computationally most demanding image processing tasks. In addition to the mentioned algorithmic developments, special hardware implementations would significantly improve the speed of these tasks. Pixel-wise operations in object detection and search operations in tracking can easily be processed in parallel. On the other hand, robustness to extreme conditions requires scene adaptation and ability to select between different detectors and trackers. We expect hybrid solutions that can predict and apply the best combination of the detectors and trackers to be more popular in the future. Nevertheless, such solutions would require even more computational power.

References

Karman, K.-P., von Brandt, A.: Moving object recognition using an adaptive background memory. In: Capellini, V. (ed.) Time-varying Image Processing and Moving Object Recognition, vol. II. Elsevier, Amsterdam, pp 297–307 (1990)
Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: principles and practice of background maintenance. In: Proceedings of the 7th International Conference on Computer Vision, Kerkyra, Greece, pp. 255–261 (1999)
Stauffer, C., Grimson, E.: Adaptive background mixture models for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Fort Collins, vol. II, pp. 246–252 (1999)
Porikli, F., Tuzel, O.: Bayesian background modeling for foreground detection. In: Proceedings of the ACM Visual Surveillance and Sensor Network (2005)
Javed, K., Shafique, O., Shah, M.: A hierarchical approach to robust background subtraction using color and gradient information. In: IEEE Workshop on Motion and Video Computing (2002)
Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Proceedings of the European Conference on Computer Vision, Dublin, Ireland, vol. II, pp. 751–767 (2000)
Porikli, F., Wren, C.: Change detection by frequency decomposition: wave-back. In: Proceedings of Workshop on Image Analysis for Multimedia Interactive Services, Montreux (2005)
Zhu, Q., Avidan, S., Cheng, K.: Learning a sparse, corner-based representation for time-varying background modelling. In: Proceedings of the 10th International Conference on Computer Vision, Beijing, China (2005)
Porikli, F.: Multiplicative background–foreground estimation under uncontrolled illumination using intrinsic images. In: Proceedings of IEEE Motion Multi-Workshop, Breckenridge (2005)
Wren, C., Azarbayejani, A., Darell, T., Pentland, A.: Pfinder: real-time tracking of the human body. IEEE Trans. Pattern Anal. Mach. Intell. 19, 780–785 (1997)
Article Google Scholar
Boykov, Y., Huttenlocher, D.: Adaptive bayesian recognition in tracking rigid objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, vol. II, pp 697–704 (2000)
Isard, M., Blake, I.: Condensation—conditional density propagation for visual tracking. Int. J. Comput. Vis. 29, 5–28 (1998)
Article Google Scholar
Bouaynaya, N., Qu, W., Schonfeld, D.: An online motion-based particle filter for head tracking applications. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia (2005)
Zhou, S., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance-based modeling in particle filters. In: Proceedings of the International Conference on Multimedia and Expo, Baltimore (2003)
Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, SC, vol. 1, pp. 142–149 (2000)
Porikli, F., Tuzel, O.: Object tracking in low-frame-rate video. In: Proceedings of PIE/EI—Image and Video Communication and Processing, San Jose, CA (2005)
Avidan, S.: Ensemble tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA (2005)
Tuzel, O., Porikli, F., Meer, P.: Region covariance: a fast descriptor for detection and classification. In: Proceedings of the 9th European Conference on Computer Vision, Graz, Austria (2006)
Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on lie algebra. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY (2006)
Porikli, F.: Integral histogram: a fast way to extract histograms in cartesian spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, vol. 1, pp. 829–836 (2005)
McCurry, P., Morgan, F., Kilmartin, L.: Xilinx fpga implementation of a pixel processor for object detection applications. In: Proceedings of the International Conference on Image Processing, vol. 3, pp. 346–349 (2001)

Download references

Author information

Authors and Affiliations

Merl Technology Lab, 201 Broadway, Cambridge, MA, USA
Fatih Porikli

Authors

Fatih Porikli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatih Porikli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Porikli, F. Achieving real-time object detection and tracking under extreme conditions. J Real-Time Image Proc 1, 33–40 (2006). https://doi.org/10.1007/s11554-006-0011-z

Download citation

Received: 05 April 2006
Accepted: 29 June 2006
Published: 18 August 2006
Issue Date: March 2006
DOI: https://doi.org/10.1007/s11554-006-0011-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Achieving real-time object detection and tracking under extreme conditions

Abstract

Similar content being viewed by others

Beyond traditional visual object tracking: a survey

Subpixel-Precise Tracking of Rigid Objects in Real-Time

Vision Tracking: A Survey of the State-of-the-Art

1 Introduction

2 Object detection

3 Inter-frame tracking

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Achieving real-time object detection and tracking under extreme conditions

Abstract

Similar content being viewed by others

Beyond traditional visual object tracking: a survey

Subpixel-Precise Tracking of Rigid Objects in Real-Time

Vision Tracking: A Survey of the State-of-the-Art

Explore related subjects

1 Introduction

2 Object detection

3 Inter-frame tracking

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation