1 Introduction

Face detection and tracking is one of the most active research topics in computer vision and has made remarkable progress in decades of research. However, illumination and occlusion problems are still major challenges in face tracking. It is a promising approach to fuse multi-cue visual features to illumination changes. With high-performance hardware devices emerging, such as multi-core CPUs and GPU cards, face-tracking algorithms can be run more efficiently via parallel schemes, which make the fast and robust face-tracking algorithm useable in many applications, including video surveillance and human–computer interactions.

Multi-cues, such as color and contour [1, 2], color and motion cues [3, 4], and color and depth cues [5, 6], have been widely used to achieve robust object tracking. The corresponding experimental results show that the multi-cue-based approach improves tracking performance greatly. Compared with commonly used tracking methods, such as the Kalman filter [7] and mean shift [8], particle filter [9] can successfully solve the nonlinear and non-Gaussian problem at the cost of a high computation load. An object-tracking survey made by Yilmaz et al. [10] reviews the methods of object representation, feature selection, object detection, and object tracking in detail. Color, edge, optical flow, and texture are commonly used features for object tracking. However, the tracking performance is not robust using a single feature due to illumination variation, camera movement, and occlusions. A combination of these features is a promising approach for the improvement of tracking performance. This paper focuses on parallel particle-tracking methods with multi-cues.

Many researchers focus on establishing a multi-cue integration mechanism under the probabilistic framework, including the Dynamic Bayesian Network [11], the Monte Carlo method [1, 12], and particle filters [1316]. In these methods, multiple cues are tightly coupled with the tracking model and the tracking algorithm based on a Bayesian framework, which make them difficult to use in deterministic tracking methods. Isard et al. [1] proposed an ICondensation algorithm combining color and contour information with importance sampling. Wu et al. [12] presented an approach to combine visual cues by including them in the state, but then decouple the prediction and observation of the different cues. Spengler and Schiele [13] proposed an integration scheme to reliably detect and track multiple hypotheses even under challenging conditions based on the Condensation algorithm. Maintaining multiple hypotheses over time explicitly avoids locking onto a particular target and therefore prevents an incorrect adaptation caused by false-positive tracking. Brasnett et al. [14] developed a particle filter (PF) and a Gaussian sum particle filter (GSPF) based on multiple information cues, namely color and texture, which are described with highly nonlinear models. The algorithms rely on likelihood factorization as a product of the likelihoods of the cues. Serby et al. [15] presented a generic tracker combining complementary sources of information like interest points, edges, and homogeneous and textured regions for robust tracking performance. These features are integrated into a particle filter framework. Zhao et al. [16] proposed an effective and robust facial feature tracking approach based on the multi-cue particle filter. Both color and edge distributions were integrated into the filter to ensure tracking accuracy. An efficient updating algorithm was also introduced to avoid tracking error accumulation problems.

Another multi-cue-integration method is the pixel-wise integration method. In this method, tracking is considered to be a pixel classification problem. Whether a pixel belongs to the foreground or background is determined by all the cues. Every cue has a saliency map, and these maps are combined according to certain principles. One representative method is the adaptive democratic integration method proposed by Triesch and Malsburg [17]. Each cue votes for the final combined saliency maps, and the voting-like integration scheme is adaptive. Spengler and Schiele [13] used this adaptive integration method to integrate cues in human face tracking. This pixel-wise integration method is suitable for use in deterministic tracking methods.

Rasmussen et al. [18] defined a target as a conjunction of parts and introduced a constrained joint likelihood filter as a data-association method to generate the measurement for a Kalman filter. Chen et al. [19] used the Lucas–Kanade algorithm to detect and track six facial feature points using multiple cues, including facial feature intensity as well as the probability distribution, geometric characteristics, and motion information. Liu et al. [20] proposed an adaptive multi-cue integration for robust visual tracking based on the mean-shift framework.

Particle filter provides a statistical probabilistic framework to estimate object states based on sampling techniques. The observational step for calculating sample weights is the most computer-intensive stage in particle filter algorithms. As a result, a trade-off consideration is needed to make practical applications. Lozano and Otsuka [21] presented a GPU-based parallel scheme to create a real-time visual tracker of the position and a 3D pose of objects in video sequences. There are several parallel particle filter research papers on GPUs. Hendeby, Hol and Karlsson [22] conducted parallel particle filter research based on a particle resampling step. Montemayor et al. [23] implemented a real-time particle filter algorithm with the help of the shader model. Happe et al. [24] presented a video object-tracking application modeled on top of a framework for implementing SMC methods on CPU/FPGA-based systems, a heterogeneous multi-core architecture.

This paper proposed a multi-cue-based face-tracking algorithm with the help of parallel multi-core and GPU, which can effectively increase the robustness of the face-tracking algorithm. In Sect. 2, we briefly describe the multi-cue-based face-tracking algorithm within the particle filter framework, including the face model, dynamic model, and the multi-cue observation model. In Sect. 3, two parallel schemes for face tracking based on particle filter techniques were proposed to speed up the tracking algorithm. The results of the experiment in Sect. 4 demonstrate the effectiveness of our proposed algorithm. Finally, our conclusions are presented in Sect. 5.

2 Face tracking with particle filter

2.1 Tracking methods

Particle filter [9] offers a probabilistic framework for dynamic state estimation based on Monte Carlo simulations. This algorithm provides a robust tracking framework against the cluttered background. The state of a face at time t is denoted s t and its history is \( S = \{ s_{ 1} , \cdots , s_{t} \} \). Similarly, the set of image features at time t is z t with history \( Z = \{ z_{ 1} , \cdots , z_{t} \} \). The basic idea of particle filter algorithm is to compute the posterior state-density p(s t |z t ) at time t using process density p(s t |s t−1)and observation density p(z t |s t1 ). To improve the face-tracking performance significantly, three cues of color, edge, and wavelet are integrated under the particle filtering framework. Given the face model, the tracking algorithm consists of four main steps: (1) sample selection, generating new samples \( s_{t}^{\prime (n)} \) from old sample set \( s_{t - 1}^{\prime (n)} \) with its weights \( \pi_{t - 1}^{(n)} \); (2) prediction, determining new samples with dynamic model \( s_{t}^{(n)} = s^{\prime (n)}_{t} + w_{t}^{(n)} \), where \( w_{t}^{\left( n \right)} \) is the Gaussian noise at the nth iteration; (3) weight measurement, calculating the weights \( \pi_{t}^{\left( n \right)} \) for each newly generated samples by observation steps; and (4) state estimation, obtaining the final state vector of a face by the newly generated samples and their weights. The implementation details are described as follows.

2.2 Face model

In this paper, multi-cues from a rectangular region are used to describe face features for tracking. As shown in Fig. 1, the red rectangle indicates the face region and three types of region features, including color histogram, edge orientation histogram, and wavelet features, which are integrated to achieve more stable tracking performance. To alleviate the problem of illumination change, we calculate the face color histogram \( H^{\text{color}} = \{ h_{i}^{\text{color}} \}_{i = 0}^{{B_{c} - 1}} \) in the HSV color space, where B c denotes the number of used bins (discrete intervals). Similarly, the edge orientation histogram \( H^{\text{edge}} = \{ h_{i}^{\text{edge}} \}_{i = 0}^{{B_{e} - 1}} \) is calculated by the edge image filtered by a Canny core, where B e denotes the number of used bins, as shown in Fig. 1c. For each face window, the horizontal, vertical and diagonal coefficients are calculated by wavelet transformations with different scales. Figure 1d indicates the final wavelet features V wavelet = {v wavelet i } d−1 i=0 , where d is the number of feature dimensions.

Fig. 1
figure 1

Face model feature description

In our work, the face model is defined by the following parameters set as:

$$ s = (H_{\text{color}} ,H_{\text{edge}} ,V_{\text{wavelet}} , \, R) $$
(1)

where R is a rectangle represented by R = (C x C y WH), and (C x C y ) is the centroid position, and WH are the width and height of the rectangle.

2.3 Dynamical and observation model

Face dynamics are modeled as a first-order process, as shown in the following equation:

$$ s_{t} = s_{t - 1} + w_{t - 1} $$
(2)

where w t indicates the Gaussian noises. The observation process is performed to measure and weigh all the newly generated samples. The visual observation is a process of visual information fusion including three sub-processes: the computation sample weights p color, p edge, p wavelet based on color histogram, edge orientation histogram, and wavelet features, respectively.

For the nth sample, we obtained the weight p color n through a Bhattacharyya similarity function as shown in Eq. 3, calculating the similarity between sample histogram H color n and the reference histogram template H colorref .

$$ p_{n}^{\text{color}} = \,\exp \{ - D^{2} (H_{n}^{\text{color}} ,H_{\text{ref}}^{\text{color}} )\} \;{\text{where}}\;D(H_{n}^{\text{color}} ,H_{\text{ref}}^{\text{color}} ) = \left(1 - \sum\limits_{i = 0}^{{B_{c} - 1}} {\sqrt {h_{i,n}^{\text{color}} \times h_{{i,{\text{ref}}}}^{\text{color}} } } \right)^{1/2}. $$
(3)

Similarly, we obtain the sample weight p edge n based on the edge orientation histogram as follows

$$ p_{n}^{\text{edge}} = \exp \{ - D^{2} (H_{n}^{\text{edge}} ,H_{\text{ref}}^{\text{edge}} )\} \;{\text{where}}\;D(H_{n}^{\text{edge}} ,H_{\text{ref}}^{\text{edge}} ) = \left(1 - \sum\limits_{i = 0}^{{B_{e} - 1}} {\sqrt {h_{i,n}^{\text{edge}} \times h_{{i,{\text{ref}}}}^{\text{edge}} } }\right )^{1/2}. $$
(4)

To compute the sample weight p wavelet n based on the wavelet feature, the Euclidean distance between the sample feature vector V wavelet n and the reference feature vector V waveletref is employed in our system. The expression of p wavelet n is as follows

$$ p_{n}^{\text{wavelet}} = \exp \{ - Eu(V_{n}^{\text{wavelet}} ,V_{\text{ref}}^{\text{wavelet}} )\} {\text{where}}\;Eu(V_{n}^{\text{wavelet}} ,V_{\text{ref}}^{\text{wavelet}} ) = \left(\sum\limits_{i = 0}^{d - 1} {(v_{i,n}^{\text{wavelet}} - v_{{i,{\text{ref}}}}^{\text{wavelet}} )^{2} } \right)^{1/2}. $$
(5)

With three different visual cues, we obtain the final weight for the nth sample as:

$$ p(z_{t}^{n} |s_{t}^{n} ) = \alpha_{\text{color}} p_{n}^{\text{color}} + \alpha_{\text{edge}} p_{n}^{\text{edge}} + \alpha_{\text{wavelet}} p_{n}^{\text{wavelet}} $$
(6)

where α color, α edge and α wavelet are the coefficient values the weights of color histogram, edge orientation histogram, and wavelet feature cues, respectively. We can determine their values from experiences.

2.4 Multi-core particle filter algorithm

2.4.1 Initialization/re-initialization

In most object-tracking systems, the initialization algorithm is only performed at the beginning of tracking and is incapable of recovering from tracking failures. In our tracking system, the boosted face detector [25] is introduced to achieve automatic initializations when the system starts or when a tracking failure occurs. The face detection results are used to update the reference face model. The updating criterion is confidence values that are less than a threshold value for M successive frames. The online updating strategy can solve the tracking drift problems due to illumination changes, slight face rotations, or partial occlusions.

2.4.2 Algorithms

The face-tracking algorithm consists of two parts: automatic initialization and particle filter tracking. Here, the algorithm is summarized, and is called Algorithm 1.

figure a

3 Parallel particle filter algorithm

3.1 Multi-core and GPGPU

There is a growing trend towards the use of multi-core chips. Dual-core and quad-core chips are very common, with many higher multiple-core chips to come in the future. Multi-core processors can deliver significant performance benefits for multi-threaded software by adding processing power with minimal latency, given the proximity of the processors. New multi-core processors will eventually come in heterogeneous configurations such as combinations of high- and low-power cores, graphical processors, cache blocks, and on-chip interconnects. By bundling many CPUs into one processor, creating what is known as a “manycore processor”, the computing industry can continue to provide dramatic increases in computing power. As shown in Fig. 2a, the dual-core CPU has two CPU-local level 1 caches, and a shared, on-die level 2 cache (http://en.wikipedia.org/wiki/Multi-core).

Fig. 2
figure 2

Multi-core and GPGPU thread structure (http://en.wikipedia.org/wiki/Multi-core). a Diagram of a generic dual-core processor. b NVIDIA G80 thread execution model (http://www.nvidia.com/cuda)

Recent graphics architectures provide tremendous memory bandwidth and computational horsepower. One emerging trend is that of GPGPU. Packed with 340 Gflops, GPGPUs have been used to speed up many applications other than graphics processing. GPU processing power has been growing much faster than that of CPUs (multi-core) recently, and the trend continues. Other examples in this family include Cell Architecture, which has been used in PlayStation III, workstations, and servers from Mercury Computer Systems as well as the System-on-a-Chip from AMD for laptop computers, in which a GPU is integrated within the CPU. Three major advances in the last 2 years have made GPUs much more accessible to general-purpose computing: unified architecture design, extended C programming interface, and libraries and demonstrations of several applications.

The G80 processor, NVIDIA’s first implementation of Compute Unified Device Architecture (CUDA), is an attempt at making GPGPU available for non-vector rendering programming. The G80 architecture enables the GPU to use multiprocessors to process blocks of 64–512 threads. The blocks are divided into groups of 32 called warps and are used by the processor for scheduling. The kernel is a program, executed as blocks of warps. The arrangement of threads in blocks and the blocks into grids of blocks is defined by the programmers, as shown in Fig. 2b.

3.2 Multi-core parallel particle filter algorithm

Traditional parallel/distributed programming techniques, such as message passing and shared-memory threads, are too complex for most researchers and developers, especially those without formal training on parallel and distributed computing.

MapReduce (http://en.wikipedia.org/wiki/MapReduce) provides a programming model for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function merges all intermediate values associated with the same intermediate key. Programmers do not need to undertake the datasets parallelizing process, and the programs written in the functional style are automatically parallelized and executed on multi-core machines and multi-core clusters. As parallel processing at the multi-core level is easier and more effective, we will attempt to use multi-cores in the particle filter algorithm.

Google’s MapReduce is designed with their typical application in mind—that is, a dataset too large for the memory—and thus, a distributed file system is needed. Recent work at Stanford University has also proven that MapReduce interfaces and libraries can be good parallel programming paradigms for Symmetrical Multi-Processing (SMP).

The particle filter algorithm in face tracking can be classified into two levels. One is a multi-core implementation with the MapReduce model (multi-threading). Another level assigns the weight measuring portion to the GPU. We would like to describe our multi-core MapReduce thread model firstly. The parallel part is in the particle filter tracking iteration. If the particle number can be expressed as particle_num and the max number of available cores is max_core, the particle number on every core can be formulated as particleNum_on_everyCore = particle_num/ max_core. The thread model for the particle filter algorithm can be shown in Fig. 3, in which the particle filter number is portioned with the current image (frame). Every thread on every core is bound to implement one map. The map procedure comprises the color histogram calculation, edge histogram calculation, texture feature calculation, and the maximum confidence. The maximum confidence in a thread is based on the three individual probability computations. The reduce procedure makes an aggregation with the different particle filters in the current frame.

Fig. 3
figure 3

Multi-core thread model for particle filter algorithm

3.3 Parallel particle filter algorithm on GPU

The most computationally consuming step is to calculate the weight measurements for all samples, especially with multi-cue features. The weight measurements for all samples are independent of each other. This character makes it easy to speed up the tracking algorithm with parallel computing hardware. The GPGPU, a parallel computing technique, can be applied to solve this problem.

The kernel for the GPU is executed in M blocks, each with N threads. Each thread calculates the weight measurement for one sample, including three measurements from the color, edge and wavelet features, respectively. Consequently, M × N samples can be processed in a parallel way. The parallel-tracking scheme is shown in Fig. 4.

Fig. 4
figure 4

Parallel particle filter thread model on the GPU

As shown in the Fig. 4, the particle information corresponding to all faces and frame information are copied to the GPU memory for each input frame. Then the kernel function is invoked and is executed over the blocks and threads. Each thread calculates the color cue match, edge match and wavelet cue match for each particle. Color cue match can be taken as an example to explain the execution process. For color cue match, the color histogram will be calculated and the Bhattacharya similarity coefficient between reference histogram and calculated histogram. Then the weight will be assigned for each particle according to Bhattacharya similarity coefficient. Finally, the sample weights from all threads are collected and the sample parameter with the maximum confidence is selected to be the final estimation for the target face.

The parallel face-tracking algorithm with particle filter is similar to Algorithm 1 described in Sect. 2.4.2. The only difference is in step 2C. The computation of the weight measurement is performed in a parallel way. All samples are divided into M groups, with each sample calculated by a computing thread.

3.4 Parallel particle filter algorithm on GPU and multi-core CPU

To improve the efficiency of particle filter tracking running on the hardware platform, many issues need to be addressed. The main functional modules of the proposed computation architecture are depicted in Fig. 5, and there are three layers: scheduler, MapReduce implementation, and CPU/GPU co-processing.

Fig. 5
figure 5

Co-processing architecture on a heterogeneous system

  • Scheduler The task scheduler is responsible for paralleling the computation job. Inspired by the Stanford Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/), we have developed a similar scheduler with MapReduce operations. Phoenix is a shared-memory model that could be used for multi-core parallel runtime on the CPU. In the scheduler, the GPU can be viewed as a data-parallel computing device that operates as a co-processor with the main CPU. The CPU takes the management and control role, and the GPU acts as the stream accelerator. The scheduler will schedule the particle filters to CPU or GPU according to the work load and dispatch strategy.

  • MapReduce The MapReduce model is an established paradigm for supporting data-intensive computation. The MapReduce implementation provides two primitive operations: (1) a map function to process input key/value pairs and generate intermediate key/value pairs, and (2) a reduce function to merge all intermediate pairs with the same key. With this MapReduce implementation, the computational task will be automatically distributed and executed on multiple machines, multiple processors, and multiple cores.

  • CPU/GPU co-processing Communication between the CPU and GPU happens in the main memory. Mars [26] is a good co-processing library using CUDA. However, they divided processing of the data statically according to a pre-defined specific ratio, which cannot support complex co-processing schemes.

The co-processing part between the GPU and multi-core CPU is shown in Fig. 6. Here, we take the color histogram calculation as an example. Supposing that there are 2,000 particles for the computation weight, the color histogram of template will be computed by the multi-core CPU. In the case of the task scheduler as mentioned in Fig. 5, the particle dispatch ratio between GPU and multi-core CPU is set up 0.25, which means 400 particle filters sent to the multi-core CPU and 1,600 particle filters sent to the GPU for computing weight, respectively. Once the two parts finish their weight computations, the total confidence sort work will start, and the particle filter with the maximum confidence will be found.

Fig. 6
figure 6

Co-processing between the GPU and multi-core CPU

4 Experimental implementation and comparison

4.1 Tracking accuracy evaluation

A face-tracking system with C++ code has been developed based on Phoenix code and CUDA (http://www.nvidia.com/cuda) on a HP xw8400 workstation. The workstation is configured with dual Intel Xeon 5345 CPUs, totaling eight cores with 4 GB of RAM and a NVIDIA FX4600 card (G80 GPU, 768 MB memory, 12 multiprocessors). A Logitech Pro 4000 web camera was used to capture 320 × 240 images for the tracking experiments. Several face-tracking experiments under different illumination conditions are designed to verify the proposed algorithms. As shown in Fig. 7, our tracking algorithm is adaptable to different illuminations and some slight face rotations benefit from online updating strategy, where the facial region is indicated by red rectangles. Face detection is performed every ten frames to verify the tracking results, and a re-initialization process is performed if tracking failure occurs. In our experiments under a sequential mode, the number of particles is set to 1,000 by default. The face number in every frame is 1. The confidence values of color histogram, edge orientation histogram and wavelet feature cues are all set to 1/3. If there are multiple faces in every frame, the numbers of the particle filters should be increased proportionally.

Fig. 7
figure 7

Tracking results with different illuminations and views

A group of 282 images with face movements was captured to evaluate tracking accuracy performance. Using the manually labeled center positions of face rectangles as the ground truth, the Euclidean distances between the tracking results and the ground truth values are used to demonstrate the accuracy of our face-tracking method, which is shown in Fig. 8. The mean absolute error is 8.43 pixels, and the standard error is 4.06 pixels.

Fig. 8
figure 8

Euclidean distance for tracking performance evaluation

4.2 Speedup issue discussions

Normally, the particle filter number is assigned to 100–200 because of sequential computation low efficiency in the real-time application, especially when multiple features are used. The computing result is from one person (one face) in the investigated frame. If there are several persons in the investigated frame, the computing time will roughly be multiplied by the number of people. The three parallel approaches are listed in Table 1.

Table 1 Three parallel approaches

We conducted several experiments with sequential and multi-core parallel implementations. The numbers of different particle filters were set at 1,000, 2,000, and up to 10,000, with a step of 1,000. There are eight CPU cores in our workstation, so the number of multi-core that were used range from 1 to 8. To investigate performance with different number cores, we set the particle filter number to 800 and the computing core from 2 to 8. The performance figure is shown in Fig. 9, which shows the maximum speedup (4.2×) obtained when eight cores are used. There is a significant speedup improvement going from six cores to eight cores.

Fig. 9
figure 9

Speedup with different computing cores

The performance curve based on the multi-core can be seen in Fig. 10. The numbers of multi-core were set to four and eight, and the particle filter number was set from 1,000 to 10,000, respectively. Based on the MapReduce programming model, the particle filter weight computation was assigned to each computing core. The internal MapReduce programming model, improved PThreads library, was used as a thread library. Color histogram, edge histogram, and wavelet histogram calculations were executed in every thread and on every core.

Fig. 10
figure 10

The computing time of parallel particle filter on the multi-core

With the particle filter number increased, the computation time in every particle filter weight computation period increased as well. As shown in Fig. 10, it took 0.1 s to finish the particle filter weight computation when the number of particle filters was 3,000, with the multi-core number set to 4. For real-time tracking, a computation time of 0.2 s can be guaranteed when the particle number was set to 10,000. The computation time is derived from the average computing time of a sequence of frames. Based on the results in Fig. 10, the particle filter number can be set to 5,000 or more with the help of the MapReduce thread model and multi-core architecture.

To check the GPU performance more visually, the computation time with different particle filter numbers on the GPU was obtained. The particle filter number was also selected from 1,000 to 10,000, with a step of 1,000. The computing parameters pertaining to color histogram, edge histogram, and texture feature were firstly copied from the main memory to the GPU memory. Then, the device memory for storing results was allocated in advance. The three GPU kernels were used to calculate the color histogram kernel, edge histogram kernel, and texture feature kernel. For the parameter M and N in Fig. 4, we set M to 10 as a thread block. Additionally, we set N to 100, meaning that there are 100 threads in every block. Once the three kernels were finished, the resulting weight was copied from device memory to host memory. As shown in Fig. 11, the GPU computing time on a weight measure kernel was 0.082 s when the particle filter number was set to 10,000. The computing time is applicable to most case applications.

Fig. 11
figure 11

The computing time of parallel particle filters on a GPU

We would like to make a comparison of the three parallel methods: the multi-core method, the GPGPU method, and the co-processing method. The research experiment includes three parallel scenarios, i.e., task computation on an 8-CPU core, task computation on NVIDA GPU cores, and computation on multi-core workstations with GPGPU. The comparison can be seen in Fig. 12 which contains three speedup comparison curves. The graph shows that a heterogeneous speedup can be better than a GPU speedup and that a GPU speedup is better than multi-core parallel implementation.

Fig. 12
figure 12

Comparison of the speedup effects of different parallel computing methods

5 Conclusion

Through the theoretical analysis and the practical face-tracking experiment, we have analyzed the availability of the proposed a particle filter tracking algorithm and processing framework, and the fast face-tracking system can achieve real-time performance using the multi-core and the GPU and can constitute the relative high speedup algorithm for the robust tracking of the face. The experiments of the fast face-tracking based on multi-core and GPU processing which have been implemented, and the experimental results show that the proposed algorithm and processing system can obtain a positive effect and significant speedup performance, and can increase the tracking speed by 8–12 times. At the same time, co-processing between the CPU and the GPU, with the supporting of CUDA, can also increase the ability of the tracking computation more effectively.