1 Introduction

Classifier ensembles are proved to benefit from the diverse decision capabilities of their members, thus improving classification performance [13]. Boosting [6], bagging and other forms of classifiers combination [26] aim to form an ensemble of weak classifiers that will perform as a strong one [11, 12, 24, 28].

Usually, the weak classifiers are trained on the entire training set and then greedily added to the ensemble with respect to a certain criterion; therefore, the iterative process builds the set of weak classifiers that performs best on the training set. This batch procedure is particularly demanding in terms of computational resources and is dedicated to systems that can afford a time-consuming training phase [15, 32]. A new stimulus has been given by the intuition of Oza [21] that conceived online versions of the bagging and boosting algorithms. In particular, the online boosting (OB) algorithm has found many applications in pattern recognition. For example, Grabner and Bischof [10] showed how to apply this framework to computer vision tasks. Pham and Cham [25] proposed an asymmetric version of the original algorithm, to cope up with unbalanced classes. Another interesting approach exploits the Wald’s sequential decision theory within the Boosting algorithm [31]; in a later work, this idea has been applied to online learning to achieve a better compromise between complexity and accuracy with respect to OB [9]. The online WaldBoost modifies the cardinality of the ensemble depending on the classification task; when the cardinality is reduced a speed-up is achieved. As will be discussed later, the proposed method has variable accuracy but always maintains the real-time constraint.

Many other recent works point that the OB is an effective framework for combining multiple classifiers for computer vision applications [17, 18, 35, 39].

The most prominent use of OB for vision is probably object tracking; a combination of classifiers is used to learn the appearance of a target in the current frame and the ensemble is then employed to detect it in the following frame, considering the detection step as a discrimination between the object and the background [2, 23]. The location of the target is therefore given by the region that triggers the strongest response of the ensemble.

Since object detection and tracking in video sequences are known to benefit from the employment of multiple (heterogeneous) features (e.g. color, orientation histograms, etc.) as shown in [5, 8, 36], the combination of different features within the boosting framework can be a winning strategy in terms of robustness and accuracy [42]. However, this can prohibitively increase the computational burden to the point of preventing real-time performance even in the case of OB. In fact, the computational complexity increases with the number of features/classifiers employed.

To overcome a similar problem in off-line classification, Viola and Jones [33] proposed the idea of building a cascaded classifier to speed-up the application of the ensemble to many sub-regions of an image during the detection step. They exploited the fact described in [1], that few accurate weak classifiers are sufficient to narrow the focus of attention on the regions of the image where the object is likely to be present. These classifiers would constitute the first level of the cascade while the other levels would comprise the remaining classifiers in descending order of accuracy. This approach has been successfully applied to many tasks, including face detection [37, 38], pedestrian detection [22, 40], neural network frameworks for classification [7], and handwritten digit recognition [14, 41]. From its first appearance in the literature until these recent works, including [3, 4, 38], the cascade was trained off-line through an extremely time consuming process. An exception is the work of Wu and Nevatia [35], where a cascaded version of the OB algorithm is proposed, but starting from general seed detectors learned off-line.

To speed up the application of this ensemble, here we propose a cascaded OB algorithm that builds on-the-fly a cascade from a heterogeneous set of online boosted classifiers. Since different features can have different computational costs, the cascade construction takes into account both the error and the computational requirements of the available features and populates the levels accordingly to optimize the detection rate while retaining real-time performance. We thus exploit the advantages given by the OB algorithm while guaranteeing the sought-after property of real-time performance as it is the case in most computer vision tasks.

The idea was initially conceived in [34] for a homogeneous set of features and then extended in [30] to the heterogeneous case. With respect to these previous papers, the present work introduces the following improvements:

  • A refined cascade algorithm that does not require user pre-defined thresholds.

  • A novel strategy for hypotheses selection that dynamically takes into account the cost of the features and the current frame rate.

  • Thorough experimentation on standard datasets.

We have compared the performance achieved by the proposed cascaded algorithm with those of the monolithic OB. In particular, we have focused on computational cost and thus measured the frame rate obtained in object tracking tasks. For completeness, we have also validated the approach by considering tracking scenarios in real-world video sequences.

The paper is organized as follows: Sect. 2 provides a brief recount of the key concepts of boosting and classifiers ensembles and puts them in the context of object tracking in video sequences; Sect. 3 provides a step-by-step description of the proposed algorithm; Sect. 4 presents the experimental validation on standard datasets of real-world video sequences.

2 Ensembles of classifiers and boosting

Ensemble methods are well-known techniques for combining multiple classifiers’ outputs in order to improve classification performance (see [26] for a recent survey). Ensemble-based (meta) algorithms [26], such as Bagging and Boosting, follow the intuition that fusing multiple weak hypotheses (classifiers) yields a strong ensemble, that is with increased classification performance.

Given a set of (weak) hypotheses \(\left\{h_1, h_2, \ldots, h_T\right\}\) so that \(h_t: X \rightarrow Y\) where X is the set of vector-valued samples, Y = {−1,  + 1} is the set of labels, and \(\left\{ ({\mathsf{x}}_i,y_i): i=1,\ldots,N \wedge {\mathsf{x}}_i\in X \wedge y_i\in Y \right\}\) is the training set, the boosting algorithm builds a (strong) classifier ensemble H as follows

$$H({\mathsf{x}}) = {\rm sign} \left[ \sum\limits_{t = 1}^T \frac{1}{2} \hbox{log} \left( \frac{ 1 - \epsilon_t}{ \epsilon_t} \right) h_t({\mathsf{x}}) \right]$$
(1)

where t is the current training epoch and T is the total number of epochs (and, at the same time, the total number of weak classifiers). For each t, a hypothesis h t is greedily added to the ensemble with respect to the probability distribution D t on all the learning samples in X. The probability distribution D t , updated at each epoch t, assigns a weight to each training example according to its “difficulty”: the boosting algorithm adjusts the weights in order to focus on the “hard” samples in the training set. The objective is to minimize the weak classifier error ε t , defined as

$$\epsilon_t = P (h_t({\mathsf{x}}_i) \neq y_i ) = \sum_{i:h_t({\mathsf{x}}_i)\neq y_i} D_t(i)$$
(2)

In 2001, Oza [21] proposed an online version of the former boosting algorithm due to Schapire and Freund [6]. Two main improvements were introduced: the first one is that the ensemble can be at the same time used for classification and trained “on-the-fly” on every new sample. The second one refers to the way the learning samples are processed. In the off line version, all the samples are available at the same time, and a distribution on the training set is maintained. At each boosting round t the algorithm, thus, modifies the weights of each sample according to the error of the base learner h t .

On the contrary, in the online version no distribution on the samples is available, and the information on the training set difficulty has to be embedded in the hypotheses. In particular, a value λ associated with each incoming sample reflects the difficulty of the entire hypotheses set to classify it [20]. The error of every hypothesis h m is conditioned by the correctly recognized samples weight \(\lambda_m^{sc} = \lambda_m^{sc} + \lambda\) and the misclassified samples weight \(\lambda_m^{sw} = \lambda_m^{sw} + \lambda ,\) so that

$$ \epsilon_m = \frac{\lambda_m^{sw}}{\lambda_m^{sc}+\lambda_m^{sw}}$$
(3)

Eventually, the ensemble output is given by:

$$H({\mathsf{x}}) = \hbox{argmax}_{y \in Y} \left(\sum\limits_{m=1}^M {\rm log} \left( \frac{1-\epsilon_m}{\epsilon_m} \right) h_m({\mathsf{x}})\right)$$
(4)

where M is the number of weak learners fixed a priori, and ε m is the error of hypothesis h m on the training set defined as above.

When ensemble classifiers are used online for target tracking [10, 20], there is no training set in the traditional sense. Instead, samples arrive sequentially in time and they are processed (learned) one by one and then discarded. In particular, in our case at each time t the ensemble learns a positive and a negative sample. Let x be the subregion of the image I t that has been classified by ensemble H as the target. Then x is considered as the new positive sample to be learned. The negative sample, on the contrary, is a background patch (negatively labelled) \({{\mathsf{x}}}_b\) chosen at a random position in the image I t with a preference for the area surrounding the target.

Considering every hypothesis h m as a Naive Bayes classifier that discriminates between the background and the target, the outputs of the ensemble over the positive (target) and negative (background) samples can be modelled by two normal distribution \({{\mathcal{N}}}(\mu_{m,y}, \sigma_{m,y}^2)\) where y ∈ Y = {−1, 1}. The highest posterior probability given by the Bayesian classifier on a sample x determines the belonging class (maximum a posteriori)

$$ h({\mathsf{x}}) = P (y | {\mathsf{x}}) \propto \hbox{argmax}_{y \in Y} P(y) {{\mathcal{N}}} ({\mathsf{x}}, \mu_{m,y}, \sigma_{m,y}^2 | y )$$
(5)

To learn a sample x , the parameters of the distributions at time t are modified as follows

$$\mu_{m,y,t} = (1-\rho) \mu_{m,y,t-1} + \rho h({\mathsf{x}})$$
(6)
$$\sigma^{2}_{m,y,t} = (1-\rho) \sigma^{2}_{m,y,t-1} + \rho (h({\bf x}) - \mu_{m,y,t})^2$$
(7)

where ρ is a learning rate parameter.

3 Processing steps

This section provides a step-by-step description of the processing phases required to build on-the-fly the proposed online boosted cascaded ensemble of classifiers.

The principle behind a cascade of classifiers is quite simple: the hypotheses are organized in a pyramidal fashion, often called cascade of attention, that is composed of several levels that follow a coarse-to-fine strategy, as illustrated in Fig. 1. The upper levels, that comprise a small number of classifiers, are applied first. The job of the upper level is to provide a quick response with high sensitivity (few false negatives). In terms of object tracking this means that the levels are not likely to miss the object to be tracked but also a lot of background is probably going to be considered similar to the target. The deeper the level, the more populated is the ensemble, the more confident the classifier, and the more operations are required.

Fig. 1
figure 1

Architecture of the proposed attention cascade algorithm

In [33], a single level of the cascade is built analysing the performances on the training set, and using a threshold (e.g. on false positives) to determine the stopping point. As already mentioned, while the classical Boosting algorithm maintains a distribution on the training data according to the “difficulty” of the samples, no distribution is available in the case of OB since new data are continuously streaming in (i.e. new frames from a video sensor) [20].

We propose to exploit the error associated with each hypothesis (3) at time t to estimate the accuracy of the classifier at time t + 1. In particular, a weak learner with a low error has, by definition, a low rate of false negatives and false positives. We can suppose then that a low error weak learner at at time t is likely to perform well at time t + 1. We suppose also that the object we are searching for maintains a similar appearance in two subsequent images which is a reasonable assumption given the high input frame rate.

Moreover, when the ensemble is composed of different types of classifiers or features as weak base learners, every one of them can take a different amount of operations to be applied; for this reason, a cost factor γ m that indicates the time for a hypothesis h m to be computed has to be considered.

Following this reasoning, in the first levels of our cascade should be placed the hypotheses with low error and low computational cost, in order to reject the highest number of regions not containing the object (true negatives) while preserving a low False Negatives rate and maintaining the real-time constraint.

3.1 Cost factor

An important consideration when structuring a framework that involves several types of hypotheses to be applied in real-time is the different amount of operations every single weak learner requires in practice to better distribute the classifiers among levels.

To each hypothesis h m we assign a weight 0 ≤ w m ≤ 1 that denotes its “importance”

$$w_m = 1- \frac{\epsilon_m + \gamma_m}{\epsilon_m \gamma_m +1}$$
(8)

and comprises the error ε m and the computational cost γ m . By means of this value we can build the cascade including in the first levels the classifiers with a high weight, to provide a good mix of accuracy and speed. We then normalize all the w m so that ∑ m w m  = 1

The cost γ m related to hypothesis h m in (8) can considerably vary depending on the kind of classifiers involved: Haar features are computationally less expensive, for example, than colour histograms. The costs are determined by an off-line procedure that times the process of extracting and probing each feature, thus avoiding the need to set them manually. This set-up phase is independent of the video sequence used and also of the size of the target, as these features can be computed in constant time through fast data structures as integral images and integral histograms. After probing the different computational times required by each feature type, their cost is obtained by dividing by the slowest time. In this way, cost values will be normalized so that 0 < γ m  ≤ 1, where the slowest feature type will have a gamma value equal to 1.

Considering an upper limit 0 ≤ W ≤ 1 for the summation of the weights of the classifiers for each level l, the strong classifier at level l is

$$H_l = \left\{ h_1, \ldots, h_k \right\} : \sum_{k} w_k \leq W$$
(9)

We notice that the higher W the more classifiers are included in the first levels. The cascade in this case is shorter, as shown in Fig. 2, and the time of computation is extended: to search the object of interest, more operations have to be done in the first levels on all the subregions of the image.

Fig. 2
figure 2

Illustration of the distribution of the classifiers in the levels of the cascade when varying the parameter W

3.2 Forcing the real-time

Since real-time execution is the goal of the proposed approach, we have decided to consider the frame rate as a stakeholder in the process of building the cascade. Exploiting the fact that W can modify the number of levels and thus the shape of the cascade (Fig. 2), we tune the level limit W to reflect the need for speed. As we know that W influences the number of features to be included in a level, if the cascade has been too slow at time t, the shape of the cascade should be revisited and stretched: W t+1 should be a more tolerant limit, therefore with a smaller value.

Said FRopt the desired frame rate, and FR t the current frame rate at time t, W t+1 becomes

$$W_{t+1} = W_{t} \times \frac{{\rm FR}_t}{{\rm FR}_{\rm opt}}$$
(10)

Note that values of W > 1 have no meaning since the upper limit is W = 1 which means that the entire set of classifiers is comprised in a single level (monolithic ensemble). With respect to our previously adopted solution, where two thresholds for the error rate and the cost rate were used to build the cascade levels, here we use only the frame rate to construct the cascade. To be more precise, the approach is now almost threshold-free since the current frame rate FR t can be read from the system, the target frame rate FRopt is generally set to 25 frames per second (fps), and the initial value of W can be randomly initialized or hard-coded to any default value in (0,1) (e.g. 0.5) and then the algorithm will adjust it automatically via (10) to achieve the required frame rate FRopt. That is, the only input required from the user is the required frame rate FRopt. With this new solution, the real-time constraint is enforced, but at the same time the classifiers of each level provide a compromise between accuracy and maximum number of operations per frame, so that the total computational cost of the level can be limited and the error is kept as low as possible.

Algorithm 1 summarizes the steps to be taken to construct the cascade. First of all, at time t the hypotheses are trained both with a positive and a negative sample; the error of the boosted classifiers is multiplied with the cost coefficient, providing a weight for each hypothesis. The classifiers are sorted in descending order of weight, and organized by levels that are partitions of the original set. In particular, for every level they are included until the W limit is reached, considering all the weights of the classifiers in a level. The process is repeated by filling the next levels with the remaining base learners. After the levels are completed, their concatenation forms the cascade, which is applied on the image at time t + 1 and the process is repeated.

figure a

Algorithm 2 describes the processing steps for the application of the cascade ensemble. The cascade obtained at time t − 1 processes every subregion in a frame at time t, assigning them also a confidence value. Every sample can be rejected at any level or sent further to the next ones; if the bottom of the cascade is reached and the final output is positive, the object is flagged with the positive label. If the object is rejected at any level, the normalized weighted outputs of the classifiers determines the confidence of the cascade on the subregion; a confidence map keeps record of the decisions of the ensemble.

figure b

3.3 Cascade confidence

For what concerns the output of the cascade, we consider both the answer of the classifier set H l at level l and the output of the previous levels recursively. On a subregion x of the image, the confidence of the ensemble at level l is defined as the sum of the confidence of the previous level and the response of the K l weak classifiers in H l , as given by the following formula

$$conf_l({\mathsf{x}}) \equiv conf_{l-1}({\mathsf{x}}) + \sum\limits_{k=1}^{K_l} \beta_{k,l} h_{k,l}({\mathsf{x}})$$
(11)

where the confidence of the first level is defined as

$$conf_0({\mathsf{x}}) \equiv \sum\limits_{k=1}^{K_0} \beta_{k,0} h_{k,0} ({\mathsf{x}})$$
(12)

and the coefficients βk,l are given by

$$\beta_{k,l} = log \left(\frac{ 1 - \epsilon_{k,l}}{\epsilon_{k,l}}\right)$$
(13)

as in the boosting weighting scheme. The smaller the error of h k,l on the training samples, the larger the coefficient βk,l assigned to it. The number of classifiers K l in level l is |H l |, where the operator |.| provides the cardinality of a set, and it follows that ∑ l K l  = M.

As detailed in Algorithm 2, in the training phase the weak learners are updated with the boosting algorithm. At classification (probing) time, if the ensemble H l at level l validates the subregion x , establishing that

$$ \hbox{sign}(\hbox{conf}_l({\mathsf{x}})) = +1 $$
(14)

then the sample is evaluated by level l + 1. The output of a level is based on the response of the previous levels and on its hypotheses one, as per (11). The output of the cascaded ensemble H out on the input x is thus given by:

$$H_{\rm out}({\mathsf{x}}) = \hbox{sign} \left( \hbox{conf}({\mathsf{x}}) \right)$$
(15)

where

$$\hbox{conf}({\mathsf{x}}) = \sum\limits_{l=1}^{L} \sum\limits_{k=1}^{K_l} \beta_{k,l} h_{k,l}({\mathsf{x}})$$
(16)

Eventually, considering the answer of the classifier on every subregion of the image a confidence map is built. The target position corresponds to its maximum, and can be found by a simple Max function. An example of how the confidence map looks like is presented in Fig. 3.

Fig. 3
figure 3

Example of a confidence map on a frame; the highest peak represents the position of the target

The computational effort while running depends on the amount of hypotheses applied per region. With our solution, not all the hypotheses of the ensemble are necessarily applied at the same time on the region of interest, but possibly only a small subset whose cardinality depends on the thresholds W. The probing time for the strong classifier is \({{\mathcal{O}}}(\sum\nolimits_{l=1}^L K_l) = {{\mathcal{O}}}(\sum\nolimits_{l=1}^L \left| H_l \right|) = {{\mathcal{O}}}(\left| H\right|)\), since in the worst case all the hypotheses are tested.

4 Experiments

We have applied the proposed approach to object detection and tracking for video surveillance purposes. In this section, several experiments performed on standard real-world video sequences are presented to validate the proposed framework. The hardware employed in all the tests is an AMD Athlon64 3500+ with 1 GB of RAM. All the modules have been implemented in C++ using fast data structures, i.e. integral images and integral histograms, to reduce the computational requirements.

4.1 Settings

We used three different types of features to describe moving objects: Haar features [16], local binary patterns (LBP) [19], and colour histograms [27]. The common start-up procedure for ensemble tracking methods involves that the object to be tracked be set up by a change detection procedure or by hand [20]. Similarly to most of the recent literature [10], we employ a fixed size window to track the object. As a natural evolution of the present work we will consider in the future the application of the cascade to varying scale target tracking.

As regards the ρ parameter for the classifiers update, it was set to 0.25 taking into account the frame rate and the typical movement speed of observed objects [30]; small changes in its value produced no significant effects.

4.2 Comparison of different values of W

First of all, we performed an experiment testing different values of W, since the variation of this parameter leads to a trade-off that involves the time of execution and the accuracy of the ensemble. This experiment wants to compare, in terms of accuracy and speed, our cost-based cascade framework with an ensemble of fixed dimension, called monolithic, when tracking is performed on an object of interest. We applied three ensembles, consisting of 300, 400 and 500 heterogeneous classifiers respectively, to a standard sequence taken from [29]. The video comprises 1,340 frames at 320 × 240 pixels resolution in which a puppet doll is moved under a light bulb. The target was manually initialized in the first frame on the puppet’s muzzle with a 40 × 40 pixels area.

Figure 4 shows the average evaluation time and the correspondent frame rate for each ensemble type presented. The evaluation time comprises the time required for building the cascade and the time required for actually probing it on the image. For each ensemble type, the performances of the monolithic and the cascaded versions are shown. As it can be seen, the monolithic version is generally slower than the cascaded counterpart and the performance gap narrows for increasing values of W. That is, the higher the value of W the slower the cascade is as its levels will eventually collapse into one (monolithic classifier). Of course, low values of W yield high frame rates but at the cost of accuracy.

Fig. 4
figure 4

a Average evaluation time (in milliseconds) for the different-sized cascades for increasing W values. b Frames per second achieved by the proposed algorithm when varying the number of classifiers involved and W

To evaluate the accuracy of the classifiers, we have measured their sensitivity and the specificity on the aforementioned video sequence by varying the acceptance function of the classifiers (14) and plotting ROC curves. Note that the acceptance threshold has been changed only for the sake of the experiment, the focus should be on the effects given by different values of W. In Fig. 5, are shown the ROC curves that indicate the performances of the cost-based cascades; these should be correlated with the results of Fig. 4. Each chart shows the plot of the ROC curves obtained by the monolithic classifier and by three cascades for different settings of W: the first one has W = 0.1, in the second one W = 0.3, and in the third one W is set to be self-tuning according to the proposed mechanism (10) for a target frame rate FR = 25. The two fixed values have been chosen to reflect different possible conditions: the lower W, the stricter is the limit imposed to the number of eligible hypotheses, thus preferring a fast system with more lightweight levels. A medium tolerance threshold W allows a larger amount of hypotheses in the initial levels. The higher W the larger the number of weak learners admitted in the first stages of the computation, thus generating a shorter cascade that requires more operations in the initial stages.

Fig. 5
figure 5

ROC curves for three proposed cascades in the case of 300 (a), 400 (b) or 500 (c) features involved. The monolithic ensemble is compared with the cascades obtained using W = 0.1, W = 0.3 and the proposed solution that automatically tunes W

We tested three ensembles composed of 300, 400 and 500 heterogeneous features, respectively; considering both the area under the curve (AUC) and the speed, as expected the ensembles that are more accurate are those that require more computational time. In particular, the non-cascaded approach scored the highest accuracy when applied to the Sylvester scenario. But the most accurate system is also the most computationally demanding. The cascade with W = 0.3 is slightly less accurate than the monolithic detector, but it dramatically reduces the application time as the number of features grows, incrementing dramatically the frame rate. The same can be said of the other cascades, included the self-adapting one with variable W.

The rationale behind this is that the most expensive features (LBP and colour histograms) are the more robust and accurate, so they better classify the sample but they are placed in the low levels of the cascades due to their high cost. For instance, in Fig. 6 the distribution of 300 features through levels for the three different values of W is shown. As we can see, when strict cost constraints are imposed, a small number of “heavy” features is included in the first levels, while lightweight features, even if inaccurate, are preferred. However, while in the monolithic ensemble all the features are applied at the same time, this is avoided when the cascade comes into play, because the levels are applied subsequently when necessary, speeding up the application.

Fig. 6
figure 6

Examples of population of a cascade of 300 features for a W = 0.6, b W = 0.4, c W = 0.2

4.3 Comparison with the OB

In the first two rows of Fig. 7, we can see the output of the monolithic detector and the output of the cascaded detector with automatic setting of W, both comprising 400 features. In the last row, the distance of the static and the cascade detector from the ground truth is presented; the shift is represented by its absolute value in pixels. In the second part of the sequence, both detectors have a slight offset with respect to the ground truth. This is due to the wide and sudden changes in both the illumination and in the orientation of the target, and, in particular, due to the transition between frontal view to the full profile or to a top view. The cascade detector resulted slightly more sensitive to variations. In the worst case, the cascade showed a 14 pixels shift, while 11 pixels was the maximum shift for the monolithic classifier. The average error of the monolithic classifier was 4,0 pixels, while the average error of the cascade was 4,8 pixels in the experiment.

Fig. 7
figure 7

Results on the Sylvester sequence [29]. The first row shows the monolithic approach, while in the second row the output of the cascaded ensemble with automatic setting of W is displayed. The bottom row presents the shift in pixel with respect to the ground truth for the x (left box) and y (right box) coordinate of cascade (a, b) set, and monolithic (b, d). The cascade detector resulted slightly more sensitive to variations; in the worst case, it showed a 14 pixels shift, while 11 pixels was the maximum offset for the monolithic classifier. The cascaded achieved 25 frames per second (fps), while the monolithic ensemble 15 fps

Regarding the confidence of the cascades compared with the detector, the more the classifiers in the ensemble, the better the confidence value. The trend of the confidence values for different ensemble sizes in the case of monolithic and cascade approach is shown in Fig. 8a–c. The variation of the parameter W in the case of an automatic setting is presented in Fig. 8d. In the first part of the sequence, the object maintains a frontal position and its appearance does not vary noticeably; in the second half, on the contrary the puppet moves rotating (out-of-plane direction), rolling and translating, causing the confidence to decrease.

Fig. 8
figure 8

Confidence values for a 300 (a), 400 (b), and 500 (c) cascade classifier on the video sequence of Fig. 7. d Dynamically tuned W values obtained by the proposed cascaded classifier

The cascaded ensemble processed the video at an average of 25 fps, as shown in Fig. 9, while the monolithic ensemble performed at an average of 15 fps. The imposed optimal frame rate FRopt was 25 fps.

Fig. 9
figure 9

Frame rate of the cascaded ensemble composed of 400 features. The choice of W is automatic (Fig. 8d) and the optimal frame rate was set to 25 fps

The results obtained on several sequences from the CAVIAR http://homepages.inf.ed.ac.uk/rbf/CAVIAR/ dataset are presented in Table 1. The cascaded online boosting (COB) and the OB results have been compared in terms of Euclidean distance from the ground truth. The mean error (μ COB and μ OB) and its standard deviation (σ COB and σ OB) are reported for both algorithms that are using 400 classifiers each. The data refer to an average of the number of subjects (#Obj) involved, and #Frames refers to the number of frames of tracking only. We set FRopt = 25 and we tracked one target in the scene. The search area was restricted to 35% more than the size of the target. The bounding box in the initial frame has set been manually or via a change detection algorithm.

Table 1 Results on CAVIAR sequences

4.4 Multiple targets

To test the performance of the system when more than one object is in the scene, we employed the CAVIAR video sequence dataset to measure the times of computation of the proposed approach based on ensembles of different sizes, compared with the fixed size ensemble.

We initialized the targets with a simple change detector output, and then the OB algorithm and the proposed one are employed to track two targets. In this case, a simple background subtraction technique could not be employed to detect the pedestrians during all the frames of the sequence because of the sudden shadows on the floor and because of the illumination changes all around the scene. On the contrary, the detection via classification is more robust and can be applied at each frame.

In Figs. 10 and 11, the times of computation and the frame rates for different algorithms are presented. Three fixed-size ensembles, consisting of 300, 400 and 500 classifiers, are compared with three cascades of the same size. As predictable, with respect to a single-object scenario the times of computation are doubled; the frame rate is halved, and the cascades remain faster than the non-cascaded ensembles. As we can see, for all the involved algorithms, as W increases the amount of time per operations gets higher; in particular, after a certain value of W the cascades have the same computational cost as the monolithic approach or greater. This can be explained with the fact that in time of evaluation of the cascades is included also the overhead of sorting the classifiers and building the whole structure. Generally, we can state that for high values of W the cascades increase their accuracy but also their computational costs, becoming not preferable to a fixed-size approach because of the overhead of organizing the whole structure.

Fig. 10
figure 10

Average time of computation for the different-sized cascades while increasing W

Fig. 11
figure 11

Frames per seconds achieved by the proposed algorithm when varying both the number of classifiers and W

In Fig. 12, the confidences of the ensembles on the already mentioned video sequence are displayed. Here only the result on the first 550 frames is shown and, for sake of readability, only the confidence of the cascade ensemble consisting of 400 features is presented. In the first part of the sequence a man is walking in the corridor until exiting the scene. Another person comes from the upper part of the image and meets with the previous man re-entering the scene. As we can see in the graph, the confidence of the detector decreases in some frames when there are ambiguities in the scene, as the two men crossing or a variation of appearance or shape. This does not affect the final frame rate, however it slightly affects the accuracy in the detection, as a small shift of few pixels is visible.

Fig. 12
figure 12

Confidences of three cascade ensembles consisting of 400 classifiers, each one assigned to a distinct subject in the same CAVIAR sequence. When the two targets cross each other, the accuracy in the detection is affected, and a small shift of few pixels is visible. Generally the confidence of the detector decreases when there are ambiguities, or variations in the appearance of the objects

With continuous updates and using 400 heterogeneous features to build the cascade, the proposed approach processed the output at about 25 fps, as it was the desired frame rate FRopt. The values of W, that is automatically adjusted through the sequence, are presented in Fig. 13. In the first frames W takes high values, decreasing progressively in the next frames. This behaviour allows to build short but accurate cascades at the beginning of the sequence, while at the end of the sequence the cascades are longer and consist of more levels to reduce the computational burden.

Fig. 13
figure 13

Values of W in a sample sequence taken from the CAVIAR dataset and referring to the scene presented in Fig. 12. W starts with a high value, and then decreases allowing lighter computational requirements

5 Conclusions

In this paper, we devised an algorithm that dynamically builds a cascade of classifiers to speed-up the OB technique. The cascade explicitly considers the computational cost of the involved features to maintain real-time performance. The structure of the cascade and its classifiers are automatically adjusted balancing speed and accuracy. Comparisons with monolithic online ensembles, in terms of accuracy and speed, on standard real-world video sequences have demonstrated the effectiveness of our idea.