1 Introduction

The background subtraction (BS) is a method to classify the pixels within the generic frame of a video sequence into the background and the foreground sets. The foreground pixels represent moving objects, usually having significant variations over the time, whereas the background pixels represent stationary objects, typically remaining static for a long period of time. The BS and then the foreground detection are fundamental steps in many computer applications, such as intelligent video surveillance, visual observation of animals and insects, optical motion capture, human–machine interaction and content-based video coding [1, 2].

Recently, the design of real-time (RT) embedded video systems and smart cameras has received a great deal of attention as appealing hardware supports for the above cited applications [37]. Unfortunately, several BS schemes known in the literature, such as [818], even offering high accuracy in detecting foregrounds, are not suitable for smart camera applications as they are compute-intensive, thus resulting in high power consumption and low speed. Conversely, other BS approaches, such as [1924], due to their lower computational complexity, can significantly reduce power consumption and increase speed, thus being ideal for smart cameras, but achieving lower accuracy.

Each BS algorithm has, therefore, its own benefits and drawbacks, also depending on the exploited colors representation. In fact, the adopted pixel color model can significantly influence the overall performances, either in terms of computational complexity or in terms of accuracy. When pure color models (e.g., HSV, c1c2c3, YCrCbCg) are exploited, critical situations, such as camouflage, moved background objects, dynamic background, beginning moving objects, sleeping foreground objects and shadows can be successfully addressed [1]. Conversely, the use of more complex color descriptors, such as the color invariants (CIs) [25], allows obtaining information independent of illumination intensity, reflectance property and viewpoint. Unfortunately, the exploitation of color models further increases the computational complexity. For these reasons, in RT systems speed and power performances are often traded off with the simplicity of the background model, the lack of adaptive background updates and the sensitivity to even small background changes.

This paper presents a novel light and efficient BS algorithm for RT applications. The proposed multimodal approach, named multimodal background subtraction exploiting color invariant and grayscale information (MBSCIG), exploits two simple background models separately build for the color invariant H and the grayscale pixels intensities. An approximated version of the novel algorithm (called MBSCIGA), targeting an efficient hardware implementation with limited requirements of computational and memory resources, is also presented. In order to evaluate the accuracy achieved by the two versions of the proposed algorithm, several tests have been performed on different video sequences. The comparison with several algorithms known in the literature demonstrates that both the versions of the proposed algorithm can reach higher percentages of correct classified pixels with a reduced computational complexity.

With the main objective of furnishing a hardware implementation suitable for the integration in low-cost high-definition embedded video systems and smart cameras [37], the proposed design has been characterized for several images sizes. Post-place and route implementation results demonstrated that, when implemented within a Xilinx Zynq-7000 programmable SoC, the presented system, based on the MBSICGA algorithm here proposed, occupies only ~38, ~19 and ~4% of Slices LUTs, Slices FFs and 36 Kb Block RAMs, respectively, available within the xc7z020 device and processes Full HD (1920 × 1080) RGB video sequences with a frame rate up to 74 fps.

The rest of the paper is organized as follows: Section II provides an overview of the BS algorithms proposed in the last decade; the novel MBSCIG algorithm and its approximated hardware-oriented version are described in Sect. 3; experimental results obtained processing several video sequences are presented and discussed in Sect. 4; the hardware implementation of the MBSCIGA algorithm is described in Sect. 5; and, finally, conclusions are drawn in Sect. 6.

2 Related works

BS algorithms classify the pixels of each frame in an acquired video sequence either as foreground or as background pixels [1, 2, 12, 1519, 2533]. This classification process is usually performed through the following main steps:

  • Background model initialization a background reference is computed to define the stationary objects initially present in the scene; the initial model can be build through a single frame or gathering several frames. This step can be critical when fast and sudden changes occur in the observed scene.

  • Background model update the background model is updated with the information within the current processed frame. Such a step is not critical in the presence of objects that remain static for a long period of time. On the contrary, managing sudden changes in the scene is a challenge, since significant changes in the chromatic properties of the current frame can cause pixels misclassifications (i.e., pixels belonging to the background are recognized as foreground pixels, and vice versa).

  • Foreground detection the objects that were not present in the initial scene are detected by comparing the current frame with the background model. Then, they are classified as a foreground.

The background models used in BS algorithms can be mainly classified as summarized in Table 1.

Table 1 Overview of existing algorithms

Basic models are based on the analysis of a certain number of frames over the time; the average, the mode, the median or a histogram is computed to model each pixel of the background [34]. Then, the difference between the current frame and the background is thresholded to identify foreground pixels [8, 9]. The threshold can be either kept to a fixed value for the entire process or updated at run-time according to the results coming from the foreground detection step [35].

Statistical models use a statistical analysis on individual pixels to build the background model. The pixel information of each processed frame is used to dynamically update the statistics of pixels that belong to the background model. Examples of statistical models are those based on mixtures of Gaussian distributions (GMM), which offer more robustness against frequent and small illumination changes, thus usually achieving good accuracy in outdoor environments [1013]. In such models, the history of each pixel is modeled over the time by the mean and variance values of a Gaussian distribution. However, in the presence of fast variations, a relatively high number of Gaussians must be considered to correctly model the background. Edge segmentation is proposed in [36] to adapt the motion variation of the background environment. This approach stores static and moving edges into lists and it statistically models the background in terms of weight, position, motion, size and shape variations information. A kernel density estimation is also used to build a nonparametric background model in [37]. These algorithms are popular due to their robustness in critical situations, such as the presence of noise, shadows and illumination changes [35].

Cluster models assume that pixels in the frame can be temporally represented by clusters and are able to capture structural background motion over a long period of time [1416, 38]. In [14], an algorithm that uses a codebook model for the background has been proposed. For each pixel, a codebook consisting of one or more codewords is build. The K-mean algorithm proposed in [39] uses clusters sorted on the basis of the likelihood to deal with lighting variations and dynamic background. A variation of codebook algorithm is proposed in [38], where not all the pixels are handled with the same number of codewords.

Predictive models dynamically build the background model according to the past frames by means of a predictive approach. The model can be predicted by Kalman filter [40], Wiener filter [41, 42] and neural networks [43, 44].

Fuzzy models were recently proposed to exploit the advantages of uncertainties and imprecision of the fuzzy logic also in the background subtraction [45, 46].

Apart on the BS model adopted, benefits and drawbacks of each BS algorithm known in the literature also depend on the exploited colors representation [1, 2, 12, 1519, 2533]. In fact, the color model can significantly influence the overall performances, either in terms of computational load or in terms of foreground detection accuracy. The pure RGB color model is commonly used [32], but alternative color models have been recently investigated. As an example, in [30, 31] it is shown that the usage of YCbCr and HSV color spaces can improve the pixels classification, whereas [33] demonstrates that using the normalized RGB color components leads to higher overall quality and speed performance than those reachable with the c1c2c3 color representation. In [25], CI expressions have been derived that allow the effects of a large set of disturbing factors, such as illumination, viewing direction, surface orientation and highlights, to be significantly reduced in computer vision applications. A way to efficiently exploit CIs in BS algorithms has been investigated in [17], where the background model is built referring to N previous frames; each frame is described by the color invariants H, Wx and Wy, and each pixel is modeled with a single Gaussian distribution. An alternative approach was presented in [18], where mixtures of Gaussians are calculated on both the grayscale pixels intensity and the color invariant Hx. The two channels are then combined to reduce the number of pixel misclassifications in the presence of shadows, noises and illumination changes.

Recently, several hardware-oriented BS algorithms developed to support RT applications have been proposed [2124] and specific IP modules have been designed for FPGA platforms [20]. The single Gaussian (SG) algorithm presented in [21] furnishes an efficient way of modeling the generic pixel through a single Gaussian distribution. Conversely, the approach demonstrated in [22] exploits a statistical model based on a ΣΔ multimodal modulation that models each pixel by K distributions, each characterized by ΣΔ mean, ΣΔ variance and weight. The fuzzy running average (FRA) and the fuzzy background update (FBU) algorithms presented in [23] and [24], respectively, provide examples of RT algorithms in which the generic pixel within the background model is updated by means of fuzzy approaches that take into account either the misclassified pixels in the past frames [23] or the neighborhood pixels [24]. In [20], the above algorithms have been compared when hardware implemented using a Xilinx Spartan-3A FPGA chip. Their results will be later referred in this paper.

In the following Sections, two sets of BS algorithms are referenced as the counterparts of the novel BS algorithm here proposed. The first set includes algorithms selected to consider different background and color models, whereas the second set includes hardware-oriented algorithms that can be strong competitors for efficient hardware implementations.

3 The novel MBSCIG algorithm

The background model used in the proposed MBSCIG algorithm belongs to the basic models category. The main computational steps performed by the novel algorithm can be summarized as follows. RGB input frames are firstly processed to obtain the grayscale image GS and the color invariant H. The latter has been chosen because it is more insensitive to image conditions and simpler to be computed than the alternative color invariants C, W, E, as defined in [25]. The adopted background model consists of N frames acquired before the current one, and therefore called historical frames, and one modeled frame containing the current background. As soon as the (N + 1)-th frame (the current frame) is acquired, the foreground detection initiates and it is executed pixel-by-pixel on both the grayscale and color invariant channels. Each pixel of the current frame is compared to its counterparts in the historical frames to establish whether it significantly varies or not. The pixel is classified as belonging to the foreground only if both the channels are consistent with such a decision, otherwise the pixel is classified as belonging to the background. After that, the current background is updated for the next computation by taking into account both the current pixel value and its stored history with appropriate weights.

The block diagram of the new algorithm is depicted in Fig. 1. The grayscale image GS and the color invariant H are computed as given in (1a) and (1b), respectively, with R, G and B representing the red, green and blue components of the input frame, and E, Eλ and Eλλ being the spectral differential quotients calculated by Eq. (2). It is worth noting that coefficients used in Eq. (1a) come from the standard RGB to gray conversion, whereas coefficients used in Eq. (2) are provided in [25].

$${\text{GS}} = 0.299 \times R + 0.587 \times G + 0.114 \times B$$
(1a)
$$H = \frac{{E_{\lambda } }}{{E_{\lambda \lambda } }}$$
(1b)
$$\left[ {\begin{array}{*{20}c} E \\ {E_{\lambda } } \\ {E_{\lambda \lambda } } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {0.06} & {0.63} & {0.27} \\ {0.3} & {0.04} & { - 0.35} \\ {0.34} & { - 0.6} & {0.17} \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}c} R \\ G \\ B \\ \end{array} } \right]$$
(2)
Fig. 1
figure 1

Block diagram of the proposed algorithm

The initialization phase of the background model computation is summarized in the pseudo-code reported in Fig. 2a.

Fig. 2
figure 2

Main computational steps of the novel algorithm: a model initialization; b foreground detection and model update

The first N frames are captured and stored as historical frames. GShi and Hhi (with i = 1,…, N) indicate the grayscale and color invariant histories, respectively, whereas Gm and Hm denote the current background models that are initially filled with GShN and HhN. The foreground detection and the model update phases are then executed. Such processes are identical for both GS and H channels, thus, in the following, only the GS is referred to. As summarized in the pseudo-code of Fig. 2b, the newest acquired frame GS is compared pixel-by-pixel to all historical frames GShi and, for each pixel at the position (x, y), the percentage variations D between GS(x, y) and the corresponding pixels GShi(x, y) are computed. Then, the number Dc of historical frames with respect to which GS(x, y) varies negligibly (i.e., D is less than the threshold Tg) is counted. Similarly, GS(x, y) is compared to the corresponding pixel Gm(x, y) within the current background. Its percentage variation is indicated with DD in the pseudo-code of Fig. 2b. Then, the final detection step is executed: If DD is less than the threshold Tg and Dc is higher than the threshold Tgc, it can be concluded that the examined pixel belongs to the background of the image and the output flag IsFg is asserted low. Otherwise, IsFg is asserted high to indicate that the pixel is potentially part of the foreground. In both cases, the pixel Gm(x, y) in the current background is updated as shown in lines 18 and 21 of the pseudo-code of Fig. 2b. The parameters ρB and ρF are used to properly and differently tune this combination in the case of a detected background pixel and a recognized foreground pixel, respectively. Details on the values experimentally selected for N, Tg, Tgc, ρB and ρF are provided in the following Section.

As the final step, the historical frames are updated by discarding the oldest one and storing the latest captured frame.

As illustrated in Fig. 1, the same detection/update process is separately performed on the H channel that generates the flag IsFh. The current examined pixel is recognized as a foreground pixel only if the flags IsFg and IsFh are both high.

Also a hardware-oriented version of the proposed algorithm has been investigated. In the following, it is named MBSCIG approximated (MBSCIGA) to indicate that it exploits an approximated formulation purposely introduced to make hardware implementation friendlier. The novel formulation here adopted and provided in (3) approximates the matrix elements used in (2) to compute E, Eλ, and Eλλ to their nearest powers of two. By comparing Eqs. (2) and (3), it can be observed that, using the proposed approximations, all the pixel components are slightly de-emphasized. This happens somewhat less prominently for the red component in the computation of the Eλ coefficient and for the green component in the computation of the Eλλ coefficient. The resulting advantage is that Eq. (3) can be easily hardware implemented through simple shifts and additions without requiring expensive multiplications. In the following Section, the effects of the adopted approximation on the achieved accuracy are deeply examined.

$$\left[ {\begin{array}{*{20}c} E \\ {E_{\lambda } } \\ {E_{\lambda \lambda } } \\ \end{array} } \right] = \frac{1}{{2^{7} }} \cdot \left[ {\begin{array}{*{20}c} {2^{3} } & {2^{6} } & {2^{5} } \\ {2^{5} } & {2^{2} } & { - 2^{5} } \\ {2^{5} } & { - 2^{6} } & {2^{4} } \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}c} R \\ G \\ B \\ \end{array} } \right]$$
(3)

4 Experimental results

The proposed MBSCIG algorithm and its approximated version have been characterized in terms of computational complexity and several accuracy metrics, measured referring to datasets available online [4750]. As summarized in Table 2, the selected benchmarks represent typical situations that can make the extraction of moving objects critical, thus furnishing an effective way to highlight strengths and weakness of each analyzed algorithms.

Table 2 Video sequences used as benchmarks

C++ routines of the proposed MBSCIG and MBSCIGA algorithms have been used to evaluate their segmentation quality. To this purpose, their output sequences were compared to the ground truths available within the chosen datasets. This allowed counting true positive (TP) and true negative (TN) pixels, which represent, respectively, the number of foreground and background pixels correctly recognized as foreground and background, in conjunction with the number of false positive (FP) and false negative (FN) pixels, which instead refer, respectively, to the number of background pixels erroneously classified as foreground, and to the number of foreground pixels misclassified as background. Then, the percentages of correct classifications (PCC) given in (4a), of correct background classifications (PCCB) provided in (4b), and of correct foreground classifications (PCCF) shown in (4c) have been computed. As it is well known, accuracy measurements obtained by the above percentages independent from each other cannot offer an adequate comparison when analyzing different algorithms [51, 52]. Therefore, the F1 and the similarity (SM) [53] metrics defined as in (4d) and (4e) have been also calculated to this purpose.

$${\text{PCC}} = \frac{{({\text{TP}} + {\text{TN}})}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} \times 100$$
(4a)
$${\text{PCCB}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} \times 100$$
(4b)
$${\text{PCCF}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} \times 100$$
(4c)
$$F1 = 2\; \times \;\frac{\text{TP}}{{2\; \times \;{\text{TP}} + {\text{FP}} + {\text{FN}}}} \times 100$$
(4d)
$${\text{SM}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}} + {\text{FN}}}} \times 100$$
(4e)

As discussed in the previous Section, the proposed MBSCIG algorithm uses N historical frames, to model the background, and the thresholds Tg, Tgc, Th and Thc, with the weights ρB and ρF, to update it. As in every BS algorithm, thresholds and parameters values have to be chosen (see Table 3). This choice is usually performed by maximizing the accuracy for a given set of benchmark sequences. Experimental tests demonstrated that Tgc = Thc = 2 allows good results in the foreground classification process to be achieved, whereas lower threshold values make the model more sensitive to sudden changes in the background. The experimental tuning process has also shown that the higher (lower) Tg and Th values, the higher false negative (positive) pixels. Then, averaging results on benchmarks in Table 2, Tg = 25 and Th = 20 have been found as values that ensure the best accuracy. Finally, the parameters ρB and ρF have been set to 0.95 and 0.04, respectively, taking into account that experiments have demonstrated that lower (higher) values of ρB (ρF), introduce over-smoothing effects.

Table 3 Parameters used in the compared BS algorithms

By using the above chosen parameters, several tests have been performed on the selected video sequences summarized in Table 2 with N ranging from 1 to 100. Using the same evaluation approach exploited in [20], the achieved PCC values have been then averaged, thus obtaining the results plotted in Fig. 3. The latter shows that N = 4 leads to an average accuracy quite high and only 0.3% lower than that reachable by increasing N to 100. Obviously, since N = 4 will ensure significantly low memory requirements, it has been chosen as the best trade-off.

Fig. 3
figure 3

Average PCC versus N

The BS algorithms presented in [13, 14, 17, 18, 2124] were selected for purposes of comparison.

They have been chosen not only because are among the most efficient approaches existing in Literature, but also because, similarly to the proposed MBSCIG algorithm, they perform the foreground detection pixel-by-pixel. Also for these algorithms, several parameters, such as the number of historical frames used to model the background, threshold values, the number of statistical distributions involved in the background model, must be carefully set to guarantee the best accuracy to be achieved for a given set of benchmark scenarios. Table 3 provides basic information about all evaluated algorithms, including the adopted color models, parameters and thresholds with corresponding meaning and their experimentally selected values.

Samples of the segmented frames are collected in Fig. 4 that also shows the ground truths referenced to evaluate their accuracies. From a qualitative analysis, it can be observed that some algorithms fail to identify a sufficient number of pixels of the foreground objects for specific sequences. As an example, this occurs in GMM, GMMHG and FBU for the Highway video, and in CB and FBU for the Lobby benchmark. Outputs are generally noisy for CB and CIHW algorithms. On average, FRA and the MBSCIG algorithms seem to produce the best results, even though somewhat blurred figures are produced by FRA for Highway and Office, and by the proposed MBSCIG algorithms for Lobby and Bootstrap.

Fig. 4
figure 4

Examples of the processed images

The quantitative accuracy analysis has been performed by examining all frames contained in the benchmark sequences. Results are collected in Table 4.

Table 4 Accuracy results in terms of PCC, PCCB, PCCF, F1 and SM

Preliminarily, we have to point out that the SG, SDM and FRA algorithms do not include a specific background initialization phase. But, they rely on the acquisition of a frame free from foreground objects. In all the benchmarks, the Bootstrap excepted, frames fully free from foreground objects exist and were selected during the accuracy tests. However, in real environments, such a condition is not always guaranteed. Therefore, an appropriate training phase is mandatory [54]. For these reasons, we believe that the precision of the above algorithms could be somewhat overestimated.

Averaging all F1 and SM measures in Table 4, it can be seen that the proposed MBSIG algorithms reach the highest accuracy scores. As also shown in the histogram of Fig. 5, the closest competitor is represented by the FRA algorithm. The proposed algorithms also are characterized by very low sensitivity to the scene. In fact, considering the standard deviation σ and the mean value µ computed for F1 and SM, they show \({\raise0.7ex\hbox{$\sigma $} \!\mathord{\left/ {\vphantom {\sigma \mu }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\mu $}}\left( {F_{1} } \right)\) and \({\raise0.7ex\hbox{$\sigma $} \!\mathord{\left/ {\vphantom {\sigma \mu }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\mu $}}\left( {SM} \right)\) as low as 0.18 and 0.22, respectively. Only the SDM algorithm does better, but with lower average precision. It is also worth noting that the above-mentioned de-emphasis of the RGB components, due to the adopted approximation of Eq. 3, produces benefic effects in video sequences characterized by repetitive background motion, such as Fountain. In fact, as shown in Table 4, for this benchmark, the PCCF of the MBSIGA algorithm is the highest one. However, this advantage is strictly related with the video sequence content and cannot be generalized.

Fig. 5
figure 5

Comparison results in terms of F1 and SM metrics

However, each algorithm of Table 4 has its own strengths and weakness. In fact, the SDM algorithm seems to be the most efficient to cope with the global illumination changes in the Lobby benchmark. GMM and MBSCIG are the most precise when repetitive background motions are present in the scene (Fountain). Even with the limit of the above-mentioned ideal setting of the initial background, FRA and CB show the better reaction to more than one moving objects (Highway). Pixels classification when the background is obstructed by moving objects (Bootstrap) is better afforded by the MBSCIG algorithms. Whereas, SG and FRA excel when static and moving objects have the same color (Office). The nice property of the novel algorithms is that, when they do not win the comparison with competitors, they have a precision level very close to the highest one for all the benchmark scenes, thus resulting in the highest F1 and SM averages.

The computational complexities of all the above algorithms are now evaluated in terms of the number of addition/subtraction (AS) and multiplication/division (MD) operations, required in their main computational steps, which can be summarized as: (i) the preprocessing step, eventually needed to compute color transformations, grayscale intensities and/or CIs from the RGB pixels; (ii) the initialization of the background model; (iii) the updating of the background model; (iv) the foreground detection. We recall that the SG, SDM and FRA algorithms do not perform any initialization phase, but in practical applications they could require further operations to be executed for an appropriate training phase [54].

Table 5 collects computational complexities data, where Np represents the number of pixels within each frames. The meaning of all other parameters is reported in Table 3.

Table 5 Computational complexities of the compared algorithms

From Table 5, it is clear that higher computational complexities not always lead to higher accuracies. As an example, GMM [13], CB [14] and GMMHG [18] algorithms, though being among the most complex, show relatively low F1 and SM performances. On the contrary, the FRA algorithm [23], which has the lowest number of operations, achieves accuracies well higher than most of the compared algorithms, provided that initial background frame choice has been correctly performable. Undoubtedly, the ability of the MBSIG and MBSIGA algorithms to efficiently manage also the above critical conditions makes them suitable for much wider actual applications contexts.

5 The hardware architecture

The proposed algorithm could be implemented using DSP-, GPU- or FPGA-based hardware platforms. All these solutions are viable, and the most suitable platform will depend on the specific project constraints in terms of image size, throughput, latency, power consumption, cost and development time. The FPGA technology has been selected as the target, since modern FPGA devices have frequencies compatible with RT applications and sufficient logic resources to support complex processing. Moreover, they can achieve computational speed higher than DSPs, power consumption lower than GPUs and guarantee high flexibility with relatively low development time [55, 56].

The proposed algorithm has been implemented by using an Avnet ZedBoard that contains an xc7z020 Xilinx Zynq FPGA chip. Xilinx Zynq system-on-chip devices allow the design of a complex embedded system to be efficiently realized exploiting its embedded Processing System based on a two-core Cortex A9. Such powerful processor is equipped with 32/32 KB I/D Caches, 256 KB on-chip RAM and several interfaces on AMBA buses. The xc7z020 SOC has a FPGA-Fabric with 85 K Logic Cells and 140 36 Kb memory blocks available for specific user design.

Two different implementations of the proposed algorithm have been evaluated. In the former, the software code of the MBSCIGA algorithm is executed by one of the available Cortex A9 cores running at 800 MHz clock frequency. In the meantime, the second core takes care of the Linaro 14.04 “Trusty” Operating System. In such an implementation, the FPGA-Fabric is used just for realizing sub-systems for testing purpose and the DDR3 memory available on the ZedBoard is exploited. Input video sequences with QQVGA and QVGA resolutions have been processed with frame rates up to 57 and 15.2 fps, respectively. Obviously, larger images sizes can be processed, but the low frame rates achieved would be inappropriate for RT applications.

The latter implementation is the design of a specific stand-alone architecture dedicated to the background subtraction and fully implemented in the FPGA-Fabric of the SOC. To this purpose, the novel background subtraction algorithm has been coded in VHDL with a limited use of IP cores, thus minimizing the efforts required to retarget, if needed, the design onto different hardware platforms. The top level architecture of the whole circuit is depicted in Fig. 6 where the two computational channels above described are clearly visible. For each pixel in the generic input frame the modules RGB2H and RGB2G compute the color invariant H and the grayscale data GS, respectively. Grayscale and H data of the four historical frames (GShi and Hhi, with i = 1,…, 4) and of the current background frame (Gm and Hm) are separately stored within the memory modules also depicted in Fig. 6.

Fig. 6
figure 6

Top-level hardware architecture

The circuits Check&UpdateH and Check&UpdateG compute the flags IsFh and IsFg that are then logically ANDed to produce the final output IsF, which is asserted to indicate that the processed pixel is recognized as a foreground pixel.

To implement the module RGB2G, the IP core color space converter available within the Xilinx design libraries has been exploited, whereas the circuit illustrated in Fig. 7 has been purpose-designed for the module RGB2H. It can be seen that, Eq. (3), applied in the dashed box to compute Eλ and Eλλ, is easily hardware implemented through right-shifts, 2’s complements and additions. Then, Eλ and Eλλ are divided through a 21-stage pipelined IP core fixed-point divisor that computes at each clock cycle a novel 16-bit value of H, with an 8-bit fractional part.

Fig. 7
figure 7

Structure of the module RGB2H

Figure 8 shows in details the architecture of the Check&UpdateG module. Its computations can be divided within three main steps underlined in dashed boxes: Historical check, Current check and Update. The Historical check sub-module compares the grayscale input pixel GS(x, y) with the correspondent pixels belonging to the N historical frames, in parallel. Thus, N instances of the CC sub-circuit establish if the input pixel differs significantly from previously stored ones or not. The binary outputs obtained in this way are then added and compared to the threshold Tgc. The Current check sub-module compares the input pixel GS(x, y) with the corresponding pixel of the current background Gm(x, y), in a similar manner. Such information is then processed to detect whether the pixel is recognized as part of the background or not, and to compute its updated value Gmup(x, y) to store in place of Gm(x, y) for the next computation. The same architecture is utilized within the color invariant channel.

Fig. 8
figure 8

Check&UpdateG module

A different view of the processing system in which memory banks are explicated is illustrated in Fig. 9. N + 1 frame buffers (RAM1-RAM5) are used to store the four historical frames and the current background model. The Initialization System manages the storing of the first four frames in the buffers. Then, when the first pixel GS(x, y) of the (N + 1)-th frame arrives the computation of the updated background pixel starts. After one clock cycle, all GShi(x, y) terms are consumed. Therefore, the oldest one [e.g., GSh1(x, y)] can be substituted with the value of the current processed pixel GS(x, y).

Fig. 9
figure 9

Memory module

A rotating register maintains the information of which buffer detains the oldest frame, that is used to control the writing in the correct memory bank. After one more clock cycle, the updated background is computed and it is written in the RAM5 block.

When realized on the 85 K Logic Cells xc7z020 FPGA chip to process RGB video sequences with a frame resolution of 128 × 160 pixels, the proposed system occupies 1868 slice LUTs, 1376 slice Registers and 75 internal 36 Kb Block RAMs. It has a latency of 81,955 clock cycles and reaches a maximum clock frequency of 154 MHz, thus producing an output frame each ~0.13 ms. At a parity of the frame resolution, the frame rate reached by the proposed hardware design is more than 132 times higher than the pure software execution above described.

Due to the limited amount of internal RAM resources, if higher resolutions are targeted, external memories have to be used. Of course, several different physical implementations of the memory buffer that follows the approach shown in Fig. 9 could be adopted, taking into account that a maximum overall bandwidth of 24 Gb/s has to be sustained. As an example, a memory buffer for processing 1024 × 768 RGB video sequences could be realized by using four IDT70T3509M 1024Kx36 Synchronous Dual-Port SRAM chips, each one allowing a transfer rate up to 9.5 Gb/s. In this case, one memory chip stores the grayscale information of the historical frames, two more chips store the 16-bit color invariant data and the last one stores the current background.

The realized Zynq-based prototype exploits the DDR3 memory available on the ZedBoard. It allows a data transfer rate up to 50 Gb/s to be achieved. In such a case, the proposed system is made able to process Full HD RGB video sequences with a frame rate up to 74fps and occupies 20,156 Slice LUTs, 15,898 Slice FFs, and 6 36 Kb Block RAMs and 32 MB of DDR3 memory.

Post-place and route characteristics of all the designs implemented using the algorithm here proposed are summarized in Table 6 that also shows data related to the hardware designs presented in [20] for the SDM [22], FRA [23] and FBU [24] algorithms. The hardware system presented in [53] is also included in the comparison since it represents a good touchstone for the hardware design proposed here to process high-resolution video streams, although the original paper provides accuracy results obtained with a different set of benchmark sequences.

Table 6 Post-place and route implementation results

To guarantee a fair comparison also with [53], the novel algorithm has been implemented also within the XC6VLX 240T-1FF1156 FPGA device and referring to the ML605 board equipped with a 512 MB DDR3 SRAM chip to be used as the external memory resource. Also in this case, the available 50 Gb/s DDR3 memory bandwidth suffices to support the performance required by the memory buffer of Fig. 9. To process Full HD video sequences the proposed circuit occupies 14,141 Slice LUTs, 14,648 Slice FFs, 12 36 Kb Block RAMs and reaches a 129 MHz running frequency, thus achieving a frame rate of about 62 fps. The design implemented in [53] reaches a similar frame rate of 60 fps, but it requires more than six times internal memory, ~69% more external memory, ~17% more slices and ~52% more flip–flops. Accuracy tests, performed on the same benchmarks used in [53], demonstrated that such an architecture reaches average F1 and SM only ~2.3 and ~3.8% higher than the proposed MBSCIGA. It is worth noting that all the above resources counts take into account the memory controllers required to interface the background subtraction engine to the external memories.

6 Conclusion

A novel algorithm for background subtraction tasks, called MBSCIG, has been presented. It jointly exploits grayscale information and color invariant H at the pixel level and uses a very simple background model, consisting of only four historical frames. The background model is unconventionally updated analyzing the percentage changes of current pixels with respect their counterparts within the historical frames.

An approximated version of the novel algorithm, here named MBSCIGA, has been also proposed having an efficient hardware implementation as the target. In this version, complex operations are avoided to reduce logic and memory resources requirements and computational time.

Referring to several testbench video sequences, the achieved accuracy has been analyzed in terms of percentages of correctly classified background and foreground pixels, as well as in terms of the F1 and similarity metrics that evaluate the overall accuracy of the segmented images computed. Obtained results demonstrated not only that the introduced approximations do not compromise the achieved accuracy, but also that the novel algorithm efficiently trades off the strengths of several state-of-the-art competitors.

With the main objective of demonstrating that the novel algorithm is suitable for the integration within low-cost embedded video systems and smart cameras oriented to real-time applications, several hardware implementations have been characterized for different images sizes. Reconfigurable FPGA devices have been selected as the target hardware platform, but also ASIC-, DSP- and GPU-based implementations can be easily carried out. Moreover, since the VHDL has been exploited with a limited use of specific IP cores, the proposed design can be retarget to different platforms with reduced efforts.

When realized within an xc7z020 FPGA chip, the proposed system can process Full HD (1920 × 1080) RGB video sequences with frame rates up to 74fps, occupying, apart 32 MB of external RAM, only ~38, ~19 and ~4% of the Slices LUTs, the Slices FFs and the 36 Kb Block RAMs, respectively, available within the chip.