Keywords

1 Introduction

Internet of Things (IoT) has been driving research in industry as well as academia for the past decade. An IoT system presents numerous challenges to designers of different expertise. One of the major tasks of designing an IoT system is to develop the sensors at the front-end of the system. These sensors are required to perform their respective tasks in real-time with limited energy and area. Inefficient implementation of sensors in IoT clusters with battery-operated nodes will result in limited operational time, as high energy consumption will quickly drain the battery. Although energy-efficient communication in IoT has been thoroughly researched, relatively little progress has been made in the design of energy-efficient sensors.

Vision sensors in the form of smart cameras are expected to be a core part of the IoT. This is because many real-life applications depend on visual understanding of a scene. Designing a smart camera for IoT that infers from the scene is a challenging proposition. First, computer vision algorithms have high computational complexity, making them inefficient for IoT. Second, accuracy is generally compromised in hardware implementations of computer vision algorithms for achieving higher speed, lower power, or smaller area.

Background subtraction (BS) is a core computer vision algorithm to segment moving objects from the dynamic or static background. BS algorithms aim to model the background that is robust against lighting changes, camera jitter, and dynamic backgrounds [1]. Generally, BS provides a region of interest (the moving object) in the frame of a given video, which is either used to trigger an alarm or further analyzed to understand the scene. Computational gain is achieved by only analyzing the moving objects in the scene rather than the whole frame.

Researchers have proposed many BS schemes in the past. Almost all of these schemes target high accuracy of BS. Such an approach is useful in some applications but not for smart cameras. The reason is that the algorithms that solely target high accuracy are computationally complex, resulting in high power consumption, delay, and area. On the other hand, some researchers have proposed dedicated implementations of BS, which provide relatively high speed and low-power consumption. However, these implementations generally degrade the accuracy of the algorithms.

Previously, we have described the use of BS in surveillance systems, its implications and proposed a novel method. Khan et al. [2] presents a strategy for obtaining the ideal learning rate of GMM-based BS to minimize energy consumption. A dual-frame rate system was proposed in [3], which allows efficient use of memory in a blackbox system. We proposed an accurate, fast, and low-power BS scheme in [4].

Shot noise is the most dominant source of noise in camera system. Shot noise is modeled by a Poisson distribution. Therefore, in this paper, we use a Poisson distribution under the shot noise assumption to model the background pixel intensity. In fact, we use a Poisson mixture model (PMM) to model dynamic background. We use a relatively stable approach for online approximation of the parameters of the distribution. Resultantly, the proposed method provides competitive performance compared to common BS schemes.

The rest of the chapter is structured as follows. Section 2 describes some efficient implementations of BS from the literature. In Sect. 3, we present a brief review of EBSCam for BS. Section 4 describes the proposed method of BS. Experimental results are discussed in Sect. 5 and Sect. 6 concludes the chapter.

2 Previous Work

This section is divided into two parts. In the first part, we describe numerous BS algorithms. In the second part, efficient implementations of BS algorithms are presented.

2.1 Background Subtraction Algorithms

Previously, numerous surveys have been performed on BS schemes [5,6,7]. These BS schemes can be classified into region-based and pixel-based categories.

Region-based schemes make use of the fact that background pixels of the same object have similar intensities and variations over time. In [8], authors divide a frame into N × N blocks and each block is processed separately. Samples of the N 2 vectors are then used to train a principal component analysis (PCA) model. The PCA model is used for foreground classification. A similar technique is described in [9]. In [10], independent component analysis (ICA) of images from a training sequence is used for foreground classification. A hierarchical scheme for region-based BS is presented in [11]. In the first step, a block in the image is classified as background and the block is updated in the second step.

Pixel-based BS schemes have attracted more attention due to their simpler implementations. In these schemes, the background model is maintained for each pixel in the frame. The simplest of these methods are classified as frame differencing, where the background is modeled by the previous frame [12] or by the (weighted) average of the previous frames. Authors in [13] use the running average of most recent frames so that old information is discarded from the background model. This scheme requires storing some of the previous frames; thereby, larger memory space is consumed. In [14] and [15], a univariate Gaussian distribution is associated with each pixel in the frame, i.e., pixel intensities are assumed to be of Gaussian distribution and the parameters of the distribution (mean and variance) for each pixel are updated with every incoming frame. The mean and the variance of the background model for each pixel are then used to classify foreground pixels from the background.

Sigma-delta (Σ-Δ)-based BS [16, 17] is another scheme which is popular in embedded applications [18, 19]. Inspired from analog-to-digital converters, these methods use simple non-recursive approximates of the background image. The background pixel intensity is incremented, decremented, or unchanged if the pixel intensity is considered greater than, less than, or similar to the background pixel intensity, respectively.

Kernel density estimation (KDE) schemes [20, 21] accumulate the histogram of pixels separately in a given scene to model the background. Although claimed to be non-parametric, the kernel bandwidth of KDE-based schemes needs to be decided in advance. Using a small bandwidth produces rough density estimates whereas a large bandwidth produces over-smoothed ones.

Perhaps the most popular BS methods are the ones based on Gaussians mixture models (GMM). These methods, first introduced in [22], assume that background pixels follow a Gaussian distribution and model a background pixel with multiple Gaussian distributions to include multiple colors of the background. Numerous improvements have been suggested to improve foreground classification [23] as well as speed [24,25,26,27] of GMM-based methods. Notable variants of the original work are [28] and [29]. In [28], an adaptive learning rate is used to update the model. In [29], which is usually referred to as extended GMM or EGMM, author uses the Dirichlet prior with some of the update equations to determine the sufficient number of distributions to be associated with each pixel. The Flux Tensor with Split Gaussians (FTSG) scheme [30] uses separate models for the background and foreground. The method develops a tensor based on spatial and temporal information, and uses the tensor for BS.

Another pixel-based method for BS is the codebook scheme [31] and [32], which assigns a code word to each background pixel. A code word, extracted from a codebook, indicates the long-term background motion associated with a pixel. This method requires offline training and cannot add new code words to the codebook at runtime.

ViBe [33] is a technique enabling BS at a very high speed. The background model for each pixel includes some of the previous pixels at the given pixel location as well as from the neighboring locations. A pixel identified as background replaces one of the randomly selected background pixels for the corresponding and the neighboring pixel locations. The rate of update is controlled by a fixed parameter called the sampling factor. Despite its advantages, the BS performance of ViBe is unsatisfactory in challenging scenarios such as dark backgrounds, shadows, and under frequent background changes [34].

PBAS [35] is another BS scheme that only maintains background samples in the background model. PBAS has a similar set of parameters as ViBe. It operates on the three independent color channels separately, and combines their results for foreground classification. Another sample based scheme is SACON [36]. This method computes a sample consensus by considering a large number of previous pixels, similar to ViBe. Authors assign time out map values to pixels and objects, which indicate for how long a pixel has been observed in consecutive frames. These values are used to add static foreground objects to the background.

Recently, several BS schemes based on human perception have been proposed. In [37], authors assumed that human vision does not visualize the scene globally but is rather focused on key-points in a given scene. Their proposed method uses key-point detection and matching for maintaining the background model. Saliency has been used in [38] for developing a BS technique. In [34], authors consider how the human visual system perceives noticeable intensity deviation from the background. Authors in [39] present a method of BS for cell-phone cameras under view angle changes.

2.2 Hardware Implementation of Background Subtraction Schemes

Some implementations of BS algorithms for constrained environments have been proposed in the past. The BS algorithm presented in [40] has been implemented on a Spartan 3 FPGA in [41]. Details of the implementation are missing in their work. Furthermore, [40] assumes the background to be static, which is impractical. In [42], authors present a modification to the method proposed in [43]. They have achieved significant gains in memory consumption and execution time; however, they have not presented the BS performance results over a complete dataset. Authors in [44] implement the algorithm presented in [45] on the Spartan 3 FPGA. The implemented algorithm, however, is non-adaptive and applicable to static backgrounds only. Furthermore, the algorithm of [45] lacks quantitative evaluation. An implementation of single Gaussian-based BS on a Digilent FPGA is given in [46]. As [45], the implemented algorithm cannot model dynamic backgrounds. Another implementation of a BS scheme for static backgrounds, more specifically the selective running Gaussian filter-based BS, has been performed on a Virtex 4 FPGA in [47]. In [48], authors present a modified multi-modal Σ-Δ BS scheme. They have achieved very high speed with their implementation on a Spartan II FPGA. Like other methods discussed above, they have not evaluated the BS performance of their method on a standard dataset.

Many researchers have implemented comparatively better performing BS schemes as well. A SoC implementation of GMM is presented in [49], which consumes 421 mW. In [50], an FPGA implementation of GMM is presented, which is faster and requires less energy compared to previous implementations. The authors maneuver the update equations to simplify the hardware implementations. Other implementations of GMM include [51] and [52]. An implementation of the codebook algorithm for BS is presented in [53]. Similarly, FPGA implementations of ViBe and PBAS algorithms have presented in [54] and [55], respectively.

It should be noted that the above implementations do not exactly implement the original algorithms but use a different set of parameters or post-processing to adapt their methods for better hardware implementations; therefore, the BS performance of these implementations is expected to be different from the original algorithms.

3 Review of EBSCam

EBSCam is a BS scheme proposed in [4]. It uses a background model that is robust against the effect of noise. In this work, we use EBSCam to estimate the parameters of the PMM distribution for every pixel.

It is shown in [4] that the noise in the input samples results in the background model of each pixel to fluctuate. This results in BS scheme making classification errors, i.e., identifying background pixels as foreground and vice versa. These errors are typically defined in terms of false positive and false negatives. A false positive is said to occur if a pixel belonging to the background is identified to be part of a moving object. Similarly, a false negative is said to occur if a pixel belonging to the moving object is identified as to be part of the background. In [4], the probability of false positives and false negatives with GMM-based BS is shown to be

$$\displaystyle \begin{aligned} \mbox{P}\left[FP\right]=1-\mbox{erf}\left(\frac{\sqrt{T}\sigma_{BG}}{\sqrt{2\left(\sigma^{2}_{BG}+s^{2}_{\mu,k}\left(\alpha_k,\sigma_{BG}\right)+\psi+\sqrt{T}s^2_{\sigma,k}\left(\alpha_k,\sigma_{BG}\right)\right)}}\right) \end{aligned} $$
(1)

and

$$\displaystyle \begin{aligned} \mbox{P}\left[FN\right]=1-\mbox{erf}\left(\frac{\mbox{E}\left[I_{FG}\right]-\left(\mbox{E}\left[I_{BG}\right]+\sqrt{T}\sigma_{BG}\right)}{\sqrt{2\left(\sigma^{2}_{FG}+s^{2}_{\mu,k}\left(\alpha_k,\sigma_{BG}\right)+\sqrt{T}s^2_{\sigma,k}\left(\alpha_k,\sigma_{BG}\right)\right)}}\right), \end{aligned} $$
(2)

respectively. Note that here \(s^2_{\mu ,k}\) and \(s^2_{\sigma ,k }\) denote the variance in the estimated mean and standard deviation parameters of GMM, respectively. Kindly, refer to [4] for the definition of the rest of the symbols as these are not relevant here. From the above equations it is seen that the variance in the estimated parameters increases the error probabilities. EBSCam mitigates this variance in the estimated parameters to reduce errors in BS.

In EBSCam, the intensity of the background of each pixel is limited to have K different intensities at maximum. In other words, the background can have K different layers to allow the method to estimate dynamic backgrounds. For every pixel i, a set E i of K elements is formed. Each element of E i represents a single layer of the background.

The pixel intensity at i-th pixel is compared against the elements of the set E i to populate E i. If the pixel intensity differs from all the elements of the set E i by more than a constant D, then it is stored as a new element in the set E i. Let us define

$$\displaystyle \begin{aligned} R_{i,t}=\cup_{j=1}^{K}[E_{i,t-1}^{(j)}-D,E_{i,t-1}^{(j)}+D], \end{aligned} $$
(3)

where \(E_{i,t-1}^{(j)}\) is the j-th element of E i at time t − 1 and D is a global threshold, then the sampling frame (if the input intensity belongs to S i,t then it is included in E i) at time t is defined as

$$\displaystyle \begin{aligned} S_{i,t}=\boldsymbol{I}\setminus R_{i,t}. \end{aligned} $$
(4)

Here I is the set of all possible values of background pixels and ∖ denotes set subtraction.

It is seen from the above equations that the background model only changes when the input intensity differs from all the elements of E i by more than D, otherwise, the background model does not fluctuate. In other words, triggering the update of the background model is thresholded by a step-size of D in EBSCam. Furthermore, pixel intensities can be used for estimating the background intensity as under the assumption of a normal distribution the mean and the mode are the same. In other words, the most frequently observed value of the background intensity is likely to be very close to the mean of the intensity, i.e.,

$$\displaystyle \begin{aligned} \arg\max_{I_{BG}}f_{I_{BG}}(I_{BG})=\mbox{E}[I_{BG}], \end{aligned} $$
(5)

where \(f_{I_{BG}}\) is the probability density function (PDF) of the background intensity.

The background model in EBSCam is not limited to the estimates E i but also the credence of individual estimates, which is stored in a set C i. The credence gives the confidence in each estimate. In [4], it is shown that the credence should be incremented if an estimate is observed and should be decremented if the respective estimate is not observed. More precisely,

$$\displaystyle \begin{aligned} C_{i,t}^{(j)}=C_{i,t-1}^{(j)}+1~\mbox{if}~I_{i,t}\in[E_{i,t-1}^{(j)}-D,E_{i,t-1}^{(j)}+D] \end{aligned} $$
(6)

and

$$\displaystyle \begin{aligned} C_{i,t}^{(j)}=C_{i,t-1}^{(j)}-1~\mbox{if}~(I_{i,t}\in[E_{i,t-1}^{(j)}-D,E_{i,t-1}^{(j)}+D]'\wedge C_{i,t-1}^{(j)}>C_{th}), \end{aligned} $$
(7)

where C th is a constant.

The background model in EBSCam is updated blindly. In blind updates, the input intensity at a pixel updates the background model regardless of the pixel intensity being part of the background or foreground. On the other hand, in non-blind updates only the background pixel intensities are used to update the background model. Whenever a pixel intensity that differs from all the elements of E i by more than D is observed then it is included as a new estimate. The new estimate has a credence value of zero. Note that the new estimate replaces the estimate about which we are least confident, i.e., the estimate with the minimum credence.

The background model is used to classify pixels to either belonging to the background model (background pixels) or the foreground. The decision to classify a pixel to background or foreground is straightforward. First, the set of estimates about which we are confident enough are defined as

$$\displaystyle \begin{aligned} B_{i,t}=\cup_{j=1}^{K}(E_{i,t-1}^{(j)}|C_{i,t-1}^{(j)}>C_{th}), \end{aligned} $$
(8)

i.e., if the credence of an estimate is greater than a threshold then we can consider the estimate to be valid. The pixel intensity is compared against the valid background estimates. If the pixel intensity matches the valid background estimates, then it is considered as part of the background, otherwise, it is considered to be part of the foreground. The foreground mask at a pixel i is obtained as

$$\displaystyle \begin{aligned} F_{i,t}=\left\{ \begin{array}{c} 0~\mbox{if}~I_{i,t}\in [B_{i,t}^{(m)}-D,B_{i,t}^{(m)}+D]\mbox{ over any}~m \\ 1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\mbox{otherwise} \end{array} \right. \end{aligned} $$
(9)

where \(B_{i,t}^{(m)}\) is the m-th element of B i,t.

4 EBSCam with Poisson Mixture Model

In EBSCam, a fixed threshold is used for distinguishing the foreground pixels from the background. Although such an approach has been used by numerous authors, such as [33], it lacks theoretical foundation. The variation in the background pixel intensities varies spatially and temporally over video frames; therefore, an adaptive threshold should be used for distinguishing the foreground pixels from the background pixels.

Generally, the Poisson distribution is used to model the shot noise of image sensors. The probability that m photons have been absorbed at a pixel i is given by

$$\displaystyle \begin{aligned} (X_{i,t}=m)\sim g(m;\lambda_i)=\frac{\lambda_i^m e^{-\lambda_i}}{m!}, \end{aligned} $$
(10)

where λ i is a parameter denoting both the mean and the variance of the distribution. Assuming that the number of photons are such that the relationship between the observed intensity and the number of photons is linear, the observed intensity can be given by

$$\displaystyle \begin{aligned} I_{i,t}=aX_{i,t}, \end{aligned} $$
(11)

where a is the gain factor. Thus, the mean and the variance of the observed intensity are given by

$$\displaystyle \begin{aligned} \mu_{i,t}=a\lambda_i \end{aligned} $$
(12)

and

$$\displaystyle \begin{aligned} \sigma_{i,t}^2=a^2\lambda_i=a\mu_{i,t}, \end{aligned} $$
(13)

respectively.

The Poisson distribution can be used to model the noise or the variance of the background pixel intensities. In order to deal with dynamic backgrounds, we propose using a Poisson mixture model (PMM) for modeling the background intensities. Each subpopulation of the mixture model is representative of a layer of the background. Thus, the background pixel intensity can be modeled as

$$\displaystyle \begin{aligned} (I_{BG,i,t}=m)\sim \sum_{k=1}^{K}\psi_i^{(k)} g(m;\lambda_i^{(k)}). \end{aligned} $$
(14)

The parameters of the distribution can be estimated as

$$\displaystyle \begin{aligned} \lambda_{i}^{(k)}=\frac{\sum_{t=1}^{T}1\{I_{BG,i,t}\in k\}I_{BG,i,t}}{\sum_{t=1}^{T}1\{I_{BG,i,t}\in k\}} \end{aligned} $$
(15)

and

$$\displaystyle \begin{aligned} \psi_i^{(k)}=\frac{\sum_{t=1}^{T}1\{I_{BG,i,t}\in k\}}{T}, \end{aligned} $$
(16)

where T is the total number of frames. Generally, expectation maximization algorithm is used to estimate the above parameters.

The approach in (15) and (16) for estimating the parameters of the distribution of the background pixel intensities is not feasible for two reasons. First, it requires maintaining all the frames of the video. Second, EM is an iterative procedure and is constrained by speed requirements. Here, we propose using EBSCam for online approximation to estimate the parameters of the PMM. Since the mean and the mode of the Poisson distribution are the same, we can write

$$\displaystyle \begin{aligned} \lambda_{i,t}^{(k)}=E_{i,t}^{(k)}. \end{aligned} $$
(17)

Rather than using the normalized values of \(\psi _i^{(k)}\) for which \(\sum _{k=1}^{K}\psi _i^{(k)}=1\), we can use non-normalized values by replacing \(\psi _i^{(k)}\) of (16) by

$$\displaystyle \begin{aligned} \phi_i^{(k)}=\sum_{n=1}^{T}1\{I_{BG,i,n}\in k\}. \end{aligned} $$
(18)

This approximation is plausible for the reason that the \(\psi _i^{(k)}\) values are used for comparison between subpopulations. Since overall k division by T takes place as shown in (16), therefore, the scaling of \(\psi _i^{(k)}\) by T will not have an effect on the result of comparison over all k. From (18),

$$\displaystyle \begin{aligned} \phi_{i,t}^{(k)}=\sum_{n=1}^{t}1\{I_{BG,i,n}\in k\}=\phi_{i,t-1}^{(k)}+1\{I_{BG,i,t}\in k\}. \end{aligned} $$
(19)

In detail, the approximate weight of the k-th subpopulation is incremented if the input intensity matches the k-th subpopulation. Similarly, to get rid of old, unobserved background layers we decrement \(\phi _{i,t}^{(k)}\) if I BG,i,t does not belong to the k-th subpopulation. In case a new background layer needs to be stored, the least observed background layer is to be removed. Based on this, the approximate weight \(\phi _{i,t}^{(k)}\) can be replaced by \(C_{i,t}^{(k)}\).

The decision whether the input intensity matches an estimate is modified as

$$\displaystyle \begin{aligned} \left(I_{i,t}-E_{i,t-1}^{(j)}\right)^2<c^2\left(\sigma_{i,t-1}^{(j)}\right)^2 \end{aligned} $$
(20)

or

$$\displaystyle \begin{aligned} \left(I_{i,t}-E_{i,t-1}^{(j)}\right)^2<c^2E_{i,t-1}^{(j)} \end{aligned} $$
(21)

or

$$\displaystyle \begin{aligned} \left|I_{i,t}-E_{i,t-1}^{(j)}\right|{}^2<c\sqrt{E_{i,t-1}^{(j)}}, \end{aligned} $$
(22)

where c is a constant. Thus, an adaptive threshold is used to distinguish foreground intensities from the background. The threshold is determined by the mean or the variance of PMM, which can be approximated by E i.

The update of E i and foreground classification is performed similar to EBSCam. Since, the proposed method uses EBSCam to estimate the distribution parameters of PMM, we term the proposed scheme as EBSCam-PMM. A step-wise procedure is shown in Algorithm 1.

Algorithm 1 EBSCam-PMM

5 Parameter Selection

In EBCam-PMM, there are two parameters namely c and K. If a too large value of K is used, then it will cover more than the background model, resulting in false negatives. Similarly, a too small value of K will result in false positives, as background intensities will not be fully covered by the estimated mixture model. For c we propose using a value of 2, i.e., the intensity at a pixel is said to match a background estimate if it is within two standard deviations of the estimate. We choose c = 2 for a couple of reasons. First, two standard deviations sufficiently cover the distribution. Second, assuming integer values of the input intensity, we can write the condition in (22) as

$$\displaystyle \begin{aligned} \left|I_{i,t}-E_{i,t-1}^{(j)}\right|<\left\lfloor c\sqrt{E_{i,t-1}^{(j)}}\right\rfloor. \end{aligned} $$
(23)

With a non-negative integer c, the above can be written as

$$\displaystyle \begin{aligned} \left|I_{i,t}-E_{i,t-1}^{(j)}\right|<c\left\lfloor\sqrt{E_{i,t-1}^{(j)}}\right\rfloor. \end{aligned} $$
(24)

By using (24), we can replace the complicated square-root operation by simple LUTs. For example, for all values of \(E_{i,t}^{(j)}\) between 170 and 224, the value of \(\left \lfloor \sqrt {E_{i,t-1}^{(j)}}\right \rfloor \) is 14. Generally, for an n-bit wide input intensity, we require only \(\lfloor 2^{\frac {n}{2}}\rfloor \) if-else conditions to compute \(\left \lfloor \sqrt {E_{i,t-1}^{(j)}}\right \rfloor \).

We applied EBSCam-PMM to the dataset in [56] with different values of K. In Fig. 1, we show the percent of wrong classification (PWC) defined as

$$\displaystyle \begin{aligned} \text{PWC}=\frac{FP+FN}{TP+TN+FP+FN}\times 100. \end{aligned} $$
(25)

Here TP are true positives and TN are true negatives. With EBSCam-PMM, optimal performance is achieved with K = 5 as seen in Fig. 1.

Fig. 1
figure 1

The effect of changing K on background subtraction performance of EBSCam-PMM over CDNET-2014 dataset

6 Hardware Implementation of EBSCam-PMM

In this section, a dedicated implementation of EBSCam-PMM is described as dedicated implementations can attain much higher processing speed and much lower energy compared to a general purpose implementation.

The abstract diagram of the overall system is shown in Fig. 2. The scene is captured by the sensor array. Afterwards, the intensity values of the scene are passed to the image signal processor (ISP). The ISP performs multiple image processing tasks, which include providing the luminosity values which are used in BS. Note that this front-end configuration is not fixed and can be replaced by any system which provides luminosity values of the scene.

Fig. 2
figure 2

Abstract diagram of the overall system

The EBSCam-PMM engine performs the task of identifying the foreground pixels from the background. The EBSCam-PMM engine is composed of the memory unit and the EBSCam-PMM circuit. The memory unit maintains the background model, which is used by the EBSCam-PMM circuit to classify a given pixel into foreground or background.

The block diagram of the BS circuit is shown in Fig. 3. This implementation uses 8 and 10 bits for pixel intensities and \(C_{i}^{(j)}\) for all j, respectively. In the graphical representation of each of the constituent modules of EBSCam-PMM circuit, we have excluded the pixel index i as the same circuit is used for all pixels. Similarly, we have excluded the time index from the notation in this section as it can be directly derived from the time index of input intensity. Also, we have not included clock and control signals in the figures to emphasize the main data-flow of the system.

Fig. 3
figure 3

Block diagram of EBSCam-PMM circuit. Bit-widths are based on FPGA implementation (Sect. 7)

The comparator module in the BS circuit compares all the elements of E i with I i,t in parallel, and generates a K-bit wide output M. M (j) = 1 indicates that the pixel has matched \(E_{i}^{(j)}\) and vice versa, which is determined by comparing \(|I_{i,t}-E_{i}^{(j) }|\) with \(c\left \lfloor \sqrt {E_{(i,t-1)}^{(j)}}\right \rfloor \). The rest of the hardware is the same for EBSCam and EBSCam-PMM. Note that multiple M (j) can be high at the same instant.

A new estimate needs to be added to the background model if M (j) = 0 for j = 1 to K. The addNewEstimate module checks this condition by performing a logical-NOR of all the bits of M.

Next, the credenceUpdate module updates \(C_{i}^{(j)}\) values based on M (j), i.e., credenceUpdate module is an implementation of (6) and (7).

The addEstimate module is used to add a new estimate to the background model. A new estimate is added to the background model of a pixel if the output of addNewEstimate module is high. The module is further subdivided into two submodules.

The replaceRequired submodule determines the index of the estimate which needs to be replaced, and the replaceEstimate submodule generates the updated estimates and credence values. If required, the addEstimate module replaces the estimate with minimum credence value by the intensity of the pixel. Also, the credence value is initialized to zero. Note that if multiple \(C_{i}^{(j)}\) are minimum at the same time, then the \(C_{i}^{(j)}\) and \(E_{i}^{(j)}\) with smallest j are initialized to zero and I i,t, respectively.

The foreground pixel should be high if M (j) = 0 and \(C_{i}^{(j)}>C_{th}\) for all j. This task is implemented by the fgGenerate block. The output of the fgGenerate block is then passed to the postProcess block which applies a 7 × 7 median filter to its input.

7 Experimental Results

To analyze the performance of EBSCam-PMM, we present results of applying EBSCam-PMM to standard datasets. Also, in this section we discuss the FPGA implementation and results of EBSCam-PMM. The performance of the proposed method is compared against some state-of-the-art implementations as well.

7.1 Background Subtraction Performance

To analyze the accuracy of BS under different scene conditions, we have applied EBSCam-PMM to the CDNET-2014 dataset, which is the most thorough BS evaluation dataset available online. CDNET-2014 [56] is an extensive dataset of real-life videos. It includes 53 video sequences divided into 11 categories.

We have used a fixed set of parameters for evaluation of our method. In practice, the value of C th should be varied with the frame rate of the video. However, here we have used a fixed value of C th over the whole dataset. We have used a 7 × 7 median filter as a post-process. The PWC metric has been used to evaluate the performance, as it is a commonly used performance metric to evaluate and compare binary classifiers such as BS.

In Table 1, we present and compare the performance of EBSCam-PMM with GMM [22], EGMM [29], KDE [57], ViBe [33], PBAS [35], and EBSCam [4]. From Table 1, it is seen that EBSCam-PMM shows improved performance compared to GMM, EGMM, KDE, and ViBe and is only next to PBAS.

Table 1 PWC comparison of different methods on CDNET-2014

To compare the accuracy of dedicated implementations, we show the PWC results of FPGA implementations of ViBe and PBAS algorithms. The results are shown over the CDNET-2012 [58] dataset. It is seen that EBSCam-PMM outperforms most of the methods except for ViBe. However, it will be seen shortly that the hardware complexity of ViBe is much higher compared to the proposed method (Tables 2, 4 and 6).

Table 2 PWC comparison of FPGA implementations on CDNET-2012
Table 3 Foreground masks of CDNET-2014 using EBSCam-PMM
Table 4 Memory bandwidth comparison
Table 5 EBSCam-PMM engine on FPGA
Table 6 EBSCam-PMM BS circuit and comparison with previous art

The actual foreground masks generated by different algorithms for a variety of scenes are shown in Table 3. False negatives are observed when the foreground and background intensities are very similar. However, it is seen that generally EBSCam-PMM shows lower number of false positives compared to other methods. It is also seen that PBAS cannot distinguish foreground from the background as seen with the dynamic background sequence.

7.2 FPGA Implementation of EBSCam-PMM

To analyze the area utilization, power consumption, and speed of EBSCam-PMM for embedded applications such as smart cameras, in this subsection we discuss the FPGA implementation. We also compare the FPGA implementation of the proposed method with other implementations of BS schemes.

For synthesis of the circuit, we used XST. ISE was used for place and route on the FPGA. To verify the functionality of the RTL, Xilinx ISE simulation (ISIM) was used. Power estimates were obtained using XPower analyzer.

Due to increasing video resolutions and frame rates, the number of pixels that need to be processed in digital videos is continuously increasing. Thus, the memory bandwidth becomes an important feature of hardware design. Algorithms which require lower memory bandwidths are more suitable to hardware implementations. The number of bits maintained per pixel of the frame in EBSCam-PMM is 90. It is seen in Table 4 that the memory bandwidth of EBSCam-PMM is lower compared to GMM. Compared to ViBe and PBAS, the memory bandwidth of EBSCam-PMM is very small, showing the utility of the proposed method compared to these implementations. In [59], authors have used compression to reduce the memory bandwidth of GMM-based BS. However, compression adds to the circuitry, resulting in increased power consumption, area, and delay.

The implementation results of EBSCam-PMM engine on an FPGA are shown in Table 5. The background model is stored in the internal SRAM of FPGA. From these results, it is seen that EBSCam-PMM requires low area, has high speed, and consumes low power. As an example, for SVGA sequences EBSCam-PMM consumes 1.192 W at 30 fps. Also, a frame rate of approximately 96 can be achieved with EBSCam-PMM at SVGA resolution.

To compare the performance of the proposed method against previous art, we show the implementation results of our method and other methods in Table 6. It is seen that EBSCam-PMM requires significantly lower power compared to GMM. This comes as no surprise as EBSCam-PMM uses parameters that do not fluctuate rapidly, resulting in lower switching activity and, thus, lower power consumption. The logic resources of the FPGA used by EBSCam-PMM are also lower and EBSCam-PMM can achieve higher speeds compared to GMM. The logic requirement and speed of EBSCam-PMM are significantly better than ViBe and PBAS. In fact, the power consumption of EBSCam-PMM is almost negligible compared to that of PBAS. Note that EBSCam-PMM only requires slightly more logic resources compared to EBSCam, with almost similar speed and power consumption.

Recently, [61] proposed a method to reduce the memory bandwidth of GMM. From Table 7, it is seen that their method achieves a lower memory bandwidth compared to EBSCam-PMM. However, it is seen that the speed and power consumption of our implementation are much lower compared to [61]. Also, note that their method can at best achieve the accuracy of GMM, whereas it is seen earlier that EBSCam-PMM outperforms GMM in BS accuracy.

Table 7 Comparison of CoSCS-MoG and EBSCam-PMM

8 Conclusion

This chapter presented a new background subtraction scheme based on Poisson mixture models called EBSCam-PMM. Since shot noise is the most dominant form of noise in natural images, it is natural to use a Poisson mixture model to model the background intensity. A sequential approach to estimating the parameters of the Poisson mixture model is also presented. The estimated parameters are more robust against noise in the samples. Resultantly, the proposed method shows superior performance compare to numerous common algorithms. Furthermore, an FPGA implementation of the proposed method is also presented. EBSCam-PMM requires low memory bandwidth and battery power while providing very high frame rates at high resolutions. These features of EBSCam-PMM make it a suitable candidate for smart cameras.