1 Introduction

With the recent advances in the domain of computer vision and the development of digital inexpensive cameras, the need for both indoor and outdoor video surveillance systems has been stimulated. Video processing is a substantially important branch of image processing and computer vision focused on extracting information from real scene videos. Among several other video processing techniques, background subtraction has attained great importance as a developing research area, during the past few years [1,2,3,4]. Background subtraction aims at dividing the pixels of each frame of a video into two complementary sets, i.e., the background pixels corresponding to the stationary part of the frame and the foreground ones that correspond to the moving part. This makes the subsequent post-processing tasks efficient and relatively easier. This makes the subsequent post-processing tasks efficient and relatively easier. Most of the background subtraction techniques aim at choosing such algorithms and task scheduling methods that make the real-time processing of videos efficient and accurate [1]. Furthermore, processing a high-resolution video stream efficiently is also a challenging task [2]. Recently, techniques are being proposed that focus on limiting the number of historical frames and thus making the background subtraction process suitable for the real-time implementation [3]. Background subtraction is a widely used technique that finds its application particularly in surveillance videos, object detection and tracking [5], detection of urban traffic [6], long-term monitoring of a scene [7], video compression [8], etc. Some critical challenges faced in the implementation of a background subtraction method are:

  1. i.

    Selection of the initial frames.

  2. ii.

    Development of a model for Background.

  3. iii.

    Selection of an effective threshold for the classification of pixels.

  4. iv.

    Updating process for the Background or the threshold or both.

Nowadays, the amount of videos obtained from surveillance cameras disseminated throughout the world is increasing dramatically. These continuously evolving videos not only made the evaluation of the background subtraction for these videos crucial but also generated an urge to build some real-time techniques to manage this increasing video data.

Recently, multiple background subtraction methods have been developed [9, 10] that gradually updates the low-rank structure underlying the background by considering one frame at a time incrementally, thus decreasing the computation time. These methods can effectively carry out background subtraction for real-time videos by significantly speeding up the calculations performed for this task. However, there are still some evident defects in these techniques when employed for real-time videos. On one hand, these approaches mostly neglect the dynamic jitters of the camera including rotation, scaling, and illumination changes that occur frequently in the real life, while assuming that the background possesses a low-rank structure. On the other hand, most of the current methods make use of a fixed loss term for their models such as \(L_1\) or \(L_2\) norm losses. These loss terms make an implicit assumption for the noises, i.e., the foreground part of the videos, by considering them to have a fixed probability distribution such as Laplacian or Gaussian. Although, in real scenarios, there are always dramatic variations in the foreground over time, therefore, this assumption for noise to have a fixed probability distribution deviates from real scenarios. In some cases, multiple modalities of noises are contained in the foreground, as shown in Fig. 1. Such cases require taking into account more complex models for noise. In practice, neglecting such an essential intuition about the diversity in the foreground of a video does not allow the current methods to be robust against the noise variations that occur in the real-time videos.

The organization of the rest of the paper is as follows: In Sect. 2, a brief review of various already existing background subtraction methods is given. Section 3 covers the proposed methodology and the related algorithms. In Sect. 4, the results acquired by exploiting the proposed algorithm on various video datasets are explained. Finally, the conclusion of the proposed work is given in Sect. 5.

Fig. 1
figure 1

Background subtraction results for a bootstrap sequence [12]. The first row contains the original frame, noise, and the background extracted by [12] (in left-to-right direction). The second row contains three noise components that correspond to the foreground, i.e., the moving object, its shadow, and certain feeble camera noise

2 Related work

Over the last couple of years, various research studies have been presented on background subtraction due to which it has attained great importance as a developing research area [13,14,15,16]. Some of these recent studies have been as referenced for the development of the proposed model. Gaussian parameter compression techniques [14] as well as combination of Mixture of Gaussian (MoG) and compressive sensing (CS) [16] are recently being used. Furthermore, parallel implementation strategies for algorithms are also being utilized for object detection in UAV-sourced videos [15]. As described earlier, background subtraction is a widely used technique particularly in surveillance videos, object tracking and detection, traffic or crowd monitoring, etc. where the main focus is to separate the moving objects, i.e., foreground from the stationary background [17]. However, the detection of moving objects in videos or other applications is yet a challenging task owing to several issues such as shadow, varying illumination, occlusion, background motion, camera jitter, as well as several different types of ambiguities like atmospheric disturbances or noise, object overlapping outliers, etc. [18]. To overcome these challenges, statistical models are the most effective ones. Statistical models may be non-parametric or parametric. Some of the most commonly applied non-parametric methods are the Kernel Density Estimation (KDE) and Eigenvalue Decomposition, but these methods have an extensive memory requirement, as well as a high computational complexity [19]. Conversely, parametric statistical models depend on the use of statistical distributions for background modeling. One of the most popular statistical models is finite GMM, capable of coping with slight illumination changes as well as moving background with small repetitive motion [19]. As with non-parametric methods, parametric methods have their limitations such as the learning parameters need to be set automatically, have to cope with complex dynamic background, have to dissociate shadows from object, etc. [19]. To accommodate these challenges and overcome the aforementioned limitations, [20] proposed an improvised version of GMM for the detection of moving objects. It uses the Gaussian elements for modeling the intensity values of a block of pixels and compensates for the learning rate limitation using a dynamic learning rate. Considering the pixel block, rather than a single pixel value, reduced the computation time almost four times, keeping the performance nearly similar to earlier methods [20]. Recently, matrix decomposition methods, for instance, Robust Principal Component Analysis (RPCA), have become an efficient framework for background subtraction. These methods aim to decompose a matrix into low-rank (for background) and sparse (for foreground) components. However, in some scenarios, as the size of input data increases and because of the absence of sparsity constraints, these methods show weak performance as they cannot handle the real-time challenges, resulting in inaccurate foreground areas. To address the aforementioned problem, an online single unified optimization framework for simultaneous detection of the foreground as well as learning of background is proposed in [21]. This method has better performance, as it provides a more reliable and efficient low-rank component, but it cannot be used for moving cameras. Although RPCA provides a good framework for background subtraction, it still has a very high computational complexity and huge memory requirements because of its batch optimization. To solve this issue, online RPCA (OR-(PCA) [22] is developed which can process such high-dimensional data through stochastic manners. However, the sparse component obtained by OR-PCA cannot always handle numerous background modeling challenges, which degrades the performance of the system. To overcome these challenges, [23] presented a multi-feature based OR-PCA scheme. Integration of multiple features into OR-PCA not only augments the quality of detected foreground but also enhances the quantitative performance of this technique, as compared to single feature OR-PCA and RPCA through PCP-based methods [23]. However, when OR-PCA is applied to real sequences which have dynamically changing background, the performance of OR-PCA is also reduced. Therefore, there is a need for enhancement in OR-PCA to cope up with the increased complexity and variety of videos. In [24], an online algorithm built on Incremental Nonnegative Matrix Factorization (INMF) is presented, which resolves the problems encountered in OR-PCA using non-negative and structured sparsity constraints. In complex scenes, this algorithm reduces the number of missed and false detection. Subspace Learning methods such as Matrix Completion (MC) and RPCA have been explored and attained significant attention during the last few years [12, 25]. These methods are based on low-rank modeling and are meant to reduce the dimensionality in a very-high-dimensional space. Unfortunately, there are some prevalent challenges with most of the conventional matrix decomposition algorithms based on MC and RPCA. First, these methods use batch processing; second, for each iteration of the optimization process, all the frames have to be accessed. As a result, they have an extensive memory requirement and are computationally inefficient. The method proposed in [26] considers the sequence of images as constructed from a low-rank matrix for background and a dynamic tree-structured sparse matrix for the foreground. It solves the decomposition by the use of approximated RPCA which is extended to make it capable of handling camera motion. This method decreases the complexity, requires less time for computation, and does not require huge memory for larger videos. Similarly, to estimate a robust background model, in [25], a robust spatiotemporal low-rank matrix completion (SLMC) algorithm is presented for dynamic videos. In the proposed method, spectral graphs are regularized for encoding spatiotemporal constraints. Furthermore, SLMC algorithm is extended to Spatiotemporal RPCA (SRPCA) with dynamic frames extraction. Together, these algorithms make the process robust and accurate but in SRPCA as both the foreground and background are optimized at the same time, the computational time is increased. Furthermore, impressive achievements in deep learning have enthused the researchers to apply deep neural networks for the task of background subtraction [27,28,29]. Usually, in a CNN, some of the essential information may be lost by the first convolution layer and is not available for the bottom layers of CNN, thus limiting its performance in multiple feature extracting. To overcome this limitation, in [27], a CNN combining multi-scale representation is studied and a multi-scale fully convolutional network architecture is proposed for background subtraction that takes advantage of multiple layer features. Similarly, in [28], a background subtraction method based on depth data in SBM-RGBD is proposed that is capable of achieving more accurate results than the traditional methods. The major difference between the depth data and traditional data is the distance information provided by the depth sensors. In this method, the impact of edge noise and absent pixels is reduced by applying a preprocessing method. Mostly, the current available CNN-based foreground methods make use of 2-D CNN, and hence, they fail to account for the temporal features present in the image sequences. These temporal features are beneficial in improving the performance of IR foreground detection, because the IR images lack rich spatial features. To solve this problem, a background subtraction method based on the 3-D convolutional network is proposed in [29] named as MFC3-D (multi-scale 3-D Fully Convolutional Network). The proposed network can effectively learn the deep and hierarchical multi-scale spatial-temporal features, hence, can perform well for foreground detection in IR videos. Although these deep learning-based algorithms have a better performance, their drawback is that they are increasingly dependent on costly hardware resources owing to the demanding training process. Because of the restricted computational resources and high real-time demands, these approaches are not realistic for visual surveillance. Moreover, mostly, the deep learning-based algorithms are supervised algorithms which means that they need a ground truth to train the model. These ground truths are constructed by either a human expert or by other unsupervised background subtraction methods. However, the background subtraction methods for video surveillance have to be unsupervised. Thus, the CNN-based background subtraction methods may not be much useful when it comes to practical applications. Considering the pros and cons of all the techniques reviewed in the literature survey, in this work, we have proposed a technique that is computationally efficient and has improved accuracy in terms of detected background and foreground, with less number of false detections.

To mitigate the limitations of existing techniques, an innovative technique for background subtraction has been proposed in this paper. The proposed technique progressively fits a particular subspace for the background that is obtained from \(L_1\)-low-rank matrix factorization (LRMF) using cyclic weighted median (CWM) and a certain distribution of a mixture of Gaussian (MoG) of noise for the foreground. This fit is achieved by regularization of the background and foreground information that is acquired from the preceding frames. In comparison to the conventional methods that used a fixed noise distribution for all the frames of the video, a separate MoG distribution is used to model the noise or foreground for each frame of the video.The expectation maximization (EM) algorithm is applied to optimize the Gaussian mixture model (GMM). To eliminate the camera jitter effects, the affine transformation operator is involved that acts to align the successive frames. Finally, the effectiveness of the proposed method is augmented using a subsampling technique that can accelerate the proposed method to execute on an average more than 250 frames per second while maintaining good performance in accuracy.

3 Proposed methodology

In this section, the proposed algorithm for background subtraction is presented in detail. The gist of the proposed work is that for each new video frame \(x_t\), the aim is to progressively fit a particular subspace for the background that is obtained from \(L_1\)-LRMF and a certain MoG distribution of noise for the foreground. This fit is achieved by regularization of the background and foreground information that is acquired from the preceding frames.

Fig. 2
figure 2

Airport dataset: from left to right OMoGMF [12], DECOLOR [40], PRMF [41], LRMF-CWM. First row: Original frame; second row: background; third row: foreground

3.1 LRMF to obtain background subspace

Low-rank matrix factorization is considered to be a significantly important technique in data science. The main idea of matrix factorization is that sometimes the data contain latent structures by uncovering which a compressed representation of the data can be obtained. Matrix factorization provides a unified method for dimensionality reduction, matrix completion, and clustering by factorization of the original data matrix into low-rank matrices. The \(L_1\)-norm LRMF problem can be formulated as follows. Let X be the video data matrix, such that X= (\(x_1\), \(x_2\), ..., \(x_n\)) where X \(\in \) \(R_{d \times n} \), d is the dimensionality, and n is the number of data. Each column of X, i.e., \(x_i\) represents a video frame having d-dimension. The missing entries in the matrix X are represented by an indicator matrix W, where W \( \in R_{d\times n}\). The elements of W i.e \(w_{ij}\) are taken in such a way that it is zero when the corresponding element is missing and one otherwise [30].

Given X and W, it is possible to formulate a general LRMF problem [31] as

$$\begin{aligned} \text {min}_\mathbf{U,V }||\mathbf{W } \odot (\mathbf{X } - \mathbf{U }\mathbf{V }^{T})||_{L_{1}}, \end{aligned}$$
(1)

where U and V denote the basis and coefficient matrices respectively. Furthermore, U = [\(u_1\), \(u_2\),..., \(u_k\)], V = [\(v_1\), \(v_2\),..., \(v_k\)] and U \(\in R ^{d \times r}\) and V \(\in R^{n \times r}\), with r \(<<\) min(d,n) and \(\odot \) denotes the Hadamard product, i.e., component- wise multiplication, different from the common matrix product. Here, r \(<<\) min(d,n) basically indicates the property of low rank of U V\(^T\). Under the framework of Maximum-Likelihood Estimation (MLE), Eq. (1) can also be understood as

$$\begin{aligned} x_{ij}=(\bar{u_{i}})^{T}\bar{v_{j}}+e_{ij}, \end{aligned}$$
(2)

where \({\bar{u}}_{i}\) is the ith row vector of U, \({\bar{v}}_{j}\) is the \({j}\hbox {th}\) row vector of V, and \(\hbox {e}_{{ij}}\) is the noise element embedded in \(\hbox {x}_{{ij}}\).

To make the proposed model robust to complex noises present in the real-time videos, it is possible to better model the term \(\hbox {e}_{{ij}}\) as a parametric probability distribution. Among other probability distributions, MoG possess a strong capability to approximate to general distributions [31]; therefore, it is selected here to adapt flexibly to the real cases. In particular, it is assumed that each video frame \(\hbox {x}_{{ij}}\) follows:

$$\begin{aligned} x_{ij} \sim \sum _{k=1}^{K}\pi _{k} N(x_{ij}|(\bar{u_{ij}})^{T} \bar{v_{j}},{\sigma _{k}}^{2}. \end{aligned}$$
(3)

For Eq. (1), unfortunately, it is somehow difficult to solve the \(L_1\)-norm minimization because of two reasons. On one hand, its optimization is non-convex which generally makes the finding of a global minimum a bit difficult, and in case of missing entries, it is even proven to be an NP-hard problem [32]. On the other hand, standard optimization tools can hardly find an effective closed-form iteration formula [33], because \(L_1\)-norm minimization is non-smooth. In the proposed technique, we have made use of the simple cyclic coordinate descent algorithm [34] which shows outstanding performance on \(L_1\)-norm LRMF.

The core idea of the cyclic coordinate descent algorithm is to break the fundamental complex minimization problem into a sequence of simple elementary subproblems. Each of these subproblems, having only one scalar parameter, is then recursively optimized. Being convex optimization problems, each of them can be readily solved using a weighted median filter, which eliminates the need for time-consuming inner loops for numerical optimization. Moreover, the recursive employment of weighted median filter makes the method more robust to the missing entries as well as the outliers to a large extent.

3.2 CWM algorithm for solving \(L_1\)-norm LRMF problem

The main idea of CWM algorithm to solve the minimization in Eq. (1) is to apply recursively the weighted median filter [35] to update each element of U = [\(u_1\), \(u_2\),..., \(u_k\)], V = [\(v_1\), \(v_2\),..., \(v_k\)], such that U\(\in \) \(\hbox {R}^{d \times r}\) and V \(\in \) \(\hbox {R}^{n \times r}\). The algorithm can be described by the following steps:

Step 1 To update each element \(v_{ij}\) of V where (i=1,...,k) and ( j=1,...,n), the weighted median filter is applied cyclically while keeping all other components of U and V fixed. This is done by solving

$$\begin{aligned} {v_{ij}}^*=\text {argmin}_{v_{ij}}||\mathbf{w }_{j}\odot {e_{j}}^{i}-\mathbf{w }_{j}\odot \mathbf{u }_{i} v_{ij} ||_{L_{1}}, \end{aligned}$$
(4)

where \(w_{j}\) represents to the \({j}\hbox {th}\) column vector of W and \(\hbox {e}_{j}^{i}\) represents the \({j}\hbox {th}\) column vector of E\(_{i}\) defined as

$$\begin{aligned} {\mathbf{E }}_{i}=\mathbf{X }-\sum _{j \ne i}^{}\mathbf{u }_{i}\mathbf{v }_{j}^{T}. \end{aligned}$$
(5)

Step 2 In the next step, apply cyclically the weighted median filter for updating each element \(u_{ij}\) of U keeping all other components of U and V fixed. This can be done by solving the following minimization problem.

$$\begin{aligned} {u_{ij}}^{*}=\text {argmin}_{u_{ij}}{||{\tilde{\mathbf{w }}}_{j} \odot {{\tilde{\mathbf{e }}}_{j}}^{i}- {\tilde{\mathbf{w }}}_{j} \odot {\mathbf{v }}_{i} u_{ij} ||}_{L_{1}}, \end{aligned}$$
(6)

where \(\tilde{\mathbf{w }_{j}}\) represents to the \({j}\hbox {th}\) row vector of W and \(\tilde{e_{j}^{i}}\) represents the \({j}\hbox {th}\) row vector of E\(_{i}\).

The factorized matrices U and V can be recursively updated via iterative implementation of the above procedures till the fulfilment of the termination condition.

In step 1 of the algorithm, the initial values of U and V are obtained from PCA [37], performed prior to the \(L_1\)-LRMF. For the termination condition of step 4, as in the iteration process, the objective function of Eq. 1 is decreasing monotonically; therefore, the algorithm terminates either when the rate of updating U and V is below some predetermined threshold or when the maximum number of iterations has been reached.

Table 1 Comparison of F-Measure for airport dataset without jitter effect
Table 2 Comparison of F-Measure for airport dataset with jitter effect

3.2.1 GMM for foreground modeling

In the proposed technique, as discussed earlier, rather than using a fixed distribution of noise for all the frames in a video, noise or foreground \(\hbox {e}_{{ij}}\) of each frame is modeled as a separate MoG distribution, that is regularized by a penalty to enforce its parameters close to those estimated from the preceding frames. This penalty may also be reformulated as the conjugate prior for MoG of the current frame, by encoding the knowledge of noise learned previously. As the MoG can effectively approximate an extensive range of distributions, the proposed method is capable of finely adapting to the variations in the foreground, even for the noises having complex dynamic structures. The expectation maximization (EM) algorithm is then used to iteratively approximates the maximum likelihood (ML) estimates of the mixture model parameters. One very useful property of the EM algorithm is that for each subsequent iteration, the maximum likelihood of the data increases strictly, which implies that it is guaranteed to approach a saddle point or local maximum. The EM algorithm, as the name suggests, essentially involves two steps. The first step being the E-step or Expectation step and the second step being the M-step or Maximization step. Iterative repetition of these two steps until convergence of the algorithm gives the maximum-likelihood estimate. The following steps explain the working of the EM algorithm.

  1. i.

    Initialize the model parameters, that is, means \(\mu _{k}\), covariances \(\sum _{k}\), and the mixture weights \(\pi _{k}\) where \(k=1,2,...,K\). Estimate the initial value of log-likelihood.

  2. ii.

    After initialization, the second step is the expectation step. In this step, using the current parameters, the responsibilities are evaluated as

    $$\begin{aligned} {\gamma _{ik}}^{t}=\frac{\pi _{k} N({x_{i}}^{t}|(\bar{u_{i}})^{T}\bar{v},{\sigma _{k}}^{2})}{\sum _{k=1}^{K}\pi _{k} N({x_{i}}^{t}|(\bar{u_{i}})^{T}\bar{v},{\sigma _{k}}^{2})}, \end{aligned}$$
    (7)

    where \({\gamma _{ik}}^{t}\) corresponds to the latent (hidden) variable for the \({k}\hbox {th}\) Gaussian,\(\hbox {x}_{i}^{t}\) denotes the \({i}\hbox {th}\) pixel of the newly coming frame \(\hbox {x}^{t}\).

  3. iii.

    The next step is the maximization step. This step involves the recalculation of the model parameters \(\pi _{k}\),\(\sigma _{k}^{2}\), and \(\bar{v}\) using the currently obtained values. This recalculation is done using the following equations:

    $$\begin{aligned} \pi _{k} & = {} {\pi _{k}}^{t-1}-\frac{\bar{N}}{N}({\pi _{k}}^{t-1}-{\bar{\pi }}_{k}), \end{aligned}$$
    (8)
    $$\begin{aligned} {\sigma _{k}}^{2} & = {} {{\sigma _{k}}^{t-1}}^{2}-\frac{\bar{N}_k}{N_{k}}({{\sigma _{k}}^{t-1}}^{2}-{\bar{\sigma _{k}}}^{2}), \end{aligned}$$
    (9)

    where \(\bar{N}\)=d ; \(\bar{N}_{k}\)=\(\sum _{i=1}^{d}\) \(\gamma _{ik}^{t}\) ; \({\bar{\pi }}_{k}\)=\(\frac{\bar{N_{k}}}{\bar{N}}\) ;

    \(\bar{\sigma _{k}}^{2}\)=\(\frac{1}{\bar{N_{k}}}\) \(\sum _{i=1}^{d}\) (\(\gamma _{ik}^{t}\) (\(\hbox {x}_{i}^{t}\)-(\(\bar{u_{i}})^{T}\) \(\bar{v}\))\(^{2}\));

    N=\(\hbox {N}^{t-1}\)+\(\bar{N}\) and \(\hbox {N}_{{k}}\)=\(\hbox {N}_{k}^{t-1}\)+\(\bar{N}_{k}\)

    $$\begin{aligned} \bar{v}=(\bar{U}^{T}diag(w^{t})^{2}\bar{U})^{-1}U^{T}diag(w^{t})^{2}x^{t}, \end{aligned}$$
    (10)
  4. iv.

    Finally, evaluate the log-likelihood [12]

    $$\begin{aligned} \ln p(\mathbf{X }|\mathbf{U },\mathbf{V },\mathbf \Sigma ,\mathbf \Pi )=\sum _{n=1}^{N}\left\{ \sum _{k=1}^{K}\ \pi _{k} N(x_{ij}|(\bar{u_{i}})^{T}\bar{v_{j}},{\sigma _{k}}^{2})\right\} , \end{aligned}$$
    (11)

    where \(\Pi \)={ \(\pi _{k}\)}\(_{k=1}\) \(^{K}\) If either the parameters or the log-likelihood has converged, it means that the desired results have been achieved, if not, then return to step 2 and iterate until convergence. After the convergence of the EM algorithm, the fitted model obtained is utilized to update the current subspace \(\mathbf{U }_t\) using the updating rules [12] defined as follows:

    $$\begin{aligned} {\mathbf{u }_{i}}^{t} & = {} {\bar{\mathbf{A }_{i}}}^{t}{\bar{\mathbf{b }_{i}}}^{t}, \end{aligned}$$
    (12)
    $$\begin{aligned} \bar{{\mathbf{A }_{i}}^{t}} & = {} \frac{1}{\rho }\left( {\bar{\mathbf{A }_{i}}}^{t-1}-\frac{{{w_{i}}^{t}}^{2}{\bar{\mathbf{A }_{i}}}^{t-1}{\bar{v}}^{t}(\bar{v}^{t})^{T}{\bar{\mathbf{A }_{i}}}^{t-1}}{\rho +{{w_{i}}^{t}}^{2}(\bar{v}^{t})^{T}{\bar{A_{i}}}^{t-1}{\bar{v}}^{t}}\right) , \end{aligned}$$
    (13)
    $$\begin{aligned} \bar{{\mathbf{b }_{i}}^{t}} & = {} \rho {\bar{\mathbf{b }_{i}}^{t-1}}+{{w_{i}}^{t}}^{2}{x_{i}}^{t}{\bar{v}}^{t}, \end{aligned}$$
    (14)

    where \(\bar{{\mathbf{A }_{i}}^{t}}\) and \(\bar{{\mathbf{b }_{i}}^{t}}\) denote the model variables that are used as background prior. \(\bar{{\mathbf{A }_{i}}^{t}}\) represents a semi-definite matrix which makes it easy to learn the subspace U. \(\hbox {w}_{i}^{t}\) represents \({i}\hbox {th}\) element of indicator matrix W, defined as w\(_{i}^{t}\)=\(\sqrt{\sum _{k=1}^{K}\frac{\gamma _{ik}^{t}}{2\sigma _{k}^{2}}}\) and \(\rho \) controls the strength of the priors. Its value is set in such a way that allows the subspace to slightly lean to the current frame. In the proposed work, it is set to 0.98.

Fig. 3
figure 3

Airport dataset with jitter effect. First row: misaligned frame; second row: aligned frame; third row: background; fourth row: foreground

4 Performance evaluation

For the performance evaluation of the proposed technique, we conducted background subtraction experiments on three different datasets selected from the Li datasetFootnote 1 which have static as well as dynamic backgrounds, i.e.

  1. i.

    Airport video without camera jitter effect.

  2. ii.

    An unaligned face with different illuminations.

  3. iii.

    Synthetically transformed airport video with camera jitter effect.

All these experiments were implemented on a computer with Intel Core m3 CPU and 8GB RAM and Windows10 operating system. The proposed work is implemented on MATLAB R2015.

Fig. 4
figure 4

For the Airport dataset, using subsampling rate=0.01. a and b are without jitter effect, whereas c and d are with jitter effect

Table 3 Comparison of the computational time (in seconds) of OMoGMF [12] and proposed technique (without subsampling)
Table 4 Comparison of the computational time (in seconds) of OMoGMF [12] and proposed technique (with subsampling)

4.1 Quantitative evaluation

The quantitative metrics used to assess the performance is F-measure which is expressed as

$$\begin{aligned} F\text {-measure}=2 \times \frac{\text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}}. \end{aligned}$$
(15)

Recall is a measure that tells how many of the true positives are identified correctly, whereas precision tells us that out of all the positive classifications, how many are actually correct. In Eq. (15), precision and recall are estimated as follows:

$$\begin{aligned} \text {Precision} & = {} \frac{|S_{f}\cap S_{gt}|}{|S_{f}|}, \end{aligned}$$
(16)
$$\begin{aligned} \text {Recall} & = {} \frac{|S_{f}\cap S_{gt}|}{|S_{gt}|}, \end{aligned}$$
(17)

where \(\hbox {S}_f\) is the support set for the foreground estimated from the proposed method and \(\hbox {S}_{{gt}}\) is the set of ground truth.

The quantitative analysis is performed by comparing the results obtained by proposed technique and OMoGMF method proposed in [12]. This comparison of F-Measure is shown in Tables 1 and 2, respectively. In Table 1, this comparison is made for airport dataset without jitter effect, whereas in Table 2, comparison is done for airport dataset with jitter effect. It can be seen that in both cases, our proposed technique performs best. For qualitative comparison, we have compared the proposed technique with four already existing techniques. In these experiments, we have taken into account the two major problems that were not solved in the existing literature, i.e., the dynamic background such as illumination changes, waving trees, ripples in water, etc., and the camera jitter effects like translation, rotation, scaling, or a combination of all these, thereby making proposed technique more robust for dynamic background changes. Figure 2 illustrates the acquired results by comparing it with three different already existing techniques, i.e., OMoGMF [12], DECOLOR [40], and PRMF [41].

To demonstrate the effectiveness of the proposed technique for the camera jitter effects, we took the same Airport sequence, but this time, it has camera jitter effects in it. Each successive frame is either rotated or translated by a certain amount as compared to the previous frame. The results of applying the proposed technique, OMoGMF [12] and t-GRASTA [43] on such a video sequence are presented in Fig. 3. Under certain low-rank assumption, it is possible to reconstruct a larger low-rank matrix from a fewer number of its entries [38]. Stimulated by some previous efforts [39] on this issue, the efficiency of the proposed method is further improved by appending the subsampling technique as used in [12]. The introduction of a subsampling technique can accelerate the execution process to, on average, more than 250 frames per second. Furthermore, this add on does not affect the accuracy of the proposed method, i.e., a good performance in accuracy is maintained. In these experiments, the subsampling rate is 0.01. The results obtained using the subsampling technique on the three video sequences are illustrated in Fig. 4.

To show the effectiveness of our proposed alignment approach using affine transformation, we have tested the proposed approach for aligning the frames of “Dummy” dataset shown in Fig. 5. This dataset contains multiple images that are not only misaligned but also suffer from illumination variation and block occlusion. The results obtained by applying the proposed technique and those proposed in [42, 43] and [12] are shown in Fig. 5. It can be seen that in some images, the face has become enlarged during the alignment process. Furthermore, there are darker images which means that the effect of illumination variation has not been removed effectively. In the proposed technique, it can be seen that the image alignment is improved and the illumination variation has been minimized.

Fig. 5
figure 5

a Images in “Dummy” dataset containing misaligned images with block occlusion and illumination variation. Frames with block occlusion removed and illumination effect minimized using b RASL [42], c t-GRASTA [43], d OMoGMF [12], and e LRMF-CWM

4.2 Computational complexity

The computational complexity of the proposed technique is reduced as compared to the other state-of-the-art methods used for background subtraction. This is basically due to the CWM method that we have employed for the LRMF using \(\hbox {L}_{{1}}\)-norm. If we take n and d as the size and dimensionality of the input data matrix, respectively, then, for the proposed method, the computational complexity is of the order O(d+n), whereas that of other state-of-the-art algorithms is O(dn) [36]. A comparison of computational time of our proposed algorithm with that of OMoGMF [12] algorithm is made in Table 3 and Table 4. It can be seen from Table 3, the computation time of the proposed technique is lessened as compared to the OMoGMF method [12]. The computational time for the Airport dataset is reduced from 14.288 to 13.884 s, from 0.9968 to 0.8636 s for the Dummy dataset and from 14.866 to 13.288 s for the airport dataset with camera jitter effects. To further enhance the efficiency of the proposed technique, subsampling is embedded into the calculation. The subsampling rate is taken to be 0.01, i.e., 1%. This can accelerate the method to execute on an average of more than 250 frames per second while maintaining a good performance accuracy. Similarly, from Table 4, it is obvious that the proposed technique has reduced computation time as compared to OMoGMF [12] even with the subsampling technique.

5 Conclusion

In this paper, a technique has been proposed which aims at making background subtraction available for real-time videos both in terms of speed and accuracy. It gradually fits a specific subspace for the background that is obtained from \(\hbox {L}_{{1}}\)-LRMF using CWM and a certain MoG distribution of noise for the foreground. This fit is achieved by regularization of the background and foreground information that is acquired from the preceding frames. As opposed to conventional methods which used a fixed distribution of noise for all frames in a video, in this technique, a separate MoG distribution is utilized to model the noise or foreground for each frame of the video. The EM algorithm is used for the optimization of GMM. To eliminate the camera jitter effects, the affine transformation operator is involved that acts to align the successive frames. The efficiency of the proposed method is augmented using a subsampling technique that can accelerate the proposed method to execute on an average more than 250 frames per second while maintaining good performance in accuracy.