1 Introduction

Video surveillance [54], traffic monitoring [52], gesture recognition [45] are some of the possible important areas of research due to the purpose of security and surveillance at places like railway stations, airports, and other public as well as private places. Moving object segmentation [40, 41, 44] and tracking [15, 38, 51] are also crucial steps in these applications due to the detection of suspicious or unusual activities to perform further high-level processing. Shape, motion, or color features are mostly used by the various segmentation and tracking techniques for the above applications. However, it is difficult to segment accurately the foreground objects due to the reasons like illumination variation, changing background, fake or abrupt motion, and occlusion [31]. Optical flow [10], frame differencing [39] and, background subtraction [1] are the main techniques to perform the object detection task. Foreground objects in the optical flow technique are detected with the help of flow vectors of the moving objects. However, this technique is complex and due to the differential operations involved in this, it is susceptible to noise and illumination variations. While a simple technique called frame differencing is based on the concept of change detection across the successive frames, there are foreground aperture and ghosting problems in it. On the other side, one of the widely used significant and easy methods called background subtraction can detect the moving foreground by subtracting the background frame from the current frame. To remove the effects of lighting and inappropriate events, this method updates the background regularly. Due to the real-time performance and stability in dynamic situations, this method is wildly used for motion detection task. There is vast amount of data available for storage and transmission in the aforementioned application areas; therefore data are stored and transmitted in the compressed form. Several techniques are used for this purpose; and wavelet transform is one of them, which divides the frame sequences into detailed and approximate components and performs further operations on these components only [2, 27].

Numerous works have been proposed by the computer vision groups to segment the moving objects from the background. The change detection and Canny edge detection-based moving object segmentation algorithm in the wavelet domain was originally proposed by Huang et al. [22]. However, it leads to foreground aperture and ghosting problems. This method is further enhanced by Khare et al. [25] by using double change detection [39] and Daubechies complex wavelet coefficients [30]. Soft thresholding and Canny edge detection schemes [8] are employed in it to detect the potential edges in sub-bands. Discrete wavelet transform (DWT) [20] and variance method [34]-based object detection and tracking task is performed by the Gangal et al. [16]. This method combines both the background subtraction and frame differencing techniques to detect moving objects. However, the computational complexity of this method is high because it (1) employs two techniques viz., background subtraction and frame differencing for single task, and (2) combines all the sub-bands of 2D-DWT for object detection task. Complex wavelet and approximate median filter-based moving object segmentation algorithm are proposed for video surveillance environments [26]. Hybrid algorithm for moving object detection in the wavelet compressed domain is proposed by Töreyin et al. [46]. Inter-frame differencing and Daubechies complex wavelet transform-based motion segmentation algorithm is presented in [24]. Here, Daubechies complex wavelet transform is selected due to better directional selectivity and shift invariance property.

An unsupervised motion segmentation technique for extracting moving areas from compressed dataset with the help of Markov random field classification [28] and global motion estimation [13] is proposed by Chen et al. [9]. Hsia et al. [21] proposed a motion detection approach with the help of modified directional lifting-based 9/7 discrete wavelet transform. This technique preserves the fine shape information in low-resolution image and reduces the computational cost. An efficient moving object detection method based on normalized self-adaptive optical flow is presented by Sengar et al. [37]. Otsu’s approach and self-adaptive window technique are employed by this method to compute the optimum threshold and to select the moving object area respectively. However, slow moving small sized objects cannot be detected using this approach. Histogram-based frame differencing approach is combined with W4 [19] method to remove the foreground aperture and ghosting problems in moving object detection [42]. Different convolutional neural network (CNN) based approaches have been proposed in [3, 4, 11, 33]. In the current scenario, these approaches are highly efficient to find the moving objects. However, the computational complexity of CNN architecture is very high. Bouwmans et al. [6, 48] suggested to use robust principal component analysis in the field of data sciences. The role and importance of features for background modeling and foreground detection have been shown by Bouwmans et al. [5]. Here, the author has shown the importance of color, texture, edge, streo, motion, and local histogram features in different environments. Background subtraction method using multi-scale structured low-rank and sparse factorization has been proposed by Zheng et al. [55]. Optimization technique has been employed in this method to effectively detect the moving objects. However, it has higher computational complexity.

Number of techniques have been proposed for the segmentation of moving objects in the compressed domain. However, still there are some deficiencies due to illumination variations, shaking camera, etc. For that, we propose a statistical background subtraction-based motion segmentation technique in the wavelet compressed domain. In our method, first we apply the dual-tree complex wavelet transform (DTCWT) on the video sequence to divide it into four sub-bands (LL, LH, HL and HH). Subsequently, initial background frame is estimated from all the detail coefficients (LH, HL, and HH) using median operation. In the next step, the statistical parameters: weighted-mean and weighted-variance for each detail coefficients are computed using the respective background frames. Successively, foreground objects for each detail sub-bands are computed with the help of statistical parameters-based background subtraction approach. Next, all the foreground objects are combined and super sampled to detect the moving object. Finally, the post-processing approach in the form of morphology, connected component analysis and flood fill are employed to efficiently and accurately segment the moving objects. Here, generated background frames from the detail sub-bands are updated regularly to adapt the change in environments. Experimental results and performance analysis on different publicly available benchmark video datasets evident that our approach is considerably better than other existing moving object detection techniques.

The main contributions of this work are summarized as below:

  1. 1.

    For obtaining the robust features, DTCWT based wavelet decomposition is employed. Here, DTCWT helps to reduce the frequency aliasing components and shift variant features.

  2. 2.

    Weighted mean and variance-based background model is generated to accurately detect the moving objects. Here, weights are generated in such a way that, they reduce the effect of outlier.

  3. 3.

    To obtain the accurate foreground, squared Mahalanobis distance between the current frame to weighted mean is obtained and compared with mean and standard deviation based threshold.

  4. 4.

    Target object is obtained by combining the output of foreground object of each sub-bands followed by super sample the video frames.

  5. 5.

    Finally, morphological operation, connected component analysis, and flood-fill algorithm are employed to suppress the noise, detect as well as label the connected components, and generate the silhouette of foreground objects respectively.

  6. 6.

    To address the issues of dynamic background (shaking camera and illumination variations), our background model has been updated for every consecutive frames.

  7. 7.

    Qualitative and quantitative analysis with the help of different benchmark datasets prove that our technique outperforms some state-of-the-art methods.

The rest of this paper is organized as follows. After this introductory section the background of the dual-tree complex wavelet transform is given in Section 2. The proposed statistical parameter based motion segmentation method in the compressed domain is elaborated in Section 3. Experimental results and performance analysis with the help of qualitative and quantitative evaluations are provided in Section 4. Finally, Section 5 concludes our work.

2 Dual-tree complex wavelet transform

The discrete wavelet transform (DWT) [20] has been extensively employed in numerous image and video processing applications such as de-noising, compression, and feature extraction, etc. [29]. However, there are some shortcomings of this method because of aliasing [50], shift variance [7], and lack of directionality. Due to the shift variance limitation of DWT, if there is any small shift in input data then wavelet coefficients are greatly distorted and each sub-band energy is changed. To overcome the above drawbacks and to get robust wavelet-based features, DTCWT [23, 36, 43] has been developed, which reduces the frequency aliasing components and has nearly shift-invariant property. As shown in Fig. 1, DTCWT is an enhancement over the ordinary DWT by employing two parallel DWTs with different low-pass (\({h_{0}^{i}}(n), {g_{0}^{i}}(n)\)) and high-pass (\({h_{1}^{i}}(n), {g_{1}^{i}}(n)\)) filters in each scale on an input signal. Here, first and second DWTs represent the real and imaginary part of the transform respectively. Second DWT’s (imaginary part) wavelet function ψg(t) is the Hilbert transform of the first DWT’s (real part) wavelet function ψh(t) i.e. ψg(t) = H[ψh(t)] and this helps us to achieve the perfect reconstruction. Here, Hilbert transform is represented by H[.]. Hilbert transform pair condition is satisfied by the wavelet functions, if the associated low-pass filter of the second tree g0(n) is the half sample delayed version of the low pass filter of the first tree h0(n) [35, 53]. This condition is represented as below in time domain.

$$ g_{0}(n)\approx h_{0}(n-0.5) \implies \psi_{g}(t) \approx H(\psi_{h}(t)) $$
(1)

It can be converted into the frequency domain as below:

$$ G_{0}(w)=e^{-0.5jw}H_{0}(w) {~for~ } |w|<\pi $$
(2)

Half sample delay condition cannot be satisfied by the FIR filters, and consequently perfect analytical condition is not fulfilled by the wavelet function pairs. Therefore, to remove the above limitations, approximation is made instead of employing half sample delay system [35]. Approximation can be made by employing different filters in the first stage from the following stages. The orthonormal perfect reconstruction filter pair is used for first stage, which satisfy the following equation:

$$ {g_{0}^{i}}(n)={h_{0}^{i}}(n-1) $$
(3)

where \(h_{0}^{(i)}(n)\) and \(g_{0}^{(i)}(n)\) are the low pass filter of real and imaginary tree respectively and i= 1, 2, 3 denote the sub-band level for 3-level decomposition. The analytical DTCWT can be obtained approximately at every stage, if the condition of (3) is satisfied.

Fig. 1
figure 1

Representation of dual-tree complex wavelet transform over 3 levels

3 Proposed work

In this work, statistical parameter based background subtraction method is employed in the dual-tree complex wavelet transform domain to automatically segment the moving objects and the steps for our technique are as follows:

  1. 1.

    Suppression of noise by employing Gaussian filter on individual frame.

  2. 2.

    Conversion of frame sequences in wavelet domain using DTCWT prior to obtaining the approximate and detailed coefficients.

  3. 3.

    Application of statistical background subtraction method on each detail wavelet coefficients with the help of statistical parameters-based thresholding technique.

  4. 4.

    Detection of moving objects.

  5. 5.

    Execution of the post-processing steps with the help of morphological operation, connected component analysis, and flood-fill algorithm to generate the silhouette of target objects.

  6. 6.

    Updating the background model to adapt the change in dynamic environments.

The Schematic diagram for the proposed technique is displayed in Fig. 2 and the steps for the proposed work are elaborated below:

Fig. 2
figure 2

Schematic diagram of the proposed approach

3.1 Noise smoothing

The smoothing operation on the individual frame is applied to suppress the noise. Several linear and/or non-linear operators [14, 32, 47] are available for noise suppression from the frame sequences. Among these Gaussian filter is one of the most efficient noise smoothing filters and we have applied it on the individual frame with the help of 1-D Gaussian masks corresponding to spatial dimension in a cascaded manner as per the following equation.

$$ F_{smooth} = \frac{1}{\sigma \sqrt {2\pi} }\int \left[ \int F(x,y)e^{{{ - \left( x^{2}+y^{2} \right)} {\left/ {\vphantom {{ - \left( x \right)^{2} } {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }}} dx \right] dy $$
(4)

3.2 Wavelet decomposition

To find the detail of a signal in terms of lower and higher frequency components, wavelet-based band pass filter is employed. Here approximate coefficients (LL part) and detail coefficients (LH, HL, and HH parts) are represented for lower and higher frequencies respectively. Detailed coefficients produce the horizontal, vertical, and diagonal dimensions high frequency details, these details will effectively produce the motion information in further steps. Due to shift-invariant and anti-frequency aliasing properties of DTCWT, In our approach, first we estimate the wavelet coefficients using DTCWT (see Section 2) for N numbers of frames to compute the initial background frame, here the value of N is 15 (selected experimentally). Next, the wavelet coefficients for current frame are estimated.

3.3 Statistical background model-based moving object detection

To accurately detect the moving objects, background frame should be constructed. In the first step of motion segmentation, statistical background model is generated to accurately represent the background. Pixel-wise median operation among frames is the simplest method to produce the background frame. Therefore, first a pixel-wise median operation is employed among the initial set of detail coefficients (FLH, FHL, and FHH) of N frames to build a reference background for statistical background model construction, these are named as median coefficients \(F_{LH}^{med}\), \(F_{HL}^{med}\), and \(F_{HH}^{med}\) (shown in (5)–(7)).

$$ \begin{array}{@{}rcl@{}} F_{LH}^{med}(x,y)&=&median(F_{LH}^{i}(x,y)) \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} F_{HL}^{med}(x,y)&=&median(F_{HL}^{i}(x,y)) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} F_{HH}^{med}(x,y)&=&median(F_{HH}^{i}(x,y)) \end{array} $$
(7)

Here i= 1, 2,....., N and N is the total number of frames in the sequence used to build the reference background frame. \(F_{LH}^{i}(x,y)\), \(F_{HL}^{i}(x,y)\), and \(F_{HH}^{i}(x,y)\) are the coefficient values at location (x, y) of detail sub-bands for the ith frame.

The following weighted-mean (shown in (8)–(10)) and weighted-variance (shown in (1113))) are used to form the statistical background model for each pixel of the detail coefficients of the Z frames:

$$ \begin{array}{@{}rcl@{}} \mu_{LH}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{LH}^{i}(x,y).F_{LH}^{i}(x,y)}{{\sum}_{i=1}^{Z}W_{LH}^{i}(x,y)} \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} \mu_{HL}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{HL}^{i}(x,y).F_{HL}^{i}(x,y)}{{\sum}_{i=1}^{Z}W_{HL}^{i}(x,y)} \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} \mu_{HH}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{HH}^{i}(x,y).F_{HH}^{i}(x,y)}{{\sum}_{i=1}^{Z}W_{HH}^{i}(x,y)} \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} \sigma^{2}_{LH}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{LH}^{i}(x,y).(F_{LH}^{i}(x,y)-\mu_{LH}(x,y))^{2}}{\frac{Z-1}{Z}{\sum}_{i=1}^{Z}W_{LH}^{i}(x,y)} \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} \sigma^{2}_{HL}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{HL}^{i}(x,y).(F_{HL}^{i}(x,y)-\mu_{HL}(x,y))^{2}}{\frac{Z-1}{Z}{\sum}_{i=1}^{Z}W_{HL}^{i}(x,y)} \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} \sigma^{2}_{HH}(x,y)&=&\frac{{\sum}_{i=1}^{Z}W_{HH}^{i}(x,y).(F_{HH}^{i}(x,y)-\mu_{HH}(x,y))^{2}}{\frac{Z-1}{Z}{\sum}_{i=1}^{Z}W_{HH}^{i}(x,y)} \end{array} $$
(13)

Here \(F_{LH}^{i}(x,y)\), \(F_{HL}^{i}(x,y)\), and \(F_{HH}^{i}(x,y)\) are the coefficient values of pixels located at (x, y) in the LH, HL, and HH sub-band of ith frame respectively. The weight parameters \(W_{LH}^{i}(x,y)\), \(W_{HL}^{i}(x,y)\), and \(W_{HH}^{i}(x,y)\) are used to minimize the effects of outliers (by subtracting background part from current frame) and computed as:

$$ \begin{array}{@{}rcl@{}} W_{LH}^{i}(x,y)&=&exp\left( \frac{(F_{LH}^{i}(x,y)-F_{LH}^{med}(x,y))^{2}}{-2SD^{2}}\right) \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} W_{HL}^{i}(x,y)&=&exp\left( \frac{(F_{HL}^{i}(x,y)-F_{HL}^{med}(x,y))^{2}}{-2SD^{2}}\right) \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} W_{HH}^{i}(x,y)&=&exp\left( \frac{(F_{HH}^{i}(x,y)-F_{HH}^{med}(x,y))^{2}}{-2SD^{2}}\right)\end{array} $$
(16)

Where the value of parameter SD is 5.

The current frame is how much deviating from the mean of the whole video will produce the information of the foreground object. Using this concept, the foreground object for each sub-band is acquired with the help of generated mean and variance-based background models and squared Mahalanobis distance in (17)–(19). Here we consider the pixel value as 1 (foreground), if the Mahalnobis distance of any coefficient is greater than the specified threshold (Th); otherwise this is considered to be 0. This process is shown as follows:

$$ \begin{array}{@{}rcl@{}} FG_{LH}(x, y)=\left\{\begin{array}{ll} 1 & \text{if} {\frac{(F_{LH}(x, y)-\mu_{LH}(x,y))^{2}}{\sigma_{LH}^{2}(x,y)}> Th_{1}}\\ 0 & \text{otherwise} \end{array}\right. \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} FG_{HL}(x, y)=\left\{\begin{array}{ll} 1 & \text{if} { \frac{(F_{HL}(x, y)-\mu_{HL}(x,y))^{2}}{\sigma_{HL}^{2}(x,y)}> Th_{2}}\\ 0 & \text{otherwise} \end{array}\right. \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} FG_{HH}(x, y)=\left\{\begin{array}{ll} 1 & \text{if} {\frac{(F_{HH}(x, y)-\mu_{HH}(x,y))^{2}}{\sigma_{HH}^{2}(x,y)}> Th_{3}}\\ 0 & \text{otherwise} \end{array}\right. \end{array} $$
(19)

Where the values of Th1, Th2, and Th3 is computed using following equations:

$$ \begin{array}{@{}rcl@{}} Th_{1}=mean\left( \frac{(F_{LH}(x,y)-\mu_{LH}(x,y))^{2}}{\sigma_{LH}^{2}(x,y)}\right)+\gamma*std \left( \frac{(F_{LH}(x,y)-\mu_{LH}(x,y))^{2}}{\sigma_{LH}^{2}(x,y)}\right) \end{array} $$
(20)
$$ \begin{array}{@{}rcl@{}} Th_{2}=mean\left( \frac{(F_{HL}(x,y)-\mu_{HL}(x,y))^{2}}{\sigma_{HL}^{2}(x,y)}\right)+\gamma*std \left( \frac{(F_{HL}(x,y)-\mu_{HL}(x,y))^{2}}{\sigma_{HL}^{2}(x,y)}\right) \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} Th_{3}=mean\left( \frac{(F_{HH}(x,y)-\mu_{HH}(x,y))^{2}}{\sigma_{HH}^{2}(x,y)}\right)+\gamma*std \left( \frac{(F_{HH}(x,y)-\mu_{HH}(x,y))^{2}}{\sigma_{HH}^{2}(x,y)}\right) \end{array} $$
(22)

Here, effective threshold is computed based on of pixel’s means and standard deviation. Mean will produce the average intensity and standard deviation will give the deviated intensity value from the mean. However, we need to control on standard deviation value. For that, we used parameter γ. The value of γ should be between 0 to 1. Experimentally, in our case, it is 0.5.

3.4 Moving object detection

Moving objects for the ith frame are detected with the following steps:

  1. 1.

    To accurately detect the moving target, combine the output of the foreground object of each sub-bands (computed in Section 3.3).

    $$ FG(x,y)=FG_{LH}(x,y)+FG_{HL}(x,y)+FG_{HH}(x,y) $$
    (23)
  2. 2.

    To make our approach simple and to produce the output as the size of the original frame, we super sample the combined output of previous step:

    $$ FG(x,y)=resize(FG(x,y),2^{l}) $$
    (24)

    Where l represents the number of decomposition level by the DTCWT method.

3.5 Post-processing

We carried the post-processing steps employing morphological operation, connected component analysis, and flood-fill algorithm to suppress the noise, detect as well as label the connected components, and generate the silhouette of foreground objects from the binary frame FG(x, y). Morphological closing operator (as shown in (25)) with diamond structuring element (SE) of size 3×3 is applied to eliminate the noise. Here closing is the erosion of dilated version of a binary frame using the same structuring element [12].

$$ FG = (FG\oplus SE)\ominus SE $$
(25)

Connected component labeling with thresholding operation is employed to remove the isolated small-sized noisy blobs. The binary foreground objects so obtained contain holes within it. Therefore, we employ the flood-fill algorithm to fill up these holes and finally to obtain the silhouette of the target objects.

3.6 Background update

In illumination variations, jitter, and camera sake environments, the background frame needs to be updated regularly to make sure that the background appropriately corresponds to the current frame. Here motion-based background updating method is done to generate the adaptive background frame. The sub-bands of the current frame and the previous frame are employed to construct the sub-band of the background of the current frame using the following sets of equations.

$$ \begin{array}{@{}rcl@{}} B_{LH}^{C}(x,y)=\left\{\begin{array}{ll} \alpha.B_{LH}^{C-1}(x,y)+(1-\alpha)F_{LH}^{C}(x,y) & if (x, y)~is~stationary\\ B_{LH}^{C-1}(x,y) & \text{otherwise} \end{array}\right. \end{array} $$
(26)
$$ \begin{array}{@{}rcl@{}} B_{HL}^{C}(x,y)=\left\{\begin{array}{ll} \alpha.B_{HL}^{C-1}(x,y)+(1-\alpha)F_{HL}^{C}(x,y) & if (x, y)~is~stationary\\ B_{HL}^{C-1}(x,y) & \text{otherwise} \end{array}\right. \end{array} $$
(27)
$$ \begin{array}{@{}rcl@{}} B_{HH}^{C}(x,y)=\left\{\begin{array}{ll} \alpha.B_{HH}^{C-1}(x,y)+(1-\alpha)F_{HH}^{C}(x,y) & if (x, y)~is~stationary\\ B_{HH}^{C-1}(x,y) & \text{otherwise} \end{array}\right. \end{array} $$
(28)

According to the above equations, the coefficient of the sub-bands of the previous background (\(B_{LH}^{C-1}\), \(B_{HL}^{C-1}\), and \(B_{HH}^{C-1}\)) are used to update the current background (\(B_{LH}^{C}\), \(B_{HL}^{C}\), and \(B_{HH}^{C}\)) for the foreground coefficient. Alternatively the sub-bands of the previous frame’s background and current frame’s sub-bands (\(F_{LH}^{C}\), \(F_{HL}^{C}\), and \(F_{HH}^{C}\)) are used to update the background. The weighting parameter α is employed to show the importance or influence of previous background in the background updating procedure. The value of parameter α is 0.4 for the proposed method.

4 Experimental results and analysis

To prove the efficacy, our compressed domain-based moving object detection technique has been applied on standard benchmark datasets namely CDnet 2014,Footnote 1 Hall,Footnote 2 Walk,Footnote 3 Meet,Footnote 4 and Traffic.Footnote 5 The detailed description of these tested video datasets is provided as follows:

  • CDnet 2014 [49]: It is a large scale data-set contains 11 categories, where each category consists of 4 to 6 video sequences. Total number of frames are 600 to 7999 in each video sequence with spatial resolutions varying from 320 × 240 to 720 × 576. We evaluate our method on 30 videos from 6 challenging categories including bad weather, baseline, dynamic background, intermittent object motion, shadow, and low frame rate.

  • Hall: In this indoor surveillance color video dataset, two persons have boxes in their hands and walk from opposite sides. Color of the background and the foreground are much similar and there are illumination variations. This video contains total of 299 frames with 240 pixels × 352 pixels size (i.e. row size × column size). The bit rate and frame rate of this dataset is 24 Kbps and 30 frames/sec respectively.

  • Walk: In this outdoor color dataset, there is a slowly moving person (foreground) having brighter color than the background. This dataset has 132 total number of frames with size 240 pixels × 368 pixels (i.e. row size × column size). The bit rate and frame rate of this dataset are 24 Kbps and 25 frames/sec respectively.

  • Meet: In this indoor surveillance color video, two persons make entry from opposite sides and do the handshake and go together. This video has varied illumination with dark and bright areas in the background and the size of foreground objects is small. There are total number of 716 frames with size 288 pixels × 384 pixels (i.e. row size × column size). The used bit rate and frame rate of this video are 24 Kbps and 25 frames/sec respectively.

  • Traffic: In this outdoor traffic surveillance gray scale video data, there are several fast-moving vehicles with varying numbers and size. It also has a person, moving slowly on the footpath. The background of this dataset has small illumination variation, dark or bright regions, trees, and other natural objects. Total number of 49 frames are there in this video with size 512 pixels × 512 pixels (i.e. row size × column size). The bit rate and frame rate of this dataset is 24 Kbps and 30 frames/sec respectively.

We have implemented all the tested techniques and employed the same parameters as recommended by the authors of corresponding work. In the following sections, we will compare our method with other existing approaches based on both qualitative and quantitative analysis.

4.1 Qualitative analysis

The results for our compressed domain moving object detection technique and other existing methods are displayed in Figs. 3456789 for some of the representative frames. The original video frames, corresponding ground truth, and detected object results are displayed in the aforementioned figures. The results for Highway (baseline category of CDnet 2014 dataset) (the frame number 505, 666, 890, 1277), Fountain2 (dynamic background category of CDnet 2014 dataset) (the frame number 138, 671, 760, 1272), Snowfall (bad weather category of CDnet 2014 dataset) (the frame number 795, 906, 1390, 3143), Hall (the frame number 72, 100, 191, 265), Walk (the frame number 7, 81, 105, 130), Meet (the frame number 311, 418, 610, 702), and Traffic (the frame number 10, 21, 38, 43) are shown for the proposed as well as seven existing techniques. In these Figs. 39, first two rows display the original video frame and ground truth respectively. Next seven rows (from top to bottom) display the results of Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] respectively.

Fig. 3
figure 3

Results with Highway video for the frames number a 505thb 666thc 890thd 1277th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 4
figure 4

Results with Fountain2 video for the frames number a 138thb 671thc 760thd 1272th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 5
figure 5

Results with Snowfall video for the frames number a 795thb 906thc 1390thd 3143th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 6
figure 6

Results with Hall video for the frames number a 72thb 100thc 191thd 265th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 7
figure 7

Results with Walk video for the frames number a 7thb 81thc 105thd 130th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 8
figure 8

Results with Meet video for the frames number a 311thb 418thc 610thd702th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

Fig. 9
figure 9

Results with Traffic video for the frames number a 10thb 21thc 38thd 43th; row wise, top to bottom: original frame, ground truth, Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11] and the proposed method

The last row of the aforementioned figures show the result of the proposed method. As we can see in resultant figures, our method accurately detects the foreground objects and has considerably high similarity with the ground truth in comparison to other tested techniques for all the used video datasets. Other existing techniques (Huang et al. [22], Gangal et al. [16], Khare et al. [24], Srivastava et al. [25], Yue et al. [18], Tao et al. [17], Dou et al. [11]) cannot segment the moving objects correctly and misclassified most of the foreground pixels as background and vice-versa. Thus, segmented objects cannot be efficiently discriminated from the background region in these approaches.

4.2 Quantitative analysis

Qualitative comparison of the proposed technique with other existing methods is shown in the previous section and it has been proved that our method performs reasonably better than some of the existing schemes. It has been observed from the qualitative analysis that perfect segmentation of moving objects is a very challenging task with all the techniques. Thus, it will be very difficult to judge the performance of the proposed and existing methods using human visual system only. Furthermore, quantitative analysis together with qualitative evaluation are more suitable for accurate performance measure. In order to measure the performance of the proposed and tested techniques, we use five metrics [49] based on the numbers of false positive FP (background pixels detected as foreground), false negative FN (foreground pixels detected as background), true positive TP (correctly detected foreground pixels), and true negative TN (correctly detected background pixels). These performance metrics are expressed as follows:

$$ \begin{array}{lll} &Recall=TP/(TP+FN) &Precision=TP/(TP+FP)\\ &FPR=FP/(FP+TN) &FNR=FN/(TN+FP) \\ &&Specificity=TN/(TN+FP) \end{array} $$
(29)

To correctly detect the foreground objects, the values of recall, precision and specificity should be high as well as FPR and FNR should be low. Tables 12 display the recall, precision, FPR, FNR, and specificity for all the tested datasets. Table 1 displays the average performance on bad weather, baseline, dynamic background, intermittent object motion, shadow, and low frame rate categories of CDnet 2014 dataset (total 30 videos); whereas Table 2 shows the average performance on Hall, Walk, Meet and Traffic video sequences. As shown in the aforementioned tables, our technique has significantly large recall, precision, and specificity value and considerably small FNR and FPR value than almost all the tested techniques. Furthermore, Dou et al. [11] performs better than our proposed method for some categories of datasets. However, due to complex deep learning architecture, the processing time of Dou et al. [11] is much higher than our simple technique.

Table 1 Performance of the different methods on CDnet 2014 dataset
Table 2 Performance of the different methods on other datasets

5 Conclusion

In this paper, a new approach has been developed for moving object detection in the wavelet compressed domain. This method detects the motion using only the detail sub-band information of dual-tree complex wavelet transform without performing the inverse wavelet transform. Shift-invariance and better edge representation properties of dual-tree complex wavelet transform build our technique more appropriately for segmentation of moving target in comparison to the approaches based on other wavelet transform. Adaptive thresholding-based background subtraction technique dependent on weighted-mean and weighted-variance based statistical parameters has been employed to detect the moving target. Connected component analysis, morphological operations, and flood-fill algorithm are used as post-processing steps to accurately segment the motion and generate the silhouette of target objects. The experimental results and analysis using both the qualitative and quantitative analysis on different standard video datasets prove that the segmentation performance of the proposed wavelet domain approach is significantly good even without performing the operations on actual pixel data. This method has several additional advantages in comparison to other compressed domain methods as it is simple, computationally efficient, and produces more accurate results.