1 Introduction

Detecting moving objects from a video is the first and most important step in many computer vision systems [1]. The efficiency of the moving object detection process increases the accuracy of video surveillance, video retrieval, behavior detection, pattern recognition and many more computer vision applications [2, 3]. A single method of moving object detection cannot address all the above problems in all kind of videos due to different characteristics of camera sensors, light variations, speed of objects and other environmental effects. Therefore, moving object detection is still a challenging task for the researchers in the area of video processing. Three most popular and widely used methods are temporal differencing [4], optical flow [5] and background subtraction [6,7,8,9]. Further, the moving object detection methods are categorized based on temporal information, spatial information and combination of these two [7, 10, 11]. Paragios et al. [4] proposed a moving object detection based on geodesic active contours and level sets. Zheng et al. [12] used frame differencing method to design an automatic moving object detection for video surveillance applications. Sengar et al. [13] detected the moving area based on the normalized self-adaptive optical flow. The effect of noise, illumination variation and fake motion degrades the accuracy of optical flow-based moving object detection. Another drawback of optical flow method is its high computational complexity [14]. In background subtraction methods, the key issue is background modeling and background updating. Wu et al. [15] modeled the background based on the ternary pattern feature and pattern kernel density estimation. Dou et al. [16] used circular shift operator in the neighboring pixels to model the complex background, and the background is updated with an adaptive update rate. The foreground mask is extracted based on background subtraction followed by the graph cut adaptation on the distance map. Zhang et al. [11] proposed a fast method for moving object detection based on the Gaussian mixture model (GMM). The weights of Gaussian distributions are used as decision criteria for tuning of the parameters of GMM. The background is updated only when the parameters of static pixels are smaller than the parameters of moving pixels which eventually reduces the computational complexity. Further, to improve the performance of GMM-based background modeling, Xia et al. [17] modified the GMM by introducing information of neighboring frames and the moving objects were extracted in a decision-based process. Most of the moving object detection methods in the literature are addressing the problem of video surveillance and traffic control systems. Therefore, most of the literature have focused on the extraction of fast moving objects rather than the slowly moving objects.

The extraction of slowly moving object in a video is very useful and important in video conference, indoor scene monitoring and video telephone applications. In the temporal differencing-based moving object detection for slowly moving object scene, the misclassification of object pixels as background is very high. Therefore, background modeling based on GMM failed to deal with slowly moving objects. The problem of detecting slowly moving objects using only the motion information is a difficult task. Neri et al. [8] proposed a method based on higher-order statistics (HOS) of a group of temporal difference and a motion detection phase. The potential slowly moving objects are detected based on adaptive HOS threshold. However, this method could not handle all kinds of slowly moving objects and the complexity increased due to the HOS calculation. Recently, Zhu et al. [7] proposed detection of slowly moving object in video by combining the temporal and spatial information. The motion analysis is carried out based on the higher-order statistical feature of a group of inter-frame difference. Each frame is segmented based on GMM and expectation maximization (EM) algorithm. Though this method is capable of handling the slowly moving objects better than that of Neri et al. [8], the computational complexity is very high. The major drawback of GMM and higher-order statistics methods is high computational complexity and high inconsistency due to the unknown number of classes. Motivated by the temporal and spatial information merging approaches by Xia et al. [17] and Zhu et al. [7], we proposed a fast and novel method to detect slowly moving objects in a video. Firstly, the motion information is extracted based on a average frame differencing method to minimize the background noise and increase the true positive rate. Then, a valley-based thresholding is developed to extract the homogeneous regions. Finally, these two data are merged to extract the slowly moving object. The rest of the paper begins with the extraction of motion information and spatial information in Sects. 2.1 and 2.2, respectively. Fusion process of temporal and spatial information to extract the slowly moving object is described in Sect. 2.3. The experimental results along with the performance evaluation is discussed in Sect. 3 followed by Conclusion in Sect. 4.

2 Proposed slow moving object detection algorithm

The proposed slowly moving object detection algorithm has three steps: (i) extraction of motion information (MI), (ii) extraction of spatial information (SI) and (iii) object detection.

2.1 Extraction of motion information

In this step, the objective is to extract motion information with minimum background error. In fast moving object detection, temporal difference value is compared with a threshold to extract the motion information. However, the magnitude of the temporal change in slowly moving object is very less in comparison with the fast moving object. It is quite difficult to differentiate the temporal change due to slowly moving objects and the dynamic backgrounds. Therefore, in this section, we proposed an efficient method for extracting the motion information with high true positive-to-false positive ratio (TFR) in comparison with existing methods. In this proposed MI, we considered the true temporal difference value instead of the absolute temporal difference value. We assumed that the temporal pixel difference in the background (dynamic/light variation) is zero mean Gaussian noise. Considering a large number of frames, the sum of temporal pixel difference at any arbitrary position (xy) in dynamic background area will almost be close to zero. However, temporal pixel difference value in the object area does not follow a zero mean distribution. Therefore, to minimize the effect of the dynamic background and efficiently capturing most of the slowly moving object pixels, we defined an average frame difference matrix (AFDM) at each temporal instant. The AFDM at kth temporal instant is defined as \(\mathbf{AFD }_{k}\) and evaluated as follows

$$\begin{aligned} \mathbf{AFD }_{k}= \frac{1}{L}~~\sum _{l=0}^{L-1}\,\mathbf{F }_{d}(k+l\lambda ) \end{aligned}$$
(1)

and

$$\begin{aligned} \mathbf{F }_{d}(k)=\mathbf{F }_{k}-\mathbf{F }_{k+1} \end{aligned}$$
(2)

where \(\mathbf{F }_{d}(k)\) is the frame difference matrix (FDM) between kth frame \(\mathbf{F }_{k}\), \((k+1)\)th frame \(\text {F}_{k+1}\) and \(\lambda \) is a constant which defines the sampling temporal distance between two consecutive FDMs considered for the evaluation of AFDM. L is the total number of FDMs considered for the evaluation of \(\mathbf{AFD }_{k}\). The elements of \(\mathbf{AFD }_{k}\) are between \(-1\) and 1. Experimentally, the values of L and \(\lambda \) are selected as 10 and 6, respectively. The motion information (MI) at kth frame is extracted from the \(\mathbf{AFD }_{k}\) as follows

$$\begin{aligned} \mathbf{ MI }_{k}(x,y)=\left\{ \begin{array}{ll} 0 &{}\quad \text {if} ~~\left| \mathbf{AFD }_{k}(x,y) \right| <T_{h_{k}}\\ 1 &{}\quad \text {Otherwise}. \end{array} \right. \end{aligned}$$
(3)

where the threshold \(T_{h_{k}}=\kappa \sigma _{k}\), \(\sigma _{k}\) is the standard deviation of \(\mathbf{AFD }_{k}\) and \(\kappa \) is a constant, which is experimentally set to 0.273 for the evaluation of optimum threshold \(T_{h_{k}}\) to reduce the misclassification error in (3).

2.2 Extraction of spatial information

In case of multiple objects, the histogram can be assumed as a mixture of Gaussian distributions. Each of the Gaussian mixture component represents an unique object in an image. We modeled the number of dominant peaks as the center of each distribution, and the valley between two peaks represents the boundary of separation between two distributions. Therefore, a peak and valley algorithm is proposed to extract the number of homogeneous regions in a frame. The positions of valleys are used as threshold points to segment the frame into optimal numbers of homogeneous regions. At first, each color frame is converted into a gray image, as follows [23].

$$\begin{aligned} \mathbf {F_{k}}= & {} 0.299*\mathbf {F_{R_{k}}} + 0.587*\mathbf {F_{G_{k}}} \nonumber \\&+ 0.114*\mathbf {F_{B_{k}}},\quad k=1, 2,\ldots N_{\mathrm{F}}. \end{aligned}$$
(4)

where \(\mathbf {F_{k}}\) is the gray-level frame of the kth frame, \(\mathbf {F_{R_{k}}}\), \(\mathbf {F_{G_{k}}}\) and \(\mathbf {F_{B_{k}}}\) are red, green and blue planes, respectively, for the kth frame. \(N_{\mathrm{F}}\) is the number of frames in a video. The histogram of kth frame is evaluated as

$$\begin{aligned} \mathbf {h_{k}}=\left[ \frac{n_{0_{k}}}{N},\frac{n_{1_{k}}}{N},\ldots ,\frac{n_{255_{k}}}{N}\right] \end{aligned}$$
(5)

where \(n_{i_{k}}\) represents the number of times ith gray value is present in the kth frame and N is number of pixels in a frame. A mask of size \(1\times {5}\) is convoluted with \(\mathbf {h_{k}}\) to generate a smoothed histogram \(\mathbf {h_{S_{k}}}\). The process of smoothing suppressed the noise and some redundant peaks.

$$\begin{aligned} h_{S_{k}}(i)=\left\{ \begin{array}{ll} h_{k}(i) &{}\quad \text {if}~~i=0, 1, 254, 255 \\ a_{i} &{}\quad \text {Otherwise} \end{array} \right. \end{aligned}$$

where

$$\begin{aligned} a_{i}=\frac{1}{5}\sum _{l=-2}^{l=+2}\,h_{k}\left[ i+l\right] \end{aligned}$$

Using \(\mathbf {h_{S_{k}}}\), a valley-based multiclass thresholding is performed to segment the frame into homogeneous regions.

2.2.1 Valley-based thresholding algorithm

The steps of valley-based thresholding algorithm are as follows:

Step 1 Consider the smoothed histogram \(\mathbf {h_{S}}\) of the frame and detect all initial peaks \(\mathbf {I_{P}}\) and initial valleys \(\mathbf {I_{V}}\) as follows:

$$ \begin{aligned} \mathbf {I_{P}}= & {} [ \left\{ i,h_{S}(i) \right\} | \left( h_{S}(i)>h_{S}(i-1)\right) \& \left( h_{S}(i)> h_{S}(i+1) \right) ] . \nonumber \\ \end{aligned}$$
(6)
$$ \begin{aligned} \mathbf {I_{V}}= & {} [ \left\{ j,h_{S}(j) \right\} |\left( h_{S}(j)< h_{S}(j-1)\right) \& \left( h_{S}(j)< h_{S}(j+1) \right) ] .\nonumber \\ \end{aligned}$$
(7)

Step 2 Delete all the peaks which do not satisfy the following conditions: \((h_{s}(i)-h_{s}(i-2))>H_{1}\) or \((h_{s}(i)-h_{s}(i+2))>H_{1}\), where i is the position of peak in smoothed histogram \(\mathbf {h_{s}}\).

Step 3 If there is not a single valley point from \(\mathbf {I_{V}}\) present between two consecutive peaks of the updated \(\mathbf {I_{P}}\) in step 2, then, delete the lowest height peak from these two consecutive peaks and update the peak list \(\mathbf {I_{P}}\). If a peak is deleted, then, start step 3 with the updated peak list \(\mathbf {I_{P}}\). If all the consecutive peak pairs are exhausted, then go to step 4.

Step 4 Update the valley list \(\mathbf {I_{V}}\) by keeping a single lowest valley point and deleting all other valleys between two consecutive peaks from the updated peak list \(\mathbf {I_{P}}\).

Step 5 If the distance between a valley and its peak is less than \(H_{2}\), then delete respective peak, of the peak list \(\mathbf {I_{P}}\) and go to step 4, else go to step 6.

Step 6 Concatenate one valley point at the beginning of \(\mathbf {I_{V}}\) list at gray value zero and one valley point at the end of the \(\mathbf {I_{V}}\) list at gray value 255.

Step 7 Use the positions of valleys from the updated \(\mathbf {I_{V}}\) as multiple threshold points to generate the segmented frame \(\text {SI}_{k}(x,y)\) as follows:

$$\begin{aligned} \text {SI}_{k}(x,y)=\begin{array}{ll} i ~~\text {if}~~I_{\mathrm{V}}(i-1)\le F_{k}(x,y) < I_{\mathrm{V}}(i); \end{array} \end{aligned}$$
(8)

\(i=2, 3, \ldots N_{\mathrm{V}}\), where \(I_{\mathrm{V}}(i)\) and \(I_{\mathrm{V}}(i-1)\) are the ith and \((i-1)\)th valley points, \(N_{\mathrm{V}}\) is the number of valleys detected in step 7 and \(F_{k}(x,y)\) is the gray value at (xy) of kth frame. The values of \(H_{1}\) and \(H_{2}\) are empirically set to 10 and 8, respectively.

2.3 Object detection

The kth spatial segmented frame \(\mathbf{SI }_{k}\) can be represented as

$$\begin{aligned} \mathbf{SI }_{k}=\bigcup _{i=1}^{N_{\mathrm{V}}-1}\; \mathbf{R}_{\mathbf{k}_{\mathbf{i}}}, \end{aligned}$$
(9)
$$\begin{aligned} \hbox {and}\; \mathbf{R}_{\mathbf{k}_{\mathbf{i}}} \bigcap \mathbf{R}_{\mathbf{k}_{\mathbf{j}}}=\varnothing \; \; \forall \; \; \mathbf{i} \ne \mathbf{j}, \end{aligned}$$
(10)

where \(\mathbf{R}_{\mathbf{k}_\mathbf{i}}\) is the ith and \(\mathbf{R}_{\mathbf{k}_\mathbf{j}}\) is the jth homogeneous region in \(\mathbf{SI }_{k}\). Each \(\mathbf{R}_{\mathbf{k}_\mathbf{i}}\), \(i=1, 2,\ldots , N_{\mathrm{V}}-1\) may have more than one connected regions of ith level. Therefore, each homogeneous region can be defined as

$$\begin{aligned} \mathbf{R}_{\mathbf{k}_\mathbf{i}}=\bigcup _{\mathbf{l=1}}^{\mathbf{Li}}\;\mathbf{C}_{\mathbf{k}_\mathbf{i}}^\mathbf{l}, \end{aligned}$$
(11)

where \(\mathbf{C}_{\mathbf{k}_\mathbf{i}}^\mathbf{l}\) is the lth connected sub-region in the ith homogeneous region. \(L_{i}\) is the number of connected sub regions in the ith homogeneous region of kth frame. The motion information of kth frame \(\mathbf{MI }_{k}\) and the connected components of each region \(\mathbf{R}_{\mathbf{k}_{\mathbf{i}}}\), \(i=1, 2,\ldots , N_{\mathrm{V}}-1\) of \(\mathbf{SI }_{k}\) are used to extract the slow moving object in the kth frame. In the first step, the connected components belonging to each region are extracted and in the second step, the connected components belonging to the moving object are extracted for all regions. In the final step, the moving object is extracted by combining the results of second step. The part of the moving object in \(\mathbf{R}_{\mathbf{k}_{\mathbf{i}}}\) is defined as

$$\begin{aligned} \mathbf{D}_{\mathbf{k}_\mathbf{i}}=\bigcup _{\mathbf{l}}\left( \mathbf{C}_{\mathbf{k}_{\mathbf{i}}}^{\mathbf{l}} \,\Bigg |\,\frac{\mathbf{N}_{\mathbf{L}}}{\mathbf{M}_{\mathbf{L}}}> \alpha \right) \end{aligned}$$
(12)

where \(N_{L}= \sharp \;\left( \mathbf{C}_{\mathbf{k}_{\mathbf{i}}}^{\mathbf{l}}\cap \mathbf{MI }_{k}\right) \) and \(M_{L} = \sharp \; \mathbf{C}_{\mathbf{k}_{\mathbf{i}}}^{\mathbf{l}}\). Then, the slowly moving object at kth frame is defined as

$$\begin{aligned} \mathbf{D}_{\mathbf{k}}= \bigcup _{\mathbf{i=1}}^{\mathbf{N}_{\mathbf{V}}-1}\; \mathbf{D}_{\mathbf{k}_{\mathbf{i}}} \end{aligned}$$
(13)

3 Experimental results

The proposed fast valley-based segmentation (FVBS) for slowly moving object detection is validated on eight benchmark videos, such as Akiyo (V1), Grandma (V2), Mother–daughter (V3), Miss (V4), Tennis (V5), Silent (V6), Suzie (V7) and Salesman (V8) videos from Derf’s collection [18] and Trace website database [19]. Performance of the proposed method is compared with four state-of-the-art methods. The first method is the combination of our proposed MI and Otsu’s SI [20], the second method is the combination of MI and JSEG [21], the third method is the combination of MI and Singla’s [22] method, and the fourth method is the Zhu’s method [7]. The performance of motion information is analyzed based on low false positive (FP) and high true positive (TP). Therefore, we consider the ratio of TP and FP known as TFR as a measure to evaluate the performance for the extraction of motion information.

$$\begin{aligned} \hbox {TFR}=\frac{\hbox {TP}}{\hbox {FP}}\times {100} \end{aligned}$$
(14)

where TP is the number of moving object pixels detected by considering the ground truth (GT) and FP is the number of background pixels wrongly detected as moving object pixels. Therefore, for nonzero pixels \(\hbox {TP}= \sharp \;\left( \hbox {GT}\,\cap \,\hbox {MI}\right) \) and \(\hbox {FP}=\sharp \;\mid \hbox {GT}-\left( \hbox {GT}\,\cap \,\hbox {MI}\right) \mid \). The performance of motion information is described in Sect. 3.1. Performance of spatial information is evaluated based on visual perception and time complexity in Sect. 3.2. Similarly, the performance of final object detection is measured based on misclassification error rate and time complexity in Sect. 3.3. The simulations are carried out in MATLAB on a computer having Intel(R) Core(TM) i3-2330M CPU with 3 GB RAM.

Fig. 1
figure 1

a, e, i Sample frames of test videos (V1), (V2) and (V3); b, f, j frame differencing; c, g, k binary mask; d, h, l proposed MI

3.1 Performance of motion information

Performance of the proposed MI is compared with two methods, frame differencing (FD) [2] and the binary mask (BM) [7] using (14). In this context, motion information for Akiyo (V1), Tennis (V5) and Mother–daughter (V3) videos is presented in Fig. 1. It is observed visually from Fig. 1a–d that the frame difference method failed to extract the motion information properly. Although the BM of Zhu et al. [7] method has highest motion information about the slowly moving object, it has also highest error in the background. The proposed method is better than that of Zhu et al. in terms of lower background error and better than FD in terms of capturing the motion information. Similarly, observation is also carried out in case of Mother–daughter video (V3) presented in Fig. 1e–h. However, in the Tennis (V5) video the motion of ball and bat is little bit faster than that of the moving objects in Akiyo (V1) and Mother–daughter (V3) video. In this video, our proposed MI detects almost all the motion information with a very low background error in comparison with BM as shown in Fig. 1k, l. Though the background error is zero in case of FD, the motion information is not sufficient for the detection of the moving object as shown in Fig. 1j. In terms of TFR which is tabulated in Table 1, the proposed MI has highest TFR value all the videos. However, considering both TFR and complexity, the proposed method is superior to FD and BM. The time complexity in terms of CPU time for MI, FD and BM is tabulated in Table 2. It is observed that the proposed MI is faster than BM, but slower than the FD method. However, considering both TFR and complexity our proposed method is superior in comparison with FD and BM.

Table 1 The measure of TFR
Table 2 Average CPU time in seconds for motion information (MI)

3.2 Performance of spatial information

The positions of dominant valleys, detected by algorithm described in Sect. 2.2.1, are considered as potential thresholds to segment the considered frame. The smoothed histogram of 20th frame of Akiyo video (V1) along with detected peaks and valleys is presented in Fig. 2a. Similarly, the smoothed histogram plot for 20th frame of video Mother–daughter (V3) is placed in Fig. 2b. The “\(\blacktriangle \)” marks represent the dominant peaks, and “\(\bullet \)” marks represent valleys. It is clearly observed from Fig. 3a, b that our proposed peak and valley detection algorithm efficiently detect all the dominant valleys in a multimodal and irregular shape histogram. In Akiyo video (V1), the peak valley algorithm detects eight peaks and seven valleys, whereas in Mother–daughter video (V3), it detects seven dominant peaks and six valleys. Considering Mother–daughter (V3) histogram in Fig. 2b, three consecutive peaks are visible between 100 and 125, whereas the proposed algorithm selects the middle one as the dominant peak and ignores the rest of two peaks. At the same time, it also ignores the valley between these peaks.

Fig. 2
figure 2

a Peak valley plot (Akiyo), b peak valley plot (Mother–daughter)

Fig. 3
figure 3

a, e, i OTSU’s segmentation; b, f, j JSEG; c, g, k Zhu’s method; d, h, l proposed valley-based segmentation (SI)

Table 3 Average CPU time in seconds for spatial segmentation

The segmented results for Akiyo (V1), Grandma (V2) and Silent (V6) videos based on Otsu’s multilevel thresholding [20], JSEG method [21], Singla’s method [22], Zhu’s EM-based method [7] and proposed SI method are presented in Fig. 3. The number of classes considered for Otsu’s method is equal to the number of valleys we detected from our valley detection algorithm. Observing the segmented results, it was found that some portions of background region as well as moving object region ware labeled as same class. Apart from this, the object and background are mostly connected in JSEG as shown in the hair of Akiyo (V1), neck of Grandma (V2), head and body of Silent (V6) video in Fig. 3b, f, j. In the segmentation results of Otsu’s multilevel thresholding and Zhu’s method, it is observed that few portions of dynamic background and moving objects are having same level, which may lead to misclassification error in object extraction stage. In our proposed method, these two problems are negligible in comparison with other four methods. The CPU time for all these segmentation methods is tabulated in Table 3. It is found that Otsu’s thresholding is the fastest method. Though our valley-based thresholding has the second lowest CPU time, it is very close to Otsu’s method. We know that Otsu’s method needs the number of classes a priori for any automatic detection process. Therefore, in terms of CPU time and automatic detection of number of classes the proposed valley-based thresholding is suitable for real-time application.

Table 4 Average total CPU time in seconds for different methods
Table 5 Average misclassification error in percentage
Fig. 4
figure 4

Col 1: Sample video frames of test videos; Col 2: ground truth of moving objects of Col 1; Col 3, Col 4, Col 5 and Col 6 represent the extracted slow moving objects using Otsu, JSEG, Zhu and proposed method, respectively

3.3 Performance of object extraction stage

The slowly moving objects at kth frame are extracted by combining the motion information \(\mathbf{MI }_{k}\) and spatial information \(\mathbf{SI }_{k}\) followed by morphological closing and median filtering. The performance of the proposed slowly moving object detection is compared with Otsu, JSEG, Singla and Zhu’s method. However, the segmented results of Otsu, JSEG and Singla’s method are fused with the proposed MI to extract the moving object. The results of object extraction stage are reported in Fig. 4. In Fig. 4, each row represents the result analysis of a particular test video. In this work, eight test videos are considered from publicly available benchmark database [18, 19]. The first column of Fig. 4 shows sample frames from the test videos, and the second column shows the manually constructed ground truth of slowly moving objects. Similarly, Col: 3, 4, 5, 6 and 7 represent the detected slowly moving objects based on Otsu, JSEG, Singla, Zhu and proposed method, respectively. It is clearly observed from last column that our method is capable of extracting the slowly moving object efficiently from the sample frames of all the test videos in comparison with all the other four approaches. The overall time complexity of the proposed method is reduced considerably in comparison with other methods due to the reduction in the complexity of spatial segmentation stage as well as the motion information stage. The average time complexity based on CPU time for all these methods is tabulated in Table 4. The generation of ground truth for a particular frame is carried out manually which is a time-consuming process. Therefore, we have considered 100 sample frames in each video with a sampling distance of \(\frac{N_{\mathrm{F}}}{100}\) to evaluate the average misclassification error. The average misclassification error in the proposed method is lowest among all these methods which are tabulated in Table 5. In terms of average misclassification error and average time complexity, the proposed method is faster and superior to the other three methods.

4 Conclusion

In this paper, a novel and fast method is proposed to detect slowly moving objects in a video. The proposed method is based on the combination of spatial and temporal information. We have developed a simple temporal segmentation known as average frame differencing method to reduce the misclassification error due to the background pixels. The time complexity is also reduced by introducing a valley-based spatial segmentation. A new fusion criterion is proposed to detect the slowly moving object in a video. The proposed method is tested in several videos from publicly available database. The proposed method is faster in terms of CPU time, and it also outperforms the Otsu, JSEG, Singla and Zhu’s method in terms of misclassification error. The proposed MI detection process produces high misclassification error for sudden scene change or light variation in the background monitor or television screen. To solve this problem, future work may address the combination of code book-based background modeling and the temporal differencing methods to improve the efficiency of motion information.