1 Introduction

Video analysis and understanding is an active area of research in the field of computer vision. To detect foreground object or moving object in a scene accurately, foreground detection method serves as a basic step in many computer vision applications like visual surveillance [1, 19, 25, 30, 40, 47, 54, 63], human-machine interaction [33, 52, 56, 59], object tracking [22, 28, 34], vision-based hand gesture recognition [48], person counting [29, 38], traffic surveillance [12, 17, 23, 31, 37, 42, 50], content-based video coding [11], optical motion capture [6]. This pre-processing task further helps the higher level processes for better object classification and tracking. Hence, an accurate foreground detection with minimum false error rate is very crucial for the performance of higher-level processes. The basic method for detecting moving objects in a video sequence is the background subtraction. In background subtraction technique, it is necessary to build and maintain an up-to-date background model, where each pixel from next frame is compared with the background model. When the pixels from the next frame are deviate from that of the background model are labeled as moving object. It is a difficult task to create an accurate foreground detection in complex environment where few key challenges experienced as background subtraction methods are dynamic background, sudden/gradual changes in illumination, noisy image, camouflage and camera jitter.

Many significant efforts in this direction are made to achieve high accuracy for moving object detection under complex environments [5]. Foreground detection is broadly realized by three approaches namely background subtraction approach, frame-difference approach and optical flow approach. Among these, the background subtraction approach is most widely utilized technique in background modeling [61]. Besides the aforementioned approach, there exist several model-based approaches for robust foreground detection like HOG detector [14], DPM detector [2] etc. However, the above mentioned detectors are always well suited for specific applications and may fail in general cases. To mitigate this issue, a popular feature extractor namely STLBP came into existence, which potentially employed in several reported works for foreground detection under challenging real time applications. STLBP is a variant of local binary pattern (LBP) where each pixel is modeled as a group of STLBP dynamic texture histogram which collectively captures both spatial rexture features as well as temporal motion features from a video frame [64]. However, the overall foreground detection accuracy of STLBP degrades in strong multimodal background conditions. This is due to the fact that, the texture features extracted in STLBP method are sensitive towards the dynamic behavior and illumination variation of background. Further, during background modeling a constant learning rate and a threshold value is adopted that leads to an increase in false detection rate. In the suggested scheme, a STLBP feature-based robust foreground detection algorithm is proposed for background modeling. The key contributions of the suggested scheme are outlined as follows:

  1. I

    A modification in change description rule for traditional STLBP method is proposed in foreground detection to obtain the robust features under complex situations.

  2. II

    Next, an adaptive strategy is formulated in background modeling to compute the learning rate (αb) and threshold value (Tp).

  3. III

    The efficacy of the proposed scheme is validated in terms of qualitative and quantitative measures with that of the state-of-the-art approaches.

The remaining sections of the article is structured as follows. Section 2 elaborately presents the relevant works for foreground detection. In Section 3, some preliminaries about the LBP and STLBP techniques are discussed. Section 4, puts forward the proposed scheme for foreground detection. Experimental results and discussions are critically analyzed in Section 5. Finally, Section 6 provides the concluding remark.

2 Related work

In recent past, several research findings on foreground detection are reported [5, 49, 53]. In 2004, Piccardi briefly categorized the methods used for foreground detection based on accuracy, speed and memory requirements [49]. The suggested scheme helps to compare the complexity of different methods and allow the readers to select an adaptive method for specific purpose. In this work an efficient background modeling strategy is adopted under challenging conditions like illumination variation, dynamic background, camouflage, rippling water and noise. Bouwmans et al. in [4] provided a detailed survey on statistical modeling used for foreground detection. In most of the earlier schemes, foreground is extracted by subtracting the reference/background frame from the current frame. The reference frame can be taken as (1) first frame or previous frame [51], (2) average of several consecutive frames [36], (3) median of several consecutive frames [39]. The above mentioned methods claim to be very simple and fast for foreground detection. However, this is very sensitive to background changes which in turn leads to degrades in overall performance.

To overcome the limitation of basic background subtraction method, Wren et al. [59] proposed a simple Gaussian distribution to model the intensity value of a pixel at a fixed location over time. Further, the authors have used an adaptive updating technique for background modeling under dynamic background. However, this uni-modal Guassian distribution model is not suitable for multi-modal background. So to resolve this issue, a multi-modal Gaussian distribution model has been proposed to provide the solution over uni-modal Guassian distribution.

In 1999, the most widely accepted multi-modal distribution over time is proposed by Stuffer and Grimson [55] which is popularly known as mixture of Gaussians (MoG) or Gaussians mixture model (GMM). In this work, the background model is adaptively updated in an iterative manner and performs significantly well under gradual illumination changes and non-stationary background conditions. The GMM model provides the best solution to many practical situations for foreground extraction. There are numerous improvements of GMM model has been incorporated in literature [8, 58, 65]. However, when background varies very fast (many modes, i.e. more than 5 modes), a few Gaussians (usually 3 to 5) are not sufficient to accurately model the background. GMM also needs apropriate threshold value and learning rate for obtaining improved results.

To overcome the problem of GMM, in 2000, A. Elgammal et al. [16] proposed a non-parametric approach based on kernel density estimator (KDE). This approach measures the probability distribution of background values by taking the recent history at each pixel value over time. The distribution is then used to map the pixel belongs to foreground or background. The major limitations associated in KDE is the huge computational time as it takes the history of a pixel to measure the density.

Codebook method is another notable approach to model the background [35]. In this work, each pixel comprises a codebook and each codebook consist of several codewords. The number of codewords comprise the variation in pixel. The improvement over this scheme reported in [60] uses both spatial and temporal information. This method can able to detect the moving object under dynamic environment and decreases the false positive rate to a certain extent.

In [26], M. Hiekkila et al. proposed a texture-based approach for background modeling using a group of histogram-based on local binary pattern (LBP). The most important property of LBP is the robustness behavior towards the illumination changes and its computational simplicity. This method exhibits improved result under slow moving background pixels. However, its performance degrades sharply when backgrounds have strong changes (under multi-modal background condition).

Zhang et al. [64] extended the algorithm in [26] by measuring the local binary pattern in spatio-temporal domain named as spatio-temporal local binary pattern (STLBP). For modeling each pixel, the authors use weighted LBP histograms which includes both spatial and temporal information. The limitation of this approach is that, it uses constant learning rate which reduces the detection performance when the speed of a moving object varies across the successive frames. In [3], the author proposed a local binary self-similarity pattern and demonstrate its effectiveness against state-of-the-art methods.

There exist several foreground detection algorithms based on subspace learning models such as Principal Component Analysis (PCA) [7]. In PCA, the background sequence is modeled by low-rank matrix and sparse error to constitutes the foreground objects. In particular, PCA-based BGS techniques are robust against noise and illumination changes. Several variants of PCA-based techniques in the field of change detection are designed and developed by many researchers [10, 57]. The major challenges associated in PCA-based BGS is the absence of any update mechanism for classification of background/foreground. Further, the overall execution time is quite high for long duration video sequences [7]. In 2013, Feng et al. [18] proposed an online Robust PCA (OR-PCA) to reduce the memory cost and enhance the computational efficiency. Furthermore, Javed et al. [32] improved the original OR-PCA for an accurate foreground detection using image decomposition technique. In this work, the input video frames are decomposed into Gaussian and Laplacian images. Then, OR-PCA is applied to both the decomposed image with specific parameters to model the background separately. Then, an alternate initialization process is applied to speed-up the optimization process. Again, Han et al. [24] improved the OR-PCA for background subtraction considering camera jitter condition.

In 2018, Goyal et al. [20] proposed a counter-based adaptive background learning with respect to changes in background. In this work, the authors have addressed the Heikkila technique [26] for foreground detection by updating the learning rate by considering different challenging conditions. However, this method achieves lower accuracy with respect to different conditions like camouflage, shadow, outdoor varying illumination environments and high dynamic background. In [62], authors utilize the advantage of intensity, color and Local Ternary Pattern (LTP) features for background modeling. In this scheme, the authors have obtained five features for each pixel. Again, out of five features, the dominant feature is calculated adaptively over the pixel in time domain for the improvement of accuracy in detection of foreground under complex backgrounds. Again, the authors use double threshold value for further improvement in accuracy of detection. In 2019, Moudgollya et al. [41] used angle co occurrence matrix (ACM) along with gray level co-occurrence matrix (GLCM) for extraction of six features. In this work, the authors have employed the merit of ACM and GLCM for foreground detection under different challenging environments. The major drawback of this method is the huge computational overhead.

In 2017, Z. Zhong et al. [66] proposed a robust background modeling method, which updates the background at pixel-level and object-level to detect the motionless or slow-moving objects. This method is effectively works well in many real time application, particularly when there is a variation of speed of foreground objects. In 2018, Panda et al. [45] proposed a spatio-temporal BGS method utilizing Wronskian change detection model. Further, they have used a new fuzzy adaptive learning rate to improve the detection accuracy. Again in [46], the authors have used multi-channel Wronskian change detection model and codebook technique for foreground detection. As these methods are effective to provide impressive results under challenging environments, but it fails to reduce the false detection rate under dynamic environments. Its due to the fact that the aforementioned schemes [45, 46] utilizes of a constant threshold value for all type of complex situations in a video sequence.

From different literature findings it has been noticed that most of the schemes for foreground detection operates in spatial domain instead of both spatial as well as temporal domain. Further, in background modeling most reported schemes use a constant learning rate and a constant threshold value. This arbitrary selection of learning rate and threshold value does not ensure improved results for all challenging situations. So the aforementioned facts motivate us to propose a modified STLBP-based technique with adaptive formulation of learning rate and threshold value for foreground detection.

3 Overview of LBP and STLBP methods For BGS

This section presents an overview of LBP and STLBP-based approach for background subtraction. To give a better insight the following subsections are presented as explained below.

3.1 Conceptual overview of local binary pattern

Local binary pattern (LBP) is a gray scale invariant in nature and a powerful texture descriptor [26, 43, 44]. LBP is a well accepted feature extractor in many pattern recognition and computer vision applications. The prime usefulness of LBP operator is that it is very robust to monotonic transformation of gray value of the pixel and gives an excellent information about the region texture feature.

LBP operator labels the neighborhood of a pixel using binary derivative technique. The binary derivative of a pixel in a given image region can be measured by thresholding the relation of neighborhood pixels with that of center pixel. The classical or traditional LBP operator works in a 3 × 3 local region of an image. It generates a binary value 1 when a neighboring pixel is greater than or equal to the center pixel and generates binary value 0, if neighboring pixel is less than that of the center pixel. The mathematical expression of LBP is expressed as follows:

$$ LBP(x_{c},y_{c})=\sum\limits_{i=0}^{7}s(g_{i}-g(x_{c},y_{c}))\times2^{i} ,\quad i=0,1,2...7 $$
(1)

where, gi represents the gray value of surrounding pixels with respect to center pixel coordinate (xc, yc). The function s(gig(xc, yc)) is defined as,

$$ s(g_{i}-g(x_{c},y_{c}))=\left\{\begin{array}{ll} 1 & \quad if\quad g_{i}-g(x_{c},y_{c})\geqslant 0\\ 0 & otherwise \end{array}\right. $$
(2)

The 8 neighbors of the center pixel in 3 × 3 region are then represented with a 8 bit number. Thereafter, the binary number is converted to its equivalent decimal value, which presents the LBP label for the center pixel. The basic working principle of LBP operator is shown in Fig. 1.

Fig. 1
figure 1

Example of Basic LBP operator. (i) 3 × 3 region of an image, (ii) binary conversion after thresholding, (iii) Weights assigned to neighborhood pixels (equivalent decimal value of the binary number in clock wise direction starting from pixel coordinate(1,1)), (iv) Summing of weights, (v) LBP code for center pixel

Ojala et al. [44], extended the definition of basic version of LBP in which I number of equally-spaced pixels located on a circle of radius R to measure the LBP code for the center pixel as shown in Fig. 2. The value of I may be any number unlike fixed number of neighboring pixels (i.e. 8) used in basic LBP operator. If the number of pixels I is less it leads to less computational overhead. On the otherhand, large number of pixels I require in large-scale texture features. The circular LBP can be mathematically expressed as,

$$ LBP_{I,R}(x_{c},y_{c})=\sum\limits_{i=0}^{I-1}s(g_{i}-g(x_{c},y_{c}))\times2^{i}, \quad i=0,1,2...I-1 $$
(3)

where, g(xc, yc) is the center pixel value with coordinates (xc, yc) of circular region of radius R and gi (i = 0, 1, 2 ..... I − 1) is the gray values of the equally spaced I pixels on the circle of radius R centered at (xc, yc). The definition of s(gig(xc, yc)) is same as in (2). The (I, R) values may be varied like (4,1), (8,1), (16,2) etc proposed in [44]. The coordinates of circular neighbor pixels is given as,

$$ [x_{i},y_{i}]=\left[x_{c}+R cos\left( \frac{2\pi i }{I}\right), y_{c}-Rsin\left( \frac{2\pi i}{I}\right)\right],\quad i=0,1,2...I-1 $$
(4)

The coordinates of the neighbors that do not fall on pixel points are approximated through bilinear interpolation [44].

Fig. 2
figure 2

LBP code for circular neighbor (6,2)

3.2 Background subtraction using LBP method

In 2006, Heikkila et al. [26] proposed an efficient approach for moving object detection by modeling the background using region texture-based method. This method, makes the circular LBP more robust against small or negligible changes in pixel value. Here, the circular LBP operator has been modified as,

$$ LBP_{I,R}(x_{c},y_{c})=\sum\limits_{i=0}^{I-1}s(g_{i}-g(x_{c},y_{c})+\alpha)\times2^{i} $$
(5)

where, expression for s(x) is same as defined in (2) and α is responsible to retain the discriminative power of LBP operator when the gray values of neighboring pixels are very close to the center pixel value of a region [26].

The limitation of this method is that it does not contain any temporal information of the pixels. As a result, it fails to detect moving object accurately under dynamic environmental conditions.

3.3 Background subtraction using STLBP method

To mitigate this aforementioned problem, Zhang et al. [64] extended the work reported in [26] by presenting a new dynamic texture operator namely spatio-temporal LBP (STLBP).

From the Fig. 3, let Ft and Ft− 1 be the current and previous frame respectively. In current frame Ft, g(xc, yc;t) and g(i;t) (i = 0, 1, 2 .... I − 1) are the center pixel and circular neighbor pixels respectively. To measure the motion information along the temporal direction, only circular neighbors g(i;t − 1),(i = 0, 1, 2 ... I − 1) of frame Ft− 1 is considered. Two LBP codes for center pixel g(xc, yc;t) for spatial and temporal neighboring pixels are expressed as follows:

$$ LBP_{I,R}(x_{c}, y_{c};t)=\sum\limits_{i=0}^{I-1}s(g(i;t)-g(x_{c},y_{c};t)+\alpha)\times2^{i} $$
(6)
$$ LBP_{I,R}(x_{c}, y_{c};t-1)=\sum\limits_{i=0}^{I-1}s(g(i;t-1)-g(x_{c},y_{c};t)+\alpha)\times2^{i} $$
(7)
Fig. 3
figure 3

STLBP histogram measurement for circular neighbor. (i) Circular neighbor of two consecutive frames with same coordinate values with respect to center pixel, (ii) Histogrm measurement on 2I bins of each frame and (iii) Resultant histogram

LBPI, R(xc, yc;t) and LBPI, R(xc, yc;t − 1) represents the spatial LBP and temporal LBP of a circular region having radius R. The spatial histogram (on current frame) and temporal histogram (on previous frame) are computed from spatial LBP using (6) and (7) respectively over a circular region of radius Rregion. Figure 3 illustrate the STLBP histogram measurement of a circular neighbor. Next, the STLBP histogram is calculated by summing the spatial histogram and temporal histogram as follows:

$$ H_{t}=\omega \times H_{t-1,i}+(1-\omega)\times H_{t,i},\quad i=0,1,2....I-1 $$
(8)

where Ht, i and Ht− 1, i are the current and previous frame histograms of ith bin respectively shown in Fig. 3. Ht is the resultant spatio-temporal histogram at time t and ω is the spatio-temporal rate. Each pixel is modeled with a group of adaptive STLBP-based histogram techniques which are computed over a circular region of radius Rregion.

The advantage of STLBP method is that it extracts the spatial information as well as temporal information for each pixel that further reduces the false detection rate under dynamic condition. From different observation it is noticed that a constant threshold and learning rate is used in STLBP method. As a result the overall performance of foreground detection accuracy reduces to a great extent.

4 Proposed approach

In the proposed work, an improvement in spatio-temporal LBP (STLBP) descriptor [64] for foreground detection under complex scenario is proposed. In the suggested scheme, a change description rule is formulated which essentially modifies the traditional STLBP method reflects in (6) and (7) respectively. The modified change detection rule is defined as follows:

$$ LBP_{I,R}(x_{c}, y_{c},t)=\sum\limits_{i=0}^{I-1}s_{T_{r}}(g(i;t)-\mu_{g}(t)+\alpha)\times2^{i} $$
(9)
$$ LBP_{I,R}(x_{c}, y_{c},t-1)=\sum\limits_{i=0}^{I-1}s_{T_{r}}(g(i;t-1)-\mu_{g}(t)+\alpha)\times2^{i} $$
(10)

where,

$$ s_{T_{r}}(x)=\left\{\begin{array}{ll} 1 & if \quad x>=T_{r}\times F_{t}(x_{c},y_{c}) \\ 0 & {\kern32pt}otherwise \end{array}\right. $$
(11)

and

$$ \mu_{g(t)}=\frac{1}{I}\sum\limits_{i=0}^{I-1}g(i;t) $$
(12)

Tr (= 0.2) is the user defined constant value.

To support our claim for the proposed foreground detection with the modified change description rule, two experiments are carried out as follow: The first experiment shows the variation in feature under dynamic environment which is shown in Fig. 4. Further, second experiment focuses on variation in illumination (refer Fig. 5). Here, the mean of surrounding pixels is taken instead of considering the center pixel across the local circular region of radius R = 2 to generate spatial-LBP which is defined in (9). The reason behind the selection of mean of the surrounding pixel with that of the center pixel is that, the obtained mean values of surrounding pixels for two consecutive frames with high dynamic region are invariant in nature as compared to the variation of center pixel of these regions. So that motivate authors to extract the invariant features under dynamic and illumination variation environments. During the experiment, two widely accepted video sequences namely WT (Waving Tree) [9] and OTCBVS [15] with 280 frames are used. An illustration for feature variation of a particular pixel with coordinate (65, 77) on WT (Waving Tree) and a particular pixel with coordinate (128, 85) on OTCBVS datasets across all 280 frames are shown in Figs. 4 and 5 respectively.

Fig. 4
figure 4

Feature variation of a particular pixel (65, 77) on WT dataset across 280 frames

Fig. 5
figure 5

Feature variation of a particular pixel (128, 85) on OTCBVS dataset across 280 frames

From the Figures it has been noticed that in both the video sequence (WT and OTCBVS) the proposed spatio-temporal LBP features value give less variation with respect to the erratic behavior of a dynamic pixel at a particular location along 280 frames. Additionally, the proposed scheme shows less variation in feature values in temporal direction than that of the traditional STLBP scheme.

Additionally, to show the effectiveness of proposed foreground detection, another experiment is conducted between traditional spatial-LBP expressed in (6) and proposed spatial-LBP expressed in (9) is shown in Fig. 6. During experimentation, initially the texture features are extracted from both the traditional spatial-LBP and proposed spatial-LBP. Next, the extracted texture features from both the techniques are given as the input to GMM [55]. The parameters of GMM are kept same for both the inputs. Further, the performance measures like Recall, Precision and Fmeasure resulted from GMM process are computed. Figure 6 shows the comparison between the encodings of traditional spatial-LBP and proposed spatial-LBP for Highway [21], PETS 2006 [21] and OTCBVS [15] datasets respectively. From Fig. 6, it is clearly noticed that the proposed spatial-LBP achieves potential improvements as compared to traditional spatial-LBP with respect to the aforementioned performance measures.

Fig. 6
figure 6

An experimentation to show the comparison between the encodings of traditional spatial-LBP and proposed spatial-LBP for highway (HW), PETS 2006 and OTCBVS video sequences. The texture features from traditional spatial-LBP and proposed spatial-LBP are considered as input to GMM. The second, third and forth rows show the Recall, Precision and Fmeasure values respectively

Further, Fig. 7 shows the generated foreground of the proposed spatio-temporal LBP over traditional spatio-temporal LBP on different threshold value. In this comparison, we adopt the background modeling procedure reported in [26]. All parameters (ω = 0.01, αb = αw = 0.01, TB = 0.8, α = 3 and Tr = 0.2) are kept fixed for fair comparison in all methods. In Fig. 7, (b)–(d) show the segmented results of foreground by traditional spatio-temporal LBP method as in [64], (e)–(g) show the segmented results for foreground using the extracted STLBP features by considering mean of the surrounding pixels instead of center pixel shown in (6) and (7). Finally, (h)–(j) show the segmented results of moving objects by proposed spatio-temporal LBP method with the changed description rule expressed in (9) and (10). The results for each methods are shown on the basis of threshold values (Tp) 0.6, 0.7 and 0.8 respectively.

Fig. 7
figure 7

Foreground segmentation results comparison on traditional STLBP and proposed STLBP based on different threshold (Tp). a 250th frame of WT video sequence, bd segmented foreground results of traditional STLBP method [64] with Tp value 0.6, 0.7 and 0.8 respectively, eg segmented foreground results of modified STLBP method with Tp value 0.6, 0.7 and 0.8 respectively , hj segmented foreground results of the proposed STLBP method with Tp value 0.6, 0.7 and 0.8 respectively. Both traditional STLBP and proposed STLBP are input to the background modeling technique reported in [26]

Next, we compute the two histograms Ht and Ht− 1 of the binary pattern resulted from a circular region depicted in (9) and (10). This histogram of binary pattern over a region is used as local texture feature. As shown in Fig. 3, the Ht and Ht− 1 over a region is computed as follows:

$$ H_{t,n}=\sum\limits_{x,y\epsilon R}N(LBP_{I,R}(x,y;t)=n), \qquad n=0,1,2...2^{I}-1 $$
(13)

and

$$ H_{t-1,n}=\sum\limits_{x,y\epsilon R}N(LBP_{I,R}(x,y;t-1)=n),\qquad n=0,1,2...2^{I}-1 $$
(14)

where Ht, n and Ht− 1, n represents the nth bin histogram of t and t − 1 respectively; and N(z) is 1 (if z is true) or 0 (otherwise). The histogram H of STLBP is generated with the help the aforementioned two histograms as follows:

$$ H_{t}=\omega \times H_{t-1,n}+(1-\omega)\times H_{t,n}, \qquad n=0,1,2...2^{I}-1 $$
(15)

The STLBP histogram expressed above provides the information about spatial texture feature and temporal motion feature.The parameter ω depends on the degree of dynamic background condition. If the background changes very fast then larger value of ω is adopted otherwise a small value of ω is considered for small changes in background. In our case the value of ω is taken as 0.01 experimentally.

4.1 Proposed background modeling algorithm using STLBP feature

Background modeling is considered to be a prime step for background subtraction process as it has a higher impact on overall detection accuracy. For background model initialization, the STLBP histogram is evaluated over a circular region of radius called Rregion centered at a particular pixel (x, y) using first and second frame. According to the MOG, the K background models for a pixel consists of {\(H_{m_{k}}^{(x,y)},~ k=1,2...K\)}. Initially, the weights of each model histograms are taken as equal value. Next, the weights of the model histograms wk (k = 1,2...K) are normalized to make their sum equal to 1. From the next frame (third frame), the STLBP histogram H over a circular region of radius Rregion centered at the pixel (x, y) is computed using the last stored frame. Then, current histogram H(x, y) is compared with the K model histograms using a proximity measure expressed in (16).

$$ \bigcap(H_{m},H)=\sum\limits_{l=0}^{L-1}min(H_{m,l},H_{l}) $$
(16)

where L = 2I represents the number of bins. The intuition behind the histogram intersection in (16) is that it measures the common part of two histograms to measure the similarities between them.

For better understanding, let us take an example with K = 4 (multimodal Gaussian). Let the proximity measure between current histogram and four model histograms at pixel point (x, y) be ∩1,∩2,∩3 and ∩4. Then updating the background model is same as in [26] as follows:

  • If max(∩1,∩2,∩3,∩4) < Tp (Tp= is the threshold value for proximity measure), then current histogram replaces the model histogram having lowest weight. The weight of this newly added histogram is initialized with a low value i.e. 0.01 for our experiment.

  • If max(∩1,∩2,∩3,∩4) > Tp, then we select the the best match model histogram having highest proximity value.The best matching model histogram with the current histogram is updated s follows:

    $$ H_{m_{k}}^{x,y}=\alpha_{b}\times H^{x,y}+(1-\alpha_{b})\times H_{m_{k}}^{x,y}, \qquad \alpha_{b}\in [0,1] $$
    (17)

    where αb is an user defined learning rate.

The weights of the histogram is updated as follows:

$$ \omega_{k}=\alpha_{w} \times M_{k} + (1-\alpha_{w})\times w_{k}, \qquad \alpha_{w}\in [0,1] $$
(18)

where αw is an user defined learning rate, Mk is 1 for best matching model histogram and 0 for others.

In the suggested scheme, we have presented a newly adaptive updating learning rate (αb) which is expressed in (19).

$$ \alpha_{b}=C_{1}\times exp\left (C_{2}\times\left (max\left (.. \right )-T_{p} \right ) \right ) $$
(19)

where, max(..) refers to the maximum proximity measure between models’ histogram with respect to the current. Again, C1 and C2 are constant values. The intuition behind this new adaptive updating learning rate is that if the matched model ‘max(..)’ is very close to the Tp (i.e. the difference ‘max(..) − Tp’ is very less), then the updating learning rate converges its lower value (i.e. C1). When the matched model away in similarity from current (i.e. the difference ‘max(..) − Tp’ increases), the updating learning rate increases. This improves the convergence rate at each frame. It has been observed that, when background changes quickly under dynamic condition (unstable conditions), the matched model ‘max(..)’ very close to Tp. Therefore, a small learning rate is desirable for updating the model. On the contrary, if the background is stable (i.e, matched model ‘max(..) >> Tp’ the model is updated with higher learning rate.

Further, in this work we have formulated to select an adaptive threshold value Tp which is dynamically changed with respect to each pixel as background also changes dynamically over time in a complex environment. Throughout this work we have considered Tp = Tp(x, y). In the proposed scheme, we have used a counter which counts the the number of times the pixels identified as a foreground pixels. The counter value increases by one if the pixel at a particular location (x, y) along the temporal direction continuously classified as foreground pixel. If the pixel at (x, y) once classified as background in any tth frame, counter is reset to zero. The criteria for aforementioned conditions are as follows:

$$ \left\{\begin{array}{ll} Foreground & \qquad count_{t}=count_{t-1}+1\\ Background & {\kern32pt}count_{t}=0 \end{array}\right. $$
(20)

In illumination changes condition, some times it is noticed that the pixel is wrongly classified as foreground instead of background. However, in the succeeding frames the same location of the pixel is classified as background. This is due to the fact that scenario like a slight movement in leaf position may be classified as foreground whereas it is expected to be classified as background. Under this circumstances, the counter value is set to be zero to avoid such misclassification. If the counter value increases gradually, the pixel is considered as a foreground. During experimentation, we have assumed that if the counter value is above \(T_{h}^{+}\), the pixel is considered as foreground. It is expected that \(T_{h}^{+}\) value should be maximum for foreground detection under challenging conditions. Once the counter value is above \(T_{h}^{+} \), the pixel is regarded as foreground and Tp(x, y) value will be increased. The intuition behind to increase the Tp(x, y) value is that, the pixel proximity value which are very close to Tp(x, y) (i.e. \(max(..)\geqslant Tp(x,y)\)) are classified as background. So to mitigate this situation, the Tp(x, y) value is increased by a factor of \(\frac {\beta }{count}\) amount. If the pixel at position (x, y) continuously detected as foreground the counter value increases which decreases the value of \(\frac {\beta }{count}\). Therefore, Tp is approximately equal to Tpinit at the high value of counter additionally which leads to removal of false positive in a frame. The Tp value is maintained a constant for counter value within the range [\(T_{h}^{-},T_{h}^{+}\)]. The same theory is adopted when Tp(x, y) value decreases when the counter value is equal to zero. For better clarity, the proposed adaptive thresholding approach is outline in Algorithm 1. In Section 5.1, the influence of selecting an adaptive threshold Tp(x, y) and learning rate αb is presented in detail for better clarity.

figure a

All the produced model histograms are not certainly resulted by the background process. To determine whether the histogram models belong to background or not the persistency of the histogram in the model is computed. Equatino (18), shows that persistence is directly related to the model histogram’s weight. The larger the weight, the higher is the probability of being a background histogram produced by background process. During the final stage of the updating procedure, sort the model histograms at a pixel with decreasing order of their weights and select the most probable B background model histograms which are on the top of the list.

$$ B=arg \mathop{min}_{\textbf{b}}\left (\sum\limits_{k=1}^{b}\omega_{k}>T_{B} \right ),\qquad T_{B}\in [0,1] $$
(21)

where TB is the user defined threshold.

For foreground detection, the histogram H at a pixel in the current frame is compared against the B background model histograms using proximity measure reflected in (16). This foreground detection process is done before the updatation of the background model. For atleast one background model histogram, if the proximity measure is greater than Tp, the pixel is classified as background, otherwise the pixel is classified as foreground.

4.2 Parameters value selection in the proposed scheme

In the proposed scheme, the learning rate (αb) and threshold value (Tp) are computed adaptively. However, there are some influence parameters which help to compute the above mentioned two parameters. A careful consideration must be applied to select the parameter values. There are so many tunable parameters used in the proposed scheme. So in most of the cases, constant value is adopted to reduce the overall computational burden and number of variables used in the algorithm. The contemplation for choosing the parameters value for different datasets are discussed as explained below.

Circular neighborhood LBP operator needs a proper value of I and R. If I value is more then the number of bin size (i.e. 2I) which in turn increases the length of the histogram. On the other hand, if the value of I is considered to be small, there is a chance of information loss in the local region as the histogram length decreases. Furthermore, the memory size is an important factor for choosing the size of I. In the proposed scheme the radius size R is taken as 2. Selection of larger size of R decreases the correlation among the pixels, which causes the problem for finding the shape of the moving objects. Again, the smaller value of R increases the number of false positives due to extraction of texture information which is too local. In this work, the value if I and R is taken experimentally as 6 and 2 respectively. The circular region of radius Rregion refers to the region for histogram calculation. A large value of radius Rregion incorporate large information into the histogram. Therefore, in a result, the shape of the moving objects are not extracted properly. However, the smaller value of radius Rregion leads to proper detection of moving objects shape but creates large number of false positives. Therefore, the radius Rregion is chosen as 5 to compromise both abovesaid requirements.

The requirements of number of models is distinct for different video sequences. It depends on the nature of complexity associated in each video. In the dynamic environment (i.e. outdoor scenario) which is a multimodal distribution, the value of K must be chosen high (preferably more than 5) and for an unimodal distribution environment (i.e. indoor), the value of K must be less (approximately 2). If higher value of K is taken, memory requirement and computational complexity increases. Further , for small value of K the detection result deteriorates to a great extent. In the suggested scheme, during experimentation for all videos, the value of K is considered as 4.

The selection parameter TB directly related to the number of models required. The value of TB should be very high (≈ 0.8) for multimodal background and small (≈ 0.2) for unimodal background. We have taken 0.8 for all our video sequences

The learning rate parameter (αb, αw) responsible for adaption speed of background model. In this work, only learning rate αb is taken into consideration for updatation which is depicted in (19). Higher values of learning rate leads to faster updatation of current value histogram into the background model histogram. Faster updatation of background due to selection of high value of learning rate which causes the misclassifcation of slow moving object as background. On the contrary, low value of learning rate causes the slow updatation, which is responsible for increase in false positives. Therefore, αb value in (19) changes in accordance with the speed of moving object in a video sequence. The value of learning parameter (αw) is taken as 0.01 for all video sequences. To make a fair comparison, we have kept the parameters value constant for all datasets. Table 1 depicts the parameters value selection for all the datasets of the proposed scheme.

Table 1 The parameter values of the proposed method for all datasets

5 Experimental analysis

To appraise the efficaciousness of the proposed scheme, extensive simulations are carried out in MATLAB R2019a software running in a 64 bit windows with Intel Core i7 processor utilizing 32GB RAM under certain pertinent and specific test environment. To conduct the exhaustive simulations, some of the standard and widely accepted video sequences from different data sets are adopted with different challenging scenario. Further, the considered datasets are categorized into indoor and outdoor environment respectively. Table 2 depicts a detailed information about different video sequences used in the experiments. Further, the different and source of video sequences are listed in Tables 3 and 4 respectively. The proposed approach is compared with eight state-of-the-art approaches namely GMM [55], KDE [16], STBS [13], PBAS [27], T-BGS [26], STLBP-BGS [64], Goyal’s method [20] and Moudgollya’s method [41]. To retain about the fairness of the proposed scheme in entire experimentation and comparison no post processing operations are adopted to improve the overall detection performance of the proposed scheme. The overall experiment is segregated in terms of qualitative and quantitative evaluation measures.

Table 2 Information about the video sequences used in the proposed scheme
Table 3 Challenges associated with different video sequences
Table 4 Download source of the video sequences used in our experiment

In context to qualitative evaluation, visual assessment of frames for different video sequences are shown with respect to improved segmented foreground detection.

For quantitative measurement, performance matrices in terms of Recall, Precision, F-measure(F1), Similarity, percentage of correct classification (PCC) and average classification error (ACE) are computed. To give a better insights to the reader, following quantitative matrices are presented as follows:

$$ \begin{array}{@{}rcl@{}} Recall=\frac{TP}{TP+FN} \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} Precision=\frac{TP}{TP+FP} \end{array} $$
(23)
$$ \begin{array}{@{}rcl@{}} F-measure=2 \frac{Recall\times precision}{Recall+Precision} \end{array} $$
(24)
$$ \begin{array}{@{}rcl@{}} PCC=\frac{TP+TN}{TP+TN+FP+FN} \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} Smilarity=\frac{TP}{TP+FP+FN} \end{array} $$
(26)
$$ \begin{array}{@{}rcl@{}} ACE ~(Average Classification Error)=\frac{FP+FN}{M} \end{array} $$
(27)

where, M is the number of frames taken for evaluation purpose. The necessary description used for measuring the performance are represented as follows:

  1. I

    True Positive (TP): the number of pixels that are correctly detected as foreground pixels.

  2. II

    False Positive (FP):the number of background pixels that are incorrectly detected as foreground pixels.

  3. III

    True Negative (TN): the number of background pixels correctly marked as background pixels.

  4. IV

    False Negative (FN): the number of foreground pixels that are incorrectly marked as background pixels.

Additionally, the pseudo code to generate TP, TN, FP and FN over selected frames is described in Algorithm 2.

figure b

5.1 Influence of adaptive threshold and learning rate

In this section, we show the influence of adaptive selection of threshold value (Tp) and learning rate (αb) instead of choosing it heuristically. In this experiment, the qualitative as well as quantitative measures are evaluated using some video sequences namely WT and WS and shown in Figs. 8 and 9 respectively. In the proposed scheme i.e. STLBP (without adaptive parameter selection) the value of Tp and αb are considered as 0.7 and 0.01 respectively. From the Figures it is clearly observed that the proposed scheme STLBP (with adaptive parameter selection) shows superior performance than that of the proposed scheme STLBP (without adaptive parameter selection). Henceforth, to make an uniformity for the subsequent results and analysis the proposed scheme without adaptive parameter selection and proposed scheme with adaptive parameter selection are named as proposed STLBP1 and proposed STLBP2 respectively.

Fig. 8
figure 8

Qualitative and Quantitative comparison of proposed STLBP1 and proposed STLBP2 . First row (Qualitative comparison): a 250th frame of WT video sequence, b Ground truth, c Proposed STLBP1 and d Proposed STLBP2 results. Second row shows the Quantitative comparison in terms of Recall, Precision and Fmeasure respectively of WT video sequences

Fig. 9
figure 9

Qualitative and Quantitative comparison of proposed STLBP1 and proposed STLBP2 on WS video sequence. First row (Qualitative comparison): a 1601st frame of WS video sequence, b Ground truth, c Proposed STLBP1 and d Proposed STLBP2 results. Second row shows the Quantitative comparison in terms of Recall, Precision and Fmeasure respectively of WS video sequences

5.2 Qualitative analysis

The qualitative comparison in terms of visual assessment of the proposed scheme (proposed STLBP2) than that of the benchmark schemes namely GMM [55], KDE [16], STBS [13], PBAS [27], T-BGS [26], STLBP-BGS [64], Goyal’s method [20] and Moudgollya’s method [41] using WT, HW, OTCBVS, CA, FO2, FO1, WS, IR and CAN video sequences are shown in Figs. 1011121314 and 15. The video sequences used in the experiment constitute different complex situations like waving of tree, rippling water, low contrast, varying illumination, camouflage, noisy and different speed of foreground object. Figure 10 shows the visual quality assessment comparison of the proposed scheme (proposed STLBP2) over that of the benchmark schemes. From the obtained result it is seen that the number of false positives and false negatives reduces significantly in the proposed STLBP2 scheme in comparison to benchmark schemes. In Fig. 11 for HW sequence, the proposed STLBP2 scheme brings potential improvement in terms of foreground extraction than that of PBAS, T-BGS, STLBP and Goyal’s BGS approaches. However, techniques like GMM, KDE, STBS and Moudgollya’s methods fails to extract the desired foreground with respect to ground truth.

Fig. 10
figure 10

Visual assessment of Moving object detection for WT video sequence: a original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Fig. 11
figure 11

Visual assessment of Moving object detection for HW video sequence: a Original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Fig. 12
figure 12

Visual assessment of Moving object detection for OTCBVS, CA and FO2 videosequences:a Original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Fig. 13
figure 13

Visual assessment of Moving object detection for FO1, WS and IR video sequences: a Original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Fig. 14
figure 14

Visual assessment of Moving object detection CAN video sequence: a Original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Fig. 15
figure 15

Visual assessment of moving object detection for different video sequences: a Original frames, b Ground Truth, c GMM, d KDE, e STBS, f PBAS, g T-BGS, h STLBP, i Goyal’s method, j Moudgollya’s method and k Proposed STLBP2

Further, the visual assessment for some complex sequences like OTCBVS, FO2, and CA is presented in Fig. 12. These sequences exhibits complex scenario like change in illumination, rippling water, dynamic background with low contrast and camouflage. From Fig. 12, in-case of varying illumination (OTCBVS sequence) the proposed STLBP2 method exhibits illumination invariant nature and accurately detect the foreground object with reduced false positive rate. Further, for low contrast and dynamic sequences (CA and FO2) the proposed STLBP2 scheme shows similar improvement over state-of-the-art approaches. In particular, the proposed STLBP2 scheme minimizes the wrong classification and able to detect the foreground object accurately.

Figure 13 includes some complex sequences namely WS and FO1 where issues like dynamic background, ripple of water and noise are present. From the Figure it is noticed that for WS sequence the extracted foreground is demolished by background for T-BGS and STLBP method. This is due to the adoption of constant threshold value and learning rate. However, the proposed STLBP2 scheme is more visual appealing than that of the other schemes with reduced false detection. Similarly, for IR video sequence (a noisy sequence), the proposed STLBP2 scheme produces superior foreground detection in comparison to the competitive schemes.

Further, Fig. 14 contains most complex video sequence namely CAN where many challenging scenarios like illumination variation, rippling water, camouflage (i.e. some chromatic feature of background is similar as foreground) and still position of foreground object for a while are present. For CAN video sequence, the proposed STLBP2 scheme has not shown significant results for foreground detection as compared to BGS methods like KDE and PBAS.

A more detailed visual assessment representation of the proposed STLBP2 scheme over eight well accepted schemes is shown in Fig. 15. In conclusion, the proposed STLBP2 scheme shows impressive performance in most of the cases in terms of visual quality assessment than that of the reported schemes used in the experiments.

5.3 Quantitative analysis

The quantitative analysis and comparison of the proposed scheme (i.e. proposed STLBP1 and proposed STLBP2) is conducted in terms of Recall, Precision, F-measure and Percentage of correct classification (PCC), Average Classification Error (ACE) and Similarity.

Tables 56789101112 and 13, present the quantitative comparison of the proposed approach against eight different state-of-the-art BGS methods for different video sequences. From the obtained Tables 513 it is seen that the proposed STLBP2 scheme outperforms almost in all cases over other BGS methods in terms of Recall, Precision, F-measure and Percentage of correct classification (PCC), average classification error (ACE) and Similarity.

Table 5 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘WT’ data sequences
Table 6 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘HW’ data sequences
Table 7 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘OTCBVS’ data sequences
Table 8 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘CA’ data sequences
Table 9 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘FO2’ data sequences
Table 10 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ’FO1’ data sequences
Table 11 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘WS’ data sequences
Table 12 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘IR’ data sequences
Table 13 Quantitative Comparison of proposed method with that of other state-of-the-art BGS method for ‘CAN’ data sequences

Figure 16a shows the average Precision for all BGS schemes. The average Precision value of PBAS is 0.7898 which is best among the methods. The next best average Precision value is 0.6866 obtained from the proposed STLBP2 method. The proposed STLBP2 method yields 13% less that of the highest reported scheme. Similarly, the average Recall value shown in Fig. 16b for T-BGS, STLBP, proposed STLBP1 and proposed STLBP2 methods are 0.9269, 0.9230, o.8806 and 0.9224 respectively. The average F-measure value is illustrated in Fig. 16c. However, it is noticed that the proposed STLBP2 scheme attains second and third best result for average Precision and average Recall value respectively. This is viable as the F-measure is the harmonic mean between Precision and Recall. From the result it is seen that the obtained average F-measure is the highest for the proposed STLBP2 scheme. The obtained average F-measure is 10.69% higher with respect to the second best method (i.e. PBAS). Similarly, for average PCC performance measures the proposed STLBP2 scheme and second best scheme (i.e. PBAS) achieved 98.74% and 98.09% respectively. The proposed scheme attains 0.66% higher value that of the second best scheme. Again, the improvements for the proposed STLBP2 approach is also noticed for average Similarity and average ACE as 17.27% and 13.28% respectively in comparison to the second best method depicted in Fig. 16.

Fig. 16
figure 16

Average Quantitative metrics comparison of different competing BGS methods with proposed method: a Precision, b Recall, c F-measure, d PCC, e Similarity, and f ACE

Finally, overall performance [46] ranking between different BGS schemes based on the accuracy matrices is evaluated with respect to the proposed scheme and shown in Table 14. The ranking is made for ten BGS approaches using nine video sequences. The ranking strategy for evaluating the overall performance, the contenders are sorted in an ascending order of enumeration based on the performance (for Precision, Recall, F-measure, PCC and Similarity) and are assigned scores of 1, 2, 3, 4, ... , 10. However, for average classification error (ACE) the assigned values are arranged in a descending order. To conduct this experiment, a cumulative score is calculated by taking all considered test sequences. Next, the total score for a contender for different BGS algorithms is computed by adding its cumulative score for all performance matrices like Precision, Recall, F-measure etc. Finally the contenders are ranked as 10, 9, 8. ... 1 (where 10 is the best and 1 is the worst) based on their overall total score. It is seen from the Table 14 that the proposed STLBP2 and proposed STLBP1 methods scored highest and second highest respectively in comparison to the other BGS schemes.

Table 14 Overall performance evaluation of algorithms

6 Conclusion

In the present work, an improved scheme for foreground detection is proposed under different complex scenarios. The proposed scheme initially utilized a spatio-temporal local binary pattern (STLBP) based approach to capture both spatial textural feature and temporal motion feature from a video. In the suggested scheme, an improvement in change description rule of traditional STLBP method is made to capture robust features under challenging scenarios like waving tree, rippling water, camouflage, noisy video, low contrast and illumination variation. During background modeling step, instead of captivating a constant value for learning rate αb and threshold value Tp, an adaptive formulation strategy is proposed to compute αb and Tp to detect the foreground object more accurately with reduced false error rate. A thorough experimental analysis has been conducted on challenging video sequences. The performance of the suggested scheme has also been contrasted and validated in comparison to some benchmark schemes with similar experimental setup. From qualitative analysis and quantitative evaluations it is clearly noticed that the suggested scheme brings potential improvement in terms of accurate foreground detection with that of the state-of-the-art background subtraction approaches.

It is realized that the proposed algorithm is capable of detecting the foreground objects effectively under challenging conditions like illumination variation, noisy and dynamic background. The proposed BGS approach shows superior performance in terms of foreground detection under certain challenging conditions like dynamic background, illumination variation, shadow, low contrast and noisy sequences. However, the proposed scheme has not shown consistent results for foreground detection under camouflage and backgrounds with rippling of water conditions.

Developing an efficient foreground detection algorithm is still an open challenge in the field of computer vision. Future work should focus in the following directions: First, devise of multi-feature based foreground detection using different background conditions. Second, to formulate some alternate background modeling methods to enhance the foreground detection accuracy and further reduces the effect of false positive rate. Finally, exploring some recent deep learning techniques to segment the foreground accurately could be considered as another area of extension.