Keywords

1 Introduction

Moving Object Segmentation (MOS) is a fundamental topic in computer vision. MOS has several applications [17], and it is also a fundamental step in numerous computer vision and image/video processing tasks such as video surveillance, security, visual object tracking, and human activity analysis, among others. The target of MOS is to separate the moving from the static objects in a scene [4]. MOS is also considered as a classification problem, where the objective is to classify between background and foreground each pixel in the videos, and therefore MOS is also known as background subtraction [7]. MOS becomes very challenging in the presence of undesirable variations in the background scene for static and moving camera sequences [21].

The approaches to solve MOS can be categorized as unsupervised, semi-supervised, and supervised learning methods. Unsupervised methods usually rely on a background model [25] of the video to segment the moving pixels (or foreground) [4,5,6, 28]. However, these unsupervised methods usually fail in dynamic background and moving camera sequences. On the other hand, supervised methods usually train very deep Convolutional Neural Networks (CNN) [31] to predict the foreground [8, 22, 35, 50]. Common CNNs in the literature contain millions of parameters [23], and thus several tricks such as dropout and data augmentation techniques [30] are required to avoid the overfitting problem. Furthermore, a general answer about the sample complexity required to guarantee performance in deep learning regimen does not exist in CNNs [15, 47, 53]. Recently, a new semi-supervised learning category for MOS has emerged [18, 20, 21]. These semi-supervised methods for MOS can achieve good performance in several challenges with a small amount of labeled information, and they are inspired from the Graph Signal Processing (GSP) community [37, 48].

GSP tries to extend the classical concepts of digital signal processing to signals supported on graphs. Graphs can model complex relationships among data, and therefore GSP has attracted a lot of interest in the last few years. Specifically, the problems of sampling and reconstruction of graph signals have been broadly studied [2, 13, 39, 40, 46]. In the same way, GSP has been used in numerous applications such as computer vision [18, 20, 21], image and 3D point cloud processing [38, 52, 57], sensor networks [16], and the estimation of new cases of Coronavirus Disease 2019 [19].

In this work, we explain the basic notions, as well as the mathematical and technical foundations of MOS using GSP, for simplicity we call Graph Signal Processing-Moving Object Segmentation (GSP-MOS) to this new approach. We propose two new architectures and a new evaluation procedure for GSP-MOS. This new evaluation scheme takes into account the amount of labeled information used in the GSP-MOS algorithms. The idea of GSP-MOS is to model regions of pixels in videos as nodes living in a high dimensional space, and thereafter a semi-supervised method classifies if each node is either a static or moving object. GSP-MOS algorithms are usually composed of: 1) a segmentation method, 2) feature extraction procedure for nodes representation, 3) a graph construction method, 4) graph signal representation, 5) a graph signal sampling strategy, and 6) a semi-supervised learning algorithm inspired from reconstruction methods of GSP. GSP-MOS has the advantage of having good performance for static and moving camera videos with small amounts of labeled information. Furthermore, Giraldo et al. [21] answered the question of the sample complexity in semi-supervised learning regimen for GSP-MOS using the concept of Paley-Wiener space in graphs [44].

The rest of the paper is organized as follows. Section 2 presents the related work in GSP-MOS. Section 3 explains the basic concepts and mathematical notions of GSP-MOS algorithms. Section 4 introduces the experimental framework that has been used for GSP-MOS, and the new evaluation procedure. Finally, Sects. 5 and 6 present a summary of results and the conclusions of this work, respectively.

2 Related Work

There are few GSP-MOS algorithms in the literature. The first GSP-MOS algorithm was proposed in [20] and coined as Graph BackGround Subtraction with Total Variation minimization (GraphBGS-TV), this method uses a Mask Region CNN (Mask R-CNN) [24] with a set of handcrafted features. GraphBGS-TV solves the semi-supervised learning problem using the Total Variation (TV) of graph signals [29]. Giraldo and Bouwmans [18] proposed the GraphBGS method, where the segmentation step uses a Cascade Mask R-CNN [10], and the semi-supervised learning problem is solved with the Sobolev norm of graph signals [19]. Finally, Giraldo et al. [21] proposed Graph Moving Object Segmentation (GraphMOS). This method uses superpixel segmentation with handcrafted plus deep features. GraphMOS solves the semi-supervised learning problem with the Sobolev norm or TV of graph signals. Additionally, Giraldo et al. [21] extended the GSP-MOS ideas to Video Object Segmentation (VOS) [41], and they coined their algorithm as GraphVOS. This work summarizes and discusses the results of the GSP-MOS algorithms. Similarly, we present potential future directions in the field of GSP-MOS.

3 Moving Object Segmentation and Graph Signal Processing

This section presents the basic concepts and mathematical foundations of GSP-MOS. Figure 1 shows the general pipeline for GSP-MOS. Most GSP-MOS algorithms follow the same pipeline with some modifications, in this work we refer to architecture as the pipeline of a certain GSP-MOS algorithm.

Fig. 1.
figure 1

General pipeline for GSP-MOS algorithms. These architectures use background initialization and segmentation. Each output of the segmentation algorithm represents a node in G, and therefore each node is represented with some features. The labeled information is used to classify if a node is a moving ( nodes) or a static object ( nodes). Finally, some nodes are sampled and the semi-supervised algorithm classify all the non-sampled nodes in the graph. (Color figure online)

3.1 Graph Signals Preliminaries

A graph \(G=(\mathcal {V},\mathcal {E})\) is represented with a set of nodes \(\mathcal {V}=\{1,\dots ,N\}\) and a set of edges \(\mathcal {E}=\{(i,j)\}\), where (ij) is the edge between the nodes i and j. \(\mathbf {W} \in \mathbb {R}^{N\times N}\) is the adjacency matrix of the graph such that \(\mathbf {W}(i,j)\in \mathbb {R}^+\) is the weight of the edge (ij) and \(\mathbf {W}(i,j)=0~\forall ~(i,j)\notin \mathcal {E}\). Therefore, \(\mathbf {W}\) is symmetric for undirected graphs. \(\mathbf {D} \in \mathbb {R}^{N\times N}\) is a diagonal matrix called the degree matrix of G such that \(\mathbf {D} = \mathrm {diag}(\mathbf {W1})\), where \(\mathbf {1} \in \mathbb {R}^{N}\) is a vector of ones. Similarly, \(\mathbf {L} = \mathbf {D}-\mathbf {W}\) is the combinatorial Laplacian operator. \(\mathbf {L}\) is positive semi-definite for undirected graphs, and thus \(\mathbf {L}\) has eigenvalues \(0=\lambda _1 \le \lambda _2 \le \dots \le \lambda _N\) and corresponding eigenvectors \(\{ \mathbf {u}_1,\mathbf {u}_2,\dots ,\mathbf {u}_N\}\). Finally, a graph signal is a function \(y:\mathcal {V} \rightarrow \mathbb {R}\), and we can represent this signal as \(\mathbf {y} \in \mathbb {R}^N\).

The graph Fourier basis of G is given by the eigenvalue decomposition of \(\mathbf {L}=\mathbf {U}\mathbf {\Lambda }\mathbf {U}^{\mathsf {T}}\) [37], where \(\mathbf {U}=[\mathbf {u}_1,\mathbf {u}_2,\dots ,\mathbf {u}_N]\) and \(\mathbf {\Lambda }=\mathrm {diag}(\lambda _1,\lambda _2,\dots ,\lambda _N)\). The Graph Fourier Transform (GFT) \(\mathbf {\hat{y}}\) of the graph signal is given by \(\mathbf {\hat{y}}=\mathbf {U}^{{\mathsf {T}}}\mathbf {y}\), and similarly the inverse GFT is given by \(\mathbf {y} = \mathbf {U}\mathbf {\hat{y}}\) [37]. A graph signal \(\mathbf {y}\) is called bandlimited if \(\exists ~\rho \in \{ 1,2,\dots ,N-1 \}\) such that its GFT satisfies \(\mathbf {\hat{y}}(i)=0~\forall ~i > \rho \). Similar to the classical digital signal processing theory, \(\rho \) is called the bandwidth of \(\mathbf {y}\) in GSP, and \(\lambda _{\rho }\) is the frequency associated with that bandwidth. Pesenson [44] used these concepts of frequency to define the space of all \(\omega \)-bandlimited graph signals such as \(PW_{\omega }(G)=\mathrm {span}(\mathbf {U}_{\rho }:\lambda _{\rho } \le \omega )\), where \(\mathbf {U}_{\rho }\) is the first \(\rho \) eigenvectors of \(\mathbf {L}\), and \(PW_{\omega }(G)\) is called the Paley-Wiener space of the graph.

3.2 Nodes Representation in GSP-MOS

GSP-MOS requires a representation for the nodes \(\mathcal {V}\) in the graph. A node of G is represented with a feature characterization of a group of pixels, where these regions are obtained with some segmentation algorithm. State-Of-The-Art (SOTA) GSP-MOS algorithms have mainly used superpixel [1], block-based [21], semantic [12], and instance segmentation [24] for the representation of nodes. Figure 2 shows an example of nodes representation with instance segmentation, where each instance is mapped to a node in the graph using certain feature representation. Figure 2 also shows the representation with the graph signals, where the green nodes are moving objects, the blue nodes are static objects, and the black nodes are unknown. Intuitively, the task of the semi-supervised learning algorithm is to color the graph signals (blue and green) as best as possible from the sampled graph (last graph to the right in Fig. 2).

Fig. 2.
figure 2

Mapping of the segmented regions (moving or static objects) to G and graph signal sampling. (Color figure online)

Superpixel Segmentation. Giraldo et al. [21] used the Simple Linear Iterative Clustering (SLIC) method for superpixel segmentation [1]. SLIC is a well known classical method for superpixel segmentation. In the same way, more recent superpixel segmentation methods such as [56] can also be used. SLIC method outputs approximately \(\zeta \) superpixels per image, where \(\zeta \) is an important parameter for GSP-MOS algorithms. For example, bigger values of \(\zeta \) allow the processing of more detailed regions, but at the same time these values increase the computational complexity of the GSP-MOS algorithm. Indeed, Giraldo et al. [21] used a powerful Nvidia DGX-2 server machine to execute the experiments related to superpixels.

Block-Based Segmentation. Giraldo et al. [21] also used block-based segmentation for the representation of nodes in G. This method divides each image in equally sized blocks for further processing. Block-based has the advantage of being faster than SLIC, but it also introduces degradation in performance as shown in [21].

Semantic Segmentation. Some GSP-MOS algorithms in the literature have used Fully CNN based segmentation methods. For instance, Giraldo et al. [21] used the DeepLab method [12] for the representation of nodes. Semantic-based GSP-MOS algorithms work well in several challenging conditions, however it fails when there is interception of static and moving objects of the same semantic class. Figure 3 shows the output of a semantic segmentation method in a scene where there are moving cars just in front of parked cars in the background. In such cases, semantic methods fail for GSP-MOS since the method cannot differentiate between the static and moving object.

Fig. 3.
figure 3

Segmentation outputs for semantic and instance segmentation.

Instance Segmentation. Several GSP-MOS algorithms have focused on instance segmentation methods such as Mask R-CNN [24] and Cascade Mask R-CNN [10]. For example, GraphBGS-TV [20] used a Mask R-CNN with a ResNeXt-101 [55] as backbone, while GraphBGS [18] used a Cascade Mask R-CNN with a ResNeSt-200 [58] as backbone. Unlike semantic segmentation, instance segmentation for GSP-MOS algorithms has the advantage of differentiating between intercepted static and moving objects of the same semantic class as shown in Fig. 3. Instance segmentation also dramatically reduces the computational complexity with respect to superpixel and block-based segmentation [21].

3.3 Feature Extraction

The feature representation plays a central role for GSP-MOS. SOTA GSP-MOS algorithms have used combinations of handcrafted and deep features. For example, GraphBGS [18] and GraphBGS-TV [18] uses a combination of optical flow [33], intensity, background, and texture [36] features. In the same way, GraphMOS [21] uses the same combination of handcrafted features of GraphBGS and GraphBGS-TV, plus deep features inspired from the visual object tracking community [14, 27]. The deep features of GraphMOS are extracted from a pre-trained VGG-m model [11]. Each node in \(\mathcal {V}\) is represented by a M-dimensional vector \(\mathbf {x}_v \in \mathbb {R}^{M}\), where \(\mathbf {x}_v\) is the concatenation of all previously extracted features.

3.4 Graph Construction

GSP-MOS algorithms have little explored the construction of the graph, SOTA methods have used a k Nearest Neighborhood (k-NN) with a Gaussian kernel. Let \(\mathbf {X} \in \mathbb {R}^{N\times M}\) be the matrix of features of the N nodes such that \(\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\dots ,\mathbf {x}_N]^{\mathsf {T}}\). The construction of G is accomplished connecting the first k neighbors of each node, and thereafter the following kernel is applied to get the weight of each edge:

$$\begin{aligned} w_{ij} = \exp {{-\frac{\Vert \mathbf {x}_i - \mathbf {x}_j \Vert _2^2}{\sigma ^2}}}, \end{aligned}$$
(1)

where \(\sigma \) is the standard deviation of the Gaussian kernel. \(\sigma \) is computed as follows:

$$\begin{aligned} \sigma = \frac{1}{\vert \mathcal {E} \vert + N}\sum _{(i,j) \in \mathcal {E}} \Vert \mathbf {x}_i - \mathbf {x}_j \Vert _2. \end{aligned}$$
(2)

Finally, we got the adjacency matrix \(\mathbf {W}(i,j)=w_{ij}~\forall ~(i,j) \in \mathcal {E}\) from this procedure.

3.5 Graph Signal

The graph signal is a matrix \(\mathbf {Y} \in \{0,1\}^{N\times Q}\), where Q is the number of classes of the problem. For example, Q is equal to two corresponding to static and moving objects in MOS. In the same way, Q is equal to the number of different objects for each video in VOS. Each row of \(\mathbf {Y}\) is a one-hot vector encoding the class of each node (static or moving object for MOS). Furthermore, the construction of \(\mathbf {Y}\) depends on the segmentation algorithm. One can construct the graph signal using the Intersection over Union (IoU) and the Intersection over Node (IoN) metrics proposed in [21]. The decision of whether a node is a static or moving object has been done using empirical formulations based on IoU, IoN, and the specific segmentation method. For further details, the readers are referred to [21].

3.6 Sample Complexity for Semi-supervised Learning

An important question in GSP-MOS is: how many labeled nodes are required to get perfect classification in the semi-supervised learning algorithm? The answer can be found in the Paley-Wiener space of G. Let \(\mathbf {y}\) be one of the columns of \(\mathbf {Y}\), if \(\mathbf {y} \in PW_{\omega }(G)\) then we can perfectly reconstruct the original graph signal \(\mathbf {y}\) with at least \(\rho \) samples. Formally, let \(\mathcal {S} \subset \mathcal {V}\) be a subset of nodes with \(\mathcal {S}=\{s_1,s_2,\dots ,s_m\}\), where \(m = \vert \mathcal {S}\vert \le N\) is the number of sampled nodes. Let \(\mathbf {y}(\mathcal {S})=\mathbf {My}\) be the sampled graph signal such that \(\mathbf {M}\) is a sampling matrix whose entries are given by \(\mathbf {M} = [\boldsymbol{\delta }_{s_1},\dots ,\boldsymbol{\delta }_{s_m}]^{\mathsf {T}}\) and \(\boldsymbol{\delta }_{v}\) is the one-hot vector centered at v. Let \(\mathbf {\tilde{y}} = \mathbf {\Phi My}\) be the reconstructed graph signal where \(\mathbf {\Phi } \in \mathbb {R}^{N \times m}\) is an interpolation operator. One can get perfect reconstruction if \(\mathbf {\Phi M}=\mathbf {I}\), i.e., \(\mathbf {\tilde{y}}=\mathbf {I}\mathbf {y}=\mathbf {y}\). In general, perfect reconstruction is not possible because \(\mathrm {rank}\,(\mathbf {\Phi M}) \le m \le N\). However, Chen et al. [13] proved that perfect reconstruction is possible when \(|\mathcal {S}| \ge \rho \). Giraldo et al. [21] used this idea to compute the sample complexity in GSP-MOS with the following corollary:

Corollary 1

(Giraldo’s corollary [21]). Let \(\mathbf {Y} \in \mathbb {R}^{N \times Q}\) be a graph signal matrix of a semi-supervised learning problem with Q classes. Let \(N_s\) be the sample complexity of the learning problem. \(\mathbf {Y}\) has a set of cutoff frequencies \(\{ \omega _1,\dots ,\omega _Q\}\), with corresponding bandwidths \(\{\rho _1,\dots ,\rho _Q\}\) for each column of the graph signal. As a consequence, \(N_s\) is bounded as follows:

$$\begin{aligned} N_s \le \mathrm {max}\{\rho _1,\dots ,\rho _Q\}. \end{aligned}$$
(3)

Proof: see [21].

Corollary 1 provides an upper bound in the sample complexity for a semi-supervised learning problem. The upper bound is the maximum of the bandwidths \(\{\rho _1,\dots ,\rho _Q\}\) of each column of \(\mathbf {Y}\), i.e., we are assuming that each column of \(\mathbf {Y}\) is in the Paley-Wiener space of G.

3.7 Semi-supervised Learning Algorithms

GSP-MOS algorithms have mainly used semi-supervised learning algorithms inspired from reconstruction of graph signals. The objective functions of these semi-supervised learning problems usually involve a graph signal norm. For instance, GraphBGS [18] and GraphMOS [21] used the Sobolev norm, while GraphBGS-TV [20] used the TV norm.

Sobolev Norm. The Sobolev norm of graph signals was defined by Pesenson [45] as follows:

Definition 1

For parameters \(\epsilon \ge 0\) and \(\alpha \in \mathbb {R}\), the Sobolev norm of graph signals is introduced as follows:

$$\begin{aligned} \Vert \mathbf {f} \Vert _{\alpha ,\epsilon } = \Vert (\mathbf {L}+\epsilon \mathbf {I})^{\alpha /2} \mathbf {f} \Vert ^2_2, \alpha \in \mathbb {R}, \end{aligned}$$
(4)

where \(\mathbf {I}\) is the identity matrix.

The semi-supervised learning algorithm based on the Sobolev norm in Definition 1 can be expressed as follows:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {z}_q} \Vert \mathbf {z}_q \Vert _{\alpha ,\epsilon } \quad \text {s.t.} \quad \mathbf {Mz}_q = \mathbf {y}_q(\mathcal {S}), \end{aligned}$$
(5)

where we need to find \(\mathbf {z}_q~\forall ~1 \le q \le Q\). Giraldo et al. [21] highlighted two valuable properties of Eqn. (5): 1) the optimization problem is convex, and 2) the Sobolev term \(\mathbf {L}+\epsilon \mathbf {I}\) is always invertible for \(\epsilon > 0\) in undirected graphs. For small values of N, the optimization function for all \(1\le q \le Q\) in Eqn. (5) can be solved as follows:

$$\begin{aligned} ((\mathbf {L}+\epsilon \mathbf {I})^{-1})^\alpha \mathbf {M}^{\mathsf {T}}(\mathbf {M}((\mathbf {L}+\epsilon \mathbf {I})^{-1})^\alpha \mathbf {M}^{\mathsf {T}})^{-1}\mathbf {Y}(\mathcal {S}), \end{aligned}$$
(6)

where \(\mathbf {Y}(\mathcal {S})\) is the sub-matrix of \(\mathbf {Y}\) with rows indexed by \(\mathcal {S}\). For larger values of N, one can solve Eqn. (5) using the interior-point method from quadratic programming or the GSP toolbox [43].

Total Variation. The optimization problem for TV minimization is given as follows:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {z}_q} \Vert \mathbf {z}_q \Vert _{\text {TV}} \quad \text {s.t.} \quad \mathbf {Mz}_q = \mathbf {y}_q(\mathcal {S}), \end{aligned}$$
(7)

where the TV norm is such that:

$$\begin{aligned} \Vert \mathbf {y} \Vert _{\text {TV}} = \sum _{i \in \mathcal {V}} \sum _{j \in \mathcal {N}_i} \sqrt{\mathbf {W}(i,j)} \Vert \mathbf {y}(j)-\mathbf {y}(i)\Vert _1, \end{aligned}$$
(8)

where \(\mathcal {N}_i\) is the set of first neighbors of the ith node. The minimization of the TV is related to the cluster assumption [29] and leads to piecewise constant signals as the solution of the optimization problem. GraphBGS-TV [20] solved Eqn. (7) using the GSP [43] and the Unlocbox [42].

4 Experimental Framework

The evaluation of GSP-MOS algorithms in [18, 20, 21] have been adjusted to the classical evaluation of MOS method (such as in background subtraction and VOS) for comparing with previous SOTA methods. These comparisons have been conducted with the F-measure metric defined as follows:

$$\begin{aligned} \text {F-measure} = 2\frac{\text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}},\text { where}\end{aligned}$$
(9)
$$\begin{aligned} \text {Recall} = \frac{\text {TP}}{\text {TP}+\text {FN}}, \text { } \text {Precision} = \frac{\text {TP}}{\text {TP}+\text {FP}}, \end{aligned}$$
(10)

where TP, FP, and FN are the number of true positives, false positives, and false negatives pixels, respectively. However, another interesting approach for evaluation is to compute the F-measure along a set of different sampling densities (percentage of training images). As a result, we can see what is the performance of each method for a small labeled data regimen. In this work, we adopt both evaluation approaches: the classical procedure and the evaluation of GSP-MOS algorithms for several sampling densities.

4.1 Dataset

GSP-MOS algorithms have been evaluated on several datasets for background subtraction and VOS. Arguably, The most popular dataset is the Change Detection (CDNet2014) [54]. CDNet2014 contains 11 challenging categories including: bad weather, low frame rate, night videos, turbulence, baseline, dynamic backgrounds, Pan–Tilt–Zoom cameras (PTZ), camera jitter, intermittent object motion, shadow, and thermal. Each challenge of CDNet2014 contains from four up to six videos, and each video has a certain amount of labeled ground truth images.

4.2 Resources for GSP-MOS

There are some important resources to study GSP-MOS. For GSP, the survey papers [37, 48] are fundamental sources of information. Moreover, the paper [21] contains detailed theoretical explanations and proofs for GSP-MOS along with a GitHub repositoryFootnote 1 with publicly available codes. The readers are also encouraged to visit the GSP WebsiteFootnote 2.

5 Results and Discussion

Table 1 shows some visual results of the GSP-MOS algorithms GraphBGS-TV [19] and GraphBGS [18] on CDNet2014 against the deep learning method BSUV-Net [51]. Furthermore, Table 2 shows the numerical comparison of three GSP-MOS algorithms (GraphBGS [18], GraphBGS-TV [20], and GraphMOS [21]), two deep learning methods (FgSegNet v2 [32] and BSUV-Net [51]), and three unsupervised methods (SuBSENSE [49], IUTIS-5 [3], and SemanticBGS [9]). All methods in Table 2 have been evaluated with an agnostic video scheme [19, 34, 51] (unseen videos). One can see that GSP-MOS algorithms are good in almost all challenges except in intermittent object motion (I-O Motion) and in low frame rate (Low-F Rate). All GSP-MOS algorithms have used the median filter as the background initialization method, however more sophisticated methods such as [26] can be used to improve the performance of GSP-MOS in I-O Motion. Furthermore, the computation of the optical flow for the feature extraction can be affecting the performance of GSP-MOS algorithms due to the nature of Low-F Rate videos.

Table 1. Some visual results on CDNet2014 dataset for the GSP-MOS algorithms GraphBGS-TV [20] and GraphBGS [18] compared with BSUV-Net [51].
Table 2. Comparisons of average F-measure over nine challenges of CDNet2014. GSP-MOS algorithms are compared with unsupervised and supervised methods in MOS. The best and second best performing method for each challenge are shown in and , respectively.

For the new evaluation strategy using several sampling densities, we adopt the following convention to name the architecture of a GSP-MOS algorithm: (semi-supervised learning method) - (segmentation method) - (k parameter for the k-NN method). Similarly, the following abbreviations are given: 1) SN means Sobolev norm with \(\epsilon =0.2\) and \(\alpha =1\), 2) TV means Total Variation, 3) M50 means instance segmentation with a Mask R-CNN with a ResNet-50, 4) X101 means instance segmentation with a Mask R-CNN with a ResNeXt-101, 5) C200 means instance segmentation with a Cascade Mask R-CNN and a ResNeSt-200, and 6) DL means semantic segmentation with DeepLab method. For example, SN-M50-30 is a GSP-MOS architecture that uses Sobolev norm minimization, a Mask R-CNN with ResNet-50, and k-NN with \(k=30\).

Fig. 4.
figure 4

Average F-measure vs sampling density for several architectures of GSP-MOS in nine challenges of CDNet2014. Each point in the plots is a Monte Carlo cross-validation experiment with 5 repetitions.

Figure 4 shows the results of several configurations of GSP-MOS algorithms. For example, TV-X101-20 corresponds to GraphBGS-TV [19], while SN-C200-30 corresponds to GraphBGS [18]. GraphMOS [21] was not evaluated using this procedure due to its high computational complexity. Similarly, we evaluate two new architectures namely SN-DL-30 and SN-X101-30. The design of an architecture for GSP-MOS should be focused on the specific challenges that we can have in our dataset. For instance, TV works better than Sobolev norm minimization in PTZ, camera jitter, and bad weather. Similarly, C200 works better than the others segmentation methods in almost all cases, however one should consider the high computational complexity of C200. Figure 4 also helps to evaluate certain algorithms given an amount of labeled information. For example, if one has a dataset with intermittent objects motions, and the percentage of labeled images in the dataset is less than \(2\%\), TV-X101-20 works better than the other architectures in that case.

6 Conclusions

In this study, we summarized the ideas and theoretical developments of SOTA GSP-MOS algorithms. Similarly, we proposed two new architectures of GSP-MOS and a new evaluation procedure for taking into account the sampling density of labeled information (the amount of training images). GSP-MOS architectures are usually composed of: 1) a segmentation method, 2) a background model, 3) a feature extraction for nodes representation, 4) a graph construction, 5) graph signal representation, 6) a sampling method, and 7) a semi-supervised learning method. GSP-MOS algorithms have shown good results for static as well as moving camera sequences. Furthermore, GSP-MOS algorithms have important theoretical guarantees rooted from the rich theory of GSP.

GSP-MOS algorithms have several directions for future research. For example, one can try to use light CNN architectures to learn better movement features than using handcrafted and pre-trained CNN features. Similarly, one can try to improve the performance of GSP-MOS using other tools such as more sophisticated background initialization methods, or feature selection/dimensionality reduction after the feature extraction procedure.