The Emerging Field of Graph Signal Processing for Moving Object Segmentation

Giraldo, Jhony H.; Javed, Sajid; Sultana, Maryam; Jung, Soon Ki; Bouwmans, Thierry

doi:10.1007/978-3-030-81638-4_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1405))

Included in the following conference series:

International Workshop on Frontiers of Computer Vision

903 Accesses
24 Citations

Abstract

Moving Object Segmentation (MOS) is an important topic in computer vision. MOS becomes a challenging problem in the presence of dynamic background and moving camera videos such as Pan–Tilt–Zoom cameras (PTZ). The MOS problem has been solved using unsupervised and supervised learning strategies. Recently, new ideas to solve MOS using semi-supervised learning have emerged inspired from the theory of Graph Signal Processing (GSP). These new algorithms are usually composed of several steps including: segmentation, background initialization, features extraction, graph construction, graph signal sampling, and a semi-supervised learning algorithm inspired from reconstruction of graph signals. In this work, we summarize and explain the theoretical foundations as well as the technical details of MOS using GPS. We also propose two architectures for MOS using semi-supervised learning and a new evaluation procedure for GSP-based MOS algorithms. GSP-based algorithms are evaluated in the Change Detection (CDNet2014) dataset for MOS, outperforming numerous State-Of-The-Art (SOTA) methods in several challenging conditions.

Access provided by Autonomous University of Puebla. Download conference paper PDF

GRUNTS: Graph Representation for UNsupervised Temporal Segmentation

Multiple Object Tracking by Efficient Graph Partitioning

Hierarchical Graph-Based Segmentation in Detection of Object-Related Regions

Keywords

1 Introduction

Moving Object Segmentation (MOS) is a fundamental topic in computer vision. MOS has several applications [17], and it is also a fundamental step in numerous computer vision and image/video processing tasks such as video surveillance, security, visual object tracking, and human activity analysis, among others. The target of MOS is to separate the moving from the static objects in a scene [4]. MOS is also considered as a classification problem, where the objective is to classify between background and foreground each pixel in the videos, and therefore MOS is also known as background subtraction [7]. MOS becomes very challenging in the presence of undesirable variations in the background scene for static and moving camera sequences [21].

The approaches to solve MOS can be categorized as unsupervised, semi-supervised, and supervised learning methods. Unsupervised methods usually rely on a background model [25] of the video to segment the moving pixels (or foreground) [4,5,6, 28]. However, these unsupervised methods usually fail in dynamic background and moving camera sequences. On the other hand, supervised methods usually train very deep Convolutional Neural Networks (CNN) [31] to predict the foreground [8, 22, 35, 50]. Common CNNs in the literature contain millions of parameters [23], and thus several tricks such as dropout and data augmentation techniques [30] are required to avoid the overfitting problem. Furthermore, a general answer about the sample complexity required to guarantee performance in deep learning regimen does not exist in CNNs [15, 47, 53]. Recently, a new semi-supervised learning category for MOS has emerged [18, 20, 21]. These semi-supervised methods for MOS can achieve good performance in several challenges with a small amount of labeled information, and they are inspired from the Graph Signal Processing (GSP) community [37, 48].

GSP tries to extend the classical concepts of digital signal processing to signals supported on graphs. Graphs can model complex relationships among data, and therefore GSP has attracted a lot of interest in the last few years. Specifically, the problems of sampling and reconstruction of graph signals have been broadly studied [2, 13, 39, 40, 46]. In the same way, GSP has been used in numerous applications such as computer vision [18, 20, 21], image and 3D point cloud processing [38, 52, 57], sensor networks [16], and the estimation of new cases of Coronavirus Disease 2019 [19].

In this work, we explain the basic notions, as well as the mathematical and technical foundations of MOS using GSP, for simplicity we call Graph Signal Processing-Moving Object Segmentation (GSP-MOS) to this new approach. We propose two new architectures and a new evaluation procedure for GSP-MOS. This new evaluation scheme takes into account the amount of labeled information used in the GSP-MOS algorithms. The idea of GSP-MOS is to model regions of pixels in videos as nodes living in a high dimensional space, and thereafter a semi-supervised method classifies if each node is either a static or moving object. GSP-MOS algorithms are usually composed of: 1) a segmentation method, 2) feature extraction procedure for nodes representation, 3) a graph construction method, 4) graph signal representation, 5) a graph signal sampling strategy, and 6) a semi-supervised learning algorithm inspired from reconstruction methods of GSP. GSP-MOS has the advantage of having good performance for static and moving camera videos with small amounts of labeled information. Furthermore, Giraldo et al. [21] answered the question of the sample complexity in semi-supervised learning regimen for GSP-MOS using the concept of Paley-Wiener space in graphs [44].

The rest of the paper is organized as follows. Section 2 presents the related work in GSP-MOS. Section 3 explains the basic concepts and mathematical notions of GSP-MOS algorithms. Section 4 introduces the experimental framework that has been used for GSP-MOS, and the new evaluation procedure. Finally, Sects. 5 and 6 present a summary of results and the conclusions of this work, respectively.

2 Related Work

There are few GSP-MOS algorithms in the literature. The first GSP-MOS algorithm was proposed in [20] and coined as Graph BackGround Subtraction with Total Variation minimization (GraphBGS-TV), this method uses a Mask Region CNN (Mask R-CNN) [24] with a set of handcrafted features. GraphBGS-TV solves the semi-supervised learning problem using the Total Variation (TV) of graph signals [29]. Giraldo and Bouwmans [18] proposed the GraphBGS method, where the segmentation step uses a Cascade Mask R-CNN [10], and the semi-supervised learning problem is solved with the Sobolev norm of graph signals [19]. Finally, Giraldo et al. [21] proposed Graph Moving Object Segmentation (GraphMOS). This method uses superpixel segmentation with handcrafted plus deep features. GraphMOS solves the semi-supervised learning problem with the Sobolev norm or TV of graph signals. Additionally, Giraldo et al. [21] extended the GSP-MOS ideas to Video Object Segmentation (VOS) [41], and they coined their algorithm as GraphVOS. This work summarizes and discusses the results of the GSP-MOS algorithms. Similarly, we present potential future directions in the field of GSP-MOS.

3 Moving Object Segmentation and Graph Signal Processing

This section presents the basic concepts and mathematical foundations of GSP-MOS. Figure 1 shows the general pipeline for GSP-MOS. Most GSP-MOS algorithms follow the same pipeline with some modifications, in this work we refer to architecture as the pipeline of a certain GSP-MOS algorithm.

3.1 Graph Signals Preliminaries

A graph $G=(\mathcal {V},\mathcal {E})$ is represented with a set of nodes $\mathcal {V}=\{1,\dots ,N\}$ and a set of edges $\mathcal {E}=\{(i,j)\}$, where (i, j) is the edge between the nodes i and j. $\mathbf {W} \in \mathbb {R}^{N\times N}$ is the adjacency matrix of the graph such that $\mathbf {W}(i,j)\in \mathbb {R}^+$ is the weight of the edge (i, j) and $\mathbf {W}(i,j)=0~\forall ~(i,j)\notin \mathcal {E}$. Therefore, $\mathbf {W}$ is symmetric for undirected graphs. $\mathbf {D} \in \mathbb {R}^{N\times N}$ is a diagonal matrix called the degree matrix of G such that $\mathbf {D} = \mathrm {diag}(\mathbf {W1})$, where $\mathbf {1} \in \mathbb {R}^{N}$ is a vector of ones. Similarly, $\mathbf {L} = \mathbf {D}-\mathbf {W}$ is the combinatorial Laplacian operator. $\mathbf {L}$ is positive semi-definite for undirected graphs, and thus $\mathbf {L}$ has eigenvalues $0=\lambda _1 \le \lambda _2 \le \dots \le \lambda _N$ and corresponding eigenvectors $\{ \mathbf {u}_1,\mathbf {u}_2,\dots ,\mathbf {u}_N\}$. Finally, a graph signal is a function $y:\mathcal {V} \rightarrow \mathbb {R}$, and we can represent this signal as $\mathbf {y} \in \mathbb {R}^N$.

The graph Fourier basis of G is given by the eigenvalue decomposition of $\mathbf {L}=\mathbf {U}\mathbf {\Lambda }\mathbf {U}^{\mathsf {T}}$ [37], where $\mathbf {U}=[\mathbf {u}_1,\mathbf {u}_2,\dots ,\mathbf {u}_N]$ and $\mathbf {\Lambda }=\mathrm {diag}(\lambda _1,\lambda _2,\dots ,\lambda _N)$. The Graph Fourier Transform (GFT) $\mathbf {\hat{y}}$ of the graph signal is given by $\mathbf {\hat{y}}=\mathbf {U}^{{\mathsf {T}}}\mathbf {y}$, and similarly the inverse GFT is given by $\mathbf {y} = \mathbf {U}\mathbf {\hat{y}}$ [37]. A graph signal $\mathbf {y}$ is called bandlimited if $\exists ~\rho \in \{ 1,2,\dots ,N-1 \}$ such that its GFT satisfies $\mathbf {\hat{y}}(i)=0~\forall ~i > \rho $. Similar to the classical digital signal processing theory, $\rho $ is called the bandwidth of $\mathbf {y}$ in GSP, and $\lambda _{\rho }$ is the frequency associated with that bandwidth. Pesenson [44] used these concepts of frequency to define the space of all $\omega $-bandlimited graph signals such as $PW_{\omega }(G)=\mathrm {span}(\mathbf {U}_{\rho }:\lambda _{\rho } \le \omega )$, where $\mathbf {U}_{\rho }$ is the first $\rho $ eigenvectors of $\mathbf {L}$, and $PW_{\omega }(G)$ is called the Paley-Wiener space of the graph.

3.2 Nodes Representation in GSP-MOS

GSP-MOS requires a representation for the nodes $\mathcal {V}$ in the graph. A node of G is represented with a feature characterization of a group of pixels, where these regions are obtained with some segmentation algorithm. State-Of-The-Art (SOTA) GSP-MOS algorithms have mainly used superpixel [1], block-based [21], semantic [12], and instance segmentation [24] for the representation of nodes. Figure 2 shows an example of nodes representation with instance segmentation, where each instance is mapped to a node in the graph using certain feature representation. Figure 2 also shows the representation with the graph signals, where the green nodes are moving objects, the blue nodes are static objects, and the black nodes are unknown. Intuitively, the task of the semi-supervised learning algorithm is to color the graph signals (blue and green) as best as possible from the sampled graph (last graph to the right in Fig. 2).

Superpixel Segmentation. Giraldo et al. [21] used the Simple Linear Iterative Clustering (SLIC) method for superpixel segmentation [1]. SLIC is a well known classical method for superpixel segmentation. In the same way, more recent superpixel segmentation methods such as [56] can also be used. SLIC method outputs approximately $\zeta $ superpixels per image, where $\zeta $ is an important parameter for GSP-MOS algorithms. For example, bigger values of $\zeta $ allow the processing of more detailed regions, but at the same time these values increase the computational complexity of the GSP-MOS algorithm. Indeed, Giraldo et al. [21] used a powerful Nvidia DGX-2 server machine to execute the experiments related to superpixels.

Block-Based Segmentation. Giraldo et al. [21] also used block-based segmentation for the representation of nodes in G. This method divides each image in equally sized blocks for further processing. Block-based has the advantage of being faster than SLIC, but it also introduces degradation in performance as shown in [21].

Semantic Segmentation. Some GSP-MOS algorithms in the literature have used Fully CNN based segmentation methods. For instance, Giraldo et al. [21] used the DeepLab method [12] for the representation of nodes. Semantic-based GSP-MOS algorithms work well in several challenging conditions, however it fails when there is interception of static and moving objects of the same semantic class. Figure 3 shows the output of a semantic segmentation method in a scene where there are moving cars just in front of parked cars in the background. In such cases, semantic methods fail for GSP-MOS since the method cannot differentiate between the static and moving object.

Instance Segmentation. Several GSP-MOS algorithms have focused on instance segmentation methods such as Mask R-CNN [24] and Cascade Mask R-CNN [10]. For example, GraphBGS-TV [20] used a Mask R-CNN with a ResNeXt-101 [55] as backbone, while GraphBGS [18] used a Cascade Mask R-CNN with a ResNeSt-200 [58] as backbone. Unlike semantic segmentation, instance segmentation for GSP-MOS algorithms has the advantage of differentiating between intercepted static and moving objects of the same semantic class as shown in Fig. 3. Instance segmentation also dramatically reduces the computational complexity with respect to superpixel and block-based segmentation [21].

3.3 Feature Extraction

The feature representation plays a central role for GSP-MOS. SOTA GSP-MOS algorithms have used combinations of handcrafted and deep features. For example, GraphBGS [18] and GraphBGS-TV [18] uses a combination of optical flow [33], intensity, background, and texture [36] features. In the same way, GraphMOS [21] uses the same combination of handcrafted features of GraphBGS and GraphBGS-TV, plus deep features inspired from the visual object tracking community [14, 27]. The deep features of GraphMOS are extracted from a pre-trained VGG-m model [11]. Each node in $\mathcal {V}$ is represented by a M-dimensional vector $\mathbf {x}_v \in \mathbb {R}^{M}$, where $\mathbf {x}_v$ is the concatenation of all previously extracted features.

3.4 Graph Construction

GSP-MOS algorithms have little explored the construction of the graph, SOTA methods have used a k Nearest Neighborhood (k-NN) with a Gaussian kernel. Let $\mathbf {X} \in \mathbb {R}^{N\times M}$ be the matrix of features of the N nodes such that $\mathbf {X}=[\mathbf {x}_1,\mathbf {x}_2,\dots ,\mathbf {x}_N]^{\mathsf {T}}$. The construction of G is accomplished connecting the first k neighbors of each node, and thereafter the following kernel is applied to get the weight of each edge:

$$\begin{aligned} w_{ij} = \exp {{-\frac{\Vert \mathbf {x}_i - \mathbf {x}_j \Vert _2^2}{\sigma ^2}}}, \end{aligned}$$

(1)

where $\sigma $ is the standard deviation of the Gaussian kernel. $\sigma $ is computed as follows:

$$\begin{aligned} \sigma = \frac{1}{\vert \mathcal {E} \vert + N}\sum _{(i,j) \in \mathcal {E}} \Vert \mathbf {x}_i - \mathbf {x}_j \Vert _2. \end{aligned}$$

(2)

Finally, we got the adjacency matrix $\mathbf {W}(i,j)=w_{ij}~\forall ~(i,j) \in \mathcal {E}$ from this procedure.

3.5 Graph Signal

The graph signal is a matrix $\mathbf {Y} \in \{0,1\}^{N\times Q}$, where Q is the number of classes of the problem. For example, Q is equal to two corresponding to static and moving objects in MOS. In the same way, Q is equal to the number of different objects for each video in VOS. Each row of $\mathbf {Y}$ is a one-hot vector encoding the class of each node (static or moving object for MOS). Furthermore, the construction of $\mathbf {Y}$ depends on the segmentation algorithm. One can construct the graph signal using the Intersection over Union (IoU) and the Intersection over Node (IoN) metrics proposed in [21]. The decision of whether a node is a static or moving object has been done using empirical formulations based on IoU, IoN, and the specific segmentation method. For further details, the readers are referred to [21].

3.6 Sample Complexity for Semi-supervised Learning

An important question in GSP-MOS is: how many labeled nodes are required to get perfect classification in the semi-supervised learning algorithm? The answer can be found in the Paley-Wiener space of G. Let $\mathbf {y}$ be one of the columns of $\mathbf {Y}$, if $\mathbf {y} \in PW_{\omega }(G)$ then we can perfectly reconstruct the original graph signal $\mathbf {y}$ with at least $\rho $ samples. Formally, let $\mathcal {S} \subset \mathcal {V}$ be a subset of nodes with $\mathcal {S}=\{s_1,s_2,\dots ,s_m\}$, where $m = \vert \mathcal {S}\vert \le N$ is the number of sampled nodes. Let $\mathbf {y}(\mathcal {S})=\mathbf {My}$ be the sampled graph signal such that $\mathbf {M}$ is a sampling matrix whose entries are given by $\mathbf {M} = [\boldsymbol{\delta }_{s_1},\dots ,\boldsymbol{\delta }_{s_m}]^{\mathsf {T}}$ and $\boldsymbol{\delta }_{v}$ is the one-hot vector centered at v. Let $\mathbf {\tilde{y}} = \mathbf {\Phi My}$ be the reconstructed graph signal where $\mathbf {\Phi } \in \mathbb {R}^{N \times m}$ is an interpolation operator. One can get perfect reconstruction if $\mathbf {\Phi M}=\mathbf {I}$, i.e., $\mathbf {\tilde{y}}=\mathbf {I}\mathbf {y}=\mathbf {y}$. In general, perfect reconstruction is not possible because $\mathrm {rank}\,(\mathbf {\Phi M}) \le m \le N$. However, Chen et al. [13] proved that perfect reconstruction is possible when $|\mathcal {S}| \ge \rho $. Giraldo et al. [21] used this idea to compute the sample complexity in GSP-MOS with the following corollary:

Corollary 1

(Giraldo’s corollary [21]). Let $\mathbf {Y} \in \mathbb {R}^{N \times Q}$ be a graph signal matrix of a semi-supervised learning problem with Q classes. Let $N_s$ be the sample complexity of the learning problem. $\mathbf {Y}$ has a set of cutoff frequencies $\{ \omega _1,\dots ,\omega _Q\}$, with corresponding bandwidths $\{\rho _1,\dots ,\rho _Q\}$ for each column of the graph signal. As a consequence, $N_s$ is bounded as follows:

$$\begin{aligned} N_s \le \mathrm {max}\{\rho _1,\dots ,\rho _Q\}. \end{aligned}$$

(3)

Proof: see [21].

Corollary 1 provides an upper bound in the sample complexity for a semi-supervised learning problem. The upper bound is the maximum of the bandwidths $\{\rho _1,\dots ,\rho _Q\}$ of each column of $\mathbf {Y}$, i.e., we are assuming that each column of $\mathbf {Y}$ is in the Paley-Wiener space of G.

3.7 Semi-supervised Learning Algorithms

GSP-MOS algorithms have mainly used semi-supervised learning algorithms inspired from reconstruction of graph signals. The objective functions of these semi-supervised learning problems usually involve a graph signal norm. For instance, GraphBGS [18] and GraphMOS [21] used the Sobolev norm, while GraphBGS-TV [20] used the TV norm.

Sobolev Norm. The Sobolev norm of graph signals was defined by Pesenson [45] as follows:

Definition 1

For parameters $\epsilon \ge 0$ and $\alpha \in \mathbb {R}$, the Sobolev norm of graph signals is introduced as follows:

$$\begin{aligned} \Vert \mathbf {f} \Vert _{\alpha ,\epsilon } = \Vert (\mathbf {L}+\epsilon \mathbf {I})^{\alpha /2} \mathbf {f} \Vert ^2_2, \alpha \in \mathbb {R}, \end{aligned}$$

(4)

where $\mathbf {I}$ is the identity matrix.

The semi-supervised learning algorithm based on the Sobolev norm in Definition 1 can be expressed as follows:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {z}_q} \Vert \mathbf {z}_q \Vert _{\alpha ,\epsilon } \quad \text {s.t.} \quad \mathbf {Mz}_q = \mathbf {y}_q(\mathcal {S}), \end{aligned}$$

(5)

where we need to find $\mathbf {z}_q~\forall ~1 \le q \le Q$. Giraldo et al. [21] highlighted two valuable properties of Eqn. (5): 1) the optimization problem is convex, and 2) the Sobolev term $\mathbf {L}+\epsilon \mathbf {I}$ is always invertible for $\epsilon > 0$ in undirected graphs. For small values of N, the optimization function for all $1\le q \le Q$ in Eqn. (5) can be solved as follows:

$$\begin{aligned} ((\mathbf {L}+\epsilon \mathbf {I})^{-1})^\alpha \mathbf {M}^{\mathsf {T}}(\mathbf {M}((\mathbf {L}+\epsilon \mathbf {I})^{-1})^\alpha \mathbf {M}^{\mathsf {T}})^{-1}\mathbf {Y}(\mathcal {S}), \end{aligned}$$

(6)

where $\mathbf {Y}(\mathcal {S})$ is the sub-matrix of $\mathbf {Y}$ with rows indexed by $\mathcal {S}$. For larger values of N, one can solve Eqn. (5) using the interior-point method from quadratic programming or the GSP toolbox [43].

Total Variation. The optimization problem for TV minimization is given as follows:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {z}_q} \Vert \mathbf {z}_q \Vert _{\text {TV}} \quad \text {s.t.} \quad \mathbf {Mz}_q = \mathbf {y}_q(\mathcal {S}), \end{aligned}$$

(7)

where the TV norm is such that:

$$\begin{aligned} \Vert \mathbf {y} \Vert _{\text {TV}} = \sum _{i \in \mathcal {V}} \sum _{j \in \mathcal {N}_i} \sqrt{\mathbf {W}(i,j)} \Vert \mathbf {y}(j)-\mathbf {y}(i)\Vert _1, \end{aligned}$$

(8)

where $\mathcal {N}_i$ is the set of first neighbors of the ith node. The minimization of the TV is related to the cluster assumption [29] and leads to piecewise constant signals as the solution of the optimization problem. GraphBGS-TV [20] solved Eqn. (7) using the GSP [43] and the Unlocbox [42].

4 Experimental Framework

The evaluation of GSP-MOS algorithms in [18, 20, 21] have been adjusted to the classical evaluation of MOS method (such as in background subtraction and VOS) for comparing with previous SOTA methods. These comparisons have been conducted with the F-measure metric defined as follows:

$$\begin{aligned} \text {F-measure} = 2\frac{\text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}},\text { where}\end{aligned}$$

(9)

$$\begin{aligned} \text {Recall} = \frac{\text {TP}}{\text {TP}+\text {FN}}, \text { } \text {Precision} = \frac{\text {TP}}{\text {TP}+\text {FP}}, \end{aligned}$$

(10)

where TP, FP, and FN are the number of true positives, false positives, and false negatives pixels, respectively. However, another interesting approach for evaluation is to compute the F-measure along a set of different sampling densities (percentage of training images). As a result, we can see what is the performance of each method for a small labeled data regimen. In this work, we adopt both evaluation approaches: the classical procedure and the evaluation of GSP-MOS algorithms for several sampling densities.

4.1 Dataset

GSP-MOS algorithms have been evaluated on several datasets for background subtraction and VOS. Arguably, The most popular dataset is the Change Detection (CDNet2014) [54]. CDNet2014 contains 11 challenging categories including: bad weather, low frame rate, night videos, turbulence, baseline, dynamic backgrounds, Pan–Tilt–Zoom cameras (PTZ), camera jitter, intermittent object motion, shadow, and thermal. Each challenge of CDNet2014 contains from four up to six videos, and each video has a certain amount of labeled ground truth images.

4.2 Resources for GSP-MOS

There are some important resources to study GSP-MOS. For GSP, the survey papers [37, 48] are fundamental sources of information. Moreover, the paper [21] contains detailed theoretical explanations and proofs for GSP-MOS along with a GitHub repository^{Footnote 1} with publicly available codes. The readers are also encouraged to visit the GSP Website^{Footnote 2}.

5 Results and Discussion

Table 1 shows some visual results of the GSP-MOS algorithms GraphBGS-TV [19] and GraphBGS [18] on CDNet2014 against the deep learning method BSUV-Net [51]. Furthermore, Table 2 shows the numerical comparison of three GSP-MOS algorithms (GraphBGS [18], GraphBGS-TV [20], and GraphMOS [21]), two deep learning methods (FgSegNet v2 [32] and BSUV-Net [51]), and three unsupervised methods (SuBSENSE [49], IUTIS-5 [3], and SemanticBGS [9]). All methods in Table 2 have been evaluated with an agnostic video scheme [19, 34, 51] (unseen videos). One can see that GSP-MOS algorithms are good in almost all challenges except in intermittent object motion (I-O Motion) and in low frame rate (Low-F Rate). All GSP-MOS algorithms have used the median filter as the background initialization method, however more sophisticated methods such as [26] can be used to improve the performance of GSP-MOS in I-O Motion. Furthermore, the computation of the optical flow for the feature extraction can be affecting the performance of GSP-MOS algorithms due to the nature of Low-F Rate videos.

Table 1. Some visual results on CDNet2014 dataset for the GSP-MOS algorithms GraphBGS-TV [20] and GraphBGS [18] compared with BSUV-Net [51].

Full size table

Table 2. Comparisons of average F-measure over nine challenges of CDNet2014. GSP-MOS algorithms are compared with unsupervised and supervised methods in MOS. The best and second best performing method for each challenge are shown in and , respectively.

For the new evaluation strategy using several sampling densities, we adopt the following convention to name the architecture of a GSP-MOS algorithm: (semi-supervised learning method) - (segmentation method) - (k parameter for the k-NN method). Similarly, the following abbreviations are given: 1) SN means Sobolev norm with $\epsilon =0.2$ and $\alpha =1$, 2) TV means Total Variation, 3) M50 means instance segmentation with a Mask R-CNN with a ResNet-50, 4) X101 means instance segmentation with a Mask R-CNN with a ResNeXt-101, 5) C200 means instance segmentation with a Cascade Mask R-CNN and a ResNeSt-200, and 6) DL means semantic segmentation with DeepLab method. For example, SN-M50-30 is a GSP-MOS architecture that uses Sobolev norm minimization, a Mask R-CNN with ResNet-50, and k-NN with $k=30$.

Figure 4 shows the results of several configurations of GSP-MOS algorithms. For example, TV-X101-20 corresponds to GraphBGS-TV [19], while SN-C200-30 corresponds to GraphBGS [18]. GraphMOS [21] was not evaluated using this procedure due to its high computational complexity. Similarly, we evaluate two new architectures namely SN-DL-30 and SN-X101-30. The design of an architecture for GSP-MOS should be focused on the specific challenges that we can have in our dataset. For instance, TV works better than Sobolev norm minimization in PTZ, camera jitter, and bad weather. Similarly, C200 works better than the others segmentation methods in almost all cases, however one should consider the high computational complexity of C200. Figure 4 also helps to evaluate certain algorithms given an amount of labeled information. For example, if one has a dataset with intermittent objects motions, and the percentage of labeled images in the dataset is less than $2\%$, TV-X101-20 works better than the other architectures in that case.

6 Conclusions

In this study, we summarized the ideas and theoretical developments of SOTA GSP-MOS algorithms. Similarly, we proposed two new architectures of GSP-MOS and a new evaluation procedure for taking into account the sampling density of labeled information (the amount of training images). GSP-MOS architectures are usually composed of: 1) a segmentation method, 2) a background model, 3) a feature extraction for nodes representation, 4) a graph construction, 5) graph signal representation, 6) a sampling method, and 7) a semi-supervised learning method. GSP-MOS algorithms have shown good results for static as well as moving camera sequences. Furthermore, GSP-MOS algorithms have important theoretical guarantees rooted from the rich theory of GSP.

GSP-MOS algorithms have several directions for future research. For example, one can try to use light CNN architectures to learn better movement features than using handcrafted and pre-trained CNN features. Similarly, one can try to improve the performance of GSP-MOS using other tools such as more sophisticated background initialization methods, or feature selection/dimensionality reduction after the feature extraction procedure.

Notes

References

Achanta, R., et al.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)
Article Google Scholar
Anis, A., Gadde, A., Ortega, A.: Efficient sampling set selection for bandlimited graph signals using graph spectral proxies. IEEE Trans. Signal Process. 64(14), 3775–3789 (2016)
Article MathSciNet Google Scholar
Bianco, S., Ciocca, G., Schettini, R.: Combination of video change detection algorithms by genetic programming. IEEE Trans. Evol. Comput. 21(6), 914–928 (2017)
Article Google Scholar
Bouwmans, T.: Traditional and recent approaches in background modeling for foreground detection: an overview. Comput. Sci. Rev. 11, 31–66 (2014)
Article Google Scholar
Bouwmans, T., El Baf, F., Vachon, B.: Background modeling using mixture of Gaussians for foreground detection-a survey. Recent Patents Comput. Sci. 1(3), 219–237 (2008)
Article Google Scholar
Bouwmans, T., Zahzah, E.H.: Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance. Comput. Vis. Image Underst. 122, 22–34 (2014)
Article Google Scholar
Bouwmans, T., et al.: Decomposition into low-rank plus additive matrices for background/foreground separation: a review for a comparative evaluation with a large-scale dataset. Comput. Sci. Rev. 23, 1–71 (2017)
Article Google Scholar
Bouwmans, T., et al.: Deep neural network concepts for background subtraction: a systematic review and comparative evaluation. Neural Netw. 117, 8–66 (2019)
Article Google Scholar
Braham, M., Piérard, S., Van Droogenbroeck, M.: Semantic background subtraction. In: IEEE ICIP (2017)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1483–1498 (2019)
Article Google Scholar
Chatfield, K., et al.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)
Google Scholar
Chen, L.C., et al.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, S., et al.: Discrete signal processing on graphs: sampling theory. IEEE Trans. Signal Process. 63(24), 6510–6523 (2015)
Article MathSciNet Google Scholar
Danelljan, M., et al.: ECO: efficient convolution operators for tracking. In: IEEE CVPR (2017)
Google Scholar
Du, S.S., et al.: How many samples are needed to estimate a convolutional neural network? In: NeurIPS (2018)
Google Scholar
Egilmez, H.E., Ortega, A.: Spectral anomaly detection using graph-based filtering for wireless sensor networks. In: IEEE ICASSP (2014)
Google Scholar
Garcia-Garcia, B., Bouwmans, T., Silva, A.J.: Background subtraction in real applications: challenges, current models and future directions. Comput. Sci. Rev. 35, 100204 (2020)
Article MathSciNet Google Scholar
Giraldo, J.H., Bouwmans, T.: GraphBGS: background subtraction via recovery of graph signals. In: ICPR (2021)
Google Scholar
Giraldo, J.H., Bouwmans, T.: On the minimization of Sobolev norms of time-varying graph signals: estimation of new Coronavirus disease 2019 cases. In: IEEE MLSP (2020)
Google Scholar
Giraldo, J.H., Bouwmans, T.: Semi-supervised background subtraction of unseen videos: minimization of the total variation of graph signals. In: IEEE ICIP (2020)
Google Scholar
Giraldo, J.H., Javed, S., Bouwmans, T.: Graph moving object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Giraldo, J.H., Le, H.T., Bouwmans, T.: Deep learning based background subtraction: a systematic survey. In: Handbook of Pattern Recognition and Computer Vision, p. 51 (2020)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: IEEE CVPR (2016)
Google Scholar
He, K., et al.: Mask R-CNN. In: IEEE CVPR (2017)
Google Scholar
Javed, S., et al.: Spatiotemporal low-rank modeling for complex scene background initialization. IEEE Trans. Circuit Syst. Video Technol. 28(6), 1315–1329 (2016)
Article Google Scholar
Javed, S., et al.: Background-foreground modeling based on spatiotemporal sparse subspace clustering. IEEE Trans. Image Process. 26(12), 5840–5854 (2017)
Article MathSciNet Google Scholar
Javed, S., et al.: Robust structural low-rank tracking. IEEE Trans. Image Process. 29, 4390–4405 (2020)
Article MathSciNet Google Scholar
Javed, S., et al.: Moving object detection in complex scene using spatiotemporal structured-sparse RPCA. IEEE Trans. Image Process. 28(2), 1007–1022 (2018)
Article MathSciNet Google Scholar
Jung, A., et al.: Semi-supervised learning in network-structured data via total variation minimization. IEEE Trans. Signal Process. 67(24), 6256–6269 (2019)
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS (2012)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Lim, L.A., Keles, H.Y.: Learning multi-scale features for foreground segmentation. Pattern Anal. Appl. 23(3), 1369–1380 (2020)
Article Google Scholar
Lucas, B.D., Kanade, T., et al.: An iterative image registration technique with an application to stereo vision (1981)
Google Scholar
Mandal, M., Vipparthi, S.K.: Scene independency matters: an empirical study of scene dependent and scene independent evaluation for CNN-based change detection. IEEE Trans. Intell. Transp. Syst., 1–14 (2020)
Google Scholar
Mandal, M., et al.: 3DCD: scene independent end-to-end spatiotemporal feature learning framework for change detection in unseen videos. IEEE Trans. Image Process. 30, 546–558 (2020)
Article Google Scholar
Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 7, 971–987 (2002)
Article Google Scholar
Ortega, A., et al.: Graph signal processing: overview, challenges, and applications. Proc. IEEE 106(5), 808–828 (2018)
Article Google Scholar
Pang, J., et al.: Optimal graph Laplacian regularization for natural image denoising. In: IEEE ICASSP (2015)
Google Scholar
Parada-Mayorga, A., et al.: Blue-noise sampling on graphs. IEEE Trans. Signal Inf. Process. Netw. 5(3), 554–569 (2019)
MathSciNet Google Scholar
Parada-Mayorga, A., et al.: Sampling of graph signals with blue noise dithering. In: IEEE DSW (2019)
Google Scholar
Perazzi, F., et al.: A benchmark dataset and evaluation methodology for video object segmentation. In: IEEE CVPR (2016)
Google Scholar
Perraudin, N., et al.: UNLocBoX a Matlab convex optimization toolbox using proximal splitting methods. arXiv preprint arXiv:1402.0779
Perraudin, N., et al.: GSPBOX: a toolbox for signal processing on graphs. arXiv preprint arXiv:1408.5781 (2014)
Pesenson, I.: Sampling in Paley-Wiener spaces on combinatorial graphs. Trans. Amer. Math. Soc. 360(10), 5603–5627 (2008)
Article MathSciNet Google Scholar
Pesenson, I.: Variational splines and Paley-Wiener spaces on combinatorial graphs. Constructive Approximation 29(1), 1–21 (2009)
Article MathSciNet Google Scholar
Romero, D., Ma, M., Giannakis, G.B.: Kernel-based reconstruction of graph signals. IEEE Trans. Signal Process. 65(3), 764–778 (2016)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Shuman, D.I., et al.: The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30(3), 83–98 (2013)
Article Google Scholar
St-Charles, P.L., Bilodeau, G.A., Bergevin, R.: SuBSENSE: a universal change detection method with local adaptive sensitivity. IEEE Trans. Image Process. 24(1), 359–373 (2014)
Article MathSciNet Google Scholar
Sultana, M., et al.: Unsupervised deep context prediction for background estimation and foreground segmentation. Mach. Vis. Appl. 30(3), 375–395 (2019)
Article Google Scholar
Tezcan, O., Ishwar, P., Konrad, J.: BSUV-Net: a fully-convolutional neural network for background subtraction of unseen videos. In: IEEE WACV (2020)
Google Scholar
Thanou, D., Chou, P.A., Frossard, P.: Graph-based compression of dynamic 3D point cloud sequences. IEEE Trans. Image Process. 25(4), 1765–1778 (2016)
Article MathSciNet Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2013). https://doi.org/10.1007/978-1-4757-2440-0
Book MATH Google Scholar
Wang, Y., et al.: CDnet 2014: an expanded change detection benchmark dataset. In: IEEE CVPR-W (2014)
Google Scholar
Xie, S., et al.: Aggregated residual transformations for deep neural networks. In: IEEE CVPR (2017)
Google Scholar
Yang, F., et al.: Superpixel segmentation with fully convolutional networks. In: IEEE CVPR (2020)
Google Scholar
Zhang, C., Florencio, D., Loop, C.: Point cloud attribute compression with graph transform. In: IEEE ICIP (2014)
Google Scholar
Zhang, H., et al.: ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955 (2020)

Download references

Author information

Authors and Affiliations

La Rochelle Université, 17000, La Rochelle, France
Jhony H. Giraldo & Thierry Bouwmans
Khalifa University, Abu Dhabi, 127788, United Arab Emirates
Sajid Javed
Kyungpook National University, Daegu, 41566, South Korea
Maryam Sultana & Soon Ki Jung

Authors

Jhony H. Giraldo
View author publications
You can also search for this author in PubMed Google Scholar
Sajid Javed
View author publications
You can also search for this author in PubMed Google Scholar
Maryam Sultana
View author publications
You can also search for this author in PubMed Google Scholar
Soon Ki Jung
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Bouwmans
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jhony H. Giraldo .

Editor information

Editors and Affiliations

Chonnam National University, Gwangju, Korea (Republic of)
Hieyong Jeong
Aoyama Gakuin University, Kanagawa, Japan
Kazuhiko Sumi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Giraldo, J.H., Javed, S., Sultana, M., Jung, S.K., Bouwmans, T. (2021). The Emerging Field of Graph Signal Processing for Moving Object Segmentation. In: Jeong, H., Sumi, K. (eds) Frontiers of Computer Vision. IW-FCV 2021. Communications in Computer and Information Science, vol 1405. Springer, Cham. https://doi.org/10.1007/978-3-030-81638-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-81638-4_3
Published: 15 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81637-7
Online ISBN: 978-3-030-81638-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics