Keywords

1 Introduction

The idea and definition of superpixel is given for the first time by Ren et al., in [12]. The two authors describe a segmentation method including an oversegmentation pre-processing grouping pixels into small homogeneous and regular regions called superpixels. By using them instead of pixels, they significantly reduce the complexity of their algorithm. Subsequently, superpixels have been integrated with success to a lot of methods, [4, 20]. Currently, oversegmentation is an active research field, with steady publication of new methods [1, 2, 8].

Previous Work. Made in 2012 by Achanta et al., the first review of oversegmentation methods [1] compare six algorithms: Normalized Cut (NC) [12], Felzenszwalb algorithm (FZ) [3], Quick Shift (QS) [17], TurboPixels (TP) [6], Veksler method (VK) [18] and the Simple Linear Iterative Clustering method (SLIC) [1]. Tests were carried out using 100 images of Berkeley Segmentation Dataset [9] (BSDFootnote 1). For each method Achanta et al. [1] analyze the complexity, the execution time and the quality of the produced oversegmentation results. They use two metrics: the undersegmentation error rate (UE) and the boundary recall measure (BR). Both of them compare the oversegmentation result S to a ground-truth G. The UE score takes into account for each object \(G_{i}\) in G the set of superpixels required to cover it and counts the number of pixels leaking: \( UE(S,G) = \frac{1}{N} \sum _{G_{i} \in G} \sum _{ S_{j} \cap G_{i} \ne \emptyset } min(|S_{j} \cap G_{i}| ,| S_{j} - G_{i} |) \) where N is the pixel number and |E| denotes the cardinality of the set E. The result is a rate between 0 and 1, 0 denoting an error-free oversegmentation result. The BR score checks if whether boundary pixels around objects in G match with boundary pixels in S. We indicate by \(B_{G}\) the set of boundary pixels in G and \(B_{S}\) the set of boundary pixels in S. If we assume that there is no doubt about the fact that a pixel belongs or not to a boundary, the quality of an oversegmentation can be evaluated by calculating the rate of boundary pixels in G corresponding to boundary pixels in S: \(BR(S,G) = \frac{|B_{S} \cap B_{G}| }{ |B_{G}|}\).

In fact, even for a human, it is sometimes difficult to know whether a pixel belongs or not to a boundary. Achanta et al. allow a distance \(\tau _{br}\) of 2 pixels between pixels in \(B_{G}\) and in \(B_{S}\). The BR score is in the range [0, 1], 1 meaning that all boundaries in G match with boundaries in S. In 2015, Stutz mades a second evaluation [14]. Their first contribution is to include seven supplementary oversegmentation methods: Entropy Rate Superpixels (ERS) [7], Superpixels via Pseudo-Boolean Optimization (SPBO) [21], Contour Relaxed Superpixels (CRS) [2], Superpixels Extracted via Energy-Driven Sampling (SEEDS) [16], Topology Preserved Superpixels (TPS) [15], Depth-Adaptive Superpixels (DAS) [19] and Voxel Cloud Connectivity Segmentation (VCCS) [11]. They also use as additional dataset, 400 images of New York University [13], NYUFootnote 2. As the dimensions of the two datasets photographs are not identical (\(481 \times 321\) pixels for BSD, \(640 \times 480\) pixels for NYU), Stutz et al. modify the threshold \(\tau _{br}\), allowing matching between boundary pixels \(0.0075\times diag\) away, where diag is the image diagonal length. In Achanta et al. review [1], FZ and SLIC methods achieve the best scores. Stutz et al. evaluation [14] corroborates this result and shows that QS, CRS and ERS algorithms achieve performances similar to those of FZ and SLIC. On the BSD images, best scores are achieved with oversegmentation containing approximately 1000 superpixels. For FZ, SLIC, QS, ERS and CRS methods, UE is lower than or equal to 0.04 and BR is greater than or equal to 0.99. On NYU dataset, with about 1500 superpixels, UE is lower than or equal to 0.09 and BR is greater than or equal to 0.99. For these two datasets, execution times are about one second on a computer with a \(3.4\,\text {GHz}\) Intel Core i7 processor and \(16\,\text {Go}\) RAM.

Contributions. In Stutz et al. [14], evaluation the BR results achieved by the five best oversegmentation methods are so close to the maximal score, that it seems difficult to suggest a new method allowing a significant improvement. However, one can ask whether the two used datasets are sufficient for an exhaustive evaluation. The BSD and NYU images cover a wide panel of situations, containing both outdoor (BSD) and indoor (NYU) images with weak local contrast, important noise and lighting problems. However, images of the two datasets have similar sizes (some thousand of pixels) and small size in comparison to images taken with common cameras. Hence, our first contribution is a Heterogeneous Size Image Dataset (HSID), mainly containing big size images (millions of pixels). HSID allows to check that algorithms do not suffer from a bias related to image dimensions and to better evaluate their scaling up. As HSID contains images with heterogeneous sizes, we need to transform BR into a fuzzy boundary adherence measure, FBR, which is the second contribution of this article. Moreover, we show that UE is not suitable for HSID. The demonstration leading us to this conclusion allows to better understand the behavior of UE and should be valuable for other works. The major contribution of this paper is a careful analysis of the result of the five best oversegmentation methods (FZ, QS, SLIC, ER and CRS) and of a recently proposed algorithm, Waterpixel (WP) [8]. Unlike previous evaluations, we show that, applied on HSID, each method encounters specific difficulties. The remainder of the article is organized as follows: in Sect. 2 we indicate properties of a suitable oversegmentation algorith and we describe the algorithms that we will compare. In Sect. 3 we describe HSID and we discuss about UE and BR measures. In Sect. 4 we analyze the evaluation results. We conclude with a discussion about perspectives of this work.

2 State-of-the-Art Methods

The review of works using superpixels [4, 5, 10, 20, 22, 23] shows that a good oversegmentation method must satisfy five properties: validity (an oversegmentation must be an image partition into connected components) boundary adherence (superpixels must not overlap different objects of the image) conciseness (an oversegmentation must give as few superpixels as possible) simplicity (the number of neighbors of each superpixel must be as small as possible, to avoid a complex adjacency graph) efficiency (an oversegmentation algorithm must have an execution time as low as possible). Simplicity and efficiency properties ensure that the time spent to oversegment the image and the time necessary to take superpixel neighborhood into account, will not be longer than the time saved by the usage of superpixels instead of pixels. Because boundary adherence is much more difficult to satisfy with large superpixels, this property is generally in contradiction with conciseness. We call adaptivity, the ability of an algorithm find the best compromise between these two properties, by reducing the number of superpixels in wide homogeneous regions.

According to Stutz et al. review [14] five methods outperform other algorithms: FZ [3], QS [17], ERS [7], SLIC [1] and CRS [2]. The FZ [3] and ERS [7] algorithms use a graph-based representation of the image \(G<V,E>\), with V the set of elements to be grouped (i.e. the pixels) and E, the set of edges linking pairs of neighboring elements. Each edge is weighted using a dissimilarity measure. FZ uses a predicate checking that the dissimilarity between elements along the boundary of two components is greater than the dissimilarity between neighboring elements within each of the two components, to produce a partition of G in K connected components corresponding to superpixels. ERS is a greedy algorithm selecting a subset \(A \subset E\) and removing these edges. The result is a partition of G, which maximizes an entropy rate. The QS method [17] is a modification of the medoid-shift algorithm to efficiently find modes of a Parzen density estimate P. Color and location of each pixel are used as feature vectors that are clustered by linking each vector to its nearest neighbor which increases P. The SLIC method [1] is an adaptation of the k-means algorithm. Starting from an image oversegmentation into a regular grid, the average color and location features of each superpixel are computed. Then, each pixel is re-assigned to the most similar superpixel and the average features of superpixels are re-computed. Finally a post-processing step reassigns disjoints pixel sets to nearby superpixels, to ensure an image partition into connected components. The CRS algorithm [2] finds a partition S into superpixels, which has a high likelihood of having generated the observed image. Starting from an initial segmentation into rectangular superpixels, CRS maximizes the probability function by reallocating some boundary pixels to another superpixel. In our evaluation, we add to these five state-of-the-art methods, WP, a watershed transformation based algorithm using a spatially regularized gradient, which has been recently suggested by Machairas et al. [8].

3 New Oversegmentation Evaluation Benchmark

We provide a new oversegmentation evaluation dataset, including 100 images from Wikimedia CommonFootnote 3. Photographs have been selected to cover a wide variety of difficulties, including blur, noise, shadow, weak contrast and objects with similar colors. For each image, a hand drawn ground truth is provided. First, objects to extract are identified. Then a segmentation is designed by locating and fitting each region corresponding to a same object with a same color. Finally, boundaries of regions are automatically extracted allowing to visually check and remove some mistakes in the previous step. The images, ground truth and a file giving image licenses are available onlineFootnote 4.

Need for a Cautious Use of Undersegmentation Error. In reviews [1, 14] UE was one of the two measures used to check boundary adherence of superpixels. However, our investigation shows that this measure must be used with cautions, in particular for dataset like HSID containing images of highly varying complexities. Figure 1 shows an oversegmentation into regular squared superpixels of two kind of images: a portrait-like in Fig. 1a with a unique big object on foreground and a panoramic-like in Fig. 1b, with multiple small objects on foreground. The foreground areas in the two images are the equal and a visual analysis show that superpixels similarly fail to match object boundaries. However, the UE score equals to 0.09 in the first case (Fig. 1a) and to 0.23 in the second case (Fig. 1b). This difference is explained by the fact that in Fig. 1a, superpixels wholly included within the objects boundaries are more numerous than in Fig. 1b (105 against 68). These superpixels do not carry information about boundary adherence and yet are taken into account by UE score, decreasing it. Hence the average UE score for these two images is not relevant to measure boundary adherence. We encountered the same problem with HSID images and chose not to use UE.

Fig. 1.
figure 1

Background is in gray, foreground in black and superpixel boundaries in white.

Fuzzy Boundary Recall Measure. The metric BR measures the capacity for a method to give superpixels whose boundaries match with ground truth boundaries. To take into account the uncertainty about boundary location in hand-drawn ground truth, previous approaches use a static threshold. They accept matching between boundary pixels at a distance of \(0.0075\times diag\) pixels, where diag is the image diagonal length. On big images of HSID (several millions of pixels) this distance (more than 30 pixels) is clearly too large. Rather than choosing another static threshold, we suggest to amend standard BR formula using fuzzy-set theory to introduce some tolerance error near the border pixels. Let G be a partition of the image into L connected components corresponding to objects (\(G_{1}, \cdots , G_{L}\)) and S an oversegmentation into K superpixels (\(S_{1}, \cdots , S_{K}\)), with \(L<<K\). A pixel \(p_{i}\) in S is on a boundary if \(\exists p_{j}\) such as \(p_{j} \in {{\mathrm{nei}}}(p_{x}) \wedge ( p_{i} \in R_{n}^{ S}\Rightarrow p_{j} \notin R_{n}^{ S}) \) where \({{\mathrm{nei}}}\) is a function giving for each pixel the set of its neighbors. Likewise, \(p_{i}\) is a boundary pixel in G, if \(\exists p_{j}\) such as \(p_{j} \in {{\mathrm{nei}}}(p_{x}) \wedge ( p_{i} \in R_{n}^{ G}\Rightarrow p_{j} \notin R_{n}^{ G}) \). Let \(B_{G}\) be the set of boundary pixels in G and \(B_{S}\) the set of boundary pixels in S. The rate of boundary pixels in G matching with boundary pixels in S is given by the classic BR measure. From \(B_{G}\) we define the fuzzy set \(B_{G \cap S}^{*}\) with the membership function \(f_{G \cap S}(p_{i}) = \exp (- \frac{d(p_{i}-p_{{i}^{'}})^{2}}{2\sigma ^{2}})\) where \(d(p_{i}-p_{j})\) is the distance between \(p_{i}\) and \(p_{j}\) locations and \(p_{{i}^{'}}= \underset{p_{j} \in B_{S} }{{{\mathrm{arg\,min}}}}(d(p_{i}-p_{j}))\). The function \(f_{G \cap S}\) returns a value in the range [0, 1], a value of 1 meaning a perfect coincidence between an element in \(B_{G}\) and an element in \(B_{S}\). Finally, we propose the fuzzy boundary recall measure \(FBR(S,G) = \frac{1}{|B_{G}|} \sum _{p \in B_{G}} f_{G \cap S}(p)\).

4 Experimental Results

We evaluate the ability of the algorithms FZ [3], QS [17], SLIC [1], ERS [7], CRS [2] and WP [8] to satisfy the properties defined in Sect. 2, using the implementations provided by their authors. By design all the tested algorithms satisfy the validity property. The boundary adherence of the superpixels is evaluated using FBR score, the simplicity by computing the average number of neighbors by superpixel (Nei), the conciseness by giving the average number of superpixels by image (K) and the efficiency by measuring the execution time (T), on a desktop computer with a \(2.6\,\text {GHz}\) Intel Core i7 processor and \(16\,\text {Go}\) of RAM. We focus this evaluation on two distinct aspects: the adaptivity and the scale up of the algorithm. To measure adaptivity, we analyze the evolution of mean and standard variation for both FBR and K scores. The fact that HSID contains a majority of big images allows us to check the ability to scale up of each method, by comparing the scores achieved on HSID with the results obtained in previous evaluations. Table 1 gives the mean and the standard deviation achieved by the six algorithms for all these measures. We made 8 tests where methods are configured to respectively produce about 500 (test 1), 700 (test 2), 900 (test 3), 1100 (test 4), 1300 (test 5), 1500 (test 6), 1700 (test 7) and 1900 (test 8) superpixels. Because the execution time strongly depends on the image size, the standard deviation for this measure is high, often similar to the mean. Even if these two measures must be analyzed with caution, they are sufficient to compare the ability of algorithms to scale up. Visual results for the complete dataset and values for each method parameters are available onlineFootnote 5.

Table 1. Quantitative results of the evaluated algorithms.

Initially designed as a segmentation method, FZ [3] cannot be used with the parameters suggested by its authors. We first use the parameters learned by Stutz et al. [14]. But a visual analyze of the produced results shows that they have been over-fitted for BSD and NYU datasets and are not suitable for HSID, where they produce segmentation-like results. A second attempt with the parameters learned by Mathieu et al. [10] gives much better results and allows a fair comparison with the rest of the state-of-the-art methods. The analysis of superpixel numbers and FBR scores shows that FZ has a good adaptability. When its parameters are set to produce a lot of superpixels (\(K>1500\)), both mean of FBR and standard deviation of K increase, showing that images, that are correctly oversegmented when setting FZ to produce less superpixels (for example 500), are always partitioned in a small number of superpixels. Boundary adherence is satisfactory with FBR scores better than those of SLIC and slightly worse than those of ERS. The only drawback of FZ is its execution time. With a FBR score significantly lower than FZ, SLIC and ERS algorithms, and the longer execution times of all the evaluated methods, QS is the first case of algorithm failing to oversegment HSID. A visual analysis of the produced superpixels shows that QS has strong difficulties with images where some boundaries are slightly blurred. Moreover, QS parameters are related to the image size. Thus, the large standard deviation of K is not the consequence of a good adaptivity, but the result of these size-dependent parameters, reducing the superpixel number when dealing with a small image. Evaluation on HSID including images with several millions of pixels, allow to highlight the huge advantage of the linear complexity of SLIC, which has an execution times from 3 to 71 times faster than the other algorithms. This main strength is offset by the FBR results, SLIC requiring more superpixels to achieve performance similar to ERS or FZ algorithms. Moreover, the standard deviation for K is low, revealing that, even if configuring SLIC to produce more superpixels reduces boundary adherence errors, this improvement suffers from an oversegmentation of simple (with a few objects) images in much more superpixels than necessary. In other terms, SLIC is not adaptive. ERS is the method achieving the best compromise between conciseness and boundary adherence. Unfortunately this result come with the second highest execution time after QS. The second drawback of ERS is its complete lack of adaptability, with a standard deviation of superpixel numbers equals to 0, meaning that the superpixel numbers is fixed by the user, without any possibility for the method to adapt it to the image complexity. Thus, to reduce errors in images with a lot of tiny details, photographs with large homogeneous areas are oversegmented into numerous small superpixels. Conversely, even if with 1900 superpixels, thin elements of some images are not correctly segmented. The algorithm CRS is the second case of algorithm failing to oversegment HSID. Even with more than 2000 superpixels, it achieves poor FBR scores, lower than 0.5. This result is easily explained by the study of CRS algorithm. Even with numerous superpixels, the initial partition in regular rectangles, corresponds to a significant error on HSID biggest images. Consequently, the convergence to a more relevant solution by only moving boundary pixels is very slow. In addition, statistical distributions of color inside superpixels are often so wrong that the algorithm remains stuck in local optimum far away from a correct oversegmentation. For example, when multiplying the number of iteration by 100, no visible improvement is shown, but the execution time raises over 2000 s. WP method is the last case of method achieving good results on BSD and failing to have similar performance on HSID. While evaluation of Machairas et al. [8] shows that WP and SLIC have similar boundary adherence, with execution time lower for WP, FBR results and computational times are, in our evaluation of WP, far away from those of SLIC. In addition, WP is the only algorithm unable to oversegment the totality of HSID, failing for img-012, img-066 and img-072.

5 Conclusion and Perspectives

Evaluation of QS [17], CRS [2] and WP [8] shows that even if these algorithms achieve good results on BSD and NYU datasets, they fail to correctly oversegment HSID images. Moreover, none of the remaining algorithms (FZ, SLIC and ERS) reaches a satisfactory compromise between boundary adherence, conciseness and efficiency. Thus, the proposed dataset HSID shows that is room for improvement and new propositions in oversegmention research area. We are currently working on a method based on a region merging approach. Our goal is to obtain a new algorithm able of adapting to the image content, reaching a compromise between conciseness and boundary adherence, while keeping short execution time. Regarding HSID, even if this dataset is sufficient to make interesting conclusion the state-of-the-art oversegmentation methods, we think that enlarge it with some supplementary images and the associated ground truth should be valuable. We hope that a collaborative effort, will be made. Finally, we invite all researchers working on a new oversegmentation method to not only evaluate their algorithms using previous benchmarks but also to show that they are competitive when dealing with HSID.