Keywords

1 Introduction

Point cloud segmentation, being usually a step preceding semantic analysis of a set, is a task of high interest of many researchers. Besides cases like recognizing structures within a point cloud [23], hand pose and face segmentation [10,11,12], big data visualisation [13] or urban and architectural reconstruction [14], segmentation may be successfully used for compression purposes [22] thanks to storing objects by means of mathematical formulas instead of thousands of 3D points. Certainly, these are just a few out of plenty of possibilities where 3D segmentation contributes.

Beyond any doubt, any algorithm, by definition, is meant to produce a good or expected results in the context of presumed characteristics. These characteristics, or rather measures, allowed us to clearly compare algorithms and understand their flaws. That is why a quality measure should be carefully chosen prior to actual problem and method definition.

In this paper, we introduced two novel measures for segmentation quality assessment: weighted classification statistics (WCS) - improved version of the ordinary classification statistics (OCS), and planarity statistics (PS) as an indicator of planes extraction quality. Additional contribution of this paper is evaluation of new and existing measures for selected model-based method of planes detection (as a reference a recent Li et al. method [3] has been chosen) and proposal of recommendations concerning particular measures.

2 Related Works

2.1 Planes Segmentation Methods

Among all methods aiming at segmentation of point clouds, Nguyen and Le identified five main categories, namely: edge-based, region-based, attributes-based, graph-based, and model-based methods [15]. Current studies focus mainly on model-based group, especially on the approaches based on random sample consensus (RANSAC). Below, we review above groups briefly to emphasize the main differences and to justify why recently the most effective Normal Distribution Transform cell-based RANSAC method, by Li et al. [3], was opted for our experiments.

Edge-based methods are constituted by methods employing gradient- or line-fitting algorithms for locating interesting points. Bhanu et al. proposed range images processing for edges extraction [1]. On the other hand, Sappa and Devy [19] relied on binary edge map generation with scan-lines approximation in orthogonal directions. These methods are characterized by sensitivity both to existing noise and non-uniform point distribution.

Region-based algorithms use greedy approach to examine similarity or distinctiveness in a limited vicinity. Rabbani et al. [18] applied simple region growing approach taking into account surface normal and points’ connectivity. On the other hand, Xiao et al. [24] proposed subregion based region growing, where each subregion is considered as a planar one or not, based on Principal Component Analysis (PCA) or KLT [25]. Region-based methods, suffer from dependence on seed point selection as well as from the fact, that decision is made locally, and it may be not correct from the global point of view.

Generally, segmentation methods based on attributes make use of clustering algorithms built onto extracted attributes. Mean-shift clustering [6], hierarchical clustering [7], contextual processing [8] or statistically supported clustering [9] are cases in point. Limitations of this group are clustering methods constraints themselves. For k-means clustering, number of clusters need to be known in advance. On the other hand, for other methods, high noise sensitivity or time complexity involved with multidimensionality may occur.

Among the algorithms based on graphs, one may find the method making use of colour information added to laser scans [21] or the method of vicinity graph presented in [2]. Clearly, graph-based methods may need a complex preprocessing phase, like training [15].

Most of methods employing a model makes use of RANSAC algorithm. Its main advantage is inherent robustness against outlying data, unlike the other methods. Currently, it appears in many variations. Schnabel et al. [20] used RANSAC for efficient detection of planes, cylinders and spheres. They evaluated their method by means of correctly detected regions. Oehler et al. [16] combined RANSAC with Hough transform [17] to increase quality of planes detection. Here, the authors identified number of true positives (TP) with 80% overlap region in order to evaluate their algorithm. Xu et al. [5] evaluated usability of different functions for weighted RANSAC in terms of their completeness (Eq. 4), correctness (Eq. 3), and quality (Eq. 5). Calvo et al. [33] engaged k-means strategy together with RANSAC to search for predefined model (like cube or pyramid) in a point cloud. They evaluated this algorithm by comparing an average angular deviation between normal vectors detected in a cloud and those known from a reference set. Finally, Li et al. [3] introduced an improved RANSAC-based method for uniform space division, which, following authors, presents itself as the most efficient state-of-art solution. Thus it was selected as a point cloud reference segmentation method.

The algorithm proposed by Li et al. relies on space division into cells of fixed size, whose dimensions have to be tuned for point cloud specifically. Cells are then classified either as planar ones or not. The decision is made on the grounds of eigenvalues, being the output of PCA procedure. If the ratio of the two greatest eigenvalues is lower than assumed threshold te, a cell is said to be a planar one (Eq. 1). For the dataset Room-1 [31], the authors assumed \(te=0.01\), whereas for Room-2 dataset [31], they took \(te=0.02\). The formula for te was determined by the authors empirically, taking into account that it is influenced by the cell size s and the noise level \(\epsilon \) (Eq. 2).

$$\begin{aligned} \frac{\lambda _1}{\lambda _2}<te{.} \end{aligned}$$
(1)

where \(\lambda _1\) and \(\lambda _2\) are, respectively the greatest, and the middle eigenvalue.

$$\begin{aligned} (\frac{\epsilon }{s})^2<te<0.04{.} \end{aligned}$$
(2)

Subsequently, the authors performed plane segmentation procedure, called NDT- RANSAC. In short, it consists of RANSAC- like examination of an individual planar cell. From each planar cell, the minimal set of three points is randomly taken to construct a hypothetical plane which may differ from that obtained with PCA. Having calculated a plane parameters, the rest of planar cells are compared to the current one, in terms of normal vector angular deviation and relative shift between objects. The plane, obtained by merging coherent patches is then refined taking into account its consensus set. The authors used verification measures based on confusion matrix analysis and they claim the quality of their segmentation procedure exceeds 88.5% of correctness (Eq. 3) and 85% of completeness (Eq. 4).

2.2 Current Quality Measures

Classification-Based Measures. Many current researchers, appraise their methods with confusion matrix analysis, treating the segmentation task as a kind of classification problem [3,4,5]. This kind of assessment relies on calculation of three basic measures using for classification evaluation: correctness (also referred to as precision, Eq. 3), completeness (known also as recall or sensitivity, Eq. 4), and quality (Eq. 5) according to maximum overlapping technique introduced by Awrangjeb and Fraser [32].

$$\begin{aligned} correctness=\frac{||TP||}{||TP|| + ||FP||}\end{aligned}$$
(3)
$$\begin{aligned} completeness=\frac{||TP||}{||TP|| + ||FN||}\end{aligned}$$
(4)
$$\begin{aligned} quality=\frac{||TP||}{||TP|| + ||FN|| + ||FP||} \end{aligned}$$
(5)

where \(||\cdot ||\) states for a cardinality of a set; TPFPFN states, respectively, for: true positives, false positives, and false negatives.

These measures do require unambiguous correspondence finding among groups of reference clustering \(\mathcal {R}_i\), such that \(\mathcal {R}=\{\mathcal {R}_1,\mathcal {R}_2,\mathcal {R}_3, ...,\mathcal {R}_n\}\) (\(\bigcup _{i=1}^n \mathcal {R}_i=\mathcal {D}\)) and clustering being apprised \(\mathcal {O}_j\), where \(\mathcal {O}=\{\mathcal {O}_1,\mathcal {O}_2,\mathcal {O}_3,...,\mathcal {O}_m\}\) (\(\bigcup _{j=1}^m \mathcal {O}_j=\mathcal {D}\)). Both clusterings are built over the dataset \(\mathcal {D}\). This correspondence is usually determined by searching patches that overlap the most [32], namely we look for the corresponding clusters \(\mathcal {O}_j\) and \(\mathcal {R}_i\) where:

\({{\mathrm{arg\,max}}}_{j}\frac{||\mathcal {O}_j \cap \mathcal {R}_i||}{||\mathcal {O}_j||}={{\mathrm{arg\,max}}}_{i}\frac{||\mathcal {O}_j \cap \mathcal {R}_i||}{||\mathcal {R}_i||}\).

Fig. 1.
figure 1

A case of improper correspondence finding with maximum- overlapping method

Nevertheless these metrics consider solely number of clusters classified as TP, FP or FN. They do not take into account cardinality of overlapping regions, hence these statistics may be easily far-fetched and results might not be reliable. Depending on the presumed overlapping threshold (if any), these measures fail in case of high ratio of FP or FN, or highly unexpected output (see Fig. 1). In Fig. 1, we may clearly see that, using maximum overlapping strategy, a solid-line triangle will be associated with a dashed rectangle rather than its actual counterpart.

Micro- and Macro- averaging. The other approach for clustering quality assessment, or its variation applicable for any measures, are techniques being widely used in machine learning domain for multi-label classification, called micro- and macro- averaging. They aggregate intermediate results into global information. Manning et al. [29] defined macro-averaging as arithmetical average across classes and micro- averaging as weighted mean of multi-label classification measures. Hence, we may formally define macro-averaging as the arithmetic mean of the values and micro-averaged completeness, correctness and quality as in the Eqs. 7 and 8 [30].

$$\begin{aligned} correctness=\frac{\sum ||TP||}{\sum (||TP||+||FP||)} \end{aligned}$$
(6)
$$\begin{aligned} completeness=\frac{\sum ||TP||}{\sum (||TP||+||FN||)} \end{aligned}$$
(7)
$$\begin{aligned} quality=\frac{\sum ||TP||}{\sum (||TP||+||FP||+||FN||)} \end{aligned}$$
(8)

where TP,FN,FP states respectively for true positives, false negatives, and false positives.

Variation of Information. Besides correctness, completeness, and quality relying solely on the binary decision: does cluster correspond or not, there is another group of methods for clustering comparison [34] making use of fuzzy correspondences, defined by information theory and information entropy of Shannon [26] (Eq. 9).

$$\begin{aligned} H(\mathcal {R})=-\sum _{i=1}^{n} \frac{||\mathcal {R}_i||}{||\mathcal {D}||}\log _2(\frac{||\mathcal {R}_i||}{||\mathcal {D}||}) \end{aligned}$$
(9)

where \(\mathcal {R}_i\) is the cluster being considered and \(||\mathcal {R}_i||\) stands for its cardinality.

One of measures constructed on the notion of information entropy, is mutual information \(I(\cdot ,\cdot )\). Mutual Information of two random variables \(\mathcal {O}\) and \(\mathcal {R}\) (\(I(\mathcal {O},\mathcal {R})\)), as Gel\(\acute{\mathrm{f}}\)and and Yaglom [27] defined, is the amount of information of \(\mathcal {O}\) contained within the variable \(\mathcal {R}\) and may be represented with Eq. 10.

$$\begin{aligned} I(\mathcal {O},\mathcal {R})=\sum _{j \le m} \sum _{i \le n} \frac{||\mathcal {R}_i \cap \mathcal {O}_j||}{||\mathcal {D}||}\log \frac{||\mathcal {R}_i \cap \mathcal {O}_j|| \cdot ||\mathcal {D}||}{||\mathcal {O}_j|| \cdot ||\mathcal {R}_i||} \end{aligned}$$
(10)

Mutual Information evaluates many-to-many relations, unlike classification-based measures relying on one-to-one correspondences. Hence Mutual Information provides insight into segmentation result by means of many-to-many relations. Another method, inspired by information entropy was introduced by Meilă. She derived measure of Variation of Information (VoI) dedicated for comparing clusterings. Meilă [28] defined VoI as loss and gain of information while switching from clustering \(\mathcal {O}\) to the clustering \(\mathcal {R}\).

$$\begin{aligned} VoI(\mathcal {O},\mathcal {R}) =H(\mathcal {O})+H(\mathcal {R})-2I(\mathcal {O},\mathcal {R}) \end{aligned}$$
(11)

where \(H(\cdot )\) is an information entropy (Eq. 9) and \(I(\cdot ,\cdot )\) is Mutual Information (Eq. 10).

Although, VoI is not directly dependent on the number of points \(||\mathcal {D}||\), value of Variation of Information is constrained with upper bounds according to (12). The best possible value of VoI, for the perfect clustering (\(\mathcal {O}=\mathcal {R}\)) is zero. Moreover, it is true metric, unlike Mutual Information.

$$\begin{aligned} VoI_{max}(\mathcal {O},\mathcal {C})\le \log ||\mathcal {D}|| \end{aligned}$$
(12)

Superiority of VoI over classification- based measures is the fact that no one-to-one correspondence has to be found a priori. It produces reliable results even in case of significant clusters’ granulation and partial overlapping. Hence it may be thought of as fuzzy decision about clusters correspondences rather than binary: yes or no.

3 New Quality Measures

Since we focus on evaluation of planarity detection methods, by example of Li et al. [3] approach, four distinguished measures were used.

  1. 1.

    Variation of Information - (VoI)

  2. 2.

    ordinary classification statistics (used in [3]) - (OCS)

  3. 3.

    micro- and macro- weighted classification statistics in terms of overlapping size (see Subsect. 3.1) - (WCS)

  4. 4.

    micro- and macro- averaged planarity statistics (see Subsect. 3.2) - (PS)

Two last of them, weighted classification statistics (WCS) and planarity statistics (PS), are newly derived ones as none of the reviewed authors exploited them and it became the contribution of this paper.

3.1 Weighted Classification Statistics

Weighted classification statistics (WCS) measure is influenced by the number of common part between a reference cluster and the best fitting resulting cluster not being associated yet. In the Fig. 2 we may see correspondence found between a reference cluster (dashed border) and an output cluster (solid line rectangle). The correspondence is found with maximum common-part strategy. In that image (Fig. 2), TP is the cardinality of the inner white region, strips signify region whose cardinality is said to be FP, and the number of points belonging to grey region pose the number of FN.

Fig. 2.
figure 2

Idea of WCS measure. The dashed lines state for a reference cluster, solid line borders identify output cluster. White region is said to be TP, grey area- FN, and stripped region- FP

To clearly state it, for each reference cluster \(R_i \in \mathcal {R}\) the set of remaining output clusters is searched to identify the cluster \(\mathcal {O}_j \in \mathcal {O}\) whose the largest number of points lies within \(\mathcal {R}_i\). Having found one-to-one correspondence between the reference \(\mathcal {R}_i\) and the output cluster \(\mathcal {O}_j\), TP, FP, and FN are calculated. True positives are thought of as the common part between corresponding clusters. The number of false positives (Eq. 13) is calculated as a difference between sum of cardinalities of the output clustering \(\mathcal {O}_j \in \mathcal {O}'\) (assuming some clusters of \(\mathcal {O}\) may be rejected \(\mathcal {O}' \subset \mathcal {O}\)) and the number of TP. False negatives (Eq. 14) are calculated as a difference of the whole set cardinality \(||\mathcal {D}||\) and the number of TP. This way of calculating local clustering characteristics gives us appraisal of an individual result influenced by overlapping size.

Contrary to the OCS, a significance of correspondence in the case presented in the Fig. 1 will be properly diminished with respect to the size of overlapping part. In the Fig. 2 one may see, that value of WCS will vary much whereas OCS may still indicate the same values.

$$\begin{aligned} ||FP||&=\bigcup _{\mathcal {O}_j \in \mathcal {O}'}||\mathcal {O}_j||-||TP|| ~~\text {where} ~~ \mathcal {O}'\subset \mathcal {O} \end{aligned}$$
(13)
$$\begin{aligned} ||FN||&=||\mathcal {D}||-||TP|| \end{aligned}$$
(14)

3.2 Planarity Statistics

Planarity statistics (PS) measure supplies the estimation of actual planar fragments detection without penalty for division of fragments which constitute the one actual plane (Fig. 3). Penalty is put only for those points of an output cluster which exceed a reference plane and those of an output cluster which do not have their counterpart in a reference one. It identifies TP, FP, and FN as in the Fig. 3, where inner white region describes TP, strips signify FP, and gray region- FN. Everything under the condition that overlapping part of a single output cluster has to be at the level of, at least, 50% to consider a part of a cluster as TP.

Fig. 3.
figure 3

Idea of planarity statistics (PS) appraisal. The dashed lines state for a reference cluster, solid line borders identify output clusters

4 Results and Discussion

Experiments were carried out for the dataset Room-1 [31] down-sampled to the size of 121,988 points, and for Room-2 [31] built of 292,997 points. Two reference clusterings, for each room dataset, were manually labelled. The detailed clusterings identified as much planar fragments as it was perceptually justified, whereas general clusterings, resembled the reference clusterings of Li et al. [3].

For Room-1, the first, more exact and detailed clustering, including significantly more planar clusters, contained 73 groups of points, whereas the second one - general, only 10 clusters. For Room-2, detailed clustering had 32 groups of points, and general included only 7 clusters. Experiment configuration was like that used in the Li’s method [3] (see Table 1) and reflected semantic characteristics of considered rooms. Detailed clusterings reflected all perceptually perceived planar fragments, whereas general, less detailed, clusterings reflected mainly the largest planar fragments. Different number of clusters, though possible, were not considered.

Table 1. Values of parameters of the method by Li et al. [3] used during experiments

Analysing quality measures, applied for classification tests performed for the planarity detection procedure introduced in [3], we may notice how considered measures reflected selected aspects of the datasets classification procedure. As the output of our experiments, we have examined four classification quality measures: Variation of Information (VoI), ordinary classification statistics (OCS), and those introduced in this paper: weighted classification statistics (WCS), and planarity statistics (PS). Experiments were conducted for the general clustering of the reference sets (see Table 2) - similar to that used by Li et al., and for the detailed clustering of the reference sets (see Table 4). We have also compared values of OCS and WCS for general clusterings with switched values of te - clustering method parameter, as to evaluate usability of the proposed measures (Table 3).

Table 2. Values of measures for Room-1 and Room-2 datasets for Li’s method [3] obtained by applying general clustering and suggested values of te (\(te=0.01\) for Room-1 and \(te=0.02\) for Room-2)
Table 3. Values of measures for Room-1 and Room-2 datasets for Li’s method [3] obtained by applying general clustering and switched values of te (\(te=0.02\) for Room-1 and \(te=0.01\) for Room-2)
Table 4. Values of measures for Room-1 and Room-2 datasets for Li’s method [3] obtained by applying detailed reference clustering

Values of OCS for general clustering of reference sets are close to these, presented by Li et al. [3]. On the other hand, the OCS values for detailed clustering of reference sets (see Table 4) differ much, but it is clear, since we provide much more detailed clustering. We may think that all output clusters have found their reference counterpart. But there are many planar fragments skipped by the algorithm [3], like a chair or a suitcase. Hence some reference clusters do not have their corresponding clusters in the output, what obviously leads to low completeness indicator. On the other hand, WCS values seems to vary a lot from OCS (Table 4). Correctness values of WCS, lower than that of OCS, point out that corresponding reference and output clusters do not overlap perfectly and may be shifted with respect to each other. On the other hand, very high completeness values inform us about low number of FN. This might be interpreted in such a way that larger planes found proper counterpart and smaller reference clusters were not associated.

High values of PS for Room-1, both micro- and macro- averaged, suggest that planar fragments, mainly, do not contain many disturbing points from other planar patches. For Room-2 we may see poorer values of micro-averaged completeness and quality of PS. From this, we may suspect that the Li’s method [3] finds only basic planes in Room-2 like a wall, a ceiling, or a floor.

Let us compare values of measures between Tables 2 and 3, where values of te were switched. Values of OCS indicate increase of correctness and fall of completeness and quality. These values indicate that, approximately, one more correct correspondence was found (TP) but number of undetected reference clusters (FN) grew. WCS values give us more information. For Room-1, micro- and macro- averaged correctness remain virtually equal, what suggests that actually numbers of TP and FP have not changed much- the same correspondences were found. On the other hand, fluctuations in WCS completeness and quality tell us that number of FN increased for larger fragments (clusters of more points). For Room-2 we may see slightly different tendency. Values of correctness and quality increased and the completeness remained the same. This would suggest that \(te=0.01\) for Room-2 suits better than \(te=0.02\). However, values of OCS and PS indicate something opposite. Actually, fewer points from larger and smaller clusters were grouped correctly.

Having analysed results presented in the Tables 2 and 4, several conclusions may be withdrawn. First of all, ordinary classification statistics might be useful for tasks of clusters counting, for instance, how many roofs terrestrial laser scan contains or how many potential walls we have in our indoor scan. OCS provides quantitative evaluation of an output clustering. On the other hand, if we would like to know how well our resulting clusters fit a reference set, we need a qualitative measure, supported by WCS. This measure allows a researcher to construct a method that focuses on maximizing overlapping parts for major clusters, whereas penalty put onto undetected small regions is accordingly smaller. This measure may be valuable for compression purposes, where we expect an algorithm to reduce size, most of all of the greatest regions, preserving at the same time the highest possible quality. Measuring PS, in turn, gives us insight into the process of space division. Regardless to the approach we use for space division, either hierarchical or uniform, it has to supply sufficiently small patches that only one plane is contained therein. Planarity statistics let us appraise whether space was divided enough to enable then the proper segments aggregation.

Comparing values of both - general and detailed clusterings (respectively, Tables 2 and 4), one may noticed that values of OCS differ a lot, when number of reference clusters has changed. The opposite tendency we may see for our measures: WCS and PS, which indicate quantitative evaluation, which regardless to the reasonable changes of number of reference clusters, point out similar assessment of the method.

5 Conclusions

In this paper, we presented two new measures used for sophisticated assessment of planes detection methods, namely: weighted classification statistics and planarity statistics. These measures were compared with two, the most popular ones. The results of the performed experiments indicate benefits of the introduced measures, because they focus on different aspects of segmentation and supplement classical approach. Whereas OCS may sometimes indicate better results, WCS clearly suggest deterioration of clustering. Planarity statistics show how good planes we found, regardless to the fact how many of them were considered. Thanks to that, we may estimate if partition is sufficient to cover each plane. On the other hand, weighted classification statistics provide us with the information how big regions were found correctly. Since each measure has its own application, we provided also recommendations concerning using particular measures for dedicated purposes.