Keywords

1 Introduction

At latest since the emergence of Convolutional Neural Networks (CNNs), it has become clear that supervised machine learning (ML) systems are severely hindered by the lack of labeled training data. To boost the development of such systems, many labeled benchmark data sets were crafted both in the domain of imagery [7, 17] and 3D point clouds [8, 28]. However, these data sets might be insufficient for new tasks, for instance in remote sensing (e.g., airborne vs. terrestrial systems). Although labeled data sets are also present in the remote sensing domain [18, 24, 27], due to the rapid development of new sensor types and system design (e.g., for airborne laser scanning (ALS): conventional ALS [24], UAV laser scanning [18] often enriched by imaging sensors, single photon LiDAR [23]) labeled data might be quickly out of date requiring new labeling campaigns. Although transfer learning might help to reduce the amount of task-specific ground truth data (GT) by building upon GT from another domain, it is often necessary to generate one’s own training data [25].

Such a labeling process is typically carried out by experts [18, 24], which is both time-consuming and cost-intensive. Hence, the idea is to outsource this tedious task to others in order to free experts from such duties in the sense of crowdsourcing. In this context many platforms for carrying out crowd campaigns (such as Amazon Mechanical Turk [5] or microWorkers [12]) have emerged. In addition to such crowdsourcing platforms, also services which offer to take over the complete labeling process come into focus (such as Google’s Data Labeling Service [9]). Although the expert loses control of the labeling campaign (outsourcing vs. crowdsourcing), the justification of the latter is to avoid time-consuming campaign management (hiring, instructing, checking and paying crowdworkers). Hence, for crowdsourcing to remain competitive, the aforementioned tasks ought to be automated.

Another major challenge of employing crowdworkers is quality control of results received from the crowd. Walter & Soergel [33] have shown that data quality varies significantly. Therefore an employer either needs to check results manually (which might become even more labor-intensive than the actual labeling task) or rely on proper means for quality control. This problem is most pronounced in context of paid crowdsourcing, where often the sole aim of crowdworkers is to make money as fast as possible and there might be even malicious crowdworkers. In this regard, motivation and consequently quality of work in paid crowdsourcing differs significantly from volunteered crowdsourcing (or volunteered geographic information to be precise). In case of the latter, workers are intrinsically motivated and aim to contribute to a greater cause, which is for example freely available map data in case of OpenStreetMap [4]. Nevertheless, paid crowdsourcing could already be successfully used for annotation of airborne imagery [33], detecting and describing trees in 3D point clouds [32] or labeling of individual points according to a specified class catalog [15]. To minimize labeling effort (for the crowd), a common approach is Active Learning (AL).

2 Related Work on Active Learning

AL aims on selecting only most informative instances justifying manual annotation effort [29]. In Mackowiak et al. [22] querying such instances (2D image subsets) is accomplished by combining both predictive uncertainty and expected human annotation effort for deriving a semantic segmentation of 2D imagery. Luo et al. [21] transferred the AL idea to the semantic segmentation of mobile mapping point clouds relying on a sophisticated higher order Markov Random Field. However, only few works focus on ALS point clouds, such as Hui et al. [14], who apply an AL framework for iteratively refining a digital elevation model. For semantic segmentation of ALS data, Li & Pfeifer [19] introduce an artificial oracle by propagating few available class labels to queried points based on their geometric similarity.

For exceeding the limits of automatically answering the query of the AL loop, Lin et al. [20] define an AL regime for the semantic segmentation of ALS point clouds relying on the PointNet++ [26] architecture, where labels are given by an omniscient oracle. The inherent problem of employing CNN approaches in AL is that usually the majority of points does not carry a label and cannot contribute to the loss function. Often, this problem is circumvented by firstly performing an unsupervised segmentation for building subsets of points, which are to be completely annotated by the oracle [13, 20]. Although such a procedure drastically reduces the amount of necessary labels, the oracle is still asked to deliver full annotations (of subsets) requiring a lot of human interaction. In Kölle et al. [16] this issue is directly addressed by excluding unlabeled points from the loss computation while still implicitly learning from them as geometric neighbors of labeled points. Additionally, the authors found that AL loops utilizing the state-of-the-art SCN [10] architecture can result in more computational effort due to relearning (or at least refining) features in every iteration step and might converge slower compared to conventional feature driven classifiers. Hence, a CNN design might not be optimal for AL.

In most of the aforementioned works it is assumed that labels of selected primitives are received by an omniscient oracle, which is a naive assumption, regardless of whether an expert or the crowd is labeling [33]. Consequently, for fully relieving experts from labeling efforts and to form a feasible hybrid intelligence [31] or human-in-the-loop system [2], integration of crowdsourced labeling into the AL procedure in an automated manner is required.

Our contribution can be summarized as follows: We develop a framework referred to as CATEGORISE (Crowd-based Active Learning for Point Semantics), which is tailored for 3D point annotation, but can be easily transferred to other tasks as well. This includes a detailed discussion of i) possibilities for automated quality control tested in various crowd campaigns (Sect. 3.1), ii) measures for automation of the crowd management (Sect. 3.2) and iii) a suitable intrinsic quality measure for the AL loop to enable an operator to monitor the training progress of the machine (Sect. 3.3). Please note that in contrast to related work in this domain [15, 16], which mainly focuses on how to employ AL in a crowd-based scenario for semantic segmentation of point clouds, the focus of this paper lies in the automation and enables running such AL loops incorporating real crowdworkers as if the annotation is a subroutine of a program.

3 The CATEGORISE Framework

As aforementioned the backbone of our framework is to combine crowdsourcing with AL to iteratively point out only the subset of points worth labeling. Starting from an initial training data set (see Sect. 5.2), a first classifier C is trained and used to predict on the remaining training points. In our case, we apply a Random Forest (RF) classifier [3] (features are adopted from Haala et al. [11]). Predicted a posteriori probabilities p(c|x) (that point x belongs to class c) are then used to determine samples the classifier is most uncertain about (i.e., the classifier would benefit from knowing the actual label). This sampling score can be derived via entropy E:

$$\begin{aligned} \begin{aligned} x_{E} = \underset{x}{\text {argmax}} \left( - \sum _{c} p(c|x) \cdot \log p(c|x) \right) \end{aligned} \end{aligned}$$
(1)

In order to also consider imbalanced class occurrences, we further rely on a weighting function, which is derived based on the total number of points \(n_T\) currently present in the training data set and the number of representatives of each class \(n_c\) at iteration step i: \(w_c(i) = n_T(i)/n_c(i)\). For avoiding sampling of points which are similar in terms of their representation in feature space (in context of pool-based AL) and to boost the convergence of the iteration, we adapt the recommendation of Zhdanov [35]. Precisely, we apply a k-means clustering in feature space and sample one point from each cluster (number of clusters equals number of points \(n_{AL}\) to be sampled) in each iteration step.

To especially account for the employment of real crowdworkers, we rely on a sampling add-on proposed by Kölle et al. [16], which aims on reducing the interpretation uncertainty of points situated on class borders, where the true class is hard to tell even by experts, referred to as RIU (Reducing Interpretation Uncertainty). Precisely, in each case, we use a point with highest sampling score as seed point but select an alternative point within a distance of \(d_{RIU}\) (in object space) instead. Combining these sampling strategies yields to an optimal selection of points both in context of informativeness and crowd interpretability crucial for our framework.

3.1 Automation of Quality Control

Such a fully automated framework is only applicable if the operator can trust labels received from the crowd to be used for training a supervised ML model. Since results from crowdworkers might be of heterogeneous nature [33], quality control is of high importance. Although interpretation of 3D data on 2D screens requires a distinct spatial imagination, in Kölle et al. [15] it was already shown that crowdworkers are generally capable of annotating 3D points. Within the present work, we aim to analyze in which way the performance of crowdworkers for 3D point annotation can be further improved. Quality control measures can be categorized as i) quality control on task designing and ii) quality improvement after data collection [34]. In case of labeling specific selected points, which can be thought of as categorization task, one realization of the latter can be derived from the phenomenon of the wisdom of the crowd [30]. This means that aggregating answers of many yields to a result of similar quality compared to one given by a single dedicated expert. In our case wisdom of the crowd can be translated to simple majority vote (MV) of class labels given by a group of crowdworkers (i.e., a crowd oracle). Consequently, this raises the question of how many crowdworkers are necessary to get results sufficient to train a ML model of desired quality. Detailed discussion of experimental set up can be found in Sect. 5.1.

We would like to stress that to clarify this question it is insufficient to run a labeling campaign multiple times and vary the number n of crowdworkers employed since results would be highly prone to individual acquisitions, which might be extraordinary good or bad (especially for small n). To derive a more general result, we ran the campaign only once and each point was covered by a total of k crowdworkers (\(k \ge n\)). From those k acquisitions, for each n (i.e., the number of crowdworkers required; range is \(\left[ 1,k\right] \)) we derive all possible combinations:

$$\begin{aligned} n_{comb} = \left( \begin{array}{c}n\\ k\\ \end{array}\right) = \frac{ n! }{ (n-k)! \cdot k! } \end{aligned}$$
(2)

For each n, acquisitions for each combination were aggregated via MV and evaluated according to Overall Accuracy (OA) and classwise mean F1-score. Afterwards, quality measures of each combination (for a specific n) were averaged. Consequently, our quality metrics can be considered as typical result when asking for labels of n crowdworkers.

However, a drawback of accomplishing quality control by wisdom of the crowd is increased costs due to multiple acquisitions. Therefore, it is beneficial to also employ quality control on task designing [34]. In our case, this is realized by including check points in our tasks. Precisely, in addition to labeling queried AL points, each crowdworker is asked to label a specific number of check points with known class label. Those additional data points can then be used to filter and reject results of low quality. Hence, labels from: i) crowdworkers who did not understand the task, ii) crowdworkers who are not capable of dealing with this kind of data or even from iii) malicious crowdworkers (i.e., who try to maximize their income by randomly selecting labels in order to quickly finish the task) can be filtered. This poses the question of the right number of check points to be included and the consequent impact to labeling accuracy (analyzed in Sect. 5.1).

Fig. 1.
figure 1

Architecture of the CATEGORISE framework (a) and the interface used by the operator (b). The latter is designed so that the operator may monitor and control the complete AL run.

3.2 Automation of Crowd Management

To realize a truly automated process, we need to avoid any engagement between operator (i.e., employer) and crowdworkers. Within our framework (visualized in Fig. 1(a)), we draw on the crowd of microWorkers, which also handles the payment of crowdworkers by crediting salaries to the microWorkers account of the crowdworker (avoiding to transfer money to individual bank accounts, which would be laborious and would cause fees). Crowd campaigns can be prompted in an automated manner by leveraging the microWorkers API. Simultaneously, a respective web tool is set up (by feeding parameters to custom web tool blueprints) and input point clouds to be presented to crowdworkers are prepared (all point cloud data is hosted on the operator’s server). An exemplary tool is visualized in Fig. 2(a). When a crowdworker on microWorkers accepts a task, he uses the prepared web tool and necessary point cloud data is transferred to him. After completion of the task, results are transmitted to the operator’s server (via php) and the crowdworker receives the payment through microWorkers.

Meanwhile, by usage of our control interface (see Fig. 1(b)) hosted on the operator’s server, the operator can both request the state of an ongoing crowd campaign from the microWorkers’ server (e.g., number of jobs completed, respective ratings and avg. task duration (see Sect. 3.1), etc.) and the current training progress from the operator’s server in order to monitor the overall progress. The control interface and web tools are implemented in Javascript (requests are handled with AJAX). As soon as all points of one iteration step are labeled, the evaluation routine is called (which is implemented in python) and the AL iteration continues (provided the stopping criterion is not met).

3.3 Automated Stopping of the AL Loop

When we recall our aim of an automated framework, it is crucial that it not only runs in an automated manner but also stops automatically when there is no significant quality gain anymore (i.e., the iteration converges). Therefore, our aim is to find an effective measure of quality upon which we can build our stopping criterion for the AL loop and which does not need to resort to GT data. Inspired by the approach of Bloodgood & Vijay-Shanker [1], we accomplish this by determining congruence of predicted labels (for the distinct test set) from the current iteration step to the previous one (i.e., we compute the relative amount of points for which the predicted class label has not changed). In addition to this overall congruence \(C_o\), to sufficiently account for small classes, we further derive a classwise congruence value by first filtering points currently predicted as c and check whether this class was assigned to those points in the previous iteration step as well. These individual class scores can be averaged to get an overall measure \(C_{ac}\) equally sensitive for each class. For actually stopping the iteration, we assume that the standard deviation of congruence values of the previous \(n_{stop}\) iteration steps (counted from the current one) converges towards 0, which means that change in predictions stays almost constant (i.e., only classes of few most demanding points change).

4 Data Sets

All our tests were conducted on both ISPRS’ current benchmark data sets, one being the well-known Vaihingen 3D (V3D) data set [24] captured in August 2008 and the other one being the recently introduced Hessigheim 3D (H3D) data set [18] acquired in March 2018. V3D depicts a suburban environment covering an area of \(0.13\,\mathrm{km}^2\) described by about \(1.2\,\mathrm{M~points}\). We colorized the points by orthogonal projection of colors from an orthophoto received from Cramer [6]. Color information is used both for deriving color based features and for presenting point cloud data to crowdworkers. H3D is an UAV laser scanning data set and consists of about \(126\,\mathrm{M\,points}\) covering a village of an area of about \(0.09\,\mathrm{km}^2\). The class catalog of both data sets can be seen in Table 1. Please note that in order to avoid labeling mistakes of crowdworkers solely due to ambiguous class understanding, in case of V3D class Powerline was merged with class Roof and classes Tree, Shrub and Fence were summarized to class Vegetation. In case of H3D, class Shrub was merged with Tree (to Vegetation), Soil/Gravel with Low Vegetation, Vertical Surface with Façade and Chimney with Roof.

5 Results

Within this section, we first discuss our experiments for determining proper measures for quality control (Sect. 5.1), which constitute the basis for conducting our crowd-based AL loops presented in Sect. 5.2.

5.1 Impact of Quality Control Measures to Label Accuracy

To determine measures for quality control (see Sect. 3.1) a total of 3 crowd campaigns were conducted for H3D. These campaigns are dedicated to i) analyze labeling accuracy w.r.t the number of multiple acquisitions when quality control is done by MV only, ii) explore the impact of including check points and iii) derive the optimal number of multiple acquisitions when combining both check points and MV (i.e., realizing quality control on task designing and quality improvement after data collection). For each campaign, we randomly selected 20 points per class and organized those in jobs of 6 points each (one point per class), which results in a total of 20 jobs. Points were randomly shuffled within each job to avoid that crowdworkers realize a pattern of point organization. Each job was processed by \(k=20\) different crowdworkers, who used the web tool visualized in Fig. 2(a). In addition to the point to be labeled, we extract a \(2.5\,\mathrm{D}\) neighborhood having a radius of \(20\,\mathrm{m}\) (trade-off between a large enough subset for feasible interpretation and required loading time, limited by the available bandwidth) to preserve spatial context.

Fig. 2.
figure 2

Developed web tool used by crowdworkers for labeling 3D points (a) and derived results. We compare the result of pure MV (b) to the result of the same task when adding check points (d). The quality improvement by check points is displayed in (c). (web tool can be tried out at https://crowd.ifp.uni-stuttgart.de/DEMO/index.php).

Figure 2(b) depicts the labeling accuracy of the crowd w.r.t. the number of acquisitions used for MV (please note that results are averaged over all \(n_{comb}\) combinations; see Eq. 2). In addition to the OA and classwise F1-scores, we further derive entropy from relative class votes to gain a measure of uncertainty for crowd labels. This first campaign shows that MV leads to almost perfect results in the long run proving the concept of the wisdom of the crowd. However, this requires a lot of multiple acquisitions and thus causes increased costs. F1-scores of most classes converge from about \(n=10\) acquisitions on. However, most classifiers are capable to cope with erroneous labels to some extent. In Kölle et al. [16] it was shown that about \(10\,{\%}\) of label errors only marginally harm the performance of a classifier. Considering those findings, a significantly smaller number for n of about 5 is actually required.

Nevertheless, we aim to further save costs by introducing check points (see Sect. 3.1). We dedicated the second campaign to determining the optimal number of check points. Precisely, we added the same 7 check points to each task and showed the first and last check point twice in order to check consistency of given labels. Crowdworkers were informed about presence of check points but without giving further details or resulting consequences. In post-processing, we used these check points to gradually filter results which do not meet a certain quality level (see Fig. 2(c)). In this context, quality level consistent means that the check point presented twice was labeled identically but not necessarily correct. Correctness however, is assumed for level passed 1 pt (following quality levels additionally incorporate correctness of more than one check point). We can see that OA can be improved by about 10 percentage points (pps) when enforcing the highest quality level. On the other hand, with this quality level, about \(30\,{\%}\) of jobs would not pass our quality control and would have to be rejected. Additionally, an extra labeling effort of 8 points per job would cause additional costs (since crowdworkers should be paid fairly proportional to accomplished work). Therefore, we decided to use quality level passed 3 pts for our future campaigns, which offers a good trade-off between accuracy gain and number of jobs rejected.

Considering these findings, we posted the third campaign, which differs from the previous ones as it combines both quality control strategies (MV & check points). By using a total of 3 check points (one being used twice for consistency), we aim on receiving only high-quality results as input for MV. As trade-off between error tolerance and additional incentive, we allowed false annotation of one point but offered a bonus of \(0.05\,{\$}\) to the base payment of \(0.10\,{\$}\) per job (which is also the base payment for campaign 1 and 2) when all check points are labeled correctly. Results obtained are displayed in Fig. 2(d). We observed that adding check points drastically boosts convergence of accuracy. Using check points and relying on results from 10 crowdworkers leads to an even better result than considering 20 acquisitions without check points (see Fig. 2(b) vs. (d)). This holds true for all classes with class Urban Furniture having the worst accuracy and class Car having top accuracy (overall trend is identical to the first campaign). Class Urban Furniture is of course difficult for interpretation since it actually serves as class Other [18], which makes unique class affiliation hard to determine. If we again accept a labeling OA of about \(90{\%}\) for training our classifier, 3 acquisitions are sufficient (2 less than for the first campaign). This offers to significantly minimize costs in case of larger crowd campaigns where many hundred points are to be annotated.

5.2 Performance of the AL Loop

Finally, we employ the CATEGORISE framework for conducting a complete AL loop for both the V3D and H3D data set. For setting up the initial training set, often random sampling is pursued. Since this might lead to severe undersampling of underrepresented classes (such as Car), we launch a first crowd campaign where a total of 100 crowdworkers are asked to select one point for each class. Since this kind of job cannot be checked at first (due to lack of labeled reference data), we present selected points of each crowdworker to another one for verification. Points which are tagged false are discarded. For both data sets we conduct a total of 10 iteration steps, sample \(n_{AL}=300\) points in each step and parametrize all RF models by 100 binary decision trees with maximum depth of 18.

Fig. 3.
figure 3

Results of the AL loop for both the V3D and H3D data set. We compare the OA achieved by the crowd for labeling (top) to the mean F1-score of the classifier’s performance (middle). Dotted black line represents the baseline result of PL. As intrinsic measure to describe training progress, we rely on predictive congruence (bottom) evaluated for our crowd-based runs (using \(\mathcal {O}_C\)).

Performance of the Crowd within the AL Loop. Figure 3 (top row) visualizes the accuracy of crowd labeling (obtained from MV from 3 acquisitions using quality level passed 3 pts) throughout the iteration process for both presenting queried points to the crowd oracle \(\mathcal {O}_C\) and points which were selected by adapting the query function by RIU. Please note that OA is used as quality metric since AL tends to sample points in an imbalanced manner w.r.t. class affiliation, so that the mean F1-score would only poorly represent actual quality. In both cases, we can observe a negative trend for labeling accuracy (V3D & H3D). This is due to our AL setting where we start with points easy to interpret (i.e., points freely chosen by the crowd), for which the crowd yields top accuracies consequently. From the first iteration step on, selection of points is handled by the classifier. In the beginning it might select points which are also easy to interpret (since such points might just be missing in the training set so far) and then it continues with points which are more and more special (and typically harder for interpretation). However, RIU is capable to at least alleviate this problem. Using \(d_{RIU}=1.5\,\mathrm{m}\) leads to an OA which is about \(4\,\mathrm{pps}\) higher in case of V3D and up to \(6\,\mathrm{pps}\) for H3D. In case of the latter, this effect especially improves quality for later iteration steps when sampling points more complex for interpretation.

Forming the training data set within AL (i.e., running a complete iteration) causes costs for the payment of the crowdworkers of \(190\,{\$}\) (\(100\,\mathrm{pts} \cdot 0.10\,{\$} + 100\,\mathrm{pts} \cdot 3\,\mathrm{rep.} \cdot 0.15\,{\$} + 10\,\mathrm{it.\,steps} \cdot (n_{AL}/10\,\mathrm{pts\,per\,job}) \cdot 3\,\mathrm{rep.} \cdot 0.15\,{\$}\)) and is completed in about 5 days (\(\text {approx.} \, 11\,\mathrm{h} \cdot 10\,\mathrm{it.\,steps} + \text {approx.} \, 16\,\mathrm{h} \, \text {for initialization} = 126\,\mathrm{h}\)). Compared to this, estimating total expenses of Passive Learning (PL) is hard to conduct since it is on one hand determined by external factors such as the skills and salary of the annotator, the required software, the required hardware and on the other hand by the complexity of the scene and the targeted class catalog etc.

Table 1. Comparison of accuracies reached both for V3D and H3D for PL and various AL approaches using different oracle types and sampling functions.

Performance of the Machine within the AL Loop. Our RF classifier relies on these crowd-provided labels within training. We compare the results of simulated AL runs using an omniscient oracle \(\mathcal {O}_O\) to the respective runs using a real crowd oracle \(\mathcal {O}_C\) each with and without RIU and to the baseline result of PL. Figure 3 (middle row) & Table 1 present the accuracies achieved when predicting on the unknown and disjoint test set of each data set. For V3D, we achieve a result (\(AL_{RIU}(\mathcal {O}_C)\)) which only differs by about \(4\,\mathrm{pps}\) from the result of PL both in OA and mean F1-score, while this gap is only about \(2\,\mathrm{pps}\) for H3D. In case of V3D, relying on \(AL_{RIU}(\mathcal {O}_C)\) results in a significantly higher increase of mean F1-score in early iteration steps compared to \(AL(\mathcal {O}_C)\) (which is beneficial when labeling budget is limited) and performs close to the optimal corresponding AL run relying on \(AL_{RIU}(\mathcal {O}_O)\). For the run without RIU, the performance of the RF is diminished, which is due to the lower labeling accuracy of the crowd. We would also like to underline the effectiveness of RIU even when labels are given by \(\mathcal {O}_O\), which demonstrates that this strategy also helps to receive a more generalized training set by avoiding to sample only most uncertain points. In case of H3D, all AL runs perform similarly well with \(AL_{RIU}(\mathcal {O}_C)\) marginally outperforming \(AL(\mathcal {O}_C)\). For both data sets, Table 1 demonstrates that while classes with a high in-class variance such as Urban Furniture and Façade (note that façade furniture also belongs to this class) suffer in accuracy whereas underrepresented classes such as Car yield better results than the PL baseline.

Terminating the AL Loop. As mentioned in Sect. 3.3, we need to provide a criterion for deciding about stopping the iteration. Congruence values derived for this purpose are displayed in Fig. 3 (bottom row). For the initialization, congruence is 0 due to the lack of a previous prediction. Generally, congruence curves correspond well to the test results for our AL iterations. For instance, consider accuracy values for \(AL_{RIU}(\mathcal {O}_C)\) (V3D), which reach close-to-final accuracy level in the third iteration step. Therefore, from the fourth iteration step on (one step offset) intrinsic congruence values \(C_o(AL_{RIU})\) & \(C_{ac}(AL_{RIU})\) have also reached a stable level. We would like to stress that AL runs having a close to linear increase in accuracy (\(AL(\mathcal {O}_C)\) for V3D and \(AL(\mathcal {O}_C)\)/\(AL_{RIU}(\mathcal {O}_C)\) for H3D) show a similar behavior in congruence. All congruence measures flatten with increasing number of iteration steps. To avoid early stopping by a too strict stopping criterion, we set \(n_{stop}=5\) (Sect. 3.3), which indeed leads to std. dev. values close to 0 for \(AL_{RIU}(\mathcal {O}_C)\) (V3D), which obviously can be stopped earlier compared to the other runs (\(0.4\,{\%}\) vs. ca. \(1.4\,{\%}\) for the other runs at it. step 10).

In case of H3D, the decrease of congruence in the second iteration step is noteworthy indicating that the predictions have changed significantly. In other words, we assume that the newly added training points have a positive impact to the training process, which is rather unstable at this point, i.e., the iteration should not be stopped here at all. The explanation for this effect is that labels of the initialization (iteration step 0) lead to a level of prediction which is slightly improved in iteration step 1 by adding unknown but comparably similar points (w.r.t. their representation in feature space), which were just missing so far (in fact mostly façade points were selected). The second iteration step then leads to the addition of a greater variety of classes so that accuracy increased significantly, causing changed class labels of many points and by this a drop in congruence. However, one inherent limitation of our intrinsic congruence measure is that it can only detect whether a stable state of training was achieved, which does not necessarily correspond to accuracy.

6 Conclusion

We have shown that the combination of crowdsourcing and AL allows to automatically train ML models by generating training data on the fly. The peculiarity of our CATEGORISE framework is that although there are humans (i.e., crowdworkers) involved in the process, the pipeline can be controlled just like any program, so that carrying out a labeling campaign can be considered as subroutine of this program. This requires automated quality control, automation of crowd management and an effective stopping criterion all addressed within this work. Although the framework was designed for annotation and semantic segmentation of 3D point clouds, it can be easily adapted to other categorization tasks (e.g., for imagery) by i) adapting the web tool used by crowdworkers (mainly concerning the data viewer) and ii) minor changes for the AL loop (as long as individual instances such as points or pixels are to be classified).