1 Introduction

1.1 Motivation and state-of-the-art

Quality control plays an important role in industrial production systems. When applying manual inspection of surfaces [55], the quality often suffers from the human reviewer’s visual acuity, inconsistency of assessment criteria and concentration, as well as workplace lighting conditions, available time for inspection and the provision of feedback or knowledge of results to the reviewer. This results in missing defects that may cause further reparation costs in the production line. In this context, automated computer vision systems are promising due to their constant inspection quality and additional documentation of test results that allow a backtracking of errors to the production process.

While in their beginning, surface inspection systems were based on predefined and fixed rules [17, 32], they have been evolving during the last 10–15 years to automated learning systems, usually based on machine learning (ML) techniques from the research field of classification. Instead of using a static classifier that is not adjustable to new error types, an automatic system tries to reproduce human cognitive abilities [45]. With the usage of ML techniques, the system learns to identify defective parts from a set of examples, typically feature vectors characterizing regions of interests [10], such as defect regions on the image [50], and thus the operator’s effort can be reduced [22, 26, 51, 54].

In general, ML classifiers require an expansive training database with full labels available. While the feature vectors can be fully automatically measured and stored during online production process, the label information has to be obtained from human beings (operators, experts, users) during a post-processing phase. This usually is a highly involved task demanding several man-hours, especially in multi-class classification settings, as we are dealing with in this paper. For each class, a sufficient number of samples should be annotated to circumvent the curse of dimensionality effect during the classifier training stage. This is a well-known fact from statistics and ML, see e.g., [16, 25]. Still, after acquiring a comprehensive training set, new event classes may arise during the production process.

Furthermore, whenever a substantial modification in the system occurs or the system operational modes expand to unexplored states, the current ML classifier may become outdated over time or remains too conservative in its originally trained state. This usually deteriorates its predictive performance significantly [44]. Hence, incremental classifier updates have been presented within visual quality or surface inspection scenarios [11, 41]. The bottom line of these approaches is that they do not focus on reducing the annotation effort, as they are using the total number of available samples for classifier updates. Additionally, using the own classifier’s predictions as ground truth class label for each update in a kind of semi-supervised learning approach [12] also leads to deterioration of its performance, especially when the predictions are wrong or highly uncertain [44]. This is unrealistic in most applications, and sometimes simply impossible.

By applying active learning (AL) algorithms [57], it is possible to selectively ask the operator to annotate interesting events that are most beneficial for classifier training and thereby reduce the annotation effort significantly. Following the pool-based approach, where samples are selected from a static pool of data and [31, 37] applied AL in the area of image classification and [18, 49] in a more general context. Pool-based active learning has recently been extended to learn from multiple crowd-sourced annotators [28, 38, 52], taking into account the inherent uncertainty of non-expert annotators. Pool-based methods are not applicable in our online data stream environment, where samples are processed in single-pass manner and processing speed is critical. An early approach for single-pass AL was motivated by the memory consumption and processing time of large-scale support vector machine learning [5]. Two approaches restricted to linear binary classifiers are presented in [15, 56]. The problem of balancing exploration and exploitation in stream-based AL has been addressed in [14, 39], while [61] proposed a method for stream-based AL of graph-based models.

1.2 Our approach

In this work, we present an enhancement of a general surface inspection system. Instead of using a fixed trained classifier for the classification of defects, we propose an AL environment that supports the user in adapting the classifier to adjust to new data, and to increase its sensitivity (see Sect. 2 for the framework). The AL component is able to operate in a fully single-pass manner. This means, a single sample is loaded, classified, its usefulness for update checked and then immediately discarded. Our system goes beyond state-of-the-art and allows different sizes of the AL buffer. In this way, the operator is not forced to be disturbed by each single interesting (actively selected) sample, but can decide the time point to label the selected samples (by setting the buffer size to the most comfortable level). The concrete AL strategies are based on uncertainty-sampling techniques (see Sect. 4.1), which are specifically designed for two classification model architectures: random forests (RF) and evolving fuzzy classifiers (EFC). Both will be explained in Sect. 3 together with their learning and dynamic update engines, also respecting possible new upcoming classes defined by the operators during an AL labeling cycle.

Fig. 1
figure 1

Overview of the proposed framework. The upper part of the image shows the process of a conventional inspection system, starting with the segmentation of defects on a surface (denoted by events), and the extraction of features for these events. A classifier then predicts the class label for each event and further passes it into a rating system where the classified events are evaluated with respect to quality criteria. In the lower part of the image, one can find the architecture of our proposed AL component that continuously trains and updates the classifier. This part will be studied in detail in this paper

The proposed framework will be evaluated on online data streams recorded from two real-world surface inspection systems for microfluidic chips and elevator sheaves, respectively (see Sect. 6), and benchmarked against the stream-base active learning algorithm proposed by Loy et al. [39]. The applicability and feasibility of our approach is confirmed by the following outcomes:

  1. 1.

    High accuracy trend lines on multi-class problems with including 6, respectively. 11 classes (event types) can be achieved, while keeping annotation effort at a low level (10 % max.).

  2. 2.

    The sensitivity on the AL buffer size turns out to be negligible opening a wide-range parametrization for the user.

  3. 3.

    The integration of new classes on-the-fly turns out to be fast (steeply rising performance curves), efficient (low number of samples used) and robust (not affecting the performance on the already established classes).

  4. 4.

    On the surface inspection datasets, the proposed architecture clearly outperforms the benchmark.

Furthermore, using only about 10 % of the samples, the loss in accuracy is only slightly lower than during a full update on all streaming samples. In addition, static classifiers where no update takes place at all significantly drop in their performance over time.

2 Our framework

We propose a new framework, similar to [22, 26, 51, 54], which expands the usual setting of an industrial inspection system that uses a fixed and static classifier. As illustrated in the upper part of Fig. 1, a conventional system typically consists of an event detection system that segments errors on an inspected surface and extracts features from them. Afterwards, these features are passed into a classifier that predicts a class label pointing to the corresponding event type. Usually, these are a strict multi-class classification tasks with a significant number of classes. In general, the classifier is trained on a fixed set of possible defect types that arise during the production and will not be changed throughout the inspection process. This set has to be labeled by one or more operators [45] which obviously results in high costs and effort in annotation cycles. The retrieved class labels (or class probabilities) are then evaluated by a separate system that decides upon the severity of the detected errors by providing a definitive feedback whether the inspected part/item is “OK” or “NOT OK” (binary classification). Usually, this is performed based on so-called aggregated features [22, 26, 54]. This procedure is denoted as inspection system layer and its performance strongly depends on the performance of the single event classifier.

In online versions of industrial inspection systems, we are dealing with sequentially arriving data streams where events are individually passed to the classification architecture. As a result, the inspection system is required to perform fast decisions on the class label of individual events. In the majority of cases, this can be achieved within milliseconds when using today’s conventional classifiers established with ML and data mining methodologies. However, there is an increasing complexity and dynamic of today’s real-world installations or systems. Thus, classifiers initially established within a batch off-line design phase may become quickly outdated since they rely on a historic training database or expert knowledge about the process. Keeping the classifiers static usually leads to a significant downtrend in classification accuracy, as in [41] with up to 20 % decreased accuracy. Therefore, in contrast to those common inspection systems, it is beneficial or often even necessary to update the classifiers with newly acquired samples.

In this work, we introduce an AL component into the system to accomplish a proper update with as low effort for operators as possible. This is illustrated in the lower part of Fig. 1, and is referred to as AL layer. In the following, we will perform a compact summary of the functionality of each component. Classifier learning and update as well as user interaction within AL cycles will be explained in more detail in Sects. 3 and 4.

Classifier This is the core part of the inspection system. It integrates a decision support module (the “Classifier”) which is able to characterize the events appearing on new incoming items as recorded by a camera. The characterization yields a sorting of the events into so-called event types, which can be directly associated with classes in the sense of ML classification tasks. This is feasible as the events are independent from each other and also do not belong to a quantitative ordering. Thus, assigning integer indices (from 1 to K) as class labels to the events and applying ML concepts is a valid procedure.

The classifier is initially trained on a small data set with given class labels. This initial data set needs to be collected from the process and annotated by an expert. A small training set results in shorter labeling times for the experts which causes a higher acceptance rate among the operators. At the same time, the quality of the classifier may suffer especially on new online samples lying further in the future. This is especially the case when the dimensionality of the feature space is high due to the curse of dimensionality effect in sparsely populated feature spaces [24, 25].

AL  filter Depending on the strategy, the AL filter continuously evaluates the incoming samples from the online data stream of a sequential production system. It selects the most informative ones among them using the current classifier structure and the feature vector extracted from the actual event. This allows to decide whether a sample is helpful or even necessary for keeping the accuracy of the classifier on the same or even an increased level as after the initial setup. Usually, such samples are lying close to the decision boundaries or in sparsely populated regions. Furthermore, the AL filter has to operate in a single-pass, sample-wise manner, which abandons the usage of conventional state-of-the-art pool-based methods. In the results section, we will see that by applying our single-pass selection strategies, we are able to achieve increasing classifier accuracy trend lines over time while selecting only a small portion of samples from the whole stream. Finally, AL can be seen as kind of pre-filtering stage for the buffer that stores the interesting samples that need to be addressed and further annotated by the user. Parameters in the AL filter will allow the user to select more or less samples, i.e., whether the AL filter is rather restrictive or generous, and thus to adapt it to his or her convenience.

Buffer The buffer collects samples that are selected by the AL filter. The size of the buffer controls how often the user has to interact with the system, which also induces how often he or she may be “disturbed” by the system during his or her daily work. The shorter the size becomes, the more often the system needs manual input from a user. It is necessary to find a trade-off between (1) a too small buffer size: this results in short time intervals between required interactions and keeps the user from his or her other work; or (2) a too large one: this ends up in fewer interruption times; however, there will be long time periods for and between manual label assignments; this also means that the classifier may not be updated for a long time, causing a huge update delay and the risk that the classifier is severely affected in its performance. In our experiments, we will show the sensitivity of the classifier accuracy when changing the buffer size within a range of 1–50 samples.

User interaction As soon as the buffer reaches its predefined maximum size, the user is required to provide feedback to the collected samples. For this purpose, the user sees a preview of the samples and possibly additional information, e.g., where the defect occurred on the object, the controversial class label, classifier certainties in predictions, image of the small surrounding area of the event, etc. The user then assigns a class label, or may even introduce a new class label into the classifier. The latter denotes a specific case and has to be handled with care in the classifier updates engines to guarantee proper integration of new classes, see Sect. 3. During the manual interaction, the classifier may continuously provide predictions for the remaining samples’ labels which can facilitate and shorten the user interaction.

Classifier update Depending on the user’s decisions, the classifier is trained on the new input data. This can either be (1) a full re-training of the classifier based on a combination of the initial training set and the newly labeled samples, or (2) an update of the classifier in case we are dealing with incremental classification algorithms. The advantage of the first variant is that a re-parametrization of the most influential learning parameters may take place to always guarantee batch-optimized parameters. Usually, this results in a highly accurate classifier. On the other hand, this may slow down the whole learning process and in some cases may not terminate in real time. This is an issue which can be overcome by the second type of learning methods which are able to update classifiers incrementally based on single samples. Furthermore, an essential property of incremental single-pass training is that it is linear in terms of the number of samples [4].

Due to the different strengths of re-training versus incremental learning, we propose two classification approaches in our framework: RF and EFC. Our RF implementation operates fully in batch (re-trained) mode, and is based on the original version by [8]. Of course, there are other implementations available, such as [53], but they have not been evaluated by our framework due to the already fast training and prediction time of common RF. EFC are able to incrementally expand and shorten the classification rule base, sample-wise and in single-pass manner [1, 43]. The motivation behind the choice of these two approaches is further underlined by the fact that RF are an extension of decision trees for more robustness due to the exploration of the sample space in a kind of bootstrapping manner. They thus establish an ensemble of trees which are fused to a final decision on the new query samples. In this sense, RF are known to be among the top-performers in classification problems [60]. Fuzzy classifiers are able to provide some sort of interpretability in form of IF-THEN rules [42, 46]. This may be an important aspect when intending to gain insight into the class representations or to change or define rules for new classes. Furthermore, fuzzy classifiers can be reliably equipped with an incremental, evolving learning engine due to the association of rules with local regions of the feature space (see Sect. 3.2). Both classifier architectures allow the application of certainty-based classification concepts, which is necessary for the AL component.

In the following, we will concentrate on all parts concerning the classification system, which support a generic usability in many inspection systems, regardless of the type of input data. Application-specific tasks including event segmentation, feature extraction and fault class evaluation are out of scope of this paper.

3 Classifier training and dynamic updating

In this section, we provide a short summary of the classifiers and learning engines we used within our AL framework: conventional batch-trained RF (Sect. 3.1) and incrementally trained and evolvable fuzzy classifiers (Sect. 3.2). We will explain the basic concepts to provide a better understanding of the single-pass AL topologies developed and employed for these two types of classifiers.

3.1 Random forests

The basic concept on which RF build [8] is the generation of an ensemble of decision trees from randomized sample distributions which are created by bootstrapping method [20, 21]. Decision trees, invented by Breiman et al. at the beginning of the 90s [9], are able to decompose the feature space into several conditions coded in tree nodes and having the form \(x_i<C\) with a constant within the feature range.

3.1.1 Construction and update

Its induction is based on an iterative search for the best binary feature splits in the input space. In each iteration, a split criterion is sought in each node along each feature separately which achieves the highest information gain. The latter is measured in terms of the impurity change from the parent T to the left child \(\textit{TL}\), and right child node \({ TR}\), respectively. It is denoted by \(W(T) = I(T) - I(\textit{TL}) - I(\textit{TR})\), where the impurity I is calculated per default through the Gini diversity index [9]:

$$\begin{aligned} I(T) = \sum _{i\ne j} p_ip_j \end{aligned}$$
(1)

with \(p_i\) the relative proportion of samples falling into class i. If all samples are falling into the same class, the impurity is 0. Other impurity measures are suggested in the literature [48], such as miss-classification error and deviance. The split inducing the highest change in impurity is used and the iteration continues to further expand the tree, until a termination criterion is fulfilled.

Based on m bootstrap replications \(B_1,B_2,\ldots ,B_m\) containing n samples, m trees are constructed based on \(n_1\approx \frac{2n}{3}\) subset samples and the remaining \(n_2=n-n_1\) samples are used as out of bag data, which are further applied for tree impurity elicitation and out of bag error calculation. The basic difference in the tree construction is that at each node only a random subset of variables are used which can be chosen in the inequation criterion of the node: in case of classification, \(\sqrt{p}\) of the p variables are taken into account for that purpose. Variable importance scores are computed additionally to the forest construction. Summarizing these aspects, the following algorithm is obtained for RF construction in batch mode [8]:

  1. 1.

    Start with \(b=1\), the first tree.

  2. 2.

    Extract boostrap sample \(B_b\) by drawing samples from the whole data set with putting back drawn samples.

  3. 3.

    Denote \(n_1\approx \frac{2n}{3}\) samples used for tree construction and \(n_2=n-n_1\) samples used as out of bag samples.

  4. 4.

    Construct the tree \(T_b\) using decision tree algorithm after [9] using \(n_1\) samples (see also above).

  5. 5.

    During construction, use only \(\sqrt{p}\) randomly pre-selected variables for being selected as \(x_i\) in the node criteria.

  6. 6.

    Calculate the tree impurity of the whole tree, \(\pi _b\) by summing up the tree impurities of all leaves based on the \(n_2\) out of bag data.

  7. 7.

    Permutate variable \(x_j\) for all \(j=1,\ldots ,p\) and calculate the tree impurity \(\pi _{bj}\).

  8. 8.

    Define the variable importance of all \(j=1,\ldots ,p\) variables by \(\delta _{bj} = \pi _{bj}-\pi _b\).

  9. 9.

    Repeat Steps 2–8 above with \(b=2,\ldots ,m\) and store \(\delta _{1j},\delta _{2j},\ldots ,\delta _{bj}\) for all j.

  10. 10.

    Calculate the overall variable importance score:

    $$\begin{aligned} \Phi _j = \frac{1}{m} \sum _{b=1}^{m} \delta _{bj} \end{aligned}$$
    (2)

Additionally, we calculate the out-of-bag error of each generated tree using the corresponding out of bag samples \(n_2\). The average over this error is equal to the miss-classification rate of a whole RF. The nice feature of the out of bag error is that it is undistorted [8], opposed to conventional cross-validation which may be distored with unknown bias [58]. Furthermore, another important feature of RF is that it belongs to a meta learning algorithm which explores the data space [7], performing a kind implicit noise reduction scheme through bootstrapping, as is the case for most ensemble classification variants [35]. Compared to a single decision tree which has a high variance and thus wrong decision in an “upper” node has an intrinsic effect on all further down nodes. RF weakens the likelihood of this effect.

For updating the RF model, we perform a simple re-training of the model based on the previously learned samples as well as the new information coming from the AL buffer. This means, after an initial training of N samples, every time i the buffer reaches its maximum size \(S_i\), we add the new samples to the learning base. This results in a growing learning base with a number of \(N + S_1 + S_2 + \cdots + S_i\) samples after the ith time the buffer is full. Thus, RF always see the full information contained in the stream so far, but on the other hand is expected to decrease its runtime performance over time as the whole training database grows—an issue which can be overcome using an online version of RF [53], or the EFC concept (see Sect. 3.2), which are able to conduct their update cycles based on single samples, thus linear in time.

3.1.2 Classification of new query samples

The classification stage of an RF is quite straightforward by passing all query samples down the tree, during which each tree votes for a class label. The final class label is then assigned by taking the class prediction of all trees, \(L_1,\ldots ,L_m\) and choosing the final classification response as the majority vote L over all K classes:

$$\begin{aligned} L = \mathrm{argmax}_{k=1,\ldots ,K} |\{L_b = k | b=1,\ldots m\}| \end{aligned}$$
(3)

where m is the number of trees in the model. By dividing the votes by the number of all trees, we gain a measure how probable a class label is for a query sample. This will be used for the uncertainty-based AL scheme as discussed in Sect. 4.

3.2 Evolving fuzzy classifiers

We apply a recent extension for the multi-class classification case, termed as all-pairs architecture which decomposes the class space into binary sub-spaces and thus is able to induce simpler decision boundaries between class pairs [43]. This decreases the severity of class imbalance and increases the likelihood that new classes are faster and more robustly integrated. The term evolving stems from the fact that the fuzzy classifiers are equipped with a learning engine which is able to expand its rule base and also recursively adapt its parameters, as will be explained in Sect. 3.2.3.

3.2.1 The architectures for fuzzy classifiers

The classifier structure of all-pairs learning in multi-class classification scenarios is based on a decomposition of the whole problem into several binary sub-problems [43]. Formally, this can be expressed by a classifier \(\mathbb {C}_{k,l}\) which is induced by a training procedure \(\mathbb {T}_{k,l}\) when using (only) the class samples belonging to classes k and l:

$$\begin{aligned}&\mathbb {C}_{k,l}\longleftarrow \mathbb {T}_{k,l}(X_{k,l}) \nonumber \\&X_{k,l}=\{\mathbf {x}|L(\mathbf {x}) = k \vee L(\mathbf {x})=l\} \end{aligned}$$
(4)

with \(L(\mathbf {x})\) the class label associated with feature vector \(\mathbf {x}\). This means that \(\mathbb {C}_{k,l}\) is a binary classifier for separating samples belonging to class k from those belonging to class l.

Each \(\mathbb {C}_{k,l}\) follows the extended classification structure by including the confidence levels of all classes in each rule. This yields the follow definition of rules [6]:

(5)

where in our case of binary partial classifiers \(K=2\). Thus, a local region represented by a rule in the form of (5) can better model class overlaps in the corresponding part of the feature space. A geometric interpretation of rules modeling two overlapping classes and classes spread in various local regions over the feature space is presented in Fig. 2 for three classes. In this figure, we also integrate the less complex, linear decision boundaries induced by the all-pairs structure on the three classes as dotted lines.

Fig. 2
figure 2

Three rules covering three classes, the left one shows the overlapping case with the confidence in consequent class labels, the other two are clean rule containing only samples from one class each; the induced decision boundary of classical rules as in (5) shown in solid line, the decomposed decision boundaries between class pairs shown in dotted line

When classifying a new sample \(\mathbf {x}\), each classifier outputs a confidence level \(\mathrm{conf}_{k,l}\) which denotes the degree of preference of class k over class l for this sample. This degree lies in [0, 1] where 0 means no preference, i.e., a crisp vote for class l, and 1 means a full preference, i.e., a crisp vote for class k. This is conducted for each pair of classes and stored into a preference relation matrix R:

$$\begin{aligned} R = \left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} 0 &{} \mathrm{conf}_{1,2} &{} \mathrm{conf}_{1,3} &{} \ldots &{} \mathrm{conf}_{1,K} \\ \mathrm{conf}_{2,1} &{} 0 &{} \mathrm{conf}_{2,3} &{} \ldots &{} \mathrm{conf}_{2,K} \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \mathrm{conf}_{K,1} &{} \mathrm{conf}_{K,2} &{} \mathrm{conf}_{K,3} &{} \ldots &{} 0 \end{array} \right] \end{aligned}$$
(6)

If we assume reciprocal preferences, i.e., \(\mathrm{conf}_{k,l} = 1-\mathrm{conf}_{l,k}\), then we can omit the training of half of the classifiers, hence finally obtain \(\frac{K(K-1)}{2}\) binary classifiers.

Compared to the classical architecture, this architecture is able to decompose a complicated multi-class problem into several much simpler binary sub-problems. Due to this trick, usually the decision boundaries get simpler, often even linear; also the likelihood to suffer from imbalanced problems is diminished, such that an increased accuracy can be expected [23, 30, 43], especially for under-represented classes (which is the case for new classes at their origin).

3.2.2 Online classification stage

In a winner-takes-it-all context (the most common choice in fuzzy classifiers [34]), the classifier output L for each binary fuzzy classifier architecture will be obtained by

$$\begin{aligned} L= & {} l_{i^*} \nonumber \\ l_{i^*}= & {} \text {argmax}_{1\le k \le K} \mathrm{conf}_{i^*k} \nonumber \\ i^*= & {} \text {argmax}_{1\le i \le C} \mu _{i}(\mathbf {x}) \end{aligned}$$
(7)

where \(\mu _{i}(\mathbf {x})\) is the membership fulfillment degree of the ith rule, i.e., when using product t-norm to achieve classical multivariate Gaussians in the high-dimensional space (characterizing the ellipsoidal contours of the rules), \(\mu _{i}(\mathbf {x}) = \prod _{j=1}^p \mu _{ij}\). However, we will focus on generalized rules being able to model local correlations between features due to their rotational behavior (as done for stream regression problems in [47]), thus \(\mu _{i}(\mathbf {x}) = \mathrm{exp} (-\frac{1}{2} (\mathbf {x}-\mathbf {c}_i)^\mathrm{T}\Sigma _i^{-1}(\mathbf {x}-\mathbf {c}_i))\) with \(\mathbf {c}_i\) the center and \(\Sigma _i^{-1}\) the inverse covariance matrix.

In a more enhanced (weighted) scheme [43], the degree of purity is respected as well and integrated into the calculation of the final classification response L:

$$\begin{aligned} L= & {} \text {argmax}_{k=m,m*} \nonumber \\&\left( \mathrm{conf}_k = \frac{\mu _{1}(\mathbf {x})h*_{1,k}+\mu _{2}(\mathbf {x})h*_{2,k}}{\mu _{1}(\mathbf {x})+\mu _{2}(\mathbf {x})}\right) \end{aligned}$$
(8)

with

$$\begin{aligned} h*_{1,k} = \frac{h_{1,k}}{h_{1,m}+h_{1,m*}} \quad h*_{2,k} = \frac{h_{2,k}}{h_{2,m}+h_{2,m*}} \end{aligned}$$
(9)

and \(h_{i,k}\) the class frequency of class k in rule i, and \(\mu _1(\mathbf {x})\) the membership degree of the nearest rule (with majority class m), and \(\mu _2(\mathbf {x})\) the membership degree of the second nearest rule with a different majority class \(m\ne m*\). This difference is important as two nearby lying rules with the same majority class label do not induce a decision boundary in-between them. Equation (8) induces a shift of the decision boundary towards the less purified rule, enlarging the decision range of the more purified rule (i.e., where the relative class situation is more clear).

In case of the all-pairs approach, the default classification induces that the majority class preference over all classes is returned as classifier’s response:

$$\begin{aligned} L= & {} \mathrm{argmax}_{k=1,\ldots ,K} (\mathrm{score}_k) \nonumber \\= & {} \mathrm{argmax}_{k=1,\ldots ,K} \left( \sum _{K\ge i\ge 1} \mathrm{conf}_{k,i}\right) \end{aligned}$$
(10)

but there are also other, more complex possibility, see e.g. [2] for specific voting schemes on fuzzy relation matrices or [43] for the integration of ignorance degrees in the preference relation entries \(\mathrm{conf}_{k,i}\).

3.2.3 Incremental training and evolving methodology

The incremental learning takes place in the cluster space (one cluster is associated with one rule) and is following the same concept for the direct multi-class classifiers and the binary classifiers within the all-pairs approach. We assume that we have a new incoming sample \((\mathbf {x},y)\) from the AL cycle with \(\mathbf {x}=\{x_1, x_2,\ldots ,x_p\}\) (see Sect. 4.1) containing the p input feature values and y the class label assigned by the operator/expert—if more samples have been annotated, we split up the whole buffer in single samples and proceed with the following features as explained below for each sample separately and independently.

Starting with an empty rule base, a new rule/cluster is evolved whenever its statistical tolerance region in the multi-dimensional space is exceeded by a new incoming sample \(\mathbf {x}\) in the input feature space. The prediction interval for multivariate Gaussians, defined by center \(\mathbf {c}\) and covariance matrix \(\Sigma \), serves as such tolerance region and is defined by [33]:

$$\begin{aligned} \sqrt{(\mathbf {x}-\mathbf {c})^\mathrm{T} \Sigma ^{-1} (\mathbf {x}-\mathbf {c})} \le \chi _p^2(\alpha ) \end{aligned}$$
(11)

with p the degrees of freedom and \(\alpha \) the significance level of the quantile function of the \(\chi ^2\) distribution. Note that when using the classical t-norm operator, \(\Sigma ^{-1}\) reduces to a diagonal matrix and thus the left-hand side of (11) to the Euclidean distance. Hence, the conventional t-norm case for composing rule in the binary classifiers is also covered in our learning engine.

Assuring variability of the prediction interval size and expanding it for clusters (rules) with low densities, we arrive at the following rule evolution criterion:

$$\begin{aligned} \min _{i=1,\ldots ,C} \sqrt{((\mathbf {x}-\mathbf {c}_i)^\mathrm{T} \Sigma _i^{-1} (\mathbf {x}-\mathbf {c}_i))} > r_\mathrm{win} \end{aligned}$$
(12)

with

$$\begin{aligned} r_\mathrm{win} = \mathrm{fac}*p^{1/\sqrt{2}}*\frac{1.0}{\left( 1-1/(k_{\text {win}}+1)\right) ^m} \end{aligned}$$
(13)

the tolerance radius for the nearest (=winning) cluster, nearest in the sense of the Mahalanobis distance as occurring on the left-hand side of (11). (13) respects the dimensionality of the feature space p: the higher it is, the larger the tolerance region for evolving new cluster. This dependency is important to reduce the curse of dimensionality effect. The exponent approximates the traditional quantile function of the \(\chi ^2\) distribution with p degrees of freedom quite well.

If (12) is fulfilled, a new cluster is evolved by setting its center to the sample \(\mathbf {c}_{C+1}=\mathbf {x}\) and initializing its inverse covariance matrix to

$$\begin{aligned} \Sigma _{C+1}^{-1} = \frac{\sum _{i=1}^C \Sigma _{i}^{-1}}{C} \end{aligned}$$
(14)

If (12) is not fulfilled, the nearest cluster is updated by moving its center towards the current sample \(\mathbf {x}\) by a small fraction (according to the evolving vector quantization concept [40]) and updating its inverse covariance matrix \(\Sigma _\mathrm{win}^{-1}\) through a recursive formula including rank-1 modification for increased stability.

As we also have the true class label y available, we use it to learn the consequents of the fuzzy rules in each binary classifier as defined in (5). Thereby, we introduce a so-called hit matrix H, whose entry \(h_{ij}\) denotes the number of samples falling into cluster (or rule) i and belonging to class j. Then, the relative frequencies of classes falling into cluster i can be computed as:

$$\begin{aligned} \mathrm{conf}_{ik} = \frac{h_{ik}}{\sum _{k=1}^K h_{ik}} \end{aligned}$$
(15)

and directly assigned as confidence levels appearing in the consequents of the ith rule. Each local hit matrix contains two columns per class pair. This induces smaller and much lighter hit matrices than when using single model architecture without class decomposition, and thus also decreases the likelihood that new classes represented only by a few samples after their origin are “overwhelmed” by older classes, thus new classes are expected to be learnt quicker, i.e., new classes are more correctly returned earlier than by the direct multi-class architecture.

In particular, whenever a new class is defined by the user, the all-pairs architecture needs to be expanded by K binary classifiers between the new class \(K+1\) and all older classes \(k=1,\ldots ,K\). Assuming to have M samples from a new class (assigned the index \(K+1\)) available, \(\mathrm{New}=\{\mathbf {x}_1,\ldots ,\mathbf {x}_M\}\), e.g. as annotated by an operator with the usage of our AL framework. Then, K new classifiers, namely \(\mathbb {C}_{l,K+1}\) with \(l=1,\ldots ,K\), have to be newly established pointing to the preference of each class l over the new class \(K+1\)—note that then automatically \(\mathrm{conf}_{K+1,l} = 1-\mathrm{conf}_{l,K+1}\), introducing the preference scores of the new class \(K+1\) over all other classes as last row in (6). This means that some (annotated) past samples Past from each class have to be kept in the memory:

$$\begin{aligned} \mathrm{Past}= & {} \{\mathbf {x}_i| (i<<N) \wedge \nonumber \\&\wedge (\{y_i\}\cap \{1,\ldots ,K\} \ne \{\}) \wedge \nonumber \\&\wedge \forall k (|\{y_i = k\}|\approx M) \} \end{aligned}$$
(16)

with N the number of samples seen so far. The size of Past is ideally \(K*M\) with equally balanced samples over the classes [the last condition in (16)], leading to a balanced initialization of the binary classifiers \(\mathbb {C}_{l,K+1}\). Furthermore, we extend the rule evolution criterion as defined in (12) by

$$\begin{aligned}&\left( \min _{i=1,\ldots ,C} \sqrt{((\mathbf {x}-\mathbf {c}_i)^\mathrm{T} \Sigma ^{-1} (\mathbf {x}-\mathbf {c}_i))} > r_i \right) \text { OR} \nonumber \\&\text {OR } (y = K+1) \text { OR} \nonumber \\&\text {OR } \exists _{k=1,\ldots ,K} \left( \sum _{c=1,\ldots ,C} h_{ck} = 0\right) \end{aligned}$$
(17)

with y the real class label in sample \(\mathbf {x}\) annotated by the operator within the AL buffer. This assures that some rules will represent the new class and some others the other classes in each \(\mathbb {C}_{l,K+1}\).

It is remarkable that all binary classifiers already established on the older classes remain untouched when a new class is introduced, thus their expected new output behavior can be assumed to be very similar to the old behavior, which causes an increase of their robustness against the introduction of a new class.

4 Active learning topology

In our proposed system, we integrate the technology of AL into an inspection system. This is a necessary component since labeling each new incoming event type is tedious and time-consuming due to the high effort and disturbance in the operator’s daily work. In most cases, i.e., production lines where the throughput of items is in the range of several seconds, a complete annotation of online data is simply not possible at all. Therefore, the algorithm in the framework has to select those samples which are intrinsically important for keeping the accuracy of the classifier on a high or even increasing level. Usually, these samples contain novel information for the classifier upon which it should be expanded or refined to sharpen its decision boundary. The automatic selection can be achieved via AL approaches [57]. In the literature, there are different query strategies to choose the relevant or interesting samples, which can be basically divided into certainty-based, density-based and committee-based samples, including various hybrids of these, see e.g., [18]. However, most of these are pool-based methods, meaning that they iterate over the whole reference database from which samples are selected, always performing a re-training of the classifier to cause an improvement in accuracy [13, 27]. Such a procedure is not possible in a single-pass online setting where no pool of samples is available. Therefore, the method has to decide in a quick manner which samples to select.

Since we are dealing with a stream-based approach in contrast to pool-based, each sample is passed individually to the classifier. Therefore, for each individual event it is beneficial to immediately decide between the following cases:

  1. 1.

    The event’s class label can clearly be retrieved by the classifier without additional effort.

  2. 2.

    The prediction of the class label is ambiguous and needs special attention.

The first case can easily be resolved by simply predicting the event’s class label and passing it on to the evaluation algorithm. The latter is depicted as last step in the inspection system layer of Fig. 1.

In contrast to this, events belonging to the second type have a high potential of bearing information that can be used to refine and improve the current classifier. Thus, before making a decision, these samples are collected in a buffer. This assures that the operator is not disturbed on single samples. After the buffer reaches its predefined size, the operator is asked to supply class label information for all events in the buffer. Then, since the system is provided with new information, i.e., input samples of the form \(\mathbf {x}=\{x_1, x_2,\ldots ,x_p\}\) with class label y assigned by the operator, the classifier is updated (in case of EFC), respectively, re-trained (in case of RF), as explained in Sect. 3.

In the following, we provide the concrete sample selection concepts which are used in combination with the two architectures.

4.1 Active learning with single-pass certainty-based sampling

Certainty-based sampling is one of the most prominent techniques for the batch or pool-based AL [36, 59]. It relies on the idea to improve classifier’s certainty for future predictions. For this purpose, samples will mainly be selected (1) for which the classifier has been most uncertain in terms of classifier predictions, e.g., due to high class overlaps, (2) samples lying close to the decision boundary or (3) significant extrapolation cases. In the classical pool-based version, after having established an initial classifier, the classifier’s certainties in the predictions on unlabeled samples from a pool are measured. According to the certainty levels, the samples are ranked and the highly ranked samples finally chosen. The certainties in the predictions can be regarded as confidence levels in deciding on the proposed class label. They can be computed depending on the internal structure and composition of the final classification output of the selected classifier. This is the case in both RF and EFC.

By measuring these certainties separately and independently for different incoming samples, we are able to adopt this strategy for the online single-pass case. Thus, instead of an extensive and time-intensive ranking over a sliding window (as e.g., conducted in [5]), we introduce a threshold parameter \(c_{\mathrm{thresh}}\) that controls the degree of uncertainty allowed in such a way that highly certain samples are not selected. Due to different settings of the threshold, we are able to obtain a different percentage of selected samples used for classifier updates. We are then also able to study the expected sensitivity of the accuracy trends over time according to the different settings. The ideal setting is a trade-off between accuracy and number of samples used for updating obtained during off-line simulation. It is expected to be most useful within the online installation environment.

In the following, the certainty measures for RF and EFC will be explained.

4.1.1 AL with random forests

Assuming to have m trees available in the ensemble, \(B_1,B_2,\ldots ,B_m\); according to (3), the majority in class votes delivers the final classification decision, i.e., the class label. This can easily be extended so that the classifier returns K probabilities \(p_i\), with \(i\in \{1,\ldots ,K\}\), and \(\sum _{i=1}^k p_i = 1\) for a given query event, which represents the probabilities of belonging to class i:

$$\begin{aligned} p_i = \frac{|\{L_b = i | b=1,\ldots m\}|}{m} \end{aligned}$$
(18)

with \(L_b\) the class label returned by the bth tree in the forest. We then compute the certainty c as

$$\begin{aligned} c = p_\mathrm{max1} - p_\mathrm{max2} \end{aligned}$$
(19)

where \(p_\mathrm{max1}\) represents the highest probability for a class label, and \(p_\mathrm{max2}\) the second highest. This captures how confident the classifier is in making its decision. Whenever c is low, there occur at least two class labels that have a very similar probability. Therefore, in this case the classifier cannot produce a certain classification due to a high conflict level (see also subsequent section). In the other case, when c is high, the probability of the winning class label stands out from the others which makes the classification result more reliable.

4.1.2 AL with evolving fuzzy classifiers

In case of EFC, the sample selection criteria relies on two certainty-based concepts: conflict and ignorance. Conflict points to the degree of a new query point \(\mathbf {x}\) lying in-between two classes, or falling into a strongly overlapping region of two or more classes. It can thus be measured as \(\mathrm{conflict}_\mathrm{deg}=\mathrm{conf}_k\) in (8), or as scoring proportion between the two classes with the highest scores (say k and l) according to (10):

$$\begin{aligned} \mathrm{conflict}_\mathrm{deg} = \frac{\text {score}_k}{\text {score}_k + \text {score}_l} \end{aligned}$$
(20)

The conflict degree lies in [0.5, 1] and can be compared with a certainty threshold \(c_{\mathrm{thresh}}\): if \(\text {conflict}_\mathrm{deg} < c_{\mathrm{thresh}}\), the sample is stored into the AL buffer.

Ignorance measures the degree of extrapolation in the current query point and thus demands a knowledge expansion of the classifier. In fact, samples lying in so far unexplored regions may also denote outliers which is a reason why many AL techniques prefer or combine density-based with uncertainty-based strategies [18, 19]. However, in our particular case of surface inspection, an operator labels the samples on request, based on the images showing the events corresponding to the feature vectors which have been fed into the classification process. This means that the operator is able to recognize outliers easily by visualization and hence ignore to label them. Thus, real outliers appearing in the extrapolation space will never be learnt into the classifier.

In the case of fuzzy classifiers modeling representing rules for each local regions where training samples fall, the distance to the nearest rule is a quite natural indicator for extrapolation [29]. This can be estimated with the highest membership degree of the current query point \(\mathbf {x}\) to all rules in a classifier, i.e., in case of each binary classifier \(\mathbb {C}_{k,l}\) the ignorance degree is given by:

$$\begin{aligned} \text {ign}\_\text {deg} = 1-\text {max}_{i=1,\ldots ,C(k,l)} \mu _i(\mathbf {x})(k,l) \end{aligned}$$
(21)

with C(kl) the number of rules in \(\mathbb {C}_{k,l}\). It is sufficient, if a rule in any binary classifier covers the current query point sufficiently well. Then, the output of this classifier can be already seen already as certain—as long as there is no high overlap in this rule, which is then covered by the conflict degree described above. Thus, the overall ignorance degree becomes:

$$\begin{aligned} \text {ign}\_\text {deg} = 1- \max _{k,l=1,\ldots ,K} (\max _{i=1,\ldots C(k,l)} \mu _i(\mathbf {x})(k,l)) \end{aligned}$$
(22)

with C(kl) the number of rules in the binary classifier between Classes k and l, \(\mathbb {C}_{k,l}\).

Moreover, the inclusion of the ignorance concept in the final preference relation values leads to the weighted scoring concept:

$$\begin{aligned} L= & {} \text {argmax}_{k=1,\ldots ,K} \end{aligned}$$
(23)
$$\begin{aligned}&\left( \sum _{K\ge i\ge 1} conf_{k,i}*\max _{j=1,\ldots ,C(k,i)}(\mu _j(\mathbf {x}))\right) \end{aligned}$$
(24)

with C(ki) the number of rules in classifier \(\mathbb {C}_{k,i}\) and \(\mu _j\) the membership degree to the jth rule. This then guarantees that binary classifiers in which at least one rule covers the current query point sufficiently well are the strongest in the preference relation matrix and in the scoring for final class response elicitation. On the other hand, classifiers which show a strongly extrapolated situation in the current query show a tendency towards 0, and are thus masked-out in the scoring process.

Figure 3 shows an example of ignorance and conflict cases in classical, binary for all-pairs fuzzy classifiers for various query points. The query points are indicated by dots.

Fig. 3
figure 3

The same rule partition as in Fig. 2, but this time showing three query point cases as dark dots: Query #1 can be safely classified to Class #2, for Query #2 there is a high conflict level between Class #2 and #3 and for Query #3 there is a high ignorance level, indicating different possibilities of the decision boundaries as dotted lines in this outer region (wide variety of the version space)

Obviously, ignorance in Eqs. (21) and (22) measures the degree of extrapolation. It is able to reduce the variation of the space in terms of labeling new query points falling into extrapolation regions and using them for classifier updates—as indicated in the right lower part of Fig. 3. The number of possible decision boundaries can be reduced, if for instance, the query point is verified to belong to Class #1. Then the upper two dotted decision boundary possibilities vanish, restricting the so-called version space and increasing classification certainty in that region.

5 Application scenario

In the following, we will describe the experiments that were conducted to demonstrate the applicability of AL in a surface inspection environment. In our application case study, we are dealing with the inspection of microfluidic chips and elevator sheaves. The microfluidic chips are used for sample preparation for DNA sequencing. On the chip, the DNA and primers are packed into aqueous droplets in oil phase. The sizes of them are controlled by the diagnostic instrument during image vision in a closed loop. Therefore, it is necessary to inspect each chip adequately to avoid loss of sample. To identify the impact of an event during the production it is necessary to classify them as exactly as possible. Event types can be of the form of: (1) black spots in the polymer, which can cause troubles during image acquisition; (2) dust in the micro channels that can cause clogging; (3) bonding voids as potential causes for air bubbles; (4) visual defects like scratches on the surface or dirty regions; or (5) micro scratches in printed layers like printed electrodes. Some typical examples are visualized in Fig. 4.

Fig. 4
figure 4

Examples of errors on microfluidic chips as used in our experiments

Full-surface inspection by a line-scan camera detects defects on the surface as well as inside the substrate or printed layer. Additionally, single areas are inspected by a high-resolution “microscope” matrix camera, inspecting a \(2\times 2\) mm area with a resolution of \(1~\upmu \mathrm{m}\).

The inspected sheaves have a diameter of around 100 mm and are used in wire rope-based elevators. They consist of a steel wheel body coated with polyurethane on the circumference that carries the rope notch. Coating is done in a casting process with the final notch shape produced on a turning lathe. Event types can be (1) surface spots that have been damaged during lathing, (2) events detected by the segmentation process that do not correspond to surface faults, (3) delamination of the coating, (4) closed and (5) open cavities and (6) swarf (turnings) that adhere to the circumference. Typical examples of these event types are depicted in Fig. 5.

Fig. 5
figure 5

Examples of errors on elevator sheaves as used in our experiments. From top-left, row-wise: (4) closed cavities, (5) open cavities, (1) lathing damages, (3) delamination, (6) swarf and (2) false-positives from segmentation

Inspection of the sheaves is restricted to the coated circumference that is scanned by a line-scan camera with a resolution of \(20~\upmu \mathrm{m}\). The actual classifier for the existing microfluidic chip inspection system is based on a rudimentary hard-coded rule base which is tuned manually and operates on features extracted from the regions of interest. This training takes a lot of time and man-power as it can be done only by experts. Classification of event types detected on the elevator sheaves is currently done using a random forest classifier. The classifier is trained on a fully expert-annotated database.

Experience from inspection systems in a variety of application domains shows that drifts in the production processes also reflect in the detected events and may lead to the appearance of new event classes. Naturally, in the classical setting, where classifier training is limited to an initial setup phase of the system, these changes are not accounted for and lead to degradation of classification performance. Keeping the performance at the highest level possible, therefore, requires automatic adaptation of parameters, classifier structure and the online integration of new event classes. To keep annotation effort on an economically realistic level during continuous maintenance, classifier training needs to be performed in active learning scheme. These issues arising in real-world inspection systems are addressed by the concepts proposed in this paper.

5.1 Data characteristics

For our experiments, we used data collected online from microfluidic chip and sheaves inspection systems. Thus, the events in the stored data file are ordered in the same manner as they appeared during the online manufacturing process. Both, the images and the corresponding feature vectors have been stored in our database, containing features describing the shape, size, intensity level, histograms of the events. For confidentiality reasons, we cannot fully disclose and describe the features extracted from the images, but this also goes beyond the scope of this paper. In sum, 24 and 20 features have been extracted, leading to high-dimensional multi-class classification problems with 6 and 11 classes for the microfluidic chip and sheaves inspection systems, respectively. These features have been designed by inspection system experts during system setup and are known to sufficiently characterize the events. For brevity, we will from now on refer to the data streams recorded from the microfluidic chips and sheaves inspection by Micro and Sheave, respectively. For both streams, class labels were retrieved from domain experts.

Micro contains \(N_m=13719\) samples with class labels \(l_i\in \{0,\ldots ,10\}, i=\{1,\ldots ,N_m\}\). The shorter Sheave consists of \(N_s=3106\) samples with class labels \(l_i\in \{1,\ldots ,6\}, i=\{1,\ldots ,N_s\}\). The labels for Micro were manually generated by an expert on two consecutive days. As for some samples the uncertainty of the user and thus the inconsistency in some labellings is known to be high, the overlap degree of the classes can be expected as significant. This makes the classification problem a challenging one, representing a realistic case since in the real online process also users will label the data, sometimes with high uncertainty. This was also confirmed in an off-line experiment where six users manually annotated a subset of this data stream. Here, the manual labels differed among each other which again demonstrates the challenge of the task. The Sheave stream was annotated by a production expert in a single day.

An overview of the class distribution among both data streams is given in Fig.6a, c, and the appearance of each class in the streams is shown in Fig. 6b, d. We can see that for Micro classes 4 and 8 are most occurring labels, whereas there exist only relatively few samples for classes 0, 2 and 7. The maximal class imbalance ratio, i.e., the ratio between the most populated and the least populated class, is higher than 100:1. In stream Sheave classes 2 and 4 dominate and make up more than 75 % of the dataset. Although being more moderate, the maximal class imbalance ratio is still 13:1. This leads to highly imbalanced and thus hard multi-class classification problems containing 6 or 11 classes each. Figure 6b, d shows the frequency distribution of the appearance of each class over time in each stream. If this is close to a straight line, the samples of the corresponding class are nearly evenly distributed over the whole stream.

Fig. 6
figure 6

Overview of the class distribution. a Total number of samples per class in Micro, and c for Sheave. b Number of samples per class at each time point in Micro, and d Sheave

6 Experiments

In the following, we will present the experiments that were conducted to evaluate the proposed AL framework.

6.1 Accuracy of active learning vs. annotation effort

For all experiments, the original order in which the samples appeared in both data streams was preserved. We evaluated the proposed AL framework by simulating an inspection system that is fed with data from the streams.

We constructed 18 experiments in which we evaluated the effects of our AL algorithm with both proposed methods (see Table 1 for an overview as well as the abbreviation). In each experiment, the AL buffer size was set to 50 upon operator’s request, and we used uncertainty sampling in single-pass manner as query strategy (see Sect. 4.1). The thresholds \(c_{\mathrm{thresh}}\) of the sampling strategy varied between \(c_{\mathrm{thresh}} \in \{0.1, 0.3, 0.5, 0.7,\) \(0.9, 1.0\}\) for RF and between \(c_{\mathrm{thresh}} \in \{0.55, 0.6, 0.7,\) \(0.8, 0.9, 1.0\}\) for EFC, which means that if \(c \le c_{\mathrm{thresh}}\) holds for a sample, it will be included in the AL buffer. Of course, \(c_{\mathrm{thresh}}=1.0\) represents the situation where each event is passed to the buffer (i.e., a full update). Loy et al.’s method requires no parameter to control sample selection. In addition, we added the case of \(c_{\mathrm{thresh}}=0.0\) where only the initially trained classifier is used to predict the class labels in the stream without any further update. This will be addressed by static classifier. We experimented with the initial number of training samples by setting them to 20, 100 and 500. 500 samples is the most feasible upper bound which the operator is willing to label manually; 100 samples is a comfortable number the operators at this inspection system have agreed upon. In this sense, the initial classification training stage is quite short compared to the adaptive learning on the whole stream and will be not investigated in our performance comparisons. For reference we also included the case where only 20 samples are labeled initially.

Table 1 Overview of the 18 constructed experiments

In all experiments, the procedure can be described by the following steps:

  1. 1.

    A new event occurs in the stream.

  2. 2.

    The event’s class label is predicted based on the currently available classifier.

  3. 3.

    If the calculated classifier’s certainty c is below the fixed threshold \(c_{\mathrm{thresh}}\) (for RF and EFC), or Loy’s query criterion is met, the event is passed to the AL buffer.

  4. 4.

    If the AL buffer is full, the containing samples are used to perform the classifier update. If the buffer is not full, continue with step 1.

Fig. 7
figure 7

Accumulated accuracy for our experiments on Micro using RF (a, c, e), and EFC (b, d, f) compared to the benchmark. For both methods, we used the parameterization threshold as explained in Table 1. For the initial training, we used a, b 20, c, d 100 and e, f 500 samples. The legend shows the certainty threshold as well as the number of samples used for updating the model in relation to the whole data stream

Fig. 8
figure 8

Accumulated accuracy for our experiments on Sheave using RF (a, c, e), and EFC (b, d, f) compared to the benchmark. For both methods, we used the parameterization thresholds as explained in Table 1. For the initial training, we used a, b 20, c, d 100, and e, f 500 samples. The legend shows the certainty threshold as well as the number of samples used for updating the model in relation to the whole data stream

As evaluation measure, we compute the accumulated accuracy over time, which is following the interleaved-test-and-then-train, a widely used technique in the data stream mining community, also proposed within the massive online analysis framework [3]. After each prediction of a sample’s class label (step 2), we compare the prediction \(\hat{y}\) with the true target value y and update the accuracy:

$$\begin{aligned} \mathrm{Acc}(N) = \frac{\text {Acc}(N-1)*(N-1) + I(\hat{y},y)}{N} \end{aligned}$$
(25)

with \(\mathrm{Acc}(0)=0\) and I the indicator function, i.e., \(I(a,b)=1\) whenever \(a==b\), otherwise \(I(a,b)=0\). Please note that for our evaluation purposes we have the true targets y available in the evaluation stream, see Sect. 5.1.

Fig. 9
figure 9

Percentage of selected samples (x-axis) versus the maximal/end/average accumulated accuracy over the life-time of Micro: (ac) for RF, and (df) for EFC. We compared the results for initially training on 20 (a, d), 100 (b, e) and 500 (c, f) samples

We evaluated the accumulated accuracy over time under our study when employing both proposed methods: the RF architectures with coupled AL strategy, as well as EFC and compared it against Loy et al. benchmark algorithm that was modified to use the same scheme of initial training samples and batch update. Both were following the parametrization scheme as presented in Table 1. Figure 7 shows the trend lines for Micro, and Fig.  8 for Sheave.

By applying RF on Micro, we can see in Fig. 7a, c, e that independently from the initial number of training samples, there is a low difference between the thresholds 0.3–1.0 in terms of accuracy. They all reached a final accumulated accuracy around 0.78. However, it is a large difference in terms of annotation effort when reducing the threshold from 1.0–0.3 since then the user only has to manually label 26 % of the samples instead of the whole data stream. In contrast to this, the accuracy of threshold 0.1 is lower, reaching 0.73 for 20, 0.77 for 100, and 0.75 for 500 initial training samples. EFC is ending up with similar (almost monotonically increasing) trend lines as achieved by RF. In the case of 100 % (full update) as well as 55 % (threshold 0.9) of the samples, EFC performs even a bit better reaching a saturation above 80 % accuracy (whereas RF stays slightly below 80 %). This is remarkable when taking into account that EFC operates on an incremental sample-wise basis, i.e., only seeing one sample for updating its structure and parameters, whereas RF sees and uses the whole data set loaded so far for re-training cycles and thus can be expected to be slower and slower over time as more samples are passing by. EFC, on the other hand, is able to learn with constant time (single-pass), i.e., linear in the number of samples over time. Nevertheless, when comparing the results of using around 30 % of the data, which is more feasible for our application area, both RF and EFC reach similar accuracy trend lines around 0.77. While requiring 24 % of the samples to be labeled, the Loy et al.’s benchmark method stays below both RF and EFC’s performances across all settings of their inclusion thresholds. Using less than one half of the samples, RF and EFC still outperform the benchmark. The difference is most prominent in the setting with 500 initial training samples and decreases when less samples are used before starting active learning.

Fig. 10
figure 10

Percentage of selected samples (x-axis) versus the maximal/end/average accumulated accuracy over the life-time of Sheave: ac for RF, and df for EFC. We compared the results for initially training on 20 (a, d), 100 (b, e) and 500 (c, f) samples

In addition, it is interesting to see that regardless of the number of initial training samples, as well as the used method, a downtrend in accuracy during the first 1000 “online” stream samples as well as around sample 6000, can be clearly observed in case of Micro. This has also been the case when re-training other types of classifiers on each new incoming sample based on a sliding window over the last 5000 samples, which can be seen as a hard benchmark. However, those are usually not terminating in real time, thus not usable in online installations. In this sense, this occurrence does not depend on the classifier architecture, but more on the nature of the data itself. Analysis of the dataset has shown, that the second downtrend can most likely be attributed to a change in annotation behavior, since annotation of the stream was performed on two days, with samples before and after the downtrend being annotated on separate days.

In fact, when analyzing in more detail, we found out that many samples are selected within this time frame because they denote extrapolation cases out of the previous feature ranges: usually, prediction performance deteriorates in case of severe extrapolation cases, but due to our AL scheme, we could finally end up with back-rising accuracies over time leading to a saturation and thus convergence effect.

In the plots of Figs. 7 and 8, we also showed the static classifier (i.e., the classifier being trained only on the initial training set without further update). For the Micro stream, the static classifier’s performance clearly deteriorates (black solid line) and confirms our expectations that it cannot compete with a continuously trained classifier that adapts to current system changes, no matter whether RF or EFC are used. While still being below the updated classifiers, the static classifier’s performance difference is less articulate for the Sheave dataset. This indicates, that the Sheave dataset is more homogenous over time, which is also backed by the class distribution lines of Fig. 6.

On the Sheave stream, RF outperform EFC with final accuracies slightly below 75 % for all but the strictest inclusion threshold (0.1) compared to below 70 % for EFC. Loy et al.’s method again stays below the performance of RF and EFC in terms of final accumulated accuracy. Compared to EFC, Loy shows superior performance until around sample number 2000 of the stream. Except for a small window between sample 250 and 1000 of RF’s performance with 20 initial training samples and threshold 0.1, Loy outperforms RF only up until sample 750 in the setting with 500 initial training samples. It is interesting to see that in this area, the large number of non-actively selected samples deteriorates RF’s performance to around 66 % while it is around 70 % at the same sample number when only using 20 initial training samples.

Finally, to provide an insight into the usability of our methods, we offer plots showing the percentage of selected samples (x axis) versus the accuracy (y axis) for both methods (RF and EFC) on Micro in Fig.  9, and on Sheave in Fig 10. For this purpose, we aggregated the accumulated accuracy trend lines shown above and reflected them by three different values: maximal accuracy, average accuracy and the accuracy at the end of the stream.

Fig. 11
figure 11

End accuracies versus the percentage of selected samples for different AL buffer sizes on Micro (left two columns), and Sheave (right two columns); the closer they lie together, the less sensitive the method. Note the difference in the y axis!

From the figures we see that on Micro RF performs slightly better in case of 10–25 % selected samples, while EFC outperforms RF when more samples are selected and at least 100 initial training samples are used. Using only 20 initial training samples not only gives best results among the RF experiments, which holds also for the Sheave data set (Fig. 10), where RF generally outperforms EFC with accuracies around 74 % compared to 70 % of EFC. It is interesting to see that for both methods there is a performance optimum around 40–50 % of selected samples with degradation if more samples are used. For noisy data sets, saturation of classifier performance when increasing sample size can be explained by the unavoidable error due to the noise that imposes an upper bound on accuracy. The underlying mechanisms of achieving increased performance with selective sampling over the full update are worth analyzing in further work.

In general, using the same set of selection thresholds, the methods select more samples for labeling than on the Micro data set. Together with the lower accuracy values this is an indication that the Sheave data set is more difficult to learn, having higher class overlap.

6.2 Sensitivity analysis of the active learning buffer

In the previous section, we described experiments that cover the influence of the confidence threshold. This is responsible for the number of samples that are passed into the AL buffer. Another interesting topic in terms of usability of the whole AL framework is the sensitivity of the AL buffer since it has a strong influence on the delay in the update. A too small buffer size causes many interruptions of the user’s daily work since there is always input required as soon as the buffer is filled. On the other hand, a too large buffer size delays the update in the classifier model. For this purpose, we constructed an experiment where the size of the buffer varied between [1, 5, 10, 20, 50]. We applied a size of 50 as upper boundary since it has been specified as comfortable default value for the operators working with the system. For these experiments, we used 20, 100 as well as 500 samples for the initial training of the classifier and a confidence (conflict) threshold \(c_{\text{ thresh }}\) of 0.3 for RF, and 0.55 for EFC together with ignorance switched on (see Sect. 4.1). For those classifiers, these combinations produced the best trade-off in accumulated accuracy versus annotation effort, see results section below. Loy et al.’s method does not require a parameter to control sample inclusion and is, therefore, not include in the analysis.

Fig. 12
figure 12

Accumulated accuracy of individual classes of Micro at different introduction points. The entry point was defined as the first occurrence after 5000 samples (a, d for RF, b, e for EFC, c, f for Loy). ac show class #1 after the point of time of its inclusion and the number of sample used for updates; df the same for Class #10; the legends indicate the classifier and number of samples used

Fig. 13
figure 13

Accumulated accuracy of individual classes of Sheave at different introduction points. The entry point was defined as the first occurrence after 1500 samples (a, d for RF, b, e for EFC, c, f for Loy). ac show class #1 after the point of time of its inclusion and the number of sample used for updates; df the same for Class #4; the legends indicate the classifier and number of samples used

Fig. 14
figure 14

Micro : Evaluation of the stability of our proposed methods: RF (a, d), EFC (b, e) and Loy (c, f) when introducing ac class #1, and df class #10 into the stream (dotted line), compared to never introducing the class (solid line)

Fig. 15
figure 15

Sheave : Evaluation of the stability of our proposed methods: RF (a, d), EFC (b, e) and Loy (c, f) when introducing ac class #1, and df class #4 into the stream (dotted line), compared to never introducing the class (solid line)

The experiment was performed on both data streams, examining the accumulated accuracy at the end of the stream (see Fig. 11). The different percentages of selected samples stem from the various parametrization (see Table 1). On the Micro stream the sensitivity lies in the range of 2–3 % in for both classifier variants (Fig. 11), with better results for smaller buffer sizes. In Micro, both methods are even less sensitive to the selected buffer size, with accuracies varying only within 0.05–1 %. Finally, we can conclude that both methods appear robust against different settings of the AL buffer size. In an online setting this means that instead of requiring an expert on standby to annotate every sample that is selected, up to 50 samples can be collected for an annotation cycle at the cost of 3 and 1 % accuracy for the respective data sets.

6.3 Online integration of new classes

Within our AL framework, studying the behavior of our classifiers whenever new classes appear is a challenging task. In these situations, samples from a new class have been selected by the AL filter, and the operator has assigned a new class label. The new information then has to be integrated into the classifiers on-the-fly. In case of general RF, there is always a re-training step conducted on all training samples, including the initial training samples and all samples that were placed into the AL buffer. This means that a new class is then automatically taken into account once it is reflected in some samples. In case of fuzzy classifiers, particular mechanisms in the learning engines are designed when a new class is introduced to assure a quick and stable integration with correct classifier outputs on the new class (see Sect. 3.2.3). A new class is usually under-represented in comparison to the older classes. Thus, we will study how this affects the continuing classifier accuracy trend lines on the older classes and especially on the new class.

We artificially simulated the case when samples of a new class (i.e., which have not been seen so far) suddenly appear in the system. For this, we performed a leave-one-out (LOO) procedure for each of the classes in each data stream. This means that for each class C, we excluded all samples of C from the data set until a particular time point t, and let the whole learning process run until t. As soon as time point t is reached, samples from C may appear in the stream in the same manner as in their natural appearance in the original stream (i.e., the first sample after t from class C is the first sample of C in this modified data stream). The introduction time point t was once set to the approximately half the number of stream samples, \(t=5000\) for Micro and \(t=1500\) for Sheave.

Additionally, we generated benchmark results where the left out class C was never introduced, and thus completely removed from the stream. We will compare the drop in accuracy after the introduction time point t from the experiment where the class was introduced, to this benchmark experiment. Again, these experiments were tested on both data streams. This provides an insight about the robustness of classifiers on the other already established classes when new ones are introduced.

To keep the paper length reasonable, results are given with a focus on two classes for both datasets. Figure 12 shows the results for leaving out Class #1 (a–c) and Class #10 (d–f) in Micro. The same is shown in Fig. 13 for the left-out classes #1 and #4 in Sheave. In each plot, the legend indicates the percentage of samples used for classifier updates as this has to be seen in relation to accuracy trend lines. In all of the figures the number of used samples slightly differs within and across methods. This is because the inclusion threshold for RF and EFC is fixed and the resulting fraction of selected samples is influenced only indirectly.

For integrating class #1 in Micro, RF outperforms both EFC and Loy, and reaches an accumulated accuracy of 56 % on class #1 samples correctly at the stream end, where after around 1100 samples occurred in the stream after sample 5000 (see Fig. 6b). Considering that the accuracy is accumulated starting from the inclusion time, this means more than 80 % of new class #1 samples are correctly classified at the stream end. Class #10 is learned almost equally well by Loy and RF, with slight advantages by Loy, reaching more than 90 % accumulated and close to 100 % at the end of the stream. The learning speed of class #10 with EFC is clearly lower, however, the steadily increasing accumulated accuracy of 40 % until the end of the stream still corresponds to above 85 % accuracy when only the end of the stream considered.

On the shorter Sheave stream, integration of class #1 after sample 1500 cannot be achieved by RF and Loy until the end of the stream with around 3100 samples, where approximately 140 class #1 samples occurred after the inclusion point. Integration starts around sample 2300 for those methods and progresses slowly afterward. EFC gives first correct classifications around sample 1800 and classifies around 45 % of class #1 samples correctly at the end of the stream (yielding an accuracy accumulated from the inclusion point of 27 %). Given the relatively low sample count and the monotonically increasing slope of the accumulated accuracy curve, we expect EFCs performance to reach near static case levels on longer streams. Class #4 makes up almost 30 % of the Sheave stream, with close to 500 samples occurring after the inclusion point. Again EFC performs best on this class with an accumulated accuracy of 71 % and above 80 % for samples at the end of the stream. RF performs slightly worse with 67 % accumulated accuracy and 79 % for samples at the end of the stream. Loy takes significantly longer to integrate class #4, with first correct classifications starting at sample 1800 in the stream and a more flat learning curve that is still increasing at the end of the stream.

Furthermore, in our application scenario, it is intrinsically important that the classifier remains stable on the other classes once a new class is introduced. This assures the reliability of the proposed methods, and therefore preserves confidence of the operators into our framework. A deterioration of this performance due to a new class would lead to a mistrust in the stability. Hence, we studied the accumulated accuracies on older, already established classes one time when no new class is introduced at all and one time when a new class is introduced at a certain point of time. We then compared the accuracy trend lines on the older classes from the time of inclusion of the new class until the end of the stream. Figure 14 shows the results on two different example classes (#1 and #10) of Micro. An interesting fact is that in case of class #10, the accumulated accuracy gets even higher after introducing the class as compared to never introducing the class at all (see Fig. 14d, f). Obviously, the selected class can well be learned into the model which is also reflected in Fig. 12d, f.

The Sheave data set shows a similar picture with very close accuracy lines (Fig. 15). On this dataset the performance on the remaining classes improves slightly for RF and Loy methods when introducing class #1 (a, c). In general the accuracy trend lines are very similar when a class is introduced or not with a maximum deviation in the range of 1–2 %, which is negligible in practice.

7 Conclusion

In this paper, we have introduced a novel framework for learning classifiers in surface inspection systems. At the core of our framework, an AL filter significantly reduces the annotation effort required to increase and maintain classification accuracy during system setup and adaptation. It enables continuous classifier updates at an economically realistic level of interaction, requiring only occasional input on the ground truth of new online samples. Dynamic adaptation of the event classification is another essential characteristics of our framework, including the capability to introduce new classes on-the-fly. This is especially important in case of non-stationary environments and dynamic production processes, where the performance of static, state-of-the-art classifiers deteriorates over time—an issue which could be verified on data from microfluidic chips and elevator sheaves inspection.

Two classifier architectures have been used in combination with extended concepts within their learning and sample selection strategies: RF and EFC, the latter equipped with a new class decomposition strategy to decrease the likelihood of class imbalance, especially for new and thus under-represented classes. These classifiers have been benchmarked against Loy et al.’s stream-based active learning method. RF clearly outperforms the benchmark on both datasets, yielding higher accuracies using only half the number of samples. The same is true for EFC on the microfluidic chip data set, while Loy gives slightly better performance at the start of the sheave data set and is outperformed by EFC only at the end.

New classes defined by operators during the AL cycles are more robustly be integrated soon after their first appearance. For all tested methods, stability of the classifiers on the older, already established classes is high. Furthermore, RF and EFC both are quite insensitive to different sizes of the AL buffer, providing a wide parametrization range to choose from, following the operator’s convenience. In summary, RF is preferable to EFC in terms of predictive performance, but in our current implementation, they are operating in re-training mode with increasing sample sizes (as they use all the training samples seen so far), thus expected to slow down over time. In contrast to this, EFCs always assure an update of their structure and parameters in constant time, as operating in single-pass, incremental manner.

Future work includes the integration and assessment of an online version of RF where we expect faster results in addition to a reduction of required memory. So far, our proposed implementation always needs to store the full training base, including the initial training set and all samples that were selected by an AL query strategy. Online implementations would overcome this issue. Additionally, a detailed analysis of the computational effort of RF in comparison to EFC would allow a new dimension to compare these two approaches. Furthermore, so far our proposed framework only concentrates on handling the occurrence of new event classes in the data stream. It would be of high interest to expand this to detect drifts in the production system that may arise from possible changes in the production line. This would add a great benefit to the end user. Finally, using other data sets that also represent stream-based items to evaluate our framework is another possible next step.