1 Introduction

In many real-world applications such as image classification [1], information extraction [2] and text classification [3], unlabeled data are abundant and easy to obtain, while labels are costly. Active learning [4, 5] aims to achieve higher classification accuracy with fewer labels through human–computer interaction. The key issue is to select critical instances to label. There are at least two views and respective approaches. One view is that uncertain instances are critical. Typical approaches include query-by-committee [6], fuzzy-rough based [7], and ambiguity-based [8] algorithms. The other view is that representative instances are critical. The most typical approaches are clustering-based [9, 10] algorithms. Naturally, combination approaches [11, 12] take advantage of both.

Three-way active learning [10, 13, 14] is a new type of clustering-based [9] algorithm. From the viewpoint of three-way decision [15, 16], there are three possible actions for each block. In case that there are not enough labeled data, representative instances of the block are queried. In case that there are enough labeled data with the same label, other instances in the block are classified. In case that there are labeled data with different labels, the block is clustered for further treatment. Since the process is iterative, the paradigm can be better represented by sequential three-way decision [17,18,19]. From the viewpoint of granular computing [20, 21], the learning process follows the coarse-to-fine scheme [22,23,24]. Instances are queried or classified in the appropriate granule, which is represented by a block.

Unfortunately, the performance of three-way active learning relies heavily on the base clustering technique. Different clustering techniques are applicable to different data. This is also a common situation for clustering-based active learning algorithms [5]. A natural question is: given a dataset, can we find appropriate clustering techniques in the learning process?

Fig. 1
figure 1

A running example of TACS a depicts the general clustering and learning process. b, d, e depict the block and the best clustering technique. c, f, g, h depict the final blocks. Triangles indicate queried positive labels, and circles indicate queried negative labels

In this paper, we propose the three-way active learning through clustering selection (TACS) algorithm for this issue. Figure 1 illustrates a running example of the new algorithm. Some representative instances in the whole block \({\mathbf {B}}\) are queried at the beginning. Since there are both positive and negative labels, \({\mathbf {B}}\) is split in \({\mathbf {B}}_1\) and \({\mathbf {B}}_2\) using the currently best algorithm (Density Peaks, DP [25]). Then a few more instances are queried for \({\mathbf {B}}_1\). Since there are only positive labels, the remaining instances are classified as positive. More instances are queried for \({\mathbf {B}}_2\) as well. With both positive and negative labels, \({\mathbf {B}}_2\) is split into \({\mathbf {B}}_{21}\) and \({\mathbf {B}}_{22}\) using the currently best algorithm (kMeans). This process repeats until all instances are either queried or classified. Note that the whole process is similar to a decision tree, but the splitting technique is different. Moreover, the binary split structure works for any datasets, not only for those with binary classes.

The contribution of the paper is to address four issues of our algorithm. The first issue is: which instances should be queried? This is also the key issue of any active learning algorithm. In the beginning, we select some representative instances to query. Here representativeness is measured by both density and distance to the closest instance with higher density [10]. These instances also help clustering technique selection. During the iteration, we select either representative or informative instances. Here information is measured by the total distance to labeled instances of the same block. In this way, the number of queries is also controlled.

The second issue is: how to revise existing clustering techniques to suit our algorithm? Since our algorithm always split a block in two, general clustering techniques should be revised. For both kMeans [26] and fuzzy c-means (FCM) [27], we choose a pair of instances with large distance as the original centers. For density peaks (DP) [25], we choose the top two representative instances with different labels for splitting. For Hierarchical [28], the solution is straightforward since the cluster tree is built in a bottom-up manner. For DBScan [29], we adjust the density threshold such that at least one core exists. For random walks (RW) [30], the largest block does not change, and the other instances form the second block. We also revise these techniques to speed up the learning process.

The third issue is: how to select the most appropriate technique to cluster the current block? Naturally, this is the key issue of this paper. We design weighted entropy with 1-nearest-neighbor. It considers both labeled and unlabeled data. Labeled data are trustworthy, hence their weight is 1. Unlabeled data are first classified by 1-nearest-neighbor to obtain pseudo-labels, whose weight is less than 1. In this way, the weighted entropy can be calculated. Moreover, we design the retrospective technique to recalculate the weighted entropy after obtaining more labels.

The fourth issue is: how to deal with the lack of labels? In real applications, the expert usually cannot provide enough queries due to limited budget. Consequently, we would like to query labels in larger blocks, which is more appropriate for the clustering-based algorithm. For this purpose, we design the breadth-first visit of the tree depicted in Fig. 1h. This is implemented by a block queue, in which blocks smaller than a threshold are discarded. Compared with the depth-first visit, different blocks of the same level have more balanced queries. When the budget runs off, only some instances in relatively small blocks remain unclassified. At last, they are classified using kNN.

Experiments are undertaken on twelve UCI datasets. Results show that in most cases, TACS can find out appropriate base clustering techniques. It is more accurate than supervised classification algorithms such as C4.5 [31] and Naïve Bayes (NB) [32], active learning algorithms such as Query-by-committee (QBC) [6] and manifold adaptive experimental design (MAED) [33], and single clustering technique based algorithms such as ALEC [10].

The rest of the paper is organized as follows. Section 2 introduces the basis of our algorithm, including the problem statement and revisions to some clustering techniques. Section 3 presents our algorithmic framework and some key issues along with corresponding techniques. Section 4 describes the experimental process and lists some results. Section 5 discusses some related work. Finally, Sect. 6 presents conclusions and outlines further research trends.

2 Basis

This section presents some basis of the work, including the data model, the problem statement, and some issues of the clustering techniques. Table 1 lists some notations used throughout the paper.

Table 1 Notations

2.1 The data model

Let \({\mathbf {X}} = (x_{ij})_{n \times m}\) be the data matrix, where n is the number of instances, m is the number of conditional attributes, and \({\mathbf {x}}_i =\)\((x_{i1}\), \(x_{i2}\), ..., \(x_{im}\)) is the ith instance. Let further \({\mathbf {y}} = (y_1, y_2, \ldots , y_n)^{\mathrm {T}}\) be the class label vector. In our active learning scenario, the value of \(y_i \in [1 .. c]\) (\(1 \le i \le n\)) is unknown. It should be either queried or classified.

The set of all instances is \({\mathbf {U}} = \{{\mathbf {x}}_i | 1 \le i \le n\}\). For brevity, we also denote it using the indices, i.e., \({\mathbf {U}} = [1..n]\). In this way, any block is a subset of [1..n].

2.2 Problem statement

Since a query requires an oracle's effort, we assume that the number of queries is limited. The problem, which is the same as that of [10], is stated as follows.

Problem 1

Active learning with fixed number of labels.

Input: The data \({\mathbf {X}}\), and the proportion of queried instances qi.

Output: The set of queried instances \({\mathbf {Q}} \subset {\mathbf {U}}\), the classified labels for \({\mathbf {U}} \setminus {\mathbf {Q}}\).

Optimization objective: Maximize the classification accuracy.

First, we assume that no label is known at the beginning of the learning process. In this way, it is easy to compare the classification accuracy of different algorithms. Second, it is required that \(|{\mathbf {Q}}| \le \lfloor n \times qi\rfloor\). When \(|{\mathbf {Q}}| < \lfloor n \times qi\rfloor\), the queries are not used up. This situation rarely occurs in applications because the number of queries is usually insufficient. Third, the classification accuracy only accounts for classified instances.

2.3 Revised clustering techniques

Our algorithm is based on six popular clustering techniques. These include two prototype-based techniques (kMeans [26] and FCM [27]), two density-based techniques (DBScan [29] and DP with Guassian kernel [25]), one hierarchical technique (Hierarchical [28]), and one graph-based technique (RW [30]). Instead of explaining them from the very beginning, we will focus on the following issues:

  1. (1)

    How to revise the techniques for binary splitting shown in Fig. 1b?

  2. (2)

    For big datasets, how to decrease the space and/or time complexity?

  3. (3)

    How to initialize clustering centers if needed? and

  4. (4)

    How to balance clusters if needed?

In the following context, the current block will be denoted by \({\mathbf {B}}\), and the sub-blocks will be denoted by \({\mathbf {B}}_1\) and \({\mathbf {B}}_2\), respectively.

2.3.1 Prototype-based techniques

kMeans and fuzzy c-means (FCM) are popular prototype-based clustering techniques. Both require some initial centers, which often affect the results. Since two initial centers are required, we would like to choose the pair of instances with the maximal distance for this purpose. That is

$$\begin{aligned} ({\mathbf {x}}_{i^*}, {\mathbf {x}}_{j^*}) = \mathop {\hbox {argmax}}\limits _{(i, j) \in {\mathbf {B}} \times {\mathbf {B}}} d_{ij}, \end{aligned}$$
(1)

where \(d_{ij}\) is the distance between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\). The time complexity of computing \(d_{ij}\) is O(m) if we use simple measures such as Manhattan and Euclidean distance. Hence the time complexity of Eq. (1) is \(O(m|{\mathbf {B}}|^2) = O(mn^2)\), which is rather high.

In reality, a pair of far enough instances usually helps to obtain good clusters. It is unnecessary to find the instance pair with the maximal distance. Here we design an alternative approach. We randomly choose some pairs of instances from \({\mathbf {B}}\) to form

$$\begin{aligned} {\mathcal {P}} \subseteq {\mathbf {B}} \times {\mathbf {B}}. \end{aligned}$$
(2)

The pair with the maximal distance in \({\mathcal {P}}\) is

$$\begin{aligned} ({\mathbf {x}}_{i^*}, {\mathbf {x}}_{j^*}) = \mathop {\hbox {argmax}}\limits _{(i, j) \in {\mathcal {P}}} d_{ij}. \end{aligned}$$
(3)

In this way, the time complexity is reduced to \(O(m|{\mathcal {P}}|)\). In our experimentation, \(|{\mathcal {P}}| = 10|{\mathbf {B}}|\), and the time complexity is O(mn).

2.3.2 Density-based techniques

Density-based techniques [25, 29] require the computation of the density of each instance. There are two popular kernels, one is cutoff, and the other is Gaussian. With the cutoff kernel, the neighborhood of \({\mathbf {x}}_i\) is

$$\begin{aligned} {\mathbf {N}}_i = \{j \ne i | d_{ij} \le d_c\}, \end{aligned}$$
(4)

where \(d_c\) is the neighborhood distance threshold. The local density of \({\mathbf {x}}_i\) is

$$\begin{aligned} \rho _i = |{\mathbf {N}}_i|, \end{aligned}$$
(5)

where \(|\cdot |\) indicates the cardinality of a set.

With the Gaussian kernel, the local density of \({\mathbf {x}}_i\) is redefined as

$$\begin{aligned} \rho _i = \sum _{j \ne i} e^{- \left( \frac{d_{ij}}{d_c}\right) ^2}. \end{aligned}$$
(6)

The cutoff kernel only considers the neighborhood, while the Gaussian kernel mainly considers the neighborhood. The cutoff kernel produces an integer value density, while the Gaussian kernel produces a real value density. The cutoff kernel often produces the same density for different instances, making the master tree sensitive to data order, which is rarely faced by the Gaussian kernel.

Now we consider the time complexity of density computation. The time complexity of computing \(d_{ij}\) is O(m). According to Eqs. (4), (5) and (6), the time complexity of computing the instance density is O(mn) with either kernel. To compute the density of all instances, the time complexity will be \(O(mn^2)\), which is rather high.

In applications, there is essentially no need to consider instances far from the current one. Hence we propose the following approach. First, the dataset is clustered into \(\lfloor n / n' \rfloor\) blocks using the kMeans algorithm. Since kMeans only iterates a few rounds, the time complexity is \(O(mn n / n') = O(mn^2 / n')\). Second, when computing the density of an instance, we only consider instances in the same block. The time complexity is \(O(mn')\). So the time complexity of computing the density of all instances is \(O(mnn')\). If we set \(n' = \sqrt{n}\), the total time complexity will be reduced to \(O(mn(n / n' + n')) = O(mn^{1.5})\).

With the cutoff kernel, this acceleration method may produce a density value that is slightly different from the original definition. With the Gaussian kernel, there will always be small differences. Fortunately, for our learning task, such difference rarely influences the final results especially when n is big.

Now we discuss some implementation issues of DBScan. Instance \({\mathbf {x}}_i\) is called a core iff

$$\begin{aligned} \rho _i \ge MinPts, \end{aligned}$$
(7)

where MinPts is the density threshold.

The setting of \(d_c\) and MinPts is an important issue. In our implementation, \(MinPts = \sqrt{|{\mathbf {B}}| + 1}\) where \({\mathbf {B}}\) is the current block. Similar to Eq. (3), to save the runtime, we set

$$\begin{aligned} d_c = rt \times \max _{(i, j) \in {\mathcal {P}}} d_{ij}, \end{aligned}$$
(8)

where \(rt \in (0, 1)\) is a distance ratio with an initial value such as 0.1. If there is no core, we will increase rt to 1.5rt until at least one core exists.

DBScan generates a number of clusters. To obtain exactly two clusters, the largest cluster forms \({\mathbf {B}}_1\), and \({\mathbf {B}}_2 = {\mathbf {B}} \setminus {\mathbf {B}}_1\).

For density peaks [25], we have some more issues to handle. Let \(ms_i = \{j | \rho _j > \rho _i\}\) be the set of instances with higher density than \({\mathbf {x}}_i\). The master of \({\mathbf {x}}_i\) is

$$\begin{aligned} m_i = \mathop {\hbox {argmin}}\limits _{j \in ms_i} d_{ij}. \end{aligned}$$
(9)

With \(m_i\), we organize all instances using a tree, where the master of \({\mathbf {x}}_i\) is its parent. This tree, which will be called the master tree, will be used in the algorithm. The distance to master is

$$\begin{aligned} \delta _i = d_{im_i} = \min _{j \in ms_i} d_{ij}. \end{aligned}$$
(10)

The representativeness of \({\mathbf {x}}_i\) is measured by [10]

$$\begin{aligned} \gamma _i = \rho _i \delta _i. \end{aligned}$$
(11)

To split \({\mathbf {B}}\) into \({\mathbf {B}}_1\) and \({\mathbf {B}}_2\), we sort labeled instances in \({\mathbf {B}}\) in descending order according to \(\gamma _i\). Let the label of the first instance be lf. Let further the first instance with label different from lf be \({\mathbf {x}}_f\). \({\mathbf {x}}_f\) and all its offsprings in \({\mathbf {B}}\) form \({\mathbf {B}}_1\). Naturally, \({\mathbf {B}}_2 = {\mathbf {B}} \setminus {\mathbf {B}}_1\).

Note that for each block, we will build a new master tree. However, the density of each instance will not be recomputed. The reason is that pairwise distance computation is very time consuming, and recomputing is unnecessary.

We have tested two versions of the density peaks clustering technique based on cutoff and Gaussian kernels, respectively. It is observed that the cutoff kernel rarely defeats the Gaussian kernel. Hence we will only use the Gaussian kernel in our experimentation.

2.3.3 Hierarchical technique

The hierarchical clustering technique works in a bottom-up manner. Let \({\mathbf {B}}_i\) and \({\mathbf {B}}_j\) be two blocks, and their centers be \({\mathbf {x}}_i'\) and \({\mathbf {x}}_j'\), respectively. Here \({\mathbf {x}}_i'\) and \({\mathbf {x}}_j'\) may not be true data points. The distance between \({\mathbf {B}}_1\) and \({\mathbf {B}}_2\) is defined as the distance between \({\mathbf {x}}_i'\) and \({\mathbf {x}}_j'\).

Let \({\mathcal {B}} = \{{\mathbf {B}}_1, {\mathbf {B}}_2, \ldots , {\mathbf {B}}_{n'}\}\) be the current round set of blocks. At the beginning, \({\mathbf {B}}_i = \{{\mathbf {x}}_i\}\) for \(1 \le i \le n\) and \(n' = n\). Let \({\mathcal {B}}'\) be the next round set of blocks initialized as \(\emptyset\). One round of merging repeats the following two steps \(\lfloor \frac{n'}{2}\rfloor\) times: Step 1. Take out \({\mathbf {B}}_i\) and \({\mathbf {B}}_j\) from \({\mathcal {B}}\) such that their distance is minimal. \({\mathcal {B}}\) = \({\mathcal {B}} \setminus \{{\mathbf {B}}_i, {\mathbf {B}}_j\}\). Step 2. \({\mathcal {B}}' = {\mathcal {B}}' \cup \{{\mathbf {B}}_i \cup {\mathbf {B}}_j\}\). If \(\lfloor \frac{n'}{2}\rfloor \ne \lceil \frac{n'}{2}\rceil\), the last element of \({\mathcal {B}}\) is added to \({\mathcal {B}}'\) directly.

In the first round, the n nodes are merged into \(\lceil \frac{n}{2}\rceil\) blocks. In the following rounds, the current \(n'\) blocks are merged into \(\lceil \frac{n'}{2}\rceil\) bigger ones. In our implementation, the process stops when there are exactly 2 blocks.

2.4 Graph-based technique

In graph-based techniques, the data is organized as a weighted graph, where each edge represents a direct connection between nodes (i.e., instances). To use this technique, we should first construct a weighted graph for a data matrix, as defined in the beginning of this section. During the learning process, we should construct a weighted graph for the current block \({\mathbf {B}}\).

For instance \({\mathbf {x}}_i\), we find out its k nearest neighbors to build direct connections. In our experiments, we set \(k = 2\). Let instance \({\mathbf {x}}_j\) be one of the k nearest neighbors. The edge between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) has the weight

$$\begin{aligned} w_{ij} = \left\{ \begin{array}{ll}MW, &{} \text { if } d_{ij} = 0;\\ 1 / d_{ij}, &{} \text { otherwise,} \end{array}\right. \end{aligned}$$
(12)

where MW is a constant which is set to 10,000 in our experimentation. By assigning 0 to all other weights, we obtain the initial adjacent matrix \(W = (w_{ij})_{|{\mathbf {B}}| \times |{\mathbf {B}}|}\).

Next, we let \(w_{ij} = \max \{w_{ij}, w_{ji}\}\). In this way, W becomes a symmetric matrix, which is the standard input for graph-based techniques.

We adopt the simple version of random walks described in [30]. The number of blocks cannot be controlled by the algorithm. Hence in our implementation, the largest block is assigned to \({\mathbf {B}}_1\), and the other instances form \({\mathbf {B}}_2 = {\mathbf {B}} - {\mathbf {B}}_1\).

3 The algorithm

In this section, we first present the algorithm framework. Then we analyze some key issues of the algorithm, including instance selection strategies, clustering quality evaluation, and other special case handling techniques.

3.1 Algorithm description

figure a

Algorithm 1 lists the TACS algorithm. The initialization is implemented in Lines 1–3. Line 1 sets the current block \({\mathbf {B}}\) as the whole dataset. Line 2 indicates that all instances are unlabeled. Line 3 calculates the number of queries \(N_q\). Line 4 calculates the number of representatives \(N_r\). The user does not specify \(N_q\) and \(N_r\) directly, since qi is easier specified by the users, and qr can be dataset independent.

Instance selection strategies are implemented in Lines 5–7, 14 and 24. Inspired by the density peaks algorithm [25], the representativeness of an instance is qualified by the multiplex of its density and distance. The informativeness of an instance is qualified by the total distance from labeled instances of the same block. These measures will be discussed in more detail in Sect. 3.2.

According to the tree structure of the learning process, there are two possible implementations. One is depth-first using a stack, while the other is breadth-first using a queue. Since there is often lack of labels, the depth-first approach tends to label small blocks of some branches. In contrast, the breadth-first approach distributes labels more evenly to different big blocks. Therefore, we choose it for balanced query. Line 9 initializes the queue. Line 10 judges whether all impure blocks are processed. Line 11 takes a block out of the queue. Lines 21–22 append two sub-blocks to the queue.

There are two pruning techniques. The first one deals with small blocks. As indicated in Lines 12–14, small blocks are abandoned in the while loop. This is because small blocks are not suitable for clustering-based active learning. The second one deals with lack of labels. As indicated by the second condition of Line 10, if the number of queries reaches \(N_q\), remaining blocks in q are suspended. Due to these pruning techniques, some instances are still not handled. They are classified using kNN, as indicated in Line 27.

3.2 Instance selection strategies

As shown in Line 15 of the algorithm, in each round, the active learner should select a number of critical instances to query. This is done by sorting unlabeled instances according to their representativeness in descending order. In the current block \({\mathbf {B}}\), let the set of labeled instances be \({\mathbf {B}}_l\) and the other part be \({\mathbf {B}}_u\). We propose two strategies to calculate the representativeness of each instance \({\mathbf {x}} \in {\mathbf {B}}_u\).

The local density \(\rho _i\) of \({\mathbf {x}}_i \in {\mathbf {B}}_u\) is computed in Eq. (6). As discussed in Sect. 2.3.2, it is adopted directly for reducing time complexity. The distance to master is redefined as

$$\begin{aligned} \delta _i({\mathbf {B}}) = \min _{j \in {\mathbf {B}}, \rho _j > \rho _i} d_{ij}. \end{aligned}$$
(13)

With the density peak strategy, the representativeness of \({\mathbf {x}}_i\) in \({\mathbf {B}}\) is measured by

$$\begin{aligned} \gamma _i({\mathbf {B}}) = \rho _i \delta _i({\mathbf {B}}). \end{aligned}$$
(14)

With the distance-sum-based strategy, the representativeness of \({\mathbf {x}}_i\) in the current block is measured by

$$\begin{aligned} ds_i({\mathbf {B}}_l) = \sum _{{\mathbf {x}}_j \in {\mathbf {B}}_l} d_{ij}. \end{aligned}$$
(15)

In a word, the density peak strategy tends to query central instances, while the distance-sum-based strategy tends to query informative ones.

3.3 Clustering quality evaluation

There are both internal and external methods to evaluate the quality of a clustering algorithm. Internal methods do not rely on any class label, while external methods rely on the labels of all instances. During the active learning process, parts of the labels are queried. We now propose a new measure to evaluate the quality of clustering.

Suppose that with the clustering algorithm, a block \({\mathbf {B}} \subseteq {\mathbf {U}}\) is partitioned into a set of blocks \({\mathcal {B}} = \{{\mathbf {B}}_1, {\mathbf {B}}_2, \ldots {\mathbf {B}}_k\}\), i.e., \(\cup _{i = 1}^k {\mathbf {B}}_i = {\mathbf {B}}\) and \(\forall 1 \le i < j \le k\), \({\mathbf {B}}_i \cap {\mathbf {B}}_j = \emptyset\). Let the set of labeled instances in \({\mathbf {B}}_i\) be \({\mathbf {Q}}_i\) for any \(1 \le i \le k\).

Given \(x_l \in {\mathbf {B}}_i \setminus {\mathbf {Q}}_i\), the label of \(x_l\) is unknown. Let \(n(x_l, {\mathbf {Q}}_i)\) be the nearest neighbor of \(x_l\) in \({\mathbf {Q}}_i\), namely

$$\begin{aligned} n(x_l, {\mathbf {Q}}_i) = \mathop {\hbox {argmin}}\limits _{x_m \in {\mathbf {Q}}_i} d_{lm}. \end{aligned}$$
(16)

Inspired by 1 nearest neighbor (1NN), let \(d'(x) = d(n(x, {\mathbf {Q}}_i))\), now every instance in \({\mathbf {B}}_i \setminus {\mathbf {Q}}_i\) has a pseudo label.

The number of actual labels of the jth class in \({\mathbf {B}}_i\) is

$$\begin{aligned} l({\mathbf {B}}_i, j) = |\{x \in {\mathbf {Q}}_i | d(x) = l_j\}|, \end{aligned}$$
(17)

and the number of pseudo labels of the jth class in \({\mathbf {B}}_i\) is

$$\begin{aligned} l'({\mathbf {B}}_i, j) = |\{x \in {\mathbf {B}}_i \setminus {\mathbf {Q}}_i | d'(x) = l_j\}|. \end{aligned}$$
(18)

Let

$$\begin{aligned} f_\alpha ({\mathbf {B}}_i, j) = \frac{l({\mathbf {B}}_i, j) + \alpha l'({\mathbf {B}}_i, j)}{|{\mathbf {Q}}_i| + \alpha |{\mathbf {B}}_i \setminus {\mathbf {Q}}_i|}, \end{aligned}$$
(19)

where \(\alpha \in [0, 1]\) is a weight, and

$$\begin{aligned} g_\alpha ({\mathbf {B}}_i, j) = \left\{ \begin{array}{ll}0, &{} \text { if } f({\mathbf {B}}_i, j) = 0;\\ f({\mathbf {B}}_i, j) \log f({\mathbf {B}}_i, j), &{} \text {otherwise.}\end{array} \right. \end{aligned}$$
(20)

The weighted information entropy of \({\mathbf {B}}_i\) is given by

$$\begin{aligned} H_\alpha ({\mathbf {B}}_i) = -\sum _{j = 1}^c g({\mathbf {B}}_i, j). \end{aligned}$$
(21)

Finally, the weighted information entropy of \({\mathcal {B}}\) is given by

$$\begin{aligned} H_\alpha ({\mathcal {B}}) = \sum _{i = 1}^k \frac{|{\mathbf {B}}_i|}{|{\mathbf {B}}|}H_\alpha ({\mathbf {B}}_i). \end{aligned}$$
(22)
Fig. 2
figure 2

An example of data and label distribution

Now we discuss the computation of information entropy through a running example.

Example 1

As illustrated in Fig. 2, \({\mathbf {B}}\) is partitioned into \({\mathcal {B}} = \{{\mathbf {B}}_1, {\mathbf {B}}_2\}\). There are only positive labels in \({\mathbf {B}}_1\), so the other 6 instances have positive pseudo labels, and \(H_{0.7}({\mathbf {B}}_1) = 0\).

There are 12 unlabeled instances in \({\mathbf {B}}_2\), 4 are close to the positive and 8 are close to the negative instance. So there will be 5 positive and 2 negative labels, including both pseudo and actual ones. We have \(H_{0.7}({\mathbf {B}}_2) = -\frac{1 + 8\times 0.7}{2 + 12 \times 0.7} \log \frac{1 +8\times 0.7}{2 + 12 \times 0.7} - \frac{1 + 4 \times 0.7}{2 + 12 \times 0.7} \log \frac{1 + 4 \times 0.7}{2 + 12 \times 0.7} = -\frac{6.6}{10.4} \log \frac{6.6}{10.4} + -\frac{3.8}{10.4} \log \frac{3.8}{10.4} = 0.285.\)

Finally, \(H_{0.7}({\mathcal {B}}) = 0 + \frac{2 + 12 \times 0.7}{4 + 18 \times 0.7}H({\mathbf {B}}_2) = 0 + 0.178 = 0.178.\)

3.4 Other issues

Three more issues are handled as follows.

  1. (1)

    How to control the number of queries? According to the problem statement, the maximal number of queries is \(\lfloor n \times qi\rfloor\). Therefore, the while loop should terminate as long as the given query runs out. In fact, Line 15 should be extended in the algorithm implementation to enable loop termination.

  2. (2)

    How to classify small cluster blocks? The generation of small cluster blocks indicates that our approach is not suitable for them. Hence we left them alone in the while loop, and use kNN to classify them at the end of the algorithm.

  3. (3)

    How to classify the remaining unprocessed instances? With limited number of queries, after the while loop, there are often some unprocessed blocks. We also use kNN to classify instances in them. Note that the neighbors are searched globally instead of within the current block.

4 Experiments

In this section, we report the results of experiments to analyze the effectiveness of the TACS algorithm. Through the experiments, we aim to answer

  1. (1)

    Is the TACS algorithm more accurate than popular supervised classification algorithms?

  2. (2)

    Is the TACS algorithm more accurate than popular active learning algorithms?

  3. (3)

    Is the TACS algorithm more accurate than single clustering technique based algorithms?

  4. (4)

    Can the TACS algorithm select appropriate base clustering techniques?

  5. (5)

    Is the TACS algorithm efficient?

The experimental environment is a Windows 10 64-bit operating system, 8 GB memory, Intel (R) Core 2 Quad CPU Q9500@2.83 GHz. The source code with a graphical user interface (GUI) written in Java is available at github.com/fansmale/tacs.

4.1 Dataset and parameter setting

Table 2 summarizes 12 datasets used for our experiments. Three of the synthetic datasets come from [25], and the other 10 are from the UC Irvine ML repository [34]. The application area ranges from biological to life. The number of attributes for a dataset ranges from 2 to 10,000.

Table 2 Dataset information

The initial value of rt is also presented in the table. It was discussed in Eq. (8). Some datasets are sorted by class label. For example, in the Iris dataset, the first 50 instances have the label “Iris-setosa”, the second 50 instances have the label “Iris-versicolor”, and the last 50 instances have the label “Iris-virginica”. Other datasets such as Spiral have the same characteristics. Unfortunately, the result may be influenced by the order. One reason is that when multiple instances have the highest (lowest) value, we tend to use the first instance. To break this bias, we reordered the dataset before the learning process. This is fulfilled by randomly swapping some instance pairs. Through this operation, the results tend to be different in different runs. Memory overflow occurs while testing TACS and comparing algorithms on large datasets. Hence for DLA, we sampled 1% of the instances to form new datasets. Class distribution is maintained after sampling.

4.2 Comparison with supervised classification algorithms

Table 3 Accuracy comparison with nine supervised algorithms (in %)

Table 3 compares TACS with nine popular supervised classification algorithms. These algorithms were implemented in the Weka platformFootnote 1. C4.5 is the implementation of the C4.5 algorithm [31, 35]. NB is the implementation of the Naïve Bayes algorithm [32]. CVR is the implementation of Classification Via Regression algorithm [36]. RF is the implementation of Random Forest algorithm [37]. ABM is the implementation of Discrete AdaBoost algorithm [38]. LB is the implementation of LogitBoost algorithm [39]. Bagging is the implementation of Bagging algorithm [31]. MCC is the implementation of Multiple Class Classifier algorithm [40]. FC is the implementation of Filtered Classifier algorithm [41].

For these algorithms, we randomly select the training set with the size indicated in Table 2. The average ranks were obtained by applying a Friedman test [42], which is the most well-known non-parametric test. The Friedman test analyzes whether there are significant differences among the algorithms. The best results are highlighted in boldface. Here are some detailed observations. (a) TACS generally outperformed existing supervised classification algorithms. In particular, the TACS algorithm had the highest accuracy on seven datasets. Through statistical analysis, the mean rank is 2.16. For the Spiral datasets, the accuracy reached 100%. For the other eleven algorithms, the accuracy did not reach 100% with only 10% labels. (b) There were a few cases in which TACS performed worse than some of the other algorithms. For example, the NB algorithm had a better accuracy than TACS on the Twonorm and DLA datasets. However, the difference in the classification accuracy was less than 1%.

4.3 Comparison with other active learning algorithms

In this section, we compare the TACS algorithm with three state-of-the-art active learning algorithms, including Query-by-committee (QBC) [6], kernel version of QBC algorithm (KQBC) [43], and MAED [33]. QBC is probably the most noticeable algorithm in active learning [6]. It is an integrated approach that requires the establishment of a voting committee [5]. Gilad-Bachrach et al. [43] presented the KQBC algorithm to decrease the runtime. MAED [33] is a uncertainty-based active learning algorithm. It minimizes the expected error to select the most discriminative instances for labeling.

Fig. 3
figure 3

Comparison of the learning curve of TACS and three popular active learning algorithms

For QBC, KQBC, and MAED, we adopted the source code provided by the authors and the best settings given in reference [6, 33, 43] to ensure the best performance. For KQBC and MAED, the code was run on the Matlab platform, while for QBC, the code was on the JAVA platform. QBC and KQBC use integrated classifiers, including KNN, J48 and NB. The final result is determined by the relevant strategic vote.

Figure 3 describes the learning curves of the TACS algorithm and three active learning algorithms. Each value is the average of 50 runs of the corresponding algorithm. Some detailed observations are described as follows. (a) TACS converges faster than other algorithms. For example, for the Flame dataset, the classification accuracy reached 100% accuracy by labeling only 10\(\%\) of the instances, while the QBC, MAED and KQBC algorithms were unable to reach 100% accuracy. This indicates that TACS is more appropriate for the cold-start situation. (b) In 9 out of 12 datasets, TACS is more accurate than other algorithms. (c) There are also some cases that TACS performs worse. For example, MAED outperforms TACS in the R15, Heart and USps_2_6 datasets.

4.4 Comparison with clustering-based active learning algorithms

Table 4 Accuracy comparison with algorithms based on single clustering technique (mean ± std in %)
Table 5 The average number of times each clustering technique was selected

We compare the new approach with the ones based on only one clustering technique. Table 4 shows the accuracy of the TACS algorithm with six clustering-based active learning algorithms. Each algorithm runs ten times to obtain average accuracy and variance. These six methods adopt the same active learning strategy as TACS. The difference is that the six methods adopt a single classical clustering algorithm, while the TACS adopts a clustering selection strategy. Table 5 shows the average number of times each clustering technique was selected. It not only shows how often each clustering technique is used, but also indicates the actual number of iterations of the active learning process.

From Tables 4 and 5, we observe that the TACS algorithm selects the appropriate base clustering techniques. According to the Friedman test, TACS generally outperforms the other algorithms, with an average rank of 1.58, which is the highest on these datasets. For each of the 12 datasets, TACS achieved the best performance on eight of them, and kMeans achieved good performance on the Twonorm, and Ionosphere datasets. In some cases, different clustering techniques produce the best result for the given block and respective counters are all added by 1.

4.5 Algorithm efficiency

The efficiency of TACS depends mainly on the base clustering techniques. As shown in Table 5, for DLA and Letter datasets, TACS only needs to run less than 100 rounds of clustering techniques. Therefore, the time complexity of TACS can be regarded as approximately the same as the sum of base clustering techniques. Specifically, in our implementation, the time complexity is \(O(mn^2)\) when RW is not selected.

Fig. 4
figure 4

The runtime of TACS with different base clustering techniques

Figure 4 depicts the runtime of the DLA dataset with the change of the number of instances n. TACS indicates that these five clustering techniques are employed for selection. Other versions are for specific clustering techniques. For example, TACS-kMeans indicates that only kMeans is available for splitting each block. Note that RW was not selected for this experiment for two reasons. (a) RW is very time consuming since we need to construct a graph from the original data. Hence it is inappropriate for large datasets. (b) RW rarely outperforms other clustering techniques. This issue has been analyzed in more detail in Sect. 4.4.

Here are some detailed observations. (a) When the number of instances is doubled, the runtime will increase to about 4 times. This shows that the time complexity is proportion to \(O(n^2)\). In other words, the TACS algorithm is scalable. (b) Different versions share the same trend, showing the effectiveness of the acceleration techniques presented in Sect. 2. (c) The runtime of TACS is smaller than the sum of other versions. This is due to the fact that some overhead is shared.

4.6 Discussions

Now we can answer the questions proposed at the beginning of this section.

  1. (1)

    TACS is more accurate than popular supervised classification algorithms, including C4.5, NB, RF, etc. This is validated by Table 3. Unfortunately, on some datasets such as Ionosphere, it is significantly worse than some other algorithms such as RF. The reason may be that clustering techniques do not perform well on those datasets.

  2. (2)

    TACS is more accurate than popular active learning algorithms, including QBC, MAED, and KQBC. This is validated by Fig. 3. It was also defeated by MAED on the Heart dataset. The reason may be that for some datasets, informative instances are more important than representative ones.

  3. (3)

    TACS is more accurate than single clustering technique based algorithms. This is validated by Table 4. It is the best, or the second best one on all datasets.

  4. (4)

    In most cases, TACS can find out the appropriate base clustering techniques. This is validated by Table 5.

  5. (5)

    TACS is efficient as long as we do not employ inefficient clustering techniques. This is validated by Table 5 and Fig. 4.

5 Related work

This section discusses some related work, including active learning, clustering-based active learning and three-way decision. Their relationships with this work will be emphasized.

5.1 Active learning

Active learning [4, 44] is a special form of semi-supervised learning [45, 46], which in turn is a special form of classification. Classification tasks typically require a large amount of labeled data to form a training set for building a classifier. However, labeled data are difficult, expensive, or time consuming to obtain [46]. Semi-supervised learning uses abundant unlabeled data with a few labeled ones to build a better classifier. When man–machine interaction is available, active learning obtains key tags to further improve classification accuracy.

The key issue for active learning is: which instances are critical and therefore should be labeled? General answers include informative ones [6, 47], diverse ones [7, 48], representative ones [10, 49], and their combinations [11, 12, 48]. Uncertain sampling [50], Query-by-committee [6], and query by margin [51] are classical approaches to find informative instances. Fuzzy-rough approaches [7, 48] have been developed to find diverse ones. Clustering-based active learning [10, 49], as the basis of this work, tends to find representative ones.

There are a number of typical active learning scenarios. The classical scenario [10] specifies the number of labels provided by oracle. It is explicitly defined in Problem 1. The cost-sensitive active learning scenario [13] involves query cost and misclassification costs. The learning task is to achieve a trade-off between them to minimize the total cost. The active learning from data streams scenario [52] involves dynamic nature of the data. The learner should query instances based on the data observed so far to maximize the classification accuracy on future instances. Moreover, instances usually should be queried immediately to save the storage.

5.2 Three-way decision

Three-way decision [15] is a general methodology for problem solving especially in data mining and machine learning. It was rooted in decision-theoretic rough set model [53], while be more independent since 2010 [15], and quite matured in 2018 with the TAO model [16]. Liu et al. [54] enriched the theory to deal with incomplete information systems. Hu [55] established three-way decision space based on the axiomatic definitions for decision measurement, decision condition and evaluation function. Fang et al. [56] discussed cost-sensitive approximate attribute reduction with three-way decisions. Zhang et al. [57] introduced a general model of decision theoretic three-way approximations of fuzzy sets based on the particle swarm optimization algorithm.

It has been connected with other theories to form new ones. Li et al. [58] investigated three-way cognitive concept learning from the perspective of multigranularity information fusion. Shivhare et al. [59] explored the process of three-way cognitive concept learning to achieve a more precise recall. Qi et al. [60] proposed three-way formal concept analysis to reveal more detailed information from formal contexts. Zhi et al. [61] extended dual formal concept with three-way decision.

It has also inspired the presentation or solution of machine learning tasks. Zhang et al. [62] developed a three-way regression approach for recommender systems. Yu et al. [63] proposed an active three-way clustering method via low-rank matrices for multi-view data. Li et al. [17] designed deep neural network based sequential granular feature extraction method for image processing. Jia et al. [64] applied the three-way decision-theoretic rough set model to address multi-class cost-sensitive learning problems. Min et al. [65] introduced tri-patterns and provided an efficient mining algorithm for different types of sequential data.

Three-decision was first applied to active learning in [66]. Then three-way active learning served as a special form of clustering-based active learning. Wang et al. [10] introduced the first three-way active learning algorithm. It can be viewed as a special case of the TACS algorithm where the DB-Gaussian clustering technique is available. Wu et al. [13] studied the cost-sensitive active learning scenario under the uniform label distribution assumption. Wang et al. [14] explored different assumptions further and established a statistical model. As mentioned in the introduction part, there are three possible actions for each block. Due to the iterative process, the paradigm can be better represented by sequential three-way decision [17, 18].

6 Conclusions and further work

This study has proposed the TACS algorithm to dynamically select the appropriate clustering in the active learning process. A number of techniques were discussed, including cluster selection, query balancing, and tree pruning. Experimental results verify the effectiveness of the algorithm.

The following research topics deserve further investigation:

  1. (1)

    More clustering techniques in the algorithm framework. Currently, only a few clustering techniques have been incorporated into TACS. Other techniques can be incorporated to accommodate data with different shapes. In addition, these techniques should be fine-tuned or modified to fit the framework of the algorithm.

  2. (2)

    Better evaluation measures to select clustering techniques. TACS uses weighted entropy to evaluate the quality of clustering. New measures based on the Gini indicator might be good alternatives. More sophisticated measures can be designed to consider other information such as block size ratios and data distribution.

  3. (3)

    Clustering ensemble techniques for active learning. TACS only selects the currently “best” clustering technique. By designing cluster ensemble techniques, new blocks can be obtained from these different techniques. Hence it is possible to obtain more stable blocks of better quality. Moreover, different classification strategies can be employed for instances inside or outside stable blocks.

In summary, TACS is a general algorithm framework that can be enriched in the future.