1 Introduction

Machine learning studies the generation of “models” from data, but the necessary prerequisite for generating effective models is sufficient high-quality data. However, in many practical applications, obtaining enough labeled instances is very time-consuming and expensive. Therefore, learning an effective model from a large quantity of data with few labels is significant. Active learning is the most popular method for solving this problem, which actively selects some samples that are “valued” for the model to be added to the training set, aiming to train the expected model with as little data labeling cost as possible. In recent years, active learning has been widely applied in object categorization [1], image retrieval and classification [2,3,4], speech recognition [5], multilabel annotation [6], and feature representation [7] fields.

The core of active learning is the sample selection strategy, and reasonable sample selection can effectively reduce the data labeling cost. Cluster-based sample selection is one of the most popular methods. McCallum and Nigam [8] proposed a general schema of cluster-based active learning by constructing hierarchical cluster trees where each leaf corresponds to a sample and each inner node corresponds to a cluster. The process of active learning involves pruning the main tree and randomly selecting samples from each cluster. Dasgupta and Hsu [9] proposed a method based on hierarchical clustering. Wang [10] proposed a method based on density peak clustering. Cluster-based approaches do not require additional classifiers and make full use of unlabeled data, avoiding the problem of sampling bias.

However, the performance of cluster-based active learning algorithms is very dependent on the quality of the clustering results. The density peak clustering algorithm calculates the local density by the number of samples within the cutoff distance, and the appropriate cutoff distance can greatly improve the clustering results quality. However, it is difficult to set a cutoff distance that is appropriate for all samples, especially if the dataset is unbalanced [11]. In addition, most of the existing cluster-based methods use a determined sample selection strategy to select the same number of key samples in different sample clusters to be classified. This strategy ignores the distribution information of the sample clusters to be classified, and inevitably selects some redundant samples that have little improvement in model performance.

To solve the above problems, an adaptive active learning algorithm based on optimized density clustering is proposed in this paper. Firstly, we introduce the k-neighbor information of the instance to redefine the local density of the instance. This method improves the stability of the clustering results. More importantly, it better describes the local data structure, which is convenient for finding the most representative samples. Secondly, we propose an adaptive instance selection strategy. This strategy adaptively determines the number of samples that need to be selected in each cluster, avoiding the choice of redundant data. Finally, we compare the algorithms presented in this paper with ten commonly used supervised learning algorithms and eight of the most popular active learning algorithms. The results of the comparative experiments show that AAKC achieves higher classification accuracy and good stability by using fewer labels on the datasets with different instances, dimensions and clusters.

2 Related work

2.1 The density peak clustering algorithm

Rodriguez and Liao [12] proposed a density peak clustering algorithm (DPC), which can automatically find cluster centers to achieve efficient clustering of arbitrary shape data. It is based on the straightforward idea about clustering centers: 1) its local density is greater than the local density of its neighbors; 2) the distance between the centers of different clusters is relatively large. The DPC algorithm uses a decision graph to select cluster centers. The decision graph is generated based on two fundamental attributes of each instance: ρi and δi. For an instance xi, local density ρi is defined as follows:

$$ \rho_{i}=\sum\limits_{j=1}^{|U|}\chi(d_{ij} - d_{c}). $$
(1)

where dij is the distance between instances xi and xj, and dc is the cutoff distance. dc is usually empirically selected so that the average number of neighbors of the sample is approximately \(1\sim 2\%\) of the total number of instances. When (⋅) < 0, χ(⋅) = 1, otherwise χ(⋅) = 0. δi is calculated as follows:

$$ \delta_{i}=\underset{j:\rho_{j}>\rho_{i}}{\min}(d_{ij}). $$
(2)

δi denotes the distance between xi and its nearest neighbor with a higher ρi. In addition, for xi with the most significant local density ρi, we conventionally take

$$ \delta_{i}=\underset{j}{\max}(d_{ij}). $$
(3)

The decision graph uses ρi as the x-axis and δi as the y-axis. According to the decision graph, instances with larger ρi and δi are selected as cluster centers, and the remaining instances are clustered.

The clustering performance of the DPC algorithm is excellent, but it still has its limitations. Firstly, it is difficult for the DPC algorithm to find a cutoff distance dc that fits all samples, especially for unbalanced datasets. Secondly, the clustering result of the DPC algorithm is susceptible to the value of dc. If dc changes slightly, the clustering results may be completely different [11]. These shortcomings of the DPC algorithm will naturally be passed on to the active learning algorithm based on it.

3 The proposed method

Currently, most active learning algorithms based on clustering adopt a deterministic sample selection strategy, which inevitably leads to waste of labels. In this paper, we propose a novel adaptive active learning algorithm based on optimized density clustering. Figure 1 illustrates the algorithm overview, and the pseudocode is summarized in Algorithm 1.

Fig. 1
figure 1

The Overview of the AAKC algorithm

Algorithm 1
figure f

Adaptive active learning through k-nearest neighbor optimized local density clustering.

3.1 Initialization

Given a dataset \({U}=\{x_{i}\}_{i=1}^{n}\), n and m are the numbers of instances and features, respectively. In the initialization, we search k nearest neighbors for each sample, introduce the k nearest neighbor information, and use an exponential kernel function with a width of 𝜃 = 1 to redefine the local density. This method can better represent the local structure of the data and avoid the detrimental effect of cutoff distance on clustering and active learning. The k-nearest neighbor based local density ρknn(i) of xi is defined as follows:

$$ \rho_{knn(i)}=\sum\limits_{j\in {knn(i)}}exp(-d_{ij}). $$
(4)

where dij is the Euclidean distance between xi and xj, and knn(i) is the set of k-nearest neighbor instances of xi. The smaller the sum of the distance between sample xi and the sample in set knn(i), the greater the local density of the sample xi, and vice versa. We also compute the minimum distance between xi and the nearest instance with a higher local density δi as (2) and (3).

In cluster-based active learning, if only the local density of the sample is considered, the algorithm usually selects only the samples that are close to the center of the cluster. Although these samples are generally highly representative, they also have greater similarities between them. Selecting too many such samples does not greatly improve the performance of active learning and wastes limited labeled data, which is unacceptable for active learning tasks with limited budgets for data labeling. To solve this problem, we define an index to measure the influence of a sample within a sample cluster, which is calculated as follows:

$$ \gamma_{i}=\rho_{knn(i)}\times\delta_{i} $$
(5)

This makes each cluster structure center area have only one sample with large intracluster influence, which ensures the representativeness and diversity of the training set and avoids the waste of labels.

3.2 Adaptive instance selection

The core of active learning is the instance selection strategy. The instance selection strategy contains two basic questions, what kind of samples to choose? And how many samples to choose? The determined instance selection strategy usually selects the same number of samples in different clusters. This strategy ignores the different requirements between clusters. As a result, too many instances are selected in some clusters, which wastes the limited label budget and reduces the number of queryable clusters, but cannot greatly improve the algorithm performance.

To solve this problem, we propose an adaptive instance selection strategy. Firstly, we perform an optimized local density clustering algorithm to divide the entire dataset into b clusters. The initial value of b is two, and its value is increased by one for each iteration.

Secondly, we build a decision graph for each cluster and use the sliding window algorithm to adaptively find turning points, and the detailed process is shown in Algorithm 2. The y-axis of the decision graph represents the γi value of the instance and the x-axis represents the instances ordered descending by the γ value. From the decision graph, we find that the value of γ decreases significantly before the turning point, and decreases flat after the turning point. This indicates that the sample’s representativeness after the turning point is relatively weak, we don’t have to spend expensive time and cost to query these instance labels. To find the turning point in the dosicion graph, we design a sliding window algorithm. In the silding window algorithm, we use the variance of γ within a sliding window of to measure the flatness of the decline in sample representativeness. The larger the variance in the sliding window, the more rapidly the sample representativeness declines. We sort the samples in bli by γ descending order to get the subscript array s, and let p start from 0 to s.lengthw, where w is the width of the sliding window. The turning point is defined as the minimum point where the variance within the sliding window is less than λ, as follows:

$$ tp = \underset{p}{\min}[var(\gamma_{s_{[p]}}, ..., \gamma_{s_{[p+w-1]}})\!<\!\lambda], p \!\in\! [0,s.length-w]. $$
(6)

Finally, we selected tp samples with largest γ to be labeled.

Algorithm 2
figure g

Sliding window algorithm.

3.3 Clasify

After selecting and querying the labels of these critical samples, we classify other instances in the cluster based on the existing labels. If the labeled sample set is pure (i.e., the labels of the samples in pi are all the same, where pi is the label set in the sample cluster bli), this cluster is considered pure. The remaining unlabeled instances are classified with the same label. Otherwise, we choose critical instances based on the current information in the next iteration until the cluster is divided into pure or the label budget N is exhausted. Assume there are still impure and unlabeled data after N labels are used up. We use the standard voting methods to assign the label with the most instances in the cluster to the remaining samples. Assume that a cluster to be classified contains 20 samples, including 5 samples to be classified, 10 samples of class A, 2 samples of class B, and 3 samples of class C. The class A samples in the cluster are the most abundant, so the standard voting method classifies the remaining 5 samples as class A.

3.4 Illustrative example

To illustrate our algorithm, we analyze a running example on the Aggregation dataset. The Aggregation dataset is a two-dimensional dataset from the UCI machine learning repository [18], with 7 clusters of 788 samples. The experimental parameters were set to k = 5, w = 5, and λ = 0.001, respectively. The label budget for AAKC is N = 78, accounting for10% of the total number of samples. Table 1 shows the size information of each cluster, the current label information, and the number of newly selected samples. Figures 2 and 3 show the decision graph of bl1 and bl2, respectively. To facilitate showing the turning point location, we marked its location with a red dotted line.

Table 1 Active learning process information of AAKC where “AS” represents the number of samples that have been selected, “NS” represents the number of newly selected samples, and the total number of selected instances is 32
Fig. 2
figure 2

The decision graph of bl1

Fig. 3
figure 3

The decision graph of bl2

We can see that the AAKC algorithm clusters the dataset into two instance clusters after initialization, and generates decision graphs, respectively. The turning point in the two decision graphs are the eleventh point and the seventh point. At this time, there are no labeled instances in bl1 and bl2, so we need to select 11 and 7 instances in bl1 and bl2 respectively. After labeling, the label set in bl1 and bl2 are p1 = {4,1,3,6,3,1,1,3,1,3,1} and p2 = {5,0,2,2,5,2,5}, respectively. Since none of them are pure clusters, we also make them wait for the next iteration. The number of labels used is 11 + 7 = 18, which is less than 78, so AAKC performs the next iteration. In the second iteration, the AAKC algorithm first divides the entire dataset into bl1, bl2, and bl3 and builds a decision graph for each cluster. The turning points in the three decision graphs are the 7th, 7th, and 6th points. There are 6 labeled instances in bl1, 7 labeled instances in bl2, and 5 labeled instances in bl3, so we select 1 instance in bl1 and bl3. After labeling, the label set in bl1 and bl2 are p1 = {4,3,6,3,3,3,3} and p2 = {5,0,2,2,5,2,5}, respectively. Since none of them are pure clusters, we also made them wait for the next iteration. The label set in bl3 is p3 = {1,1,1,1,1,1}, and all the labels in the cluster are the same, so it is a pure cluster. We directly put the remaining instances in bl3 into one category. The number of labels used after the second iteration is 18 + 1+ 1 = 20, less than 78, so the iteration continues. The remaining iterations are performed in the same way, and finally, the AAKC algorithm uses 32 labels, iterates six times dividing the dataset into seven clusters, and completes the classification of all instances.

For comparison, we also run the ALEC algorithm on the Aggregation data. Table 2 shows the size information of each cluster, the current label information, and the number of newly selected samples. In the first iteration, ALEC divides the master tree into two clusters bl1 and bl2 through clustering. Since N = 78, ALEC selects \(\sqrt {N} = 8\) instances in bl1 and bl2. At this time, the number of labels used is 8 + 8 = 16, which is less than 78. Therefore ALEC performs the next iteration. In the second iteration, ALEC first divides the entire dataset into bl1, bl2, and bl3. Then, ALEC selects 8 instances for all the clusters. After the second iteration, the number of labels used is 16 + 8+ 8 + 8 = 40 less than 78, so ALEC performs the next iteration. The remaining iteration runs in the same manner. Finally, the ALEC algorithm runs out of budget in the fourth iteration and then uses the standard voting method to classify the remaining unclassified instances. The ALEC algorithm tends to select more instances in the earlier generated clusters, which causes ALEC to run out of labels and exit the loop quickly. For example, ALEC chooses a total of 24 instances in the bl2 cluster. The remaining instances can only use the standard voting method classification. AAKC selects only truly representative samples in each cluster, making AAKC iterate more times and finally complete the category of all instances. The result of this example is that ALEC uses all 78 labels with a classification accuracy of 0.9506, and AAKC uses only 32 labels with a classification accuracy of 0.9972.

Table 2 Active learning process information of ALEC where “AS” represents the number of samples that have been selected, “NS” represents the number of newly selected samples, and the total number of selected instances is 78

3.5 Complexity analysis

The time complexity of the AAKC algorithm consists of two parts. Initialization. Firstly, we need to calculate γ, ρknn(i), and δ for each sample, and the time complexity is O(mn2). Adaptive instance selection. This stage usually involves several iterations. In each iteration, we take O(nlogn) time to sort γi and select centers and take O(n) time to assign the cluster index to non-center instances. We obtain several sample clusters to be classified. Asumme that the cluster size is \(n^{\prime }\), we take O(nlogn) to sort and take \(O(n^{\prime })\) to calculate the variance. Since the size of a cluster is less than n, this part takes O(nlogn). Now we classify pure clusters, the number of instances in the pure cluster is \(n^{\prime }-|p_{i}|\), and classification takes \(O(n^{\prime }-|p_{i}|)\) time. Since \(n^{\prime }-|p_{i}|\) is less than n, this part takes O(n). The algorithm usually iterates \(k^{\prime }\) times, so the total time complexity of AAKC is \(O(mn^{2}) + O(k^{\prime } (nlog_{n}+nlog_{n}+n))\). Since \(k^{\prime }\) is much smaller than n, the total time complexity of the AAKC algorithm is O(mn2).

4 Experiment

To verify the AAKC algorithm’s effectiveness, we compare the AAKC algorithm with the mainstream supervised classification algorithm and the latest active learning algorithm. The parameter settings of the AAKC algorithm are as follows, w = 5 and λ = 0.001. We use four synthetic datasets [12] and five UCI benchmark datasets [18], and Table 3 shows information about these datasets.

Table 3 Dataset information where “n” represents the number of samples, “c” represents the number of categories, and “m” represents the number of features

The experiment is conducted under Windows 10 64-bit 32GB operating memory and an Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz processor environment. We use the accuracy index to evaluate the algorithm performance, which is calculated as follows:

$$ Acc=\frac{|U_{t}|-e}{|U_{t}|}. $$
(7)

|Ut| is the size of the testing set, and e is the number of misclassified instances. If the active learner queries N labels, |Ut| = nN. Otherwise, |Ut| ≥ nN, indicating that the active learner chooses to predict more labels than expected.

4.1 Comparison with supervised classifiers

The AAKC algorithm is an active learning algorithm for classification, we compare it with 10 supervised learning algorithms on nine datasets. (K Nearest Neighbors (KNN) [19], C4.5 [20], Naive Bayes (NB) [21], Random Forest (RF) [22], AdaBoostM1 (ABM) [23], Classification Via Regression (CVR) [24], Logit Boost (LB) [25], Bagging [26], Multi Class Classifier (MCC) [27], and Filtered Classifier (FC) [28]) These algorithms are implemented in the Weka platform [29]. All datasets are normalized, and the number of labels N and the training set size are both 0.1|U|.

Table 4 lists the experimental results, with the best results obtained in bold. Among the ten algorithms, AAKC has the highest accuracy in six datasets, with the highest average precision of 0.9227 and the lowest average ranking of 2.11. It is worth noting that AAKC achieved an accuracy of 1.000 on Spiral and Flame datasets that the other nine algorithms cannot achieve. AAKC is superior to the classical supervised classification algorithm.

Table 4 Classification accuracy of the supervised learning algorithm

We select three classical supervised learning algorithms (KNN, C4.5, and NB) for further experimental analysis of label number N. During the experiment, the number of labels N and the size of the training set range from 0.01|U| to 0.1|U|. Figure 4 presents the experimental results in the form of a broken line graph. The x-axis represents the number of labeled samples or the size of the training set, and the y-axis represents the classification accuracy of the algorithm. The AAKC algorithm has the following advantages over the other algorithms. Firstly, AAKC is the fastest to peak accuracy in seven data sets. It is only slightly slower than NB and KNN algorithms on flame and ionosphere datasets. AAKC only queries the labels of 0.01|U| samples on the six datasets, and the accuracy almost reaches the peak. Secondly, on all data sets, the accuracy of the AAKC algorithm increases steadily with increasing N, while the accuracy of the other three algorithms fluctuates. For example, on the DCCC dataset, when the training set size increases from 0.01|U| to 0.02|U|, the performance of the NB algorithm decreases by 0.0848. In the Spiral dataset, when the training set size increases from 0.05|U| to 0.06|U|, the prediction accuracy of the C4.5 algorithm decreases by 0.2279. In the Seeds dataset, when the training set size increases from 0.06|U| to 0.07|U|, the prediction accuracy of the C4.5 algorithm decreases by 0.0791.

Fig. 4
figure 4

Comparison of classification accuracy with supervised learning algorithm

4.2 Comparison with active learning classifiers

In this section, we verify the effectiveness of the adaptive instance selection strategy. We compare the AAKC algorithm with eight state-of-the-art active learning algorithms, including the committee-based algorithms QBC [14] and KQBC [15], the uncertain sampling algorithm MAED [30], and the cluster-based algorithms QUIRE [13], ABD [31], ALEC [10], TASC [16], and ALTA [17]. They are the most popular approaches for active learning. code and optimal parameters of all comparison algorithms are from the author.

Table 5 shows the experimental results of the AAKC and the other eight active learning algorithms when the number of labels is 0.1|U|, and the bold numbers indicate the best results obtained. The AAKC algorithm achieves the highest accuracy on seven data sets. On the DCCC and Ionosphere datasets, it is only slightly lower than the QBC algorithm on DCCC and Ionosphere datasets. The mean accuracy of AAKC is 0.9227, and the mean rank is 1.44. Compared with ALEC, the mean accuracy and the mean rank are increased by 0.0393 and 2.56, respectively.

Table 5 Classification accuracy of the active learning algorithm

Figure 5 compares the accuracy of the AAKC algorithm with eight state-of-the-art active learning algorithms. For all datasets, the proportion of queries ranges from 0.01|U| to 0.1|U|. We observed that in nine datasets, the AAKC algorithm has the following advantages over the other algorithms. Firstly, the AAKC algorithm achieves peak accuracy using fewer labels than other active learning algorithms.

Fig. 5
figure 5

Comparison of classification accuracy with active learning algorithm

In addition, AAKC performs exceptionally well on the Aggregation, Spiral, Seeds, Heart-stat log, and Ionosphere datasets. When the proportion of queries is 0.02|U|, the accuracy is almost at the peak. From the perspective of minimizing query costs, there is no need to continue to query instances. Secondly, the accuracy of AAKC maintains a steady increase as N increases. In contrast, the accuracy of QUIRE and MAED fluctuates. For example, on the DCCC dataset, the accuracy of the QUIRE algorithm fluctuates constantly. When the roportion of queries is 0.02|U|, the accuracy rate reaches the maximum value, and then the accuracy rate decreases slightly as the number of tags increases. On the Heart-stat log and Seeds datasets, when the number of labels increases from 0.04|U| to 0.05|U|, the accuracy of the MAED algorithm decreases by 0.0774 and 0.0656, respectively. AAKC is consistently more accurate and more consistent than ALEC. In conclusion, the AAKC algorithm is superior to other active learning algorithms in general.

4.3 Effects of parameters

The critical parameter in AAKC is k, which determines the neighbors of each instance. To evaluate the effect of k and the stability of our algorithm, we set k from 3 to 10. Figure 6 presents the results of the experiment, and Table 6 shows the statistics of the experimental results. We observe that AAKC is very robust concerning the parameter k. Additionally, although the value of k is not always the same when the AAKC algorithm obtains the highest accuracy on different datasets, it is always in the range of 3 to 6. This proves the stability of our algorithm with respect to the parameter k.

Fig. 6
figure 6

The experimental results of AAKC on different k

Table 6 The effects of k on nine datasets, where k ranges from 3 to 10, and the “k” column represents the k value for maximum accuracy

5 Conclusion

This paper proposed the AAKC algorithm, which uses the k-nearest neighbor information to redefine the local density of instances and adopts an adaptive instance selection strategy to select samples automatically. Experimental results on nine datasets confirm that our algorithm is better than the classic supervised learning algorithm and the latest active learning algorithm. The time complexity of AAKC is O(mn2), which is the same as that of DPC algorithms. However, it is not efficient enough when facing high-dimensional and large datasets. In future research, we plan to further improve the efficiency of the algorithm.