Keywords

1 Introduction

Many supervised learning algorithms of the machine learning area have been widely used in pattern recognition tasks. In all of these methods, the accuracy of classifier depends heavily on the labeled sample set. However, in reality, to obtain the training samples is very easy while the labeled samples are scarce, so to get labeled samples requires a high price. Moreover, redundant-labeled samples may slowdown the training speed, while they are not helpful to the classifier. In order to reduce the time and cost of labeling [1], which requires that the sample selected during training, not only has the information content but be diversed from each other. Active learning [2] is an effective method to solve these problems. Active learning algorithms select high-information content unlabeled examples to be labeled by experts [3, 4], several loops make the correct classification accuracy gradually increased, and thus the classifier obtains the strong generalization ability in the case of minimum labeling cost. Compared with the traditional supervised learning methods, active learning can significantly reduce time and cost of labeling. How to choose samples, as few as possible to get the higher classification accuracy, are the core issue of active learning algorithm. Therefore, the sampling strategy [5] naturally becomes a concern of active learning algorithms.

The traditional sampling methods are generally divided into two categories: one is based on uncertainty [6], merely use the uncertainty to measure sample information content, select the most uncertain samples of classification for labeling; although this method has a wide range of applications and achieved good results in practical tasks, but it only considers the relationship between current sample and the labeled samples, ignoring the distribution information of unlabeled sample set. Thus the important issue is that the unavoidable choice of outliers in the training process may reduce the classification accuracy. Another method surmounts the difficulty of uncertainty sampling, considering the sample of uncertainty and representativeness [7]. But for different data, it is difficult to measure the importance level of uncertainty and representativeness, i.e., the respective weights of uncertainty and representativeness; moreover these methods did not consider redundancy between selected samples. Some batch mode active learning methods will suffer from the same problem. In order to accelerate the learning process, it is necessary to speed up the learning process by selecting more than one sample at each iteration for batch mode active learning methods. So it needs to consider the diversity of the selected samples. To solve the above problems, this chapter proposes a multicriterion query-based active learning method. Experimental results show that our proposed method has achieved good performance.

2 Related Work

A general active learner can be modeled as a quintuple (G, Q, S, L, U) [8]. Initially, the training set L has few labeled samples to train the classifier G. After that, the query function Q is used to select a set of most informative samples from the unlabeled pool U and the supervisor S assigns a class label to each of them. Then, these new labeled samples are included into L and the classifier G is retrained using the updated training set. The closed loop of querying and retraining continues for some predefined iterations or until a stop criterion is satisfied.

There has been more research work in the sampling strategy of active learning. Uncertainty-based sampling methods are more commonly used. Entropy is most commonly used in uncertainty sampling method. Sample entropy can better represent the uncertainty of samples, i.e., the greater the entropy, the greater the uncertainty of the sample. However, in multiclass problems, the entropy does often not well reflect the uncertainty of the sample. Some may have larger classification uncertainty than the ones whose entropy may be higher. For the above problem, Joshi [9] proposed a more direct active learning sample selection criteria Best-versus-Second-Best (BvSB). The BvSB method considers the difference between the probability values of the two classes having the highest estimated probability value as a measure of uncertainty. The practical applications of the method get a better performance. Another common sampling strategy is based on the reduction of version space. Query-by-committee (QBC) algorithm is the most widely used famous algorithm. The committee, which is constituted by a set of group classifiers. Each committee member is then allowed to vote on the labelings of query candidates. The most informative query is considered to be the instance about which they most disagree. In essence, the QBC is based on uncertainty sampling. However, these methods only consider the impact on the labeled samples, without considering the distribution of unlabeled sample set, ignoring the relationship with unlabeled samples. The literature [7] shows that unlabeled samples have a great influence on classification accuracy. If the current sample can better represent the remaining unlabeled samples, then we say that the sample has a high-representative, meaning that the information content is higher. There have been some studies for a combination of uncertainty and representative aspects. Settles and Craven [7] proposed information density(ID) method, first with uncertainty methods measure the current sample basic information content, then use the sample feature vector cosine similarity method to calculate average similarity to all other samples in the input distribution, information is then multiplied by the density and set a fixed threshold to control the weight of density items. However, in many problems it is necessary to speed up the learning process by selecting more than one sample at each iteration. Li and Sethi [8], proposed that estimates the uncertainty level of each sample according to the output score of a classifier and selects only those samples whose output scores are within the uncertainty range. Patra and Bruzzone [10], proposed a fast clustering-based active learning method to solve the multiclass classification problems. However, these methods select batch of samples at each iteration by considering only the uncertainty criterion. This will result in the selection of redundant samples which reduce the speed of the classifier without adding any additional information. To solve this problem, Brinker [11] presented a batch mode method of considering the diversity. In the literature [12, 13], the clustering method to measure the diversity was introduced into the design of active learning query function. For the combination of uncertainty, representation, and diversity, there is also some research. Lin and Bilmes [14] also studied batch mode active learning with submodular graph functions for the problem of training hidden Markov models for speech recognition, but this method is mainly designed for representativeness. Diversity, density, and relevance (analogous to uncertainty) are all incorporated in a query criterion by Xu [15] et al. but the approach is to simply interpolate three scores with two empirically-tuned weights. Tuning weights for active learning is more challenging in a real scenario than for classification accuracy.

Different from the above methods, we propose an active learning algorithm which integrates different selection criteria. The framework of our method consists of three key components: uncertainty, representativeness, and diversity criterion. Compared to the previous work this method has the advantages of (1) take into account the relationship with the labeled samples, but also make full use of the remaining unlabeled samples, which ensure the samples with higher uncertainty and better representation. In addition, considering the correlation between the selected samples, so that the selected samples are diverse; (2) without respecting to the weights of uncertainty, representativeness, and diversity. The first consideration of our method is the uncertainty of samples, and then with a combination of representativeness, and diversity. (3) Compared to the methods that directly deal with all the unlabeled samples, our method can greatly reduce the cost of sample selection, thereby improving efficiency.

3 MCQAL: Multicriterion Query-Based Active Learning

3.1 Problem Analysis

A limitation of the uncertainty sampling strategy is that it relies on the relative correctness or confidence of the current model, which can be a difficulty, especially in the early stages. And the uncertainty sampling can also suffer from a lack of exploration of the feature space and may not work well in some scenarios. As shown in Fig. 1, the red triangles and green circles represent the instances which have been labeled, the remaining white diamonds represent unlabeled sample set. Since sample \( x_{A} \) lies on the classification boundary, it must be selected by using uncertainty sampling strategy for human experts to label. In fact, \( x_{B} ,x_{C} ,x_{D} \) are samples with higher information content, they can better reflect the distribution of the sample set. Therefore, we make a combination of representativeness and diversity criteria based on the uncertainty sampling strategy, thus this part of samples will be selected together.

Fig. 1
figure 1

An illustration of when uncertainty sampling can be a poor strategy

3.2 Uncertainty of Sample Selection

In the uncertainty sampling strategy, we employ the BvSB [9] approach, which consider the difference between the probability values of the two classes having the highest estimated probability value as a measure of uncertainty. Say that our estimated probability distribution for a certain example is denoted by P, where \( p_{i} \) denotes the membership probability for class i. Also suppose that the distribution P has a maximum value for class h. Based on current knowledge, the most likely set of classifiers in contention is \( C_{h} \). The classification confidence for the classifiers in this set is indicated by the difference in the estimated class probability values, \( p_{h} - p_{i} \). This difference is an indicator of how informative the particular example is to a certain classifier. Minimizing the difference \( p_{h} - p_{i} \), or equivalently, maximizing the confusion (uncertainty), we obtain the BvSB measure.

$$ \begin{aligned} {\text{BvSB}}^{*} & = \arg \mathop {\hbox{min} }\limits_{{x_{i} \in U}} (\mathop {\hbox{min} }\limits_{{y \in Y,y \ne y_{\text{Best}} }} (p(y_{\text{Best}} |x) - p(y |x))) \\ & = \arg \mathop {\hbox{min} }\limits_{{x_{i} \in U}} (p(y_{\text{Best}} |x) - p(y_{{{\text{Second}} - {\text{Best}}}} |x)) \\ \end{aligned} $$
(1)

3.3 Representativeness Measure

In addition to the most informative sample, we also prefer the most representative sample. The representativeness of a sample can be evaluated based on how many samples are similar or near to it. So, the samples with high-representative degree are less likely to be an outlier. Adding them to the training set will have an effect on a large number of unlabeled samples. In this section, we use the Gaussian Process [16] model to measure the mutual information between the uncertain set and remaining unlabeled samples.

Using mutual information criterion, we define the mutual information-based representativeness measure for a candidate sample \( x_{i} \) as below

$$ {\text{rep}}(x_{i} ) = I(x_{i} ,U_{{x_{i} }} ) = H(x_{i} ) - H(x_{i} |U_{{x_{i} }} ) $$
(2)

where \( U_{{x_{i} }} \) denotes the index set of unlabeled instances after removing \( x_{i} \) from \( U \).

We propose to compute the entropy terms in (2) within a Gaussian Process framework. A Gaussian Process is a joint distribution over a (possibly infinite) set of random variables, such that the marginal distribution over any finite subset of variables is multivariate Gaussian. For our problem, we associate a random variable \( \chi (x) \) with each instance. A symmetric positive definite Kernel function \( K( \cdot , \cdot ) \) is then used to produce the covariance matrix, such that

$$ \sigma_{i}^{2} = K(x_{i} ,x_{i} ) $$
(3)
$$ \sum_{{U_{i} U_{i} }} = \left( {\begin{array}{*{20}c} {K(x_{1} ,x_{1} )} & {K(x_{1} ,x_{2} )} & \ldots & {K(x_{1} ,x_{u} )} \\ {K(x_{2} ,x_{1} )} & {K(x_{2} ,x_{2} )} & \ldots & {K(x_{2} ,x_{u} )} \\ { \vdots {\kern 1pt} } & { \vdots {\kern 1pt} } & { \vdots {\kern 1pt} } & { \vdots {\kern 1pt} } \\ {K(x_{u} ,x_{1} )} & {K(x_{u} ,x_{2} )} & \ldots & {K(x_{u} ,x_{u} )} \\ \end{array} } \right) $$
(4)

where the covariance matrix \( \sum_{{U_{i} U_{i} }} \) is actually a kernel matrix defined over all the unlabeled instances indexed by \( U_{i} \), we assume \( U_{i} = \left\{ {1,2, \ldots ,u} \right\} \). One commonly used kernel function is the Gaussian kernel \( K(x_{i} ,x_{j} ) = \exp \left( { - \frac{{\left\| {x_{i} - x_{j} } \right\|^{2} }}{{2\lambda^{2} }}} \right) \).

Closed-form solutions exist for the entropy of multivariate Gaussian distributions such that

$$ H(x_{i} ) = \frac{1}{2}\ln \left( {2\pi e\sum_{ii} } \right) $$
(5)
$$ H(x_{i} |U_{{x_{i} }} ) = \frac{1}{2}\ln \left( {2\pi e\sum_{{i|U_{i} }} } \right) $$
(6)

using (5) and (6), the representativeness definition given in (2) can finally be rewritten into the following form

$$ {\text{rep}}(x_{i} ) = H(x_{i} ) - H(x_{i} |U_{{x_{i} }} ) = \frac{1}{2}\ln \left( {\frac{{\sum_{ii} }}{{\sum_{{i|U_{i} }} }}} \right) $$
(7)

3.4 Diversity Analysis

Diversity criterion is to maximize the training utility of a batch. We prefer the batch in which the examples have high variance to each other. In this step we cluster all samples in the uncertainty set based on the representativeness measure proposed in Sect. 3.3. The samples in the same cluster may be considered similar to each other, so we will select the most informative sample from different clusters at one time. We apply the kernel k-means clustering algorithm.

In greater detail, let us assume that the kernel k-means clustering algorithm divides the m samples into h clusters \( C_{1} ,C_{2} , \ldots ,C_{h} \). After \( C_{1} ,C_{2} , \ldots ,C_{h} \) are obtained, the h most informative samples are selected as

$$ x_{k} = \arg \mathop {\hbox{min} }\limits_{{x \in C_{k} }} {\text{rep}}(x),k = 1,2 \ldots ,h. $$
(8)

3.5 A Combination Framework

In this section, we will study how to combine and strike a proper balance between these criteria, to reach the maximum effectiveness.

We first consider the uncertainty criterion. We choose m samples with the most informativeness score from all of the pool. By this preselecting, we make the selection process faster in the later steps since the size of uncertainty set is much smaller than that of the pool. Then we cluster the samples in uncertainty set and choose the centroid of each cluster into a batch. The centroid of a cluster is the most representative sample in that cluster since it has the largest density. Furthermore, the samples in different clusters may be considered diverse to each other. By this means, we consider representativeness and diversity criteria at the same time. We will summarize our overall algorithm, shown in Table 1.

Table 1 The pseudo-code of MCQAL is summarized in Algorithm 1

4 Experimental Results

4.1 Design of Experiments

In order to assess the effectiveness of our algorithm, four international standard UCI data sets were used in the experiments data. Table 2 shows the information of data sets. As a comparison, we choose (1) information density (ID) active learning [7], a representative approach which selects informative and representative instances (2) A cluster-assumption-based batch mode active learning technique [17], a representative approach which selects informative and diverse instances. A RBF kernel with default parameters is used (performances with linear kernel are not as stable as that with RBF kernel). LibSVM is used to train a SVM classifier for all active learning approaches in comparison. To reduce the classification error, for every data set, we run the experiment for 10 times, each with a random partition of the data set. In the next section, we present the results of three different experiments. In the first experiment, we compared the accuracy of the proposed technique with other two techniques by using four datasets. The second experiment shows the diversity of samples during selecting. The computational load of the different methods is analyzed in the third experiment.

Table 2 Experimental data information table

4.2 Results

Figure 2 shows the classification accuracy of different active learning approaches with varied numbers of queries.

Fig. 2
figure 2

Classification accuracy of different active learning approaches in four datasets

For the two data sets Ionosphere and Letters, the proposed method is superior to the other two methods: When the number of labeled samples is small, our method and CAAL are similar, but significantly better than IDAL; With the increase number of labeled samples, the superiority of our method gradually reflected and significantly better than CAAL. So in each iteration, the samples selected by our method are more informative (representative). In the condition of same labeled samples, our method is more effective to increase the classification accuracy.

Pen-Digits dataset, since our method took a priority on the basis of uncertainty and then selected samples both of representativeness and diversity, so the performance will be influenced by the uncertainty sample set. As shown in the section of the curve, we can see that the sample selected is not optimal. Those samples with high-representative but low-uncertainty may be ignored. In contrast, samples of higher information content are selected by IDAL.

The Balance Scale data set is imbalanced of class distribution. For imbalanced data sets, certain categories of samples are extremely rare, but often this part samples have higher information content. Therefore, it needs to pick up them as possible submitted to human experts for annotation. The result can be seen that our approach also received great performance on imbalanced dataset. Initially, classification accuracy is much higher than the others as the selection of our approach is relatively more balanced. When the labeled samples reach a certain number, due to the high-valuable samples have been added to the training set, then the consideration of sample balance doesn’t affect classification accuracy so obviously.

On the Pen-Digits and Letters datasets, each iteration we recorded the category number of selected samples. From the statistical data in the Fig. 3, in the condition of the same samples labeled at each iteration, the distribution of selected samples in our method is relatively more balanced. Reduce the selection redundancy while considering the diversity of samples, so the selected samples are more informative, quickly improving the classification performance.

Fig. 3
figure 3

The class number of selected samples at each iteration

The computational time required by the different techniques using the same experimental setting as described in the experiments. All the experiments were carried out on a PC (Pentium (R) Dual-Core CPU i5-2400@3.1 GHz, 4G RAM). Table 3 shows the computational time (in seconds) required by three techniques for all four data sets. From this table, compared with other two methods our method greatly reduced the running time in most cases.

Table 3 Computational time required by three techniques for all four data sets

5 Discussion and Conclusion

This paper presents a new active learning algorithm which combines uncertainty, representativeness, and diversity creation. On the basis of uncertainty sampling, we combined the measure of sample representativeness and analysis of sample diversity. This technique shortens the time required for training samples under the guarantee of classification accuracy. The labeling cost can be reduced without degrading the performance. For different data, our method needs to specify the size of the uncertain sample set according to the experiment. Therefore, according to the distribution of samples, how to dynamically determine the size of uncertainty set in order to ensure the optimal performance, is the focus of next study in the future.