1 Introduction

With the growing popularity of the Internet, shorter texts are being generated by users and services on such platforms as Twitter and Facebook [10]. Moreover, techniques of short text classification have a number of applications, such as sentiment classification [18] and semantic analysis [16]. Therefore, there is considerable demand for short text classification or categorization [1]. Short text classification requires a large number of labeled samples for training [2], because it involves a supervised learning process [26]. However, the volume of short texts generated by Web services is enormous, and manually labeling all these data items is time consuming and expensive. Labeling work takes a long time, creating a bottleneck in most short text classification problems.

Active learning is a widely used technique to solve the above problem [36]. The main idea underlying active learning is to choose representative samples by using machine learning algorithms to generate a small set. We only need to label this small set with fewer samples as the training set. We can thus save a large amount of time and money. The main task of this paper is to apply active learning to reduce manual labeling work in supervised classification tasks.

A number of active learning algorithms have been developed, e.g., Uncertainty sampling [21] or some density-based selection methods [18, 39, 40]. Uncertainty sampling is a sampling method that is widely used in natural language processing tasks such as text categorization and grammar matching. However, its limitation is that some outliers (i.e., abnormal samples) may be selected as representative samples, which affects the quality of its results. In another word, outliers have great negative impact on the performance of uncertainty sampling. To solve this problem, sampling methods based on the density distribution of the training set have been proposed [18, 39, 40]. As outliers are generally located in low-density space, these methods involve the selection of samples from high-density space to avoid choosing outliers. However, such methods also have some drawbacks. Samples located along the borders of categories are not necessarily located in high-density space. These methods therefore cannot accurately select samples along the borders of categories, while samples in the borders are important for classification.

In this paper, we propose a sampling algorithm called Top K Representative (TKR). This algorithm is inspired by the K-Nearest Neighbor (KNN) algorithm [3, 8], an important classification algorithm. In KNN, each sample can be represented by k samples most similar to it. The categories of these top k samples are used to determine the categories of each sample. In the classification process, if a sample is more similar to its top k samples, it is more likely to be correctly allocated. As shown in Fig. 1a, \(t_1\) is the sample to be classified, and \(t_2\)\(t_7\) are the top k samples closest to \(t_1\) in the training set. In KNN, \(t_1\) is represented by \(t_2\)\(t_7\) in the training set (as shown in Fig. 1b), where it determines the categories of \(t_1\) according to the categories of \(t_2\)\(t_7\). In light of this, we think that if we can find a subset that can represent all samples in a training set, it can be used to represent the entire training set. As shown in Fig. 1c, the set S is a subset of the training set T; if any sample \(t_m\) in T can be represented by one or more samples in S, we assume that the set S can represent the entire set T. To obtain set S, we need to find a set from all possible subsets of T that maximizes the sum of similarities among all samples in training set T and their top k samples in S. In this way, almost all training samples can determine nearby samples from the generated subset. In other words, the data distribution of set S is substantially consistent with that of set T. Therefore, samples along the borders of the categories are not neglected. Moreover, as outliers are distant from other samples, they are unlikely to be selected.

Fig. 1
figure 1

Idea inspired by KNN. a K-nearest neighbor. b A sample represented by the Top K samples. c The training set represented by a subset

However, finding the most suitable set S is an NP-hard problem, for the reason that it involves traversing all subsets of the training set. Thus, an exact solution cannot be obtained within an acceptable time when the training set is large. We propose an optimization algorithm based on the greedy algorithm [9] to obtain an approximate solution that is close to the exact solution.

Several experiments are conducted to compare the performance of our proposed method with baseline sample selection methods. The results show that the AVG-F1 value of Top K representative is higher than those of the density-based method by 3.7% and those of the Uncertainty Sampling by 10.1%, respectively. We also conducted experiments to verify the efficiency of the proposed Top K Representative. The results shows that the run time of our algorithm is acceptable, shorter than 100 s for approximately 10,000 samples.

The contributions of our work here can be summarized as follows:

  1. 1.

    We propose a sample selection algorithm, TKR, to determine a subset that can select samples along the borders of categories and avoid selecting outliers.

  2. 2.

    As finding the most suitable subset in TRK is an NP-hard problem, we propose an optimization algorithm based on the greedy algorithm to obtain an approximate solution close to the exact solution.

  3. 3.

    We conducted several experiments to verify the effectiveness of TKR and its efficiency for samples of different sizes.

In the remainder of this paper, we introduce related work in Sect. 2. The proposed model and its corresponding optimization algorithm are detailed in Sect. 3. Our experiments and their results are reported in Sect. 4.

2 Related work

2.1 Active learning

2.1.1 Uncertain sampling

Uncertain sampling [22] is a popular technique of sampling, which is widely-used in nature language processing. For example, Word Sense Disambiguation [5, 6], Text Classification [22, 36], Statistic Parsing [35], Named Entity Recognition [22], image retrieval [17]. Besides, in [18], sentiment classification also apply uncertain sampling approach to find the most informative samples from the corpus in order to reduce human effort.

Uncertain sampling is to choose K samples as initial samples to train the classifier. Then select N-1 samples, and use these samples to train classifier \(C_N-1\). Use classifier \(C_N-1\) to classify the samples which have not been selected in data set X, and yield the probability \(p_{ij}\) of sample i belong to category j. And then, we compute the entropy:

$$\begin{aligned} H_i = - \sum _{j=1}^{n}p_{ij}log_{b}p_{ij}, \end{aligned}$$
(1)

where b is the parameter of the entropy, which usually is 2 or e. In information theory, entropy can measure the uncertainty of information. In our case, if the value of entropy is large, it means the probabilities of the sample belong each class is very close, i.e. it is very hard to decide which class this sample belongs to. Therefore, if the value of entropy is large, we can say the classifier have a poor ability of distinguishing this sample. Putting this sample into the training set is likely to bring more information. So we choose sample:

$$\begin{aligned} X_{N} = arg max_{i}H_{i}. \end{aligned}$$
(2)

The key idea of active-learning based on uncertain sampling is: the sample with the largest uncertainty, which contains the most information, will be selected for manual labeling in every cycle. The uncertainty is large means the current classifier have little confidence of classifying this sample. In this framework, the classifier will choose the samples, which it is uncertain about, to learn. Methods using this framework usually apply probability learning scheme. For example, in the model of binary classification problem, the uncertain sampling just need to select the samples with a probability near 0.5 [22]. In the case of multi-classification, we can use entropy to compute the certainty of classifier. An intuitive explanation is: uncertain sampling chooses the samples near the decision boundary, and use them to update the decision boundary to get a more precise result. However, if the uncertain sampling chooses outliers, we will fail to get a more precise decision boundary [35]. Although, these outliers have high uncertainties, they can not improve the classifier.

In uncertain sampling, how to choose the initial training set is also very important. In the previous study of active-learning, the initial training set is usually chosen randomly. It is based on the assumption that random sampling might select a set of samples which share the same distribution with the whole data set. However, since we always choose a very small initial training in practice, this assumption might be invalid. In the paper [40], a method which use sampling by cluster (SBC) to select the most typical samples as initial training set. In order to do this, the corpus should be clustered into a specific number (depends on the size of the initial training set) of groups. The samples which are close to the centroids of clusters are the most typical sample.

2.1.2 The selection method based on density distribution

The key idea of the selection method based on density distribution is focusing on the whole input space. Therefore, this method is not likely to choose outliers. It compute the density of the space of samples to choose the typical samples. However, the area with a high density might not be the area near the decision boundary. So we try to combine density with uncertainty.

In the paper [40], a framework about describing the density of information is proposed. It is a general weighting method based on density and uncertainty. Its key idea is that the samples with the most information not only are uncertain, but also locate in an area with a high density in the input space. In [40], the unlabeled samples are evaluated by a density method based on K nearest neighbors. The K nearest neighbors of sample x is represent as the following set:

$$\begin{aligned} S(x) = \{S_{1}, S_{2}, ..., S_{K}\}. \end{aligned}$$
(3)

Hence, the density of the K nearest neighbors DS(x) is defined:

$$\begin{aligned} DS(x) = \frac{\sum _{s_{i}}cos(x, s_{i})}{K}, \end{aligned}$$
(4)

where \(cos(x, s_i)\) is the cosine similarity between sample x and \(s_i\).

There is a new method in [39]. It based on uncertain sampling and density measurement, in which the uncertainty is measured by entropy and the density is measured by K-means method. In this method, we measure the uncertainty using a method called Density * Entropy, which compute as Eq. 5. This is a general method of computing uncertainty and density:

$$\begin{aligned} DSH(x) = DS(x) * H(x), \end{aligned}$$
(5)

where H(x) is entropy, which compute through Eq. 1.

Density-based sample selecting approaches have been applied in many research fields. In [18], Density-based sample selecting approaches are introduced into cross-lingual sentiment classification field. Since most recent research works in natural language processing focused on English language, and there are not enough labeled sentiment resources in other languages, sample selecting methods are needed for reducing the cost of manual construction of annotated sentiment corpora for a new language. This work applies density measures of unlabelled samples to avoid outlier selection.

2.1.3 Supervised sampling selecting methods

When active learning combined with some specific applications, supervised sampling selecting methods are introduced, which needs extra labeled samples for training.

In [23], a supervised approach is proposed to solve the domain adaptation problem in sentiment classification. The domain adaptation problem is that a sentiment classifier trained with the labeled samples from one domain has bad performance in another domain. The difficulty of this problem is that the data distributions in the source domains are different from that in the target domains. The authors in [23] perform active learning for cross-domain sentiment classification by a supervised approach which needs a small amount of labeled data for training. First, the labeled source and target data are used to train two individual classifiers. Then, the unlabeled samples will go through the classifiers. The unlabeled samples will be selected when the results of these two classifiers are disagreed.

In [34], a active learning support vector machine (SVM) is proposed for image classification tasks. It prefers to select the uncertain samples from unlabeled dataset as the classifier’s training samples. The goal of the active learning task is to select the support vector, which is the most informative samples for classification. In the supervised part, the goal of the labeled samples is to find the best model parameters to obtain a optimal classifier.

The supervised sample selecting methods we discuss above combine active learning and specific applications together. However, these approaches are hard to be applied in other applications. For example, the model proposed in [34] can apply active learning in SVM, but it is hard to be applied in other classifiers like KNN. The model provided in our paper is an unsupervised sample selecting method, and it is a general method developed for applications from different research fields, like text classification [19], image retrieval, sentiment search [13,14,15, 37], etc.

2.2 The expression of sentiment texts

Most of sentiment texts created from the Internet are the online reviews from E-commerce platforms or articles from Twitter or Facebook. These texts are short in length, thus they are short text. The sentiment classification problem can be transformed to short text classification problem. In short text classification problem, expressing a short text as a vector is an important step. It including two parts: find features to represent the semantic meaning of the text, and the composition of these features. In the field of short text classification, the most widely-used model is Vector Space Model [30].

In Vector Space Model, we usually use Bag of Words model which ignores the order of words, for the reason that samples may be sparse. In this way, a text is expressed as a vector, in which every feature stands for a word, and its value is the weight of the word.

Specific to Chinese, apart from Bag of Words model, N-gram model is also a popular method [4]. It selects N successive words as a phrase(e.g., given a Chinese text:“ ” (college freshman), we will have “” (college) and “” (freshman), if we use Bag of Words; if we use 2-grams model, we will have “” (college), “” (This combination has no meaning in Chinese), “” (freshman), which are the combinations of each pair of successive words.).

2.3 Term weighting

Term weighting is the process of deciding the value of features, when we using Vector Space Model [4, 30] to express a text as a vector. In tradition, term weighting is unsupervised, which means ignoring the label of texts.

There are some popular models:

  • 0/1 model: if a word appear in the text, then it will get value 1; if not appear, it get value 0.

  • TF-IDF (Term Frequency- Inverse Document Frequency) model [33]: given a text, the Term Frequency is how many times a specific word appear in this text. We usually normalize this value, in case the value become bigger just because the text is longer. Inverse Document Frequency can measure the importance of a word. It is based on the assumption: the less documents a word appears in, the more important this word is.

However, for the reason that the amount of words is too little in a short text, the unsupervised term weighting method cannot perform well. Thus it is better to use supervised methods which take the label of words into account.

IG and \(\chi ^{2}\) are two typical statistic measures [29], which can act as the confidence of the dependency between a word and a category. In these term weighting method, a multi-classification problem is converted into multiple binary classification problems.

3 Top K representative

The solution to the problem of classification of short texts involves a supervised learning process, that needs a dataset labeled by a human. However, creating a massive training corpus is expensive and highly time consuming in practice. Therefore, this process is the bottleneck in the application of such classification to a new field. The purpose of our study is to use active learning to minimize the amount of manual work needed, and to achieve the same final performance as obtained with manual labeling of all data. Active learning is a widely used framework that allows for the selection of samples with the largest amount of information. The key idea underlying active learning is to apply machine learning algorithms to actively acquire training labels from a given dataset [28]. In this manner, we can attain higher accuracy with fewer training labels. We thus attempt to reduce manual labeling work within supervised learning problems through active learning while achieve satisfactory performance.

In this paper, we propose a sample selecting method based on the idea of KNN. The KNN performs well in text classification because the more similar the relevant texts, the more likely they are to belong to the same category [11]. This is exactly how KNN works: given a text, it finds the k most similar samples to it in the training set; it then votes according to the categories of these samples, and chooses the one that receives the most votes [31]. Therefore, if the k samples are strongly similar to the given text, the categorization is probably correct. Conversely, if the samples found are only weakly similar to the text, the categorization is likely incorrect. Given this, we propose a method to select samples, where the idea is to select a subset of the samples for manual labeling. When given a text to test, we can find the top k samples in the subset that are strongly similar to the given text. To this end, we first define the concept of representativeness between samples and subset.

3.1 Representing sample B using sample A

The representativeness of sample A with respect to sample B:

$$\begin{aligned} R_{1}(a, b) = Sim(a, b), \end{aligned}$$
(6)

where Sim(ab) represents the similarity between samples A and B. The representativeness of sample A with respect to sample B is the similarity between A and B. Similarity can be computed using different methods according to the problem at hand. As we focus on short texts in this paper, we use cosine similarity. Furthermore, according to the tf-idf of the terms appearing in a given text, its feature vector can be represented as:

$$\begin{aligned} q_{i} = < t_{i1}:w_{i1}, t_{i2}:w_{i2}, ..., t_{in}:w_{in}>, \end{aligned}$$
(7)

where \(t_{ij}\) are the terms appearing in the data set and \(w_{ij}\) is the weight of \(t_{ij}\), where \(w_{ij}\) can be computed by tf-idf:

$$\begin{aligned} w_{ij} = tf_{ij} * idf_{j} = tf_{ij} * log\left( \frac{N}{n_{j}}\right) . \end{aligned}$$
(8)

The definition proposed in this paper is in the context of short text classification, but does not exhaust the meaning of the term. The proposed Top K representative active learning is a general method that can also be used with other types of data than text, e.g., pictures. Depending on the data type, we can use different definitions of representativeness, e.g., the reciprocal of Euclid distance.

3.2 Representing a sample using a set

The representativeness of set \(S_{1}\) with respect to sample b:

$$\begin{aligned} R_{2}(S_{1}, a) = \sum _{a\in TopKSim(b,S_{1},k)}R_{1}(S_{1},a), \end{aligned}$$
(9)

where \(topKSim(b, S_{1}, k)\) represents the k most similar samples to b in \(S_{1}\), where k is chosen by a human. \(R_{2}(S_{1}, a)\) can measure the probability of finding the k most similar samples using test data in set \(S_{1}\) and the strength of the similarity. We use the sum of the similarity between b and the k samples as the representativeness of set \(S_{1}\) with respect to the test samples. This is based on the assumption mentioned above: if a set can represent a sample better, the k samples found in the set most similar to this sample must share a stronger similarity with the sample. To measure the k samples at the same time, we consider them equal and take the sum of their similarities to b. This is not the only method to measure these samples. We can alter the method according to the problem. For instance, in some problems, it is important that we find the most similar sample. In this case, we can define \(R_{2}(S_{1}, a)\) through a weighting method.

3.3 Representing set \(S_{2}\) using set \(S_{1}\)

The representativeness of set \(S_{1}\) with respect to set \(S_{2}\)

$$\begin{aligned} R_{3}(S1, S2) = \sum _{a\in S_{1}} R_{2}(S_{1}, a). \end{aligned}$$
(10)

Given an unlabeled dataset X, we need to find a subset of X. We call this subset \(X_{S}\) and it should satisfy the following conditions:

$$\begin{aligned}&(a) \quad |X_{S}| = m, \quad \quad \quad \quad \quad \quad \quad \quad \quad \nonumber \\&(b) \quad X_{S} = argmax_{X_{S}\subset X} R_{3}(X_{S},X), \end{aligned}$$

where m is the number of samples that we need to label manually. This depends on the number of laborers at hand. Subset \(X_{S}\), which satisfies the two conditions above, best represents set X. This means that given any sample in set X, we can always find k samples in its subset \(X_{S}\) that share a strong similarity with the sample.

\(R_{3}(X_{S}, X)\) means the representativeness of a subset with respect to its superset. If this representativeness has a high value, given any sample in the superset, we can find the k samples most similar to the sample in the subset that are strongly similar to it. This means that we have obtained a subset in which we can always find k samples that can satisfactorily represent any sample in the superset, i.e., this subset has a similar distribution to that of its superset.

figure g

3.4 Optimization algorithm

We want to find the subset that can best satisfy the aforementioned conditions. Because traversing every subset of a set is NP-hard [20], we cannot obtain a precise result in an acceptable amount of time. However, we do not need a precise solution to this problem, and an approximate result can yield acceptable performance. Therefore, we propose the greedy algorithm shown in Algorithm 1, which minimizes the size of the subset in each iteration.

Lines 14: As \(X_{S}\) is empty at the outset, the TopKSim set of every sample is empty. Therefore, the addition of each sample a to \(X_{S}\) causes the \(TopKSim(b, X_{S}, k)\) of every sample b in dataset X to get a new member a. \(R_2(X_{S}, a)\) then increases Sim(ab). As \(R_{3}(S1, S2) = \sum _{a\in S_{1}} R_{2}(S_{1}, a)\), the overall \(R_3(X_S, X)\) increases \(\sum _{b\in X}Sim(a, b)\). Therefore, \(Score(a) = \sum _{b\in X}Sim(a, b)\).

Lines 611: Based on the strategy of the greedy algorithm, we choose the sample with the largest value of Score that can increase \(R_3(X_S, X)\) by the largest magnitude.

Lines 1231: Having selected sample \(a_{selected}\), if the similarity between \(a_{selected}\) and any other sample a is higher than that between it and the \(kth\) most similar sample of a, \(a_{selected}\) is added to set \(topKSim(a, X_S, k)\). At the same time, if the size of set \(topKSim(a, X_S, k)\) is greater than k, the member with the lowest similarity in set \(topKSim(a, X_S, k)\) should be removed. Following the removal, if the size of set \(topKSim(a, X_S, k)\) is k, the Score of other samples may have changed because we may have to remove another sample with the lowest similarity after adding this new sample, b, even if b can eventually stay in the set topKSim. Therefore, we must update the Score of every sample in X. We first recompute \(KthSim(a, X_S, k)\) and set the old \(KthSim(a, X_S, k)\) to \(KthSim(a, X_S, k)_{old}\). \(KthSim(a, X_S, k)> KthSim(a, X_S, k)_{old}\). If \(Sim(a, b)> KthSim(a, X_S, k)\), adding b to \(X_S\) first increases Sim(ab) and then reduces \(KthSim(a, X_S, k)\). In this case, Score(b) reduces \(KthSim(a, X_S, k) - KthSim(a, X_S, k)_{old}\). Then, if \(KthSim(a, X_S, k)_{old}< Sim(a, b) < KthSim(a, X_S, k)\), we remove the part relating to Sim(ab) from Score(b), which means subtracting \(Sim(a, b) - KthSim(a, X_S, k)_{old}\) from it. Otherwise, if \(Sim(a, b) < KthSim(a, X_S, k)_{old}\), we do not need to make any change as it is already impossible for b to enter set TopKSim.

3.5 Complexity analysis

In our algorithm, when given a new sample, there are two main loops. The first loop checks to determine whether the top k samples need to be updated. When the first loop determines that an update is needed, the second updates the score values of all other samples. Both loops are linearly dependent on |X|, and the process of updating the top k samples is linearly dependent on k. The complexity of our algorithm is \(O(k * m * |X|^2)\), where m is the number of samples we want to provide for manual labeling and |X| is the size of the entire dataset. For each sample, we need to retain the top k samples. Thus, the spatial complexity is \(O(k * |X|)\). To minimize the computation, we use a matrix to record the similarity between each pair of samples. This matrix is sparse (in practice, fewer than 10% of its values are non-zero). We can thus skip the zero values while traversing the matrix to improve the speed of computation.

Table 1 AGV-F1 value of different sampling methods
Fig. 2
figure 2

The performance of different sampling methods

4 Experiments

4.1 Datasets

In this experiment, we compare the performance of different sampling methods. The proposed sampling selecting method is a general method and can be applied in different field, for example, short text classification or sentimental classification. In the experiments, we apply the sample selecting methods in short text classification. Samples are selected using different sampling methods, and they are used to train classifiers. The performance of classifiers represents the performance of the corresponding sampling method. In this experiment, we assume that the dataset does not have any labels. A sample will obtain its label when it is selected by the sampling methods. We apply Google Snippets dataset [26] in our experiment. This dataset consists of 10,060 training snippets and 2280 test snippets from 8 categories. On average, each snippet has 18.07 words. \(10\%\) of this dataset is used as the test set, denoted as T. Meanwhile, \(90\%\) of that is regarded as a optional set, denoted as X. We do not use all the sample in optional set X as our training set. Instead, we only use N sample of X as the training set. Ten-Fold cross-validation [24] is applied to make the result convincible.

4.2 Evaluation metric

The evaluation matric used in this paper is AVG-F1 [27]. It is calculated as follows:

$$\begin{aligned} AVG-F1 = \frac{1}{N} \sum _{i=1}^{N} F1_{i}, \end{aligned}$$
(11)

where N is the number of classifiers, and \(F1_{i}\) is the F1 value for classifier i. \(AVG-F1\) is the average F1 value of all classifiers. \(F1_{i}\) is calculated as follows [27]:

$$\begin{aligned} F1_{i} = 2 * \frac{Precision_{i}*Recall_{i}}{Precision_{i}+Recall_{i}}. \end{aligned}$$
(12)

The precision and recall value is calculated as follows [12, 25]:

$$\begin{aligned} Precision_{i}= & {} \frac{TP_{i}}{TP_{i}+FP_{i}}, \end{aligned}$$
(13)
$$\begin{aligned} Recall_{i}= & {} \frac{TP_{i}}{TP_{i}+FN_{i}}, \end{aligned}$$
(14)

where \(TP_{i}\) is the number of samples correctly classified as belonging to the positive class by classifier i, and \(FP_{i}\) is the number of samples that is incorrectly classified to positive class, while \(FN_{i}\) is the number of samples incorrectly classified as belonging to the negative class.

4.3 Comparing methods

The comparing sampling methods in this experiment is introduced as follows:

  • Random sampling: N samples are selected randomly as the training set [7].

  • Uncertainty Sampling: K samples are selected randomly at first. And then another N samples are selected one by one. When selecting Nth sample, the selected N-1 samples are used to train a classifier \(C_{N-1}\). This classifier can obtain the probability \(p_{ij}\) that sample i belong to class j. Then the entropy of sample i can be calculated in Eq. 1. In information theory [32], entropy is used to measure the uncertainty. Higher entropy value represents that the probability of a sample belonging to each class is similar, thus it is harder to determine a class for that sample. Therefore, the high entropy of a sample indicates that the classifier has a poor performance on this sample, thus this sample should be put into the training set to bring more classifying information to the classifier. The Nth sample should be selected in Eq. 2.

  • Density*Entropy: Zhu et al. propose a method to describe the density of information [40], called Density*Entropy. This is a weighted method based on density and uncertainty. Its main idea is that the most informative sample is not only those with uncertainty, but also those in the dense area in the input space. In [40], the unlabeled samples are evaluated by a density method based on K nearest neighbors. The K nearest neighbors of sample x is represent as the following set: \(S(x) = \{S_{1}, S_{2}, ..., S_{k}\}\). Hence, the density of the K nearest neighbors DS(x) is defined in Eq. 4. The formula of Density*Entropy is in Eq. 5. This method is widely used to combine uncertainty sampling methods and density-based sampling methods together.

  • Top K representative: the sampling method we proposed in this paper.

Table 2 Relation between the number of samples in optional set |X| and the run time
Fig. 3
figure 3

Relation between the number of samples in optional set |X| and the run time

Table 3 Relation between the number of samples we selected m and the run time
Fig. 4
figure 4

The relation between the number of samples we selected m and the running time

4.4 Comparison of sample selection methods

We also compared the performance of the proposed TKR with other samples selection methods. We chose random sampling, uncertainty sampling and Density * Entropy as the baseline methods, as they are popular sample selection techniques. These methods were applied to select samples from the Snippets dataset, and the selected samples were used to train a KNN classifier. We used AVG-F1 value [27] to evaluate the performance of each sampling method according to the performance of the corresponding classifier. The experimental results are shown in Fig. 2 and Table 1. As shown in the graph in the figure, classification accuracy increased with the number of selected samples because with more samples, the classifiers became better informed and could identify more features in the dataset. The figure also shows that our proposed Top K Representative yielded a higher AVG-F1 value than the other methods when selecting the same number of samples. The gap between Top K Representative and the other methods narrowed with increase in the number of selected samples because the corresponding classifiers then had sufficient information to more accurately choose samples. We also found that the AVG-F1 values of Density * Entropy and TKR were similar and higher than those of random sampling and uncertainty sampling. This is because the latter methods struggle to solve problems featuring outliers, whereas TKR and Density * Entropy can handle them. The AVG-F1 value of TKR was higher than that of Density * Entropy, the second-best method, by 3.7% because the former could select important samples along the boundary of the categories whereas the latter could not. Note that the AVG-F1 value of Top K representative for 100 selected samples was 0.47, whereas random sampling attained this value for 200 samples, and uncertainty sampling needed 400 samples obtain the same value. That is to say that our proposed method can reduce labeling work by half while delivering the same performance as the other methods. This shows that our proposed method can save both the time and human resources needed to label samples.

In general, such traditional methods as uncertainty sampling perform worse than random selecting on short texts. We found that uncertainty sampling tended to select more outliers in a database of short texts. This had a negative effect on classification performance because the short text dataset is usually a sparse dataset; that is, some words in the samples have never appeared before. These samples are at large distances from others, and classifiers cannot obtain the information needed from the other samples to classify them. According to uncertainty sampling, these samples are selected as training samples and make a small contribution to the classification process. Thus, uncertainty sampling yields poor performance on short texts. However, our proposed method and Density * Entropy do not encounter this problem because they choose samples that can represent the entire dataset. As outliers have low similarity with other samples in a given dataset, the samples obtained by our proposed method are less likely to be outliers.

4.5 Time efficiency of top K representative

This part of our experiment was intended to verify the efficiency of the optimization algorithm proposed in this paper and gauge its complexity. We also used the Snippets dataset in this experiment. There are two parameters in the optimization algorithm that can influence efficiency. The first is the total number of samples, i.e., the number of samples in the optional set, denoted by |X|, and the second parameter is the number of samples we want to select from all samples, denoted by m.

The relation between the number of samples in optional set |X| and the run time is shown in Table 2 and Fig. 3. We fixed m to 100, i.e., \(m = 100\). As show in the above results, the run time was close to the quadratic of |X|, which increased 2.96 times from 1000 to 2000, and 3.47 times from 2000 to 4000. The quadratic relation was not obvious when the value of |X| was low because program initialization and other factors occupied part of the necessary time. However, with increase in the number of samples, the common part [Common to/between what? Please specify.] of the run time decreased, because of which the curve of |X| with respect to s was close to the quadratic function.

We conducted another experiment to explore the relationship between the number of target samples m and run time. The results are shown in Table 3 and Fig. 4. We fixed the number of samples in the optional set to \(|X| = 12387\). The figure shows that run time s and the number of target samples m had a linear relationship over time. This is because the similarities among all samples were calculated in the initialization process, which can consume a large amount of time even if the number of target samples is small. For example, it took approximately 19.221 s to select only 100 samples because most of the time was consumed by initialization. The run time increased to 34.626 s when the number of target samples was increased to 30,000. The linear relationship was apparent when the number of target samples was greater than 100. In this case, the run time increased by 1–2 s every 100 samples.

In general, it is clear from the experiment that the complexity of the optimization algorithm is \(O(k*m*|X|^2)\). The run time of our algorithm for a sample size of approximately 10,000 was less than 100 s, an acceptable result. Thus the proposed algorithm can be applied to practical problems.

5 Conclusion and future work

In this paper, we propose a method of selecting samples for manual labeling. The proposed method can be applied in different fields, for example, text classification, sentiment classification and so on to reduce the labeling workload of datasets. Firstly, we propose a series of definitions of representativeness. And then we convert the selection problem to find the subset which have the highest representativeness. As this is an NP-hard problem, and we do not need a precise result, we propose a greedy algorithm to solve it.

We did not consider updating the model following manual labeling. Therefore, sample selection was unsupervised. Further work can be done here by using manually labeled samples to improve the model, or to combine them with uncertainty sampling to implement a method for distribution estimation and uncertainty estimation. Although the greedy algorithm we used is efficient, the final result is not a precise answer. Further research in the area can use the hill-climbing algorithm, the genetic algorithm, or other algorithms to yield better results and improve sampling performance.