1 Introduction

Supervised learning algorithms require extensive labeled data to train models and then make predictions on new data [7, 8, 25, 30]. Conventional labeling tasks have been typically marked by domain experts or well-trained workers [19]. This kind of method provides high-quality labels, but is inefficient and expensive [10, 11]. The social network service has supplied a novel method to resolve the labeling problem. In fact, programs such as the Listen game [21] have proven the feasibility of using public resources to address difficult machine learning problems [22]. Although these methods provide free-labeled data, guaranteeing their quality is difficult. Therefore, a more direct and economical method is to hire online crowd workers to label the data. This has become possible thanks to the rapid growth of crowdsourcing platforms such as Amazon Mechanical TurkFootnote 1 and Crowdflower.Footnote 2

These crowdsourcing platforms have been widely used to obtain extensive labeled data in applications such as ImageNet [3], computer vision [13], and natural language processing [9]. However, owing to differences in personal preferences and cognitive abilities, the quality of labels collected by a single crowd worker is often poor, which may compromise practical applications that use such data. To solve this problem, multiple labels are frequently requested from different workers for a single instance. Indeed, many existing works [20, 29] have revealed the efficiency of repeated labeling. After each instance has been labeled by different crowd workers and, thus, obtains its multiple noisy label set, a label aggregation strategy is needed to infer the unknown true label from its multiple noisy label set, a method known as label aggregation (integration).

In recent years, label aggregation (integration) from multiple noisy labels has attracted much attention, and a large number of label aggregation strategies have been proposed [18, 28]: Dawid and Skene [1] proposed the DS strategy, which uses the maximum likelihood estimation to estimate a confusion matrix for each labeler and a class prior. Raykar et al. [16] proposed the RY strategy, which based on the Bayesian estimation to model the sensitivity and the specificity of labelers. Demartini et al. [2] proposed ZC strategy, which uses a two-element parameter to weight the reliability of a labeler. Karger et al. [9] proposed the KOS strategy based on the reliabilities of labelers to capture the presence of spammers. Zhang et al. [28] proposed the GTIC strategy based on Bayesian statistics for multi-class labeling. Ma et al. [14] proposed the FaitCrowd strategy, which uses a novel probabilistic Bayesian model to address the challenge of inferring fine grained source reliability. Zhang et al. [26] proposed the BLC strategy, which clusters two layers of features (conceptual-level and physical-level) to infer true labels of instances. Zhang et al. [24] proposed the MNLDP strategy, which considers the intercorrelation among multiple noisy label sets of different instances.

Of the numerous strategies, majority voting (MV) is the most straightforward, efficient, and widely used [4, 6, 19]. However, it discards a lot of useful information, such as the certainty information of the majority class and all the information of the minority class. To solve this problem, Sheng et al. [17] proposed four improved strategies, including two soft MV strategies and two paired soft MV strategies, to avoid this loss of information. Nevertheless, these strategies do not account for the labeling qualities of the crowd workers, especially the differences in the quality of workers labeling different instances. In other words, these strategies assume that different crowd workers have the same labeling quality, which is rarely true in real-world crowdsourcing scenarios.

To relax this assumption, in this paper, we propose four new strategies, including two weighted soft MV strategies and two weighted paired soft MV strategies, by assigning different weights to workers when labeling different instances. Specifically, we first use the similarity among worker labels to estimate the specific quality of the worker on different instances. Then, we build a classifier on the training set with the labels given by the worker and evaluate the classification accuracy on the test set as the overall quality of the worker across all instances. Finally, we combine these two qualities to define the weight of the worker labeling a particular instance. It can be seen that the differences in the quality of workers labeling different instances are considered in our proposed strategies. More importantly, the extensive empirical studies validate the effectiveness of our four newly proposed strategies.

The remainder of this paper is organized as follows. Our research starts from soft MV and pairing and, thus, we first provide a comprehensive introduction in Sect. 2. Then, we propose four new strategies in Sect. 3. The experiments and results are reported in Sect. 4. Some extensions to multi-class classification are discussed in Sect. 5. Finally, the conclusions are drawn and some main directions for future work are outlined in Sect. 6.

2 Soft MV and pairing

For a crowdsourcing system, a training instance set is defined as \(E=\{e_{i}\}_{i=1}^{n}\), where each instance is \(e_{i}=<x_{i},y_{i}, {\mathcal {L}}_{i}>\), \(x_{i}\) is the feature vector, \(y_{i}\) is the unknown true label, and \({\mathcal {L}}_{i}=\left\{ l_{i j}\right\} _{j=1}^{m}\) is the multiple noisy label set provided by m crowd workers for the ith instance \(x_i\). For simplicity, in this paper, we provisionally restrict our discussion to binary classification, and thus both \(y_{i}\) and \(l_{i j}\) take values from a finite set \(\{+,-\}\) only.

When each instance has only a multiple noisy label set, conventional supervised learning algorithms cannot learn a model from these instances directly. Thus, label aggregation strategies are required to infer the unknown true label from its multiple noisy label set. Of the numerous strategies, MV is the most straightforward, efficient, and widely used. However, it discards a lot of useful information, such as the certainty information of the majority class and all the information of the minority class. For example, there exist two instances with the multiple noisy label sets \(\{+,+,+,+,-\}\) and \(\{+,+,+,-,-\}\), respectively. According to MV, their aggregated (integrated) labels are of course the majority class \(+\). However, the certainty (confidence) information of \(+\) is ignored, which means that we cannot express the information regarding how “far off” they are from belonging to \(+\). At the same time, all the information of the minority class − is thoroughly discarded. As a result, we cannot distinguish between these two entirely different multiple noisy label sets, although the certainty (confidence) of them belonging to the majority class \(+\) are totally different.

2.1 Soft MV

By exploiting the certainty information of the majority class, Sheng et al. [17] proposed two soft MV strategies: MV-Freq and MV-Beta. Similar to MV, MV-Freq and MV-Beta still use the majority class of a multiple noisy label set as the aggregated label, but at the same time assign a weight that represents the certainty of the majority class.

For MV-Freq, the certainty of the majority class is defined as the appearance frequency of the majority class in the multiple noisy label set. The detailed formula is

$$\begin{aligned} W_{H_{i}}=\left\{ \begin{array}{ll}{P\left( +| {\mathcal {L}}_{i}\right) ,} &{}\quad {P\left( +| {\mathcal {L}}_{i}\right) \ge P\left( -| {\mathcal {L}}_{i}\right) } \\ {P\left( -| {\mathcal {L}}_{i}\right) ,} &{}\quad {P\left( +| {\mathcal {L}}_{i}\right) <P\left( -| {\mathcal {L}}_{i}\right) }\end{array}\right. , \end{aligned}$$
(1)

where \(P\left( +| {\mathcal {L}}_{i}\right) \) (or \(P\left( -| {\mathcal {L}}_{i}\right) \)) is the certainty of the majority class \(+\) (or −) of the multiple noisy label set \({\mathcal {L}}_{i}\) of the ith instance \(x_i\), which can be estimated by

$$\begin{aligned} P(+|{\mathcal {L}}_{i})= & {} \frac{\sum _{j=1}^{m} \delta \left( l_{i j},+\right) }{\sum _{j=1}^{m} \delta \left( l_{i j},+\right) +\sum _{j=1}^{m} \delta \left( l_{i j},-\right) }, \end{aligned}$$
(2)
$$\begin{aligned} P(-|{\mathcal {L}}_{i})= & {} \frac{\sum _{j=1}^{m} \delta \left( l_{i j},-\right) }{\sum _{j=1}^{m} \delta \left( l_{i j},+\right) +\sum _{j=1}^{m} \delta \left( l_{i j},-\right) }, \end{aligned}$$
(3)

where \(l_{i j}\) is the class label provided by the jth worker for the ith instance, and \(\delta (\cdot )\) is an indicator function that outputs 1 if its two parameters are identical, and 0 otherwise.

Please note that Eqs. (2)–(3) are a little different from those of the original paper by [17], which uses Laplace correction to reduce the effect of extreme probability estimations. However, to our knowledge, Laplace correction should be removed from these equations to reflect the true frequency of each class in the multiple noisy label set. More importantly, our experiments show that using Laplace correction reduces the performance of the related strategies to some extent. For saving space, we do not present the detailed experimental results in this paper.

Now, for the above two different multiple noisy label sets \(\{+,+,+,+,-\}\) and \(\{+,+,+,-,-\}\), their weights are \(\frac{4}{5}=0.8\) and \(\frac{3}{5}=0.6\), respectively. Therefore, they can be represented by \(\{(+,0.8)\}\) and \(\{(+,0.6)\}\), respectively.

For MV-Beta, the certainty of the majority class in the multiple noisy label set is defined as

$$\begin{aligned} W_{H_{i}}=\max \left\{ I_{0.5}(\alpha _{i}, \beta _{i}), 1-I_{0.5}(\alpha _{i}, \beta _{i})\right\} , \end{aligned}$$
(4)

where \(I_{0.5}(\alpha _{i}, \beta _{i})\) is the value of the cumulative distribution function (CDF) of the Beta distribution at the decision threshold 0.5. The detailed formula is

$$\begin{aligned} I_{0.5}(\alpha _{i}, \beta _{i})=\sum _{k=\alpha _{i}}^{\alpha _{i}+\beta _{i}-1} \frac{(\alpha _{i}+\beta _{i}-1) !}{k!(\alpha _{i}+\beta _{i}-1-k) !} 0.5^{\alpha _{i}+\beta _{i}-1}, \end{aligned}$$
(5)

where \(\alpha _{i}\) and \(\beta _{i}\) are two shape parameters of the Beta distribution, which are calculated by

$$\begin{aligned} \alpha _{i}= & {} \sum _{j=1}^{m} \delta \left( l_{i j},+\right) +1. \end{aligned}$$
(6)
$$\begin{aligned} \beta _{i}= & {} \sum _{j=1}^{m} \delta \left( l_{i j},-\right) +1. \end{aligned}$$
(7)

2.2 Paired soft MV

Just as shown in Sect. 2.1, MV-Freq and MV-Beta indeed exploit the certainty information of the majority class. However, similarly to the simplest MV, they also discard all the information regarding the minority class. According to the observations by [17], the information regarding the minority class is also very important, especially when there are only a few labels available in the multiple noisy label set.

By further exploiting the information about the minority class, Sheng et al. [17] proposed two paired soft MV strategies: Paired-Freq and Paired-Beta. Different from MV-Freq and MV-Beta, Paired-Freq and Paired-Beta generate a pair of weighted pairwise instances (a majority class instance and a minority class instance) from a single instance with a multiple noisy label set, where the weights of each pair of instances are defined as the certainty of the majority class and the certainty of the minority class, respectively.

For Paired-Freq, the certainty of the majority class is also calculated by Eqs. (1)–(3). After obtaining the certainty of the majority, the certainty of the minority class can be estimated by \(1-W_{H_{i}}\). Now, the above two different multiple noisy label sets \(\{+,+,+,+,-\}\) and \(\{+,+,+,-,-\}\) can be represented by \(\left\{ \left( +, 0.8\right) ,\left( -, 0.2\right) \right\} \) and \(\left\{ \left( +, 0.6\right) ,\left( -, 0.4\right) \right\} \), respectively. For Paired-Beta, the certainty of the majority class is also calculated by Eqs. (4)–(7). In the same way, the certainty of the minority class is \(1-W_{H_{i}}\).

3 Proposed strategies

Compared with the simplest MV, the four improved strategies [17] MV-Freq, MV-Beta, Paired-Freq, and Paired-Beta indeed avoid the loss of much information, such as the certainty information of the majority class and the certainty information of the minority class. However, none of these methods consider the labeling qualities of the crowd workers, especially the differences in the quality of workers labeling different instances. In other words, all of them assume that different crowd workers have the same labeling quality, which is rarely true in real-world crowdsourcing scenarios.

In many real-world crowdsourcing scenarios, to the best of our knowledge, even a high-quality worker may provide an incorrect label for a particular instance, whereas a low-quality worker may provide a correct label. Assume that the same worker has the same labeling quality on different instances; the influence of incorrect labeling from the high-quality workers will be strengthened, whereas the influence of correct labeling from the low-quality workers will be weakened. We call this phenomenon “quality inversion.” For example, for a particular instance with a multiple noisy label set \(\{+,+,+,-,-\}\), if we do not account for the labeling qualities of the crowd workers, Paired-Freq represents it as \(\left\{ \left( +, 0.6\right) ,\left( -, 0.4\right) \right\} \). However, suppose that these five workers have different labeling qualities, such as \(\{0.95,0.6,0.94,0.92,0.59\}\), on this instance, this can be represented as \(\left\{ \left( +, 0.6225\right) ,\left( -, 0.3775\right) \right\} \). By taking the labeling quality into account, we can scale up the certainty of the majority class \(+\) and reduce the certainty of the minority class −. Again suppose that each of these five workers has the same labeling quality on another instance with a multiple noisy label set \(\{-,-,+,+,+\}\), then this instance is represented as \(\left\{ \left( +, 0.6125\right) ,\left( -, 0.3875\right) \right\} \). Thus, the certainty of the majority class \(+\) decreases slightly, whereas the certainty of the minority class − increases slightly. In other words, for this instance, the influence of incorrect labeling (−) from the high-quality worker (the first worker with the labeling quality 0.95) is strengthened, whereas the influence of correct labeling (\(+\)) from the low-quality worker (the last worker with the labeling quality 0.59) is weakened.

To deal with the phenomenon of “quality inversion” discussed previously, in this paper, we argue that the same worker may also have different labeling qualities on different instances. Based on this premise, we propose four new strategies, including two weighted soft MV strategies and two weighted paired soft MV strategies, by assigning different weights to workers when labeling different instances. Specifically, a label similarity-based weighting method that combines the specific quality of the worker on different instances and the overall quality of the worker across all instances is proposed to estimate the weight of each crowd label. We simply denote the resulting strategies by WMV-Freq, WMV-Beta, WPaired-Freq, and WPaired-Beta, respectively.

3.1 Weighted soft MV

Similar to MV, MV-Freq and MV-Beta also assume that all crowd workers have the same labeling quality. To improve MV-Freq and MV-Beta, we propose two weighted soft MV strategies: WMV-Freq and WMV-Beta, respectively.

For WMV-Freq, the weight formula is the same as Eq. (1). We repeat it here for convenience:

$$\begin{aligned} W_{H_{i}}=\left\{ \begin{array}{ll}{P\left( +| {\mathcal {L}}_{i}\right) ,} &{}\quad {P\left( +| {\mathcal {L}}_{i}\right) \ge P\left( -| {\mathcal {L}}_{i}\right) } \\ {P\left( -| {\mathcal {L}}_{i}\right) ,} &{}\quad {P\left( +| {\mathcal {L}}_{i}\right) <P\left( -| {\mathcal {L}}_{i}\right) }\end{array}\right. , \end{aligned}$$
(8)

where \(P\left( +| {\mathcal {L}}_{i}\right) \) (or \(P\left( -| {\mathcal {L}}_{i}\right) \)) is also the certainty of the majority class \(+\) (or −) of the multiple noisy label set \({\mathcal {L}}_{i}\) of the ith instance \(x_i\), but they are estimated using Eqs. (9)–(10) instead of Eqs. (2)–(3), respectively:

$$\begin{aligned} P(+|{\mathcal {L}}_{i})= & {} \frac{\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},+\right) }{\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},+\right) +\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},-\right) }, \end{aligned}$$
(9)
$$\begin{aligned} P(-|{\mathcal {L}}_{i})= & {} \frac{\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},-\right) }{\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},+\right) +\sum _{j=1}^{m} w_{i j}\delta \left( l_{i j},-\right) }, \end{aligned}$$
(10)

where \(w_{i j}\) is the weight of \(l_{i j}\).

For WMV-Beta, the weight formula is the same as Eq. (4). We also repeat it here for convenience:

$$\begin{aligned} W_{H_{i}}=\max \left\{ I_{0.5}(\alpha _{i}, \beta _{i}), 1-I_{0.5}(\alpha _{i}, \beta _{i})\right\} , \end{aligned}$$
(11)

where

$$\begin{aligned} I_{0.5}(\alpha _{i}, \beta _{i})=\sum _{k=[\alpha _{i}]}^{[\alpha _{i}+\beta _{i}]-1} \frac{([\alpha _{i}+\beta _{i}]-1) !}{k!([\alpha _{i}+\beta _{i}]-1-k) !} 0.5^{[\alpha _{i}+\beta _{i}]-1}, \end{aligned}$$
(12)

where \(\left[ \cdot \right] \) is an integer-valued function, \(\alpha _{i}\) and \(\beta _{i}\) are also two shape parameters of the Beta distribution, but they are estimated using Eqs. (13)–(14) instead of Eqs. (6)–(7), respectively,

$$\begin{aligned} \alpha _{i}= & {} \sum _{j=1}^{m} w_{i j} \delta \left( l_{i j},+\right) +1, \end{aligned}$$
(13)
$$\begin{aligned} \beta _{i}= & {} \sum _{j=1}^{m} w_{i j} \delta \left( l_{i j},-\right) +1. \end{aligned}$$
(14)

3.2 Weighted paired soft MV

Similar to MV-Freq and MV-Beta, WMV-Freq and WMV-Beta also only use the certainty information of the majority class and discard all the information of the minority class. Consequently, to improve WMV-Freq and WMV-Beta, we also adapt Paired-Freq and Paired-Beta to propose two weighted paired soft MV strategies: WPaired-Freq and WPaired-Beta, respectively.

For WPaired-Freq, the certainty of the majority class is also calculated using Eqs. (8)–(10). After we obtain the certainty of the majority, the certainty of the minority class can be estimated by \(1-W_{H_{i}}\). For WPaired-Beta, the certainty of the majority class is also calculated using Eqs. (11)–(14). In the same way, the certainty of the minority class can be estimated by \(1-W_{H_{i}}\).

3.3 Label similarity-based weighting

Now, the only question left to answer is how to define the weight \(w_{ij}\) of each crowd worker labeling a particular instance. Generally speaking, there are mainly two kinds of methods to define (learn) such weights. The first is to conduct a sophisticated search process to find the weights that maximize the performance of the resulting model. Usually, this kind of method leads to a good weight assignment, but it requires a significant amount of time and an appropriate fitness function for the search. The other is to directly compute the weights using the statistical characteristics of the available data, and thus it is often more efficient.

In this paper, we focus our attention on the second method and propose a label similarity-based weighting method, which combines the specific quality of the worker on different instances and the overall quality of the worker across all instances to estimate the weight \(w_{ij}\) of each crowd label. We expect that the learned weights could weaken the influence of incorrect labeling on high-quality workers and strengthen the influence of correct labeling on low-quality workers. Inspired by [12], we define the normalized weight \(w_{ij}\) of each crowd label as

$$\begin{aligned} w_{i j}=\frac{1}{Z}w_{i j}^{\prime }, \end{aligned}$$
(15)

where Z is a normalization constant, which ensures that the sum of all crowd label weights for the ith instance is still equal to m, the detailed formula is Eq. (16). \(w_{i j}^{\prime }\) is the non-normalized weight of each crowd label defined by Eq. (17).

$$\begin{aligned} Z= & {} \frac{1}{m}\sum _{j=1}^{m}w_{i j}^{\prime }, \end{aligned}$$
(16)
$$\begin{aligned} w_{i j}^{\prime }= & {} \frac{1}{1+e^{-\gamma _{i j}}}, \end{aligned}$$
(17)

where \(\gamma _{i j}\) is estimated by

$$\begin{aligned} \gamma _{i j}=\tau _{j}\left( 1+s_{i j}^{2}\right) , \end{aligned}$$
(18)

where \(\tau _{j}\) is the overall quality of the jth worker across all instances, \(s_{i j}\) is the specific quality of the jth worker for the ith instance, \(s_{i j}^2\) is used to strengthen the influence of the specific quality \(s_{i j}\), and \(1+s_{i j}^{2}\) is used to avoid the effect of the extreme estimation of \(s_{i j}=0\).

Next, we introduce how to estimate \(s_{i j}\). Inspired by the similarity assumption [12], we propose to use the similarity among worker labels to estimate the specific qualities of the same worker for different instances. For a specific instance \(e_{i}\), if the jth worker uses the same label as most other workers, this indicates that the worker has a high degree of confidence in this instance. That is, the specific quality of the jth worker for the ith instance is very high. Based on this idea, we can define \(s_{i j}\) as the label similarity among workers:

$$\begin{aligned} s_{i j}=\sum _{j^{\prime }=1\wedge j^{\prime } \ne j}^{m} \delta \left( l_{i j},l_{i j^{\prime }}\right) . \end{aligned}$$
(19)

We now introduce how to estimate \(\tau _{j}\). Estimating the overall qualities of different workers is not a new research topic in the crowdsourcing learning community. To the best of the authors’ knowledge, there exist many state-of-the-art algorithms, such as Dawid–Skene [1], ZenCrowd , KOS [9], and DEW [15, 23]. However, none of them exploit feature vectors of instances, which makes it impossible to take full advantage of the statistical characteristics of the available data when evaluating the label qualities. According to the observation by [30], in traditional supervised learning, there exists a schema to exhibit the relationship between data features and the ground-truth labels. For example, suppose there exists a high-quality worker; the data schema will be well-inherited in their labels, because the difference between their labels and ground-truth labels is small. Meanwhile, suppose there exists a low-quality worker, the data schema may be broken because their labels will be very different from the ground-truth labels. Therefore, we can estimate the overall quality of a worker by evaluating how well the schema is inherited in their labels. Specifically, we can first extract all training instances’ feature vectors and the corresponding crowd labels provided by the jth worker to form a new single-label data set. Then, we use tenfold cross-validation to evaluate the classification accuracy of a classifier. In theory, this classifier can be any classifier. Finally, we define the overall quality of the jth worker as the classification accuracy of the built classifier. The detailed formula can be expressed as

$$\begin{aligned} \tau _{j}=\frac{\sum _{i=1}^{n} \delta \left( l_{i j},f_{j}\left( x_{i}\right) \right) }{n}, \end{aligned}$$
(20)

where n is the size of the extracted data set and \(f_{j}(x_{i})\) is the class label of the feature vector \(x_{i}\) predicted by the built classifier.

It can be seen that although the existing FaitCrowd strategy [14] also considers the quality of workers on different tasks, our label similarity-based weighting method is totally different from it. The FaitCrowd strategy jointly models question content and source answering behavior to learn latent topics and estimate the topical source expertise. By contrast, our method directly uses the similarity among worker labels to estimate the specific quality of the worker on different instances. Yet at the same time, our method is totally different from the existing BLC strategy [26] that also takes the features of instances into account. The BLC strategy utilizes the conceptual-level features extracted from crowdsourced labels to infer the true labels of instances by clustering. By contrast, our method only uses the original features of instances to build a classifier to estimate the overall quality of the worker across all instances.

4 Experiments and results

The purpose of this section is to validate the effectiveness of our proposed strategies: WMV-Freq, WMV-Beta, WPaired-Freq, and WPaired-Beta. Therefore, we designed four groups of experiments to compare them with the original MV-Freq, MV-Beta, Paired-Freq, and Paired-Beta, respectively. We conducted our experiments on 12 real-world datasets from the University of California at Irvine (UCI) repository [5] listed in Table 1, which includes all nine binary datasets from the website of the CEKA platform [27] and three transformed binary datasets used in [17].

Table 1 Descriptions of the used datasets
Fig. 1
figure 1

Classification accuracy \((\%)\) comparisons for WMV-Freq versus MV-Freq. The labeling quality \(p\in (0.3,0.9)\)

Fig. 2
figure 2

Classification accuracy \((\%)\) comparisons for WMV-Beta versus MV-Beta. The labeling quality \(p\in (0.3,0.9)\)

Fig. 3
figure 3

Classification accuracy \((\%)\) comparisons for WPaired-Freq versus Paired-Freq. The labeling quality \(p\in (0.3,0.9)\)

Fig. 4
figure 4

Classification accuracy \((\%)\) comparisons for WPaired-Beta versus Paired-Beta. The labeling quality \(p\in (0.3,0.9)\)

To simulate a crowdsourcing process to obtain multiple noisy labels of each instance, the original true labels of all instances were hidden, and all simulated workers were employed to label each instance. For each worker, the original true label was assigned to each instance with the probability p, and the opposite value was assigned with the probability \(1-p\). In our experiments, the labeling quality p of each worker was generated randomly from a uniform distribution on the interval (0.3, 0.9). In fact, in our experiments, we also tested some other distributions, such as the normal (Gaussian) distribution \(N(0.65,0.35^2)\), to randomly generate the labeling quality of each worker. Owing to virtually the same experimental conclusions and for brevity, we do not present the detailed experimental results here.

After obtaining the multiple noisy label set of each instance, we use label aggregation strategies to infer its aggregation label. Then, the classifier is built on the training set with the aggregation labels and evaluated on the test set with the true labels. Because the simulation process has a certain degree of randomness, we use tenfold cross-validation to evaluate the classification accuracy of the built classifier. In our experiment, we use C4.5, one of the top 10 data mining algorithms, to estimate the overall quality of each worker \(\tau _{j}\) (\(j=1,2,\ldots ,m\)) in our proposed strategies and evaluate the performance of all label aggregation strategies.

Figures 1, 2, 3, and 4 show the detailed classification accuracy \((\%)\) comparison results between our proposed four strategies and their original counterparts, respectively. From these comparison results, we can see that assigning different weights to different workers when labeling different instances can largely improve the performance of the existing label aggregation strategies. Now, we summarize some of the highlights.

  1. 1.

    Our proposed two weighted soft MV strategies (WMV-Freq and WMV-Beta) are better overall than the original two soft MV strategies (MV-Freq and MV-Beta). Our proposed two weighted paired soft MV strategies (WPaired-Freq and WPaired-Beta) are also better overall than the original two paired soft MV strategies (Paired-Freq and Paired-Beta). All these results validate our viewpoints: different workers should have different labeling qualities and the same worker should also have different labeling qualities on different instances.

  2. 2.

    The accuracies of our weighted strategies, WMV-Freq, WMV-Beta, WPaired-Freq, and WPaired-Beta, are much higher than those of the original MV-Freq, MV-Beta, Paired-Freq, and Paired-Beta, respectively. However, the advantages between our weighted strategies and the original strategies gradually degraded as the number of workers increased.

  3. 3.

    As expected, the accuracies of our weighted strategies and the original strategies rapidly upgraded as the number of workers increased. However, the same as Paired-Freq [17], we also notice that the performance of WPaired-Freq does not produce an expected increment when more and more labels for each instance are available. Its learning curves are completely flat and even fall back a little over three datasets (i.e., “biodeg”, “ionosphere”, and “vote”). Why does WPaired-Freq perform so? The fundamental reason is that WPaired-Freq also keeps the noise completely. Suppose there exists an instance with a multiple noisy label set \(\{+,+,+,-,-\}\) and the labeling qualities of these five workers are 0.95, 0.6, 0.94, 0.92, and 0.59, respectively. WPaired-Freq represents it as \(\{(+,0.6225),(-, 0.3775)\}\). If these five workers label this instance twice, its multiple noisy label set becomes \(\{+,+,+,-,-,+,+,+,-,-\}\). However, WPaired-Freq represents it as \(\{(+,0.6225),(-, 0.3775)\}\) as well. It can be seen that as more labels are acquired for each instance, the certainty of the majority class \(+\) and the certainty of the minority class − do not change anymore. By contrast, WPaired-Beta does not incur this issue. For the same example, WPaired-Beta represents them as \(\{(+,0.6563),(-, 0.3437)\}\) and \(\{(+,0.8867),(-, 0.1133)\}\), respectively. That is to say, as more labels are acquired for each instance, the certainty of the majority class \(+\) keeps rising and the certainty of the minority class − continues to decrease, which means that the influence of correct labeling (\(+\)) is further strengthened and the influence of incorrect labeling (−) is further weakened.

Table 2 Classification accuracy \((\%)\) comparisons for our proposed four strategies versus ZC, RY, KOS, and GTIC
Table 3 Integration accuracy \((\%)\) comparisons for our proposed four strategies versus ZC, RY, KOS, and GTIC

To further validate the effectiveness of our proposed four strategies, we performed another group of experiments to compare them with some other existing state-of-the-art label aggregation strategies such as ZC [2], RY [16], KOS [9], and GTIC [28]. Owing to virtually the same experimental conclusions and for brevity, we only show the detailed comparison results when the number of workers is six. Table 2 shows the detailed comparison results in terms of the classification accuracy of the target classifier. From these comparison results, we can see that the average classification accuracies (79.14%, 79.5%, 80.93%, and 83.66%) of our proposed four strategies are all much higher than those of ZC (77.47%), RY (78.41%), KOS (75.65%), and GTIC (76.44%). Besides, we also observed the performance of our proposed four strategies in terms of the integration accuracy, which is defined as the proportion of instances whose integration labels are the same as their true labels. Table 3 shows the detailed comparison results. From these comparison results, we can see that the average integration accuracies (87.72%, 87.57%, 87.95%, and 88.92%) of our proposed four strategies are all also much higher than those of ZC (83.54%), RY (86.52%), KOS (87.33%), and GTIC (83.23%).

5 Discussion

As we have validated the effectiveness of our proposed four strategies for binary classification, we now discuss and extend the proposed new strategies to multi-class classification in this section.

At first, we focus on how to define the uncertainty of the majority class when multi-class classification is need. Given a multiple noisy label set \({\mathcal {L}}_{i}\) of the ith instance \(x_i\), we can directly borrow the definitions on the impurity of a given decision-tree node to define its uncertainty of the majority class. In decision-tree learning, Error, Gini and Entropy have been widely used for measure the impurity of a given decision-tree node. The detailed definitions are

$$\begin{aligned} \mathrm{Error}({\mathcal {L}}_{i})= & {} 1-\arg \max \limits _{j=1}^q P(c_k|{\mathcal {L}}_{i}), \end{aligned}$$
(21)
$$\begin{aligned} \mathrm{Gini}({\mathcal {L}}_{i})= & {} 1-\sum _{k=1}^{q}P(c_k|{\mathcal {L}}_{i})^2, \end{aligned}$$
(22)
$$\begin{aligned} \mathrm{Entropy}({\mathcal {L}}_{i})= & {} -\sum _{k=1}^{q}P(c_k|{\mathcal {L}}_{i})\log _2 P(c_k|{\mathcal {L}}_{i}), \end{aligned}$$
(23)

where q is the number of classes and \(0 \log _2 0=0\).

Then, the corresponding certainty of the majority class are defined as

$$\begin{aligned} W_{H_{i}}= & {} 1-\mathrm{Error}({\mathcal {L}}_{i})=\arg \max \limits _{j=1}^q P(c_k|{\mathcal {L}}_{i}), \end{aligned}$$
(24)
$$\begin{aligned} W_{H_{i}}= & {} 1-\mathrm{Gini}({\mathcal {L}}_{i})=\sum _{k=1}^{q}P(c_k|{\mathcal {L}}_{i})^2, \end{aligned}$$
(25)
$$\begin{aligned} W_{H_{i}}= & {} 1-\mathrm{Entropy}({\mathcal {L}}_{i})=1+\sum _{k=1}^{q}P(c_k|{\mathcal {L}}_{i})\log _2 P(c_k|{\mathcal {L}}_{i}), \end{aligned}$$
(26)

where \(P(c_k|{\mathcal {L}}_{i})\) is the appearance frequency of class \(c_k\) in \({\mathcal {L}}_{i}\) estimated by

$$\begin{aligned} P(c_k|{\mathcal {L}}_{i})=\frac{\sum _{j=1}^{m} w_{ij}\delta \left( l_{i j},c_k\right) }{\sum _{k=1}^{q}\sum _{j=1}^{m} w_{ij}\delta \left( l_{i j},c_k\right) }, \end{aligned}$$
(27)

where

$$\begin{aligned} w_{i j}=\frac{1}{Z}\frac{1}{1+(q-1)e^{-\gamma _{i j}}}. \end{aligned}$$
(28)

where Z is a normalization constant.

At last, weighted pairing can also be extended by decomposing each instance with a multiple noisy label set \({\mathcal {L}}_{i}\) into q class-specific weighted instances, where the weight of each class-specific instance is defined as the certainty of each class \(P(c_k|{\mathcal {L}}_{i})\), respectively.

6 Conclusion and future work

In this paper, we have argued that a single worker may even have different labeling qualities on different instances. Based on this premise, we have proposed two weighted soft MV strategies and two weighted paired soft MV strategies. We have simply denoted the resulting strategies as WMV-Freq, WMV-Beta, WPaired-Freq, and WPaired-Beta, respectively. In addition, we have proposed a label similarity-based weighting method, which combines the specific quality of the worker on different instances and the overall quality of the worker across all instances to estimate the weight of each worker labeling different instances. The experimental results have validated the effectiveness of our proposed four new strategies.

Given a weighted multiple noisy label set, the definition of the certainty of the majority class is a crucial problem in our proposed strategies, and thus exploring some other effective definitions is the main direction for our future work. In addition, exploiting some other sophisticated weight learning method is another interesting topic for future work.