1 Introduction

Labeled data play a crucial role in machine learning. In recent years, crowdsourcing has been a popular cost-saving way for label collection. The power of crowdsourcing relies on two conditions. One is the possibility to obtain highly accurate estimation of true labels by aggregating the collected noisy labels. Another is that the cost paid to the workers during the label collection process is not large, hence crowdsourcing is much more economical than to recruit domain experts. Particularly, in practice the budget to pay is usually limited. So it is important to study the approaches for balancing between cost reduction and estimation performance.

Unfortunately, there is a trade-off between aggregation accuracy and cost. As more labels are collected, typically, the aggregated accuracy increases, while the cost also increases. One way to deal with this problem is to design better label aggregation methods, without controlling the data collection process (Raykar et al. 2010; Dalvi et al. 2013; Zhang et al. 2014). Another more active way is to design effective task assignment mechanisms, for saving budget meanwhile maintaining the aggregation quality (Karger et al. 2011; Ho et al. 2013). However, all these methods do not utilize the subjective behavior of the workers. Though seldom studied previously, it is interesting to consider an alternative kind of mechanisms utilizing the subjective uncertainty of the crowd, by allowing workers to choose unsure option instead of actually labeling the data. The advantage is that since the confidence of the workers often has a close relationship with their potential abilities, the quality of the returned labels may be improved. In Zhong et al. (2015), the setting of providing unsure option was studied under the active learning with crowd scenario, and the effect of label quality improvement was empirically justified from experiments. On the other hand, for ensuring the honesty of the workers, choosing unsure option should also be paid. Otherwise the workers would prefer to make guesses when their confidence is low. As a result, providing unsure option also leads to the potential danger of budget waste, since under the same budget, the number of returned labels is decreased. It is important to theoretically answer when providing unsure option can lead to significant cost reduction.

In this work, we take the first step towards the analysis of the cost-saving effect for the crowdsourcing with unsure option setting. Firstly, we provide the sufficient conditions for employing unsure option to be indeed effective on cost reduction. Secondly, we show how confidence threshold can be set properly. Thirdly, motivated by the theoretical results, we propose an alternative online mechanism. It is suitable to use for threshold selection when the statistics about the crowd are difficult to estimate due to the lack of golden standard tasks with known labels.

The rest of the paper is organized as follows. In Sect. 2, the related work is discussed. In Sect. 3, we describe the basic formulations and assumptions. Sections 4 and 5 give the theoretical analysis on the quality ensured and the unsure mechanisms. They include the main results of this paper. Section 6 discusses the possible extensions of the incentive compatible payment schemes. Section 7 introduces the online algorithm. Section 8 shows the experimental results. Section 9 concludes the paper.

2 Related work

Crowdsourcing, typically crowd-sourced data labeling, has been a fruitful topic in machine learning. One of the central tasks is to achieve desirable learning performance using the noisy labels returned by the crowd. This can be done by learning good classifiers utilising the noisy labels directly (Dekel and Shamir 2009a, b; Urner et al. 2012). Meanwhile, in many researches the label estimation problem has been thouroughly studied (Raykar et al. 2010; Zhou et al. 2012, 2015; Dalvi et al. 2013; Li and Yu 2014; Zhang et al. 2014). The focus of these two kinds of researches is on improving learning performance, while the label collection cost is not directly considered.

Wang and Zhou (2016) pointed out the importance of the cost-saving effect in crowdsourcing, showing that cost reduction is one of the central tasks involved. Meanwhile, Wang and Zhou (2015) theoretically showed that it is indeed possible to achieve desired accuracy with reasonable cost. From the algorithmic perspective, there have been many researches about task assignment and budget allocation, which try to balance between aggregation accuracy and data collection cost. Both non-adaptive task assignment mechanisms (Karger et al. 2011; Tran-Thanh et al. 2013), which assign the tasks off-line before the worker comes, and the adaptive mechanisms which assign the tasks on-line during the labeling process (Ho et al. 2013; Chen et al. 2013; Abbasi-Yadkori et al. 2015), have been studied thoroughly. The accuracy-cost trade-off is also the major issue discussed in this paper. The difference is that the task assignment and budget allocation mechanisms focus on improving the behavior of the task requesters, while for designing the unsure mechanisms, the target is to improve the behavior of the workers, by utilizing their own uncertainty about the tasks. There have also been several studies on employing the bandit model into crowdsourcing. In Abraham et al. (2013), a special crowdsourcing problem called the bandit survey problem was considered. In Jain et al. (2014), the bandit model was employed to deal with the task assignment problem. In Zhou et al. (2014), the bandit arm identification problem was employed for worker selection. All these works have different settings to our work.

There have not been many works studying the unsure mechanisms in crowdsourcing. Zhong et al. (2015) considered providing unsure option under the active learning from crowd scenario. In their work, the purpose of allowing unsure option was to improve worker reliability. This is similar to the purpose of employing unsure mechanism in our work. The difference is that in their work, they empirically justified that providing unsure option to the crowd could make labeling quality improved in active learning by experiments. While in our work, instead of considering the active learning scenario, we consider the more general crowdsoured labeling task. Furthermore, our focus is on the theoretical analysis of the cost-saving effect of employing unsure option. In Shah and Zhou (2015), a double or nothing incentive compatible mechanism was proposed, to make workers behave honestly based on their confidence. Their proposed mechanism is provable to avoid spammers from the crowd, under the assumption that every worker wants to maximize the expected payment. In their setting, employing unsure option is a way for ensuring high quality of the returned labels. The potential accuracy-cost trade-off is not considered. While in our work, the main focus is on the accuracy-cost trade-off, other than designing incentive compatible mechanisms.

3 Problem formulation

We consider the tasks of collecting binary labels of \(\{-1,1\}\). The well-known Dawid–Skene model (Dawid and Skene 1979; Zhou et al. 2012; Zhang et al. 2014; Karger et al. 2011; Zhou et al. 2015; Li and Yu 2014) under the binary classification case is adopted. Under the D-S model, the tasks are assumed to be homogeneous. The homogeneity means that the potential costs for different tasks are the same. As a result, in the rest of the paper, we focus on dealing with the cost for one single task.

We adopt the anonymous worker assumption introduced in Karger et al. (2011) for modeling the crowd. Let \(\theta _i\) be the ability of the ith worker, i.e. the probability that the returned label is correct. The process of choosing a number of anonymous workers is modeled as independent random draws \(\theta _1, \theta _2, \theta _3 \dots \in [0,1]\) over the crowd ability distribution \(\theta _i \sim A\). Once drawn, the worker is asked to return the label or to choose unsure option based on the crowdsourcing mechanism applied, such as the quality ensured mechanism and unsure mechanism introduced below, or return the label directly if no mechanism is applied. The accuracy of the returned label is decided by the ability of the worker. We assume that after label collection, labels are aggregated by majority voting, i.e. to estimate the label based on the choices of the majority. In this paper, we assume that the mean \(\mu \) of A should be larger than 1 / 2, i.e.

$$\begin{aligned} \mu > 1/2. \end{aligned}$$

The assumption is based on the well known result such that for majority voting, the estimated label does not converge to the true label when \(\mu \le 1/2\). From this result, if \(\mu \le 1/2\) then any crowdsourcing mechanism that makes the estimated label asymptotically correct can trivially leads to cost reduction. So we only consider the non-trivial case such that \(\mu > 1/2\). We also adopt the following assumption:

$$\begin{aligned} \Pr \,\left( \theta < \mu -1/2\right) = 0. \end{aligned}$$
(1)

It states that we do not allow the workers with very low abilties which are far from the mean to exist. This assumption tightens our analysis. Indeed, these workers may correspond to the malicious workers who aim at attacking the crowdsourcing system, and should be excluded by any effective crodsourcing mechanism. Since malicious attacks prevention beyonds the scope of this paper, we just leave this as a preset assumption. Besides the above assumptions, we do not assume the crowd ability distribution A to be any specific distributions.

We define the confidence of the workers to be the subjective accuracy they believe to have, which is denoted by \(c_1,c_2, c_3 \dots \in [1/2,1]\). The minimum value is 1 / 2 since the weakest choice in one honest worker’s mind is to make random guess. Similar to the crowd ability distribution A, we assume that the confidence for workers is independently drawn from the crowd confidence distribution \(c_i \sim C\). Obviously, there is a close relationship between one’s ability and confidence. Thus modeling this relationship is necessary. In this paper we adopt a two-step analysis. At first, we can consider the mechanism, which simply filters out low quality labels based on the ability without using the information of the confidence. By this subroutine, we can further consider the setting of providing unsure option, by introducing reasonable assumptions on the relationship between workers’ confidence and abilities.

As a consequence, we define a quality ensured mechanism as the following process: given an ability threshold T, when one worker is drawn for labeling, the label is accepted only when the ability is above T. This mechanism is ideal since it assumes that once a worker is drawn, the ability can be also obtained. An unsure mechanism is a surrogate for the idealized mechanism. When an unsure mechanism is employed, a confidence threshold T is adopted. One worker is asked to do labeling only when the confidence \(c \ge T\), otherwise she/he is asked to use the unsure option, for example, to return the label “0” which represents the unsureness.

The budget is defined as the total cost on the label collection. For simplicity of the analysis, we assume that returning one label and choosing unsure option once are both paid for 1, i.e. the total cost equals to the total number of workers involved in the task. We also assume that workers behave honestly according to their true abilities and confidence. Under the assumption that each worker aims at maximizing their payments, the honesty can be satisfied by adopting the incentive compatible payment mechanisms, which is discussed in Sect. 6.

The goal of our analysis is to answer the question that, on which kind of crowd ability distributions, an quality ensured mechanism or an unsure mechanism is provable to be effective:

Definition 1

For label aggregation being correct with high probability, denote m as the provable cost needed for the simple aggregation without any mechanisms, and \(m'\) as the provable cost needed when utilizing a quality ensured or an unsure mechanism. Then the utilized mechanism is effective if

$$\begin{aligned} m' \ll m. \end{aligned}$$

Note that this is just a qualitative definition, and will be further quantified in definition 2 and 3. The effectiveness is defined for verifying the significance of cost reduction.

In crowdsourcing, it is usually not possible to estimate the ability or the confidence of any individual worker, since the potential number of the tasks for an individual to accomplish is usually very limited. On the opposite, the statistical properties of the crowd ability distribution A and the crowd confidence distribution C, are more practical to estimate. As a result, our analysis focuses on utilizing some simple statistics of A and C, which are also much easier to estimate than to exactly model the distributions. Our target is to derive:

  1. (1)

    How to properly set the ability and the confidence threshold.

  2. (2)

    On what kind of A and C, the quality ensured and the unsure mechanism can be effective.

The following two inequalities are important for the analysis (Shalev-Shwartz and Ben-David (2014, lemma B.6)):

Lemma 1

For independent random variables \(X_1,X_2,\ldots ,X_m\) bounded in [ab] with \(\mathbb {E}\left[ X_i\right] =\mu _i, i=\{1,2,\ldots ,m\}\), the following inequality holds holds (Shalev-Shwartz and Ben-David (2014, lemma B.9)):

$$\begin{aligned} \Pr \,\left( \frac{1}{m}\sum \limits _{i=1}^m\mu _i -\frac{1}{m}\sum \limits _{i=1}^m X_i > \epsilon \right) < \exp (-2m\epsilon ^2/(b-a)^2). \end{aligned}$$
(2)

Lemma 2

For independent random variables \(X_1,X_2,\ldots ,X_m\) bounded in \([-M,M]\) with \(\mathbb {E}\left[ X_i\right] =0, i=\{1,2,\ldots ,m\}\) and variance \(\sigma _1^2, \sigma _2^2,\ldots ,\sigma _m^2\), for all \(\epsilon > 0\), the following inequality holds:

$$\begin{aligned} \Pr \,\left( \sum \limits _{i=1}^m X_i > \epsilon \right) < \exp \left( -\frac{\epsilon ^2/2}{\sum \limits _{i=1}^m\sigma _i^2+M\epsilon /3}\right) . \end{aligned}$$
(3)

4 Analysis on the quality ensured mechanisms

In this section, we focus on studying the quality ensured mechanisms. The analysis in this section forms the foundation to the analysis of the unsure mechanism in Sect. 5.

To show that a quality ensured mechanism is effective, it is necessary to compare between the cost bounds with and without the mechanism. The first result is on the cost needed for using simple aggregation without any mechanisms:

Lemma 3

Let \(\mu \) denote the mean of crowd ability distribution A, for simple majority voting aggregation under the settings introduced in Sect. 3, the aggregated label is correct with probability at least \(1-\delta \) if the total cost m satisfies

$$\begin{aligned} m \ge \frac{2[1+\frac{2}{3}(2\mu -1)]\log \frac{1}{\delta }}{(2\mu -1)^2}. \end{aligned}$$
(4)

Proof

Given labels \(a_1, a_2,\ldots ,a_m,\; a_i\in \{-1,1\}\) from workers, the majority voting rule is:

$$\begin{aligned} \hat{y} = 1 \leftrightarrow \sum \limits _{i=1}^m a_i >=0, \quad \hat{y} = -1 \leftrightarrow \sum \limits _{i=1}^m a_i <0. \end{aligned}$$

The target is to bound \(\Pr \,\left( y\hat{y}<0\right) = \Pr \,\left( y\sum \nolimits _{i=1}^m a_i <0 \right) \). By assumption, \(\mathbb {E}\left[ a_i\right] =\theta _i,\; \mathbb {E}\left[ \theta _i\right] = \mu \), then

$$\begin{aligned} \begin{aligned} \Pr \,\left( y\sum \limits _{i=1}^m a_i <0\right)&= \Pr \,\left( \sum \limits _{i=1}^m (2\mu -1-ya_i) > m(2\mu -1)\right) . \end{aligned} \end{aligned}$$

\(2\mu -1-ya_i\) is a zero mean random variable bounded in \([-2,2]\), thus we use a lemma to bound the variance:

Lemma 4

(Boucheron et al. 2013, Corollary 3.2) If \(\sum _{i=1}^m X_i\) has the bounded difference property with constant c, then

$$\begin{aligned} \mathrm {Var}\left( \sum _{i=1}^m X_i\right) \le \frac{1}{4} \sum _{i=1}^m c^2. \end{aligned}$$

From this we obtain that \(\mathrm {Var}(\sum \nolimits _{i=1}^m(2\mu -1-ya_i)) \le m\) since the bounded difference property with constant 2 holds. By Eq. 3,

$$\begin{aligned} \begin{aligned} \Pr \,\left( y\hat{y}<0 \right)&< \exp \left( -\frac{m^2(2\mu -1)^2/2}{\sum \limits _{i=1}^m\mathrm {Var}(2\mu -1-ya_i)+2m(2\mu -1)/3}\right) \\&= \exp \left( -\frac{m^2(2\mu -1)^2/2}{\mathrm {Var}(\sum \limits _{i=1}^m(2\mu -1-ya_i))+2m(2\mu -1)/3}\right) \\&\le \exp \left( -\frac{m(2\mu -1)^2/2}{1+2(2\mu -1)/3}\right) .\\ \end{aligned} \end{aligned}$$

Let \(\delta = \exp (-\frac{m(2\mu -1)^2/2}{1+2(2\mu -1)/3})\), solving for m gives the desired result. \(\square \)

The main order term in Eq. 4 is \(\frac{1}{(2\mu -1)^2}\). Then we show another lemma, which gives the cost bound for a quality ensured mechanism.

Lemma 5

Let \(T>1/2\) be the ability threshold, and \(\eta =\Pr \,\left( \theta \ge T\right) \) be the upper tail probability of the crowd ability distribution A. When the quality ensured mechanism is employed, for majority voting aggregation under the settings introduced in Sect. 3, the aggregated label is correct with probability at least \(1-\delta \), if the total cost \(m'\) satisfies

$$\begin{aligned} m' \ge \frac{2(1-\eta )\log \frac{2}{\delta }}{\eta }+\frac{4\log \frac{2}{\delta }}{(2T-1)^2\eta }+\frac{2\log \frac{2}{\delta }}{3\eta }. \end{aligned}$$
(5)

Proof

The first step: For those workers whose confidence is larger than T, we bound the number of workers needed when the aggregated estimation is incorrect with probability at most \(\delta /2\), which is denoted as \(m'_0\).

The target is to bound \(\Pr \,\left( y\hat{y}<0\right) = \Pr \,\left( y\sum \nolimits _{i=1}^{m'_0} a_i <0 \right) \). Note that for \(\forall i\), \(\mathbb {E}\left[ ya_i\right] = \Pr \,\left( ya_i=1\right) - \Pr \,\left( ya_i=-1\right) = 2\theta _i-1 \ge 2T-1.\) Then \(\mathbb {E}\left[ \sum \nolimits _{i=1}^{m'_0} ya_i \right] \ge m'_0(2T-1)\). Then

$$\begin{aligned} \begin{aligned} \Pr \,\left( y\sum \limits _{i=1}^{m'_0} a_i <0 \right)&=\Pr \,\left( \mathbb {E}\left[ y\sum \limits _{i=1}^{m'_0} a_i\right] - y\sum \limits _{i=1}^{m'_0} a_i> \mathbb {E}\left[ y\sum \limits _{i=1}^{m'_0} a_i \right] \right) \\&\le \Pr \,\left( \mathbb {E}\left[ y\sum \limits _{i=1}^{m'_0} a_i\right] - y\sum \limits _{i=1}^{m'_0} a_i > m'_0(2T-1)\right) . \end{aligned} \end{aligned}$$

\(ya_i\) is an independent random variable in \(\{-1,1\}\), by Eq. 2,

$$\begin{aligned} \Pr \,\left( y\hat{y}<0 \right) < \exp (-m(2T-1)^2/2). \end{aligned}$$

Let \(\delta /2 = \exp (-m'_0(2T-1)^2/2)\), solving for \(m'_0\) gives

$$\begin{aligned} m'_0 = \frac{2\log (2/\delta )}{(2T-1)^2}. \end{aligned}$$
(6)

The second step: we bound the cost when sufficient number of workers have confidence larger than T.

Consider the random variables \(\mathbb I(\theta _i>T)\), the target is to bound \(\Pr \,\left( \sum \nolimits _{i=1}^{m'}\mathbb I(\theta _i > T) < m'_0\right) \). We have

$$\begin{aligned} \begin{aligned}&\Pr \,\left( \sum \limits _{i=1}^{m'}\mathbb I(\theta _i> T) < m'_0\right) \\&= \Pr \,\left( \mathbb {E}\left[ \sum \limits _{i=1}^{m'}\mathbb I(\theta _i> T)\right] - \sum \limits _{i=1}^{m'}\mathbb I(\theta _i> T)> \mathbb {E}\left[ \sum \limits _{i=1}^{m'}\mathbb I(\theta _i > T)\right] - m'_0 \right) . \\ \end{aligned} \end{aligned}$$

Since \(\mathbb {E}\left[ \mathbb I(\theta _i>T)\right] = \Pr \,\left( \theta _i>T\right) =\eta , \mathbb {E}\left[ \mathbb I(\theta _i>T)-\mathbb {E}\left[ I(\theta _i>T)\right] \right] ^2= \eta (1-\eta )\), by Eq. 3,

$$\begin{aligned} \Pr \,\left( \sum \limits _{i=1}^{m'}\mathbb I(\theta _i > T)< m'_0\right) < \exp \left( -\frac{(m'\eta -m'_0)^2/2}{m'\eta (1-\eta )+(m'\eta -m'_0)/3}\right) . \end{aligned}$$

Let \(\exp (-\frac{(m'\eta -m'_0)^2/2}{m'\eta (1-\eta )+(m'\eta -m'_0)/3}) = \delta /2\), we have

$$\begin{aligned} \frac{1}{2}\eta ^2m'^2-\left[ (4/3-\eta )\log \frac{2}{\delta } + m'_0\right] \eta m' + \frac{1}{2}{m'_0}^2+\frac{1}{3}m'_0\log \frac{2}{\delta } = 0, \end{aligned}$$

then

$$\begin{aligned} m' = \frac{(4/3-\eta )\log \frac{2}{\delta } + m'_0+ \sqrt{\left[ (4/3-\eta )\log \frac{2}{\delta } + m'_0\right] ^2-{m'_0}^2-\frac{2}{3}m'_0\log \frac{2}{\delta }}}{\eta }. \end{aligned}$$

It is easy to see that \(m'=\frac{2((4/3-\eta )\log \frac{2}{\delta } + m'_0)}{\eta }\) also satisfies the condition. Together with Eq. 6, the desired result can be shown by union bound. \(\square \)

The main order term in Eq. 5 is \(\frac{1}{(2T-1)^2\eta }\). Given the above two cost bounds, we propose a more concrete definition on the effectiveness, based on how much reduction on the main order of the cost:

Definition 2

Let \(\mu \) be the mean of the crowd ability distribution, T be the ability threshold for a quality ensured mechanism and \(\eta = \Pr \,\left( \theta \ge T\right) \) be the probability for a worker to have the ability above T, then a quality ensured mechanism is at least \(\alpha \) -effective if

$$\begin{aligned} \frac{1}{(2T-1)^2\eta } \le \frac{1}{(2\mu -1)^\alpha }. \end{aligned}$$
(7)

When Eq. 7 is satisfied, by Lemmas 3 and 5, the main order term of the bound improves in the order of \(\frac{1}{2\mu -1}\), from 2 to \(\alpha \). \(\alpha \) is a measure of significance, as \(\alpha \) decreases, the improvement on the cost becomes more significant. We can also see that the ability threshold T should be larger than the mean ability \(\mu \), i.e.

$$\begin{aligned} T > \mu . \end{aligned}$$

The reason is that from Eq. 7, it is not possible for a quality ensured mechanism with \(T\le \mu \) to be \(\alpha \)-effective with \(\alpha < 2\), which means that the mechanism can not lead to cost reduction. Now we are ready to show the main result of this section, a general sufficient condition for a quality ensured mechanism to be at least \(\alpha \)-effective.

Theorem 1

For crowd ability distribution with mean \(\mu > 1/2\) and variance \(\sigma ^2\) under the condition in Eq. 1, when

$$\begin{aligned} (1+\gamma )\frac{(1+\sqrt{1-4\sigma ^2})^2}{\sqrt{3}\sigma ^2[2\sigma ^2+(2\mu -1)]^2} \le \frac{1}{(2\mu -1)^\alpha }, \end{aligned}$$
(8)

in which

$$\begin{aligned} \gamma = \frac{\Pr \,\left( \theta \le \mu - \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma ^2}}\right) }{\Pr \,\left( \theta \ge \mu + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma ^2}}\right) }, \end{aligned}$$
(9)

then the quality ensured mechanism with ability threshold

$$\begin{aligned} T= \mu + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma ^2}} \end{aligned}$$
(10)

is at least \(\alpha \)-effective.

Proof

First we show a lemma which gives a lower bound on the tail probability for the crowd ability distribution. The intuition is that when the variance of a random variable is high, the tail probability should not be too small.

Lemma 6

Given a random variable \(x \sim A, x\in [0,1]\) with mean \(\mu \) and variance \(\sigma ^2\), under the condition in Eq. 1, for \(0< r < \sigma \), we have

$$\begin{aligned} \Pr \,\left( |x-\mu |\ge r\right) \ge \frac{2\sqrt{3}(\sigma ^2-r^2)}{1-4r^2}. \end{aligned}$$

Proof

Suppose that

$$\begin{aligned} \Pr \,\left( |x-\mu | > r\right) = A. \end{aligned}$$

We can derive an upper bound of the variance \(\sigma ^2\). Denote \(Y=|x-\mu |\), by \(\mu > 1/2\) and Eq. 1, we have \(Y \in [0,1/2]\). Then

$$\begin{aligned} \begin{aligned} \sigma ^2 = \mathbb {E}\left[ (x-\mu )^2\right]&=\int _0^{\frac{1}{2}} y^2\mathrm {d}P(Y\le y)=\int _0^{\frac{1}{2}} 2y\Pr \,\left( Y> y\right) \mathrm {d}y\\&= \int _0^{r}2y\,\Pr \,\left( Y> y\right) \mathrm {d}y + \int _{r}^{\frac{1}{2}} 2y\Pr \,\left( Y>y\right) \mathrm {d}y\\&\le \int _0^{r}2y\mathrm {d}y + \sqrt{\int _{r}^{\frac{1}{2}}4y^2\mathrm {d}y\int _{r}^{\frac{1}{2}}(\Pr \,\left( Y>y\right) )^2\mathrm {d}y}\\&\le \int _0^{r}2y\mathrm {d}y + \sqrt{\int _{r}^{\frac{1}{2}}4y^2\mathrm {d}y\int _{r}^{\frac{1}{2}}A^2\mathrm {d}y}\\&\le r^2 + \sqrt{\left( \frac{1}{6}-\frac{4}{3}r^3\right) \left( \frac{1}{2}-r\right) A^2}. \end{aligned} \end{aligned}$$

The first inequality is due to \(\Pr \,\left( Y>y\right) \le 1\) and Cauchy-Schwarz inequality. The second inequality is due to \(\Pr \,\left( Y > y\right) \le A\) when \(y>r\). We then have

$$\begin{aligned} A \ge \frac{2\sqrt{3}(\sigma ^2-r^2)}{\sqrt{(1-8r^3)(1-2r)}}. \end{aligned}$$

and

$$\begin{aligned} \Pr \,\left( |x-\mu |\ge r\right) \ge \Pr \,\left( |x-\mu | > r\right) = A \ge \frac{2\sqrt{3}(\sigma ^2-r^2)}{\sqrt{(1-8r^3)(1-2r)}}. \end{aligned}$$

Observe that

$$\begin{aligned} (1-8r^3)(1-2r) = 1-((2r)^3+2r)+(2r)^4 \le 1 - 2(2r)^2 + (2r)^4 = (1-(2r)^2)^2. \end{aligned}$$

Then

$$\begin{aligned} \Pr \,\left( |x-\mu |\ge r\right) \ge \frac{2\sqrt{3}(\sigma ^2-r^2)}{1-4r^2}. \end{aligned}$$

\(\square \)

Denote the ability threshold as \(T = \mu + r > \mu \), and assume that

$$\begin{aligned} \Pr \,\left( \theta \le \mu -r\right) = \gamma \Pr \,\left( \theta \ge \mu + r\right) , \end{aligned}$$

then

$$\begin{aligned} \Pr \,\left( \theta \ge T\right) = \Pr \,\left( \theta \ge \mu + r\right) \ge \frac{1}{1+\gamma }\frac{\sqrt{3}(\sigma ^2-r^2)}{1-4r^2}. \end{aligned}$$
(11)

Now we turn to the task of finding the minimum of \(\frac{1}{(2T-1)^2\eta }\), which is equivalent to maximizing \((2T-1)^2\eta \). Relaxation can be made to make use of the lower bound given by Eq. 11. Then we turn to the maximization of

$$\begin{aligned} \frac{1}{1+\gamma }\frac{(\sqrt{3})(\sigma ^2-r^2)}{1-4r^2}(2\mu +2r-1)^2. \end{aligned}$$

We have a constraint \(0<r<\sigma \), and this constraint is enough to ensure \(T=\mu +r \le 1\). To see this, for random variable x in [0, 1] with mean \(\mu _x > 1/2\) and variance \(\sigma _x^2\), the maximum of the variance is attained on the distribution such that we have probabilities of 1 / 2 only at \(x=1\) and \(2\mu _x-1\). Then we can verify that under this distribution, we have \(\sigma ^2_x = (1-\mu _x)^2\) and \(\mu _x + \sigma _x = 1\).

By \(0< r < \sigma \), we have \((\frac{2\sigma +(2\mu -1)}{2\sigma })2r \le 2r+2\mu -1\). We turn to the maximization of

$$\begin{aligned} (4\sqrt{3})\left( \frac{1}{1+\gamma }\right) \left( \frac{2\sigma +(2\mu -1)}{2\sigma }\right) ^2\frac{(\sigma ^2-r^2)}{1-4r^2}r^2. \end{aligned}$$

Let \(s=r^2\) and drop the constants for the moment. We consider

$$\begin{aligned} \max \limits _{0< s < \sigma ^2} f(s) = \frac{\sigma ^2s-s^2}{1-4s}. \end{aligned}$$

We then have

$$\begin{aligned} f'(s) = \frac{(\sigma ^2-2s)(1-4s)+4(\sigma ^2s-s^2)}{(1-4s)^2}=\frac{4s^2-2s+\sigma ^2}{(1-4s)^2}. \end{aligned}$$

Without consideration of constraints on s, the optimum is attained when \(s=(1-\sqrt{1-4\sigma ^2})/4\). The solution satisfies the constraints since

$$\begin{aligned} 1 - 4\sigma ^2 \le \sqrt{1-4\sigma ^2} \iff \sigma ^2 \ge (1-\sqrt{1-4\sigma ^2})/4. \end{aligned}$$

So we have

$$\begin{aligned} T= \mu + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma ^2}} \end{aligned}$$

and

$$\begin{aligned} m' = (1+\gamma )\frac{(1+\sqrt{1-4\sigma ^2})^2}{\sqrt{3}\sigma ^2[2\sigma ^2+(2\mu -1)]^2}. \end{aligned}$$

\(\square \)

From the theorem, it can be seen that as \(\sigma ^2\) increases, the left hand side of Eq. 8 decreases such that smaller \(\alpha \) can be achieved. In Eq. 9, the smaller the \(\gamma \) is, the larger the upper tail probability at T can be ensured. It is reasonable to assume that for an effective quality ensured mechanism, \(\gamma \) can not be large, since \(\Pr \,\left( \theta \ge T\right) \) can not be small.

In Eq. 10, T increases as the variance gets larger. The intuition behind is that when the variance of the crowd ability distribution increases, then we have more workers with high ability, and we can safely increase T to make higher demand on the quality of the returned labels.

5 Analysis on the unsure mechanisms

In this section, we consider the unsure mechanisms. In the previous analysis, the quality ensured mechanisms guarantee that the abilities of the workers who return their labels are above the threshold T. However, due to the potential mismatch between one’s confidence and ability, an unsure mechanism does not guarantee this property, since the workers behave based on their confidence, not their true abilities. The only reasonable assumption we can make is that there can be a positive correlation between the confidence and ability for an individual worker. This makes it difficult to estimate the mean ability of the crowd directly, which is essential for deriving the cost bound. In spite of this difficulty, one of the major theoretical findings in this section is: If we can filter out a bit more workers with low abilities besides the assumption in Eq. 1, then we can lower bound the mean ability. As a result, we introduce the following worker testing stage conducted before the actual labeling tasks start:

  1. 1.

    Keep a small pool of golden standard tasks with known labels.

  2. 2.

    For each worker in the crowd, k golden standard tasks are drawn for testing the ability.

  3. 3.

    We only send tasks to workers who correctly labels all k golden standard tasks.

Note that we do not assume to have a sufficient number of golden standard tasks to accurately estimate each workers’ abilities, or to make k large so that the mean ability of workers can be boosted. On the opposite, we assume that k is very small since by this it is enough to filter out a bit more low quality workers. The experimental results in Sect. 8 show that introducing the worker testing stage is effective even when \(k=1\). We assume that no rewards are paid during the test stage, since the workers tend to have the motivation for passing the test. It is also essential to ensure that the workers behave the same among the test and the real tasks. As we assume workers’ honesty in this paper, this is not a problem. While in applications it is necessary to utilize incentive compatible payment mechanism to ensure honesty, as discussed in Sect. 6.

The next task is to model the relationship between one’s ability and confidence. First we introduce some notations, as listed in Table 1.

Table 1 Some notations used in further analysis

To follow the analysis process in Sect. 4, we assume that

$$\begin{aligned} {\left\{ \begin{array}{ll} T> \mu _{\theta ,0}, &{} \quad \eta _{c,1} \ge \eta _{\theta ,1}, \\ T > \mu _{c,1},&{} \quad \eta _{c,1} < \eta _{\theta ,1}. \\ \end{array}\right. } \end{aligned}$$
(12)

Furthermore, we introduce the following assumptions:

When \(\eta _{c,1} \ge \eta _{\theta ,1}\), there exists \(k_{0} > 0\), for all \(T > \mu _{\theta ,0}\),

$$\begin{aligned} \frac{\Pr \,\left( c \ge T | \theta \ge T, passed\right) }{\eta _{c,1}}\ge & {} \Big (\mathbb E_{\theta ,0}[\theta ^k]\Big )^{k_0}\left( \frac{1}{\eta _{\theta ,0}}\right) , \end{aligned}$$
(13)
$$\begin{aligned} \frac{(\mu _{\theta ,0})^{k+1}}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}\ge & {} 1/2. \end{aligned}$$
(14)

When \(\eta _{c,1} < \eta _{\theta ,1}\), then there exsits \(k_1 \ge 0\), for all \(T > \mu _{c,1}\),

$$\begin{aligned} \frac{\Pr \,\left( c \ge T | \theta \ge T, passed\right) }{\eta _{c,1}}\ge & {} (\mu _{c,1})^{k_1}\left( \frac{1}{\eta _{c,1}}\right) , \end{aligned}$$
(15)
$$\begin{aligned} (\mu _{c,1})^{k_1+1}\ge & {} 1/2. \end{aligned}$$
(16)

Ignoring the exponential terms, Eqs. 13 and 15 imply that

$$\begin{aligned} \Pr \,\left( c \ge T | \theta \ge T, passed\right) \ge \eta _{c,1} = \Pr \,\left( c\ge T| passed\right) . \end{aligned}$$

This is a reasonable assumption, since for an unsure mechanism to be useful, the positive correlation between confidence and ability is necessary. The task dependent constants \(k_0\) and \(k_1\) control the magnitude of this positive correlation. Since we have \(\mathbb E_{\theta ,0}[\theta ^k] < 1\) and \(\mu _{c,1} < 1\), the larger \(k_0\) and \(k_1\) are, the weaker the positive correlation becomes. While these two constants are usually small since it is common for the confidence and the ability to be correlated. Equations 14 and 16 are adopted for cost bound derivation, ensuring the transformed threshold \(T'\) to be above 1 / 2 (see Eqs. 18, 19). Under the above assumptions, we can get the following cost bound for an unsure mechanism:

Lemma 7

Assume the conditions in Eqs. (1216) to hold. Employ the unsure mechanism with confidence threshold T and the worker testing stage with k golden standard tasks. For majority voting aggregation under the settings introduced in Sect. 3, the aggregated label is correct with probability at least \(1-\delta \) if the cost satisfies

$$\begin{aligned} m' \ge \frac{2(1-\eta )\log \frac{2}{\delta }}{\eta }+\frac{8\log \frac{2}{\delta }}{(2T'-1)^2\eta }+\frac{2\log {2}{\delta }}{3\eta }. \end{aligned}$$
(17)

When \(\eta _{c,1} \ge \eta _{\theta ,1}\),

$$\begin{aligned} \eta = \left( \frac{T^k}{\mathbb E_{\theta ,0}[\theta ^k]}\right) \eta _{\theta ,0}, \quad T' = \left( \frac{T^k}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}\right) T. \end{aligned}$$
(18)

When \(\eta _{c,1} < \eta _{\theta ,1}\),

$$\begin{aligned} \eta = \eta _{c,1}, \quad T' = (\mu _{c,1})^{k_1}T. \end{aligned}$$
(19)

Proof

The key idea is to estimate two quantities for the crowd after the worker testing stage. One is the proportion of the workers who have confidence above T, i.e. \(\eta _{c,1}\). When \(\eta _{c,1} < \eta _{\theta ,1}\), we directly use \(\eta _{c,1}\). Otherwise when \(\eta _{c,1} \ge \eta _{\theta ,1}\), we should lower bound \(\eta _{\theta ,1}\).

$$\begin{aligned} \eta _{\theta ,1} = \frac{\eta _{\theta ,0}\Pr \,\left( passed|\theta \ge T\right) }{\mathrm {Pr}_{\theta ,0}(passed)}. \end{aligned}$$

It is easy to see that \(\Pr \,\left( passed|\theta \ge T\right) \ge T^k\), and

$$\begin{aligned} \mathrm {Pr}_{\theta ,0}(passed)= \mathbb E_{\theta ,0}[passed|\theta ]= \mathbb E_{\theta ,0}[\theta ^k]. \end{aligned}$$

So we have

$$\begin{aligned} \eta _{\theta ,1} \ge \left( \frac{T^k}{\mathbb E_{\theta ,0}[\theta ^k]}\right) \eta _{\theta ,0}. \end{aligned}$$

The other quantity to lower bound is the mean ability of workers who have confidence above T, i.e. \(\mathbb {E}\left[ \theta | c \ge T, passed\right] \). We have

$$\begin{aligned} \mathbb {E}\left[ \theta | c \ge T, passed\right] \ge T \Pr \,\left( \theta \ge T|c \ge T,passed\right) \end{aligned}$$

and

$$\begin{aligned} \Pr \,\left( \theta \ge T|c \ge T,passed\right) =\frac{\eta _{\theta ,1}\Pr \,\left( c\ge T|\theta \ge T,passed\right) }{\eta _{c,1}}. \end{aligned}$$

Then by Eqs. 13 and 15, we have the desired result. The remaining part of the proof is similar to Lemma 5. \(\square \)

Denote \(B_1 = \frac{T^k}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}, B_2 = \frac{T^k}{\mathbb E_{\theta ,0}[\theta ^k]}\), the main order term of cost under an unsure mechanism is \(\frac{1}{(2B_1T-1)^2B_2\eta _{\theta ,0}}\) when \(\eta _{c,1} \ge \eta _{\theta ,1}\) and \(\frac{1}{(2(\mu _{c,1})^{k_1}T-1)^2\eta _{c,1}}\) when \(\eta _{c,1} < \eta _{\theta ,1}\). As previously discussed, k is usually a small number. Then \(B_1,B_2\) scale like constant factors. Thus we can let \(B_1 = \frac{(\mu _{\theta ,0})^k}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}\) to consider the worst case, and ignore \(B_2\). Similar to definition 2, we define the \(\alpha \)-effectiveness for the unsure mechanisms:

Definition 3

The unsure mechanism with confidence threshold T, utilizing the worker testing stage with k golden standard tasks, is at least \(\alpha \) -effective if

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{1}{\left( 2\frac{(\mu _{\theta ,0})^k}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}T-1\right) ^2\eta _{\theta ,0}} \le \frac{1}{(2\mu _{\theta ,1}-1)^\alpha },\quad \eta _{c,1} \ge \eta _{\theta ,1}, \\ \frac{1}{(2(\mu _{c,1})^{k_1}T-1)^2\eta _{c,1}} \le \frac{1}{(2\mu _{\theta ,1}-1)^\alpha },\quad \eta _{c,1} < \eta _{\theta ,1}. \\ \end{array}\right. } \end{aligned}$$
(20)

The \(\alpha \)-effectiveness again measures the significance of the improvement on the cost bound, with respect to doing simple aggregation from the crowd after the worker testing stage. Then we can show the condition when an unsure mechanism can be \(\alpha \)-effective, which is similar to Theorem 1. The process of the proof is also similar to Theorem 1, thus is omitted.

Theorem 2

Assume the conditions in Eqs. (1, 1216) to hold.

  1. (1)

    When \(\eta _{c,1} \ge \eta _{\theta ,1}\), let

    $$\begin{aligned} m'= & {} (1+\gamma )\frac{(1+\sqrt{1-4\sigma _{\theta ,0}^2})^2}{\sqrt{3}\sigma _{\theta ,0}^2\left[ 2B_1\sigma _{\theta ,0}^2+(2B_1\mu _{\theta ,0}-1)\right] ^2},\quad B_1 = \frac{(\mu _{\theta ,0})^k}{(\mathbb E_{\theta ,0}[\theta ^k])^{1-k_0}}, \end{aligned}$$
    (21)
    $$\begin{aligned} \gamma= & {} \frac{\mathrm {Pr}_{\theta ,0}\left( \theta \le \mu _{\theta ,0} - \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{\theta ,0}^2}}\right) }{\mathrm {Pr}_{\theta ,0}\left( \theta \ge \mu _{\theta ,0} + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{\theta ,0}^2}}\right) }, \end{aligned}$$
    (22)

    and

    $$\begin{aligned} T= \mu _{\theta ,0} + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{\theta ,0}^2}}. \end{aligned}$$
    (23)
  2. (2)

    When \(\eta _{c,1} < \eta _{\theta ,1}\), let

    $$\begin{aligned} m'= & {} (1+\gamma )\frac{(1+\sqrt{1-4\sigma _{c,1}^2})^2}{\sqrt{3}\sigma _{c,1}^2\left[ 2\mu _{c,1}^{k_1}\sigma _{c,1}^2+(2\mu _{c,1}^{k_1+1}-1)\right] ^2}, \end{aligned}$$
    (24)
    $$\begin{aligned} \gamma= & {} \frac{\mathrm {Pr}_{c,1}\left( c \le \mu _{c,1} - \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{c,1}^2}}\right) }{\mathrm {Pr}_{c,1}\left( c \ge \mu _{c,1} + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{c,1}^2}}\right) }, \end{aligned}$$
    (25)

    and

    $$\begin{aligned} T= \mu _{c,1} + \frac{1}{2}\sqrt{1-\sqrt{1-4\sigma _{c,1}^2}}. \end{aligned}$$
    (26)

    Then if

    $$\begin{aligned} m' \le \frac{1}{(2\mu _{\theta ,1}-1)^\alpha }, \end{aligned}$$
    (27)

    the unsure mechanism with confidence threshold T, utilizing the working testing stage with k golden standard tasks, is at least \(\alpha \)-effective.

6 Discussion on the payment strategy

In the above analysis, we assume that returning labels and choosing unsure option are equally paid. In many crowdsourcing applications, this payment strategy may lead to the potential danger for causing workers to always choose the unsure option without returning any labels. This phenomenon violates the assumption that the workers are honest. Using alternative incentive compatible payment method (Shah and Zhou 2015) can be helpful to deal with this problem. As an example, the following payment method incentivizes the workers to behave honestly, under the assumption that the workers aim to maximize their payments:

  1. (1)

    Choosing unsure option is paid for T, the value of the confidence threshold.

  2. (2)

    Among the returned labels, the ones that accord with the aggregated label are paid for 1, otherwise are paid for 0.

It is easy to show that this payment strategy is incentive compatible. If the worker has confidence \(c > T\), then the expected payment for returning the label is also c, while the payment for choosing unsure option is T. Thus the worker is desirable to return the label. If \(c < T\) the reason is similar for the worker to choose unsure option. The analysis in previous sections is a good approximation for this payment method. The reason is that, since \(T>1/2\), and for an effective unsure mechanism, most of the returned labels should agree with the aggregated label, thus assuming returning labels and choosing unsure option are both paid for 1 does not sacrifice much tightness for the cost bounds. Overall, it is interesting to study how the optimal incentive compatible payment method and the unsure mechanism can be integrated for different application scenarios. We leave this as future work.

7 Online algorithm with unsure option

figure a

The central task for applying an unsure mechanism is to determine the confidence threshold T. According to the previous analysis, setting T can be transfered to the problem of estimating the mean and variance of the crowd ability distribution or the crowd confidence distribution. For the case that we need to consider the crowd ability distribution, doing accurate estimation requires a sufficient number of golden standard tasks with known labels. However, this condition is difficult to be satisfied in practice. For solving this problem, we propose an alternative online bandit based algorithm for setting the confidence threshold T. Note that we still allow to use a small number of golden standard tasks to perform the worker testing stage.

The task is to properly choose the confidence threshold T, which can be treated as bandit arms. We can model the crowdsourcing process as the following bandit game: To collect a new label, a random worker is drawn from the crowd, and a confidence threshold T is provided. T is updated online by the bandit algorithm. We consider only discrete candidate set of \(t_j,j\in \{1,2,\ldots ,K\}\), which segments the interval [0.5, 1] into finite number of parts. Motivated by the previous analysis, we define the reward as

$$\begin{aligned} r_i = (2T_i-1)^2\mathbb I(c_i \ge T_i), \end{aligned}$$
(28)

in which i denotes the ith round, \(T_i\) denotes the chosen confidence threshold, and \(\mathbb I(c_i \ge T)\) denotes the indicator function of the event that the worker does not choose the unsure option. Under this definition, for each arm \(t_k\), let \(N_k\) denote the number of times the arm is chosen. The average reward \(\hat{r}_k = \frac{1}{N_k}\sum _{n=1}^{N_k}(2t_k-1)^2\mathbb I(c_i \ge t_k)\) is the empirical estimation of \((2t_k-1)^2\Pr \,\left( c_i\ge t_k\right) \), i.e. the inverse of main order term of cost for accurate estimation, which should be maximized. \(r_i\) are i.i.d. random variables bounded in [0, 1]. The above problem can be solved by many bandit optimization methods, such as the UCB-1 algorithm (Auer et al. 2002), which is illustrated in Algorithm 1. Note that more sophisticated bandit algorithm can be designed in this task, since the sample collected on one arm may provide additional information on other arms. As the major topic in this paper is theoretical analsys other than algorithm design, we leave this as future work.

Fig. 1
figure 1

Experimental results. In the legend, “Theory” denotes the result from setting the confidence threshold as Eq. 25. “OLU” denotes the results from online unsure mechanism algorithm. “SA” denotes simple aggregation without unsure mechanism

8 Experiments

We used synthetic data to test the theoretical results and the online algorithm. In the experiments, a set of binary labeling tasks were generated, and the ground-truth labels were uniformly sampled from \(\{-1,1\}\). To simulate on different types of crowd ability distributions, the abilities of the workers were sampled from different Beta distributions. The choices of distribution parameters \(\{\alpha ,\beta \}\) were \(\{0.55,0.5\}, \{1.1,1\}\) and \(\{2.2,2\}\). The corresponding mean was 0.52 and corresponding variances were 0.1217, 0.0805 and 0.0480. Since the mean is close to 1 / 2, the left hand side of the distributions were not cut according to Eq. 1. The returned labels were sampled from the Bernoulli distributions according to the abilities. To simulate the situation that the confidence can be largely deviated from the ability, we assumed that the unsure option was used when the sampled ability is above T or below \(1-T\).

For each task, we collected the same number of returned labels for majority voting aggregation. The baseline method was simple aggregation without using the unsure mechanism. For examining theory, we adopted the unsure mechanism, on which the confidence threshold is set according to Eq. 23. To implement the online algorithm, we employed the candidate threshold set \(\{0.55,0.60,\ldots ,1\}\). For the fairness of the comparison, we adopted the worker testing stage with \(k=1\) for all methods.

The results are illustrated in Fig. 1. In all kinds of crowd ability distributions, employing unsure mechanism outperformed simple aggregation. As the variance got larger, the number of workers needed for high accuracy aggregation was significantly reduced. Furthermore, when the number of tasks got larger, the performance of the online algorithm got better. This phenomenon indicates that the online algorithm is capable to be employed when the number of tasks to be done is large.

9 Conclusions

In this work, we theoretically study the cost-saving effect of the crowdsourcing with unsure option setting. We give the sufficient condition for an unsure mechnanism can lead to significant cost reduction, show how confidence threshhold can be properly set. Motivated by the theoretical analysis, we also propose an alternative online algorithm for setting the confidence threshold. We also hope our work to be a motivation for further studies on how crowdsourcing can benefit by utilizing subjective uncertainty of workers.