Keywords

1 Introduction

Crowdsourcing platforms are disrupting traditional work marketplaces. Their ability to compute high-quality matchings between tasks and workers, instantly and worldwide, for paid or voluntary work, has made them unavoidable actors of the 21\(^{st}\) century economy. Early crowdsourcing platforms did not (and still do not) require strong and specific skills; they include for example Amazon Mechanical TurkFootnote 1 (for online micro-tasks), UberFootnote 2 (for car-driving tasks), or TaskRabbitFootnote 3 (for simple home-related tasks—e.g., cleaning, repairing). Today’s crowdsourcing platforms now go one step further by addressing skill-intensive contexts (e.g., general team buildingFootnote 4, collaborative engineeringFootnote 5) through the collection and use of fine-grained worker profiles. Such platforms carry the promise to facilitate, fasten, and spread innovation at an unprecedented scale.

However abusive behaviors from crowdsourcing platforms against workers are frequently reported in the news or on dedicated websites, whether performed willingly or not (see, e.g., the privacy scandals due to illegitimate accesses to the geolocation data of a well-known drivers-riders companyFootnote 6, or the large-scale exposure of workers’ identifying and sensitive information—e.g., real name, book reviews, or wish-list— through Amazon Mechanical Turk IDs [8]). The problem is even more pregnant with skill-intensive crowdsourcing platforms since they collect detailed workers’ profiles for computing highly accurate matchings (e.g., demographics, encompassive set of skills, detailed past experiences, personal preferences, daily availabilities, tools possessed). We advocate thus for a sound protection of workers’ profiles against illegitimate uses: in addition to the necessary compliance with fundamental rights to privacy, it is a precondition for a wide adoption of crowdsourcing platforms by individuals.

Fig. 1.
figure 1

Our approach to privacy-preserving task assignment

Computing the assignment of tasks to workers is the fundamental role of the platform (or at least facilitating it). This paper considers precisely the problem of computing a high-quality matching between skill-intensive tasks and workers while preserving workers’ privacy. To the best of our knowledge, this problem has only been addressed by a single recent work [6]. However, this work is based on costly homomorphism encryption primitives which strongly hamper its performances and prevent it to reason about skills within the assignment algorithm (e.g., no use of semantic proximity).

We propose an approach (see Fig. 1) that addresses these issues by making the following contributions:

  1. 1.

    A simple skills model for a worker’s profile: a bit vector and a taxonomy.

  2. 2.

    An algorithm run independently by each worker for perturbing her profile locally before sending it to the platform. By building on the proven randomized response mechanism [3, 11], this algorithm is privacy-preserving (provides sound differential privacy guarantees [5]), and lightweight (no cryptography, no distributed computation, only bitwise operations).

  3. 3.

    A suite of weight functions to be plugged in a traditional assignment algorithm run by the platform and dedicated to increase the quality of matchings performed over perturbed profiles. Our weight functions reduce the impact of the perturbation by leveraging the skills taxonomy, vertically and horizontally, averaging the skills according to their semantic proximity in order to reduce the variance of the differentially-private perturbation. The variance reduction is mathematically sound and does not jeopardize privacy.

  4. 4.

    An experimental study (see the full version [1]), over a synthetic taxonomy and various synthetic datasets, that shows promising preliminary results about the practical adequacy of our approach from the sides of performance and quality.

For space reasons, we give in this paper an overview of our approach. We refer the interested reader to the full version of our work [1] that describes our approach in details, presents its experimental results, and positions it with respect to related work.

The rest of the paper is organized as follows. Section 2 introduces the notions used in our approach and defines more precisely the problem we tackle. We overview our algorithms in Sect. 3 and conclude in Sect. 4 outlining interesting future works.

2 Problem Definition

Skills and Participants. The set of skills that can be possessed by a worker (resp. requested by a task) is denoted \(\mathcal {S}\). A worker’s profile \(p_i \in \mathcal {P}\) (resp. a task \(t_i \in \mathcal {T}\)) is represented by a bit vector, i.e., \(p_i = {\{0, 1\}}^{|\mathcal {S}|}\), where each bit corresponds to a skill \(s_j \in \mathcal {S}\) and is set to 1 if the given worker has the given skill (resp. the given task \(t_i = {\{0, 1\}}^{|\mathcal {S}|}\) requests the given skill). Without loss of generality, we consider that each requester has a single task and that the number of workers and requesters is the same (i.e., \(|\mathcal {P}|=|\mathcal {T}|\)). Furthermore, we assume that a skills taxonomy \(\mathcal {S}_T\) existsFootnote 7 [9], structuring the skills according to their semantic proximity, and is such that the skills in \(\mathcal {S}\) are the leaves of \(\mathcal {S}_T\) (i.e., no non-leaf node can be possessed nor requested). The non-leaf nodes of the taxonomy are called super-skills (Fig. 2).

Fig. 2.
figure 2

Example of a skill taxonomy \(\mathcal {S}_T\)

The platform is essentially in charge of intermediating between workers and requesters. The workers’ profiles are considered private while the requesters’ tasks are not. The platform holds the set of workers’ profiles, perturbed to satisfy differential privacy (defined below) and denoted \(\widetilde{\mathcal {P}}\), as well as the exact set of requesters’ tasks \(\mathcal {T}\). All participants, i.e., workers, requesters, and the platform, are considered to be honest-but-curious. This means that they participate in the protocol without deviating from its execution sequence (e.g., no message tampering, no data forging) but they will try to infer anything that is computationally-feasible to infer about private data (i.e., the set of workers’ non-perturbed profiles \(\mathcal {P}\)).

The Traditional Tasks-to-Workers Assignment Problem. In a traditional context, where workers’ profiles are not considered private, the objective of the crowdsourcing platform is to assign a worker to each task such that the overall expected quality is maximized. This well-known combinatorial optimization problem is referred as the assignment problem and can be expressed as a standard linear problem [7] (assuming \(|\mathcal {T}| = |\mathcal {P}|\)). Assignment algorithms rely on a weight function \(\mathtt {C}:\mathcal {T}\times \mathcal {P}\rightarrow \mathbb {R}\) in charge of defining the cost of each assignment, i.e., the divergence between the requirements vector of a task and the skills vector of a worker. Common weight functions include the usual distance metrics (e.g., Hamming distance) or disimilarities (e.g., cosine distance). Since our approach is independent from the algorithm, we simply use the Hungarian method [7], a standard academic choice.

Security. We say that our approach is secure against honest-but-curious participants if and only if no participant learns information about the set of non-perturbed profiles \(\mathcal {P}\) that has not been perturbed by a differentially-private mechanism, where differential privacy [4] - the current de facto standard model for disclosing personal information while satisfying sound privacy guarantees - is defined below. Differential privacy is self-composable [10] and secure under post-processing [5].

Definition 1

(Differential Privacy [4]). A randomized mechanism \(\mathtt {M}\) satisfies \(\epsilon \)-differential privacy with \(\epsilon >0\) if for any possible set of workers’ profiles \(\mathcal {P}\) and \(\mathcal {P}'\) such that \(\mathcal {P}'\) is \(\mathcal {P}\) with one additional profile (or one profile less), and any possible set of output \(O \subseteq Range(\mathtt {M})\),

$$\begin{aligned} \Pr [\mathtt {M}(\mathcal {P}) \in O] \le e^\epsilon \times \Pr [\mathtt {M}(\mathcal {P}') \in O] . \end{aligned}$$
(1)

Quality. The inherent information loss due to the differentially-private perturbation impacts the quality of the worker-to-task assignment. This is the price to pay to satisfy a sound privacy model. We quantify this impact by measuring the relative increase of the assignment cost as well as the fraction of profiles that have all the skills required by the task to which they are assigned (see the full version for formal definitions [1]).

3 A Flip-Based Approach

This section overviews our approach. We first focus on the workers’ side: we describe the algorithm that we propose for perturbing each worker’s profile, show its adequacy to our context, and demonstrate that it complies with our security model. Second, we shift to the platform’s side. We explain how to reduce the impact of the differentially private perturbation (while still satisfying differential privacy) and we describe the assignment algorithm based on perturbed profiles. Finally, we overview technical means for letting workers fetch their assignment in a secure way in order to complete it. We refer the interested reader to the full version [1] for more details (including the formal proofs).

3.1 At a Worker’s Side: Local Perturbation

Building Block: Randomized Response. Randomized response [11] is a simple though powerful perturbation mechanism shown to satisfy differential privacy (see below). Basically, it inputs a single bit (e.g., the answer of an individual to a sensitive boolean question) and flips it randomly according to a well-chosen distribution probability. We describe below the variant called innocuous question that we use in this paper and show that it satisfies differential privacyFootnote 8. Let \(x \in \{0,1\}\) be a private value. The randomized response mechanism simply outputs the perturbed value of x, denoted \(\widetilde{x}\), as follows:

$$\begin{aligned} \widetilde{x} = {\left\{ \begin{array}{ll} x &{}\text {with probability } 1-\Pr _{flip} \\ 1 &{}\text {with probability } \Pr _{flip}\times \Pr _{inno} \\ 0 &{}\text {with probability } \Pr _{flip}\times (1-\Pr _{inno}) \\ \end{array}\right. } \end{aligned}$$

where \(Pr_{flip}\) depends on \(\epsilon \) (see below) and \(Pr_{inno} \in [0, 1]\). We use \(\Pr _{inno} = 0.5\) in the rest of the paper, since it minimizes the variation of the estimated value of \(x\) after perturbation [3].

Claim

For a given differential privacy parameter \(\epsilon >0\), and a worker’s profile made of a single bit to be flipped, the innocuous question randomized response scheme satisfies \(\epsilon \)-differential privacy if \(\Pr _{flip} = \frac{2}{1+e^\epsilon }\) (see [1] for the proof).

Flip Mechanism. Our Flip mechanism (Algorithm 1) essentially consists in applying the randomized response mechanism to each binary skill of a worker’s profile before sending it to the platform and inherits thus its high efficiency. The self-composability properties of differential privacy allow that by distributing \(\epsilon \) over the bits of the skills vector.

figure a

Claim

The Flip mechanism satisfies \(\epsilon \)-differential privacy (see [1] for the proof).

3.2 At the Platform’s Side: Task Assignment

Efficient traditional assignment algorithms do not need any modification to work with perturbed profiles, which are bit vectors, exactly as non-perturbed profiles are. The main question is the impact on quality due to our perturbation, and this impact is naturally related to the weight function \(\mathtt {C}:\mathcal {T}\times \mathcal {P}\rightarrow \mathbb {R}\) on which assignment algorithms rely. As there is no clear consensus on what is a good weight function for task assignment, in the sequel we recall several reasonable functions, ignoring or using the skill taxonomy. We also propose new weight functions and explain how they could cope with the differentially-private perturbation.

Existing Weight Functions. Numerous weight functions have been proposed as metrics over skills. The Hamming distance is a common choice to compute dissimilarity between two vectors of bits but it does not capture the semantics needed for crowdsourcing (e.g., a worker possessing all the skills has a high Hamming distance from a task requiring only one skill, although he is perfectly able to perform it). The weight function proposed in [9] adresses this problem based on a taxonomy. We slightly adapt it and call it the Ancestors weight function (AWF for short).

Definition 2

( \(\mathtt {Ancestors}\) Weight Function (adapted from [9])). Let \(d_{max}\) be the maximum depth of the taxonomy \(\mathcal {S}^T\). Let \(\mathrm {lca}(s, s') \in \mathcal {S}^T\) be the lowest common ancestor of skills \(s\) and \(s'\) in the taxonomy.

$$\begin{aligned} \mathtt {AWF}(t, \widetilde{p}) = { \sum _{s_i \in \widetilde{p}} { \min _{s_j \in t} \left( \frac{d_{max} - \mathrm {depth}(\mathrm {lca}(s_i, s_j))}{d_{max}} \right) } } \end{aligned}$$
(2)

Naive Skill-Level Weight Functions. The Missing weight function (MWF for short) between a worker and a task revisits the Hamming distance. It settles a task-to-worker assignment cost that is intuitive in a crowdsourcing context: it is defined as the fraction of skills required by the task that the worker does not have (see Definition 3).

Definition 3

( \(\mathtt {Missing}\) Weight Function \(\mathtt {(MWF)}\) ). \(\mathtt {MWF}:\mathcal {T}\times \widetilde{\mathcal {P}}\rightarrow \mathbb {R}\) is defined as follows:

$$\begin{aligned} \mathtt {MWF} (t, \widetilde{p}) = \sum _{\forall i} t[i] \wedge \lnot \widetilde{p}[i] \end{aligned}$$
(3)

Leveraging the Taxonomy. In realistic contexts, the differentially private perturbation may overwhelm the information contained in the original profiles and make the perturbed profiles be close to uniformly random bit vectors. We cope with this challenging issue by building on the taxonomy \(\mathcal {S}_\mathtt {T}\). Indeed, the taxonomy allows to group large numbers of skills according to their semantic proximity and to reduce the variance of the perturbation by using group averages [2].

Climbing Weight Function. The Climbing weight function (CWF for short) leverages the vertical relationship given by the taxonomy by averaging, for each profile, the skills along the root-to-leaf paths. In other words, before performing the assignment, the platform converts each perturbed profile into a tree, i.e., the same as the taxonomy \(\textsc {S}_T\), and for each node n of the tree, it computes the mean of the skills that appear below n (interpreting the boolean values 1 and 0 as integers). For a given node, this mean is actually a rough estimator of the fraction of descendant skills possessed. We call it score below. During an assignment, given a task and a perturbed profile, the Climbing weight function consists essentially in computing the distance between the scores vector of the task and the scores vector of the profile at each level. Definition 4 formalizes the Climbing weight function.

Definition 4

( \(\mathtt {Climbing}\) Weight Function \(\mathtt {(CWF)}\) ). Let \(v_i\) (resp. \(u_i\)) denote the scores vector at level i in the tree corresponding to the profile \(\widetilde{p}\) (resp. to the task t), and \(\mathtt {d} :\mathbb {R}^n\times \mathbb {R}^n\rightarrow \mathbb {R}\) be a classical distance function on real-valued vectors (e.g., Cosine). Then, \(\mathtt {CWF}:\mathcal {T}\times \widetilde{\mathcal {P}}\rightarrow \mathbb {R}\) is defined as follows:

$$\begin{aligned} \mathtt {CWF} (t, \widetilde{p}) = \sum _{\forall i} i \times \mathtt {d} (u_i, v_i) \end{aligned}$$
(4)

Touring Weight Function. The Touring weight function (TWF for short) leverages the horizontal relationship given by the taxonomy, i.e., the neighbouring proximity degree between skills. As described in Definition 5, it returns the average path-length—according to the taxonomy \(\mathcal {S}_T\)—between the skills required by the task and the skills of the worker’s profile. The expected variance reduction comes from the average path-length of the full cartesian product between the skills required by a task and the skills set to 1 in a perturbed profile. The reduction depends on the taxonomy (similarly to Climbing) and on the number of skills averaged.

Definition 5

( \(\mathtt {Touring}\) Weight Function \(\mathtt {(TWF)}\) ). Let \(\leadsto \) denote the path-length operator between two skills in the taxonomy \(\mathcal {S}_T\). Then, \(\mathtt {TWF}:\mathcal {T}\times \widetilde{\mathcal {P}}\rightarrow \mathbb {R}\) is defined as follows:

$$\begin{aligned} \mathtt {TWF} (t, \widetilde{p}) = \frac{ \sum _{\forall i} \sum _{\forall j} (t[i] \wedge \widetilde{p}[j]) \times ( s_i \leadsto s_j ) }{ \sum _{\forall i} \widetilde{p}[i] \times \sum _{\forall i} t[i] } \end{aligned}$$
(5)

3.3 Post-assignment Phase

Workers need a secure way to fetch their own assignment. This can be solved easily by well-known technical means. For example, the platform could post the assignments on the Web (e.g., to a dedicated webpage for each perturbed profile) so that each worker would then access it through a secure web browser (e.g., TORFootnote 9).

4 Conclusion

We have overviewed in this paper a lightweight privacy-preserving approach to the problem of assigning tasks to workers. Our approach allows each worker to perturb her skill profile locally in order to satisfy the stringent differential privacy model without any need for additional communication or computation cost. We have proposed novel weight functions that can be easily plugged in traditional centralized assignment algorithms, and that are able to cope with the differentially private perturbation by leveraging the presence of a skill taxonomy. Additionally, promising preliminary results of experiments performed over a synthetic taxonomy and synthetic datasets are presented in the full version of the paper [1]. Future works include consolidating experiments (e.g., more profiles and tasks, alternative quality measures), use the taxonomy during the Flip mechanism, collecting a large-scale skill dataset, and continue exploring the performance/quality tradeoff by designing other profile perturbation strategies (e.g., collaborative perturbation protocols designed to minimize the perturbation).