FormalPara Key Points

Wider deployment of machine learning in pharmacovigilance requires further algorithmic evaluations, and appropriate contemporaneous test sets are lacking.

Crowdsourcing has become a frequently leveraged approach to a wide range of challenges, we present its application to the review of public domain data of potential use to safety.

We evaluated the crowdsourced approach and showed it to be a scalable, rapid, and effective approach for developed annotated social media data.

1 Introduction

The volume of individual case safety reports (ICSRs) seems to be ever increasing [1]. While data from spontaneous reporting systems undoubtedly remain the cornerstone of post-marketing safety surveillance, there is a lack of evidence that the increasing volumes of such ICSRs result in an increased ability to find safety signals [2].

With technological advances across many sectors, we are increasingly seeing automation introduced into routine processes. This includes, for example, the many transactional activities that occur in the financial service sector such as invoice processing [3]. The ability to automate elements of ICSR intake is of paramount importance given the ever-increasing pharmacovigilance (PV) ICSR burden [4]. In addition to more efficient handling of reports, a potential benefit would include a more consistent approach in the handling of reports with a transparent audit trail. Automation offers the potential to allow PV medical expertise to be more focused on activity where clinical input is critical and therefore far more likely to impact the risk-benefit balance.

Machine learning (ML) approaches are an appealing solution to many of the challenges of automation. The ability to ‘learn’ rules through training data without explicit “a priori” rules are of particular use with unstructured data, such as ICSRs, where the amount of combinations of variants of data entry makes a few simple decision rules challenging to define. Machine learning algorithms need data to ‘learn’ from training data, results are then produced on a validation set and generalized performance is demonstrated through application to a separate ‘test’ set. Reliable annotated training/test data of sufficient volume and generalizability are therefore essential for ML. Continual feedback through external annotation on the data classification is required to enable dynamic improvements and adaptations as unstructured data changes over time. This is of particular importance as PV data are constantly changing (often in a non-random manner as therapeutic advances occur, healthcare system changes are introduced, Weber Effect), thus the need to continuously update algorithms to ensure contemporaneous automation is of paramount importance.

As automation using ML and artificial intelligence is explored in PV, a challenge is how to best train the algorithm [5]. This is time consuming as it requires manual clinical review and, given that training and test data sets themselves need to be updated over time, this is far from facile. While ICSRs are in public domain depositories, the narratives are not normally available, at least in full detail. Sharing of full ICSR data including clinical narratives is challenging given difficulties in effective anonymization of free-text narratives at scale. Other public domain narratives with potential information on narratives include social media data. Comfort et al. [6] in developing an automated approach ICSR classifier to classify valid ICSRs from a set of social digital media data, estimated that the task of human manual review of the unannotated social media posts would be 44,000 work hours. Abatemarco et al. [7] built a 14,000 strong annotated corpus of ICSRs (of an originally planned 20,000 ICSRs), and were explicit about the challenges in the time, effort, and cost of producing high-quality data for training and testing automation techniques. While organizations may potentially share training data, and/or develop simulated data for some of the training needs, developing scalable approaches to the development of accurate training or test data sets is a current major limitation in PV and will be of great importance to facilitating the use of automation in PV. As PV, therapies, and healthcare evolve, there will be a need for ongoing development of new training/validation/test data to ensure ML algorithms for automation remain trained and appropriate to ongoing routine use, thus the capability to rapidly produce new good training data sets will be ever more important.

To develop automated approaches, it is necessary to start with training data sets relevant to PV. One such approach, crowdsourcing, was examined as a potential solution in transforming a large amount of unstructured data, using discussions of adverse events on social media as an exemplar, to structured accurately classified data that can subsequently be used to develop scalable contemporaneous algorithms to support automation.

Crowdsourcing is defined as the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, especially from the online community, rather than from traditional employees or suppliers [8]. The use of crowdsourcing for health-related activities and across the medicine development lifecycle is still an emerging area, but preliminary results have been promising [9,10,11]. For example, the study by MacLean et al. concluded that crowd-labeled data is a scalable and effective technique for automatically identifying medical terms in patient-authored text [12]. There has only been limited exploration of crowdsourcing in PV [13]. This study looked to investigate whether crowdsourcing executed on social media data could be an effective tool in the development of training data sets that would then ultimately be used for PV automation algorithms.

2 Methods

2.1 Reference Dataset

Our team began evaluating the usefulness of social media posts for PV by examining commercially available, anonymized Facebook and Twitter posts from Epidemico that contained mentions of 15 products of interest over a 1-year period (1 September, 2013–31 August, 2014). The 15 products were chosen as a diverse representation of prescription and over-the-counter drugs. Initially, over 212,000 posts were identified for inclusion in the study. We looked for alternative methods to reduce the volume of these data so that it could be reviewed by our curation team (experienced safety physicians and scientists) in a reasonable amount of time. Language-detection algorithms were applied to identify only the posts written in English. In addition, steps were taken to remove duplicate posts and to reduce potential noise due to spam or advertising. These methods reduced the dataset to 81,300 posts, which were still too numerous to review within the timeline of the project. Approximately 19% of posts (15,490 posts) were selected by random sampling to be read and reviewed by our curation team.

Prior to reviewing the data, 22 medical topics relevant to PV were identified, to test whether these types of topics could be mined from social media data. The wide variety of topics included the type of person who was posting (e.g., patient, family member, healthcare provider, or friend), socio-economic indicators (e.g., occupation, level of education), and contributing health factors (e.g., pregnancy, smoking, or alcohol use). Additional items of interest included more complex topics such as drug indication/dose, adverse events, and efficacy. These 22 topics were evaluated across the set of all 15 drug products.

The curation process was completed using a software tool called Insight Explorer, which had been built by researchers at GSK to help manage the curation process [14]. The team comprised 13 curators who worked over a 5-month period (November 2014 to March 2015) to create our reference dataset. The software helped to track the total time spent to review the 15,490 posts and reported that it took 333 hours averaging ~ 1.3 min per post.

2.2 Experimental Design

Crowdsourcing is the process of dividing a large project or task into small assignments and outsourcing them to a large number of people (i.e., a crowd), typically using the Internet [14]. Amazon’s “Mechanical Turk” (MTurk) is one of the more popular crowdsourcing platforms and was the one chosen for this study [15]. Mechanical Turk has been extensively used for research in other fields including social science for several years [16,17,18,19]. The workers, whose identities are hidden from the requestor, are known as “Turkers.” They complete assignments known as Human Intelligence Tasks (HITs). A HIT is a task that cannot easily be completed by a computer, requiring a real person to read, understand, and respond to one or more questions [20].

For this scientific methodological study, the same 22 medical concepts (see Table 1 in  Section 3.1 for the list) being analyzed by the curation team were also given to the Turkers, using instructions written at a level that could be easily understood. The project was divided into two phases. The first phase tested the impact of price on the accuracy and efficiency of Turker reviews, when compared to the experts. This was done by (1) having a post reviewed by either one or multiple Turkers and (2) varying the price paid per post reviewed. In phase II, the questionnaire, reward per HIT, and number of reviewers per post were adjusted based on the results of phase I. Additionally, the accuracy and time efficiency of the Turkers were evaluated on a larger, more representative scale by increasing the number of posts reviewed from 500 to 5000.

The MTurk system allowed us to create a new project with relative ease. For our study, we developed a single survey form to use as our template, which contained the medical concepts to be evaluated by the Turkers (see Fig. 1). The MTurk system generated unique HITs based on this template after we uploaded the dataset containing our randomly sampled, anonymized social media posts. Once all HITs were completed, MTurk handled the collation of results for further analysis.

Fig. 1
figure 1

Amazon’s “Mechanical Turk” (MTurk) user interface. A Turker interface was designed to mimic the social media curation interface, Insight Explorer, that was used by the expert safety scientists in the creation of the gold standard data set [11]. For each Human Intelligence Task (HIT), a single post would appear in the box labeled “Post” followed by overall instructions and detailed directions on how to answer each of the 22 questions corresponding to the 22 medical topics of interest

MTurk allows the requester to specify how many times each HIT should be reviewed, making it simple to test whether multiple reviews add value to the process. Additional criteria specifying which Turkers can complete a HIT may be enforced. For this study, we selected workers who reside in the USA to increase the likelihood of English language proficiency.

The Turkers were presented with an interface that displayed the social media post content to be reviewed and the 22 questions to identify the medical topics (such as, “Is more than one medication mentioned within the post?” or “Is illicit drug use mentioned within the post?”) that were the same as those identified by the GSK curation team (see Fig. 1). All but one of the attributes (poster type) were modified to binary outcomes (Yes/No) for the Turkers to answer. This reduced the need for Turkers to have extensive training in pharmacology and related medical disciplines and made comparison to the reference dataset more straightforward.

2.3 Phase I

In phase I of the study, 500 posts were randomly sampled from the reference dataset and uploaded to MTurk. This phase tested the influence of price elasticity upon Turker efficiency and accuracy. Additionally, this phase was used to understand whether there was a difference in accuracy when comparing (1) having multiple Turkers review a single post and aggregating their results through a voting system or (2) using only a single Turker to review a post. The sample data to be reviewed were published at various reward levels per HIT.

To test our voting system method, each post was reviewed by three unique Turkers in this phase. The quality of completed HITs was reviewed to ensure that the Turker had not left any questions blank and had spent an adequate amount of time reviewing the post (> 10 s). Tasks that failed this quality test were rejected and republished until successfully completed.

2.4 Phase Two

In phase II of the study, 5000 posts were randomly sampled from the reference dataset and uploaded to MTurk. Phase I results, which are discussed in detail later, showed that neither a reward per assignment nor having a post reviewed by multiple Turkers had a statistically significant effect on overall match to the reference dataset. In phase I, two questions (regarding adverse events and poster type) had an agreement rate below 80%. For phase II, the wording around those attributes was revised to help Turkers better understand how to answer those questions appropriately, the specific changes were:

  • In phase I, a lack of effect was listed as a possible adverse event but was not specifically clarified for Turkers. In phase II, the wording was changed to say “Lack of effect or an ineffective drug is also considered an adverse event”.

  • In phase I, poster type was defined as the relationship between the author and person receiving the medication. However, in phase II, each of the seven poster types was specifically defined as to aid the Turker.

In phase II, each social media post was reviewed by just one Turker. The reward level for phase II was raised, which had an added benefit by making our HITs attractive enough to be completed in a timely manner. As with phase I, all results were checked for quality assurance; if they failed, those results were rejected and republished until successfully completed.

2.5 Data Analysis

In phase I, a majority voting algorithm was implemented to compare Turker results to the reference dataset. For each post, all three Turker responses were compared across all 22 attributes. If more than one Turker voted yes for a single attribute, then the plurality ruled for a “Yes” response to that question. Only the “poster type” attribute was computed differently because it was not a binary outcome; the default response was set to “Unknown” unless two or more Turkers agreed on a single classification, such as “HCP” or “Patient”. The plurality vote for each post was then compared to the reference dataset response for match and accuracy.

A matrix of all Turker responses was created with columns set to the attribute type and row entries populated with a (1) for a match and (0) otherwise. In phase I, there were 500 social media posts tested, each with 22 attributes, for a total of 11,000 entries in the matrix. A simple analysis was then performed to find the overall percent match to the reference dataset. A similar process was followed for phase II, except that there was no plurality vote required because each of the 5000 posts was reviewed only once. This resulted in a matrix of 110,000 entries, and the same analysis performed for phase I was computed against these results.

The total time for Turkers to complete the HITs was computed to the nearest hour by determining the duration between when the first and last HITs were completed. Finally, we computed the percentage of false-positive and false-negative errors for the phase II results.

3 Results

3.1 Phase I Results

Accuracy for the phase I batches (1500 posts across three batches) was computed to be 92.8% (with a range of 92.6–92.9), remaining steady across varying reward amounts. Turkers struggled the most with correctly identifying (1) whether there was an adverse event (82–83% accuracy) and (2) the poster type attribute (71–73% accuracy). This decrease in accuracy is likely owing to the increasing complexity of these concepts. For example, the average person may not realize “lack of effect” is considered an adverse event for regulatory reporting purposes. Further details and a concept-by-concept breakdown of the results from phase I can be found in Table 1. The time to complete the review of the 1500 posts (no difference seen across the batches, data not shown) was approximately 441 h.

Table 1 Phase I summary statistics: agreement to a reference data set

3.2 Phase II Results

Phase II attempted to measure overall accuracy when using a larger sample size of posts (5000) that would be more representative of a real-world scenario, and also measured the duration to complete the HIT assignments at a higher reward level. It also tested whether improving the wording of the two questions about adverse events and poster type would help increase the accuracy of matching those attributes against the reference dataset.

In phase II, the Turkers achieved an overall match of 91.8% compared to the reference dataset (similar to the 92.8% overall accuracy results in phase I). In addition, all assigned HITs were completed and accepted within 33 hours. This was a dramatic reduction in time compared to phase I.

We can extrapolate that the time to complete all 15,490 posts would be less than a week, or about 5% of the total amount of time it took the experts to create the reference dataset. Further details of the results from phase II can be found in Table 2. Furthermore, to test that the classification was not biased toward either a yes or no answer, the percentages of false positives and false negatives were computed for all the binary variables (excluding “poster type”) and are shown in Table 3.

Table 2 Phase II summary statistics
Table 3 False positives and false negatives: phase II

4 Discussion

To test the accuracy and efficiency of crowdsourcing to effectively and rapidly develop a PV training set through the review and annotate social media data, crowdsourced curation using MTurk was compared to a reference dataset of posts reviewed by a trained curation team (experienced safety physicians and scientists). A random sample of 5000 posts was uploaded to MTurk, where Turkers were asked to identify the same 22 attributes the curation team had been asked to detect. Curation of these 5000 posts was completed in 33 hours and had an overall agreement with the reference dataset of 91.8%.

4.1 Limitations

Although our results show the viability of using crowdsourcing for a review of social media posts, there are several potential limitations. First, social media data were used for this study, generalizability to other data sources is unknown. For this analysis, we included anonymized Facebook and Twitter social media data provided by an external vendor, Epidemico, who has conducted research into the use of social media for drug safety for many years, including various regulatory collaborations with the US Food and Drug Administration and in the public private Innovative Medicines Initiative-funded project on social media WEBRADR [21,22,23,24,25]. The landscape of available social media data is ever changing [26, 27], not only will new data sources emerge but some existing data sources are likely to change their terms of use over time and or make substantive changes to their data platform [28, 29]. The method we outlined in this paper will likely be generalizable for many sources of consumer data of relevance or potential relevance to PV, and perhaps beyond to for example, EMRs, particularly if crowdsourcing participation was limited to some healthcare/medical qualification. However, one needs to understand the terms of use for each data source as well as carefully evaluate data characteristics (e.g., consumer vs HCP data) and the context of use (e.g., drug safety) when deciding on the appropriateness of crowdsourced training data sets to ensure appropriate legal, ethical, and scientific use of the data. It should be noted our project underwent extensive internal reviews by our safety, privacy, and legal groups to ensure appropriate use, governance, and oversight.

Additionally, only basic medical/safety insights were evaluated (adverse events, medication usage, dose), thus the ability to identify more complicated insights (causality assessment, drug interactions) has yet to be determined. However, most social media posts come from non-medically trained individuals, thus the Turkers are likely to be somewhat representative of people who post on social media. The generalizability of our approach beyond general social media and simplistic medical/safety insights required further investigation. Additional “fine tuning” of crowdsourced training data sets by domain experts may be required for some activities (e.g., adding that a lack of effect should be considered an adverse event). However, crowdsourcing the initial training data set with subsequent refinement by domain experts should still take significantly less time than having domain experts create the entire data set from scratch.

Another limitation is some opaqueness in how the Turk review was conducted. There is no way to know the individual who is performing the work, whether they correctly entered their demographic information, and whether they are actually putting thought into answering questions or just clicking on buttons for monetary motivation. One safeguard with crowdsourcing is that you have the benefit of large numbers of testers, which should minimize the risks. As a result, one should consider crowdsourcing a large data set as well as placing a limit on the number of posts a single person can review (for our pilot, we capped the maximum amount a single person could have earned in an effort to minimize single-source bias/confounding). There may be value in routine activity to ensure perceived trust in outputs of having two or more people annotating the same post. One also has the ability to assess overall performance by comparing the verbatim free-text post to the Turk review output. One should nevertheless be aware of the trade-off between speed and cost and some unavoidable uncertainty with respect to the accuracy of the work.

The demographics of the Turkers are not well understood and likely to change non-randomly over time. Some studies have shown that most of the people are based in the USA, but their geographic location, age, and educational experience are unknown [15], as with other convenience samples, this can be very useful, but one needs to consider generalizability and use of the outputs carefully [30]. This limitation was mitigated to an extent by the use of a well-stablished, extensively studied crowdsourcing platform that is likely to be more stable to changes over time than newer, smaller, or less well-known platforms; additionally, there is extensive literature investigating the generalizability of this particular platform, for example [17, 31]. Nevertheless, one must be cognizant that cultural differences that may lead to unanticipated bias being introduced into the work could potentially limit the generalizability of the results; although it is not clear the extent to which this would be problematic in the use of outputs in training sets for developing ML solutions.

Accuracy of annotation and minimizing the gaming that might occur with crowdsourcing is of paramount importance to ensuring the best training data sets possible [32]. Our study demonstrated that having each post reviewed by three individuals (voting system) in phase I yielded similar accuracy as compared to having each post reviewed by a single individual in phase II (92.9% vs 91.8%). Some studies have shown similar results whereas others have shown greater variability in overall accuracy and differences between group and individual reviews, such as Good et al. who found the greatest improvement in performance was going from two to three annotators per post [33,34,35,36,37,38,39]. There are some inherent limitations of having posts reviewed by a single individual. We attempted to reduce the single source bias in our study by limiting the number of posts a single individual could review, nevertheless the distribution of work by individuals showed most individuals reviewed a very small number of posts whereas a small group of individuals reviewed a disproportionately larger group of posts. As a result, some bias may have still been introduced into our results. Activities that may improve accuracy and minimize bias may include, but are not limited to, using ‘Master Turkers’ (Turkers with significant positive feedback), limiting the number of posts that can be reviewed by a single person, using qualification tests to help with diversity, as well as using various techniques to help prevent/detect potential problems (e.g., wording questions in a manner that prevents simply clicking on the default answer, identifying outliers with respect to time to complete task).

4.2 Moral, Ethical, and Legal Considerations

The Amazon MTurk platform was used specifically for this scientific methodological study, and no promotional activity was conducted as part of the engagement. It should be noted that ethical questions have arisen with the use of crowdsourced platforms in general and specific to Amazon MTurk since this study was initiated, see for example [40, 41]. We are committed to conducting studies of the highest ethical standards. Therefore, for any future studies looking at this issue, further assessment of differential ethical approaches across crowdsourcing platforms is warranted.

In addition to these limitations, there are also a number of other ethical considerations. Publicly available de-identified posts for their social listening activities were used; however, posting these on a public site like mturk.com may draw undue attention to the posts. In theory, a Turker could copy the post verbatim and search for it on the Internet in an attempt to identify the original source of the post. If successful, the Turker might then have access to the original author’s username and could potentially even respond to the post. If the approach herein were pursued further, then additional anonymization of the posts might be considered.

A final consideration is around setting the rate of pay. The perception of creating an Internet “sweatshop” is a real risk with any funded crowdsourcing approach [42]. Our approach to mitigating this risk was to benchmark pay in the much smaller phase I trial based on an initial set of assumptions and then adjust pay for the larger phase II trial based on empiric results to ensure an appropriate level of compensation. Additionally, we limited the maximum amount a single individual could earn in aggregate. However, it should be noted that a study by Cohen et al. showed that increasing the rate of pay for each completed task did not lead to a linear increase in the total amount of pay, as the more they paid per activity the longer the Turkers took to complete the task [43]. Clearly ensuring clarity on the purpose of the foundational research and developing test sets that could be shared across many organizations for foundational necessary methodological work for drug safety would seem less of a concern than potential commercial applications. Using established, widely known, tested crowdsourcing applications with appropriate processes and safeguards in place clearly reduces the risk. As there is more experience with a specific application, for example, the time needed to conduct certain tasks appropriately, further safeguards can be included to ensure that participants are remunerated appropriately. Although crowdsource workers may not feel that they are being taken advantage of, ensuring participants are appropriately recompensed in all future applications is critical [44, 45].

Finally, there is very little guidance on the use of technologies such as crowdsourcing. In the meantime, the best approach is for stakeholders is to share lessons learned and begin to develop industry-wide best practices on the use of crowdsourcing.

5 Conclusions

Crowdsourcing is an accurate and efficient method that can be used to develop training data sets in support of PV automation. More research is needed to better understand the breadth and depth of possible uses as well as the strengths, limitations, and generalizability of results.