An Examination of the Convergence of Online Panel Data and Conventionally Sourced Data

“I have recommended reject on every paper I’ve reviewed using this technique. I hope that it is a passing fad, because it is already hurting the integrity of our journals and quality of our science.” –Review Board Member

“This is a great survey tool! I look forward to seeing more papers using such a survey technique.” –Review Board MemberFootnote 1

We live in turbulent times for survey research methods. Social scientists in general, and survey researchers in the areas of applied psychology in particular, are finding it more difficult to access high-quality survey data. In response, applied psychology researchers have increasingly turned to commercial firms that recruit pools of potential respondents to participate in survey and opinion research, usually for compensation. Because recruitment and access to subjects is largely conducted through the internet, data provided by companies such as MTurk, StudyResponse, and Qualtrics have come to be known as online panel data (OPD). OPD services typically recruit a large pool of respondents who agree in advance to participate in survey studies on a variety of different topics. Essentially, anyone with internet access can volunteer to become a panel member or “opt in” and can choose to participate in a given task or not. Many online panels provide payment for participation in the form of cash incentives, gift cards, or charitable contributions, sometimes as little as $ 0.25 for a short survey. However, questions exist about the suitability of OPD for applied psychology research.

Researchers have used OPD in a range of fields since the 1990s (Postoaca, 2006). Goodman and Paolacci (2017) note that 43% of the behavioral studies published in the Journal of Consumer Research from June 2015–April 2016 were conducted on MTurk. As such, much of the research regarding the reliability of OPD comes from the consumer research field (e.g., Goodman & Paolacci, 2017; Sharpe Wessling, Huber, & Netzer, 2017). The adoption of OPD in applied psychology, although less pervasive, has grown considerably in the last 5 years. To demonstrate this point, we manually reviewed the last 10 years of six highly cited applied psychology journals (i.e., Academy of Management Journal, Journal of Applied Psychology, Journal of Management, Journal of Organizational Behavior, Organizational Behavior and Human Decision Processes, and Personnel Psychology). We found only 31 samples that used OPD in the 5 years from 2006 through 2010, but 307 samples in the 5 years from 2011 through 2015, an almost tenfold increase. Although we can glean some insight from the consumer research studies, it is important to consider the suitability of OPD for empirical studies explicitly in applied psychology.

Two main concerns with OPD revolve around the measurement properties of OPD and the characteristics of OPD samples (Landers & Behrend, 2015; Paolacci, Chandler, & Ipeirotis, 2010). Regarding measurement properties, the key question is the extent to which OPD respondents provide data that is reliable and meaningful. Regarding characteristics of OPD samples, the key question is how different OPD respondents are from “typical” respondents. A number of studies have examined demographic and employment characteristics of OPD samples relative to other, more traditional sampling techniques, such as student or organizational samples (Behrend, Sharek, Meade, & Wiebe, 2011; Paolacci et al., 2010; Gosling, Vazire, Srivastava, & John, 2004; Sprouse, 2011). However, this approach has both empirical and conceptual limitations. Demographic comparisons do not address the extent that constructs’ relationships for OPD samples differ from conventional applied psychology samples (Shadish, Cook, & Campbell, 2002). We attempt to address this question of generalizability by comparing relations among constructs based on OPD with established population estimates of these same construct relationships.

The Current Study

The purpose of our study is to examine evidence regarding the extent to which online panel samples produce psychometrically sound and criterion-valid research results in the field of applied psychology. The strategy we adopt is to identify a set of frequently examined relations in studies using OPD, including such independent variables as leadership, personality, and affect and their relationship with outcome variables including job satisfaction, organizational commitment, organizational citizenship, and counterproductive work behavior. We then conduct a set of meta-analyses on published and unpublished studies in the field of applied psychology that have used OPD and compare the scale reliabilities and the effect size estimates from these studies with meta-analytic estimates already established in the existing literature. If the reliability and effect size estimates based on OPD studies fall within the credibility intervals provided by established meta-analyses (based on conventionally sourced data), we infer that OPD is not substantively biased relative to conventional samples currently in use. As described previously, others have used primary data to examine the demographic characteristics of OPD as a means of assessing external validity. This paper is the first to focus directly on the extent to which observed results using OPD are consistent with population estimates in the field. Our strategy, based meta-analytic estimates, complements previous approaches that are based on primary data alone.

Theoretical Concerns with Online Panel Data

Landers and Behrend (2015) suggest reviewers often dismiss OPD as a sample source due to a variety of assumptions that remain largely untested and perhaps even unstated. Fortunately, several scholars have expressed their concerns with OPD explicitly and systematically in published form (Harms & DeSimone, 2015; McGonagle, 2015; Feitosa, Joseph, & Newman, 2015). Below, we review issues of external validity and internal consistency as they relate to OPD and develop the research questions of the study.

External Validity and Online Panel Data

Some scholars question the external validity of OPD because the variety of recruitment methods used result in a nonprobability respondent population (e.g., Harms & DeSimone, 2015). This means that the total pool of potential online respondents is not a representative sample of the US or world working population, the population to which most applied psychology researchers at least implicitly wish to generalize (Landers & Behrend, 2015). Indeed, evidence suggests that OPD samples are more diverse, younger, more educated, but more poorly paid than the general US population (Paolacci et al., 2010; Gosling et al., 2004; Sprouse, 2011) and, at the same time, more diverse, older, and more work experienced than a typical undergraduate research sample (Behrend et al., 2011). However, representative sampling or stratified random sampling is rarely used in applied behavioral science research, including applied psychology research (Fisher & Sandell, 2015; Shadish et al., 2002). Rather, samples of convenience are used, most often employees drawn from a single work organization. Such samples are unlikely to be representative of the entire US working population or even less, the worldwide working population (Highhouse & Gillespie, 2009; Landers & Behrend, 2015). For example, Bergman and Jean (2016) showed that, in the aggregate, samples in top I–O journals over-represent salaried, managerial, professional, and executive employees and under-represent wage earners, low- and medium-skilled employees, first-line personnel, and contract workers, relative to the US and international labor pool. Does the lack of representative sampling techniques and the resulting non-representative samples mean that the vast majority of the survey research in the field of applied psychology lacks external validity? Not necessarily.

Methodologists have long argued that the importance of representative sampling depends on the purpose for which the research sample is drawn (Fisher, 1955; Highhouse & Gillespie, 2009; Gillespie, Gillespie, Brodke, & Balzer, 2016). For example, public opinion pollsters as well as consumer behavior researchers typically seek to generalize a sample statistic (e.g., the sample mean) to the larger population in order to predict the voting or buying behavior of that population. They typically rely on representative sampling because a non-representative sample will lead to an inaccurate point estimate of a given attitude or behavior in the general population. Applied psychologists, on the other hand, are typically interested in theoretical generalizability. Theoretical generalizability concerns the extent to which presumed causal relationships among constructs can be expected to hold across other times, settings, or people (Cook & Campbell, 1976; Sackett & Larson Jr, 1990; Shadish et al., 2002). Sackett and Larson (Sackett & Larson Jr, 1990) argue that reasonable sacrifices of representative sampling are justifiable if the primary question is whether the presumed causal relationship under investigation can occur and if the purpose of the study is to falsify a theory through null hypothesis significance testing, circumstances that are typical of the applied psychology field. According to Sackett and Larson Jr (1990), under these circumstances, the sole criteria for selecting a setting and sample is that the sample be a relevant sub-group of the general population to which one wishes to generalize.

The logic of theoretical generalizability thus justifies the use of convenience samples for specific scientific purposes even when they do not strictly represent the population to which one wishes to generalize, so long as they may reasonably be seen as a sub-population of the larger population (Sackett & Larson Jr, 1990; Shadish et al., 2002). Several scholars have in fact argued that OPD are more generalizable than typical organizational samples precisely because they are more diverse and because demographic and other characteristics can be screened for in advance to compose samples with the desired characteristics (Bergman & Jean, 2016; Landers & Behrend, 2015). However, some scholars suggest that OPD samples are so different they essentially do not form a sub-group of the population to which the researcher wishes to generalize. Demographic and other characteristics are self-reported and respondents may have financial or other reasons to provide inaccurate information regarding, for example, their nationality or employment status (Feitosa et al., 2015; McGonagle, 2015). Although the typical organizational sample may not be representative of the working population or even of the entire organization from which it is drawn (Bergman & Jean, 2016; Landers & Behrend, 2015), at least the researcher has some confidence respondents are indeed employed workers at the organization (McGonagle, 2015).

These authors suggest that OPD samples differ from traditional samples of convenience on key demographic and employment characteristics and, further, we can never know for certain how much they differ due to the potential for false reporting. However, as our review of the external generalizability suggests, the critical question is not if samples of convenience differ from the general population. Rather, the question is whether these differences are substantial enough to have a systematic influence on the theoretical relationships of interest to the researcher (Highhouse & Gillespie, 2009; Gillespie et al., 2016; Sackett & Larson Jr, 1990). Fortunately, we can compare the effect size estimates produced by OPD samples with those produced by conventional data without knowing anything about the underlying characteristics of the samples. Therefore, failure to find substantive effect size differences suggests, indirectly, that either sample characteristics do not differ substantially across these two types of data sources or that they differ on characteristics that do not have a significant influence on the effect size estimates.

The strategy we use in this paper is based upon comparisons of cumulative results using meta-analysis rather than a single primary sample. We conduct an omnibus test for differences between OPD and conventionally sourced data, assessing overall differences in effect size resulting from all factors that might differ between the two types of data. If OPD samples do differ from traditional samples used in applied psychology to such an extent that they do not derive from the same general population, we should expect to find the effect size estimates based on studies using OPD to differ significantly from those using traditional organizational samples. If we find substantial differences in effect size estimates, generalization from OPD samples to the general working population will be unjustified without serious consideration of the way these characteristics moderate or limit OPD results. If, on the other hand, we fail to find substantive differences, the field can be more confident that, although OPD samples may be different in a variety of ways, they make up a sub-population of the full population to which we wish to generalize. We might then treat them as we would any other sample of convenience, as the source of tentative theoretical generalizations to the broad working population but with observed effects open to further exploration for moderation in different or less range restricted samples. This logic leads us to our first research question.

Research question 1: Do relationships among independent and dependent variables derived from online panel data differ from the same relationships found in conventionally sourced data?

Measurement Error and Online Panel Data

The second concern with OPD relates to measurement error. Measurement error occurs when individuals’ answers are not accurate or “true” (Dillman, Smyth, & Christian, 2014). One of the primary reasons measurement error may occur is that respondents pay little attention to survey items in anonymous or low-stakes responding situations. Huang, Curran, Keeney, Poposki, and DeShon (2012) have defined insufficient effort responding (IER) as a response set in which participants answer survey questions with little motivation to comply with survey instructions, correctly interpret item content, or provide accurate responses. The effects of such careless responding has generally been assumed to be the introduction of more random measurement error and thus weaker observed relationships with criterion variables (Schmidt & Hunter, 2014; Nunnally, 1978). However, patterned responding (e.g., pick 4 for all questions) may inflate internal reliability if scales items are grouped together and no reverse items are used (Huang et al., 2012) or may inflate observed correlations when the IER response set biases means in the same direction across multiple variables (Huang, Liu, & Bowling, 2015). Researchers have suggested a number of techniques for detecting IER, such as response time, extreme infrequency or bogus items, and psychometric antonyms (Huang et al., 2012; Meade & Craig, 2012).

A number of scholars have suggested OPD may be more prone to IER because respondents have a primarily monetary motivation for responding (McGonagle, 2015). Further, “professional” panel members, that is, members who participate in many surveys or belong to more than one panel, might maximize their income by speeding through surveys with little attention to the accuracy of their responses (Baker et al., 2010; Smith & Hofma Brown, 2006; Sparrow, 2007). Some research has examined the motivation of OPD responders and found that compensation is indeed a primary motivation of survey participation, but interest in the topic, self-insight, and altruism are also important motivators (Behrend et al., 2011; Brüggen, Wetzels, de Ruyter, & Schillewaert, 2011; Paolacci et al., 2010). Evidence linking frequent participation in surveys to IER is also weak. For example, Hillygus, Jackson, and Young (2014) showed that experienced survey takers complete surveys more quickly, but there was no relationship between participation frequency and poor responding. In fact, Hillygus et al. (2014) found less bias in the frequent responders than in the infrequent survey responders in the YouGov panel sample they examined relative to population benchmarks.

Other scholars have used detection techniques to directly examined IER in OPD sources. While evidence for IER is present, it is not clear that IER is more prevalent in OPD than in other types of samples. For example, Harms and DeSimone (2015) report 9.5% of their sample responded incorrectly to bogus items inserted in their survey and as much as 35% of their MTurk sample provided extreme outlier response patterns. However, Ran, Liu, Marchiondo, and Huang (2015) reported infrequent item responses ranging from 2.5 to 11.2% in four datasets based on MTurk data were similar to rates found in four of their student samples. Ran et al. (2015) concluded that OPD and student samples were equally prone to IER. Likewise, Fleischer, Mead, and Huang (2015) found 15–20% of OPD respondents identified as inattentive, rates only somewhat higher than student samples (Meade & Craig, 2012). Fleischer et al. (2015) suggested that features of some online panel sources, such as MTurk’s respondent quality ratings function, may render OPD less prone to IER than traditional samples if used properly.

Finally, researchers have directly examined the quality of OPD based on psychometric properties. These scholars typically conclude OPD is at least as high-quality as student and field samples. For example, Buhrmester, Kwang, and Gosling (2011) found Cronbach’s alpha and 3-week test–retest reliability of OPD to be good to excellent. Likewise, Behrend et al. (2011) found slightly higher internal consistency estimates in the OPD than in the student sample they examined. Behrend et al. (2011) also used item response theory analyses (Meade, 2010) and found minimal difference in the response characteristics of the OPD and student samples. Feitosa et al. (2015) assessed measurement equivalence (Vandenberg & Lance, 2000) of a measure of Big Five personality on an OPD (MTurk) sample, a student sample, and an organizational sample. They used the default settings for MTurk survey data collection, which includes workers with a 95% approval rate but no specified geographic origin. They found a lack of measurement equivalence with the student and organizational samples when using the whole MTurk sample. However, they found both configural invariance (i.e., the same pattern of factor loadings across samples) and metric invariance (i.e., factor loadings constrained to be equal across samples) when IP addresses were used to eliminate probable non-native English-speaking subjects from the MTurk sample. They conclude that OPD demonstrates measurement equivalence when data is collected from countries where English is the native language.

Thus, while a number of questions have been raised about OPD, previous empirical research suggests that the psychometric properties of OPD are not significantly worse than that of other sample sources. Each of the studies reviewed above is based on the analysis of primary data. Although meta-analytic data cannot be used to conduct item-level data quality analyses, it can be used to assess scale-level indicators of the psychometric quality of OPD, such as reliability. Use of meta-analytic techniques complements the work done with primary data because it allows us to draw more general conclusions about OPD. We therefore compare meta-analytically derived reliabilities based on OPD and traditional data sources in the literature. If the psychometric properties differ, we can conclude that OPD has more measurement error than traditional samples and researcher should give serious consideration to the use of IER techniques with such data. If, however, differences do not emerge, we may conclude that OPD and traditional samples have similar internal reliabilities.

Research question 2: Do the internal reliability estimates of samples using online panel sources differ from those of conventionally sourced data?

Methods

Identification of Studies

Our meta-analysis included 90 independent samples based on online panel data for 32,121 online panel participants. Of the 90 samples, 54 were published in academic journals and 36 were from dissertations or samples that were unpublished. To increase the likelihood of gathering available studies based on online samples, we first searched electronic databases (i.e., PsycINFO, Google Scholar, ABI Inform, and ProQuest Dissertations) for the following keywords and various combinations thereof: online panel, Study Response, StudyResponse, MTurk, Mechanical Turk, Qualtrics Panel, Survey Monkey, Zoomerang, online respondent, online study, internet sample, internet panel, and online sample. Combined there were over 25,000 studies that cited one or more of the search terms as of December 31, 2015. We also conducted a manual search of six top applied psychology journals that have published OPD (i.e., Academy of Management Journal, Journal of Applied Psychology, Journal of Management, Journal of Organizational Behavior, Organizational Behavior and Human Decision Processes, and Personnel Psychology) for the years 2006–2015. Finally, we posted calls for additional in-press or unpublished articles on two OB/HR listservs, HRDIV_NET and RMNET; we gathered six additional studies in this way.

Inclusion Criteria

Our initial search included over 25,000 total citations with one or more of the search terms. We were interested in finding empirical data from an online respondent pool (e.g., StudyResponse, MTurk, Qualtrics) which had included a common OB/HR relationship with existing meta-analytic data that could be used for comparison. Of the total citations that included one or more of the online panel search terms, 5463 also included mention of at least one key variable of interest (i.e., either an independent (IV) or dependent variable (DV) of interest). As our search included information from several databases, we then searched for any duplicate citations, which reduced the remaining number to 3158 citations. We then determined which of these studies included quantitative, statistical data resulting in 838 potential studies remaining. Of these 838 quantitative studies, only 107 contained a relationship (i.e., IV–DV relationship) of interest (e.g., conscientiousness to OCB). Many studies using online panels were experimental in nature and testing a new manipulation or intervention on a DV of interest, and not necessarily an IV–DV relationship of interest.

Of the 107 studies considered for inclusion, 23 studies provided data that was not useable for our purposes (see Appendix 3 for a full list of these studies). The following study types were excluded: studies which used an online webhosting service (e.g., Qualtrics) but collected data from a conventional sample (e.g., employees at a specific company, k = 10), studies which mixed conventional and OPD samples together (k = 9), data which used an online panel data that was designed to be unique to a specific, non-generalizable population (e.g., sample drawn from Craigslist in a given area, k = 3), and studies which used online panel participants and examined relationships of interest but did not report an effect size (k = 1). Furthermore, if a paper contained multiple studies, only data from studies using exclusively an OPD sample were included. The available OPD needed to consist of relationships that were comparable to existing conventionally sourced meta-analyses; only those relationships for which enough OPD studies were available (i.e., k ≥ 3) were analyzed and compared. We followed Wood’s (2008) detection heuristic to ensure that we did not include any duplicate study effects.

Following guidelines outlined by Schmidt and Hunter (2014), we averaged correlations obtained from samples using multiple measures of the same construct (e.g., OCB) so that each effect size reflected a unique sample. We corrected the variance of the averaged effect size using equations provided by Borenstein, Hedges, Higgins, and Rothstein (2009). Finally, there were no criteria regarding the publication date or sample nationality. The nationality of sample participants was not clearly reported for most of the samples (k = 50). Of the 40 samples whose participants’ nationality was reported, most were exclusively from the USA (k = 30). There was one exclusively Dutch sample. The remaining samples (k = 9) were of mixed nationalities with participants from the USA and other countries. Of those nine samples, seven samples included a majority of US participants and two samples included a majority of participants from India. Two members of the authorship team coded the studies. These individuals independently coded a random subset of the studies and the interrater reliability was high at 99.3% (868 cells/874 cells; Cohen’s kappa = .986). The discrepancies were resolved through discussion.

We coded the OPD studies for the type of data pre-screening and quality checks used by the original authors. Unfortunately, 34% of the samples provided no information about pre-screening of participants and 53% provided no information about data quality checks. Since non-reporting does not necessarily mean no checks were employed, we deemed this coding too “noisy” to analyze. Nevertheless, it may be instructive to know that 30% of the samples reported requiring participants to have a specific work status (e.g., full time or a minimum number of hours per week), 27% required other specific work characteristics (e.g., have a direct supervisor), and 24% required a specific geographic setting (however, only 16% reported using screening questions to ascertain these participant attributes). Further, some type of insufficient effort responding checks (e.g., bogus items or pattern responding) was used in almost 35% of the samples. Elimination of subjects for missing data was reported in 27% of the samples.

Selection of Comparison Conventional Meta-analyses

To determine whether the OPD population estimate falls within the 80% credibility interval of existing, conventionally sourced meta-analyses, we created a protocol to identify existing meta-analytic data to use. The decision rules agreed upon by the research team prior to one of the researchers searching for and identifying meta-analyses examining the common OB/HR relationships of interest are as follows. First, the researcher found all existing meta-analyses which had data for a given relationship. Then, if multiple meta-analyses were identified for a single relationship, the study with the highest k around which CVs could be constructed was chosen. It was important to use the point estimate and corresponding CVs with the highest k to provide the most accurate and reliable population estimate of conventionally sourced data. Furthermore, since we are comparing overall effects between OPD and conventional meta-data, the overall effect sizes were used when possible (i.e., data from “main effects” tables) instead of choosing effect sizes as part of moderator analyses. Thus, whenever possible, we compare main effects and corresponding CVs of conventional meta-data with main effects of OPD. When applicable, we used weighted averages to calculate an overall effect size for constructs. We noted instances of this at the bottom of Table 4 in Appendix 1. Finally, we ensured that the corrected scores for all meta-analytic results were as comparable as possible. All but one of the meta-analyses corrected for reliability in the independent and dependent variables and made no other corrections. One conventional meta-analysis (Chiaburu, Oh, Berry, Li, & Gardner, 2011) also corrected for range restriction in the predictor (personality) values using the estimated range restriction ration (ux) from Schmidt, Shaffer, & Oh, 2008

Meta-analytic Techniques

We used Schmidt and Hunter (2014) psychometric meta-analysis for analyzing the effect sizes of the OPD correlational relations. We performed the calculations using metatfor in R (Viechtbauer, 2010). To ensure that the OPD true score calculations were as comparable as possible, we corrected for reliability in the independent and dependent variables for all of our analyses. For those data missing reliability information, we used artifact distributions (Schmidt & Hunter, 2014). Additionally, we used the ux values from Schmidt et al. (2008) to correct for direct range restriction in the personality values when calculating the true score values between the Big Five personality traits and OCB (to be comparable with Chiaburu et al., 2011). The ux values used were as follows: conscientiousness .92, agreeableness .91, neuroticisim .91, extraversion .92, and openness to experience .91.

To compare scale reliabilities, we used reliability generalization, a framework developed by Vacha-Haase (1998) based on the concept of validity generalization, as a means to amalgamate the variability in reliability estimates that occurs across measurements. The goal of reliability generalization is similar to that of a traditional meta-analysis: to obtain a weighted mean alpha and estimate the degree of variability in alpha across different measurements and samples. Consistent with best practices (Botella, Suero, & Gambara, 2010), we performed all calculations on non-transformed estimates of alpha. We weighted the alphas by their inverse variance. We calculated the variance using derivations of the SE of alpha as explained by Duhachek, Coughlan, and Iacobucci (2005).

Moderator Analysis

Although the primary purpose of this research study was to compare the effects of OPD to those from conventional data sources, we performed some supplementary analyses to examine potential moderators that may influence the OPD effect sizes. We examined three potential moderators: publication status, OPD source, and publication date. Regarding publication status, it is likely that reviewers have more closely scrutinized data from published studies and therefore these data have undergone more data cleaning and integrity checks than data in unpublished studies. These additional integrity checks may moderate the examined relationships. Regarding OPD source, subjects from MTurk often have lower compensation rates than other paid OPD sources, such as StudyResponse or Qualtrics. Therefore, MTurk respondents may have systematic differences from the other OPD sources due to the lower compensation (e.g., they may speed through the survey randomly selecting choices which may attenuate relationships). Finally, it may be possible that the nature of OPD respondents has changed over time, as OPD has become more popular. Therefore, the data when OPD was collected may moderate relationships. We used the metafor program in R (Viechtbauer, 2010) with restricted maximum-likelihood estimation to examine whether or not these three moderators influenced the OPD relationships. For publication status and OPD source, we examined relationships where we had at least three studies in each group. For publication date, we performed the moderator analysis when there was at least one study published in three different years.

Results

Research Question 1: External Validity

Our first research question was whether relationships among variables derived from online panels differ from conventionally sourced data. We present the meta-analytic estimates from OPD samples in Table 1 and graphically in Fig. 1. We compare the results from the OPD meta-analysis to the meta-analytic estimates that we gathered from the existing literature, which we present in Table 4. Recall that our research question asks if ρ-OPD, the population estimate of the size of a given relationship based upon studies using online panel data, falls within the 80% credibility interval of the population estimate based on the conventionally sourced data. We found that 86% (37/43) of the IV-DV relationships fell within the 80% credibility intervals of conventionally sourced data.Footnote 2

Table 1 Results for meta-analysis of online panel samples
Fig. 1
figure 1

Relationship number is on the x-axis; magnitude of correlation (from − 1 to 1) is on the y-axis. The OPD point estimate (ρ) is designated with a circle and the 80% CVs are indicated with bold error bars; the point estimate (ρ) from existing meta-analyses is designated with a triangle and 80% CVs are indicated with thing error bars. The order of relationships follows the same order found in Table 4 in Appendix 1 and corresponds to the following number as shown on the x-axis: (1) positive leadership and job satisfaction (DeGroot, Kiker, & Cross, 2000), (2) positive leadership and organizational commitment (Jackson, Meyer, & Wang, 2013), (3) positive leadership and turnover intentions (Griffeth, Hom, & Gaertner, 2000), (4) positive leadership and CWB (Hershcovis et al., 2007), (5) abusive supervision and job satisfaction (Mackey et al., 2015), (6) abusive supervision and organizational commitment (Mackey et al., 2015), (7) abusive supervision and CWB (Mackey et al., 2015), (8) agreeableness and job satisfaction (Judge, Heller, & Mount, 2002), (9) agreeableness and organizational commitment (Choi et al., 2015), (10) agreeableness and turnover intentions (Zimmerman, 2008), (11) agreeableness and OCB (Chiaburu et al., 2011), (12) agreeableness and CWB (Berry, Ones, & Sackett, 2007), (13) conscientiousness and job satisfaction (Judge et al., 2002), (14) conscientiousness and organizational commitment (Choi et al., 2015), (15) conscientiousness and turnover intentions (Zimmerman, 2008), (16) conscientiousness and OCB (Chiaburu et al., 2011), (17) conscientiousness and CWB (Berry et al., 2007), (18) extraversion and job satisfaction (Judge et al., 2002), (19) extraversion and organizational commitment (Choi et al., 2015), (20) extraversion and turnover intentions (Zimmerman, 2008), (21) extraversion and OCB (Chiaburu et al., 2011), (22) extraversion and CWB (Berry et al., 2007), (23) neuroticism and job satisfaction (Judge et al., 2002), (24) neuroticism and turnover intentions (Zimmerman, 2008), (25) neuroticism and OCB (Chiaburu et al., 2011), (26) neuroticism and CWB (Berry et al., 2007), (27) openness to experience and job satisfaction (Judge et al., 2002), (28) openness to experience and organizational commitment (Choi et al., 2015), (29) openness to experience and turnover intentions (Zimmerman, 2008), (30) openness to experience and OCB (Chiaburu et al., 2011), (31) openness to experience and CWB (Berry et al., 2007), (32) positive affect and job satisfaction (Thoresen, Kaplan, Barsky, Warren, & de Chermont, 2003), (33) positive affect and organizational commitment (Thoresen et al., 2003), (34) positive affect and turnover intentions (Thoresen et al., 2003), (35) positive affect and CWB (Cochran, 2014), (36) negative affect and job satisfaction (Thoresen et al., 2003), (37) negative affect and organizational commitment (Thoresen et al., 2003), (38) negative affect and turnover intentions (Thoresen et al., 2003), (39) negative affect and OCB (Dalal, 2005), (40) negative affect and CWB (Cochran, 2014), (41) justice and job satisfaction (Colquitt, Conlon, Wesson, Porter, & Ng, 2001), (42) justice and organizational commitment (Colquitt et al., 2001), (43) justice and CWB (Cochran, 2014)

Each of the relationships that fall outside the credibility interval tend to be stronger for the OPD sources than for the conventional sources, whether more positive or more negative. Three of the five relationships that were outside the credibility interval involved turnover intentions. The relationship between positive leadership and turnover intentions was more negative for OPD (ρ = − .50 than in conventional samples (80% CV − .40, − .06). The relationship between conscientiousness and turnover intentions was also more strongly negative for OPD (ρ = − .29) than in conventional samples (80% CV − .24, − .08). Finally, the relationship between openness to experience and turnover intentions was consistently negative for OPD (ρ = − .17; 80% CV − .28, − .07), whereas there was a less consistent relationship in the conventional samples (80% CV − .15, .17).

We also examined the confidence intervals to note any pattern of significant differences in the OPD versus conventional superpopulation effect sizes. Confidence intervals were reported in the conventional meta-analyses for 29 of the effect sizes (not all conventional meta-analyses reported confidence intervals). We found that ρ-OPD was within the 95% confidence interval of the conventional meta-analytic effect size in 10 of the cases, was outside the upper bound of the confidence interval in nine of the cases, and was outside the lower bound of the confidence interval in 10 of the cases. Of the 19 effect sizes that fell outside the confidence interval (either upper or lower bounds), 11 of the OPD effect sizes were stronger than the conventional effect sizes and eight of the OPD effect sizes were weaker than the conventional effect sizes. These results suggest that there is no systematic difference between the OPD effect sizes and the conventional effect sizes. This is not to say that there are not differences, rather the differences do not seem to follow any interpretable pattern. As a final check of the confidence intervals, we examined whether or not the 95% confidence interval from the OPD meta-analysis overlapped with the 95% confidence interval from the conventional meta-analyses. There were three cases where the confidence interval did not overlap: conscientiousness-turnover intentions, openness to experience-turnover intentions, and negative affect-CWB.

Moderator Results

We examined three potential moderators that may influence the OPD relationships of interest: publication status, OPD source, and publication date. Although a few differences emerged, these differences were generally small and no systematic pattern of differences emerged. Publication status (published versus non-published) moderated only three of the 18 relationships that we examined (neuroticism-job satisfaction, neuroticism-CWB, and negative affect-CWB). Two of the three relationships were attenuated by publication status (negative affect-CWB was strengthened). Source (MTurk versus other) moderated two of the 19 relationships that we were examined (conscientiousness-job satisfaction and negative affect-CWB). One of the two relationships was attenuated by source (negative affect-CWB was strengthened). Finally, publication date moderated four of the 39 relationships examined (extraversion-turnover intentions, extraversion-CWB, openness-job satisfaction, and negative affect-turnover intentions). One of the four relationships was attenuated by date (the relationship between openness and job satisfaction was weaker as the publication date increased). Because of the null findings, the results of these analyses are not included in the manuscript but are available from the first author upon request.

Research Question 2: Reliability Generalization

Our second research question asked whether the internal reliability estimates from online panel sources differ from those found in conventionally sourced data. The results for the reliability generalization are presented in Table 2 and, graphically, in Fig. 2. Here, we compare the results of the reliability generalization analysis using OPD sources to a comprehensive reliability generalization study conducted by Greco, O’Boyle, Cockburn, and Yuan (2015). We were able to compare the reliability point estimate of 12 constructs from the Greco et al. (2015) analysis to reliability generalization using the OPD sources. All 12 point estimates from the OPD analysis fell within the 80% credibility estimate from the larger reliability generalization study. These results suggest that the internal consistency of scales with OPD samples is similar to that of conventional sample sources.

Table 2 Comparison of scale reliabilities for OPD and conventionally sourced samples
Fig. 2
figure 2

Comparison of reliability generalization using OPD studies versus the reliability generalization information from published management studies. The reliability generalization estimate from the OPD studies is designated with a circle; the 80% CVs from the Greco et al. (2015) reliability generalization are indicated with error bars. The scales represented in the figure are as follows: (1) abusive supervision, (2) agreeableness, (3) conscientiousness, (4) counterproductive work behaviors, (5) extraversion, (6) job satisfaction, (7) justice, (8) leadership, (9) negative affect, (10) neuroticism, (11) openness to experience, (12) organizational citizenship behaviors, (13) organizational commitment, (14) positive affect, and (15) turnover intentions

Discussion

Online panel sources are increasingly being used to compose research samples in the field of applied psychology. The purpose of our research was to examine the external validity and measurement properties of OPD. We used meta-analytic techniques to aggregate the published and unpublished online survey data and compare the psychometric properties and criterion validity of this data to that found in conventional data sources. Our reliability generalization analyses showed that 100% (12 of 12) of the reliability generalization estimates from OPD samples were within the 80% credibility values of the reliability estimates based on conventional samples (Greco et al., 2015). Based on both the primary data analyses reported in previous work and our analyses using aggregate data reported here, it appears that OPD does not systematically affect internal consistency in applied psychology research.

Little previous research has examined the criterion validity of OPD in the field of applied psychology. To test external validity, we calculated meta-analytic effect size estimates for 43 IV–DV relations frequently found in OPD and compared them to these same relations based on conventional data. The OPD population estimate fell within the 80% credibility interval established in previous meta-analyses based on conventional data 86% of the time, suggesting differences between OPD and conventional data do not exceed chance. Thus, OPD appears to provide effect size estimates that do not differ from conventional data in the field. Together, our examination of the internal and external validity of data provided by online panel sources suggests such data as appropriate as other samples of convenience used in the field of applied psychology. As with all convenience samples, it important to be able to justify that the sample source is appropriate for addressing the hypotheses/research questions. For example, it would be difficult to justify MTurk as a sample source for a study on CEOs.

Theoretical Implications

It is important to understand the purposes for which OPD is or is not appropriate. OPD, like the vast majority of samples used in applied psychology, provides a convenience sample in the sense that it is not necessarily a representative sample of the US or world working population. It is not appropriate to generalize sample statistics, such as a mean, to a population when using a non-representative samples. However, point estimates are rarely the focus of research in the applied psychology field, which tends to focus much more on causal relations among constructs and rely on the concept of theoretical generalizability. According to generalizability theory (Sackett & Larson Jr, 1990), samples of convenience are appropriate when one wishes to generalize presumed causal relationships among constructs to a broader population and if the convenience sample is reasonably similar to the population to which one wishes to generalize. For such purposes, a completely random or stratified random sampling of the population is not necessary. Rather, one can make a strong case for generalizability if the convenience sample is reasonably similar to the larger population, for example, if the convenience sample is a subsample of the population. Some authors (Harms & DeSimone, 2015) have suggested that OPD respondents may not be truthful about their demographic or employment characteristics but may be so different as to preclude generalization to the broad working population. If this is so, our approach cannot tell us exactly what demographic and work experience characteristics OPD respondents possess, but our results do show that OPD data demonstrate psychometric properties and criterion validities that are not meaningfully different from conventional field data. Thus, even if OPD samples differ from organizational samples on a number of attributes, these differences do not seem to have a systematic influence on the theoretical relationships we examined. This strongly suggests that the OPD samples are reasonably similar to other samples typically used in the field and thus make up an appropriate convenience sample.

Practical Implications

Our results and review of the literature on OPD yield a number of practical implications for scholars seeking to use OPD in their research beyond the theoretical considerations discussed above. Although we coded OPD studies for the types of respondent screening and data cleaning procedures used, reporting was inconsistent and incomplete, so we could not determine exactly which procedures were used or what effect each data handling technique might have on the quality of the data. It is important to note that some data screening procedures were used in the majority of the studies that make up our OPD meta-analyses. Therefore, until we can gather more accurate information regarding exactly which screening techniques are used, the conservative approach is to recommend a relatively comprehensive list of the screening procedures we found in the OPD-based studies. Table 3 provides a summary of best practices for data handling derived from the literature and the techniques already used with OPD in the field (see also DeSimone, Harms, & DeSimone, 2015). Overall, we recommend researchers carefully consider the purposes of their study, the population sampling frame, the incentives they use to select and motivate respondents, and the data screening procedures they use to eliminate poor responders. Further, we strongly suggest expressly detailing these procedures in the methods section of the article. Future research should determine which of these procedures are effective.

Table 3 Recommendations when using online panel data

OPD may not be appropriate if a researcher is theorizing is about specific contextual processes (e.g., information processing) or is concerned with a specific group of people (e.g., CEOs) since the convenience sample may not experience the type of contextual influences and may not make up a subsample of the desired population. Bergman and Jean (2016) go further to suggest that unrepresentative samples may lead scholars to overlook important workplace phenomenon that exist only in specific subgroups, such as food insufficiency or economic tenuousness. However, others have suggested that OPD sources can be of great utility precisely because they are more diverse and provide access to under-represented populations (Smith, Sabat, Martinez, Weaver, & Xu, 2015). Researchers should always be able to justify the appropriateness of the sample (source) for addressing their specific hypotheses.

Limitations and Future Research

This study, based as it is on meta-analytic techniques, has limitations common to meta-analysis. First, because the use of online panels is relatively recent in the field, the number of relationships examined and the number of studies in each meta-analysis is limited. Although we include personality, work attitudes, and leader behavior as independent variables and attitudes, behavioral intentions, and employee behavior as dependent variables, future research might extend our results to a broader range of IV–DV relations. However, the consistent nature of our results leads us to expect similar outcomes with other constructs. Second, the small number of studies for each effect size estimate restricts our ability to conduct moderation analyses by OPD source. Examining our data by OPD service source revealed no substantive differences, but future research based on a greater number of studies could explore this potential moderation with more statistical confidence.

Incomplete reporting in the primary studies regarding the way data were collected limited our ability to explore the extent to which data screening and cleaning might improve data quality. Our results suggest that the data handling procedures currently used in the field are adequate, since the OPD and conventional data do converge, but a more systematic understanding of these factors might make data collection smoother and more cost effective. Further research might also focus on the techniques and practices that the online panel firms themselves use to develop and maintain high-quality survey respondents, including the forms of compensation, identification protocols, and quality feedback from end users (Callegaro, Villar, Yeager, & Krosnick, 2014). Online panel participants and online panel service practices may change at any time, so continued attention to OPD quality issues is warranted.

A third limitation is that some of the more recent meta-analyses that we used to establish the 80% CV for conventional data themselves include a small number of OPD samples. We examined each of the conventional meta-analyses for studies that used OPD samples and found slight overlap. The Choi, Oh, and Colbert (2015) and Chiaburu et al. (2011) meta-analysis contained one study that used OPD. The Mackey, Frieder, Brees, and Martinko (2015) meta-analyses contained five studies that used OPD. We chose to use the existing meta-analyses to represent the established true score estimates in the field because the small number of OPD samples is unlikely to have much influence and because the number of judgment calls necessary to update all of these meta-analyses would inevitably raise questions of their own.

A final limitation is that the majority of the OPD sources used in this study were from USA-based companies (MTurk, StudyResponse, Qualtrics). Due to differences in labor markets, social welfare, the culture of employee-employer relations, and other cultural differences, these results may not generalize to OPD from other countries.

As these future research ideas suggest, there is much more we might want to know about the nature of online panel samples and services. However, our results support a growing body of evidence that online panels can provide data that are appropriate to test some hypotheses about the general population within field of applied psychology.