Structural equation modeling is a theory-driven statistical technique that involves analyzing and making sense of the correlations between variables. It can be used to assess the plausibility of different theoretical models, including those that impose a putative causal structure on non-manipulated constructs assessed at a single point in time and non-manipulated constructs assessed at multiple points in time, as well as the effects of manipulating a variable (such as in intervention research), and both mediating and moderating variables. The focus of this paper is on meta-analytic structural equation modeling (MASEM), which, as its name suggests, is a statistical technique that brings meta-analysis and structural equation modeling together (Becker, 1992, 1995; Cheung & Chan, 2005; Cheung, 2015a, 2019; Jak & Cheung, 2020). It allows researchers to investigate how empirically distinct theoretically related constructs might be and whether the structural model serves as a reasonable way of organizing the underlying constructs and, hence, is well-suited to addressing topics of interest to both theoretically oriented and applied prevention scientists. While both meta-analysis and structural equation modeling are regularly used in prevention science, they are rarely applied in combination in the prevention science literature (the first such application in Prevention Science is Shen et al., 2021). We hope that, by providing a primer on MASEM, prevention scientists will make greater use of these methods in the future. In addition to this paper, good conceptual introductions to MASEM can be found in Cheung and Hafdahl (2016) and Cheung (2020).

MASEM usually involves two basic steps. First, the relationships observed in the studies in the meta-analytic database are used to create a meta-analytic (pooled) correlation matrix. This is the primary distinction between MASEM and a structural equation model that is based on the data observed in a single study. The average correlation matrix with its sampling covariance matrix can be analyzed with a weighted least square estimation method. After arriving at the meta-analytic correlation matrix, structural equation models are fitted to the pooled correlation matrix. Like all structural equation models, test statistics and goodness-of-fit indices can be used to help judge the exact and approximate fit of the proposed models.

To create and interpret MASEM models, then, researchers need familiarity with the methods of systematic reviewing, meta-analysis, and structural equation modeling. Fortunately, researchers with a good familiarity with structural equation modeling will transition easily to applying that knowledge to MASEM. Both the random-effects two-stage structural equation modeling (TSSEM) approach proposed by Cheung (2014, 2015a) and one-stage structural equation modeling (OSMASEM; Jak & Cheung, 2020), which are introduced in this paper, use structural equation models to meta-analyze correlation matrices and fit structural equation models. Therefore, model fit indices and parameter estimates in MASEM can be interpreted similarly as those in SEM. That said, there are some practical differences between SEM and MASEM. For example, applications of SEM usually involve models with latent variables with large degrees of freedom (dfs), whereas regression or path models with small dfs are fit in MASEM. This point is important because some fit indices, for example, the root mean square error of approximation (RMSEA), may not behave well in models with small dfs (Kenny et al., 2015). Researchers in both SEM and MASEM should be cautious in these cases.

An Overview of the Order of Operations for MASEM

Researchers interested in using MASEM should begin by formulating a precise research question, or set of research questions, that can be addressed using the technique. Then, researchers should set the Participant, Intervention, Comparison, Outcome, Study designs (PICOS) criteria, that is, arrive at theoretical and operational definitions of the relevant participants, interventions (if applicable), comparisons (if applicable), outcome measures used, and study designs. The PICOS criteria will inform the systematic review’s literature search. A structured process that yields reproducible results should be used to identify and extract information from eligible studies.

The five primary considerations for arriving at the meta-analytic (pooled) correlation matrix are (a) whether to use a fixed effects or a random effects approach, the considerations for which are the same in MASEM as they are in any meta-analysis, (b) how to handle study dependence (e.g., if a study presents multiple outcome estimates for the same construct), (c) how to address possible publication bias, (d) whether to correct the estimates for attenuation, and (e) how to incorporate judgments of study quality. We address each of these briefly in turn.

Fixed Effects vs Random Effects

The fixed effects model is defensible if it is reasonable to believe that all study effects are estimating the same (or common) population parameter or if all study-level influences are known and can be accounted for (e.g., via regression adjustment). In either case, this assumption implies that sampling error is the only reason that the observed effect varies across studies. If, on the other hand, researchers believe that study effects will vary across samples, setting, and study characteristics, and that some of these characteristics are unknown and cannot be accounted for, then the random effects model will generally be more appropriate. We should note that the fixed effects model is also used when the researchers’ interest is limited to the studies at hand (i.e., they want to generalize only to the specific studies in the meta-analytic database and studies highly like them), something we suspect will be rare in MASEM applications.

There are two random-effects models in MASEM. Conventionally, we may fit a structural equation model on the average correlation matrix, whereas the random effects in a study represent the differences between the population correlation matrix in that study and the average correlation matrix. This is the model we use in the present study. An alternative model is to treat the random effects in a study as the differences between the population structural parameters, e.g., regression coefficients and factor loadings, in that study and the average structural parameters. Readers interested in these issues may refer to Cheung and Cheung (2016) and Ke et al. (2019) for the discussion.

Dependent Estimates

The estimates contributing to a meta-analytic correlation matrix are presumed to be independent. Non-independence can happen for many different reasons. For example, several studies included in the meta-analytic database we present below used multiple measures of depression and therefore had more than one estimate of the correlation between depression and another construct of interest. If the estimates contributing the meta-analytic correlation are not independent, then the meta-analytic standard errors will be too small, and studies contributing multiple effect sizes will be given too much weight in the analysis. There are two ways of addressing dependence. One approach is structural, in the sense that researchers can make intentional choices to limit dependence. In the case of having multiple measures of the same construct, for example, the research team might have a preference for one measure over another or might drop measures randomly until only one remains (see Lipsey & Wilson, 2001). Of course, these kinds of decisions should not be contingent on the effect size observed and ideally will be articulated in a publicly available protocol that was created prior to data collection. The other approach is statistical. Common statistical methods for addressing dependence are multilevel modeling (see, for example, Konstantopoulos, 2011), robust variance estimation (see, for example, Tanner-Smith & Tipton, 2014), and averaging dependent effects (Cooper, 2017). For MASEM applications, we recommend averaging dependent effect sizes within studies because it provides a framework for testing moderators in MASEM. The average correlation matrix with its asymptotic sampling covariance matrix from multilevel modeling or robust variance estimation can be used to fit structural equation models with weighted least squares as the estimation method.

Publication Bias

Sometimes, the decision about whether to publish a study depends on the nature of its findings. When a study goes unpublished because it does not have statistically significant findings on its main outcomes, this is known as publication bias, and it can represent an important threat to the validity of the conclusions arising from a systematic review and meta-analysis. In general, publication bias will tend to result in estimated relationships between constructs that are too large and heterogeneity estimates that are too small. In the context of MASEM, this means that the path coefficients will be too large in the presence of publication bias. Vevea et al. (2019) provide a good discussion of the various options for detecting the possible presence of publication bias and diagnosing the extent of its possible impact.

Prevention is always the best defense against publication bias (Jak & Cheung, 2020; Vevea et al., 2019), meaning that researchers should carry out an exhaustive search for studies relevant to their research question. This feature is always an important part of a systematic review, but it assumes even greater importance in a MASEM given that, to date, no consensus has emerged on how to address possible publication bias in MASEM. If researchers carrying out a MASEM project conclude that publication bias has an important impact on one or more of their paths of interest, then it might make sense to redouble efforts aimed at locating relevant unpublished studies. If the researchers choose to proceed with MASEM despite the possible influence of publication bias, they should clearly communicate the rationale for this decision to their readers, and should be very cautious when interpreting results.

Attenuation Corrections

If researchers have item-level correlation matrices, MASEM can handle measurement error by fitting confirmatory factor analytic models or structural equation models. When path models on the composite scores are fitted in MASEM, problems like measurement error and artificial dichotomization of continuous variables reduce the correlation that will be observed between two variables. For example, even if two constructs are perfectly correlated, their observed correlation will not be 1.0 in the presence of measurement error. This phenomenon is known as attenuation, and it is not uncommon for researchers to correct correlations for attenuation caused by measurement error and other problems in meta-analysis (in fact, doing so is the norm in some fields, organizational psychology being one example). If attenuation corrections for measurement error are desirable, one way to implement these would be to rely on good systematic reviews and meta-analyses of coefficient alpha for measures of relevant constructs (these are often referred to as reliability generalization studies; Rodriguez & Maeda, 2006). Another would be to meta-analyze the reliability coefficients present in the studies in the meta-analysis. Schmidt et al. (2019) have a very good discussion of how this could be done. At the same time, users should be aware that attenuation corrections can sometimes lead to a meta-analytic correlation matrix that is non-positive definite (e.g., Charles, 2005). Furthermore, Michel et al. (2011) found that results of uncorrected and corrected for unreliability were similar in several published datasets. However, it remains unclear whether this finding is supported by simulation studies.

There are a couple of reasons for a non-positive definite matrix in MASEM, especially when using attenuation corrections. One is that the individual correlation matrices are not positive definite because pairwise deletion was used to calculate the primary correlation matrices (Wothke, 1993) or the corrected correlation was larger than one. Researchers should therefore check whether the correlation matrices are positive definite. The provided R code illustrates how to do this. Any non-positive definite matrices should be excluded from the analysis. Also, the average correlation matrix may be non-positive definite. Based on our experience, this is rare. If it happens, however, researchers should identify and exclude the variable causing it to happen. An alternative approach is to regularize the average correlation matrix using ridge SEM (e.g., Yuan et al., 2011). Finally, if the number of elements in the variance component is large comparing to the number of studies, it is difficult to estimate the full variance component. A common practice is to estimate the variances of the random effects, wherein the covariances are fixed at zero.

Incorporating Judgments of Study Quality

There is widespread recognition that study quality, which in the context of a systematic review and meta-analysis can be defined as the extent to which the methods deployed to answer a research question fit the goals of the review (Valentine, 2019), can have an influence on study effect sizes. Therefore, it is common for experts to recommend that study quality indicators be assessed and that judgments about study quality be incorporated into the review. For example, the AMSTAR-2 checklist holds that systematic reviews ought to assess the impact of varying study quality on meta-analytic results (Shea et al., 2017).

Assessing study quality is perhaps the most challenging aspect of conducting a systematic review, in part because the important study quality indicators vary as a function of study design — the issues relevant to a randomized experiment are not exactly the same as the issues involved in a cross-sectional investigation of a pattern of relationships. Important study quality indicators almost certainly also vary as a function of study context. For example, participant attrition is probably a more severe threat to the inferences arising from targeted prevention efforts than to those arising from universal prevention efforts.

In general, we recommend that individuals carrying out a MASEM project identify the relevant study quality indicators for their research question and then adopt a two-stage approach: Use the most important of these as inclusion criteria and then treat others empirically. Ideally, with a sufficient number of studies, the observed effect sizes could be adjusted (via meta-regression) in an attempt to take into account the impact of varying study quality on the meta-analytic results. A general approach for identifying relevant study quality indicators for a particular research question can be found in Valentine (2019).

Our Motivating Example: Cognitive Theories of Depression

One of the major cognitive theories explaining depression is Beck’s cognitive theory of depression (1976). One reason for its status as a major theory is that it is the basis of one of the most effective approaches to treat (Butler et al., 2006) and prevent (e.g., Pössel et al., 2018) depression. Beck’s theory includes multiple cognitive elements including that the relationship between dysfunctional attitudes and depression is mediated by negative automatic thinking. However, while the elements themselves are well established, their empirical relationships with each other and with depression are less clear (e.g., Pössel & Black, 2014). Drawing on data collected for a larger project that examines the relationships between constructs from multiple cognitive models of depression (i.e., Beck’s cognitive theory, 1976; the Hopelessness Model, Abramson et al., 1989; Response Style Theory, Nolen-Hoeksema, 1991) and personality traits known to be associated with depression, in this tutorial, we test two effects to illustrate MASEM: the direct effect of dysfunctional attitudes on depression and the extent to which negative automatic thoughts serve as a mediating effect of the relationship between dysfunctional attitudes and depression. We also test three theoretically relevant study-level moderators: sample age (children vs adults), sample recruitment (general vs mixed vs clinical), and the cultural context in which the study took place (North America vs other locations).

Method

The Literature Search

Two experts in theories of depression, supported by several students in psychology programs, developed a list of terms designed to capture constructs related to cognitive models of depression. We focus on three for this tutorial: depression, depressive schema/dysfunctional attitudes, and automatic thinking. We searched Medline, PsycINFO, and the Psychological and Behavioral Sciences Collection simultaneously in EBSCO, as well as ProQuest Dissertations and Theses. As shown in Table 1, we searched document abstracts for at least two constructs of interest (e.g., depression and automatic thoughts) and the full text of documents for terms suggesting that a correlation matrix might be available (i.e., matrix or correlation* or intercorrelation*).

Table 1 Search strategy for EBSCOhost (tutorial studies only)

This latter point merits further discussion. We focused on studies reporting a full correlation matrix for their measured variables not only in part for efficiency’s sake (it is easier to search for a correlation matrix than for individually reported correlations) but also in part to reduce our exposure to outcome reporting bias (Pigott et al., 2013). This refers to within-study selective reporting of findings that are statistically significant. Outcome reporting bias has the potential to inflate effect size estimates in a meta-analysis. Fortunately, most researchers create correlation matrices that report the intercorrelations between all of their measured variables and not just the statistically significant ones. Therefore, relative to using individually reported correlations, which are potentially selectively reported, relying on a full correlation matrix likely reduces the impact of outcome reporting bias on our results.

The search generated more than 3200 documents that were potentially eligible for review. Our goal for this stage was to identify documents clearly suggesting that at least two constructs of interest were not measured (i.e., we sought to eliminate all clearly irrelevant documents). To this end, we used three pilot rounds of testing (during which all screeners coded either 99 or 100 articles). During these pilot rounds, we tracked agreement and identified sources of disagreement. After the pilot rounds, two researchers working independently screened each study. All disagreements were resolved by one of the depression experts. We used abstrackr to manage the screening process (Wallace et al., 2012). Over 2700 documents passed this stage of screening.

We then attempted to obtain the full text of the 2770 studies that remained after eliminating the clearly not relevant documents. We were successful in obtaining 2424 of these. We proceeded to full-text screening, which involved a single researcher examining the full text of each document to determine (a) whether at least two constructs of interest were assessed in the study and (b) whether the study reported a full correlation matrix. We also screened out studies that included an experimental manipulation relevant to depression and reported only post-manipulation measures (pre-manipulation correlations were eligible for review). This process led to 161 documents that were provisionally eligible to be reviewed for this tutorial. Ultimately, after two researchers independently examined the full text of these 161 documents, 104 studies were included in the data on which this tutorial is based. By far, the most common reason studies dropped out at this stage was that at least one of the measures used did not pass expert screening (this was true for 41 of the 57 studies that did not pass full text screening). The next most common reason was that the study did not report a full correlation matrix (k = 8). Two studies assessed constructs of interest after a relevant experimental manipulation, one study was a duplicate, and another is awaiting translation. The remaining four studies had either unusable data (e.g., correlations based on change scores) or ambiguities that prevented us from using the data (e.g., inconsistent values reported for the data of interest). The references for the 104 included studies can be found in the Appendix.

Coding Studies

Studies were coded by two researchers working independently. One of us served as the first coder for all of the studies, supported by one of five additional coders. One of our experts in depression served to arbitrate disagreements. In addition to coding the relevant correlations of interest, we coded several study-level characteristics, including the year the study was published or released, sample characteristics (e.g., age, clinical sample vs not, reported race/ethnicity), and measure characteristics (e.g., measure name, reported reliability estimates). Overall, we coded 288 individual correlations, from 125 independent samples nested in 104 studies, that contributed to the data used in this tutorial. The number of coded findings in the studies ranged from 1 to 24.

With respect to assessing study quality, the ideal study for our research questions would involve giving a large probability sample perfectly reliable and valid measures of our constructs of interest. Because we (accurately) anticipated that no studies would be based on a probability sample, we did not code for this characteristic. We did consider coding for any validity evidence reported in a study that was based on the study’s sample, but pilot testing indicated that most studies did not provide any information on validity apart from (a) mentioning validity work done in another study, (b) providing evidence of convergent validity by using multiple measures of the same construct, and (c) providing evidence of discriminant validity by providing correlations across different constructs. Therefore, the primary study quality characteristic that varied across the studies in this review was score reliability. When available, the sample coefficient alpha for each measure is reported in Table 2. Most of the measures used in the studies included in the meta-analysis will be recognized by scholars familiar with cognitive models of depression, adding a degree of face validity to our results, though it should also be noted that researchers occasionally revised existing measures by selecting particular items for use.

Table 2 Study characteristics

Participant Characteristics

As can be seen in Table 2, most studies included in the data supporting this tutorial were based on adults (88 of 104), which we defined as having an average sample age between 18 and 65. Only two were based on elderly adults (65 and older). The remaining studies were based on children. Most studies were conducted in the USA (78) or Canada (14). Twenty-seven studies recruited participants from a clinical sample (e.g., an inpatient hospital), while 68 studies recruited from a more general location, and 13 recruited from both clinical and non-clinical sources. Three studies recruited from both clinical and non-clinical sources and reported separate correlations by recruitment source — these studies are included twice in the counts above. For studies taking place in the USA, the median of the reported percentages of white participants was 75% (the range was 0% to 90 + %). Within studies, sample sizes ranged from 10 to 1063, with a median of 143 and the middle 50% of the sample size distribution ranging from 57 to 241.

Measure Characteristics

With respect to measurement, the BDI I and II were the most used in the studies in this review — their usage was more than five times greater than the Center for Epidemiologic Studies — Depression subscale, the next most used measure. In all, 24 unique measures of depression were used in this review. The Automatic Thoughts Questionnaire was, by far, the most commonly used measure of automatic thoughts. In some cases, authors assessed positive automatic thinking (which should be negatively correlated with depression) — for these, we reversed the sign of the reported correlation coefficient. The Dysfunctional Attitudes Scale was the only measure of dysfunctional attitudes used in the studies in this tutorial (though studies differed in the specific subscales and items used).

Statistical Models

The random-effects TSSEM and OSMASEM approaches were used to conduct the MASEM. Correlation coefficients are pooled in a multivariate way by taking their dependence (sampling covariances) into account. Specifically, given the diversity of the studies included in this review, including sample and measure characteristics, we could not support the assertion that the studies share a common effect size, and therefore, the correlation matrices were combined with a random-effects model in the first stage of analysis. To address non-independence, when a study provided multiple estimates for the same construct pair, we averaged the correlations within studies. We chose this method because (a) dependencies arose due to having multiple estimates of the same construct pair with equal or nearly equal sample sizes and (b) it minimized measurement errors by including all relevant effect sizes. Although a multilevel model or robust variance estimation might also be used, one major limitation was that these models do not allow including study characteristics to moderate the structural parameters.

As can be seen in Table 2, about 60% of the studies in our meta-analytic dataset provided a sample-based coefficient alpha for at least one measure. The coefficient alphas were quite high. For depression, the median coefficient alpha was .90 and the middle 50% of the distribution ranged from .85 to .91. For automatic thoughts, the median coefficient alpha was .94 and the middle 50% of the distribution ranged from .92 to .96. For depressive schema, the median coefficient alpha was 0.91 and the middle 50% of the distribution also ranged from .85 to .91. In part because attenuation corrections for measurement error would have little impact given the typically high reliability estimates observed in these studies, we opted not to perform this correction.

An average correlation matrix and the variance component of the heterogeneity variances were estimated with the full information maximum likelihood estimation method (FIML), which is unbiased and efficient in handling data missing completely at random and missing at random (Enders, 2010). FIML is preferable to other methods such as listwise and pairwise deletion in handling missing data in MASEM (Cheung, 2015a). If a particular correlation of interest is not present in any of the included studies, then the associated variable has to be dropped from analysis.

Both the TSSEM and OSMASEM fit the structural equation models using the precision of the correlation matrices. Therefore, the study sample sizes do not affect the parameter estimates, their standard errors, and the fit statistics of the proposed model once the asymptotic sampling covariance matrices of the correlations are calculated. The study sample sizes do slightly affect some of the approximate fit indices, e.g., RMSEA, which depend on the sample size used to fit the structural model. In TSSEM and OSMASEM, the sum of the sample sizes is used as the sample size in calculating fit indices. This follows the practice of how sample size is calculated in multiple-group SEM.

The average correlation matrix and its asymptotic sampling covariance matrix were used to fit structural equation models in the second stage of analysis in TSSEM. Figure 1 displays the proposed mediation model. The corresponding path coefficients are labelled as a, b, and c for ease of reference. As is usual in SEM, the indirect effect is estimated by the product term a*b. Conventional confidence intervals, also known as Wald confidence intervals, are usually based on parameter estimate ± 1.96*standard error. The sampling distribution of an indirect effect is complicated and tends not to be normal (MacKinnon et al., 2004), making the Wald confidence intervals less appropriate. Therefore, following the advice provided in Cheung (2009), in the second stage of the analysis, we report the 95% likelihood-based confidence interval (LBCI), which captures the non-normal distribution of the parameters. The main drawback to this approach is that it is computationally intensive. It is therefore advisable to use the LBCI only when researchers are interested in the functions of the parameters, e.g., the indirect effect in this paper.

Fig. 1
figure 1

Proposed mediation model. Dys, dysfunctional attitudes; Aut, automatic thoughts; Dep, depression. Path c represents the hypothesized direct effect of dysfunctional attitudes on depression. The indirect effect of dysfunctional attitudes on depression via automatic thoughts is represented by the a × b pathway

Conventional MASEM, e.g., the TSSEM approach proposed by (Cheung & Chan, 2005; Cheung, 2014), cannot handle continuous moderators. To address the moderating effects, we used the OSMASEM recently proposed by Jak and Cheung (2020) to test whether the selected study characteristics moderate the regression coefficients. The key idea is that the structural parameters, say a, b, and c in Fig. 1, are formulated as functions of the moderators. Then, we may test whether the moderators can be used to explain these parameters. All the analyses were conducted using the metaSEM package (Cheung, 2015b) implemented in the R statistical platform (R Development Core Team, 2020). The data, R code, and output are available as a registered Open Science Foundation project at https://osf.io/h976y.

Results

Descriptive Statistics

In Table 3, we report descriptive statistics for each correlation pair, including the number of studies and the total number of participants on which the meta-analytic data are based. As can be seen, the random effects meta-analytic correlation between automatic thinking and depression is quite large (+ .67). The other two correlations are smaller but still substantial: + .47 for the correlation between dysfunctional attitudes and automatic thinking and + .40 for correlation between dysfunctional attitudes and depression. The test statistic for the null hypothesis that all the correlation matrices are homogeneous was rejected, χ2(df = 123) = 360.38, p < .001, suggesting that the correlation matrices are not homogeneous. A prediction interval for each meta-analytic correlation can be found by multiplying the average correlation by ± 1.96*τ in Table 3.

Table 3 Weighted descriptive statistics for the correlations of interest

Assessing Publication Bias

Because almost two-thirds of the studies in this review are dissertations or theses, we did not expect that publication bias would have an important influence on the meta-analytic correlations we observed. Still, to be safe, we used the trim and fill approach to assess the plausibility of publication bias on these results. We ran the trim and fill analysis three times, once for each correlation of interest, using the metafor package in R (Viechtbauer, 2010). In two of the three cases, the trim and fill analysis did not impute studies to the left of the mean effect size (this is the direction that we would expect would be censored if publication bias were operating). In the other case, which involved correlations between measures of automatic thoughts and measures of depression, the trim and fill procedure identified two possibly censored studies. Based on the large percentage of dissertations that comprise our evidence base and on the results of the trim and fill analysis, we conclude that publication bias is unlikely to be an important influence on our results.

Testing a Structural Model

Because the proposed model is saturated, the model fit is perfect. Referring to Fig. 1, the estimated coefficients and their 95% LBCIs are 0.47 (0.42, 0.52) for path a (dysfunctional attitudes → automatic thoughts), 0.62 (0.56, 0.69) for path b (automatic thoughts → depression), and 0.11 (0.05, 0.17) for path c (dysfunctional attitudes → depression). As can be inferred from the confidence intervals, all individual paths are statistically significant at α = .05. The estimated direct effect (path c) is only 0.11, whereas the estimated indirect effect of paths a and b is 0.47 × 0.62 = 0.29 (0.24, 0.35), more than two and a half times larger than the direct effect. A likelihood ratio test rejects the null hypothesis that the direct effect and the indirect effect are the same magnitude in the population, χ2(df = 1) = 13.88, p < .001. This suggests that, consistent with the theory, the primary effect of dysfunctional attitudes on depression mediated by automatic thoughts is stronger than that of the direct effect.

Testing Study-Level Moderators in a Structural Model

We next tested three study-level characteristics that might moderate the relationships we observed. The first potential study-level moderator we examined was whether studies sampled children vs adults and older adults. As there were 3 path coefficients, we compared the models with and without the moderating effect. The test statistic was not statistically significant with χ2(df = 3) = 3.77, p = .29, suggesting that there is not enough evidence to support that the path coefficients are different in samples based on children relative to path coefficients obtained from other samples. We conducted the same analysis on examining effects based on North American samples vs samples obtained from other locations. The test statistic was not statistically significant with χ2(df = 3) = 5.14, p = .16, suggesting that there is not enough evidence to support that the path coefficients are different in North America samples vs other samples.

Finally, we tested whether the samples were general (k = 78), clinical (k = 32), or mixed (k = 15). The test statistic was statistically significant with χ2(df = 6) = 25.41, p < .001, suggesting that there are differences in the path coefficients among these groups. Further analyses on the individual paths show that they were all statistically significant at α = .05. Table 4 shows the path coefficients in different samples. As can be seen, the primary difference is that for clinical samples, the path from dysfunctional attitudes to automatic thoughts is somewhat smaller (.28) than it is for other samples (.48 for general samples and .52 for mixed samples).

Table 4 Moderating effects on the path coefficients and their standard errors

Discussion

We had several goals in writing this primer. One was to highlight how the strengths of systematic reviewing, meta-analysis, and structural equation modeling can be used to help inform theories and practices relevant to prevention science. To this end, we brought together 104 studies, all of which had one thing in common: They provided a correlation between at least two of three constructs relevant to the Beck’s model of depression. This model posits, in part, that dysfunctional attitudes lead to negative automatic thinking, which leads to depression.

Although the paper is primarily focused on articulating how MASEM can be used, based on our data, we concluded that the indirect effect is substantially larger than the direct effect of dysfunctional attitudes on depression. Therefore, our results suggest that researchers developing prevention and treatment strategies for depression, and clinicians implementing these, should be aware of the importance of negative automatic thinking as a mediator of the relationship between dysfunctional attitudes and depression. We also tested three possible study-level moderators of this relationship. For two of these, we were unable to reject the null hypothesis that the models are similar. For the other, which examined whether the study sample was recruited from a general population, a clinical population, or a mix of these two sources, results suggested that while the overall patterns of the relationships between the variables were similar, the path from dysfunctional attitudes to automatic thoughts was somewhat smaller for clinical samples than it was for other samples. In generating these conclusions, we outlined the major considerations that potential users of MASEM will need to consider (model choice, dealing with dependent effect sizes, assessing publication bias, and whether to carry out attenuation corrections).

In highlighting how the strengths of systematic reviewing, meta-analysis, and structural equation modeling can be used to help inform theories and practices relevant to prevention science, we aimed not only to fulfill a tutorial function and show readers how this can be done, but also to convince readers about the value of doing so. To this end, it is worth emphasizing that of the three relationships that we address in this paper, the one with the least amount of information (the relationship between dysfunctional attitudes and automatic thoughts) is based on more than 25 times the number of participants that the median study had (3718 participants across 19 studies vs a median sample size of 143). And because these data were drawn from studies that varied along a number of theoretically important dimensions, there is much more opportunity to test mediation and moderation using the MASEM framework than there is in an individual study. Finally, while the results of an individual study are conditional on the specific operations used, a systematic review involves multiple samples using diverse populations collected at different times, and these features almost certainly translate into better generalizability. All of these are reasons to considering adding MASEM to the toolbox of methods that are used to advance prevention science.