In research and clinical practice, psychopathology has traditionally been viewed in terms of distinct diagnostic categories. Recently, however, evidence has been accumulating for the presence of a general psychopathology dimension, referred to as the ‘p’ factor, that spans mental disorders (e.g., Caspi et al., 2014; Lahey et al., 2012, 2021; Ronald, 2019). The p factor is conceptualized as a transdiagnostic construct reflecting severity of overall psychopathology and/or the degree of comorbidity among a range of symptoms (Fried et al., 2021). The concept of the p factor is supported by an abundance of research demonstrating high comorbidity, transdiagnostic risk factors, and shared sequelae of various forms of psychopathology (Gili et al., 2019; Kessler et al., 2005; Nolen-Hoeksema & Watkins, 2011). The p factor shows high heritability and stability over time (Allegrini et al., 2020; Class et al., 2019; McElroy et al., 2018). Further, the p factor is associated with a number of relevant psychological constructs, such as functional impairment, psychiatric diagnoses, psychiatric medication use, family history of psychopathology, and neural correlates of mental disorders (Caspi et al., 2014; Moore et al., 2019; Pettersson et al., 2018).

Despite growing support for the use of the p factor in clinical research, questions remain about how the p factor is assessed. First, whereas research with adults has typically relied on self-reports to assess the p factor, research with youth typically takes a multi-informant approach to assessment wherein symptom ratings are obtained from youth and a collateral informant (typically a caregiver). Yet, research on the multi-informant assessment of p is lacking, resulting in major gaps in the literature. Second, estimating the p factor requires assessing a large number of symptoms that span a range of disorders, resulting in high reporter burden. Some recent evidence suggests p can be estimated more efficiently using computerized adaptive testing (CAT) procedures with little or no cost to construct or predictive validity (Moore et al., 2019). However, this research applied CAT only to youth self-reported symptoms. It is unclear if applying CAT to parent-reported youth symptoms would yield similarly promising results. The present investigation aims to address these gaps in the literature.

Multi-Informant Assessment of Youth Psychopathology

It is a well-established practice for researchers and clinicians interested in youth mental health to obtain ratings from multiple informants (e.g., youth, caregiver, teacher). However, an abundance of evidence suggests that informants often disagree, sometimes quite substantially, in their ratings of various symptoms and behaviors (e.g., Achenbach et al., 1987; De Los Reyes et al., 2023). A meta-analysis of 25 years of multi-informant studies reported an average overall correlation of 0.29 between parent and youth reports of internalizing (r = 0.26) and externalizing (r = 0.32) symptoms (De Los Reyes et al., 2015). Discrepancies in reports of youth psychopathology symptoms have important clinical implications. For example, in inpatient psychiatric settings, greater discrepancies in youth and parent reports of youth symptoms have been prospectively associated with use of intensive and restrictive treatment regimens (e.g., standing antipsychotics, locked door seclusion; Makol et al., 2019). Further, in outpatient settings, parent-child discrepancies at the beginning of therapy predicted poorer overall treatment response (Goolsby et al., 2018), and increased parent-youth concordance over the course of treatment predicted better outcomes (Becker-Haimes et al., 2018).

To our knowledge, there has been very limited investigation of cross-informant agreement on youth p. Allegrini et al. (2020) used principal components analysis (PCA) to derive a p factor score from youth and parent reports of youth psychopathology at multiple time points and found that correlations among the first components extracted (i.e., the general components) across time points ranged from approximately 0.30 − 0.40, similar to the correlations reported by De Los Reyes et al. (2015). However, there are limitations to a PCA-derived p factor. Namely, PCA does not capture the hierarchical structure of psychopathology with a general p factor and several sub-factors (see Lahey et al., 2021). Indeed, loadings on the first principal component are almost certainly biased by the underlying multidimensionality (Reise et al., 2011, 2015). Other statistical approaches, such as bifactor modeling, correct for this multidimensionality. Bifactor modeling (Reise, 2012; Reise et al., 2010) is unique in that it allows each item to load on two factors simultaneously; one specific factor capturing covariance among items in a sub-domain (e.g., externalizing) and one general factor (p) capturing covariance among all items. A critical advantage of bifactor modeling in this context is that it includes direct relationships between the general factor (p) and the items themselves. Contrast this with a second-order model (see Reise et al., 2010; Fig. 1c) in which items load on their specific factor only and these specific factors load on a second-order general factor. This configuration will not work in the present context because one cannot estimate item parameters on the general factor without a direct relationship between the item and the factor. Item parameter estimates—specifically item response theory discrimination and threshold parameters—are necessary to build a CAT. In the present work, therefore, we employed bifactor multidimensional item response theory (MIRT) models (Reckase, 1997) to derive adolescent- and parent-reported p factor scores and evaluate agreement between scores.

Computerized Adaptive Testing (CAT)

In the extant literature, p factor scores have typically been derived from lengthy and time-consuming clinical assessments that include numerous items spanning a range of disorders and symptom clusters. Other areas of research and assessment (e.g., standardized testing, cognitive/IQ tests) have adopted adaptive procedures that substantially reduce testing burden while maintaining a high level of precision relative to the non-adaptive version. In CAT, after the first item is completed, an algorithm estimates the responder’s trait level of the construct being assessed and then chooses the most appropriate next item, where “most appropriate” is determined by the amount of informationFootnote 1 an item will provide at that examinee’s estimated interim trait level. After the second item is completed, the algorithm uses both responses to re-estimate the trait level to select the next most appropriate item, and so on until some stopping criterion is reached. Recently, Moore et al. (2019) applied CAT to data from the Philadelphia Neurodevelopmental Cohort (PNC) study to create a publicly available adaptive screener for assessing p, which they called the Overall Mental Illness (OMI) screener. They reported that the CAT version (substantially shorter than the full form) performed nearly as well as the full version in predicting psychiatric diagnoses and brain parameters. However, Moore et al. (2019) only used youth report to create the adaptive assessment of p. We extended this work in the current investigation by applying CAT to both parent and adolescent reports of youth psychopathology in the PNC and examining agreement between CAT p factor scores. If p factor scores derived from CAT perform similarly to scores derived from lengthy p factor assessments, CAT could reduce assessment burden, and thus provide researchers and clinicians with an efficient way to collect multi-informant ratings of youth psychopathology.

The Present Study

In the present investigation, we utilized data from a large sample of youth and collateral informants in the PNC study (Calkins et al., 2015) to evaluate the degree of agreement between youth- and parent-reported p. The PNC includes over 5,000 youth-parent dyads who were independently administered a structured clinical interview by an assessor and thus provides a rich resource for robustly evaluating issues related to informant agreement. Prior multi-informant analyses with the PNC sample revealed substantial discrepancies in youth and parent reports of adolescent substance use (Jones et al., 2017a), suicidal ideation (Jones et al., 2019), and psychosis spectrum symptoms (Xavier et al., 2022). The present analyses extend this prior work by utilizing all the symptom-level data available and employing bifactor models to create overall psychopathology p factor scores for youth and parent reports. In addition to evaluating agreement between p factor scores, we also examined which items loaded most highly on youth- and parent-reported p to see which, if any, of the highest loading items were common across reporters. We also tested associations between each reporter’s p factor score and assessor-rated youth global functioning. We chose this criterion of clinical validity because, like the p factor, it is transdiagnostic. Equally important, the validity criterion was independently rated by an assessor, rather than reported by the adolescent and/or parent. This aspect of our study design addressed issues recently raised with use of mono-source paradigms for measurement validation (see De Los Reyes et al., 2023; Watts et al., 2022). Finally, we applied simulated CAT to parent and youth reports of psychopathology to determine whether adaptive testing substantially alters the degree of agreement between parent and youth reports of p or the associations with youth global functioning.

Method

Participants

The PNC includes 9,498 community youth between the ages of 8 and 21 years from the greater Philadelphia area. Participants were recruited from the Children’s Hospital of Philadelphia (CHOP) pediatric healthcare network. Importantly, participants were not recruited from mental health treatment centers; thus, the PNC is not enriched for individuals seeking psychiatric care. To be eligible for the PNC, participants had to be: (a) aged 8–21 years, (b) proficient in English, and (c) not have any mental or physical conditions that could interfere with the completion of study procedures.

Notably, for adolescents between the ages of 11 and 17 years, both the youth and a collateral informant completed a structured clinical interview that screens for a wide range of youth psychopathology symptoms. The present analyses included 5,060 adolescents (mean age = 14.54 years, SD = 1.98; 52% female) and collateral informants (87% mother/mother figure; 10% father/father figure; 3% other caregiver/legal guardian) from the PNC. The sub-sample was racially diverse: 57% White, 32% Black, and 11% other races (e.g., multiracial, Pacific Islander).

Procedures

After providing a detailed description of study procedures, written parental consent and youth assent were obtained. Parents and youth were assessed independently and were informed

that all their responses would be kept confidential, with the exception of legal reporting requirements (i.e., self/other harm, evidence of abuse). See Calkins et al. (2015) for additional details about study recruitment, sample, and procedures. The Institutional Review Boards of CHOP and the University of Pennsylvania approved all study procedures.

Measures

Clinical assessment. Trained assessors administered the computerized GOASSESS structured clinical interview to youth and collateral informants (Calkins et al., 2015). GOASSESS is derived from the Schedule for Affective Disorders and Schizophrenia for School-Aged Children (KSADS; Kaufman et al., 1997) and screens for major domains of psychopathology (e.g., mood, anxiety, attention/behavior, psychosis spectrum). In the present investigation, youth and parent p factor scores were derived from 107 dichotomous items from GOASSESS that assess a range of psychopathology symptoms (see Table S1 in Supplemental Materials for a list of all 107 items). Participants were also asked about distress/impairment associated with symptoms and about history of mental health treatment for mood or behavioral problems. Based on all information provided during the clinical interviews, the assessor assigned a global functioning score using the Children’s Global Assessment Scale (CGAS; Shaffer et al., 1983) GOASSESS has been validated as a psychopathology screener in numerous studies (e.g., Barzilay et al., 2019a; Calkins et al., 2015; Jones et al., 2017b, 2021; Moore et al., 2019; Satterthwaite et al., 2014). Previous bifactor analyses of the GOASSESS youth symptom data found an overall psychopathology factor (i.e., p) and four subfactors: (1) anxious-misery (e.g., depression, generalized anxiety disorder), (2) fear (e.g., panic disorder, phobias), (3) externalizing (e.g., ADHD, conduct disorder), and (4) psychosis spectrum symptoms (Calkins et al., 2015; Moore et al., 2019, 2023; Shanmugan et al., 2016).

Analytic Approach

Multidimensional Item Response Theory Models

To determine the optimal factor configuration for youth and collateral informants, we used exploratory multidimensional item response theory (MIRT; McDonald, 2000; Reckase, 1997) models implemented in the mirt package (Chalmers, 2012) in R. The number of factors to extract (4) was based on previous analyses of these data (Moore et al., 2019, 2023; Shanmugan et al., 2016), as well as the theoretical 4-factor structure of psychopathology. This 4-factor structure combines the original 3-factor structure from Krueger (1999), itself roughly comprising anxious-misery, fear, and externalizing, with a fourth factor, psychosis spectrum symptoms (Calkins et al., 2015; cf. Markon, 2010). Subjective evaluation of the scree plot, as well as the minimum average partial (MAP) method (Velicer, 1976) for determining the number of factors, were consistent with the choice of 4 factors. With the optimal item configuration determined by MIRT, a confirmatory bifactor MIRT model was fit to youth and collateral informant data. This allowed the general p factor to be estimated for each with optimal weights uncontaminated by each reporter’s unique multidimensionality (Reise et al., 2015). Note that, if we wished to make claims about the “structure of psychopathology,” the above analyses would need to be performed in a cross-validated framework whereby the model configuration is determined in a separate sub-sample from the one used for the confirmatory analysis. However, here we do not wish to make such claims, and confirmatory models were used only for the purpose of calibrating the items needed for all downstream analyses. Item calibration produced parameter estimates (e.g., each item’s severity) that were then used as an item bank for performing CAT simulations and calculating p factor scores (interim and final).

Computerized Adaptive Testing

The measurement theory facilitating CAT is item response theory (IRT; Embretson & Reise, 2000), which describes how latent traits relate to the probability of item responses. The most demonstrative model of IRT is the unidimensional 2-parameter model described by Eq. 1:

$$ {p_i}\left( {{X_i} = 1|\theta } \right) = \frac{{{e^{{a_i}(\theta - {b_i})}}}}{{1 + {e^{{a_i}(\theta - {b_i})}}}}$$
(1)

Where pi(Xi = 1| θ) is the probability of endorsing item i (given θ), ai is the item discrimination for item i, bi is the item severity (or “difficulty”), and θ is the trait level (e.g., psychopathology severity) of the person. For example, if an item had discrimination (a) = 2.0 and severity (b) = 0.5, and an examinee had trait level (θ) = 1.0, the probability that that examinee would endorse the item would be e2.0(1.0– 0.5)/(1 + e2.0(1.0– 0.5)) = e1/(1 + e1) = 2.7181/(1 + 2.7181) = 0.731. By providing an estimate of endorsement probability, the item parameter estimates (ai and bi) provide an estimate of item “quality” at any specific point along the trait dimension, because knowing ai and the probability of endorsement allows one to calculate the information provided by that item:

$$ I\left( \theta \right) = a_i^2{p_i}\left( \theta \right){q_i}\left( \theta \right)$$
(2)

Where I(θ) is the information produced by the item, ai is the item discrimination, pi(θ) is the probability of endorsement, and qi(θ) is the probability of non-endorsement. Using the example above, the information provided by that item (a = 2.0) for that examinee (θ = 1.0) would be 22(0.731)(0.269) = 4(0.197) = 0.788. Item information, in turn, relates to the standard error of measurement as:

$$ SE\left( \theta \right) = \frac{1}{{\sqrt {I\left( \theta \right)} }}$$
(3)

Where SE(θ) is the measurement error, meaning that as information increases, error decreases. For the example above, the SE(θ) would be 1/sqrt(0.788) = 1.127 standard deviations. A trait level (θ) estimate from that one item would therefore have quite wide 95% confidence intervals around it, spanning roughly 2.2 standard deviations above and below. As additional items are administered, more information is accumulated, reducing the standard error.

Equation 1 above can be expanded into a multidimensional model (Reckase, 2009) as:

$$ {p_i}\left( {{X_i} = 1|{\theta _1},{\theta _2}} \right) = \frac{{{e^{{a_{i1}}\left( {{\theta _1} - {b_i}} \right) + {a_{i2}}({\theta _2} - {b_i})}}}}{{1 + {e^{{a_{i1}}\left( {{\theta _1} - {b_i}} \right) + {a_{i2}}({\theta _2} - {b_i})}}}}$$
(4)

Where pi(Xi = 1| θ1, θ2) is the probability of endorsement of item i given two different trait dimensions (θ1 and θ2), ai1 is the item discrimination for dimension 1, ai2 is the item discrimination for dimension 2, and bi is the item difficulty. However, while the present study does use a multidimensional model (bifactor; Reise, 2012), the focus is on only one dimension (i.e., the general factor).

With the ability to characterize item quality (information provided) as a function of examinee trait level, the ability to administer tests adaptively follows neatly. Imagine an examinee of unknown trait level, assumed to be average (θ = 0). Each item in a collection (item bank) of calibrated items has an associated amount of information provided for an examinee at any trait level, and a test can start by administering the item with maximum information at θ = 0. The response to that item (endorsed or not) would result in an updated estimated trait level– e.g., suppose the examinee endorsed it and θ is re-estimated to be 0.80. The next item to administer would be the one that provides maximum information at θ = 0.80. Suppose the examinee does not endorse this second item, resulting in a re-estimated trait level of 0.10. The item providing maximum information at θ = 0.10 would be administered, and so on. The test would stop when some pre-established stopping criterion is met. A common approach is to stop when the standard error [SE(θ) from Eq. 3 above] reaches a lower bound (e.g., 0.30).

Using the above analytic framework, CAT sessions were simulated for the purpose of calculating what score each person would have received had they taken an adaptive version of the psychopathology screener (and received far fewer items). This was possible because all participants had already answered all items (in the full form), meaning any item administered in the CAT simulation would have a corresponding response from that person (given when they took the full version). A standard error cutoff of 0.30 was used in the simulations, meaning each simulated CAT stopped when the standard error reached that level, and the person received whatever score the algorithm had estimated at that point in the CAT sequence. With the CAT simulation scores estimated (in a Z metric, as is typical of CAT), they could be compared to the full form scores (estimated using the full item bank). CAT scores were estimated using the default Bayesian modal method (Birnbaum, 1969) common in CAT, and full-form scores were estimated using expected a posteriori (EAP; Bock & Mislevy, 1982).

p Factor Agreement and Associations with Youth Global Functioning

We used Pearson correlations to evaluate youth-parent agreement on p factor scores and the associations between each reporter’s p factor scores and youth CGAS scores. To compare agreement on p in the full version versus agreement on p in the CAT version we used Steiger tests (Steiger, 1980). To test whether the magnitude of the correlation between p factor and CGAS differed by reporter (parent vs. youth) or version (full vs. CAT), we used Williams’ tests (Williams, 1959). Given the large sample size and the number of comparisons, we set a stringent significance threshold of p <.001.

Results

Bifactor Model Results

Fit of the bifactor models with four subfactors was acceptable for youth (CFI = 0.97, SRMR = 0.04, RMSEA = 0.03) and collateral informants (CFI = 0.97, SRMR = 0.05, RMSEA = 0.03). In addition, while interpretation is beyond the present scope, it is common to report several indices specific to bifactor models (Rodriguez et al., 2016a); these are presented in Supplementary Tables S2 and S3 for youth and collateral informants, respectively. The indices (such as omega, omega-hierarchical, and factor determinacy) are used to judge qualities such as: (1) factor reliability, especially that unique to a specific factor (e.g., externalizing relative to p), (2) the extent to which we can expect factor scores to represent the factor used to calculate them, (3) the proportion of inter-item correlations that are “uncontaminated” by multidimensionality, and other similar properties. All general factor indices were within the acceptable range (Rodriguez et al., 2016b).

Bifactor Item Loadings

The 10 highest loading items for youth- and parent-reported p are presented in Table 1 (see Table S1 in Supplemental Materials for all item loadings). It is noteworthy that 7 out of 10 items were common across reporters and that most of the items capture symptoms of obsessive-compulsive disorder (OCD). For example, the GOASSESS item assessing intrusive and repetitive bad or forbidden thoughts was the highest loading item on p factor scores for both parents and adolescents.

Table 1 Top 10 Highest Loading Items in Bifactor Models

CAT Simulations

In CAT simulations, youth were administered 46 items on average (range 7-107 items) and parents were administered 47 items on average (range = 6-107 items). This corresponds to approximately 43% of the original (full form, 107 items) test length across reporters. Notably, OCD symptoms were not among the 10 most frequently used items in the CAT simulations and there was minimal item overlap between youth and parents (see Table 2). The items most frequently used in CAT were related to mood, externalizing, and psychosis spectrum symptoms. Nevertheless, the within-reporter correlation between the full version p factor score and CAT-derived p factor score was high for both youth-report (r = 0.95) and parent-report (r = 0.95). Exploratory analyses including all 214 parent and youth items combined are reported in Supplemental Materials.

Table 2 Top 10 Most Frequently Administered Items in CAT Simulation

Agreement between Youth- and Parent-Reported p Factor Scores

The correlation between youth- and parent-reported p factor scores on the full form was moderate, r = 0.44, p <.001. Parent-youth agreement on CAT-derived p factors scores was similar to the full form version, r = 0.40, p <.001. The difference in the magnitude of these correlations, although small (0.04), was statistically significant, p <.001.

Associations of p Factor Scores with Overall Functioning

Both youth- and parent-reported p (full and CAT versions) were negatively associated with assessor-rated youth global functioning (r ranged from − 0.44 to − 0.48; p <.001, see Table 3). Within each reporter, using CAT only slightly reduced the magnitude of the association between p and CGAS scores (0.03 for youth and 0.02 for parents, p <.001; rows in Table 3). Across reporters, the correlation between p factor scores (full version or CAT version) and CGAS scores did not significantly differ for youth report versus collateral informant report (columns in Table 3).

Table 3 Associations between p and Youth Global Functioning

Discussion

Interest in a general psychopathology p factor has grown rapidly in recent years among researchers, clinicians, and psychometricians. Yet several important questions about the p factor have been insufficiently addressed. Although multi-informant assessment of youth psychopathology is considered part of “best practices” in clinical research and practice, almost no research has taken a multi-informant approach to studying the p factor (see Allegrini et al., 2020 and Watts et al., 2022, for exceptions). Our results fill several gaps in the literature related to cross-informant agreement on p and the clinical validity of the p factor when reported by youth and a collateral informant. In addition, our results suggest that lengthy clinical assessments used to generate p factor scores could be substantially abbreviated through adaptive testing procedures with minimal consequences to parent-youth agreement or construct validity.

It is noteworthy that among the 10 items that loaded most highly on youth- and parent-reported p in the bifactor models, seven of the items were common across reporters. The majority of the highest loading items for each reporter assessed symptoms of OCD and were related to repetitive bad thoughts and repetitive behaviors. Other high loading items across reporters were related to mood dysregulation. In some of the earliest research on the p factor, Caspi et al. (2014) found that thought disorder symptoms (which included OCD symptoms) loaded most highly on the general psychopathology factor. This finding is consistent with recent evidence suggesting that OCD symptoms among youth in the PNC study, particularly repetitive bad thoughts, are associated with increased risk for depression, psychosis, and suicidal ideation, indicating thought disturbance as a transdiagnostic risk factor for psychopathology (Barzilay et al., 2019b). The study by Barzilay and colleagues relied only on youth self-reported OCD symptoms. The present results extend this prior work by suggesting that parent-reported youth OCD symptoms may meaningfully capture general psychopathology risk. Further, researchers have recently proposed a general cognitive vulnerability factor (dubbed the ‘c’ factor) that is significantly associated with the p factor and is a transdiagnostic risk factor for multiple psychopathologies in youth (Schweizer et al., 2020). Our results further support thought disturbance (particularly intrusive repetitive thoughts) as a major underlying component of the p factor, and it is noteworthy that this was the case across reporters.

However, it is important to note that there is ongoing debate about what the p factor captures and possible underlying mechanisms that might explain a single general dimension of psychopathology (see Watts et al., 2024, for a review of key issues). One important criticism is that there is currently no satisfactory explanation for why p exists; there are theories, but they are often unfalsifiable or otherwise unsatisfactory. Researchers have proposed several mechanisms that are transdiagnostic and cut across a range of disorders, including dispositional negative affectivity, emotion dysregulation, impairment, and thought dysfunction (Caspi & Moffit, 2018; Duetz et al., 2020; Lahey et al., 2021; Smith et al., 2020; Tackett et al., 2013). Further compounding the problem, the strongest indicators of p vary from study to study, where one study might define a p factor strongly determined by psychosis symptoms and another study’s p might be most strongly determined by mood symptoms (Watts et al., 2024). These differences are partially attributable to differences in measurement approach (as there is no standardized assessment of p) and sample characteristics across studies. Notably, even within the same sample, the strongest indicators of the p factor vary across reporters. Allegrini et al. (2020) found that autism traits and externalizing problems loaded highest on parent- and teacher-reported p factor, whereas internalizing symptoms loaded most highly on youth-reported p.

This lack of replication across p factor studies is a problem for the field, further exemplified by our finding that OCD symptoms most strongly defined the p factor in the present study. However, we argue this problem does not imply that p is an illusory or false construct. That p exists is demonstrated by the consistency of finding that any combination of psychopathology items (or collection of psychopathology measures) will produce a correlation matrix that is unmistakably unidirectional—i.e., all items correlate in the same direction. For researchers who wish to parse specific types of psychopathology, this “positive manifold” of symptoms is a nuisance. It results in collinearity, often severe, and causes uncertainty about the optimal way to model the distinct disorders or dimensions (such as internalizing versus externalizing). However, in some cases (as here), the distinctions among disorders or dimensions may be less important; the construct of interest may simply be psychopathology or mental health risk. In these cases, the “positive manifold” works in the researchers’ favor, and the multidimensionality is nuisance (accounted for by the bifactor configuration). To summarize, while we might not have a satisfactory explanation of what p is, it likely exists and serves the purpose of some research questions—e.g., as a general indicator of mental health risk—that can be used to compare parent and youth reports and is correlated with relevant psychological constructs (e.g., functioning). Therefore, it is worthwhile to pursue ways to make the assessment of p, as operationalized here, more time efficient than having to query about the entire range of symptoms. Such time savings could be particularly important in a low-yield population such as community samples or healthy controls in clinical studies. An efficient measure of p could also be used as a screening tool, enabling identification of individuals where more detailed evaluation is indicated.

In contrast to the bifactor loadings, OCD symptoms were not among the most frequently used items in CAT simulations for either reporter. Critically, a high factor loading (analogous to a high item discrimination parameter, where higher is preferable) does not necessarily indicate a “high-quality”Footnote 2 item. This is because item quality depends not only on its discrimination but also on its difficulty (severity in this case), where items of extreme severity (high or low) tend to be less informative on average. The OCD symptoms here are a good example: they have very high loadings (and discrimination parameters) but are not selected very often in the CAT simulations because they tend to be severe/rare. Further, unlike the highest loading items which were largely similar across reporters, there was little overlap between reporters in the items most frequently selected in the CAT simulations. The youth-reported items that were most frequently selected in CAT were related to internal experiences and, interestingly, 40% of the items assess psychosis spectrum symptoms. By contrast, the parent-reported items that were most frequently selected in CAT largely capture externalizing symptoms and no psychosis spectrum symptoms were selected. These results are consistent with prior work suggesting that parents may be more attuned to observable, problematic behaviors but may miss less salient internal states that youth experience (De Los Reyes et al., 2015; Xavier et al., 2022). Despite these differences, correlations between full form p factor scores and CAT-derived p factor scores were very high regardless of reporter (r =.95).

Youth-parent agreement on the full form p factor was moderate (r =.44). This correlation compares favorably to the average level of parent-youth agreement on youth psychopathology symptoms reported in studies over a 25-year period (r =.29; De Los Reyes et al., 2015). Unlike prior cross-informant studies, which have typically compared average or sum scores across reporters on a particular category of symptoms (e.g., depression symptoms), we examined agreement on a latent general psychopathology factor derived from ratings on a range of symptoms that span multiple disorders. In bifactor models of psychopathology, the general factor loadings are optimized by accounting for the correlations among the symptoms specific to each sub-factor (Lahey et al., 2021), meaning in some cases (as here) the specific sub-factors are modeled merely as “nuisance” to arrive at the true p (general factor) loadings. Consistent with the growing endorsement of a dimensional rather than categorical approach to psychopathology, symptoms are correlated across disorders and dimensions (e.g., internalizing/externalizing) and youth may present with a range of symptoms that do not fit neatly within a single disorder category. Thus, as our results suggest, a latent dimensional approach to capturing a general psychopathology factor may result in better youth-parent agreement on psychopathology than comparing mean scores on disorder-specific assessments.

Higher concordance between adolescent and parent reports of youth psychopathology is not a trivial result, as parent-youth divergence in symptom reports has important clinical implications. For example, in a large community sample, parent-adolescent disagreements in youth symptoms were prospectively associated with youth self-harm, substance use, and referral to mental health services (Ferdinand et al., 2004). Further, in clinical samples, pre-treatment divergence in symptom reports has been associated with less parental involvement in therapy and poorer treatment outcomes among anxious youth (Becker-Haimes et al., 2018; Israel et al., 2007). Importantly, reducing reporter discrepancies over the course of treatment was associated with better treatment outcomes (Becker-Haimes et al., 2018). Based on these prior results, and the degree of concordance of p factor scores reported in the present study, additional research on parent-youth agreement on p and implications for clinical outcomes is warranted.

A critical barrier to implementing the p factor approach to multi-informant assessment of youth psychopathology is informant response burden. To assess the broad range of symptoms from which a p factor score is derived takes substantial time and effort (e.g., clinical interview or comprehensive self-report battery). In a multi-informant context, which is typical in youth clinical assessment, this measurement burden is doubled. This type of assessment is likely not feasible in many clinical and research settings. Thus, investigations into possible ways to abbreviate p factor assessments are warranted. In the current work, our p factor score was derived from over 100 items rated by each reporter. Importantly, our results suggest that using CAT could substantially reduce burden on informants, cutting the average number of items administered by more than half with minimal impact on parent-youth concordance or the clinical validity of p factor scores. All differences in correlations between full form and CAT-derived p factor scores across reporters were ≤ 0.04. In previous research with the PNC sample, Moore et al. (2019) demonstrated that a youth p factor score could be derived from as few as 10 items and still show clinical validity that is comparable to the full form version derived from over 100 items. Thus, adaptive assessments that include a broad range of psychopathology symptoms may be a promising and efficient approach to multi-informant assessment of youth psychopathology.

Our results further support the construct validity of a general psychopathology p factor among adolescents. Both youth and parent reports of p (full form and CAT versions) were negatively associated with youth global functioning. It is noteworthy that, in this study, youth global functioning was independently rated by an assessor rather than self-reported by youth and/or parents. This methodological approach addressed important issues recently raised related to mono-informant reports for measurement validation (De Los Reyes et al., 2023; Watts et al., 2022). Some researchers have proposed that the p factor captures impairment, rather than severity of psychopathology (Smith et al., 2020; Watts et al., 2020). Psychopathology, by definition, is characterized by functional impairment and, therefore, a significant association between a general psychopathology factor and impairment is expected. However, the moderate correlations reported in this study (rs ≈ 0.40) between p factor scores and assessor-rated functioning suggest that p is capturing more than just impairment. An important direction for future research is to further elucidate the potential mechanisms underlying the general psychopathology factor by including independent assessments of putative processes and examining associations with p factor scores.

Limitations and Future Directions

The present findings should be interpreted in the context of several study limitations that point to future avenues for research. Our results highlight the potential benefits of employing the p factor approach in multi-informant assessments of youth psychopathology and the ability to reduce response burden through adaptive testing. However, a barrier to implementing this strategy is that there is no widely-used, standardized measure of p or its sub-factors. Moore and colleagues (2019) created a publicly available Overall Mental Illness (OMI) screener (full and CAT versions) to measure p using items from the PNC study. The initial psychometrics of this measure are promising, and the results of the present study further support this measure. Additional research aimed at validating and standardizing a comprehensive measure of p across diverse sociodemographic groups would benefit the field.

Use of the general psychopathology p factor in research has increased substantially in recent years and evidence continues to accumulate to support its construct validity. However, there are ongoing debates about p both from a statistical angle and a conceptual perspective (e.g., Lahey et al., 2021; Watts et al., 2020, 2024). Statistically, there are debates about the use of bifactor models in p factor research and what they can and cannot tell researchers about model validity and causality (e.g., Heinrich et al., 2023; Watts et al., 2024). One concern is that bifactor models will always tend to fit better than alternative models simply because bifactor models are less parsimonious—i.e., there are more parameters (loadings) estimated in the bifactor model—and this superior fit is often erroneously used to justify or argue in favor of the p factor, bifactor model, or both. However, there are many (perhaps most) cases where the lower parsimony (higher complexity) of the bifactor model is necessary to obtain unbiased parameter (loading) estimates. This is because the bifactor model does not impose proportionality constraints on the loadings (Gignac, 2016). For example, Supplemental Figure S1 shows two theoretical bifactor configurations, where the loading pattern in panel “a” fulfills the proportionality assumption of the second-order model (if it were estimated instead of the bifactor), and the loading pattern in panel “b” does not. The model in panel “a” is an example of a case where the bifactor model is unnecessary; the general and specific factor loadings are proportional, so all information could be captured in a second-order model. Any superior fit of a bifactor model in the panel “a” example would be due only to overfitting caused by the added complexity of the bifactor. However, the information in the panel “b” model could not be captured by a second-order model; only a bifactor model could estimate those loadings accurately. This issue is clarified in Moore and Lahey (2022), in which Fig. 1 shows what happens to score estimates—i.e., how wrong scores can be—as the proportionality assumption of second-order models is increasingly violated. We have reason to believe that the true factor structure underlying item responses in the present study is more like panel “b” than panel “a”, because (1) the estimated loadings of the bifactor show clear disproportionality (e.g. general/specific loading ratios ranging from 0.59 to 2.87 for the psychosis spectrum items in youth) and (2) model fit of the bifactor model is far better than the second-order model, beyond what would be expected from the overfitting example given above. Conceptually, as noted above, there is no agreed upon theory or mechanism to explain a single general dimension of psychopathology (but see Lahey et al., 2017). Future research using the p factor will continue to inform these important issues.

The present study included a community sample of youth and parents from a major metropolitan area in the northeastern United States. It is unclear if our results would generalize to clinical samples or to samples from other areas. Further, for our purposes in this study, the exploratory and confirmatory MIRT models and CAT simulations were all performed on the same sample. Future research could employ a cross-validation approach in which item calibration and CAT simulations are performed on independent samples.

Conclusion

In sum, we observed moderate parent-youth agreement on p and significant associations between p and youth global functioning across reporters. Further, our results suggest that the degree of parent-youth agreement and associations between p and global functioning are only slightly diminished by reducing reporting burden through adaptive testing procedures. Although there are ongoing debates and unanswered questions about the p factor (Watts et al., 2024), these novel results highlight the promise and potential clinical utility of a multi-informant p factor approach and set the stage for additional investigations of youth p.