Introduction

Typically, behavioural disorders are classified into two broad categories: Externalizing Disorders (ED—such as Attention-Deficit Hyperactivity Disorder—ADHD, Conduct Disorders—CD, and Oppositional-Defiant Disorder—ODD) and Internalizing Disorders (ID—such as Anxiety Disorders and Major Depression) [1]. Although structured diagnostic interviews are clearly the gold standard for assessing these disorders [2], these procedures are costly. Thus, self- or informant-reported questionnaires are often used as a first step in community screenings or large-scale community studies [3].

The Strengths and Difficulties Questionnaire (SDQ) [4] was developed as an extension of Rutter’s parent questionnaire [5, 6], and has become one of the most commonly used instruments for measuring psychopathological symptoms in school-age children and adolescents. All versions of the SDQ (parents and teacher for children and adolescents aged between 4 and 17 years, and self-report for adolescents aged between 11 and 17 years) count 25 items rated on a 3-point scale (“Not true”, “Somewhat true”, or “Certainly true”). Five items (7, 11, 14, 21, 25) are reversed scored. The SDQ is available free of charge for non-commercial purposes (www.sdqinfo.com) in 40 languages [7]. Based on initial principal component analyses (PCA), five component scores are generally formed. Four reflect behavioural Difficulties (Emotional Symptoms, Conduct Problems, Hyperactivity-Inattention and Peer Problems) and one reflects behavioural Strengths (assessed through Prosocial behaviours). It logically follows that the former four subscales should combine into a higher-order difficulty factor negatively correlated with the latter Strengths/Prosocial Behaviours (S/PB) factor. Another possible structure would be to combine the Hyperactivity-Inattention and Conduct Problem subscales into a higher-order ED factor, and the Emotional Symptoms and Peer Problems subscales into a higher-order ID factor [8]. Although these alternative structures make sense and each has received some support, there appears to be a need for clarification regarding the optimal factor structure of the SDQ.

For this purpose, we systematically searched MEDLINE, EMBASE, ERIC, PSYCInfo and ScienceDirect databases for papers published between January 1st 1995 and December 31st 2013, using the strings “SDQ” or “Strengths and Difficulties questionnaire” and “factor analysis” or “factor structure”. Based on available abstract and/or full-text content, we discarded papers: (a) reporting on a different instrument with the same acronym (e.g. the Self-Description Questionnaire), (b) with a main focus that was not the structure of the SDQ; (c) that did not analyse all 25 items of the SDQ; (d) only reported research conducted on special populations (e.g. intellectually disabled children) or preschool children using a specific version of the SDQ. This systematic review encompasses 54 publications. These publications are summarized in Tables S1 (self-report version), S2 (parent version), and S3 (teacher version) of the online supplements, together with their references. For each study, we report the sample sizes, age range, language version, country where the study was conducted, and whether analyses were conducted in specific subsamples (e.g. males and females). We also report on the method used to analyse the data (exploratory factor analyses, principal component analyses, confirmatory factor analyses), and whether the number of factor was fixed a priori to 5 in a confirmatory manner, or whether an exploratory approach was used to determine the number of factors.

First-order factor structure

Studies provide some support to the a priori first-order 5-factor structure but many suffer from important limitations, such as the reliance on principal component analyses (PCA) which are not suited to the analysis of the underlying structure of psychological constructs [9] and the reliance on exploratory procedures when an a priori structure has been previously defined [10, 11]. Furthermore, studies relying on confirmatory factor analyses (CFA) failed to provide a clear and unmitigated support to the adequacy of this a priori structure, showing the fit to be “good” in 20 cases, “acceptable” in 22 cases and “poor” in 24 cases.

Studies also generally failed to support the adequacy of two alternative 3-factor structures including 3 factors reflecting S/PB, ID and ED: one where S/PB is defined using the five a priori items (1, 4, 9, 17, 20), and the Dickey and Blumberg [12] model where the S/PB was refined through exploratory procedures and counted 8 items (adding items 7, 11, 14) [12]. In model, the ED factor was defined through 9 items (2, 5, 10, 12, 15, 18, 21, 25 and item 7 which cross-loaded on S/PB), the ID factor was defined through 8 items (3, 6, 8, 13, 16, 19, 23 and 24), and item 22 “Steals from home, school or elsewhere” was discarded.

Higher-order factor structure

A total of 6 studies estimated a model including one higher-order difficulties factor, estimated from the four first-order factors, and allowed to correlate with the S/PB factor. Only one of these studies clearly supported this model, whereas three clearly failed to support this model. Another model including two correlated higher-order factors representing ED and ID, allowed to correlate with the S/PB factor, was tested in 3 studies. These studies only provided partial support to this structure limited to the specific subsamples or SDQ versions. Thus, although the question of whether SDQ items form global constructs over and above the five specific subscales appears important, a question that remains is whether a higher-order model is the best way to explore this issue.

Alternative representations

In psychiatric measurement, a crucial question is whether a primary dimension (e.g. ED) exists as a unitary construct including specificities, or whether these specificities rather define distinct facets without a common core (i.e. a first-order CFA). Higher-order models, where higher-order factors are defined from the covariance among first-order factors, represent one way of looking at this issue [13, 14]. However, bifactor models provide a more flexible alternative [15, 16] based on the assumption that a f-factor solution exists for a set of n items with one global (G) factor and f-1 specific (S) factors. Bifactor models can easily be expanded to include more than one G-factor. The S-factors are typically specified as uncorrelated (orthogonal) to one another and with the G-factor(s). The Schmid–Leiman transformation (SLT) [17] can be used to convert a higher-order model to a bifactor approximation. However, each item’s association with the SLT G- and S-factors is obtained by multiplying their first-order loadings by constants, resulting in a ratio of G- to S-factor loadings that is exactly the same for all items associated with a first-order factor. This is a reason why true bifactor models are more flexible and tend to provide a better fit to the data than higher-order models [13, 15, 18]. Recently, bifactor models have been found to provide superior representation of ADHD [19, 20] and depression [21] than higher-order factors. Similar results have also been found for the SDQ, unfortunately without using all 25 items [22].

Gender and age/grade similarities and differences

An important test of the generalizability of a measurement model has to do with the possibility to replicate results across multiple meaningful subgroups of participants and the demonstration that meaningful unbiased group comparisons are possible. This verification requires systematic tests of measurement invariance [23, 24]. Among the reviewed studies, a handful separately estimated, and compared, the SDQ measurement models across meaningful subgroups of participants (origin, gender, age or grade, combinations) [25, 26]. These studies generally report similar measurement models across subgroups, although they sometimes suggest variations as a function of age or grade. However, the results from the 9 studies that conducted systematic tests of measurement invariance generally supported some level of invariance of the SDQ measurement model across genders, age/grade, language, or informant.

An important test of the discriminant validity of a measure lies in its ability to recover group differences in the constructs of interest. Interestingly, measurement invariance should be verified prior to tests of group-based mean-difference. In relations to the SDQ, English norms (http://www.sdqinfo.com/norms/UKNorm2.pdf) show that boys tend to present higher levels than girls on the first-order Conduct (Cohen’s d = 0.39), Hyperactivity (Cohen’s d = 0.60), and Peer Relationships (Cohen’s d = 0.17) factors, and lower levels on the S/PB (Cohen’s d = 0.56) factor. Similarly, younger children are known to present higher levels on the Emotional (Cohen’s d = 0.11), Hyperactivity (Cohen’s d = 0.15) and S/PB (Cohen’s d = 0.08) first-order factors, albeit the effect sizes are negligible (http://www.sdqinfo.com/norms/UKNorm3.pdf). Among the reviewed studies, only one systematically explored latent mean differences as a function of gender, after having established the measurement invariance of the model [27]. This study showed that boys tended to present higher scores on the Conduct Problems, Hyperactivity-Inattention, and Peer Problems factors than girls, who tended to present higher levels of S/PB.

The present study

In the present study, we aim to provide a comprehensive test of the complete factor structure of the teacher version of the French SDQ. After contrasting alternative representations of the first-order structure, we investigate the more global constructs present in the SDQ (i.e. ED, ID or Difficulties) using alternative higher-order and bifactor models. We then test whether the best-fitting model is invariant across groups formed on the basis of gender (boys/girls) and school level (kindergarten, primary, secondary) to ascertain whether answers provided to the SDQ can be meaningfully compared across these groups. We then verify whether well-documented group-based differences, or lack thereof, in latent means can be replicated.

Methods

Participants, material, and procedures

This paper uses data from the ChiP-ARD (Children and Parents with ADHD and Related Disorders) study, targeting French children and adolescents from the general population aged between 4 and 18 years old [19, 20, 28]. Overall, 262 teachers participated in the study (mean age = 43.9; SD = 8.6; range = 24–61); 47 were males (17.94 %). Each was asked to rate 2–4 youths from their classes whose name began with a letter randomly drawn from the alphabet. The official French adaptation of the teacher version of the SDQ for 4- to 17-year olds was obtained from the official website (http://www.sdqinfo.org). SDQ ratings were returned for a total of 889 youths (including 455 girls, 51.18 %): 132 attended kindergarten (14.85 %; including 64 girls), 350 attended primary schools (39.37 %; including 174 girls), and 407 attended secondary schools (45.78 %; including 217 girls). Girls were aged on average 5.69 (SD = 0.29) years in kindergarten, 8.62 (SD = 1.54) years in primary school, and 13.87 (SD = 2.16) in secondary school. Boys were aged on average 5.65 (SD = 0.36) in kindergarten, 8.61 (SD = 1.51) years in primary school, and 13.47 (1.75) years in secondary school. The Commissioner of Education and the Department of Education supported this study that complied with normative ethical prescriptions for French medical research. The Commission Nationale Informatique et Liberté approved the procedures used to keep the data secured and anonymous.

Analyses

The main models were estimated with Mplus 7.11 [29] from polychoric correlation matrices using robust weight least square (WLSMV) estimation, which has been found to outperform Maximum Likelihood with ordered-categorical items involving 5 or less answer categories [3034]. The fit of 4 alternative a priori first-order models was contrasted: (M1) a one-factor model defined based on all SDQ items; (M2) a model including 2 correlated factors (Strengths, defined based on all 5 prosocial items, and Difficulties, defined based on the other 20 items); (M3) a model including 3 correlated factors [ID (10 items), ED (10 items), and S/PB (5 items)]; (M4) a 3-factor model defined according to Dickey and Blumberg [12] specifications [ID (8 items), ED (9 items), and S/PB (8 items)]; (M5) the a priori SDQ model including 5 correlated factors (5 items each). Assuming that the a priori 5-factor model provides the highest level of fit to the data, two higher-order factor models will be contrasted: (M6) a model including a single higher-order factor defined on the basis of the 4 first-order Difficulty factors (Emotional Symptoms, Conduct Problems, Hyperactivity-Inattention and Peer Problems) and correlated to an S/PB factor; (M7) a model including two correlated higher-order Internalizing (defined on the basis of the Emotional Symptoms and Peer Problems first-order factors) and Externalizing (defined on the basis of the Conduct Problems and Hyperactivity-Inattention first-order factors) Disorder factors, correlated to an S/PB factor. Likewise, two bifactor models will be contrasted: (M8) a model including a single Difficulty G-factor defined on the basis of all items associated with four S-factors and correlated to an S/PB factor; (M9) a model including two correlated ID (defined on the basis of all items associated with the Emotional Symptoms and Peer Problems S-factors) and ED (defined on the basis of all items associated with the Conduct Problems and Hyperactivity-Inattention S-factors) G-factors, correlated to an S/PB factor.

From the best-fitting model, we will then perform tests of measurement invariance tests across gender (male versus females) and school level (kindergarten, primary school, secondary school) following Meredith recommendations [23] as adapted for ordered-categorical items [21, 35]. The sequence of tests is as follows: (a) configural invariance, (b) metric/weak invariance (invariance of the factor loadings); (c) scalar/strong invariance (invariance of the factor loadings and thresholds); (d) strict invariance (invariance of the factor loadings, thresholds and uniquenesses), (e) invariance of the latent variances–covariances (invariance of the factor loadings, thresholds, uniquenesses and variances–covariances), and (f) latent means invariance (invariance of the factor loadings, thresholds, uniquenesses, variances and latent means).

The fit of all models was evaluated using the WLSMV Chi-square statistic (χ 2), the Comparative Fit Index (CFI), the Tucker-Lewis Index (TLI), the Root Mean Square Error of Approximation (RMSEA) and its 90 % confidence interval [36, 37]. Values greater than 0.95 for CFI and TLI are considered to be indicative of adequate model fit. Values smaller than 0.08 or 0.06 for the RMSEA support, respectively, acceptable and excellent model fit. Chi-square differences tests were computed using MPlus DIFFTEST function MDΔχ 2; [38, 39], with the significance level to identify non-invariance fixed at 0.01 to take into account the overall number of MDΔχ 2 tests performed [4042]. Because the χ 2 and MDΔχ 2 are oversensitive to sample size and to minor model misspecifications, additional indices were used in the comparisons of nested invariance models. Thus, a CFI diminution of 0.01 or less and a RMSEA augmentation of 0.015 or less between a model and the preceding model in the invariance sequence indicate that the invariance hypothesis should not be rejected [43, 44].

Scale score reliability for the estimated factor is estimated based on omega (ω) [45] which has the advantage over traditional scale score reliability estimates (e.g. Cronbach’s α) to take into account the strength of association between items and all constructs, as well as item-specific measurement errors [46]. This makes it more realistic for complex measurement models such as those considered here.

Results

The results reported in Table 1 show that, among the alternative first-order models, only M5 reaches an acceptable, albeit marginal, fit to the data. The parameter estimates from this model are reported in Table 2. All items presented high and satisfactory factor loadings on their main factors (λ = 0.413–0.951; M = 0.746) except item 23 (λ = 0.248). All scale score reliability coefficients proved satisfactory (ω = 0.758–0.914). However, the correlations between the Conduct Problems and Hyperactivity-Inattention factors (r = 0.748) and between the Emotional Symptoms and Peer Problems (r = 0.513) factors, as well as the marginal fit of the model, suggest a need for further verifications. More precisely, these correlations are in line with the idea that these two pairs of factors may in fact assess two overarching constructs of, respectively, ED and ID. Although some of the other correlations (for instance those involving the S/PB factor) are of a similar magnitude, there are no a priori theoretical reasons to expect that these other factors would also form a reduced set of overarching constructs.

Table 1 Fit Indices for the alternative measurement models
Table 2 Original first-order 5-factor model (M5)

Among the two alternative higher-order models, M6 failed to reach an acceptable level of fit to the data, whereas M7 provided an acceptable, yet marginal, fit to the data that could not be empirically distinguished from M5 (ΔCFI = −0.002; ΔTLI = −0.002; ΔRMSEA = +0.001). Similarly, bifactor model M8 provided a marginal level of fit to the data that could not be empirically distinguished from either M5 or M7. In contrast, the fit of bifactor model M9 proved fully acceptable and substantially better than the fit of alternative models M5, M7 and M8 according to the goodness-of-fit indices (ΔCFI = +0.018–+0.023; ΔTLI = +0.019–0.021; ΔRMSEA = −0.007–−0.008), although the confidence interval of the RMSEA suggests that these differences may not be fully significant.

Model M9 was thus retained (Fig. 1), and corresponding parameter estimates reported in Table 3 show that the S/PB factor and the ED G-factor are well-defined through relatively high factor loadings (λ = 0.679–0.830 and λ = 0.467–0.791, respectively), and present a satisfactorily high level of scale score reliability (ω = 0.883 and ω = 0.926, respectively). Noticeably, over and above their associations with this ED G-factor, most of these items present a relatively low level of specificity associated with the Conduct Problems (λ = −0.119–0.492) and Hyperactivity-Inattention (λ = −0.161–0.758) S-factors, themselves defined mostly by relatively low loadings. The two items with the greatest level of specific association with the Conduct Problems S-factor concern covert—and thus less Exteriorized—forms of violence (often lies or cheats, λ = 0.492; steals from home, school or elsewhere, λ = 0.439). As a result, the scale score reliability of this S-factor remains quite low (ω = 0.544). Similarly, two items present a high level of specificity on the Hyperactivity-Inattention S-factor, and both assess symptoms of Hyperactivity (restless, overactive, cannot stay still for long, λ = 0.622; constantly fidgeting or squirming, λ = 0.758), rather than Inattention. Given the magnitude of these two specific loadings, the scale score reliability of this S-factor remains generally satisfactory (ω = 0.705).

Fig. 1
figure 1

Final retained bifactor Model M9: loadings are represented as solid lines with an arrow (non-significant loadings are in grey) and factor correlations are represented as broken lines with arrows at both ends (see numerical values in Table 3)

Table 3 Final retained bifactor model (M9)

The ID G-factor is not as well defined as the ED G-factor, although it still reflects reasonably well a common core of ID manifestations. This G-factor is defined though: (a) 4 items with λ > 0.500 covering manifestations of social rejection (has at least one good friend; generally liked by other children; picked on or bullied by other children) and generic unhappiness (often unhappy, depressed or tearful); (b) 4 items with 0.200 ≤ λ < 0.500 covering manifestations of anxiety (often complains of headaches, stomach aches or sickness; nervous in new situations, easily loses confidence; many fears, easily scared) and preference for solitude (would rather be alone than with other youth); (c) Two items with low or non-significant factor loadings covering a generic tendency to worry (many worries or often seems worried) and preference for adult company (gets along better with adults than with other children). Supporting the potential usefulness of this G-factor, its model-based scale score reliability appears fully satisfactory (ω = 0.804). The S-factor reflecting Emotional Symptoms going over and above this generic presence of ID is also well defined through high-factor loadings (λ = 0.468–0.780), and a satisfactory level of scale score reliability (ω = 0.832). In contrast, the Peer Problems S-factor appears to be more strongly defined through items reflecting a preference for solitude from peers (would rather be alone than with other youth, λ = 0.570; gets along better with adults than with other children, λ = 0.707) than items reflecting peer rejection (λ = 0.123–0.313) which present stronger relations with the ID G-factor—resulting in a lower S-factor scale score reliability estimate (ω = 0.650).

Results for model M9 (Table 1) supported the complete invariance of the measurement model, as well as the invariance of the variances and covariances across gender groups and for all school levels considered (kindergarten, primary, secondary); some MDΔχ 2 proved significant but none of the ΔCFI, ΔTLI and ΔRMSEA exceeded the recommended cutoffs. The teachers’ version of the SDQ thus provides results that are fully comparable across male and female youths attending kindergarten, primary schools, and secondary schools. The results also support the absence of latent mean differences across school levels (ΔCFI = 0.000; ΔTLI = +0.001; ΔRMSEA = −0.001), and the presence of latent mean differences across genders (ΔCFI = −0.017; ΔTLI = −0.016; ΔRMSEA = +0.007). When girls latent means are fixed to 0 for identification purposes and differences are expressed in standard deviation units, boys have higher latent means on the ED G-factor (0.493, p < 0.05), the ID G-factor (0.252, p < 0.05) and on the Hyperactivity/Inattention S-factor (0.287, p ≤ 0.05), but lower latent means on the Strength/Prosocial Behaviour factor (−0.467, p < 0.05) and the Emotional Problems S-factor (−0.231, p < 0.05). Latent means did not differ across genders on the Conduct Problems and Peer Problems S-factors.

Discussion

The factor structure of all SDQ versions has been extensively cross culturally assessed. Our review showed that the a priori first-order 5-factor model has generally received strong support. In contrast, research results are mixed regarding the presence of more global constructs reflecting ID, ED or global Difficulties [8]. This could partly be related to the reliance on higher-order factor models over bifactor models [13, 15, 18].

In the present study, we explored the global and specific factor structure of the SDQ using data from the general population. Alternative first-order CFA models were first contrasted to verify the adequacy of the a priori 5-factor model over alternative models. Examination of the parameter estimates revealed well-defined factors and satisfactory model-based estimates of scale score reliability. However, the fit of this model remained close to the lowest bound of acceptability according to conventional guidelines. Furthermore, the estimated factor correlations suggested exploring the presence of more global constructs.

None of the higher-order representations of the SDQ considered provided a satisfactory alternative to the a priori first-order factor model. Conversely, a bifactor model including two correlated G-factors reflecting ID and ED (themselves correlated to an S/PB first-order factor) provided a satisfactory level of fit to the data and a clear improvement over the fit of the first-order factor model. It should be noted that this conclusion is limited by the fact that the confidence intervals for the RMSEA mainly overlapped across models, although the efficacy of this specific indicator of model fit has yet to be systematically investigated in the context of WLSMV estimation. Similarly, some of the estimated factor loadings for this model turned out to be non-significant, which is consistent with the nature of bifactor models where each item cannot realistically be assumed to present equally strong associations with Global and Specific factors [47]. Rather, the specific patterns of significant versus non-significant loadings helped us to refine the interpretation of the G- and S-factors.

Parameter estimates from the final model revealed three well-defined S/PB, ED and ID factors, although the ID G-factor mainly reflects the social rejection and anxiety components of ID. These three factors also present high and satisfactory scale score reliability (ω = 0.804–0.926), supporting the use of the corresponding total score in research and practice. Similarly, the S-factor reflecting Emotional Symptoms is well defined through high-factor loadings from all the items, and presents satisfactory scale score reliability (ω = 0.832), which confirms the importance of using scores on this factor to complement ID ratings obtained based on the G-factor.

The S-factors reflecting the a priori SDQ scales do not appear to be defined as well as the G-factors, but still convey meaningful specificity over and above the assessment provided by the G-factors. Whereas the G-factor appears to provide a relatively complete overarching assessment of ED, the S-factor related to Conduct Problems mainly reflects covert forms of conduct disorders related to stealing and cheating going beyond the more overt forms of violence that are specifically covered by the ED G-factor. Similar distinctions between overt and covert manifestations of violence are often noted in the research literature on Conduct Disorders [4850]. However, although this distinction appears worthy of consideration in the measurement model of the SDQ as a way to control for the specificity of these covert behaviours beyond what is already assessed through the ED G-factor, the low scale score reliability of this S-factor (ω = 0.544) suggests that this specific Conduct Problem subscale should not be used in practice in its current state. Rather, future research should seek ways to improve the assessment of covert behaviours to increase the meaningfulness of this subscale. One hypothesis may be that teachers alone cannot capture all facets of overt/covert behaviours, and that one way of improving this assessment might be to use multiple informants across different settings. This would be expected based on multi-rater studies of conduct problems and antisocial behaviours [51, 52].

The Hyperactivity-Inattention S-factor mostly covers manifestations of Hyperactivity, rather than Inattention, going beyond the common core of ED. This is in line with the subtype specificity of ADHD proposed by the DSM-5 [1], stating that ADHD symptoms can be dominated either by Hyperactivity or Inattention, as well as with previous bifactor representations of ADHD [19, 20]. Similarly, the Peer Problems S-factor is mainly defined through items reflecting a preference for solitude, rather than the social rejection component of peer-related problems covered within the ID G-factor. The distinction between peer rejection and preference for solitude has been found to have important substantive implications in previous research on peer problems [53, 54]. Although the results from the first-order 5-factor model suggested a need for the re-assessment of item 23 (“Gets on better with adults than with other children”) due to a very low factor loading on its a priori factor, the results from the final retained bifactor model rather suggest that this item plays an important role in the definition of this S-factor reflecting a preference for solitude (λ = 0.707). The level of specificity related to these S-factors (Hyperactivity-Inattention and Peer Problems) is sufficient to provide satisfactory scale score reliability estimates that fully justify their use to complement ratings on the G-factors to assess hyperactivity (ω = 0.705) and preference for solitude (ω = 0.650), over and above levels of ED and ID.

As a preliminary test of generalizability, we conducted tests of measurement invariance of the obtained factor structure across subgroups of participants. This verification is particularly relevant for the SDQ, which has been developed to be suitable for youth aged between the ages of 4 and 17, thus relying on the assumption that SDQ ratings would be comparable across this full developmental period. In line with these expectations, the final retained bifactor model proved to be fully invariant (configuration, loadings, thresholds, uniquenesses, and even variances and covariances) across genders and the three school levels considered. We further verified whether the latent means obtained on the estimated factors would replicate the results from previous studies as a test of the discriminant validity of the model. Interestingly, supporting previous studies, no mean level differences could be observed as a function of school levels. Conversely, but also supporting previous studies based either on the SDQ [27] or other instruments [19, 20], our results showed that boys had higher levels of ED, ID, and Hyperactivity-Inattention, while they had lower levels of S/PB and Emotional Problems than girls.

In summary, our study shows that it is legitimate to compute five a priori scores when using teacher ratings of the SDQ. Additionally, it may be even more informative to compute scores of ED and ID and then to interpret subscale-specific scores on the Conduct Problems, Hyperactivity-Inattention, Peer Problems and Emotional Symptoms factors as a function of the information they add to refine initial interpretations based on the ED and ID scores. More precisely, our results suggest that the Emotional Symptoms score would be meaningful in its own right as the content of this subscale is only imperfectly reflected in the ID factor. Conversely, the Conduct Problems, Hyperactivity-Inattention, Peer Problems apparently mainly and respectively reflect Covert Behaviours, Hyperactivity, and Preference for Solitude once global scores on the ED and ID factors are taken into account. Although our results are promising, future studies should still investigate the validity, sensibility, and specificity of SDQ assessments based on teachers-, parents- and self-ratings of the same instrument, and formal clinical assessments conducted using structured interviews. Indeed, although this study focused on the psychometric properties of the teacher’s version of the SDQ, in practice, the assessment of behavioural disorders typically seeks to identify behaviours that are pervasive across settings and thus aims to integrate multiple sources of information (parent, teachers, clinicians, self). This is important as these informants are known to provide different perspectives on the behaviours being rated [55] due to their reliance on distinct frames of references in their interaction with the child being rated. In addition, the generalizability of the present results to other versions of SDQ and other linguistic groups should also be more thoroughly investigated.