Introduction

The identification of mental health problems in children and adolescents poses several challenges on mental health and research professionals. One important criterion of mental health problems, amongst others, is that they are characterised by a deviation from an appropriate reference group. An appropriate reference group might be a group of persons with similar demographic characteristics, such as age, sex and cultural background (e.g., [1, 2]). It is important that screening instruments are validated in nationally representative samples in different age groups and for both boys and girls before they are used for identifying mental health problems.

The Strengths and Difficulties Questionnaire (SDQ) is one of the most widely used mental health screening instruments for children and adolescents, and has been translated into over 60 languages [3, 4]. It comprises the five subscales emotional symptoms, peer problems, conduct problems, hyperactivity/inattention and prosocial behaviour. The SDQ was originally developed for children and adolescents of ages 4–17 years. An early-year version of the SDQ for children of age 2–4 years was developed in 2014.Footnote 1 The evaluation can be done by parents and teachers of children and adolescents aged 2–17 years (parent/teacher version) or by the children themselves if they are 11 years or older (self-report version). The SDQ parent version (SDQ-P) is the most widely used version [5].

Epidemiological studies showed that psychopathological abnormalities are prevalent in about 10–20% of children and adolescents [6]. Many countries have national norms that were derived from nationally representative samples (for a review, see [5]). Some authors accounted for potential sex and age differences. Based on the norms, cutoff scores defining a normal range for the SDQ total sum score can be derived for the screening of mental health problems. Due to prevalence estimates between 10% and 20%, the 80th and 90th centiles of nationally representative samples have been frequently used as cutoff scores for defining clinically relevant behaviour. More recently, Goodman et al. proposed using the 80th, 90th and 95th centiles as cutoffs for defining the scores as ‘close to average’ (< 80th centile), ‘slightly raised’ (80th–90th centile), ‘high’ (90th–95th centile) and ‘very high’ (> 95th centile) [7]. Such information could be used to interpret the severity of abnormality.

The German SDQ-P has been tested and validated in five studies [8,9,10,11,12,13]. Two of the studies were based on clinical samples comprising children aged 6–18 years with ADHD [8] and children aged 5–17 years with any psychiatric diagnosis [9]. The three remaining studies examined community samples that were considered nationally representative. Norms of the German SDQ-P stem from Woerner et al. and are based on parents’ reports of approximately 1000 children aged 6–16 years [12, 13]. These norms have been confirmed by Rothenberger et al. in a sample of approximately 2500 parents of 7–16-year-old children [11]. However, there are no studies examining preschoolers, limiting the generalisability of the results to children of 6 years or older. The availability of German norms for preschoolers is inevitable because preschoolers are not yet able to report on their mental health and parents’ or carers’ reports are the main source of information (for a review, see [14]).

Emotional or behaviour problems might express differently depending on a child’s age [1, 2]. The behaviour of an 11-year-old might be of clinical relevance, although the same behaviour might be normal at the age of, say, 3 years. Therefore, because the SDQ-P can be applied to the wide age range from 2 to 17 years, developmental differences in the phenotype of emotional or behaviour problems might arise. This would prohibit any comparisons of (subscale) scores for children of different ages. Further, it raises the concern that younger children are reported systematically more or less frequently of showing abnormal behaviour than older children, if norms of older children are used (e.g., [15]). Examination of measurement invariance of the SDQ across age is, therefore, of crucial importance and if the German SDQ-P is not measurement invariant across age, age-specific norms are required. If measurement invariance can be shown in contrast, normal ranges (usually defined by centiles of a reference population) are constant across different ages and there is no need for age-specific reference values.

Woerner et al. reported that for children and adolescents aged 6–16 years scale scores are unrelated to age, except for the hyperactivity/inattention subscale for which younger children had slightly higher scores than older children and adolescents [12, 13]. The authors concluded that the differences in the German SDQ-P across age groups have shown to be negligible. Rothenberger et al. obtained similar results for children and adolescents aged 7–16 years [11]. Again, younger children had significantly higher hyperactivity scores than older children and adolescents. In addition, younger children had slightly higher SDQ total scores and lower prosocial behaviour scores than older children. This suggests that developmental processes should be taken into account at least for some subscales. However, their investigations are limited to children of age 6–16 years and do not reveal the developmental course of SDQ problems from preschool to school-aged children. Within their preschool sample, Klein et al. obtained no age differences between 3- and 5-year-olds [10]. In addition, they compared SDQ scale means of their 3–5-year-olds with the scale means of the German representative sample of 6–16-year-olds of Woerner et al. [12, 13]. Despite the lack of representativeness and comparability, Klein et al. attribute differences between the samples to age rather than other factors, leading to conclusions that are in conflict with the available literature [10]: The authors concluded that for 3–5-year-olds prosocial behaviour and hyperactivity scores were higher and peer problem scores were lower than in 6–16-year-olds. Hölling et al., in contrast, showed that 3–6-year-olds and 14–17-year-olds had lower problem behaviour than 7–13-year-olds [16].

In all the studies, either Spearman correlation analysis or classical statistical tests and descriptive analysis were performed [12, 13]. While the former tests for the existence of any monotone relationship between the SDQ total score and age, the latter assesses whether the central tendency (like mean or median) is different across age groups. Although addressing interesting questions, neither type of analysis answers the question of whether the SDQ operates the same way for all age groups (i.e., if the SDQ is measurement invariant).

We used data of the German Health Interview and Examination Survey for Children and Adolescents (KiGGS) [17], to address the following goals: (1) to replicate the original scale structure of the German SDQ-P in the whole sample, to validate the German SDQ-P across the whole age range, including preschool children, (2) to assess whether the German SDQ-P is measurement invariant across the full age range, and (3) to provide norms that are age-specific if measurement invariance cannot be shown and not age-specific if measurement invariance can be shown. All analyses are performed separately for boys and girls because of substantial sex differences in problem behaviour [3, 16, 18].

Methods

Study design and sample

The German Health Interview and Examination Survey for Children and Adolescents (KiGGS) is a nationally representative cross-sectional health interview and examination survey for children and adolescents that took place in Germany from May 2003 to May 2006 [17]. The KiGGS Study was approved by the Charité/Universitätsmedizin Berlin Ethics Committee and the Federal Office for the Protection of Data, and was conducted according to the Declaration of Helsinki. Details on the objective, study design and sampling strategy were described elsewhere [19,20,21].

Participants were sampled based on a complex two-stage sampling procedure, with 167 sample points from an inventory of German communities stratified according to the BIK classification system. Data on sociodemographic characteristics as well as parameters related to physical, psychological and social health were obtained from 17,640 children and adolescents for the entire age range from 0 to 17 years. Among those, 14,835 children were 3 years or older.

Instruments

The Strengths and Difficulties Questionnaire (SDQ) is a widely used screening instrument for emotional and behaviour problems, and also contains a subscale on prosocial behaviour. It consists of 25 items positively or negatively worded, thereby assessing both strengths and difficulties. In the SDQ-P, the items are rated by parents as untrue (corresponding to a score of 0), somewhat true (score of 1) or certainly true (score of 2). They can be grouped into four problem subscales and one competence subscale, each comprising five items with sum scores ranging from 0 to 10. The four problem subscales assess conduct problems, hyperactivity/inattention, emotional symptoms and peer problems. The competence subscale assesses prosocial behaviour. The total difficulties score is composed of the scores of the problem subscales only and thus ranges from 0 to 40, with larger scores suggesting greater problem behaviour.

Statistical analysis

Derivation of SDQ subscale scores in the presence of missing data

If the parent answered three or more items of a given subscale, the respective subscale score was derived. Current practice consists in imputing the scores of items that were not answered using the mean score of the subscale, if no more than two answers per subscale are missing. For example, if a parent rates the first three items of a subscale with 1, 0 and 1, but does not rate the other two items of the subscale, the scores for these two items are imputed by the value 0.667 (= [1 + 0 + 1]/3). This gives a subscale score of 3.333 (= 1 + 0 + 1 + 0.667 + 0.667). If the items of all other subscales are answered, one obtains a real-valued SDQ total score.

Note that the derivation is slightly different from the original algorithm provided at http://www.sdqinfo.org. Real-valued scores are rounded to the nearest integer (see http://www.sdqinfo.org). Rounding the score to the nearest integer, however, introduces a bias. In the example above, the value 3.333 is rounded to 3, which leads to an underestimation of the child’s problem score. Moreover, according to the implementations provided at http://www.sdqinfo.org, the SDQ total score is computed from the rounded subscale scores. This might be a problem. For example, if 3.333 is the subscale score of all four subscales, the derived SDQ total score is 12 (= 3 + 3 + 3 + 3), although the value 13 (≈ 3.333 + 3.333 + 3.333 + 3.333) would be more appropriate. This introduces an additional bias. For this reason, we did not round any values to the nearest integer.

Confirmatory factor analysis (CFA) for replicating the five-factor structure of the SDQ-P

CFA was used to evaluate the five-factor structure of the SDQ, based on the sample population for which all 25 items of the SDQ were answered. The analysis was performed both for the total sample population and separately for boys and girls. Diagonally weighted least squares (WLSMV) estimation was used to account for the ordinal scale of the items [22]. There are several criteria for assessing model fit, such as the chi-square statistic or fit indices. Since the chi-square statistic depends on sample size, it is likely that in large population-based samples, even very small improvements in model fit might become significant [23]. This is why the chi-square statistic was not used for assessing model fit. Instead, model fit was assessed using the Bentler comparative fit index (CFI; [24]), the Tucker–Lewis index (TLI), where CFI and TLI > 0.90 signifies acceptable fits and > 0.95 signifies good fits, respectively, and the root mean square error of approximation (RMSEA), where an RMSEA < 0.08 indicates an acceptable model fit and < 0.05 a good model fit [25]. Model fit was considered acceptable if CFI ≥ 0.9 and TLI ≥ 0.9, and RMSEA < 0.08. All fit indices were computed from the scaled chi-square statistic (therefore, termed ‘scaled CFI’, etc.).

Multi-group confirmatory factor analysis (MGCFA) for assessing measurement invariance across age

MGCFA was used to assess measurement invariance of the SDQ across age. A categorisation of the sample into the following age groups was performed: 3–4 years, 5–6 years, 7–8 years, 9–10 years, 11–12 years, 13–14 years and 15–17 years. This categorisation is a trade-off between sufficient sample sizes per age group and homogeneous groups. With this categorisation, the numbers of subjects in each age group within girls or boys were not below 900 and are thus sufficiently large for MGCFA.Footnote 2 At the same time, the pooling of two (three, resp.) adjacent ages to form homogeneous age groups is considered to be acceptable.

First, the proposed five-factor model (termed baseline model), in which factor loadings and thresholds varied freely over age groups, was assessed based on fit indices. Configural invariance was assumed if this model had acceptable model fit. Note that the SDQ contains categorical items that are evaluated on an ordinal scale (answer format: untrue, somewhat true, certainly true), and invariance testing with continuous and categorical items differs. According to Muthén and Muthén, in the presence of categorical item responses, thresholds and loadings should be varied in tandem since the item characteristic curves depend on both parameters [27] (see also [28] for the MGCFA methodology with categorical item responses). Thus, weak invariance testing is appropriate for continuous but not for categorical item responses and was not applied for this reason.

If configural invariance held, strong measurement invariance was assessed by comparing the baseline model to a more constrained model (termed strong invariance model), in which all items’ factor loadings and thresholds were held equal across age groups. Strong measurement invariance was not established if the difference in the models’ CFI indices (ΔCFI) exceeded 0.01 [29, 30]. Note that this is a tolerant criterion and more strict criteria for declaring measurement non-invariance were proposed. Meade et al., for example, proposed declaring measurement non-invariance if ΔCFI ≥ 0.002, which has greater power to detect non-invariance if it is present but also bears a higher risk to falsely declare measurement non-invariance [31]. In this study, we used the less strict criterion (ΔCFI ≥ 0.01) for assuming measurement non-invariance because we want to minimise the risk of incorrectly declaring the SDQ being measurement non-invariant.

If the difference in CFI indices was larger than 0.01 (i.e., strong measurement invariance cannot be assumed), we assessed whether it suffices to constrain the factor loadings and thresholds not for all but only for a few items, usually referred to as partial strong measurement invariance. Partial strong measurement invariance would indicate that specific items function differently on children of different ages but others do not. To identify items for which measurement invariance cannot be assumed, we tested for each item i whether the strong invariance model has a significantly worse fit than a (partial strong invariance) model in which the factor loadings and thresholds are held constant over all age groups, except for the loadings and thresholds of item i that were allowed to freely vary over the age groups. Items with small p values (or equivalently, large score test statistics) can be considered as items for which the constraint of equal loadings and thresholds should be released. We, therefore, first sorted the items according to their p values or equivalently, according to their score test statistics. Then, starting from the strong invariance model, we repeatedly fit a number of models, each time releasing the constraint for one additional item, until we obtained a model which had not a substantially worse fit than the baseline model (i.e., ΔCFI < 0.01).

In contrast to the derivation of centile curves (described in the following paragraph), we did not use weighting factors for (MG)CFA since we do not report any numbers or percentages from these analyses that are supposed to be representative for the population in Germany.

Centile curves for deriving age-specific norms

The LMSP method of centile estimation was used to model centiles of the SDQ total score in dependence on age [32]. This method assumes that for a given age, there is a transformation of the form

$$\widetilde{\text{SDQ}} = \left\{ {\begin{array}{*{20}l} {\frac{1}{\sigma \nu }\left[ { \left( {\frac{\text{SDQ}}{\mu }} \right)^{\nu } - 1} \right] \quad {\text{if}}\; \nu \ne 0}\\ {\frac{1}{\sigma }{ \log }\left( {\frac{\text{SDQ}}{\mu }} \right) \quad {\text{if}}\; \nu = 0} \\ \end{array}}\right.,$$

such that the transformed SDQ total score, \(\widetilde{\text{SDQ}}\), follows a standard power exponential distribution with power parameter \(\tau > 0\). The SDQ score is then said to have a Box–Cox power exponential distribution with parameters \(\mu , \sigma , \nu\) and \(\tau\) relating to the location, scale, skewness and kurtosis, respectively [32]. Each of the four parameters \(\mu , \sigma , \nu\) and \(\tau\) were modelled as smooth non-parametric functions of the exact age. The scores of the subscales take values between 0 and 10, and were assumed to follow a zero-adjusted gamma distribution. Worm plots [33] and Q statistics testing normality of residuals within age groups [34] were used as diagnostic tools to identify possible inadequacies of the fit. In addition, the smoothed centiles were also compared with their empirical counterparts. A weighting factor was used for modelling centiles to correct for deviations in the sample from the population structure in Germany (as on 31 December 2010) with respect to age and region (East/West/Berlin).

Statistical software

All analyses were conducted with the statistical software R, version 3.3.0. CFA and MGCFA models were fit using the function cfa in the R package lavaan (version 0.5-22; [35]). Partial measurement invariance was assessed based on the results of the function lavTestScore of the same package. For modelling the centiles of the SDQ scores, the R package gamlss (version 5.0-2) and relevant functions therein were used [36].

Results

Factor structure and measurement invariance across age groups

SDQ data from 13,423 completed questionnaires (i.e., no missing items; 6810 boys and 6613 girls) were used for assessing the factor structure and measurement invariance. The fit of the five-factor models in the overall study population and within boys and girls was not optimal, and yielded CFI and TLI values below 0.9 (results not shown). The modification indices for all three models suggested that there is strong residual correlation between the items restless and fidgety of the hyperactivity/inattention subscale. After accounting for residual covariance between these items, the fit considerably improved and yielded acceptable values with CFI = 0.912, TLI = 0.900, RMSEA = 0.051 for the overall study population, CFI = 0.917, TLI = 0.905, RMSEA = 0.053 for boys and CFI = 0.912, TLI = 0.900, RMSEA = 0.050 for girls. The path diagrams for the three models including the factor loadings, thresholds and (co)variances are shown in Online Resource 1.

The models for boys and girls were subsequently specified in the framework of MGCFA to test for configural invariance across all age groups within boys and girls. The configural invariance models yielded an acceptable fit; CFI = 0.925, TLI = 0.915 and RMSEA = 0.049 for boys and CFI = 0.918, TLI = 0.907 and RMSEA = 0.047 for girls (Table 1). This suggests that the proposed five-factor structure of the German SDQ-P is appropriate for the complete age range, 3–17 years.

Table 1 Results of MGCFA for assessing measurement invariance across age groups in a specified population

The strong measurement invariance models (i.e., models with equal item loadings and thresholds across age groups) yielded a substantially worse fit for both boys and girls. The differences in CFI (ΔCFI) exceeded the threshold 0.01. This suggests that the SDQ is not measurement invariant across age groups.

Partial strong measurement invariance was assessed by testing the strong invariance model against a model in which the factor loadings and thresholds of an item may vary freely across age groups. An overview of the score test statistics and p values of all 25 items is given in Table 2. For boys, the strongest evidence against measurement invariance (i.e., the largest test statistic, or equivalently the smallest p value) was obtained for the item worries of the emotional symptoms subscale and items distractible, fidgety and restless of the hyperactivity/inattention subscale. For the item distractible, there was a non-linear change in the thresholds (see figure in Online Resource 2): The thresholds decreased within childhood and then increased again showing that 3–6 and 15–17-year-olds are less likely to become distracted than 7–14-year-olds. The item worries is less often endorsed by parents of younger children (3–10 years) since thresholds decrease constantly during these ages (Online Resource 2). The thresholds for the item fidgety were larger for older boys (results not shown). This shows that parents of younger children endorse this item more than parents of older children or adolescents. Releasing the equality constraints for the items worries and distractible yielded an acceptable partial strong invariance model with CFI = 0.916, TLI = 0.914 and RMSEA = 0.050. This model was not substantially worse than the baseline model (ΔCFI = 0.009 < 0.01; see Table 1 and Online Resource 2 for details). Partial strong measurement invariance can thus be established for boys.

Table 2 Items sorted by the score test statistic (in descending order) for assessing partial strong measurement invariance

For girls, the biggest problems were observed also for items of the emotional symptoms subscale (somatic, worries, afraid) and the hyperactivity/inattention subscale (restless, fidgety), as these items yield the largest score test statistics (see Table 2). An inspection of the age-group-specific thresholds for the items worries and somatic shows that parents of older children and adolescents are more likely to report that their child worries or has headaches, stomach-aches and sickness, respectively, than parents of younger children (see figure in Online Resource 2). Parents of younger children, in contrast, are more likely to report that their child has many fears or is easily scared (item afraid; results not shown), is constantly fidgeting or squirming (item fidgety; results not shown) and is restless or overactive (item restless; Online Resource 2). Relaxing the equality constraints for the three most problematic items, somatic, worries and restless, yielded a partial strong invariance model that was not substantially worse than the baseline model (CFI = 0.910, TLI = 0.908, RMSEA = 0.046; ΔCFI = 0.008 < 0.01; cf. Table 1). Partial strong invariance can, therefore, be established also for girls.

To conclude, the five-factor structure of the SDQ can be validated in all age groups. The SDQ and the subscales are thus applicable also in children younger than 6 years that have not been investigated in studies on the German SDQ-P so far. The results obtained from MGCFA for both boys and girls suggest that strong measurement invariance cannot be assumed for the emotional symptoms subscale and the hyperactivity/inattention subscale, while for the other three subscales, measurement invariance might be assumed. This supports the use of age-specific norms at least for the emotional symptoms subscale and the hyperactivity/inattention subscale. Age-specific norms should also be used for the total difficulties score since this is the sum of the subscale scores.

Age-specific norm values for the SDQ total difficulties score

The data of completed questionnaires (n = 13,423; 90.5%) and incomplete questionnaires with no more than two missing items per subscale (n = 1054; 7.1%) were used to derive age-specific reference values for the German SDQ-P. This makes up 97.6% of all questionnaires. Only 2.4% (n = 358 in total; 165 girls; 193 boys) of the questionnaires had to be excluded from the computation of reference values due to too many incomplete items.

The age-specific 5th, 10th, 20th, 50th, 80th, 90th, 95th centile curves for the SDQ total difficulties score are shown in Fig. 1 separately for boys and girls. The concrete values for the 50th, 80th, 90th and 95th centiles of the SDQ total difficulties score are specified in Table 3 for children of age 3–17.

Fig. 1
figure 1

Age-specific norm values for the total difficulties score of the German SDQ-P. The plots show the 5th (P5), 10th (P10), 20th (P20), 50th (P50), 80th (P80), 90th (P90), 95th (P95) centiles of the SDQ total difficulties score in dependence on age in the German study population including 3–17-year-old boys (a) and girls (b)

Table 3 Age-specific norm values for the SDQ total difficulties score of the German SDQ-P for boys and girls

Figure 1 supports a dependency of the centiles on age. In particular, the upper centiles that are commonly used for defining abnormal behaviour show a strong dependency on age, and the dependency is stronger for higher centiles. The upper (i.e., 80th, 90th, 95th) centiles have their maxima at about 10 years for both boys and girls. This implies that a total difficulties score of 17, for example, is more indicative for abnormal behaviour for a 17-year-old boy than it is for a 10-year-old boy because about 10% of the 10-year-old boys score equal or higher than 12, while only about 5% of the 17-year-old boys have such a high or an even higher score. Or using the notation of Goodman et al., for the 10-year-old the SDQ score of 17 is considered high (90th–95th centile), while for a 17-year-old the same score is considered very high (> 95th centile) [7].

The median and the lower centiles do not show any differences during childhood. For boys, the median score is constantly high at 8.5 up to the age of 13 years, gets smaller with increasing age and has its minimum for 17-year-olds with about 6–7 points. For girls, the median slightly decreases from approx. 8 to 7 at the beginning of a girl’s lifetime, from 6 to 13 years it is constant at 7 and then slightly decreases at about 1 score. The lower (i.e., 5th, 10th, 20th) centiles show a similar dependency on age.

Age-specific norm values for subscales

We computed the 80th, 90th and 95th centiles of the subscales since these centiles are frequently used for screening children and adolescents with hyperactivity/inattention problems, conduct problems, emotional problems, peer problems or deficits in prosocial behaviour, respectively. In addition to those, the median score (50th centile) was computed in order to assess changes of the central tendency with age. The age-specific centiles for the five subscale scores are shown in Fig. 2 separately for boys (a, c, e, g, i) and girls (b, d, f, h, j).

Fig. 2
figure 2

Age-specific norm values for the subscale scores of the German SDQ-P. The plots show the 50th, 80th, 90th, 95th centiles of the subscale score in dependence on age in the German study population including 3–17-year-old boys (a, c, e, g, i) and girls (b, d, f, h, j). The prosocial behaviour subscale was inverted such that lower values indicate better prosocial behaviour

The results for the emotional symptoms subscale and the hyperactivity/inattention subscale suggest the use of age-specific norms for these two subscales as differences in the centiles over age can be observed for the two subscales (Fig. 2a–d). Concrete values of the age-specific 50th, 80th, 90th, 95th centiles are given in Table 4. Note that differences in the 80th, 90th and 95th centiles are remarkable, while the median does not vary that much with age. There is a tendency of higher scores around the age of 10 years, in particular for the emotional symptoms subscale. In girls, higher scores are again reached in late adolescence, while for boys the centiles are steadily smaller with age. The centiles for the hyperactivity/inattention subscale are smaller for ages above 10 years. In contrast to the peak at about 10 years observed for the emotional symptoms subscale, the centiles for the hyperactivity/inattention subscale remain nearly the same for 3–10-year-olds, for girls the highest scores are even obtained for 3-year-olds.

Table 4 Age-specific norm values for the emotional symptoms subscale and the hyperactivity/inattention subscale of the German SDQ-P for boys and girls

In contrast to the emotional symptom and the hyperactivity/inattention subscales, the differences in the subscales on conduct problems, peer problems and prosocial behaviour are rather small (Fig. 2e–j, Online Resource 3). In particular, there are hardly any differences for the conduct problem subscale, suggesting that scores of the conduct problem subscale are comparable across different age groups. Very small differences across age can be observed in the upper centiles for the peer problem subscale and the prosocial behaviour subscale. Note that the ‘jump’ of the median to exactly zero in Fig. 2h is an artefact which arises from the fact that a mixed discrete–continuous distribution (zero-adjusted gamma distribution) was used. Changes in the centiles of the peer problem subscale were similar for boys and girls but slightly more remarkable for girls than for boys. The upper centiles for girls take slightly smaller values from 3 up to approx. 6 years, increase from 6 to 11 years and decrease again. The same can be observed for boys, but the changes in the centiles are smaller. Note that we inverted the scores of the prosocial behaviour subscale: higher scores obtained on this subscale indicate worse prosocial behaviour. For the prosocial behaviour subscale, the centiles first slightly decrease from 3 to approx. 9 years and then increase again.

Table 5 shows the norm values not specific to age for the subscales on conduct problems, peer problems and prosocial behaviour, respectively. Note that the subscales have a limited number of values (cf. Online Resource 3). Thus, it is common that there are no values that exactly correspond to the 80th, 90th, and 95th centiles, as was already noted in other papers (e.g., [12, 13]). In some cases, there is even a large deviation between the exact centiles and the 80th, 90th, and 95th centiles, as seen from Table 5.

Table 5 Norm values (not age-specific) for the conduct problems subscale, the peer problems subscale and the prosocial behaviour subscale of the German SDQ-P for boys and girls

Discussion

With the present study, we validated the five-factor structure of the German SDQ-P across the entire age range of 3–17 years. Our results are in accordance with prior studies on the validation of the factor structure for the German SDQ-P, namely that the German SDQ-P can be used for children and adolescents [11,12,13]. Since we were the first to also include children younger than 6 years from a nationally representative sample, we additionally conclude that the SDQ and the subscales can be derived from parent ratings of preschoolers, as the study by Klein et al. on a small regional sample has suggested [10].

SDQ scores are often interpreted irrespectively of the child’s age. The underlying assumption is that identical scores represent the same level of the construct measured by the SDQ for children of different ages. Both MGCFA and centile curves revealed that the SDQ is not measurement invariant across age, although the factor structure could be confirmed across the entire age range of the present study. The results of this study help with identifying children with abnormal behaviour and rating the severity of abnormality, while explicitly taking the developmental course of SDQ problems into account. For example, we showed that no more than 5% of the 4-year-old boys in Germany are expected to have an SDQ total score exceeding 16.96 (95th centile). A 4-year-old boy with an SDQ total score of 18, say, could accordingly be rated as having a clinically relevant behaviour that is abnormal for boys of that age. Being aware of those differences across age groups is critical. Without knowledge of differences in the SDQ scales between age groups, services might be denied to children of specific ages because their SDQ scores are below a clinical cutoff despite high levels of impairment [28]. Further, as was noted by Bowen and Masa “Researchers might draw erroneous conclusions about relationships among social, emotional, or behavioural constructs and outcomes for subgroups [and] Their conclusions could translate into guidelines for intervention that are inappropriate for some clients” [28]. We, therefore, strongly encourage the use of age-specific norms for the SDQ total difficulties score, as well as for the emotional symptoms subscale and the hyperactivity/inattention subscale. Norms that are not specific to age might be used for the other three subscales since these were shown to be measurement invariant across age.

Note that the KiGGS study population is far larger than that of most of the existing studies which derive norms for the SDQ or assess psychometric properties of the SDQ [5]. The number of subjects of each age is sufficiently large in KiGGS for investigations on measurement invariance across the complete age range. The present study allows detailed analyses on measurement invariance of the German SDQ-P across all ages, from early childhood to late adolescence, as well as the establishment of age-specific norms. This is in contrast to the existing studies [5]. Neither of these studies covered the complete age range from 3 to 17 years. The focus on a narrow age range prevents an assessment of SDQ measurement invariance across the complete age range. Further, the existing studies did not include a large number of subjects of the same age, such that children of different ages were allocated to an age group. In particular, age groups that cover a wide age range might be too heterogeneous and differences within age groups (e.g., in norm values) are concealed. Rothenberger et al., for example, reported no differences across age from their results of MGCFA in contrast [11]. In their MGCFA, they subsumed children aged 7–10 years in one group and children of age 11–16 years in another group. The centile curves we have derived show that there are considerable differences within each of the two age groups. Further, our studies show that centiles do not increase or decrease linearly with age but that there is a non-linear change with a “peak” at the age of around 10 years indicating that the variability of SDQ scores is largest for children at the age of around 10 years. Younger or older children have narrower normal ranges in contrast. These findings suggest that the categorisation used by Rothenberger et al. is too rough to detect any differences across age since aggregated values of 7–10-year-olds and 11–16-year-olds are similar. Note that the sample from the BELLA study which, as used in their studies, is a subsample of our study population, supporting our theory that the different conclusions are not based on sample differences but are related to the subsuming of heterogeneous groups to one broad age group.

In principle, we might have also used centiles that are observed in each age group as norm values. However, due to a limited number of subjects within each year, there is a large variation in centiles, in particular for extreme centiles like the 95th or the 90th centiles which are of special interest in the context of identifying children with abnormal behaviour. Another disadvantage of this approach is that the exact age of survey participants would be neglected when computing centiles for each year. The reference value for a boy who had just had his 12th birthday would ideally be derived from children of exactly the same age, rather than from 12-year-old boys who are just turning 13. Smoothed centile curves account for random variations in the centiles and they reflect changes in the course of life. Centile curves have frequently been used in population-based studies to flexibly model the centiles of specific variables in dependence on age. They have become an established tool for measurements related to growth and development in the context of paediatrics. To our knowledge, this method has rarely been used in psychology, and in particular not in the context of age-specific norm values for the SDQ. Note that the presented centiles which might be used as cutoff values for identifying children with clinically relevant behaviour do not require the subscale score or SDQ total score to be an integer. In the presence of incomplete questionnaires, we recommend comparing real-valued scores to the centiles since real-valued scores are more precise.

When using age-specific norms, one should, however, be aware that the categorisation of SDQ scores as abnormal (90th percentile) or borderline (80th percentile) is based on the prevalence of mental health problems of 10–20% for children and adolescents, and stems from the entire age range rather than from specific age subgroups. Goodman advises using cutoffs based on knowledge of the prevalence in the general population [3]. Further, he notes that it “may be appropriate to adjust cutoffs for age and gender” ([3], p. 585). Age-specific prevalence rates based on nationally representative samples have, however, not been reported so far, and our nationally representative sample does not allow estimating the prevalence of mental health problems in a reliable way. Future studies are needed to address this issue. In their latest manuscript on this issue, Goodman et al. proposed using the 80th, 90th and 95th centiles as cutoffs for defining the scores as ‘close to average’ (< 80th centile), ‘slightly raised’ (80th–90th centile), ‘high’ (90th–95th centile) and ‘very high’ (> 95th centile) instead of categorising individuals as abnormal [7]. This categorisation does not depend on the prevalence of psychopathological abnormalities. It is thus applicable in populations with a different prevalence of psychopathological abnormalities, such as boys and girls or children of different ages. We, therefore, recommend using this interpretation of SDQ scores as long as there is no knowledge on the age-specific prevalence of mental health problems.

Conclusion

We used data from a large nationally representative survey (KiGGS) for providing norm values and validating the German parent version of the SDQ (SDQ-P) across the complete age range. For the first time, evidence was provided that the German SDQ-P is a valid screening instrument also for preschoolers. Moreover, we showed that for neither boys nor girls the SDQ-P is measurement invariant across age. Results from both MGCFA and centile curves showed that the absence of measurement invariance is attributable to a different answer behaviour to some items of the emotional symptoms subscale (item “worries” for both boys and girls and item “somatic” for girls) and the hyperactivity/inattention subscale (item “distractible” for boys and “restless” for girls). For screening mental health problems, we, therefore, propose using age-specific norms and cutoff values for the SDQ total difficulties score and for the subscales on hyperactivity/inattention and emotional symptoms. In contrast to that, we propose using generic norms and cutoff values for the subscales on conduct problems, peer problems and prosocial behaviour since these subscales were shown to be measurement invariant across age. Norm values for the SDQ and its subscales were derived from data of a large nationally representative sample and are provided along with this paper.