Introduction

Psychosocial interventions for foster/kinship families are implemented, disseminated and evaluated in an exceptionally complex and challenging context. It is a context that includes the wide-ranging and heterogeneous difficulties of children in alternative care, the varying capacity of carers to meet these children’s needs and the multifaceted child protection system itself. The importance of these factors, which have been well established in research, is essential in understanding the methodological limitations and challenges analysed in this review.

Children’s experience of maltreatment prior to their entry into alternative care has repercussions across a wide spectrum of domains essential to their well-being and development. Enduring and complex trauma is associated with increasing complexity of symptoms (Cloitre et al. 2009). Abuse and neglect, occurring across essential development periods, has been associated with dysfunctions in neurobiological systems associated with physiological, arousal, relational and executive brain function (Perry 2009) resulting in poor physical health, hyperarousal, attachment and attentional problems (Ford and Courtois 2013; Marti Haidar 2013; Fisher et al. 2000). As a result, children in care have recognised difficulties with emotion and behavioural regulation (Greenberg et al. 2001; Appleyard et al. 2005), an increased prevalence of mental disorders including depression, anxiety and posttraumatic stress disorder (Leenarts et al. 2013; Ford and Courtois 2013) and delays in cognitive and academic functioning (Jacobsen et al. 2013; Gutman et al. 2003).

When children are in family-based alternative care (i.e. foster or kinship care), the principal responsibility for meeting this wide array of needs falls on children’s carers. Ideally, foster/kinship carers provide a caring and consistent environment that helps to remediate the effects of maltreatment, with a wide range of research indicating that stable, early placement predicts better long-term outcomes across a number of domains (James et al. 2004; Sinclair et al. 2005; Oosterman et al. 2007; Rubin et al. 2007). The provision of care, however, faces significant challenges, including the behavioural problems identified as the strongest predictor of placement breakdown (Oosterman et al. 2007) which is in turn a strong predictor of subsequent placement breakdown. Placement instability is common in the foster care system, with a significant number of foster children moving through several placements in their childhood (Sinclair et al. 2005; James et al. 2004). Unsurprisingly, this continued instability exacerbates problems in developmental, behavioural and mental health domains (Rubin et al. 2007; Ryan and Testa 2005).

In addition to challenges related to their foster child’s needs, foster/kinship carers are required to navigate a complex, poorly resourced and at times poorly functioning child welfare system, which some carers report as their greatest single source of stress (Buehler et al. 2003). An integrative review of research on foster carers’ experience found foster carers reporting a range of stresses including poor relationships and disagreements with case workers, and a lack of effective resources and supports (Blythe et al. 2014). Navigating these challenges may be further complicated by the conflicted identity foster/kinship carers may experience as fulfilling both professional and parental roles, with concomitant issues around attachment, uncertainty and commitment (Blythe et al. 2014). This range of barriers to the provision of stable and caring foster/kinship placements may, in some part, explain the generally poor long-term outcomes for children in alternative care. Reviewing 30 years of longitudinal data, Goemans et al. (2015) found that adaptive function and internalising and externalising behaviour problems remained unchanged during children’s time in foster care, with poorer outcomes associated with longer time spent in care.

Interventions for Foster/Kinship Carers

Researchers and clinicians have responded to this array of challenges by developing a wide range of interventions for family-based alternative care (i.e. foster/kinship care). Typically, such interventions are designed to increase foster/kinship carers’ skills in an effort both to better manage the sequelae of children’s maltreatment and disrupted attachment and their capacity to provide stable, responsive and consistent care. Increasingly, these skills and capacities have come to be seen as essential to effective foster/kinship care (Turner et al. 2007; Everson-Hock et al. 2012). In keeping with the growing demands of evidence-based practice, foster/kinship family interventions have been evaluated using increasingly sophisticated methodologies over recent years, with a growing use of randomised controlled trials (RCTs). Evidence from these trials has been synthesised in a range of reviews, systematic and empirical reviews and meta-analyses. Review authors have reported a range of methodological limitations that challenge their ability to draw robust conclusions and make clear recommendations about the effectiveness of these interventions. Limitations relating to internal validity included mixed study quality (Leve et al. 2012; Kerr and Cossar 2014), insufficient sample sizes (Festinger and Baker 2013; Craven and Lee 2006) and lack of replication (Kinsey and Schlösser 2013). Problems relating to outcome measures included the use of non-validated measures (Craven and Lee 2006) and the heterogeneity of outcome measures that made it difficult to adequately synthesise or compare trial results (Leve et al. 2012). Challenges to external validity included trial participants that were not representative of the alternative care population, outcome measures not representative of foster child well-being (Leve et al. 2012; Everson-Hock et al. 2012; Dorsey et al. 2008), limited long-term follow-up (Luke et al. 2014; Tarren-Sweeney 2014), a lack of fidelity assurance (Festinger and Baker 2013) and absence of metrics conveying meaningful clinical change (Tarren-Sweeney 2014). Reviewers also noted significant heterogeneity of clinical factors, including diversity in intervention aims, duration and setting and participant characteristics (Racusin et al. 2005; Everson-Hock et al. 2012; Dorsey et al. 2008; Kerr and Cossar 2014; Tarren-Sweeney 2014), all of which make intervention planning and evaluation more challenging. Two reviewers recommended that future evaluation research needs to pay more attention to the clinical, experiential and cultural context of participants, in order to provide essential information about how these factors lead to differential responses to treatment, including possible harms (Luke et al. 2014; Tarren-Sweeney 2014). One reviewer noted that the limitations in current evaluation research reflect, to some extent, the limitations in our current theoretical frameworks, which do not adequately capture the complex developmental challenges of children in alternative care (Tarren-Sweeney 2014).

Potentially, these methodological limitations represent a significant barrier to evidence-based practice in the field. The function of RCTs is to establish an unbiased measure of an intervention’s effectiveness in a specified setting for a specified population (Ioannidis et al. 2008). Fulfilling this objective depends upon internal validity, such that methods and procedures are followed to minimise the risk of bias and limit non-systematic effects on trial outcomes (Clarke and Oxman 2000). In addition, RCTs depend upon external validity, such that the results of trials are generalisable to a defined group of patients in a specified setting (Rothwell 2005). Thus, constraints to internal validity limit the robustness of clinical trial results, while constraints to external validity limit the generalisability of trials results to clinical practice.

Systematic reviews and meta-analyses of evaluation trials function to systematically synthesise and compare trial results in order to guide clinical practice and policy. Reviewers therefore depend on both the internal and external validity of included trials, and the use of valid criteria to include and exclude trials from review (Ioannidis 2005). In addition, the meaningful comparisons and syntheses of individual trial results depend on a sufficient level of uniformity between trials (Higgins et al. 2011). This is the case because clinical heterogeneity, including differences in participant characteristics, trial methods and measured outcomes, can moderate the magnitude of the intervention effect (West et al. 2010), risking the introduction of non-systematic bias in reviewers’ conclusions and, by extension, threatening the internal validity of those conclusions. Finally, because these factors also determine which participants benefit from which interventions in which settings, heterogeneity threatens external validity that allows research conclusions to inform clinical practice and policy.

Aims of Review

While existing reviews have reported a great number of limitations in the research evaluating interventions for foster/kinship families, their full impact on the evidence base has yet to be identified in a systematic review. As a result, consensus is limited as to which interventions provide robust evidence of efficacy for which problems and participants. In addition, while reviewers have provided a range of helpful recommendations for future research, these recommendations are not based on a systematic review focussed solely on methodological limitations.

This review aims to systematically identify challenges related to the complexity of the field that currently prevent a robust evaluation of interventions that can be synthesised, compared and generalised to foster/kinship carers and children in care. Specifically, it will provide a systematic review of factors that threaten both the internal and external validity of programme evaluations, and the clinical heterogeneity that exists between programmes. Given the improvement in methodological quality over time (Festinger and Baker 2013), this study will focus on the most recent evaluations conducted using the highest available study quality.

The present review will address the following research questions:

  1. 1.

    What are the threats to external validity of randomised trial evaluations?

  2. 2.

    How much clinical heterogeneity is present between randomised trials evaluations?

  3. 3.

    What are the threats to internal validity in randomised interventions for foster families?

Method

Inclusion/Eligibility Criteria

Types of Studies

Studies were eligible for this review if they were conducted after 1990, were published in a peer-reviewed journal, had more than 20 participants and reported quasi-random or random allocation of participants to control or experimental groups. Only studies that evaluated interventions with a stated aim of improving child well-being were included.

Types of Participants

Interventions for foster and kinship carers (with or without their foster children) caring for children under the age of 18. The focus of this review is on interventions primarily directed to family-based alternative care (i.e. foster or kinship care). For this reason, we excluded interventions targeting residential group care, institutional care, children referred to care by the juvenile justice system and interventions that solely targeted adoptive or biological parents.

Types of Interventions

Psychosocial interventions that reported primary aims of improving child well-being, including behavioural, mental health, relational and attachment approaches. Specifically, aims related to reducing child behaviour, improving child mental health, child interpersonal skills, child biomarkers of stress, foster/kinship parent–child relationships, carer well-being, parenting skills and placement stability were included. Interventions with a primary aim of improved academic or school functioning were excluded. Programmes that provided additional support outside of the direct intervention, commonly known as ‘wrap-around’ services, were excluded, as this was beyond the scope of this study.

Types of Outcome Measures

Interventions that used at least one psychometrically validated quantitative outcome measure related to child well-being.

Types of Comparisons

Studies that randomised participants to an intervention group and an active or inactive control group, and reported comparative change in outcomes after the intervention.

Search Methods for Identification of Studies

The databases CINAHL, Cochrane Library, PsychINFO, Scopus, Medline and Web of Science were searched for articles published between January 1990 and August 2016. Search terms were modified to meet the individual requirements of databases.

Selection of Studies

Studies were screened and excluded by title by one of the authors (JK). Studies were then screened independently by abstract and by article by two authors (JK and AD). Discrepancies were resolved by consensus after further detailed analysis and reading. The first stage of the project involved a systematic review with qualitative data analysis consistent with the PRISMA-P statement (Moher et al. 2015). Figure 1 provides an overview.

Fig. 1
figure 1

Flowchart of selection of studies for inclusion in this review

Data Extraction and Management

Data were extracted by AD and JK independently. No imputation of missing data was undertaken. Five randomly selected included studies were used to pilot the criteria and extraction process, which were modified after consultation between researchers. Data relating to intervention setting and format, participant characteristics, intervention characteristics and outcomes were extracted from each study. Setting and format data included location, duration, dosage, delivery, format and setting. Data relating to participant characteristics included age, gender, placement history, maltreatment history, foster/kinship carer experience and family characteristics. Data relating to the interventions included theoretical approaches, intervention aims and outcomes measures used to evaluate efficacy. These data categories were chosen due to their importance as predictors of foster child outcomes.

Risk of Bias

Risk of bias was based on reporting in published journal articles and was evaluated across six domains using the Cochrane risk of bias tool (Higgins et al. 2011). Bias was rated independently by two authors (JK and AD) with inter-rater reliability assessed by an independent researcher (LL). Inter-rater reliability varied across risk of bias domains with good agreement in half of the domains and moderate agreement in the remaining half. The random sequence generation domain had an inter-rater reliability test agreement of 70.6% κ = 0.512 (p < 0.002). The allocation concealment domain had an inter-rater reliability test agreement of 94.1% κ = 0.883 (p < 0.001). The performance bias domain had an inter-rater reliability test agreement of 100% κ = 1 (p < 0.001). The outcome assessment bias received the lowest inter-rater reliability of all domains, with agreement of 70.6% κ = 0.514 (p < 0.002). The attrition bias domain had an inter-rater reliability test agreement of 88.2%, κ = 0.79 (p < 0.001). The reporting bias domain had an inter-rater reliability test agreement of 76.5%, κ = 0.511 (p < 0.001). Reported results are a consensus agreement made where there was some disagreement.

Results

As shown in Fig. 1, 17 studies were included in the final analysis. Table 1 provides a list of included intervention names and abbreviations. See ‘Appendix’ for further characteristics of the included studies.

Table 1 Abbreviation for included interventions

Setting and Format

Eleven of the seventeen studies included in this review were conducted in the USA, three in the UK, two in the Netherlands and one in Romania. All study results were published in peer-reviewed journals. Five were conducted by researchers currently affiliated with the Oregon Social Learning Centre (KEEP1-3, KITS, MSS). All other interventions were evaluated by individual researchers.

Duration, dosage, format and delivery of the interventions varied a great deal. Figure 2 summarises intervention dosage across studies. Duration varied in length from 3 days to 16 weeks. Individual session length varied from 1 to 8 hours, a range of 3 to 52 total sessions and 8–104 total hours. Three interventions were delivered to parent–child dyads in the home (ABC, FFI, PFR), and one was delivered to individual parents in a community setting (PMTO). All remaining interventions utilised group formats in a community setting. Just over half of interventions (53%) were delivered to both foster/kinship carers and children in their care (ABC1, ABC2, FFI, IY + CP, KITS, MSS, PCIT, PFR, PSB) with the remaining interventions delivered to foster/kinship carers only.

Fig. 2
figure 2

Duration and length of interventions

Participant Characteristics

Figure 3 summarises overall levels of reporting of demographic information reported by included studies. All included studies reported some demographic information for children in care and their carers including age, gender, placement history, maltreatment history, carer experience and family characteristics. Demographic information for foster/kinship carers was generally well reported, while reporting of demographic information for index foster/kinship children was less extensive, despite its identification as an important predictor of child well-being (Redding et al. 2000; Oosterman et al. 2007). Determining the representativeness of participants and degree of heterogeneity in participant characteristics was limited by a lack of complete reporting.

Fig. 3
figure 3

Reporting of demographic information overall and separately for intervention and control

Figure 4 summarises age range and mean age of trial participants. Mean age was the most consistently reported participant characteristic; all but one of the studies (CBT-PT) reported the mean age of the index child at baseline. Mean age of participants varied a great deal across studies, from 18 months to 11 years. Age range of included participants varied from 1 year (ABC1) to 15 years (KEEP1).

Fig. 4
figure 4

Mean age and age range of participants in included studies

Fifteen of seventeen studies (88%) reported on the proportion of male and female index foster/kinship children, with a generally even proportion of male and female participants in most studies. Eleven interventions (65%) reported data relating to ethnicity for children and carers, with significant heterogeneity between studies that ranged from 74% African American participants (ABC1) to 86% Caucasian (KEEP1).

53% of studies reported data relating to the index child’s placement history, reflecting its recognised importance as a predictor of foster/kinship family functioning and placement stability. Placement history was reported in a wide variety of formats, precluding a systematic analysis or comparison of its role as a mediating variable. Four studies reported the number of children’s prior placements (KEEP1, KEEP2, MSS, PMTO). Placement history was also reported as age at removal (ABC1, PFR, PMTO, MSS), length of current placement (FCCT, IY) and time in alternative care (MSS). Seven studies reported no placement history. Variation in age of removal was evident between interventions aimed at younger children (e.g. ABC) and those including older children (e.g. KEEP2), with studies involving younger children generally demonstrating younger age of removal.

Maltreatment history of the index children was reported in seven interventions (41%). Five studies reported the percentage of children that experienced neglect and abuse (FCCT, KEEP1, MSS, IY − CP, PSB), and one reported the percentage of children with a maltreatment history (FFI). One study (ABC1) reported that all children were removed due either to neglect of children or to psychopathology or incarceration of biological parents. Ten studies reported no information on children’s maltreatment history (ABC2, CBT-PT, CEBPT, IY, KEEP2, KEEP3, KITS. PCIT, PFR, PMTO).

Foster/kinship carer characteristics were generally well reported. Years of carer experience was recorded in 13 studies (65%). Eleven studies reported the mean years of carer experience, which ranged from 2 to 8 years (CBT-PT, IY, KEEP1, KEEP2, KEEP3, KITS, MSS, PCIT, PFR, PMTO, PSB). Four studies did not report this variable. Fostering context (kinship or non-kinship care) was recorded in nine studies (53%) and ranged from 0% kin (CEBPT) to 47% kin.

Theoretical Approaches of Included Studies

Three interventions were described as being based on principles of Attachment Theory (ABC, PFR, FFI). Four were described as being based on the principles of Social Learning Theory (IY, IY + CP, KEEP, PMTO). One intervention was described as being founded on a combination of Attachment and Social Learning Theories (PCIT), one on Cognitive Behavioral principles (CEBPT) and one on a combination of Cognitive Behavioural and Social Learning Theory and (CBT-PT). One intervention was based on a combination of Social learning and Developmental Theory (MSS) and another (PSB) on a combination of Social Learning, Family Systems and Emotion Regulation theories. One intervention (FCCT) did not report the theoretical basis on which the intervention was based.

Intervention Aims

Reflecting the wide variety of theoretical approaches and the diverse and complex needs of children and their foster/kinship carers, intervention aims exhibited significant variation and complexity, with the majority of interventions stating more than one primary aim. Four interventions were reported as having different primary aims across multiple published studies (ABC1, KEEP2, KITS, MSS).

Intervention Aims Related to Improving Carer Capacity and Skills

Several interventions reported a primary aim of improving foster child well-being by improving parenting skills and capacity. Interventions based on attachment principles that involved foster children younger than 6 years stated aims including enhancing caregiver sensitivity and empathy (PFR), helping carers to create an environment that fosters regulatory abilities (ABC1, ABC2), recognising and responding to child stress (FFI) and meeting unmet needs (PFR). Interventions based on Social Learning Theory involved carers of children up to the age of eighteen and reported aims of reducing behaviour problems and placement disruption by increasing use of positive reinforcement (IY, PMTO, CEBPT, CBT-PT, KEEP2 and KEEP3) on increasing parent involvement (KITS). Three interventions reported aims relating to increasing carers confidence to manage behavioural problems (CBT-PT, IY, IY + CP). One intervention (FCCT) reported an aim to increase foster carers communication skills and confidence. One intervention (PCIT) reported aims of modifying both the responsiveness of the parent, and also developing their behaviour management skills, reflecting its foundation in both Attachment and Social Learning Theory.

Intervention Aims Directly Related to Child Behaviour

Reduction in behaviour problems was the most commonly reported intervention aim (KEEP2, CBT-PT, PMTO). Four interventions reported an intervention aim of reducing externalising behaviour (CEBPT, IY + CP, IY, MSS).

Intervention Aims Related to Emotion Regulation

One intervention stated emotion regulation as the primary aim of the intervention (ABC1), while another reported the broader domain of self-regulation as one of several aims (MSS).

Intervention Aims Related to Placement Outcomes

Four studies reported a reduction in placement disruption, or inversely an increase in placement stability as long-term aims of the intervention (MSS, KEEP1, KEEP2, PFR).

Other Intervention Aims

In addition to the aims described above, some interventions had aims specifically related to the target of the intervention. These included development of literacy and pro-social skills (KITS) and the long-term reduction in internalising problems, substance use, delinquency and high-risk sexual behaviour (MSS).

Outcomes Measures Used to Evaluate Interventions

The outcomes measures used by included studies are summarised in Table 2. Reported outcomes spanned a broad range of parent and child domains, reflecting the diversity of foster child needs, theoretical approaches and intervention aims. This resulted in a general lack of convergence in outcome measures used: of the 122 outcomes reported across all of the trials, 100 unique measures or scales were used, spanning fifteen domains. On average, studies reported a mean of 7 outcome variables, so that 90 outcome measures were unique across the seventeen trials. Eighty-seven outcomes were self-report measures, and thirty-five were clinician-assessed. Four of the studies accounted for 66% of clinician-rated measures leaving the majority of self-report (82%) measures in the remaining thirteen studies.

Table 2 List of outcomes used to evaluate efficacy in included trials

Child behaviour was the most commonly reported outcome domain across trials, with fifteen interventions (88%) reporting some kind of behavioural outcome. Despite this convergence, seven different behavioural outcome instruments were used to measure behaviour problems. In addition, a range of subscales of these measures was used (e.g. externalising, aggressive, and oppositional subscales) so that in total 14 different measures or subscales were used across the fifteen trials.

Parental stress and psychological well-being were the next most commonly reported outcome domains across trials, with eight studies (47%) reporting an outcome in both of these domains. The Parenting Stress Index (PSI) (Abidin 1990) was the most commonly used instrument being included in five studies (29%). Three other parenting stress outcome measures were in used to evaluate outcomes this domain (see Table 2).

Outcomes associated with the development of effective parenting skills were also common, reflecting that the aim of many interventions was to improve foster/kinship carers’ capacity to manage challenging behaviour of foster children. Five studies reported outcomes in this domain, using seven psychometric instruments composing nineteen different subscales.

Authors provided citations to articles establishing the validity and psychometric properties of outcome variables in the majority of cases (85%). Most of the commonly used scales, however, cited articles that were not normed for the alternative care child population, presenting a potential challenge to their external validity. Some scales were custom-made by developers and evaluators of the intervention (e.g. ABC, KEEP, PFR) or were custom-made composites of several scales used to provide a global measure of a domain (e.g. MSS). Some of these custom-made measures failed to provide information or reference to construct validity or psychometric properties (e.g. Reactive Attachment Disorder in PFR, Parent Attachment Diary in ABC1) or provided links to articles that were unpublished, or inaccessible on journal databases (e.g. Parent Daily Report in KEEP).

Risk of Bias in Included Studies

Figure 5 provides a visual overview of the risk bias assessment consensus (for detailed of risk of bias judgements comments, see ‘Appendix’).

Fig. 5
figure 5

Risk of bias assessment of included studies

Selection and Allocation Bias

Selection bias can threaten the internal validity of trials by introducing non-systematic differences between intervention and control groups (Higgins et al. 2011). Methods used to minimise selection bias include allocating participants, using random sequence generation and concealing participant allocation from researchers.

All seventeen included studies were described as randomised controlled trials and reported a range of methods used to minimise selection bias. Ten of the seventeen trials reported methods of random sequence generation that met criteria for ‘LOW’ risk of selection bias. Four did not provide sufficient information about randomisation and were judged ‘UNCLEAR’. One was considered ‘UNCLEAR’ because, despite a well-reported randomisation procedure, there were significant unexplained differences in baseline levels of the primary outcome measure (KEEP3). Two studies received a ‘HIGH’ risk of bias rating due to a failure to provide sufficient information indicating randomisation had taken place together with significant baseline differences that may favour the intervention group (KEEP1, PSB).

Nine of the seventeen trials were judged to have ‘LOW’ risk of bias in the allocation bias domain because they described methods of allocation concealment sufficient to minimise risk of selection bias. Eight trials did not provide sufficient information about allocation concealment to make a clear judgment, and received an ‘UNCLEAR’ rating.

Reporting demographic data independently for the intervention can help to ensure randomisation was successful and trial groups are equivalent (Higgins et al. 2011), as unreported differences at baseline can lead to biased estimates of between-group differences (Clarke and Oxman 2000). Eight studies (47%) failed to report any baseline demographic data independently for intervention and control (ABC2, CEBPT, CBT-P, IY + CP, KEEP1, PCIT, KITS, MSS). The remaining studies reported a mixed range of characteristics for control and intervention groups (see Fig. 3). Because placement history, maltreatment history, foster carer experience and family context are all well-established predictors of foster child difficulties and outcomes, ensuring group equivalence on these factors is an important guard against potential bias. These standards of reporting were reflected in the Cochrane risk of bias ratings given to the studies.

Performance Bias

Performance bias refers to participants’ exposure to factors other than those related to the intervention. To achieve a ‘LOW’ risk of performance bias, participants and researchers a required to be blind to treatment condition, so-called double blinding. Blinding is a recognised challenge in psychosocial interventions, as participant blinding requires an active control group (Goldbeck and Vitiello 2011). For this reason, only two studies reported double blinding and a ‘LOW’ rating for performance bias. One study informed participants about blinding, to ensure that 11 foster children were not aware of treatment condition; this study received an ‘UNKNOWN rating for performance bias. Fourteen studies received a ‘HIGH’ judgment for risk of performance bias because they used a treatment-as-usual control group.

Detection Bias

Detection bias can result when the assessment of outcomes changes in response to knowledge of participants’ allocation status. One author (AD) was of the opinion that, as parent self-reports were used for outcome assessment, parents were in all senses the assessors of the study outcome. Overall, nine studies were judged to have a ‘LOW’ risk of detection bias, because at least some of the assessments were conducted by blinded research assessors. Two studies were judged to have ‘UNCLEAR’ risk of detection bias, because researchers were reported as blind but only self-report measures were used. Six studies received a judgement of ‘HIGH’ risk of detection bias, because they used interview assessments in which neither party were reported as blinded.

Attrition Bias

Attrition bias refers to the systematic differences between groups that may bias analysis of post-intervention outcomes (Higgins et al. 2011). Ten of the trials were judged to have ‘LOW’ risk of attrition bias, because attrition rates were less than 20%, and measures were taken to analyse outcome data to compensate for drop-outs using Last Outcome Carried Forward (LOCF) analysis. Three studies received an ‘UNCLEAR’ risk of bias, because they did not provide adequate information to provide sufficient information about attrition rates in the study. Four studies were judged to be at a of ‘HIGH’ risk of bias because they had levels of attrition, included no information about attrition at all and did not report the use an intention to treat analysis to compensate for drop-outs.

Reporting Bias

Reporting bias refers to systematic differences between reported and unreported findings and includes the selective reporting of both trials and trial outcomes (Higgins et al. 2011). Two of the included interventions (18%) were judged to have a ‘LOW’ risk of bias because they pre-registered both the aim of the intervention and the outcome measures that were subsequently reported (FCCT, PFR). Ten of the included interventions received an ‘UNCLEAR’ judgement of reporting bias because the trial was not pre-registered and authors provided no evidence that all of the outcome measures used in the trial were reported. Finally, five of the included trials were judged to be at HIGH risk of bias. Of these, one trial received this rating because authors pre-registered outcome measures that were not subsequently reported and also reported outcomes that were not registered (MSS). Remaining studies received this rating because the primary outcome measure reported in the abstract was the only significant finding of many reported outcomes (IY), reported a primary outcome in a domain not associated with the intervention aims (PMTO) or added to the trial registry shortly before a report was submitted for publication, but after the trial had been completed (ABC1).

Trial registration is now seen as an essential safeguard against publication bias in scientific research (DeAngelis et al. 2004). Only five of the seventeen trials (29%) registered outcome variables in a trial registry (ABC1, KEEP3, PFR, KITS, MSS) (for details see ‘Appendix’). Of these, one trial added outcome measures into the registry after the trial completion (ABC1) and two registered only outcome domains, rather than specific assessment instruments (MSS, KITS). Only two interventions (11%) pre-registered the specific outcome instruments that were subsequently used as evidence for programme efficacy in published articles (PFR, KEEP3). These limitations were reflected in the ratings of reported bias given to included trials.

Interacting Bias Across Domains

Assessing risk of bias is not algorithmic or automatic process, but depends on assessor judgement based on a detailed analysis of reporting (Higgins et al. 2011). In some studies, the interactions between bias across domains led to unclear or misleading reporting of results. Specifically, reporting bias, combined with randomisation, allocation or attrition bias led to the reporting of intervention efficacy that was not reflected in systematic differences between intervention and control groups. One randomised study reported significant benefits to foster parents and children in the intervention (KEEP1). However, the significantly greater reduction in behaviour problems in the intervention group (e.g. different in change scores) compared to control was due to differences at baseline, with no significant between-group difference in post-intervention scores. Another randomised study reported significantly larger changes in mean behaviour problems for the intervention group, compared to control (IY). However, this larger change was not reflected in lower levels of post-intervention problem behaviours for the intervention group. Instead, reported changes resulted from higher baseline levels of behaviour problems in the intervention group, suggesting problems with randomisation. Additionally, authors reported positive and significant effect sizes for the intervention based solely on pre–post-scores (paired t tests), rather than the negative effect sizes associated with between-group differences in post-intervention scores. A third study clearly reported potential randomisation problems that had failed to result in equivalence between control and intervention groups (KEEP3). However, authors also reported an HLM analysis that suggested significantly larger decreases in behaviour problems were found in the intervention group, compared to control. These differences in change scores, reported as a significant group × interaction effect, were the result of baseline differences in behaviour problems. Even though post-intervention scores were equivalent, suggesting there was no benefit to participants randomised to the intervention, significant and positive within-group effect sizes were reported. Finally, the significant findings reported were based on focal siblings swapped in during the intervention due to attrition. An additional analysis that included the same participants pre- and post-intervention was not significant, even for within-group changes. This suggests that potential allocation or attrition bias may have contributed to reporting that did not reflect intervention effects on focal children.

Summary of Risk of Bias Assessment

Figure 6 summarises the mixed study quality. Fewer than fifty per cent of trials reported both the randomisation and allocation concealment procedures that are required for a low rating of selection bias. Because of the almost universal dependence on self-report measures in single blinded trials, only two of the seventeen included studies had low risk of bias of both performance and detection bias (ABC, PFR), although this reflects a shortcoming common in evaluations of psychosocial interventions that depend on self-report. Across the included studies, risk associated with attrition bias was managed best, with almost 60% reporting low attrition and the use of analytic methods to compensate for participant drop-outs. In contrast, over 80% of included studies had a low or unclear level of reporting bias. This reflected the low level of trial registration, with only five of the seventeen trials using pre-registration. Of those registered trials, only one trial had a complete concordance between registered and reported outcomes (KEEP2; see ‘Appendix’).

Fig. 6
figure 6

Summary of risk of bias across included trials

Discussion

This review aimed to provide a systematic analysis of the methodological factors that threaten the internal and external validity of interventions for foster and kinship carers and an overview of the clinical heterogeneity currently limiting the meaningful synthesis and comparison of trial results.

The results of this analysis show how the complex and diverse context of the field (see, for example, Jones et al. 2011) gives rise to a series of methodological challenges to robustly evaluating interventions for children in alternative care. The results mirror comments by Tarren-Sweeney (2013a, b, 2014) in which he outlines the range of challenges facing researchers seeking to establish a robust evidence base for children with complex trauma, attachment problems and children in alternative care. Tarren-Sweeney identifies continuing tensions between the principles of working from an empirical evidence base and the enormous complexity of human psychological development and morbidity, much of which is yet to be adequately measured or conceptualised.

Researchers in the field of child protection have spent over 40 years attempting to make sense of this complexity, using the understanding garnered from research to develop a range of interventions that aim to improve foster/kinship care in a meaningful way. Responding to the increasing demands of evidence-based practice, researchers have subjected their interventions to increasingly sophisticated trials that seek to establish their efficacy and effectiveness. Yet despite improvements in methodology over time, there continue to be several challenges in adapting the diverse and complex interventions used in family-based alternative care to the rigid requirements of randomised trials. The full extent of this complexity and its impact on the clinical heterogeneity, internal validity and external validity of the included studies are reflected in the results of this review.

External Validity

There were some limits to external validity that may restrict the generalisability of trial results. In contrast to previous reviews that noted several non-validated outcome variables were used to evaluate trials, results of this review found that the great majority were validated. This difference may relate to inclusion criteria that limited this review to high-quality randomised trials. Something that previous reviewers rarely commented on was the use of validated measures that were normed for non-foster/kinship care populations, a factor that may present a threat to the external validity of the trials (Locke and Prinz 2002). Unfortunately, the measures that were developed specifically for use in the trials were those that did not provide information about validation or norming with children in care.

The clinical diversity of children in care makes the reporting of demographic information particularly important for establishing the external validity, especially given that children in care at varying developmental stages, with different maltreatment histories, placement histories and psychological and relational problems have different needs (Schuengel et al. 2009). Without accurate reporting of such information, there is no way of knowing which carers and children will benefit from an intervention with demonstrated efficacy. Fortunately, the reporting of participant characteristics was generally good, with some studies collecting a range of detailed information. Even in these cases, however, this may still not be sufficient to guarantee that trials can be generalised to clinical populations. Complex and costly interventions need to be provided to those who will benefit from them most, and those that need them most must be shown to receive benefit from them (Chambless and Ollendick 2001). If trials allocate participants that are not screened for the problems the intervention is designed to target, then the results of that trial cannot be said to demonstrate effectiveness that population. Thus, the trial’s external validity may be compromised (Rothwell 2005). Of the included interventions, only four (ABC2, PCIT, PMTO, CEBPT) used participants assessed for difficulties that were reflected in their programme’s treatment aims and could, as a result, demonstrate strong external clinical validity.

Clinical Heterogeneity

The purpose of systematic reviews and meta-analyses of clinical trials is to synthesise and compare the results of trials to provide an overall summary of intervention effectiveness (Haidich 2010). Reviewers of foster/kinship carer interventions face particular challenges in completing this task. In contrast to trial populations with specific, homogenous problems, children in care face multiple challenges and have multiple needs manifested across behavioural, developmental, psychological, physiological and cognitive domains. In turn, these challenges have further implications for the child, the foster/kinship carer, their relationship and the stability of the placement (Jones et al. 2011; Oosterman et al. 2007).

The results of this review suggest that the extent of clinical heterogeneity in participants, aims settings and outcomes limit the meaningful synthesis of these interventions. It is unlikely, for example, that the effectiveness of an attachment intervention on improving HPA axis functioning in infants can be meaningfully compared to a behavioural intervention targeting teenage delinquency. The enormous range of outcomes used to evaluate these different interventions, and the lack of coherence and consensus over which instruments or scales should be used to determine efficacy, provides a further barrier to comparing or synthesising these results, limiting both their value in informing practice and policy and their capacity to inform future research.

More positively, the results of this study indicate that research is now beginning to focus on developing different interventions for specific developmental stages, with attachment-based interventions (ABC1, ABC2, FFI, PFR) focussing on encouraging warm, responsive care that increases attachment security for children under the age of six. In contrast, Social Learning Theory approaches (IY, PMTO, CEBPT, CBT-PT, KEEP2 and KEEP3) tend to include older participants and aim to modify the coercive cycle of negative reinforcement that can result from child behaviour problems. Similarly, MSS and KITS interventions were targeted at children at a particular developmental stage, and also show that efforts are being made to link interventions to identified needs. Other approaches not included in this review (e.g. Multi-Dimensional Foster Care, MTFC; Fisher et al. 2005; Fisher and Kim 2007) are showing promise in targeting interventions towards treatment need.

As such, the results of this review suggest that that clinical heterogeneity—the diversity of interventions, participants and outcomes—may present real challenges to those who wish to synthesise and compare interventions in a meaningful way. Existing reviews suggest a range of research questions and hypotheses that can be tested in future interventions. These interventions can be targeted at participants screened for specific problems, at specific developmental stages using a unified set of outcome variables would help to address these challenges and provide material more amenable to robust meta-analyses and systematic reviews.

Challenges to Internal Validity

Randomised controlled trials—and the statistical methods used to analyse them—are designed to test a hypothesis: that a particular intervention is effective in treating a particular disorder or problem, in a specified population (Kendall 2003). Ideally, an appropriate, validated and pre-specified primary outcome measure (or composite measure) is used to detect differences in treatment and control groups. The null hypothesis (that the intervention has no effect) is rejected if the outcome is sufficiently unlikely to be a result of chance (e.g. p < 0.05). Departures from this approach challenge both the internal validity necessary to ensure efficacy and the external validity that supports generalisation of trial results. Differences between the trials’ hypotheses, interventions, setting and participants challenge the ability of reviewers to meaningfully synthesise and compare the efficacy of interventions.

In the included studies, the diverse needs of participants, the complexity of the interventions and the measurement of effect using multiple outcomes over a number of domains potentially challenges unbiased trial evaluation. Firstly, hypothesis testing depends on a clear link between theory, intervention aim, hypothesis and outcome measures. Where a hypothesis is too vague: ‘this intervention may be effective in one of a broad range of outcomes’ the statistical principles of null hypothesis testing first developed by Fisher and Neyman are violated (Simonsohn et al. 2014) and rejecting the null hypothesis no longer signifies a true treatment effect (Frick 1996). Ideally then, clinical trials should be confirmatory and depend on a priori hypotheses that are generated by theory and exploratory research and then tested by the experimental research. In contrast to these requirements only two of the seventeen included (KEEP3, PFR) studies used pre-registered a priori outcomes and only five trials reported a primary outcome post hoc, with fourteen trials used multiple measures that were not specified before the trial.

There is an additional issue associated with the use of multiple outcomes without a specified primary outcome: the selective reporting of significant outcomes that can lead to a biased evaluation of efficacy. This is a potential problem that even more recent methods such as p-curve analysis (funnel plot, variance testing) nor meta-analytic methods are able to detect (Bishop and Thompson 2016; Ioannidis et al. 2008). It was also notable that only one study reported adjustments to p values to compensate for the reporting of multiple outcomes, despite the established p-inflation that results in these cases (PMTO). As a result, some trial results near the critical value of p = 0.05 that were reported as significant may be not be robust treatment effects (Feise 2002).

Finally, although it makes sense for interventions that address complex problems to have several aims (something that is common in many psychosocial interventions), the aims themselves also need to be pre-registered. Unfortunately, results indicate that some included studies published multiple reports of the same trial, but modified and adapted the stated aims of the intervention consistent with the outcomes that were reported. This practice again risks violating the principles of hypothesis testing, challenges the theoretical coherence of the intervention and makes statistical tests less reliable in detecting a true effect (Frick 1996). As for external validity, targeting interventions for participants screened for specific needs and using single primary outcomes to measure effectiveness will likely increase effect, which is additionally important given the identified lack of power in many of the included studies.

The intense pressure to demonstrate effects in research (Nissen et al. 2016) using RCTs (Ioannidis 2005) has led to an increased attention on research guidelines that ensure high standards of trial methodology and reporting (Higgins et al. 2011) and an increase in the overall quality of RCTs (Begley and Ioannidis 2015). In general, however, the studies included in this review tended to rely on self-report data and this was important in all but one of the included trials receiving a rating of high risk of bias. In addition, the use of non-active control groups means that participants cannot be blinded to group status. Given that these participants also played a central role in assessing child outcomes (e.g. using CBCL, PDR), the combination of performance and detection bias presented a significant risk to study results. Even leaving aside performance and reporting bias domains aside, only one of the included trials (PFR) had an overall low risk of bias rating using the Cochrane Handbook of Standards for RCTs. It is notable that this trial detected only one significant change in the 16 outcomes related to interventions aims. As such, trial results of the included interventions that indicate treatment efficacy must be interpreted with some caution.

Limitations

The clinical heterogeneity in the interventions reviewed here extends to the entire field of alternate care that lies beyond the scope of this review. ‘Alternative care’ is a general term for children with differing legal status, under varying forms of care (e.g. kinship vs. foster care), in a variety of settings (e.g. residential vs. family care). For practical reasons, this review focussed on interventions targeted towards foster and kinship carers, excluding several important interventions for children who have experienced maltreatment and separation from their birth parents. As a result, several targeted and robust trials of interventions (e.g. Multi-dimensional Foster Care for Preschoolers; MTFC-P) were excluded, as were RCT evaluations that were published only in white papers, such as the Fostering Changes Programme, which has robust methods and promising results (Briskman et al. 2012). For this reason, this methodological review cannot be taken as indicative of all research in alternative care, or as providing a complete picture of the field. A full review of the field, conducted by a large, well-funded team of researchers (for example, see Fraser et al. 2013) would provide a more comprehensive critique and therefore a more informative review of methods used across the entire field of alternative care.

A further limitation was that characteristics of trial participants were not compared with norms of the general foster care populations. Assessing whether the trial participants were representative of the wider population of carers and children in care, in the outcome measures used to determine effectiveness, would have provided a more systematic analysis of the external validity. Such an analysis was precluded by the heterogeneity of the population, of the outcomes used to establish norms, and the lack of population-wide data.

Conclusions

The complex challenges of the field continue to present challenges to current standards of trial methodology or reporting, with the result that evidence of the effectiveness of interventions for foster/kinship carers and children remains limited. The field is yet to develop explicit hypotheses concerning which interventions are effective in treating which difficulties for which participants, and then test those hypotheses with interventions targeted at those problems using robust methods and appropriate outcome measures.

Recommendations

In line with previous reviewers (e.g. Dorsey et al. 2008; Everson-Hock et al. 2012; Festinger and Baker 2013; Kerr and Cossar 2014; Kinsey and Schlösser 2013; Luke et al. 2014; Tarren-Sweeney 2014), the results of this review suggest further research is needed to provide robust evidence of effectiveness in the context of the complex needs and difficulties of foster/kinship families. This context is the focus of an enormous and growing body of research and provides an opportunity to align evaluation trials to current research, and test focussed hypotheses and research questions in controlled conditions.

In order to test these hypotheses and provide robust evidence that interventions are effective, a clear link needs to be drawn between the identified needs of a specific population (e.g. age, presenting problem), the aims of the intervention and the outcome measures used detect whether those needs are met. If possible, the use or development of appropriate measures, normed and validated for the population would help to improve their external validity (Tarren-Sweeney 2014). Evidence-based practice is about the judicious application of evidence to the individual patient (Sackett et al. 1996). If evaluation trials are to inform clinicians, they need to know for whom they are effective (Tarren-Sweeney 2014).

Participant characteristics and needs, intervention aims and outcome measures should be registered a priori, and trials should follow all recommended Cochrane procedures to minimise bias. The robust results from such trials would do much to clarify the significant uncertainty that remains in the field. Outcomes need not be limited to single primary outcomes. Additional aims, outcomes and mediation effects can be included into research trials to generate further exploratory hypotheses for confirmatory tests in subsequent RCTs, further extending the research field. Clinical heterogeneity between studies needs to be assessed prior to any meta-analytic or systematic synthesis and precludes an assessment of programme effectiveness (Ioannidis 2005). Reviewers seeking to make sense of this complex field would benefit from narrowing the scope of their reviews to particular population sub-groups or participant needs, approaches similar to existing reviews of Attachment-Based (Kerr and Cossar 2014) and Cognitive Behavioural (Turner et al. 2007) interventions. Given the consistent improvements in trial methodology over time, there is promise that future research will help to unravel the daunting complexity of the field and enable more robust evaluations of interventions for foster families.