The quality of program evaluations, and its impact on their policy relevance, is a key component of the discourse on ‘what works’ in crime and justice (e.g., Sherman et al. 1997; Boruch et al. 2000; MacKenzie 2000; Weisburd et al. 2001; Farrington 2003a; Lum and Yang 2005; Sherman et al. 2006). Academics and policymakers alike are increasingly calling for better evidence. This has led to a focus on the randomized controlled trial (RCT) as the “gold standard” research design for maximizing internal validity (e.g., Berk and Rossi 1999: 20-21; Farrington 2003b). Internal validity, or the extent to which causal inferences about program effectiveness may be drawn from a given study, is considered one of the most important of four criteria proposed by Cook and Campbell (1979) for methodological quality (Farrington 2003a; Shadish et al. 2002). Another important criterion, external validity, while not necessarily maximized by a randomized design, is nonetheless crucial to evidence-based public policy because it indicates the extent to which a program’s outcomes may be applicable to settings and populations different from those under which it was tested (see also Berk and Rossi 1999: 22).Footnote 1

Despite the extent of the recent interest in methodological rigor and the increased use of RCTs, debate around the quality of evaluation research in criminal justice has not subsided. The RCT is not, in itself, a guarantee of validity. One of the most prominent recent critiques, from the United States General Accounting Office (GAO), indicated that the majority of evaluations managed by the National Institute of Justice over the preceding 10 years, even those deemed “sufficiently designed,” were so beset with methodological and implementation problems that it was difficult to “draw meaningful conclusions about the programs’ effectiveness” (reported in Lauritsen 2006: 365). There are numerous barriers to successful experimental research in criminal justice populations and settings. Attrition of participants (both pre- and post-random assignment) can occur in any research study involving human subjects, leading to biased results and limited external validity, but may be more pronounced in the frequently “risky” and “less accessible” subjects who come into contact with the criminal justice system (e.g., Goldkamp 2008: 86). The politicized, bureaucratic nature of many criminal justice agencies can create practical and financial obstacles to access and the implementation of research projects. Overall, practitioners in the criminal justice arena are not committed to a tradition of experimental research and practice to the same extent as in other disciplines, such as health (Shepherd 2003). Complex ethical concerns about the potential risks to subjects and the public of denying (or mandating) treatment may hinder the design and implementation of true experiments, particularly when experienced practitioners believe strongly in the effectiveness of an intervention (Weisburd 2003; Farrington and Welsh 2005). With such deep-set structural factors apparently limiting the production of high-validity research, how can criminal justice policymakers decide what really constitutes the ‘best’ evidence for guiding practice?

One (relatively) simple fix among a host of suggestions put forward by Lipsey et al. (2006: 295) in response to the GAO report is the proposal that evaluators make full results and technical details about trials available to the research and policy communities. Lipsey and colleagues argue that this would facilitate discussion on how to improve evaluation methods and practice. It could also be useful in assisting policymakers and other research consumers to sort the good evaluations from the poor when making judgments about what works. This proposal requires a sharper focus on another type of validity not discussed by Cook and Campbell: descriptive validity. Farrington (2003a: 55) defines descriptive validity as “the adequacy of the presentation of key features of an evaluation in a research report… such as the number of participants and the effect size.” Farrington places descriptive validity second only to internal validity in terms of importance for assessing the quality of a trial (ibid.: 61).

Descriptive validity has a strong relationship with both internal and external validity (Perry 2010: 333). If an RCT is to be held up as the “gold standard” of internally valid research, transparent reporting is crucial. Sufficient evidence must be provided that the experiment was designed, implemented, and analyzed such that internal validity is truly maximized. If this validity is compromised in any way, a detailed explanation must be provided to allow the research consumer to judge the extent to which the evaluation remains of satisfactory quality to be considered relevant to policy. Furthermore, if the results are to be meaningful to policymakers under any circumstances, study authors must provide details not only of the intended target population for the program, but the characteristics of those who received it (and where and when), so that outcomes may be generalized for implementation on the broader scale.

In terms of policy relevance, descriptive validity is also crucial to the discipline of systematic review and meta-analysis (for example, the work of the Campbell CollaborationFootnote 2), which seeks to distill rigorous evidence on a particular intervention into statements and measures of overall effectiveness for the benefit of decision-makers. Transparent and detailed reporting of studies is necessary not only to ensure comparability between the programs included in a review, and for authors to make judgments about methodological rigor, but also for the calculation of meta-analytic effect sizes. Inconsistent reporting of results across different evaluations of the same intervention may prevent the meta-analyst from calculating comparable effect sizes, thus limiting the pool of studies that can be meaningfully combined.Footnote 3 Ultimately, this prevents systematic review authors from making the strong, unequivocal statements of effectiveness policymakers want to hear, which in turn damages the policy relevance of criminal justice research. In this and all the other ways described above, the concept of descriptive validity clearly represents the ‘missing link’ in the process of translating the best research evidence into policy and practice.

Despite its importance, the issue of descriptive validity is less often discussed in the field of criminology (cf. Lösel and Köferl 1989; Boruch 1997; Farrington 2003a; Petrosino et al. 2006). This contrasts sharply with the health and medical sciences, in which the development of quality standards for the reporting of trials has been widely discussed and advanced over the past 15 years. The Consolidated Standards of Reporting Trials (CONSORT), first set out by the CONSORT group in 1995 and revised in 2001 (Altman et al. 2001), currently consists of a 22-item checklist of trial characteristics to be reported. These standards have been adopted by many leading medical journals, including the Journal of the American Medical Association, the British Medical Journal, and The Lancet. Empirical studies have shown that overall, the use of CONSORT is associated with improvements in reporting quality over time (Moher et al. 2001; Plint et al 2006). CONSORT has also been adopted by the American Psychological Association, which represents a health discipline more familiar to many criminologists (Petrosino et al. 2006).

Although general efforts have been made to improve the reporting of criminological trials (e.g., Farrington et al. 2006), and some have specifically called for the development of a checklist and proposed basic frameworks (e.g., Farrington 2003a; Petrosino et al. 2006; see also Boruch 1997: Ch. 10, for social sciences generally), no consensus on standards similar to that seen in the health sciences has been reached. Recently, several reports of RCTs in the Journal of Experimental Criminology have explicitly and voluntarily adhered to CONSORT standards (e.g., Sherman et al. 2005; Watt et al. 2008; Barnes et al. 2010). Perry and Johnson (2008) provide empirical evidence that criminological trial reports only partially adhere to CONSORT. In a review of 17 RCTs on mental health services for juvenile offenders, they found considerable variability in the extent to which certain details were reported. Acknowledging that the narrow focus of their review may have overstated CONSORT compliance in criminology because the interventions they examined were rooted in the health and psychology disciplines, Perry and her colleagues repeated their investigation with a broader range of criminological trials and concluded that overall, descriptive validity is generally low (Perry et al. 2010).

The present study

The aim of the present study is to build upon previous discussion and empirical investigation of reporting quality in criminology, incorporating the relationship of descriptive validity to internal and external validity. The two studies described above (Perry and Johnson 2008; Perry et al. 2010) represent the only comprehensive attempts to apply the CONSORT checklist in its entirety to samples of criminological trials. I extend their analysis by focusing on the extent to which RCTs in criminology contain sufficient information to allow research consumers to judge the internal and external validity of the study. I assess this by developing indicators of the extent to which studies report information relevant to internal and external validity that could be used to create a rating system for policymakers.

It is important to emphasize at the outset that these reporting indicators are not intended to tell research consumers whether or not an experiment is internally or externally valid. Thus, I label the indicators ‘R-IV’ and ‘R-EV’ to remind the reader that they relate to information reported about the two types of validity. For example, an experiment with a high R-IV score may still have produced biased results due to differential attrition of participants, but the score shows that the report authors have provided enough information about the issues that affect internal validity to allow the reader to assess the extent to which the results remain meaningful. Conversely, a trial with a low score could have been perfectly implemented in practice, but the report provides so little information that to place substantial weight on its conclusions would be based on mere assumption. A study that rates highly on R-EV (external validity) provides sufficient descriptions of the setting, participant characteristics, and intervention details that policymakers can decide whether its outcomes could extend to the populations they serve.

Construction of internal and external validity indicators

In the absence of a reporting-standards checklist developed specifically for the field of criminology, I followed the methodology of Perry and Johnson (2008) and used the CONSORT checklist items, broken out into 45 individual elements. The elements pertinent to internal and external validity were selected to construct the indicators (see Fig. 1). Ultimately, the selection of elements was subjective, since there is also no agreed-upon measure of scientific validity or checklist for methodological quality (Shadish et al. 2002: 100; Farrington 2003a: 61-2). However, each element was chosen according to my assessment of whether it provided information relevant to the respective definitions of internal and external validity. Note that some of the selected elements are technically more relevant to Cook and Campbell’s (1979) other validity criteria—statistical conclusion validity and construct validity. These concepts are closely related to internal and external validity respectively (Shadish et al. 2002). For example, it is necessary to know details such as the number of participants in the trial to establish the existence of an effect (statistical conclusion validity) before drawing causal inferences (internal validity). A clear description of the intervention is needed to establish that it is a valid representation of the theoretical concept being measured (construct validity) as well as to extrapolate to variations of that concept (external validity).Footnote 4

Fig. 1
figure 1

Reporting indicators of internal and external validity

CONSORT elements relevant to internal validity related to the generation and implementation of the random assignment sequence; the flow of participants through the trial; the length of the follow-up period; the number of participants analyzed; whether analysis was based on intention-to-treat (according to randomized treatment) or per protocol (according to treatment actually received); and whether the authors believed the results to be affected by bias. Threats to or overrides of the random assignment sequence, differential attrition, or outcome analysis based only on those who successfully completed the intervention, all affect the extent to which causation may be inferred.

CONSORT elements relevant to external validity related to the eligibility criteria for participants; setting of the trial; the dates of the recruitment period; baseline characteristics of the participants; and the authors’ interpretation of the generalizability of their findings and the extent to which the results fit within the existing evidence-base. Details about these elements set the trial within geographic, cultural, and historical contexts, and allow policymakers to determine the extent to which the studied population and intervention aligns with their planned course of action in their own communities.

Sample selection

The sample of studies to which the reporting indicators are applied was drawn from a total of 28 journals (see Fig. 2), which were hand-searched by the author. Initially, I searched the top 20 criminology and penology journals (according to the 2007 Impact Factor ranking).Footnote 5 Since some of these journals were unlikely to publish RCTs (e.g., Theoretical Criminology), I boosted the sample size with several other criminology (e.g., Journal of Experimental Criminology Footnote 6) and general evaluation journals (e.g., Evaluation Review). These additional journals were selected based on my familiarity with them and expectation that they might include experiments involving criminal justice settings and/or outcomes. I searched journal issues published between January 2002 and December 2008. This time period was selected in part to maintain a manageable number of trials for analysis by a sole author, but also to reflect a time period in which debate over reporting quality, and potentially also authors’ and journal editors’ familiarity with the issue, were increasing. The year 2002 was selected as the start year to incorporate a time lag between the 2000 inception of the Campbell Collaboration and the 2001 publication of the most recent CONSORT standards in medicine.

Fig. 2
figure 2

Eligible studies by journal

All primary reports of field experiments involving random allocation of human participants to treatment and control groups were examined for inclusion. Although the exclusion of studies with non-human units of analysis (i.e., place-based randomized trials such as ‘hot spots’ experiments) excludes a substantial pool of recent experiments in policing, I felt that their inclusion might create a downward bias in reporting quality given my use of CONSORT. The CONSORT checklist was designed for simple two-group comparisons of human subjects, and several of the items that comprise my indicators do not apply to place-based trials in the form in which they appear in CONSORT. Thus, these studies would fail to score highly on the two indicators not because of poor reporting, but because there is currently no checklist specifically designed for the types of experiments many criminologists conduct. Laboratory-based and vignette studies were also excluded to avoid similar bias. These studies operate under more controlled conditions than those conducted in the field, so authors often do not need to report on attrition or implementation issues. Finally, I excluded any articles that did not report ‘true’ experiments: systematic reviews; follow-up surveys or analyses of subsets of RCT participants in which random assignment was not maintained; reports on multiple experiments, unless each one was fully reported (e.g., Goldkamp and White 2006); and preliminary results of trials, unless the authors purported to describe the experiment in full (e.g., Gottfredson and Exum 2002; Marlowe et al. 2003), and all the participants randomly assigned so far were included in the analysis. Based on these criteria, 38 RCTs were identified for analysis through title and abstract screening, and more thorough reading where necessary.

Coding of studies

The coding protocol developed for this study is reproduced in the Appendix. It was originally designed to gather full information about the reporting of each CONSORT element. I also recorded the publication year; the type of intervention studied (e.g., corrections); the institutional affiliation of the lead author at the time of publication; the field of the lead author’s highest degreeFootnote 7; and whether or not the authors mentioned CONSORT. Each element was coded 2 if the authors described it in the report, and 0 if they did not. The code 1 was used where the report was partial or unclear; for example, where CONSORT required a description of the interventions for both the treatment and control groups, and the authors only described the treatment group. As this was a small-scale study conducted by this author alone, it is important to stress the caveat that these studies have not been double-coded for reliability.

Analysis plan

The R-IV and R-EV indicators for each study were constructed by taking the mean score (0-2) over each of the relevant CONSORT elements listed in Fig. 2, rounded to 1 decimal place. Thus, each study is assigned an R-IV rating between 0 and 2, and an R-EV rating between 0 and 2, to indicate the extent to which factors affecting internal and external validity were described. Higher scores indicate more comprehensive reports.

In the following section, I present a descriptive analysis of these reporting indicators. I first examine the mean R-IV and R-EV ratings across the full sample. Mean scores are then broken down by subgroups: publication year, to examine whether reporting standards improved over time; intervention type, to examine whether trials of certain types of programs in certain settings lend themselves to better reporting practicesFootnote 8; by the lead author’s current institution and discipline (if in academia) and field of training, both of which could influence the extent to which authors consider the reporting of certain details in their work. Lum and Yang (2005) examined these last two factors in relation to authors’ preferences for choosing experimental or non-experimental methods when designing research studies. They hypothesized that “disciplinary norms” and the field of training may determine academics’ inclination toward or confidence in producing RCTs. By extension, I investigate the possibility that these norms affect attention to detail in writing up trial reports.

Finally I explore the relevance of the reporting indicators to policymakers and other research consumers. Each individual study’s R-IV and R-EV scores convey information to the reader about the sufficiency of its reporting on matters relevant to internal and external validity. However, decision-makers committed to evidence-based policy usually need to take a range of studies into account when deciding whether to implement an intervention or strategy. Graphical conceptualizations, such as matrices, are extremely useful for condensing large amounts of information into domains and patterns that can be easily digested by busy readers. A recent example in the field of criminology is the Evidence-Based Policing Matrix (Lum et al. in press), which visually classifies studies of policing strategies across several crime prevention dimensions. Following this example, I develop a 3 × 3 Descriptive Validity Matrix, which can be used to plot studies along dimensions of High, Medium, and Low descriptive validity on items relevant to the assessment of internal and external validity.

Results

Tables 1 and 2 show the R-IV and R-EV scores assigned in this study. Table 1 shows the mean scores across the whole sample, broken out by each individual CONSORT element that formed the reporting indicators. Table 2 shows the R-IV and R-EV score for each RCT coded. In the following discussion, I consider scores above 1.3 to indicate ‘High’ descriptive validity, scores between 0.7 and 1.3 ‘Medium,’ and scores below 0.7 ‘Low.’

Table 1 Mean R-IV and R-EV scores by element
Table 2 R-IV and R-EV analysis by study

Overall, the mean R-IV and R-EV scores across the sample are fairly promising. These studies show medium descriptive validity in reporting on elements relevant to internal validity (mean R-IV = 1.0), and high descriptive validity on items relevant to external validity (mean R-EV = 1.5). However, the individual elements indicate a great deal of variation within these indicators. Reporting scores across the individual elements ranged from a low of 0.1 for R-IV item 18, ‘method of implementation and concealment of random assignment sequence,’ which was fully described in just one study (Zhang and Zhang 2005); to a perfect 2.0 for R-EV item 4, ‘setting/location where data collected,’ which was discussed in all the reports. Within almost all the individual elements, reporting scores ranged from 0 to 2.

Descriptive validity was high on internal validity items relating to the number of participants, dates and timing of follow up, and numbers randomly assigned and analyzed for the primary outcome. Many authors also attempted to assess potential biases in their conclusions. Less consistently reported were details about the number of participants actually receiving the treatment and completing the study protocol, whether outcomes were based on intention-to-treat or per protocol, who enrolled participants and assigned them to groups, and any deviations from the planned treatment. Low-scoring items related to the construction and implementation of the random assignment sequence and precise information on the flow of participants through each stage of the trial.

Within the external validity indicator, almost all the elements scored highly. Authors provided a good amount of detail about the setting and location of the study, eligibility criteria, dates of recruitment, description of the interventions, and how their findings related to current knowledge. However, fewer authors specifically addressed the issue of generalizability (external validity) in their conclusions.

The mean R-IV and R-EV scores were also broken down by subgroups, reflecting variation in the sample in publication date, topic area, and author affiliation and experience. Figure 3 shows the mean scores by publication year. The line represents the number of studies published in that year, or the denominator on which the means are based. The number of eligible studies published per year ranged from three in 2003 to eight each in 2002 and 2008. R-IV and R-EV were both at their highest in 2003, although the means are based on just three studies (Armstrong 2003; Gottfredson et al. 2003; Marlowe et al. 2003) that all scored highly on at least one indicator. There appears to be little pattern or variation in descriptive validity over the time period, although scores improved slightly during the later years. R-IV scores were medium in each year, and R-EV was generally high, although it dipped slightly in 2004, which was the lowest-scoring year overall.

Fig. 3
figure 3

Reporting indicators by publication year

Figure 4 shows the results by topic area. Again, there is little variation in scores, with medium R-IV in all categories and high R-EV in most. Reports of RCTs in more ‘traditional’ criminal justice domains, such as corrections and courts, scored slightly higher than those in psychological therapies and treatments, which is somewhat surprising given the psychology field’s closer alliance with health science and its adoption of CONSORT standards during the time period covered in this study.

Fig. 4
figure 4

Reporting indicators by topic area

In Fig. 5, average R-IV and R-EV scores are broken out by the institutional affiliation of the lead author at the time of publication. The ‘Government’ category, for example, might include authors affiliated with a state or local Department of Corrections in the U.S. The ‘Research Organization’ category includes non-profit and for-profit agencies like the RAND Corporation. Authors working in university departments or research centers are classified under ‘Academic,’ which is broken down into broad disciplinary areas to account for differing norms and practices. Unsurprisingly, authors from medical schools or departments were most successful in conveying details relevant to internal validity and also scored very highly on R-EV. Academic sociologists provided the most detail about external validity, and authors working in research organizations also scored well on that indicator. Criminologists were relatively successful, with medium R-IV and high R-EV scores. Again, authors from psychology backgrounds performed less well, with an R-IV score on the borderline of low and medium. It is important to note that these results may well be biased for affiliations other than criminology and research organizations, due to small numbers of authors falling into the other categories.

Fig. 5
figure 5

Reporting indicators by lead author's institutional affiliation

Following Lum and Yang (2005), I also investigated whether the lead author’s field of training, as distinct from their current affiliation, affected their reporting practices (Fig. 6). Thirty-seven of the 38 lead authors (97.4%) either held or were pursuing a Ph.D. at the time of publication. I was unable to obtain information about the field in which two authors had received their doctorates. Again, a similar pattern emerges: R-IV is medium and R-EV is high across most fields. However, it is interesting to note that authors who received their Ph.D. in sociology or criminology/criminal justice had the highest scores on both R-IV and R-EV, whereas scores for those trained in medicine or public health were lowest. Again, small numbers of authors trained in the medical field may have skewed these results. However, the relatively high scores for sociologists (some of whom would have specialized in criminology) and criminologists are very promising for the field, indicating some tradition of good reporting practices in these disciplines even in the absence of a dedicated checklist.

Fig. 6
figure 6

Reporting indicators by lead author's field of training

One reason for the promising results observed among authors trained or working in criminology and sociology may be that many of the leading proponents of evidence-based policy and experimental practice in these fields have themselves been directly involved in conducting randomized controlled trials and would likely incorporate issues of quality and descriptive validity into their reports. In order to examine this hypothesis further, I coded each study according to whether or not a current Fellow of the Academy of Experimental Criminology (AEC) had been involved, either as lead author or a co-author. AEC Fellows are specifically recognized for their experience and success in conducting randomized controlled trials in criminology. In this sample, ten AEC Fellows (including two past or current AEC presidents) are represented: eight as lead authors (several of whom also co-authored other studies in the sample), and two as co-authors.Footnote 9 There is also some overlap between AEC Fellows and members of the Campbell Collaboration Crime and Justice Group steering committee, although I do not break down the results by Campbell membership as only one member was the lead author of a trial in this sample.Footnote 10

Figure 7 shows a distinct improvement in scores, particularly for reporting of internal validity, when an AEC Fellow is involved in authoring a study, particularly when he or she is a lead author (R-IV, lead author = 1.2; co-author = 1.1; no involvement = 0.9). External validity is high in all groups, but slightly higher for studies involving an AEC fellow.

Fig. 7
figure 7

Reporting indicators by AEC Fellow authorship

Having examined the characteristics of the sample as a whole, I now turn to the visual presentation of each study’s individual score (from Table 2, above) on the Descriptive Validity Matrix (Fig. 8). The Matrix is divided into 9 squares, representing each combination of Low, Medium, and High reporting quality for R-IV and R-EV. Each study’s R-EV score is plotted against its R-IV score on the Matrix to summarize reporting quality across both domains. Studies that fall closest to the top right-hand corner of the Matrix scored highest on both R-IV and R-EV. Each study is labeled with a differently shaped symbol depending on which dimension of the Matrix it falls into. This symbol can be used to refer back to a list of the individual study scores and references, such as the one in Table 2. Thus, the user of the Matrix might decide only to consider studies that rated high for descriptions of both internal and external validity, which are labeled with stars on the Matrix. He or she could then refer just to the starred section of the table for further information.

Fig. 8
figure 8

Descriptive validity matrix

Figure 8 shows the application of the Matrix to the present sample. While the results above have indicated that overall, reporting quality has been medium for items relevant to internal validity and high for items relevant to external validity, the Matrix clearly demonstrates that the majority of studies in the sample had reasonably high descriptive validity. In all, eight of 38 studies (21.1%) scored ‘high’ on both indicators, and 26 of 38 (68.4%) scored highly on at least one. Only nine studies (23.7%) scored high on R-IV, compared to 25 (65.8%) scoring high on R-EV. However, the overall distribution of scores is in the direction of the top-right corner, which is a promising assessment for reporting quality in the field.

Discussion

Despite concerns about descriptive validity in the criminological literature, and findings from empirical research (Perry and Johnson 2008; Perry et al. 2010) indicating that criminal justice research has some way to go to catch up to standards in the healthcare field, the results of this study are fairly promising. This sample of RCTs published between 2002 and 2008 in leading criminology journals provides at least partial details of information crucial to assessments of internal and external validity, and thus to the policy relevance of crime and justice evidence. The studies were particularly strong in reporting on items relevant to external validity or generalizability, which is of paramount importance in translating evidence into practice across different populations and settings. Furthermore, despite the need to borrow a reporting validity checklist from the medical field, studies conducted by criminologists and sociologists or focused on more traditional criminal justice strategies and settings performed as well, and sometimes better, than crime-related studies conducted within health science disciplines. The recent focus on experimental methods and evidence-based practice in criminology, and the founding of organizations like the Campbell Collaboration Crime and Justice Group and the Academy of Experimental Criminology, may have led to an improvement in reporting methods despite the lack of an agreed-upon standard. Studies in the present sample that were authored or co-authored by AEC Fellows had better reporting quality than those that were not, particularly on items related to internal validity.

However, the results of this study also indicate that much more needs to be done to improve reporting quality even further. A consistent finding across all the results presented above is that the R-IV score is always distinctly lower than the R-EV score. Only one study scored higher on R-IV than R-EV (Gottfredson and Exum 2002). There is substantial variability in the extent to which the individual elements of the R-IV indicator were reported. Arguably, many of the details that comprise the R-EV indicator are more obvious or easier to capture than those comprising R-IV. In a complex field RCT, researchers will certainly know the details of the intervention and the eligibility criteria for participants, but it may be much more difficult to track information about the flow of participants through the trial, especially if they are reliant on staff who work in the field to provide information over and above their normal duties. However, participant flow is vital in showing how representative the final sample of participants was of the full population, and how many were lost at each stage of the experiment. Differential attrition of participants and treatment crossover are major threats to internal validity, especially when those who do not drop out or who end up receiving the intervention are those most likely to respond positively. Fewer than half of the studies in the sample reported the numbers of participants actually receiving the intended treatments and completing the study protocol separately from the numbers randomly assigned or analyzed. Only half of the studies indicated whether the analysis was based on intention-to-treat or per protocol. The studies also provided very little information about the random assignment sequence; for example, only one study fully reported the methods of implementation and concealment of the sequence (Zhang and Zhang 2005), and just three more provided partial information (Haapanen and Britton 2002; Labriola et al. 2008; Watt et al. 2008). Allocation concealment is crucial to internal validity because it prevents selection bias by ensuring that the random assignment sequence is not known in advance. Prior knowledge of the sequence could, for example, result in participants thought to be ‘deserving’ of treatment being deliberately selected for the treatment group, which biases effect estimates (see Altman et al. 2001: 673).

Overall, the issues with reporting information about details relevant to internal validity in this sample are a cause for concern. Internal validity is considered to be the most important dimension of scientific validity (Farrington 2003a; Shadish et al. 2002), so it follows that without a good R-IV score, a high R-EV rating would not be as meaningful. Arguably, a study’s applicability to other populations and settings is irrelevant if the causal relationships it demonstrates are unreliable.

The key contribution of this study is the development of the Descriptive Validity Matrix, which visually organizes studies according to their R-IV and R-EV scores. The Matrix is a simple, intuitive way to convey information to decision-makers about whether a set of evaluations provide sufficient information to judge their internal and external validity. The most obvious application of the Matrix would be as an organizing scheme for a set of studies examining the same intervention or treatment: for example, a matrix could be produced that classifies all the rigorous evaluations conducted on drug courts according to R-IV and R-EV. A decision-maker who is considering implementing drug courts in his or her jurisdiction could use the Matrix to identify a subset of evaluations meeting a minimum standard of reporting quality, which would save the time of reading through reports that do not contain sufficient information. Alternatively, the Matrix could be taken in its entirety as an indicator of reporting quality across the evidence-base, providing the user with a basis for assessing and articulating confidence in his or her decisions based on the available research. As well as being a decision-making tool, the Matrix could be used by scholars of scientific validity to identify areas for improvement and develop checklists and standards in those areas.

Of course, producing a Matrix for each type of intervention would be quite a time-consuming task. However, it could in theory be successfully combined with the systematic reviews produced on behalf of the Campbell Collaboration. One of the essential steps of systematic review is the development of a coding protocol to extract information from each study about the intervention, population, and outcomes. The R-IV and R-EV indicators used in this study consist of 26 items that could be easily judged while reading the study, and recorded on the protocol. The indicators themselves are based on a simple mean and can be calculated in seconds with any statistical software or spreadsheet. Systematic review authors could generate the Matrix and include it in their reports along with the list of references. It is even concise enough to be included in shorter ‘user abstracts.’ In this way, the discipline of systematic review contributes to the development of evidence-based policy by providing summaries of both the overall effects of an intervention, and the confidence that can be placed in those effects based on the extent to which the review authors could glean information from the primary research.

This study has several limitations. It was not always possible to distinguish CONSORT items that were not reported from those that did not apply. Although all the items were relevant to criminological trials in general, it is not necessarily the case that all the issues would apply to all trials. For example, a report might fail to discuss the results in the context of current evidence, but the study may represent the first attempt to assess a particular strategy. In addition, the coding of CONSORT items was conducted by one person, and as such is based on personal judgment. Other readers of the same study reports may disagree with my assessments. However, I have been careful to apply an objective understanding of the concepts of internal and external validity based on prior literature.

The sample is a small subset of criminal justice experiments, so the studies reviewed here may not be representative of the overall quality of reporting in the field as a whole. The limited timeframe does not encompass some of the more productive eras in experimental research in criminology. Although evidence-based crime policy gained prominence relatively recently, Farrington and Welsh (2005) found 83 criminological RCTs published between 1982 and 2004, and a further 35 conducted between 1957 and 1981. More importantly, for reasons described above, the sampling criteria excluded place-based experiments, which eliminates much of the recent research on policing, a key domain of criminology that has provided a fruitful output of experimental research. Several high-quality studies (e.g., Weisburd et al. 2006; Braga and Bond 2008; Weisburd, Morris, and Ready 2008) representative of the field were excluded as a result.

Furthermore, only RCTs published in academic journals are included. Journal articles may be constrained by space and themes, and focus more on results and contributions to scholarship and criminological theory than the finer details of the project. This may explain why fewer authors specifically reported their own assessments of external validity in this study. Policymakers may be more likely to read research from their own governmental organizations, private research organizations, and technical reports submitted by academics (for example, the grant report on which a journal article may be based), which may contain more information about the full details of the study. Thus, this study may actually underestimate the quality of information available to policymakers. It is conceivable that when good research comes across their desks, it is in the form of more detailed technical reports.

Future research in this area should focus on refining the indicator system developed in this study to better capture information vital to the assessment of internal and external validity and increase its relevance to criminological trials. More work is required to unpack the definitions of internal and external validity themselves before they can be fully incorporated into reporting standards. The present indicator system also assumes that all the elements of internal and external validity are of equal importance, which may be unjustified. A refinement to this system, with the guidance of further research on the nature of scientific validity, might incorporate a weighted average to rank certain elements of validity as more or less important. In addition, this study does not examine the other important types of validity—statistical conclusion and construct validity—both of which are also important to policy relevance. For example, low statistical power is a major threat to statistical conclusion validity (Farrington 2003a: 52) and a chronic problem in criminological research (Brown 1989; Weisburd et al. 1993), yet fewer than 25% of the studies reviewed here offered information on how the sample size was determined. As discussed above, there is also an urgent need for a modified CONSORT-type reporting checklist designed specifically for the field of criminology, which takes into account the different research designs and units of analysis that are not found in the health sciences, the most obvious of which are the place-based experiments.

It would also be instructive to conduct a similar study of internal and external validity reporting in healthcare trials and compare it to these findings (Perry et al. 2010). This would help us to learn whether criminology does need to catch up with the medical field, especially since this study suggests that criminological trial reports authored by scholars trained or working in medicine were not always better reported than trials written by those from social science backgrounds. Furthermore, given the extent to which research and practice are connected in the health sciences (Shepherd 2003), it would be interesting to contrast health and criminology trials on the Matrix to compare the amount of policy-relevant information they provide.

Conclusions

This paper makes the case for the importance of descriptive validity as a foundation for drawing conclusions about scientific validity in criminological research. I constructed indicators designed to help research consumers assess whether a study provided sufficient information to assess the trial’s internal and external validity. I applied the indicators to 38 randomized controlled trials of criminological interventions. Reporting quality results were mixed, with factors relevant to external validity well reported, but important information about technical aspects of study design that greatly impact conclusions about internal validity routinely missed. Although the reporting standard applied was borrowed from the healthcare field, the elements that formed the reporting indicators were equally applicable to the effective reporting of criminal justice trials, and those items that were missed were not omitted because they were irrelevant. The indicators developed were used to map studies onto a Descriptive Validity Matrix, which could be provided to policymakers to help them assess the quality of information available in the evidence-base for a particular intervention or strategy.

Although this study has some limitations, it represents an important first step in assessing how descriptive validity relates to internal and external validity, and the value of criminological research to policy and practice. The indicators developed are based on a respected, well-documented framework and have been applied to a group of studies that is representative of much of the experimental research in criminology. As such, this is a useful starting point and framework for continued assessment of descriptive validity. The General Accounting Office report has indicated that the field of criminology still has some distance to go in improving the quality of the research it offers to policy decision-makers. While descriptive validity indicators do not address the fundamental difficulties of conducting field experiments in criminal justice settings, attention to good reporting of the problems that inevitably arise could go a long way toward helping decision-makers to make sense of research quality. As criminologists continue to hold up the randomized controlled trial as the authoritative evaluation design, and expand efforts to disseminate the results of experiments and systematic reviews to policymakers, we must recognize the “moral imperative” (Weisburd 2003) not only to produce the best research, but to clearly report it to enhance the objectives of evidence-based crime policy.