Systematic heterogenization for better reproducibility in animal experimentation

Richter, S Helene

doi:10.1038/laban.1330

Systematic heterogenization for better reproducibility in animal experimentation

Perspective
Published: 31 August 2017

Volume 46, pages 343–349, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Lab Animal

Systematic heterogenization for better reproducibility in animal experimentation

Download PDF

S Helene Richter¹

1192 Accesses
45 Citations
Explore all metrics

Abstract

The scientific literature is full of articles discussing poor reproducibility of findings from animal experiments as well as failures to translate results from preclinical animal studies to clinical trials in humans. Critics even go so far as to talk about a “reproducibility crisis” in the life sciences, a novel headword that increasingly finds its way into numerous high-impact journals. Viewed from a cynical perspective, Fett's law of the lab “Never replicate a successful experiment” has thus taken on a completely new meaning. So far, poor reproducibility and translational failures in animal experimentation have mostly been attributed to biased animal data, methodological pitfalls, current publication ethics and animal welfare constraints. More recently, the concept of standardization has also been identified as a potential source of these problems. By reducing within-experiment variation, rigorous standardization regimes limit the inference to the specific experimental conditions. In this way, however, individual phenotypic plasticity is largely neglected, resulting in statistically significant but possibly irrelevant findings that are not reproducible under slightly different conditions. By contrast, systematic heterogenization has been proposed as a concept to improve representativeness of study populations, contributing to improved external validity and hence improved reproducibility. While some first heterogenization studies are indeed very promising, it is still not clear how this approach can be transferred into practice in a logistically feasible and effective way. Thus, further research is needed to explore different heterogenization strategies as well as alternative routes toward better reproducibility in animal experimentation.

A reaction norm perspective on reproducibility

Article Open access 25 March 2021

Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’

Article Open access 06 October 2020

The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research

Article Open access 14 July 2020

Main

In October 2013, the biomedical research community was startled by the latest issue of The Economist running the headline “How science goes wrong”. In a short briefing, modern scientists were accused of doing “too much trusting and not enough verifying”, followed by a list of problems, pitfalls, and mistakes that currently limit the validity and reproducibility of research findings, mostly in the context of animal-based research¹. However, this is not the first time that criticism has been expressed about common practices in the field of biomedical research. Already in 2005, John Ioannidis published a paper provocatively entitled “Why most published research findings are false”², in which he pointed out that most studies are more likely to report a false finding than a true one. So, what is behind this criticism? Are these just alarming claims, or are there indeed problems with the translational value, the validity, and/or the reproducibility of research findings?

How self-correcting is science?

The scientific literature of animal-based research is indeed full of publications reporting or discussing poor reproducibility, as well as failures to translate results from preclinical animal experiments to clinical trials in humans^{3,4,5,6,7,8,9}. In a 10-year review of drug development, for example, Kola and Landis pointed out that the success rate from first-in-man to registration for different therapeutic areas between 1991 and 2000 was on average 11%, indicating that only one in nine compounds made it through the complete development process and were approved by the regulatory authorities⁴. Notably, the success rate was even worse for trials in specific research areas, such as oncology or women's health⁴. Similarly, in a systematic evaluation of how well mice mimic human inflammatory responses, fundamental disparities in genomic responses between mice and men were detected. Among genes that changed significantly in humans, the murine orthologs were close to random in matching their human counterparts, questioning the translational value of current mouse models for severe inflammation¹⁰. Thus, despite the overall and widely recognized improvement in scientific and technological tools over the last years, novel compounds have been criticized to fail more often in clinical development today than in the 1970s¹¹.

However, translational failures are not the only challenge the scientific community has to face. There is also an increasing concern about the rate at which published findings are reproducible. The current debate even goes so far as to generate a novel headword, i.e. “reproducibility crisis”, that increasingly finds its way in to numerous high-impact journals^{11,12,13,14,15}. Against this background, it is not surprising that 90% of 1,576 interviewed life scientists believe that they are currently facing either a slight or a significant reproducibility crisis¹².

By definition, “reproducibility” refers to the degree of accordance between results of the same experiment performed independently in the same or in a different laboratory¹⁶ (Table 1). Results that cannot be reproduced cast serious doubts on the quality of experiments and hinder scientific progress. In the context of animal experimentation, poor reproducibility is also an ethical issue, as the need for additional follow-up studies undermines the aim of reducing animal use. In that respect, it is an extremely serious matter that reproducibility problems seem to be most prevalent in those research areas that work with animal model systems^17,18, although recent surveys indicate that they also occur in other fields, such as psychology, chemistry and physics^12,19,20. Current estimates for irreproducibility in biomedical research are alarmingly high, ranging from 50 to 90%^15,21. Begley and Ellis, for example, reported that only 6 out of 53 “landmark studies” in oncology could be replicated²², and Prinz and colleagues detected inconsistencies in 75 to 80% of 67 in-house projects in oncology, women's health and cardiovascular diseases¹⁸. From an economic perspective, these high irreproducibility rates have been associated with costs of approximately US$28 billion per year in the United States alone²¹. As indicated by these examples, basic science has lost a great deal of credibility over the last years, emphasizing the need for fundamental changes in the conduct and analysis of experiments. However, the causes of current limitations to translation and reproducibility need to be identified first, before changes can be adequately addressed.

Table 1 Definitions of key terms (adapted from refs. 3,5,16,52)

Full size table

Threats to translation and reproducibility

Besides possible shortcomings in the clinical trials that may contribute to high attrition rates, translational failures have been attributed to biased research approaches, overoptimistic conclusions, or the lack of external validity in preclinical studies^5,23 (Table 1). Similarly, to explain poor reproducibility in animal experimentation, most explanatory approaches have concentrated on methodological issues, such as the inadequate choice of experimental designs and control groups or different types of biases^5,24,25,26 (Table 1). Knowledge of treatment assignment, for example, may consciously or unconsciously affect the outcome assessment, a phenomenon recognized for the first time at the beginning of the 20^th century. Here, a horse named Hans drew worldwide attention as the first animal with “numeracy skills”. By tapping its hoof, the horse seemed to solve arithmetic operations, read the clock, or recognize playing cards. A few years later, however, it turned out that the horse was only able to respond correctly to these tasks in the presence of the questioning person. If this person was absent or did not know the answer, the horse suddenly seemed to lose these skills. Thus, instead of being able to solve math problems, the horse was simply receptive to subtle cues present in the human questioners. Today, known as “Clever Hans Effect” or “Experimenter Bias”, this simple example illustrates how non-conscious cues from experimenters can introduce bias into testing. Similarly, so-called “Selection Biases” (i.e., biased allocation to treatment groups) may lead to selective exclusion or inclusion of animals to treatment groups, resulting in systematic differences in the baseline characteristics between groups^3,5.

Steps can be taken to reduce the risk of bias. But, where risks of bias have been systematically assessed in reviews of in vivo studies, an alarmingly low reporting rate of measures against risks of bias has been found. For example, a systematic review of studies reporting on functional outcome in animal models of acute ischemia found that random treatment allocation was reported in only 42% of the studies, blinded administration of the treatment in 22%, and blinded assessment of outcome in 40% (ref. 27). Similarly, a meta-analysis published in 2015 revealed that out of 2,671 publications reporting drug efficacy in eight different disease models, randomization was reported in only 662 publications (24.8%), blinded assessment of outcome in 788 (29.5%), and a sample size calculation in 20 cases (0.7%) (ref. 25). Notably, reporting rates of such quality criteria are not only low at the publication level, but also at the level of applications for animal experiments, (i.e., before the studies have been conducted). A recent meta-analysis published in PLoS Biology indicated that out of 1,277 applications for animal experiments in Switzerland, only 3.2% included a statement about blinding, 12.6% about randomization, and 7.9% about a sample size calculation²⁸. Reporting guidelines have therefore become a major tool in overcoming risks of bias²⁹.

As one important step toward improved reporting standards in animal experimentation, the ARRIVE guidelines (Animal Research: Reporting of In Vivo Experiments) have been introduced in 2010 (refs. 30,31). Based on a 20-item checklist of information to be reported in publications, the guidelines aim at maximizing the availability and utility of the information gained from every animal and every experiment. However, although the ARRIVE guidelines have been endorsed by over 1,000 journals since their introduction, little improvement in reporting standards has been observed^28,32. Nevertheless, overall awareness seems to have risen, as Macleod and colleagues showed that reporting rates in at least specific research areas in the biomedical sciences have increased over time²⁵.

Poor reproducibility has also been linked to manifold failures in the statistical analyses and the choice of the experimental unit (i.e., the smallest physical unit that can be randomly assigned to a treatment condition)^33,34,35,36. If, for example, a pregnant female animal is subjected to an experimental treatment, but the scientific interest is in the individual offspring, analyses are often based on individual pups³⁷. Because pups within a litter represent highly dependent entities, treating each pup as an independent experimental unit results in artificially large sample sizes associated with a substantial inflation of the nominal 0.05 alpha level³⁸. In fact, simulation studies have shown that an increase of the sample size by treating two pups per litter as independent measurements can almost triple the nominal 0.05 alpha level³⁹. Referred to as “litter or cage effects”, such misconceptions contribute to an overrepresentation of false positives in the scientific literature and, hence, hamper reproducibility^33,40. Similar problems arise in multiple testing situations. If more than one statistical test is performed on a given data set, the chance of drawing at least one false conclusion increases rapidly with the number of tests applied. Thus, failures to control the familywise error rate and adjust the p-value adequately also contribute to poor reproducibility by increasing the chance of producing and publishing false positive findings⁴¹.

Apart from these methodological issues, animal welfare constraints, poor training of researchers in experimental design and conduct^17,42, as well as current publication ethics^5,43,44, have all been linked to poor reproducibility and translational failures in animal experimentation. In particular the first point has been aptly summarized at a very early stage by Trevor Poole with his article “Happy animals make good science”⁴⁵. The idea here is that it is not only better for the animal to be in a good welfare state, but also for the quality of the scientific results derived from experiments with animals of “normal” behavior and physiology^46,47. Furthermore, current teaching standards have been criticized, because only few scientists get formal training in experimental techniques and statistical analyses. Many experiments are therefore planned and conducted on the basis of “lab traditions” rather than “good laboratory practices”, resulting in wrong and irreproducible findings⁴². Regarding publication ethics, key words like “publication bias”, “selective reporting”, or “p-hacking” have dominated the debate^48,49. Particularly, the overweighting of positive results leads to a subsequent overestimation of effect sizes in meta-analyses and systematic reviews. Sena and colleagues, for example, came to the conclusion that publication bias accounts for one third of the effect observed in animal stroke studies⁴³, and Simmons and colleagues argued that even before publication, researchers are remarkably adept at reaching those conclusions that mesh with their desires⁵⁰. Current publication standards that emphasize positive results are thus likely to further exacerbate the reproducibility problems described in the life sciences.

Standardization in animal experimentation: necessity or fallacy?

More recently, the concept of standardization has gained attention as an additional source of irreproducible findings, especially in the context of behavioral phenotyping studies¹³. Standardization within experiments aims at reducing variation in the data, thereby increasing test sensitivity and reducing animal use⁵¹. Furthermore, standardization between experiments aims at reducing between-experiment variation, thereby improving the comparability and reproducibility of results between studies⁵². In light of the reproducibility crisis, however, the question arises, whether standardization really is a prerequisite for good reproducibility.

In practice, the concept of standardization has led to rigorous homogenization of the animals' genotype (for example, by inbreeding), the laboratory environment (for example, by using uniform cage enrichment), the daily routines (for example, by standardized handling procedures), and the test situation (for example, by defining the time of testing). The idea here is to isolate the variables of interest, minimize the background noise, and maximize the detection of even subtle treatment effects. However, while this approach may indeed allow for exploring condition-restricted hits effectively, it can reduce information gain at the same time. Fully effective homogenization would thus decrease inter-individual variation within a study population to zero, leading to statistically significant, but possibly irrelevant results that lack generalizability to slightly different conditions (referred to as the “standardization fallacy”^53,54, (Fig. 1).

**Figure 1: Simplified schematic illustration of the standardization fallacy.**

Ironically, this standardization fallacy can be best demonstrated by poor reproducibility in the scientific literature. In a groundbreaking study involving three different laboratories, Crabbe and colleagues conducted a series of common behavioral tests in eight different mouse strains that were delivered, housed, reared, handled and tested under highly homogenized conditions. Notwithstanding this extreme level of standardization between facilities, the authors observed interactions between genotype and laboratory (i.e., genetically identical mice behaved differently depending on site). Based on these results, the authors hypothesized that “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory”⁵⁵. Subsequently to these initial findings, several other multi-laboratory studies confirmed difficulties in reproducing behavioral strain differences across labs^56,57,58, clearly showing that reproducibility problems arise despite rigorous standardization regimes. Because different laboratories inevitably standardize to different local constellations of experimental conditions (i.e., many factors, such as the experimenter, room architecture, or daily routines, cannot be standardized between laboratories), within-laboratory standardization will always exceed between-laboratory standardization. It is therefore not surprising that increasingly rigorous standardization within labs produces results that are increasingly distinct between laboratories (Fig. 1).

Previous approaches to solve the issue of poor reproducibility in behavioral phenotyping studies have focused on the search for tests yielding robust results across experiments and laboratories^59,60. Suggestions for improving the situation range from establishing one “golden standard” test for each domain, to using a battery of tests all believed to measure the same construct to assess the robustness of measures⁶¹. Other approaches have focused on the experimenter as a major source of experimental noise, leading to the development of human-free testing environments, such as the SmartCube, the IntelliCage⁶² or modern touchscreen-based procedures^63,64. Automated testing minimizes the need for human intervention and its accompanying stress and is therefore believed to reduce inter-individual variation⁶⁵. However, because home cage testing is still under way, currently existing test systems are often too complex for high-throughput approaches.

Systematic heterogenization rather than rigorous standardization?

A central fact in biology is that living organisms do vary. Such phenotypic plasticity relies on complex gene-by-environment interactions that shape the individual phenotype. With the aim of reducing such variation, standardization neglects individual phenotypic plasticity, thereby counteracting the widely adopted idea of “reaction norms”¹³ (i.e., pattern of phenotypes produced by a given genotype under different environmental conditions^66,67). Instead of spiriting this biological variation away, however, inter-individual differences may be key to making study populations more representative¹³. Thus, it may be advantageous, rather than detrimental, to use samples varied across genetic and/or environmental conditions to increase the external validity of the results, and improve the reproducibility of research findings. So, how can this logic be transferred into practice?

Common practice to identify idiosyncratic results is to run independent replicate experiments^16,68. Ideally, a replicate experiment is not a mere repetition of the original experiment, but should extend the scope by varying a particular set of factors¹⁶. If a replication study then fails to confirm the results, either the replicate study, the original study (or both) may have produced false or spurious results of limited external validity⁶⁹. Although this method indeed provides information on the robustness of a finding, it raises practical and ethical questions, because the need for replicate studies may easily inflate the number of animals needed to confirm a “true” effect. It thus seems to be preferable to incorporate such a “robustness check” directly in the experimental design. In this respect, the concept of “systematic heterogenization” has been proposed to be a powerful tool to extract robust and hence reproducible findings in animal experiments^70,71,72.

The underlying idea of systematic heterogenization is to introduce variation systematically into a single experiment to make study populations more representative and findings more robust across the variation that inevitably exists between experiments. In line with this idea, a recent simulation study revealed greater variation of treatment effects between different single-laboratory studies in comparison to different multi-laboratory studies. Furthermore, reproducibility was improved from less than 50% to over 80% in studies involving as few as three labs⁷. These findings clearly indicate that the inevitable increase in environmental variation in a multi-laboratory situation benefits the external validity and hence the reproducibility of treatment effects. Since it is unlikely, however, that all single experiments will be replaced by multi-laboratory approaches in the near future, systematic heterogenization aims at transferring this logic to a single-lab situation by increasing the variation within each single experiment.

Including variation in a non-systematic and uncontrolled way may bear the risk of inflating the number of animals needed for each experiment. It is thus important to combine the approach with adequate analytical techniques and experimental designs, such as split-plot, factorial, or randomized block designs that control for the introduced variation without reducing test sensitivity and statistical power^70,73,74. The potential value of split-plot designs, for example, has recently been demonstrated by a study investigating mixed-strain housing. Co-housing individuals of different strains increased the external validity of the experiment, without exerting negative effects on the data variability and the statistical power⁷⁴. Originally, these designs derived from agricultural research, where the experimental area was divided into heterogeneous blocks of land⁷³. However, such techniques are likely to have a much wider applicability in laboratory animal science, since they allow combining animals of, for example, different ages, batches, strains, litters, cages, or environmental conditions within a single experiment⁷⁵.

Overall, such systematic and controlled forms of heterogenization may replace or at least complement many of the conventionally used standardized approaches in animal experimentation. While in some cases a highly standardized experiment can be useful to identify single condition-restricted treatment effects, a heterogenized approach may help to detect more universally applicable conclusions. The best approach, however, clearly depends on the specific research goal. From an animal-ethical point of view, systematic heterogenization contributes to the refinement and reduction of animal experimentation by either reducing the number of experiments needed to detect a meaningful result or by increasing information gained based on the same number of animals.

Toward better reproducibility: applying a heterogenization strategy

Some first heterogenization studies are indeed promising. In a series of three experiments, standardization was found to increase the incidence of spurious results in behavioral tests, accounting for poor reproducibility, while systematic heterogenization attenuated spurious results, thereby improving reproducibility^71,72,76. Here, systematic heterogenization was achieved by varying two defined environmental factors that are known to interact with mouse genotype: (1) cage enrichment and (2) test age. According to a 2 × 2 factorial design, each factor was varied across two factor levels A and B, resulting in four different factor combinations (1A, 1B, 2A, 2B). Each heterogenized experiment was thus composed of mice that were kept and tested in four different ways.

Interestingly, this simple form of systematic environmental variation was sufficient to guarantee almost perfect reproducibility of behavioral strain differences between replicate experiments within a single laboratory⁷² (but see also refs. 77,78,79). Between laboratories, however, the observed improvement was not as strong as in the single-lab situation⁷⁶. Although heterogenization improved reproducibility compared to standardization, differences in the size and direction of strain effects occurred in both experimental approaches. Thus, despite the increasing awareness of reproducibility problems, the experimental design of animal experiments is still in need of refinement. While the strict homogenization of experimental conditions obviously does not cure poor reproducibility and translational failures, it is still not clear which types of systematic heterogenization may improve the situation and how this approach can be transferred into practice

Richard Paylor suggested splitting experiments into small batches of animals that are tested some time apart instead of testing them all at once in just one large batch⁸⁰. The underlying idea here is very close to the proposed concept of systematic heterogenization. Because each single small batch is supposed to rely on a unique time-dependent constellation of environmental and testing conditions, combining several “mini-experiments” in one big experiment is assumed to mimic a multi-laboratory situation within a single experiment and therefore to result in findings that are more robust. The approach reflects a kind of “systematic heterogenization over time” or “batch heterogenization” (Fig. 2). This is also in line with findings from computational approaches that have identified and ranked sources of variability in nociceptive responses in mice, showing that both season and time of day greatly influences the outcome measures^81,82. Similarly, Karp and colleagues conducted an analysis of data from phenotyping studies, showing that batch (i.e., the time point of testing) explains about a quarter of the observed variation in mouse phenotypes⁸³.

**Figure 2: Systematic heterogenization over time (“batch heterogenization”).**

Furthermore, the experimenter has been shown to be one of the most important factors influencing the outcome of an experiment^57,81,82. Precisely what differentiates the experimenters between studies remains unknown, but recent work has shown that even the gender of the experimenter can affect baseline responses in behavioral testing to a significant extent⁸⁴. Involving multiple experimenters for testing, instead of using only one, may suffice to make the study populations more representative and therefore less prone to variation between studies. Alternatively, genetic rather than environmental variation may represent a promising strategy to increase the external validity and the reproducibility of research findings. Because genetic background has been found to strongly modulate mutant phenotypes⁸⁵, the systematic variation of different strains or genotypes within a single study may also contribute to increased generalizability. Testing these different strategies will reveal whether such minor variations are indeed sufficient to significantly improve the reproducibility of research findings, especially in the context of behavioral phenotyping studies³.

Conclusions

Poor reproducibility and translational failures in animal experimentation can be attributed to deficiencies on many different levels. While most researchers have linked these problems to poor experimental design and conduct, poor reporting standards, and animal welfare constraints, they may also result from strict homogenization regimes that are widely practiced in biomedical research. Instead, a systematically heterogenized experimental approach that takes biological variation into account might help to improve representativeness of study populations and contribute to improved external validity and reproducibility of research findings. However, despite some first efforts toward heterogenized experimental strategies, there is still no “golden solution” for the conduct of single laboratory experiments, highlighting the need for further improvement strategies and innovative research approaches.

References

Unreliable research. Trouble at the lab. The Economist (2013).
Ioannidis, J.P. Why most published research findings are false. PLoS Med. 2, e124 (2005).
Article PubMed PubMed Central Google Scholar
Bailoo, J.D., Reichlin, T.S. & Würbel, H. Refinement of experimental design and conduct in laboratory animal research. ILAR J. 55, 383–391 (2014).
Article CAS PubMed Google Scholar
Kola, I. & Landis, J. Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug Discov. 3, 711–716 (2004).
Article CAS PubMed Google Scholar
Van der Worp, H.B. et al. Can animal models of disease reliably inform human studies? PLoS Med. 7, e1000245 (2010).
Article PubMed PubMed Central Google Scholar
Mogil, J.S. Laboratory environmental factors and pain behavior: the relevance of unknown unknowns to reproducibility and translation. Lab Anim. (NY) 46, 136–141 (2017).
Article Google Scholar
Würbel, H. More than 3Rs: the importance of scientific validity for harm-benefit analysis of animal research. Lab Anim. (NY) 46, 164–166 (2017).
Article Google Scholar
Garner, J.P., Gaskill, B.N., Weber, E.M., Ahloy-Dallaire, J. & Pritchett-Corning, K.R. Introducing Therioepistemology: the study of how knowledge is gained from animal research. Lab Anim. (NY) 46, 103–113 (2017).
Article Google Scholar
Jarvis, M.F. & Williams, M. Irreproducibility in preclinical biomedical research: perceptions, uncertainties, and knowledge gaps. Trends Pharmacol. Sci. 37, 290–302 (2016).
Article CAS PubMed Google Scholar
Seok, J. et al. Genomic responses in mouse models poorly mimic human inflammatory diseases. Proc. Natl. Acad. Sci. USA 110, 3507–3512 (2013).
Article CAS PubMed PubMed Central Google Scholar
Scannell, J.W. & Bosley, J. When quality beats quantity: decision theory, drug discovery, and the reproducibility crisis. PLoS ONE 11, e0147215 (2016).
Article PubMed PubMed Central CAS Google Scholar
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Article CAS PubMed Google Scholar
Voelkl, B. & Würbel, H. Reproducibility crisis: are we ignoring reaction norms? Trends Pharmacol. Sci. 37, 509–510 (2016).
Article CAS PubMed Google Scholar
Peng, R. The reproducibility crisis in science: A statistical counterattack. Significance 12, 30–32 (2015).
Article Google Scholar
Begley, C.G. & Ioannidis, J.P. Reproducibility in science. Circ. Res. 116, 116–126 (2015).
Article CAS PubMed Google Scholar
van der Staay, F.J., Arndt, S.S. & Nordquist, R.E. Evaluation of animal models of neurobehavioral disorders. Behav. Brain Funct. 5, 11 (2009).
Article PubMed PubMed Central Google Scholar
Collins, F.S. & Tabak, L.A. Policy: NIH plans to enhance reproducibility. Nature 505, 612–613 (2014).
Article PubMed PubMed Central Google Scholar
Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 712 (2011).
Article CAS PubMed Google Scholar
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Ioannidis, J.P. et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41, 149–155 (2009).
Article CAS PubMed Google Scholar
Freedman, L.P., Cockburn, I.M. & Simcoe, T.S. The economics of reproducibility in preclinical research. PLoS Biol. 13, e1002165 (2015).
Article PubMed PubMed Central CAS Google Scholar
Begley, C.G. & Ellis, L.M. Raise standards for preclinical cancer research. Nature 483, 531–533 (2012).
Article CAS PubMed Google Scholar
Giles, J. Animal experiments under fire for poor design. Nature 444, 981 (2006).
Article CAS PubMed Google Scholar
Ioannidis, J.P. et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet 383, 166–175 (2014).
Article PubMed PubMed Central Google Scholar
Macleod, M.R. et al. Risk of bias in reports of in vivo research: a focus for improvement. PLoS Biol. 13, e1002273 (2015).
Article PubMed PubMed Central CAS Google Scholar
Reichlin, T.S., Vogt, L. & Würbel, H. The researchers' view of scientific rigor—survey on the conduct and reporting of in vivo research. PLoS ONE 11, e0165999 (2016).
Article PubMed PubMed Central CAS Google Scholar
van der Worp, H.B., de Haan, P., Morrema, E. & Kalkman, C.J. Methodological quality of animal studies on neuroprotection in focal cerebral ischaemia. J. Neurol. 252, 1108–1114 (2005).
Article PubMed Google Scholar
Vogt, L., Reichlin, T.S., Nathues, C. & Würbel, H. Authorization of animal experiments is based on confidence rather than evidence of scientific rigor. PLoS Biol. 14, e2000598 (2016).
Article PubMed PubMed Central CAS Google Scholar
McNutt, M. Journals unite for reproducibility. Science 346, 679 (2014).
Article CAS PubMed Google Scholar
Kilkenny, C., Browne, W., Cuthill, I.C., Emerson, M. & Altman, D.G. Animal research: reporting in vivo experiments: the ARRIVE guidelines. Br. J. Pharmacol. 160, 1577–1579 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kilkenny, C., Browne, W.J., Cuthill, I.C., Emerson, M. & Altman, D.G. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol. 8, e1000412 (2010).
Article PubMed PubMed Central CAS Google Scholar
Baker, D., Lidster, K., Sottomayor, A. & Amor, S. Two years later: journals are not yet enforcing the ARRIVE guidelines on reporting standards for pre-clinical animal studies. PLoS Biol. 12, e1001756 (2014).
Article PubMed PubMed Central Google Scholar
Lazic, S.E. & Essioux, L. Improving basic and translational science by accounting for litter-to-litter variation in animal models. BMC Neurosci. 14, 37 (2013).
Article PubMed PubMed Central Google Scholar
Festing, M.F. Design and statistical methods in studies using animal models of development. ILAR J. 47, 5–14 (2006).
Article CAS PubMed Google Scholar
Halsey, L.G., Curran-Everett, D., Vowler, S.L. & Drummond, G.B. The fickle P value generates irreproducible results. Nat. Methods 12, 179–185 (2015).
Article CAS PubMed Google Scholar
Goodman, S.N. Aligning statistical and scientific reasoning. Science 352, 1180–1181 (2016).
Article CAS PubMed Google Scholar
Wainwright, P.E. Issues of design and analysis relating to the use of multiparous species in developmental nutritional studies. J. Nutr. 128, 661–663 (1998).
Article CAS PubMed Google Scholar
Zorrilla, E.P. Multiparous species present problems (and possibilities) to developmentalists. Dev. Psychobiol. 30, 141–150 (1997).
Article CAS PubMed Google Scholar
Holson, R.R. & Pearce, B. Principles and pitfalls in the analysis of prenatal treatment effects in multiparous species. Neurotoxicol. Teratol. 14, 221–228 (1992).
Article CAS PubMed Google Scholar
Lazic, S.E. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 11, 5 (2010).
Article PubMed PubMed Central Google Scholar
Noble, W.S. How does multiple testing correction work? Nat. Biotechnol. 27, 1135–1137 (2009).
Article CAS PubMed PubMed Central Google Scholar
Festing, M.F. We are not born knowing how to design and analyse scientific experiments. Altern. Lab. Anim. 41, 19–21 (2013).
Google Scholar
Sena, E.S., Van Der Worp, H.B., Bath, P.M., Howells, D.W. & Macleod, M.R. Publication bias in reports of animal stroke studies leads to major overstatement of efficacy. PLoS Biol. 8, e1000344 (2010).
Article PubMed PubMed Central CAS Google Scholar
Cumming, G. The new statistics why and how. Psychol. Sci. 25, 7–29 (2014).
Article PubMed Google Scholar
Poole, T. Happy animals make good science. Lab. Anim. 31, 116–124 (1997).
Article CAS PubMed Google Scholar
Garner, J.P. Stereotypies and other abnormal repetitive behaviors: potential impact on validity, reliability, and replicability of scientific outcomes. ILAR J. 46, 106–117 (2005).
Article CAS PubMed Google Scholar
Prescott, M.J. & Lidster, K. Improving quality of science through better animal welfare: the NC3Rs strategy. Lab Anim. (NY) 46, 152–156 (2017).
Article Google Scholar
Nuzzo, R. Statistical errors. Nature 506, 150 (2014).
Article CAS PubMed Google Scholar
Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. The extent and consequences of p-hacking in science. PLoS Biol. 13, e1002106 (2015).
Article PubMed PubMed Central CAS Google Scholar
Simmons, J.P., Nelson, L.D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366 (2011).
Article PubMed Google Scholar
Festing, M.F. Reduction of animal use: experimental design and quality of experiments. Lab. Anim. 28, 212–221 (1994).
Article CAS PubMed Google Scholar
Beynen, A.C., Baumans, V. & Van Zutphen, L.F.M. in Principles of Laboratory Animal Science (eds. L.F.M. Van Zutphen, V. Baumans & A.C. Beynen) 103–110 (Elsevier, Amsterdam, 2001).
Google Scholar
Würbel, H. Behaviour and the standardization fallacy. Nat. Genet. 26, 263 (2000).
Article PubMed Google Scholar
Würbel, H. Behavioral phenotyping enhanced–beyond (environmental) standardization. Genes Brain Behav. 1, 3–8 (2002).
Article PubMed Google Scholar
Crabbe, J.C., Wahlsten, D. & Dudek, B.C. Genetics of mouse behavior: interactions with laboratory environment. Science 284, 1670–1672 (1999).
Article CAS PubMed Google Scholar
Mandillo, S. et al. Reliability, robustness, and reproducibility in mouse behavioral phenotyping: a cross-laboratory study. Physiol. Genomics 34, 243–255 (2008).
Article PubMed PubMed Central Google Scholar
Lewejohann, L. et al. Environmental bias? Effects of housing conditions, laboratory environment and experimenter on behavioral tests. Genes Brain Behav. 5, 64–72 (2006).
Article CAS PubMed Google Scholar
Wolfer, D.P. et al. Laboratory animal welfare: cage enrichment and mouse behaviour. Nature 432, 821–822 (2004).
Article CAS PubMed Google Scholar
Wahlsten, D. Standardizing tests of mouse behavior: reasons, recommendations, and reality. Physiol. Behav. 73, 695–704 (2001).
Article CAS PubMed Google Scholar
Wahlsten, D. et al. Different data from different labs: lessons from studies of gene–environment interaction. J. Neurobiol. 54, 283–311 (2003).
Article PubMed Google Scholar
Crabbe, J.C. & Morris, R.G. Festina lente: late-night thoughts on high-throughput screening of mouse behavior. Nat. Neurosci. 7, 1175–1179 (2004).
Article CAS PubMed Google Scholar
Galsworthy, M.J. et al. A comparison of wild-caught wood mice and bank voles in the Intellicage: assessing exploration, daily activity patterns and place learning paradigms. Behav. Brain Res. 157, 211–217 (2005).
Article PubMed Google Scholar
Talpos, J. & Steckler, T. Touching on translation. Cell Tissue Res. 354, 297–308 (2013).
Article PubMed Google Scholar
Richter, S.H. et al. Touchscreen-paradigm for mice reveals cross-species evidence for an antagonistic relationship of cognitive flexibility and stability. Front. Behav. Neurosci. 8, 154 (2014).
Article PubMed PubMed Central Google Scholar
Richardson, C.A. Automated homecage behavioural analysis and the implementation of the three Rs in research involving mice. Altern. Lab. Anim. 40, 7–9 (2012).
Google Scholar
Dingemanse, N.J., Kazem, A.J., Réale, D. & Wright, J. Behavioural reaction norms: animal personality meets individual plasticity. Trends Ecol. Evol. 25, 81–89 (2010).
Article PubMed Google Scholar
Sarkar, S. From the Reaktionsnorm to the adaptive norm: the norm of reaction, 1909–1960. Biol. Philos. 14, 235–252 (1999).
Article Google Scholar
van der Staay, F.J. Animal models of behavioral dysfunctions: basic concepts and classifications, and an evaluation strategy. Brain Res. Rev. 52, 131–159 (2006).
Article PubMed Google Scholar
Muma, J.R. The need for replication. J. Speech Lang. Hear. Res. 36, 927–930 (1993).
Article CAS Google Scholar
Würbel, H. & Garner, J.P. Refinement of rodent research through environmental enrichment and systematic randomization. NC3Rs 9, 1–9 (2007).
Google Scholar
Richter, S.H., Garner, J.P. & Wurbel, H. Environmental standardization: cure or cause of poor reproducibility in animal experiments? Nat. Methods 6, 257–261 (2009).
Article CAS PubMed Google Scholar
Richter, S.H., Garner, J.P., Auer, C., Kunert, J. & Würbel, H. Systematic variation improves reproducibility of animal experiments. Nat. Methods 7, 167–168 (2010).
Article CAS PubMed Google Scholar
Grafen, A. & Hails, R. Modern statistics for the life sciences (Oxford University Press, Oxford, 2002).
Google Scholar
Walker, M. et al. Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice: validating a split-plot design that promotes refinement and reduction. BMC Med. Res. Methodol. 16, 11 (2016).
Article PubMed PubMed Central Google Scholar
Festing, M.F. & Altman, D.G. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J. 43, 244–258 (2002).
Article CAS PubMed Google Scholar
Richter, S.H. et al. Effect of population heterogenization on the reproducibility of mouse behavior: a multi-laboratory study. PLoS ONE 6, e16461 (2011).
Article CAS PubMed PubMed Central Google Scholar
Würbel, H., Richter, S.H. & Garner, J.P. Reply to: “Reanalysis of Richter et al. (2010) on reproducibility”. Nat. Methods 10, 374 (2013).
Article PubMed CAS Google Scholar
Jonker, R.M., Guenther, A., Engqvist, L. & Schmoll, T. Does systematic variation improve the reproducibility of animal experiments? Nat. Methods 10, 373 (2013).
Article CAS PubMed Google Scholar
Wolfinger, R.D. Reanalysis of Richter et al. (2010) on reproducibility. Nat. Methods 10, 373–374 (2013).
Article CAS PubMed Google Scholar
Paylor, R. Questioning standardization in science. Nat. Methods 6, 253–254 (2009).
Article CAS PubMed Google Scholar
Chesler, E.J., Wilson, S.G., Lariviere, W.R., Rodriguez-Zas, S.L. & Mogil, J.S. Identification and ranking of genetic and laboratory environment factors influencing a behavioral trait, thermal nociception, via computational analysis of a large data archive. Neurosci. Biobehav. Rev. 26, 907–923 (2002).
Article PubMed Google Scholar
Chesler, E.J., Wilson, S.G., Lariviere, W.R., Rodriguez-Zas, S.L. & Mogil, J.S. Influences of laboratory environment on behavior. Nat. Neurosci. 5, 1101–1102 (2002).
Article CAS PubMed Google Scholar
Karp, N.A., Melvin, D., Mott, R.F. & Project, S.M.G. Robust and sensitive analysis of mouse knockout phenotypes. PLoS ONE 7, e52410 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sorge, R.E. et al. Olfactory exposure to males, including men, causes stress and related analgesia in rodents. Nat. Methods 11, 629–632 (2014).
Article CAS PubMed Google Scholar
Sittig, L.J. et al. Genetic background limits generalizability of genotype-phenotype relationships. Neuron 91, 1253–1259 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Current research on heterogenization and reproducibility is funded by the German Research Foundation (DFG, RI 2488/3-1). Furthermore, I would like to thank Norbert Sachser, Hanno Würbel, Sara Hintze, Niklas Kästner, and Vanessa von Kortzfleisch for their helpful comments on earlier drafts of this manuscript.

Author information

Authors and Affiliations

RG Behavioural Biology and Animal Welfare, Institute of Neuro and Behavioural Biology, University of Münster, Münster, Germany
S Helene Richter

Authors

S Helene Richter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S Helene Richter.

Ethics declarations

Competing interests

The author declares no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Richter, S. Systematic heterogenization for better reproducibility in animal experimentation. Lab Anim 46, 343–349 (2017). https://doi.org/10.1038/laban.1330

Download citation

Received: 26 March 2017
Accepted: 15 May 2017
Published: 31 August 2017
Issue Date: September 2017
DOI: https://doi.org/10.1038/laban.1330
Springer Nature America, Inc.

This article is cited by

Challenging current scientific practice: how a shift in research methodology could reduce animal use
- S. Helene Richter
Lab Animal (2024)
A paradigm shift in translational psychiatry through rodent neuroethology
- Yair Shemesh
- Alon Chen
Molecular Psychiatry (2023)
Predictive validity in drug discovery: what it is, why it matters and how to improve it
- Jack W. Scannell
- James Bosley
- J. Mark Treherne
Nature Reviews Drug Discovery (2022)
Comparative Analysis of the Pharmacological Activity of Bis(3,5-di-Tert-Butyl-4-Hydroxyphenylthiolate)Dimethylol in Different Modes of Administration on a Mouse Model of Melanoma B16 Tumor Growth
- M. A. Dodokhova
- I. M. Kotieva
- E. R. Milaeva
Bulletin of Experimental Biology and Medicine (2022)
Mapping the past, present and future research landscape of paternal effects
- Joanna Rutkowska
- Malgorzata Lagisz
- Shinichi Nakagawa
BMC Biology (2020)

Associated content

Lab Animal's 50th Anniversary

Collection 27 September 2021

Systematic heterogenization for better reproducibility in animal experimentation

Abstract

Similar content being viewed by others

A reaction norm perspective on reproducibility

Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’

The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research

Main

How self-correcting is science?

Threats to translation and reproducibility

Standardization in animal experimentation: necessity or fallacy?

Systematic heterogenization rather than rigorous standardization?

Toward better reproducibility: applying a heterogenization strategy

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Challenging current scientific practice: how a shift in research methodology could reduce animal use

A paradigm shift in translational psychiatry through rodent neuroethology

Predictive validity in drug discovery: what it is, why it matters and how to improve it

Comparative Analysis of the Pharmacological Activity of Bis(3,5-di-Tert-Butyl-4-Hydroxyphenylthiolate)Dimethylol in Different Modes of Administration on a Mouse Model of Melanoma B16 Tumor Growth

Mapping the past, present and future research landscape of paternal effects

Lab Animal's 50th Anniversary

Navigation

Systematic heterogenization for better reproducibility in animal experimentation

Abstract

Similar content being viewed by others

Main

How self-correcting is science?

Threats to translation and reproducibility

Standardization in animal experimentation: necessity or fallacy?

Systematic heterogenization rather than rigorous standardization?

Toward better reproducibility: applying a heterogenization strategy

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation