Abstract
It is claimed here that the confidence mathematics education researchers have in statistical significance testing (SST) as an inference tool par excellence for experimental research is misplaced. Five common myths about SST are discussed, namely that SST: (a) is a controversy-free, recipe-like method to allow decision making; (b) answers the question whether there is a low probability that the research results were due to chance; (c) logic parallels the logic of mathematical proof by contradiction; (d) addresses the reliability/replicability question; and (e) is a necessary but not sufficient condition for the credibility of results. It is argued that SST’s contribution to educational research in general, and mathematics education research in particular, is not beneficial, and that SST should be discontinued as a tool for such research. Some alternatives to SST are suggested, and a call is made for mathematics education researchers to take the lead in using these alternatives.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Atkinson, D. R., Furlong, M. J., & Wampold, B. E. (1982). Statistical significance, reviewer evaluations, and scientific process: Is there a (statistically) significant relationship?Journal of Counselling Psychology, 29, 189–194.
Bakan, D. (1966). The test of significance in psychological research.Psychological Bulletin, 66, 423–437.
Begg, I., Armour, V., & Kerr, T. (1985). On believing what we remember.Canadian Journal of Behavioral Science, 17, 199–214.
Carver, R. P. (1978). The case against statistical significance testing.Harvard Educational Review, 48, 378–399.
Chow, S. L. (1991). Some reservations about power analysis.American Psychologist, 46, 1088–1089.
Coats, W. (1970). Significant differences: A case against the normal use of inferential statistical models in educational research.Educational Researcher Newsletter, 21, 6–7.
Cohen, J. (1977).Statistical power analysis for the behavioral sciences. New York: Academic Press.
Cohen, J. (1990). Things I’ve learned so far.American Psychologist, 45, 304–312.
Cooper, H. M. (1984).The integrative research review: A systematic approach. California: Sage Publications.
Cronbach, L. J., & Snow, R. E. (1977).Aptitudes and instructional methods: A handbook for research on interactions. New York: Irvington.
Crow, E. L. (1991). Response to Rosenthal’s comment “How are we doing in soft psychology?”American Psychologist, 46, 1083.
Daniel, L. G. (1989, January).Use of the jacknife statistic to establish the external validity of discriminant analysis results. Paper presented at the annual meeting of the Southwest Educational Research Association, Houston, Texas. (ERIC Document Reproduction Service No. ED 305 382).
Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists.American Psychologist, 42, 145–151.
Dawes, R. M. (1981).How to use your head and statistics at the same time, or at least in rapid alternation. Unpublished manuscript, University of Oregon.
Diaconis, P., & Efron, B. (1983). Computer-intensive methods in statistics.Scientific American, 248(5), 116–130.
Diaconis, P., & Freedman, D. (1981). The persistence of cognitive illusions.The Behavioral and Brain Sciences, 4, 333–334.
Factor, L., & Kooser, R. (1981).Value presuppositions in science textbooks: A critical bibliography. Galesburg, IL: Knox College.
Falk, R. (1986). Misconceptions of statistical significance.Journal of Structural Learning, 9, 83–96.
Falk, R., & Greenbaum, C. W. (1993).The fallacy of probabilistic modus tollens and the statistical-significance decision. Paper submitted for publication.
Fisher, R. A. (1960).The design of experiments, (7th ed.). Edinburgh: Oliver & Boyd.
Gigerenzer, G., & Murray, D. J. (1987).Cognition as intuitive statistics. Hillsdale, NJ: Lawrence Erlbaum Associates.
Glass, G. V., & Hopkins, K. D. (1984).Statistical methods in education and psychology (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Gold, D. (1969). Statistical tests and substantive significance.The American Sociologist, 4, 42–46.
Guttman, L. (1977). What is not what in statistics.The Statistician, 26, 81–107.
Guttman, L. (1981). Efficacy coefficients for differences among averages. In I. Borg (Ed.),Multidimensional data representations: When and why. Ann Arbor, MI: Mathesis Press.
Guttman, L. (1985). The illogic of statistical inference for cumulative science.Applied Stochastic Models and Data Analysis, 1, 3–10.
Hays, W. L. (1974).Statistics (2nd ed.). New York: Holt, Rinehart & Winston.
Hays, W. L. (1981).Statistics for psychologists (3rd ed.). New York: Holt, Rinehart & Winston.
Kendall, M. G. (1943).The advanced theory of statistics. Vol. 1. New York: Lippincott.
Lesnak, R. J. (1989). Writing to learn: An experiment in remedial algebra. In P. Connolly & T. Vilardi (Eds.),Writing to learn mathematics and science (pp. 147–156). New York: Teachers College Press.
Levy, P. (1967). Substantive significance of significant differences between two groups.Psychological Bulletin, 67, 37–40.
Lunneborg, C. E. (1987).Bootstrap applications for the behavioral sciences. Seattle: University of Washington.
McGraw, K. Q. (1991). Problems with the BESD: A comment on Rosenthal’s “How are we doing in soft psychology?”American Psychologist, 46, 1084–1086.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology.Journal of Consulting and Clinical Psychology, 46, 806–834.
Melton, A. W. (1962). Editorial.Journal of Experimental Psychology, 64, 553–557.
Morrison, D. E., & Henkel, R. E. (1969). Significance tests reconsidered.The American Sociologist, 4, 131–140.
Pauker, S. P., & Pauker, S. G. (1979). The amniocentesis decision: An explicit guide for parents. In C. J. Epstein, C. J. R. Curry, S. Packman, S. Sherman & B. D. Hall (Eds.),Birth defects: Original article series; Vol. 15. Risk, communication, and decision making in genetic counseling (pp. 289–324). New York: The National Foundation.
Phillips, L. D. (1973).Bayesian statistics for social scientists. London: Nelson.
Rosenthal, R., & Rubin, D. B. (1982). A simple general purpose display of magnitude of experimental effect.Journal of Educational Psychology, 74, 166–169.
Rosenthal, R. (1979). The “file drawer problem” and tolerance for null results.Psychological Bulletin, 86, 638–641.
Rosenthal, R. (1990). How are we doing in soft psychology?American Psychologist, 45, 775–777.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science.American Psychologist, 44, 1276–1284.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test.Psychological Bulletin, 57, 416–428.
Salsburg, D. S. (1985). The religion of statistics as practiced in medical journals.The American Statistician, 39(3), 220–223.
Shaver, J. P. (1985a). Chance and nonsense: A conversation about interpreting tests of statistical significance, Part 1.Phi Delta Kappan, September, 57–60.
Shaver, J. P. (1985b). Chance and nonsense: A conversation about interpreting tests of statistical significance, Part 2.Phi Delta Kappan, October, 138–141.
Shaver, J. P. (1992, April).What statistical significance testing is, and what it is not. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
Slakter, M. J., Yu, Y. B., & Suzuki-Slakter, N. S. (1991). *, **, and ***; Statistical nonsense at the.00000 level.Nursing Research, 40(4), 248–249.
Spencer-Brown, G. (1957).Probability and scientific inference. London: Longmans.
Stegmuller, W. (1973). “Jenseits von Popper und Carnap”: Die logischen Grundlagen des statitischen Schliessens. Berlin: Springer.
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa.Journal of the American Statistical Association, 54, 30–34.
Stevens, S. S. (1968). Measurement, statistics and the schemapiric view.Science, 161, 849–856.
Stevens, S. S. (1971). Issues in psychophysical measurement.Psychological Review, 78, 426–450.
Strahan, R. F. (1991). Remarks on the binomial effect size display.American Psychologist, 46, 1083–1084.
Thompson, B. (1987).The use (and misuse) of statistical significance testing: Some recommendations for improved editorial policy and practice. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC.
Thompson, B. (1988). Program FACSTRAP: A program that computes bootstrap estimates of factor structure.Educational and Psychological Measurement, 48, 1129–1135.
Thompson, B. (1989). Statistical significance, result importance, and result generalizability: Three noteworthy but somewhat different issues.Measurement and Evaluation in Counselling and Development, 22, 2–6.
Thompson, B. (1992).The use of statistical significance tests in research: Some criticisms and alternatives. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, April 22, 1992.
Tyler, R. W. (1931). What is statistical significance?Educational Research Bulletin, 10, 115–118, 142.
Winch, R. P., & Campbell, D. T. (1969). Proof? No. Evidence? Yes. The significance of tests of significance.The American Sociologist, 4, 140–143.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991).Statistical principles in experimental design (3rd ed.). New York: McGraw-Hill.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Menon, R. Statistical significance testing should be discontinued in mathematics education research. Math Ed Res J 5, 4–18 (1993). https://doi.org/10.1007/BF03217248
Issue Date:
DOI: https://doi.org/10.1007/BF03217248