Sport concussion is a significant public health issue, with estimated rates ranging from 0.1 to 32.1 per 1,000 athletic exposures, depending on the sport, level of play, and injury surveillance methods employed (Clay et al., 2013; Kerr et al., 2018). Consensus guidelines recommend assessment of cognitive function following suspected concussion in athletes (McCrory et al., 2017). Two testing modalities currently exist for this purpose, including computerized cognitive tools and conventional tests administered as part of a neuropsychological evaluation.

While traditional “paper-and-pencil” neuropsychological tests are the gold standard for evaluating human cognitive function, the last two decades have seen the development of computerized neurocognitive tests for the evaluation of sport concussion (Meehan et al., 2012). These computerized tools have been widely adopted by athletic programs given practical advantages such as ease of administration and scoring, ability to test athletes in groups rather than individually, and portable test results (Collie et al., 2001). However, these tools are tailored to the areas of cognition most susceptible to concussion (e.g., processing speed) and as such do not evaluate other aspects of cognition including executive functioning, verbal fluency, and free recall memory (Belanger & Vanderploeg, 2005; Broglio & Puetz, 2008; Iverson & Schatz, 2015).

To inform best practices in sport concussion assessment, we sought to update and expand prior reviews evaluating the use of computerized and conventional cognitive testing in concussion assessment. Past literature reviews have identified highly variable psychometric estimates across tests (Farnsworth et al., 2017; Randolph et al., 2005; Resch et al., 2013a, b). Our goal was to utilize systematic review methodology to describe published psychometric data and to formally compare computerized cognitive tools with standard neuropsychological tests that may be used to assess athletes with a suspected concussion.

Methods

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed (Moher et al., 2009). We defined the population, intervention, comparison, and outcome (PICO), as recommended by the American Academy of Neurology guidelines for evidence-based medicine (Gronseth et al., 2011, pp. 3–4). Our clinical research question in PICO format was, “In English-speaking athletes, do computerized cognitive tools, compared to standard neuropsychological tests, provide sufficiently reliable and valid measures of cognitive functioning?” Upper thresholds for good reliability and sensitivity of clinical assessments, including subtest scores, were based on previously published criteria and convention (Blake et al., 2002; Nunnally & Bernstein, 1994; Slick, 2006; Weissberger et al., 2017).

Search Strategy

Searches were developed with the assistance of a medical librarian at the Medical College of Wisconsin. Terms were combined for three main concepts, including concussion, cognitive assessment, and athletes (see Table S1 in supplementary materials). Databases queried included Ovid Medline, Web of Science (Core Collection), PsycINFO, Scopus, Cochrane (Database of Systematic Reviews, Register of Controlled Trials, and Protocols; searched via Wily), Cumulative Index of Nursing and Allied Health Literature (CINAHL; searched via Ebsco), and Education Resources Information Center (ERIC; searched via EBSCO). The initial query of databases was conducted in December 2018 (without restriction of date). Results were updated most recently in August 2021 with a follow-up search.

Study Selection

Studies were included if they evaluated the psychometric properties (e.g., sensitivity, specificity, reliability, convergent or discriminant validity) of one or more cognitive assessments. Strict inclusion criteria involved studies that solely examined athlete participants, although the cognitive assessments need not be designed only for athlete use. We included postconcussive evaluations conducted up to 30 days post injury, although evaluations within 72 h of injury were considered acute (Joseph et al., 2018). Results for non-English tests and studies that solely examined assessments of balance, vestibular function, or oculomotor function were excluded. As only journal articles were considered, conference abstracts, commentaries, letters, and editorials were omitted. References of relevant review articles yielded by the search were examined to identify additional studies for inclusion.

Initial screening was conducted based on titles and abstracts using Rayyan (http://rayyan.qcri.org/). Two authors made independent inclusion decisions, with a third providing an independent rating for tied articles to achieve consensus on inclusion. Articles that passed the initial screening underwent full-text review in a similar manner.

Data Extraction and Analysis

PROSPERO is an international prospective register of systematic reviews. Prior to data extraction, our review protocol was registered with PROSPERO (Record ID 180763; see supplementary materials). Sample characteristics and key findings were extracted from all studies meeting inclusion criteria and information was verified by a second reviewer.

Risk of bias was assessed via the Quality Assessment of Diagnostic Accuracy Studies—Version 2 (QUADAS-2; Whiting et al., 2011). The QUADAS-2 was modified to fit the aims of the review, as has previously been done to assess methodological quality of concussion studies (e.g., McCrea et al., 2017). Four domains assessed risk of bias, including patient selection, use and interpretation of the test, examination of psychometric properties, and patient flow (see supplementary material). All reviewers were authors trained on the tool using beta articles to ensure consistency with ratings and met regularly to refine shared understanding of rating criteria. An overall rating for each included study was generated based on the instrument, which was then confirmed by an independent second rater (with a third rater utilized as needed to break ties and reach consensus). Level of evidence was assessed using a Strength of Recommendation Taxonomy (SORT) scoring system (Ebell et al., 2004).

Results

A total of 10,590 combined search results were returned from Ovid, Web of Science, PsycINFO, Scopus, Cochrane, CINAHL, and ERIC. Deduplication revealed 5,409 unique results, which were screened by two authors (with screening by a third author needed for 355 results to achieve consensus). Figure 1 displays a standard PRISMA flowchart (Page et al., 2021) of inclusion and exclusion during screening and full-text review. Ultimately, 103 articles met inclusion criteria. Sample characteristics of included studies are summarized in Table 1.

Fig. 1
figure 1

PRISMA flowchart of inclusions and exclusions

Table 1 Sample characteristics of the 103 studies meeting systematic review inclusion

The Immediate Post-Concussion Assessment and Cognitive Test (ImPACT; Riverside Publishing) was the most widely evaluated, included in 65 of the 103 studies (63%). Thirteen studies evaluated a hybrid battery, comprised of both computerized cognitive and traditional neuropsychological tests. Psychometric data from all included studies are listed in the supplementary materials (Table S2).

Brief Summary, Risk of Bias, and Level of Evidence

Modified QUADAS-2 ratings for most studies (n = 76) indicated a moderate risk of bias. Common methodological limitations included use of convenience samples or missing patient flow diagrams (e.g., CONSORT), limited control for confounding factors, and unreported effect sizes. High risk of bias (n = 2) was due to protocol deviations (Echemendia et al., 2001, administration in hallways and buses; Maerlender & Molfese, 2015, large group administration with non-standardized instructions). See Table S2 for individual study ratings.

Test–retest reliability coefficients varied highly for both computerized tools (0.14 to 0.93) and standard neuropsychological tests (0.02 to 0.95). The level of evidence for reliability was rated a SORT grade A (consistent good-quality evidence) for computerized tools and standard neuropsychological tests. Sensitivity to acute concussion ranged from 45%–93% for computerized tools and 18%–80% for neuropsychological test batteries. Given estimates were inconsistent and below chance levels at times, the level of evidence for sensitivity was a SORT grade B for computerized tools and standard neuropsychological tests. Table 2 provides the range of test–retest reliability and sensitivity estimates by measure. Figures 2 and 3 depict these ranges with heat maps for computerized tools and standard neuropsychological tests, respectively.

Table 2 Range of reliability and sensitivity estimates for computerized cognitive tools and standard neuropsychological tests
Fig. 2
figure 2

Heat map of test–retest reliability and sensitivity estimates for computerized cognitive tools.

Note. Bars represent ranges of test–retest reliability estimates by measure or, where noted, sensitivity (0%–100%) to acute concussion

Fig. 3
figure 3

Heat map of test–retest reliability and sensitivity estimates for standard neuropsychological tests.

Note. Bars represent ranges of test–retest reliability estimates by measure or, where noted, sensitivity (0%–100%) to acute concussion

Reliability

Reliability coefficients such as Pearson r or intra-class coefficients (ICCs) assess the reliability or consistency of test scores. Slick (2006) define acceptable reliability as coefficients ≥ 0.70. Nunnally and Bernstein (1994, p. 265) define adequate reliability to be 0.80, or a more stringent 0.90 if important decisions are to be based on specific test scores. Reliable change indices (RCIs; Chelune, 2003) are calculated using reliability coefficients and represent a meaningful change between assessments. Regression-based methods (RBM; McSweeny et al., 1993) calculate a predicted change score using baseline data. Dependent samples t-tests or repeated measures analysis of variance (ANOVA) evaluate practice effects with a significance test.

Test–Retest Reliability

Test–retest reliability was not consistently better for one testing modality over the other. Coefficients had a wide range for standard neuropsychological tests, from 0.02 to 0.95, and for computerized cognitive tools, from 0.14 to 0.93 (see Table 2 and Figs. 2 and 3). Test–retest reliability was most often assessed in ImPACT, typically using annual preseason baseline testing, with shorter (six months, 0.35–0.86) and longer reliability estimates (four years, 0.29–0.69) having a similar span (Echemendia et al., 2016; Mason et al., 2020; Womble et al., 2016). Processing speed tasks had strongest correlations for both modalities, although were most susceptible to practice effects upon three-day retest (Register-Mihalik et al., 2012). When retested after one year or longer, estimates for verbal memory tasks were the weakest except for one included study of athletes aged 10–12 years in which verbal memory performed best (Moser et al., 2017). This may reflect the developmental needs of this sample, for which a separate pediatric ImPACT has recently been developed.

Multiple attempts have been made to improve the test–retest reliability of ImPACT, including aggregating multiple baseline (Bruce et al., 2016, 2017) and composite scores (memory and speed; Schatz & Maerlender, 2013). These two methods consistently improved reliability estimates (Brett et al., 2018; Bruce et al., 2017; Echemendia et al., 2016; Schatz & Maerlender, 2013). Overall, test–retest reliability was not improved with increasingly strict invalidity or exclusion criteria (Brett et al., 2016; Register-Mihalik et al., 2012) or across specific populations (i.e., learning disabilities and headache or migraine treatment history; Brett et al., 2018).

Other Forms of Reliability

Cronbach’s alpha, a measure of internal consistency, ranged from 0.64 to 0.84 for ImPACT (Gerrard et al., 2017; Higgins et al., 2018). McDonald’s omega, a less biased estimate of reliability requiring factor analysis, ranged from 0.52 to 0.72 for ImPACT (Gerrard et al., 2017) and 0.03 to 0.85 for Automated Neuropsychological Assessment Metrics (ANAM; Glutting et al., 2020). Hinton-Bayre and Geffen (2005) examined alternate forms of standard neuropsychological tests and found moderate to large equivalence for the Symbol Digit Modalities Test (SDMT) and the Wechsler Adult Intelligence Scale–Revised (WAIS-R) Digit Symbol subtest. Subtests from the abbreviated ImPACT Quick Test had variable correlations (r = 0.14–0.40) with complete ImPACT administration (Elbin et al., 2020). No other studies identified examined internal consistency or alternate forms reliability.

Sensitivity and Specificity

Sensitivity and specificity evaluate how well test performance differentiates concussed and healthy athletes. Percentages represent the proportion of concussed (sensitivity) and healthy athletes (specificity) correctly identified. Thresholds for acceptable (≥ 80%) and marginal (60%–79%) sensitivity have previously been used to describe cognitive impairment (e.g., Blake et al., 2002; Weissberger et al., 2017).

Logistic regression, discriminant function analyses, RCI, and RBM can evaluate sensitivity and specificity. RCI and RBM are able to assess clinically meaningful changes in raw scores (Hinton-Bayre, 2015; Louey et al., 2014; Schatz & Robertshaw, 2014). Echemendia et al. (2012) found RBM to perform better than RCI methods (although similar to normative comparison), whereas examination by Erlanger et al. (2003a, b) and Merz et al. (2021) yielded comparable findings for RCI and RBM.

Sensitivity of computerized tools varied across studies, often below acceptable limits (see Table 2 and Fig. 2). From 80 to 93% sensitivity has been reported for ImPACT and CogState Ltd’s computerized concussion tool when administered to athletes within 72 h post-injury (Gardner et al., 2012; Louey et al., 2014; Schatz & Sandel, 2012; Van Kampen et al., 2006). Although, similar studies evaluating reliable impairment in acutely concussed athletes observed only marginal sensitivity for ImPACT and CogState (Abbassi & Sirmon-Taylor, 2017; Broglio et al., 2007a; Czerniak et al., 2021; Nelson et al., 2016; Sufrinko et al., 2017). As athletes progressed from acute to subacute concussion, the percent of athletes with reliable worsening on one or more subtest (23%–55%) was below acceptable levels for ANAM, CogState, and ImPACT (Iverson et al., 2006; Nelson et al., 2016; Sufrinko et al., 2017).

Sensitivity for test batteries and standard neuropsychological measures also varied across studies (see Table 2 and Fig. 3). Hopkins Verbal Learning Test (HVLT), Trail-Making Test (TMT), SDMT, Stroop, and Controlled Oral Word Association Test (COWAT) were at chance levels of detecting acute impairment (McCrea et al., 2005), similar to subtest-level sensitivity for computerized tools (Louey et al., 2014; Nelson et al., 2016). Sensitivity to acute concussion reached 80% for a battery comprised of the SDMT, WAIS-R Digit Symbol, and Speed of Comprehension Test (SOCT; Baddeley et al. as cited in Hinton-Bayre et al., 1999), although this finding was not replicated (sensitivity 18%–44%) in similar test batteries (Broglio et al., 2007b; Makdissi et al., 2010; McCrea et al., 2005).

Assessing Cognitive Effects of Acute Concussion

Significance testing (e.g., t-tests, ANOVA, logistic regression) can examine the sensitivity of cognitive performance to the effects of concussion. Athletes’ post-injury scores can be compared to healthy controls (between-subjects design) or to baseline (within-subjects design). Cohen’s d (size of group differences) and η2 (amount of variance explained) are two widely used statistics that depict effect sizes, or the magnitude of findings. Cohen (1988) defines d as small (0.2), medium (0.5), and large (0.8). Effect sizes can be calculated for ANOVA and regression, although comparison is difficult when models contain varying combinations of factors or predictors.

Both testing modalities have consistently been shown to be sensitive to acute concussion (i.e., within 72-h postconcussion; Joseph et al., 2018), with effect sizes that are generally medium to large. Specifically, effect sizes were 0.54 to 0.86 for ANAM (within-subjects Cohen’s d), 0.80 to 1.03 for the Cambridge Neuropsychological Test Automated Battery (CANTAB; Cambridge Cognition, Ltd.) (between- and within-subjects Cohen’s d), -0.88 to -0.18 for CogState (between-subjects effect size), up to 0.36 for the HeadMinder, Inc. Concussion Resolution Index (between- and within-subjects η2) and variable for the ImPACT (between-subjects maximum η2 = 0.99, within-subjects maximum d = 0.45, between-subjects maximum d = 0.95) (Abbassi & Sirmon-Taylor, 2017; Collins et al., 1999; Gardner et al., 2012; Iverson et al., 2003; Louey et al., 2014; Lovell et al., 2004; Pearce et al., 2015; Register-Mihalik et al., 2013; Schatz & Sandel, 2012; Schatz et al., 2006; Sosnoff et al., 2007). Standard neuropsychological tests including COWAT, Digit Span, HVLT, O’Connor Finger Dexterity test, Repeatable Battery for the Assessment of Neuropsychological Status, Stroop, SDMT, TMT, and Vigil Continuous Performance Test demonstrated large effect sizes (Cohen’s d = 0.80–1.03; η2 = 0.91) (Echemendia et al., 2001; Moser & Schatz, 2002; Pearce et al., 2015). Acute evaluations using COWAT, Digit Span, Grooved Pegboard, HVLT, Stroop, SDMT, and TMT were significant, although effect sizes were not reported (Collins et al., 1999; Guskiewicz et al., 2001).

Validity

Convergent and Discriminant Validity

Convergent validity evaluates the similarity of the assessment with other measures of a similar cognitive domain. Discriminant validity examines the relationship between assessments measuring differing domains. Most studies have evaluated convergent and discriminant validity using correlation coefficients, which can be interpreted using Cohen (1988)’s guidelines for small (0.10), medium (0.30) and large (0.50) effect size. Confirmatory factor analysis and multitrait-multimethod analysis have also been used and have the advantage of taking into account the measurement error inherent in single observed test scores (Floyd & Widaman, 1995; Strauss & Smith, 2009).

Convergent and discriminant validity were examined most frequently for ImPACT. Maerlender et al. (2010) found ImPACT correlated with the California Verbal Learning Test-2 (ImPACT Verbal Memory r = 0.40), Brief Visuospatial Memory Test–Revised (BVMT-R) (ImPACT Visual Memory r = 0.59), Connor’s Continuous Performance Test (ImPACT Reaction Time r = -0.39), Delis-Kaplan Executive Function System subtests (ImPACT Visual Motor r = 0.41), and Gronwall’s Paced Auditory Serial Attention Test (ImPACT Visual Motor r = 0.39, Reaction Time r = -0.31). SDMT correlations were stronger for ImPACT speeded subtests (r = -0.60 and 0.70) and weaker for ImPACT memory subtests (r = 0.37 and 0.46), providing support for convergent and discriminant validity (Iverson et al., 2005).

Studies using C3 Logix and CogState’s speeded subtests have reported small to medium correlations with standard neuropsychological tests of processing speed. For C3 Logix, correlations with TMT and SDMT ranged from 0.10–0.78 (Simon et al., 2017). CogState subtests correlated with TMT and WAIS-R Digit Symbol (strongest r = 0.48) (Makdissi et al., 2001).

Medium to large correlations were observed among standard neuropsychological tests assessing similar cognitive domains. Valovich McLeod et al. (2006) reported correlations up to 0.92 among Buschke selective reminding test scores and speeded tasks (TMT B, Wechsler Intelligence Scale for Children–Third Edition processing speed subtests). Similarly, Hinton-Bayre et al. (1997) found correlations among speeded tasks ranged from 0.44 to 0.77 (WAIS-R Digit Symbol, SDMT, SOCT).

Factor analysis generally supports factor validity (i.e., that performance reflects the aspect of cognition theoretically underlying the test domain). ImPACT memory and processing speed factors have been identified in samples of healthy and concussed athletes using exploratory (Gerrard et al., 2017; Iverson et al., 2005; Thoma et al., 2018) and confirmatory factor analysis (Maietta et al., 2021; Masterson et al., 2019; Schatz & Maerlender, 2013) as well as multitrait-multimethod analysis with traditional pencil-and-paper assessments (Thoma et al., 2018). Factors of memory and executive functioning have emerged within baseline pencil-and-paper assessments comprising the HVLT–Revised (HVLT-R), BVMT-R, TMT, and COWAT using exploratory factory analysis (Lovell & Solomon, 2011). Czerniak et al. (2021) identified support for a hierarchical factor structure comprised of ANAM’s general composite and two lower-level factors. Unlike other computerized tools, C3 Logix did not conform to a two-factor model (Masterson et al., 2019).

Other Forms of Validity

Correlation of cognitive test scores with concussion symptom duration supports predictive validity. For example, ImPACT and Concussion Resolution Index scores in acutely concussed student athletes were found to predict recovery time (Erlanger et al., 2003a, b; Lau et al., 2011, 2012). Worse acute post-injury memory performance has also been associated with subjective symptom reports (Broglio et al., 2009; Erlanger et al., 2003a, b; Lovell et al., 2003).

Discussion

We conducted a systematic review of the psychometric properties of computerized cognitive tools and conventional neuropsychological tests for the assessment of concussion in athletes. There were 103 studies published between 1996 and 2021 that met the inclusion criteria. ImPACT was the most robustly evaluated test. HVLT-R, TMT, and SDMT were most common in standard neuropsychological batteries. Only 13 studies employed hybrid test batteries using both modalities.

Consistent with prior reviews (Farnsworth et al., 2017; Randolph et al., 2005; Resch et al., 2013a, b), reliability coefficients were highly variable for both computerized tools (0.14 to 0.93, see Table 2 and Fig. 2) and conventional neuropsychological tests (0.02 to 0.95, see Table 2 and Fig. 3). Reliability coefficients were generally stronger for speeded tasks. Aggregating multiple baseline and composite scores (Brett et al., 2018; Bruce et al., 2017; Echemendia et al., 2016; Schatz & Maerlender, 2013), as well as detecting suboptimal effort (Walton et al., 2018), can improve reliability estimates in athletes.

Test–retest reliability for two widely used conventional neuropsychological tests was weaker in athletes (HVLT r = 0.49–0.68, TMT r = 0.41–0.79) (Barr, 2003; Register-Mihalik et al., 2012; Valovich McLeod et al., 2006) than estimates derived from non-athlete samples (HVLT 0.66–0.74, TMT 0.79–0.89) (Benedict et al., 1998; Dikmen et al., 1999). Homogeneity in athlete samples may place a limit on reliability estimates through restricted score variance. Further, the cognitive functions typically assessed in concussion (e.g., psychomotor processing speed, attention, verbal and nonverbal memory) continue to develop throughout adolescence (Carson et al., 2016; Casey et al., 2005; Giedd et al., 1999) and may impact reliability for long test–retest intervals.

Sensitivity varied widely, and was at times below chance levels, as depicted in Figs. 2 and 3. When administered within 72 h post-injury, ImPACT and CogState reached acceptable sensitivity (80% or better). However, for just as many acute samples, marginal ImPACT and CogState sensitivity was observed. For standard neuropsychological test batteries, sensitivity was difficult to compare across studies given the differing numbers of measures and criteria for impairment used. The method used to classify impairment appears to modify sensitivity (Echemendia et al., 2012; Schatz & Robertshaw, 2014). RCI and RBM are superior to using normative comparison to calculate sensitivity (Brett et al., 2016; Maerlender & Molfese, 2015; Moser et al., 2017; O'Brien et al., 2018; Schatz, 2010).

Further study of the psychometric properties of cognitive assessments should include quantitative synthesis. We did not pool systematic review results beyond the provided table and figures given the wide range of measures and statistical methods included, for example, variable test–retest time frames. While broadly inclusive, more specific conclusions were limited by this approach. Future meta-analysis may be possible with tests used across numerous studies (e.g., ImPACT, HVLT-R, TMT, and SDMT).

Continued Gaps in the Literature and Assessment Practices

Attempts to improve the psychometric properties of tests were sparse within the literature and limited to a single test (i.e., aggregating ImPACT baseline and composite scores). Current cognitive assessment tools could leverage modern technologies and procedures such as telehealth, machine learning, and virtual reality to improve ecological validity (Marcopulos & Lojek, 2019; Parsons, 2011). Structural modeling techniques have recently been used to improve a sport concussion symptom inventory (Brett et al., 2020; Wilmoth et al., 2020). Alternate forms are a known method of eliminating practice effects in both traditional neuropsychological and computerized cognitive assessments, and nonequivalence affects test–retest reliability (Echemendia et al., 2009; Resch et al., 2013a, b).

Though not a pre-specified objective, as we excluded studies of non-English speakers, we informally observed during our review that few studies considered sociocultural and ethnoracial differences in their methodologies. Preliminary findings indicate differences on baseline testing based on sociocultural factors including maternal education as well as racial and linguistic diversity (Houck et al., 2018; Maerlender et al., 2020; Wallace et al., 2020), which has very important implications for use of normative data and clinical interpretation. Given the current paucity of data, more research is needed to understand how intersectionality influences cognitive testing performance pre- and post-injury in athletes.

Further research is needed to better understand which aspects of test psychometrics (e.g., test–retest reliability) are affected by cognitive development within adolescent and college-age athletes (Brett et al., 2016). Assessment performance in older athletes also warrants further examination. Similar surveys of non-sport concussion literature should first be conducted in civilian and military populations prior to generalizing these findings. Although no reliable change has been observed in performance from baseline following recent concussion (Lynall et al., 2016), more studies are needed to examine this possibility in athletes who have sustained multiple prior concussions. Future studies should screen for invalid performance, which is common in this population (Abeare et al., 2018), and examine the relationship between suboptimal effort and psychometric properties (Brett & Solomon, 2017).

Best Practice Recommendations

Ultimately, only a healthcare provider with expertise in assessment and knowledge of psychometric principles who is qualified to interpret test results should be doing so (Bauer et al., 2012). Computerized neurocognitive tests do not eliminate the need for a clinical expert, or at minimum, an appropriately trained clinician to evaluate and interpret test results (Covassin et al., 2009; Echemendia et al., 2009; Meehan et al., 2012; Moser et al., 2015). Demographic and sociocultural factors as well as prior medical and psychiatric history have been observed to influence baseline scores (Cook et al., 2017; Cottle et al., 2017; Gardner et al., 2017; Houck et al., 2018; Wallace et al., 2020; Zuckerman et al., 2013) while pre-injury cognitive status (Merz et al., 2021; Schatz & Robertshaw, 2014), history of attention-deficit hyperactivity disorder (Gardner et al., 2017), and concussion history (Covassin et al., 2013) have been shown to influence post-injury test results. Sport clinicians should utilize the most appropriate normative reference data available (Mitrushina et al., 2005). Normative data are becoming increasingly available for athletes (e.g., Solomon et al., 2015's NFL player normative data). Whether using normative data or individuals’ own pre-injury baseline performance, measurement error and multivariate base rates of impairment should be considered in interpretation (Nelson, 2015).

Conclusions

There remains no clear psychometric evidence to support one testing modality over another in the evaluation of sport concussion. Test–retest reliability for speeded tasks was generally stronger overall, although susceptible to variability over time. Sensitivity to acute concussion was greatest for ImPACT, CogState, SDMT, SOCT, and WAIS-R Digit Symbol. A hybrid model combining test modalities would seem to counterbalance and optimize methods for increased accuracy and efficiency. To this end, more robust formal study of psychometric properties should be pursued (Echemendia et al., 2020). It is critical for sport clinicians to have expertise in interpreting test results and to implement a multidimensional assessment that synthesizes an individual’s performance within the context of their history and situational factors. Future test revisions should capitalize on advances in psychometric and analytic approaches.