FormalPara Key points

No computer-based neurocognitive test outperforms the others

No computer-based neurocognitive test has a sensitivity and specificity necessary for clinical utility as a standalone measure

Caring for patients using a multi-dimensional concussion assessment is recommended

1 Background

Mild traumatic brain injury (mTBI), or concussion, is the most common type of mTBI and has become a significant epidemiologic phenomenon [12]. It is estimated that 1.6–3.8 million concussions occur in sports and recreational activities annually [29]. Concussions are typically associated with increased symptom reporting and declines in neurocognitive functioning and balance [35, 41, 44, 55]. A multi-dimensional approach to concussion assessment that measures change in each of these domains is critical to concussion management protocols [36].

Neuropsychological testing has been identified as a key component of the concussion assessment protocol, and it plays a crucial role in concussion management programs at all levels of sports [3, 19]. In the athletic environment, computer-based testing is commonly implemented to establish a pre-injury baseline of neurocognitive functioning and to measure potential neurocognitive change post-injury. In comparison to traditional neuropsychological testing, computer-based testing may be advantageous for multiple reasons, including administrations using auditory and visual modalities, the ability to be given individually or in large groups simultaneously, and results are immediately available for review [42]. Despite these advantages, however, computer-based neurocognitive testing has several drawbacks. For example, there continue to be concerns about their psychometric reliability and validity [4, 48, 53]. However, the psychometric reliability and validity of pencil-and-paper neurocognitive tests in concussion assessments have also been questioned [43].

There exist many computer-based tests for the assessment of sport-related concussion (SRC), and the most popular include Immediate Post-Concussion Assessment and Cognitive Testing (ImPACT) [32], CogState Computerized Assessment Tool (CCAT) [1], Automated Neuropsychological Assessment Metrics (ANAM®) [45], and CNS Vital Signs [23]. ImPACT is by far the most widely implemented as it is used by 83.5% of athletic trainers [30]. Yet, CCAT, ANAM, and CNS Vital Signs remain prevalent [38].

The sensitivity and specificity of these tools have been evaluated previously with mixed findings (see [42] for review). In one study, ImPACT was reported to possess sufficient sensitivity (91.4%) to detect post-SRC neurocognitive impairment but lower specificity (69.1%) [51]. A separate group, however, reported much lower sensitivity for ImPACT (62.5%) [7]. For CCAT, it has been reported that it has both high sensitivity and specificity for the detection of SRC neurocognitive impairment [31]. Conversely, Maruff et al. [34] reported that CCAT was sufficiently sensitive to distinguish healthy adults and patient samples, but it was not sufficiently specific to distinguish groups of patients with mTBI. In a study investigating the sensitivity and specificity of ANAM modules in detecting SRC, sensitivity was low (< 1–6.6% at 95% CI) and specificity was high (94–100% at 95% CI) [46]. A study comparing all three tests, ImPACT, CCAT, and ANAM, concluded sensitivities of 67.8%, 60.3%, and 47.6%, respectively [39]. And finally, CNS Vital Signs has been shown to adequately discriminate between various non-SRC clinical groups, but its ability to do this in SRC samples is unknown (e.g. Ref. [9]).

As computer-based neurocognitive tests are wide-spread throughout SRC management programs, it is essential for medical teams to employ the most clinically useful measures to ensure appropriate patient care following SRC. However, it is currently unknown which computer-based neurocognitive test battery is optimal for the clinical care of SRC. Extant research on the sensitivity and specificity of these assessment tools are often underpowered and/or do not examine these constructs in SRC samples specifically. Thus, the aim of this investigation was to evaluate the accuracy of computer-based neurocognitive tests commonly implemented for SRC evaluations on a large scale and provide guidance on their clinical utility.

2 Methods

2.1 Study Participants

Study participants consisted of individuals from 30 National Collegiate Athletic Association (NCAA) military service academies and civilian universities who served as a cadet or participated in a NCAA sport during the 2014–2018 academic years (29 of the institutions currently provide data). Data for this study were provided by the Concussion Assessment, Research, and Education (CARE) Consortium [8]. Specifically, the CARE dataset contained 47,397 pre-season baseline (baseline) examinations and 2752 examinations performed 24–48 h (24–48 h) post-concussion. The 24–48 h examination was completed if a study participant was diagnosed as concussed by the local medical team at their institution using a standardized injury definition [11]. Individuals self-reporting a diagnosis of attention-deficit/hyperactivity disorder (ADD/ADHD) or a learning disability (LD) were excluded from the analysis (n = 1060). Excluding these individuals is supported by Elbin et al. [20] who found patients self-reporting ADD/ADHD and/or LD performed significantly worse on the components of the baseline ImPACT neurocognitive test. However, future research can extend the analysis presented in this manuscript to the subgroup of athletes in the CARE dataset with ADD/ADHD given the high prevalence of these individuals within the collegiate athlete/cadet population. All individuals provided written informed consent which was approved by the local institution and the US Army Human Research Protection Office. Also, all computations were completed using the software R, Version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria).

2.2 Measurements

To evaluate both sensitivity (proportion of sample correctly identified as having sustained a SRC) and specificity (proportion of sample correctly identified as not having sustained a SRC), two groups were identified from the data. The first group consisted of individuals completing a baseline test and were later diagnosed with concussion, resulting in a 24–48 h test within the same academic year (i.e., baseline/24–48 h). The second group consisted of individuals who completed a baseline test in two consecutive academic years (i.e., baseline/baseline). In this group, individuals were removed if they experienced a concussion during the 1-year gap separating baseline tests. From here, each study group was stratified by test type: ImPACT, CNS Vital Signs, and CCAT. Baseline tests were screened for validity using embedded metrics and were excluded from the analysis if declared invalid (n = 340).

Change scores were then calculated for each component within each test type. A change score in the baseline/24–48 h group was calculated by subtracting the baseline score from the 24–48 h score. Similarly, a change score in the baseline/baseline group was calculated by subtracting the first baseline score from the second baseline score. Therefore, a negative change score showed an individual scored lower on the second test whereas a positive change score showed an individual scored higher on the second test. Further, the aim of this study was to analyze the neurocognitive components of the tests, so symptom scores were not considered in this analysis.

2.2.1 Immediate Post-concussion Assessment and Cognitive Test (ImPACT)

The ImPACT test is a “computerized neurocognitive test battery that is used to assess Sequencing/Attention, Word Memory, Visual Memory, and Reaction” [27]. The four components of interest for this study were the Verbal Memory, Visual Memory, Visual Motor Speed, and Reaction Time Composites. Change scores were calculated for each. In general, higher scores indicated “better” performance with the exception of the Reaction Time Composite score, where lower/faster scores were better. Therefore, a negative change score represented improved performance for the Reaction Time Composite.

2.2.2 CNS Vital Signs (CNS)

The CNS Vital Signs computer test is a “clinical testing procedure used by clinicians to evaluate and manage the neurocognitive state of a patient” [14]. The eleven components of interest for this study were the Simple Attention Percentile, Composite Memory, Verbal Memory, Visual Memory, Psychomotor Speed, Reaction Time, Complex Attention, Cognitive Flexibility, Processing Speed, Executive Function, and Motor Speed Standard Scores. Change scores were calculated for each of these components and a positive change score represented improved performance for all measurements of interest.

2.2.3 Cogstate Computerized Cognitive Assessment Tool (CCAT)

The CCAT is a computer test that uses “psychological techniques to record learning, memory, processing speed and accuracy” [1]. Change scores for the Composite Processing Speed, Composite Attention, Composite Learning, and Working Memory Speed components were calculated and a positive change score represented improved performance for all measurements of interest.

2.2.4 Missing Data

Within the study, each participant in the ImPACT, CNS Vital Signs, and CCAT study groups required four, eleven, and four change scores, respectively. If any of these change scores were missing, the study participant was removed from the appropriate study group [ImPACT: baseline/24–48 h = 3, baseline/baseline = 5; CNS Vital Signs: baseline/24–48 h = 7, baseline/baseline = 56; CCAT: baseline/24–48 h = 1, baseline/baseline = 3]. In addition, the initial study sought to analyze the ANAM test, but insufficient data precluded the ability to do so (baseline/24–48 h = 62, baseline/baseline = 3).

2.3 Data Analysis

Demographic variables (age, gender, race, height, weight, number of previous concussions) were compared between the baseline/24–48 h and baseline/baseline groups using the non-parametric Mann–Whitney U test for continuous/ordinal variables and the non-parametric Chi-Squared test for categorical variables. A significance level of \(\alpha =0.05\) was considered significant. If significant differences were determined, a one-sided Mann–Whitney U test determined the directionality for continuous and ordinal variables. For categorical variables, the directionality was determined using the contingency tables from the Chi-Squared test. Sensitivity and specificity for the ImPACT, CNS Vital Signs, and CCAT were evaluated using the Normative Change method (Normative) and Reliable Change Index method (RCI).

2.3.1 Normative Change Method

To evaluate change on single or multiple components of the ImPACT, CNS Vital Signs, and CCAT tests, the Normative method implemented change scores developed utilizing CARE Consortium data captured from 2014 to 2017 [6]. With this method, a study participant’s change score for a specific component was considered “failed” if the change score fell outside of a given normative change confidence interval. Consistent with previous work, we evaluated the 75, 87.5, 90, 92.5, 95, 97.5, and 99 percent one-sided confidence intervals [6]. The overall classification of the study participant was determined by the Number of Components Failed (NCF) for a specific assessment. NCF represents how many neurocognitive test components a study participant would need to fail (with respect to the normative change confidence interval) for that specific study participant to be classified as concussed. For example, a NCF of two would mean the study participant’s change scores would need to exceed the confidence interval for two or more components to be classified as concussed.

2.3.2 Reliable Change Index

The RCI method is defined for the ImPACT, but not the CNS Vital Signs nor CCAT tests. Therefore, this method was only used to analyze the ImPACT baseline/24–48 h and ImPACT baseline/baseline groups. With the RCI method, a meaningful change on a given component was noted if the study participant’s change score fell outside of a given reliable change confidence interval. RCI calculations provided by the ImPACT Administration and Interpretation Manual [27] were completed. To be consistent with the normative method, 75, 87.5, 90, 92.5, 97.5, and 99 one-sided confidence intervals were used. In addition, the NCF determined if a study participant was classified as concussed.

2.3.3 Four Models Studied

Using the Normative method and RCI method, this study analyzed the performance of four models: ImPACT Normative, CNS Vital Signs Normative, CCAT Normative, and ImPACT RCI. Each of these models were analyzed with different confidence intervals and NCF values.

2.4 Test Performance Measures

The purpose of this study was two-fold. First, to characterize the ability of each neurocognitive test to discriminate between concussed and healthy participants by varying (1) the cut-off point (as determined by one-sided confidence intervals) used to differentiate between normal and abnormal results and (2) the number of “failed” components (as determined by NCF) used to indicate an abnormal test result. Second, this study aimed to compare the accuracy between the ImPACT, CNS Vital Signs, and CCAT tests to provide clinical care guidance.

To achieve the aims of this study, the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) were determined. These values determined the sensitivity and specificity in addition to other measurements discussed later. To clarify, these four values (i.e., TP, TN, FP, FN) were calculated for each model, each confidence interval, and each NCF value. For example, consider ImPACT Normative with a 75% confidence interval and NCF = 2. The change scores representing significant change are − 5, − 6, − 2.1, and 0.04 for verbal memory, visual memory, visual motor speed, and reaction time, respectively [6]. If a patient in the baseline/24–48 h group had two or more change scores exceeding these values, then the patient was classified as concussed and represented a TP. Otherwise, the baseline/24–48 h patient was classified as not concussed and represented a FN. If a patient in the baseline/baseline group had two or more change scores exceeding these values, then the patient was classified as concussed and represented a FP. Otherwise, the baseline/baseline patient was classified as not concussed and represented a TN.

When determining the confidence interval and NCF that maximized a neurocognitive test’s performance, the objective of this study was to maximize the 2:1 weighted sum of sensitivity and specificity (i.e., \(\frac{2}{3}\) × Sensitivity + \(\frac{1}{3}\) × Specificity) while having a sensitivity and specificity of at least 0.5. This study maximized the 2:1 weighted sum of sensitivity and specificity because this study focused on the 24–48 h time point which is a time where medical professionals emphasize sensitivity over specificity. The 2:1 weight is consistent with other concussion literature such as Broglio et al. [6]. However, the study required both sensitivity and specificity to be at least 0.5 because if either of these values are less than 0.5 then random decision making would be more accurate than test administration. Overall, this study wanted to provide medical professionals with a comprehensive view of the test performance.

To provide additional measurements to characterize and compare each model’s accuracy, the positive predictive value (PPV) [28], negative predictive value (NPV) [28], and F1 score [2, 33] were also computed. PPV (i.e., TP/[TP + FP]) and NPV (i.e., TN/[TN + FN]) are included to complement the sensitivity and specificity results. PPV and NPV depend on the prevalence of concussed patients in each model, so the reader is cautioned when comparing the PPV and NPV values between test types (e.g., comparing CNS Vital Signs to CCAT). The F1 score balances sensitivity and the PPV. Further, the F1 scores are presented, but not used in the analysis since the F1 score fails to have a good intuitive explanation [2]. Finally, to assess the overall performance of each test type based on NCF, receiver operating characteristic (ROC) curves were plotted and the area under the ROC curve (AUC) were calculated using a linear approximation. AUC represents how well the model performs in terms of sensitivity and specificity. A higher AUC corresponds to overall higher sensitivity and specificity results. This study classifies the AUC values as follows: 0.5–0.59 (bad), 0.6–0.69 (poor), 0.7–0.79 (fair), 0.8–0.89 (good), 0.9–1.0 (excellent).

Finally, a bootstrap-based hypothesis test for paired samples was used to compare the AUC between two NCF variations of a test type. A bootstrap-based hypothesis test for unpaired samples was used to compare the AUC between two test types. The null hypothesis in this test is that both variations have an equal AUC. We rejected this null hypothesis at a significance level of \(\alpha =0.05\). This hypothesis testing was performed using the pROC package in the software R [49].

2.5 Change Scores vs. Raw Scores

The objective of this study was to determine the sensitivity and specificity of common neurocognitive tests and in particular, this study aimed to achieve this through change scores. However, there has been much discussion in the concussion literature regarding the need for baseline testing. Some argue baseline testing is significant [50] whereas others indicate there is a small or negligible benefit in using athlete-specific baseline values over normative values [13, 17, 18, 21, 22, 40, 52]. For thoroughness, this study also analyzed the three neurocognitive tests (i.e., ImPACT, CNS Vital Signs, and CCAT) with raw scores to determine the significance of baseline testing.

In particular, the Year 1 normative mean and standard deviation raw scores published in Ref. [6] were used to construct the 75, 87.5, 90, 92.5, 97.5, and 99 one-sided confidence intervals for each test component of interest. For clarification, it was appropriate to construct these confidence intervals with the normality assumption because all values published in Ref. [6] had sufficiently large sample sizes (> 50) supporting the application of the Central Limit Theorem [54]. Then, the 24–48 h raw scores from the baseline/24–48 h group were used to calculate sensitivity and the first baseline raw scores from the baseline/baseline group were used to calculate specificity. The first baseline raw scores from the baseline/baseline group were chosen to analyze specificity over the baseline raw scores from the baseline/24–48 h group to keep consistency with the change score sample sizes. Similar to the change score method, the sensitivity, specificity, and other diagnostic measurements were calculated for each test type, each confidence interval, and each NCF. Also, the best-performing model for each test type (i.e., confidence interval and NCF combination) was determined by maximizing the 2:1 weighted sum of sensitivity and specificity while having a sensitivity and specificity of at least 0.5. Finally, a bootstrap-based hypothesis test for paired samples was used to compare the AUC between the change score best-performing model and raw score best-performing model for each test type.

3 Results

Demographic information of the study cohort is presented in Table 1. For the baseline/24–48 h group, the mean ± standard deviation (SD) time between tests was 143.01 ± 93.75 days (ImPACT), 148.56 ± 131.26 days (CNS Vital Signs), and 121.71 ± 80.08 days (CCAT). For the baseline/baseline group, the mean ± SD time between tests was 347.84 ± 73.63 days (ImPACT), 321.95 ± 85.03 days (CNS Vital Signs), and 357.39 ± 49.86 days (CCAT).

Table 1 Characteristics of study data by test group

The ImPACT baseline/24–48 h and baseline/baseline groups differed in gender (p < 0.001), race (p < 0.001), and number of previous concussions (p < 0.001), but not age (p = 0.26). The CNS Vital Signs baseline/24–48 h and baseline/baseline groups showed differences in age (p < 0.01), race (p < 0.05), and number of previous concussions (p < 0.05), but no difference in gender (p = 0.23). The CCAT baseline/24–48 h and baseline/baseline groups differed in age (p < 0.01) and number of previous concussions (p < 0.001), but not gender (p = 0.34) nor race (p = 0.67). Additional hypothesis test results for the demographic information including the directionality of significant differences can be found in Table 1. For clarification, these demographic differences were not accounted for in the analysis because the aim of this study was to analyze the overall performance of patients. Change score statistics for the ImPACT, CNS Vital Signs, and CCAT tests for the baseline/24–48 h and baseline/baseline groups are displayed in Table 2, along with the percentage of study participants who improved, declined, or did not change with respect to a specific test type and component.

Table 2 Characteristics of individuals who improved, declined, or no change for different groups and test types

3.1 Sensitivity, Specificity, PPV, NPV, and F1 Score

The sensitivity, specificity, PPV, NPV, and F1 score were calculated for each model (ImPACT Normative, CNS Vital Signs Normative, CCAT Normative, and ImPACT RCI), for each one-sided confidence interval (75, 87.5, 90, 92.5, 95, 97.5, and 99), and for each NCF (4 for ImPACT/CCAT and 11 for CNS Vital Signs). Further, the 2:1 weighted sum of sensitivity and specificity was calculated to determine the best-performing model (i.e., maximum 2:1 weighted sum of sensitivity and specificity while having a sensitivity and specificity of at least 0.5).

ImPACT performed best when using the Normative method with an 87.5%-confidence interval and 1 NCF (sensitivity = 0.583, specificity = 0.625, F1 = 0.308). CNS Vital Signs performed best when using the Normative method with a 90%-confidence interval and 1 NCF (sensitivity = 0.587, specificity = 0.532, F1 = 0.314). CCAT performed best using the Normative method when using a 75%-confidence interval and 2 NCF (sensitivity = 0.513, specificity = 0.715, F1 = 0.290). Finally, the ImPACT RCI method performed best with an 87.5%-confidence interval and 1 NCF (sensitivity = 0.626, specificity = 0.559, F1 = 0.297). Table 3 provides a collective summary of the best-performing models. Also, results for all sensitivity, specificity, PPV, NPV, and F1 for ImPACT Normative, CNS Vital Signs Normative, CCAT Normative, and ImPACT RCI models can be found in ESM1, ESM2, ESM3, and ESM4 of the Online Resources, respectively. This study defined the best-performing model as the one that maximized the 2:1 weighted sum of sensitivity and specificity while having a sensitivity and specificity of at least 0.5. To this end, we present the full spectrum of performance measures across all neurocognitive tests in the Online Resources so medical professionals can choose the performance measures which best suit their needs. Also, this study focused on the 24–48 h time point, but future work can consider which measurement values should be considered at each time point in the concussion recovery process.

Table 3 Best-performing models for change score analysis

3.2 ROC Curves and AUC

The ROC Curves for each model can be found in Fig. 1. Specifically, the 1 NCF curve for CNS Vital Signs Normative dominates all other CNS Vital Signs Normative curves (AUC = 0.610). Similarly, the 2 NCF curve for CCAT Normative dominates all other CCAT Normative curves (AUC = 0.640). For the other two models, ImPACT Normative and ImPACT RCI, there are trade-offs between the NCF = 1 and NCF = 2 curves dependent on the confidence interval employed. Overall, ImPACT Normative performed the best with 2 NCF (AUC = 0.638) and ImPACT RCI performed the best with 2 NCF (AUC = 0.632). It should be noted the best-performing ImPACT NCF in terms of AUC is not consistent with the best-performing confidence interval and NCF defined earlier for neither Normative nor RCI. However, AUC equally weights sensitivity and specificity whereas the best-performing model gives more weight to sensitivity. When comparing the best-performing models to one another, the results support similar AUC values (p > 0.05 for all using bootstrap test). Details regarding this AUC hypothesis test analysis can be found in ESM8 of the Online Resources.

Fig. 1
figure 1

Each neurocognitive test’s receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) for change score models

3.3 Raw Score Analysis

Table 4 displays a side-by-side comparison of the best-performing change score and raw score models for all three test types. Recall, this study defined the best-performing model as the one that maximized the 2:1 weighted sum of sensitivity and specificity while having a sensitivity and specificity of at least 0.5. First, the ImPACT change score models (Normative and RCI method) outperformed the ImPACT raw score model when considering the 2:1 weighted sum. Further, the bootstrap-based hypothesis test showed the AUC values between the change score model and raw score model for both the Normative and RCI method are significantly different (p < 0.05). In particular, a one-sided bootstrap-based hypothesis test showed the ImPACT change score best-performing model has a significantly higher AUC for both the Normative and RCI method. Second, the CCAT change score model outperformed the CCAT raw score model in terms of the 2:1 weighted sum of sensitivity and specificity, but the bootstrap-based hypothesis test showed similar AUC values (p > 0.05). Third, the CNS Vital Signs raw score model outperformed the change score model in terms of the 2:1 weighted sum, but the bootstrap-based hypothesis test showed similar AUC values (p > 0.05). All values for the ImPACT, CNS Vital Signs, and CCAT raw score analysis can be found in ESM5, ESM6, and ESM7 of the Online Resources, respectively. Further, details regarding the AUC hypothesis test can be found in ESM8 of the Online Resources.

Table 4 Side-by-side comparison of best-performing models for change score and raw score analysis

4 Discussion

Sport-related concussion (SRC) is an ever-increasing public health concern and accurate assessment of neurocognitive functioning has long been included as part of the post-concussion multi-faceted assessment. However, it remains unknown which commonly implemented computer-based neurocognitive tests are optimal for this aspect of injury management. Thus, the current study evaluated the sensitivity and specificity of three computer-based neurocognitive assessments in a large and diverse sample to provide athletic trainers, physicians, neuropsychologists, and other healthcare providers. guidance on their clinical utility.

For ImPACT, change score performance was best (sensitivity = 0.583, specificity = 0.625, F1 = 0.308) with an 87.5%-confidence interval and when participants failed at least one neurocognitive test component (NCF = 1) using previously developed normative data [6]. The ImPACT RCI method performed best with an 87.5%-confidence interval and NCF = 1 (sensitivity = 0.626, specificity = 0.559, F1 = 0.297). These results are generally consistent with previous research [4, 39, 47] and the embedded ImPACT algorithms employing 80% two-sided confidence intervals (90% one-sided) [27]. Also, neurocognitive testing is used to provide increased sensitivity to detect deficits not apparent on routine clinical examinations, but the low sensitivity results suggest their ability to do so is very poor.

The change score performance of the CCAT and CNS Vital signs were similar to ImPACT. CCAT’s best-performing model had a sensitivity (0.513) and specificity (0.715) exceeding chance with a 75%-confidence interval when participants failed two neurocognitive test components (NCF = 2). CNS Vital Signs performed best (sensitivity = 0.587, specificity = 0.532, F1 = 0.314) using normative data with a 90%-confidence interval and when participants failed one assessment component (NCF = 1).

As previously mentioned, there has been much discussion regarding baseline testing in the concussion literature. The findings from the raw score analysis showed ImPACT and CCAT change score models performed better than raw score models when looking at the 2:1 weighted sum of sensitivity and specificity. Further, ImPACT showed the change score models (for both Normative and RCI) have significantly higher AUC values, but CCAT showed the change score model and raw score model have similar AUC values. The CNS Vital Signs raw score model performed better than the change score model when looking at the 2:1 measurement, but they exhibited similar AUC values. The CCAT and CNS Vital Signs results support baseline testing has a negligible impact on neurocognitive test performance. The ImPACT results support baseline testing aids neurocognitive test performance but considering the difference in magnitude between the 2:1 weighted sum for the change score and raw score models (i.e., 0.01 for Normative and 0.017 for RCI), the impact is small. Overall, these results support current literature which states there is a small or negligible benefit in using athlete-specific baseline values over normative values [13, 17, 18, 21, 22, 40, 52].

Extant research comparing the sensitivity and specificity of traditional pencil-and-paper neurocognitive tests and computer-based tests in accurately classifying patients with SRC and non-injured patients report mixed findings. For instance, a prior study showed that a battery of neuropsychological tests including measures of verbal learning and memory, processing speed, executive functioning, and working memory demonstrated 87.5% sensitivity and 90% specificity [15]. Yet, Randolph and colleagues [43] argue that these and other studies providing evidence that traditional neurocognitive tests are sensitive to the effects of SRC suffer from methodologic flaws that limit their comparability and generalizability. Still other research reported that computer-based neurocognitive tests do not fare any better with respect to sensitivity and specificity than traditional neurocognitive tests [42]. Along these lines, Resch et al. [48] summarized extant research on the sensitivity (79.2–94.6%) and specificity (89.4–97.3%) of ImPACT and the reported sensitivity (70.8%) of CogState. Houck et al. [26] recently reported baseline to baseline testing in non-concussed athletes commonly shows failure on one testing component. When considered together, there is little evidence suggestive that one neurocognitive test measure is superior or better than another, leaving such decision making in the hands of the medical provider.

When considering all of the approaches to test accuracy, no test or interpretative approach evaluated here appeared substantially better than the other, suggesting equivalence between the measures. However, the overall low sensitivity and specificity estimates solidifies the clinical examination as the gold standard for concussion diagnosis, supported by a multi-dimensional objective assessment protocol. In most instances, this will include a symptom evaluation, postural control assessment, neuropsychological status, and other functional assessments. Indeed, consensus statements support the use of neurocognitive tests [36, 37] and other studies support that neurocognitive tests, when included in a battery, increase the clinical utility over symptoms alone [5, 13, 17, 21]. With this, we argue that computer-based testing should not be abandoned, but rather be used in a multi-dimensional assessment protocol or at the discretion of the appropriate clinician when circumstances dictate (e.g., when athletes are slow to recover).

4.1 Limitations

The current study should be considered in light of several limitations. First, the number of sustained concussions of each participant prior to their participation in the current study was uncontrolled for in analyses. As has been reported previously [16, 24], multiple concussions have been associated with prolonged symptoms, recovery time, and risk for future concussions and may have impacted participants’ performance on neurocognitive assessment. Future research incorporating the number of previously sustained SRC in analyses would help elucidate these potential neurocognitive performance differences. Second, gender, race, and socio-economic status differences in pre- and post-SRC neurocognitive performance in addition to demographic differences between the baseline/baseline and baseline/24–48 h groups were not controlled for in the analyses. Third, we evaluated cadets and athletes collectively, consistent with the CARE Consortium aims [8]. Fourth, this study analyzed the sensitivity and specificity of the embedded ImPACT algorithm (i.e., RCI 80% two-sided confidence interval [27]) but the study did not analyze the embedded algorithms of CNS Vital Signs nor CCAT. The embedded algorithms for these two tests are proprietary algorithms and the information was not captured by the CARE dataset. Fifth, when analyzing the embedded ImPACT algorithm, the RCI values published in the ImPACT Administration and Interpretation Manual [27] were employed because using the manual is the standard. However, future research can analyze how different RCI measurements [25] and more recent RCI calculations [10] impact the sensitivity and specificity results. Sixth, athletes and cadets with invalid baseline tests were removed from the analysis, yet Table 2 demonstrates that approximately 30–60% of study participants performed better on the second test regardless of concussion status. Such improvement from one test administration to another suggests that multiple factors, including effort, motivation, and physical and mental fatigue, affect test performance and warrant consideration when determining the validity of neurocognitive testing results. Thus, future research that describes methods that more accurately account for these various performance factors to identify invalid baseline tests to improve the diagnostic utility of computer-based neurocognitive tests is needed. Finally, the measures that comprise the computer-based neurocognitive tests utilized in the current study are not equivalent to the original paper and pencil measures from which they were derived. Typically, traditional paper and pencil neuropsychological tests were designed to evaluate gross changes in neurocognitive functioning, not the subtle deficits associated with SRC. Future research that takes a more granular, task-level approach, rather than the component level approach used here, would assist in identifying those measures that exhibit better or worse sensitivity and specificity in SRC assessment.

5 Conclusion

This investigation sought to examine the sensitivity and specificity of commonly used computer-based neurocognitive tests in SRC management to provide relevant clinicians additional guidance for appropriate patient care. Our findings indicate that no assessment or interpretative approach is substantially better than the other. Also, the overall low sensitivity and specificity results provide additional evidence for contemporary multi-dimensional concussion assessment approaches and indicate the need for improved sensitivity of neurocognitive assessment tools used in concussion assessment.