FormalPara Key Points

Using a data-driven approach, we used baseline data representing 18 variable categories to build a predictive model for sport-related concussion in collegiate athletes and military cadets with AUROC = 0.73.

Significant features in the predictive model can provide insight into risk and protective factors for sport-related concussion and be used to generate hypotheses for future research.

This is clinically important because a predictive model capable of identifying athletes at elevated risk for sustaining sport-related concussion can facilitate more targeted injury prevention, education, and surveillance strategies.

1 Introduction

Concussion, or mild traumatic brain injury (mTBI), is a common and serious injury faced by athletes and military personnel alike. The US Centers for Disease Control and Prevention estimates as many as 1.6–3.8 million sport and recreation-related concussions occur annually in the US across all age groups and levels of play [1]. Of these, approximately 10,500 concussions occur annually in the approximately 495,000 collegiate student-athletes competing in NCAA championship sports [2]. TBI is also a serious concern for US military personnel, and has been identified by the US Department of Defense as “one of the signature injuries of troops wounded in Afghanistan and Iraq” [3]. Approximately 82.3% of traumatic brain injuries sustained by Active-Duty US Service Members are mTBI/concussions, often occurring outside of combat with a mechanism similar to that in athletes [4]. In the US military service academies, non-varsity-athlete cadets frequently sustain concussions during club/intramural sport participation as well as military training.

In the short term, concussion causes a constellation of physical, cognitive, affective, and sleep-related signs and symptoms that may limit an individual’s ability to participate and perform in school, at work, on the field of play, or on the battlefield. Additionally, there is growing concern that concussions and repetitive sport-associated head trauma may increase an individual’s long-term risk for developing depression [5], dementia [6], and neurodegenerative diseases such as chronic traumatic encephalopathy later in life [7,8,9,10]. While concussions typically occur in the general population as a result of unpredictable traumatic events, athletes and military cadets are placed at risk for concussion during participation in sport activities. Given the myriad of potential short- and long-term consequences of concussion, efforts should be put forth to identify those at greatest risk for sport-related concussion so that targeted injury surveillance, prevention, and education strategies can be optimized.

We are aware of only one small study attempting to use a limited number of baseline factors to identify athletes’ future concussion risk, without success [11]. Broadly speaking, injury prediction models can utilize factors which are either positively or negatively associated with an outcome of interest. In the case of sport-related concussion, other prior work has identified numerous intrinsic (i.e., individual athlete characteristics), and extrinsic (i.e., sport-associated conditions and mechanisms) features as potential risk or protective factors for injury [12,13,14,15,16]. However, many existing studies have limited their study population to a single sex [17, 18], or sport [17,18,19,20,21,22,23,24,25], and those studies assessing large, diverse populations have still considered only a relatively small set of specific risk factors [26,27,28,29]. As such, many potential concussion risk and protective factors remain unexplored.

We, therefore, sought to leverage the Concussion Assessment Research and Education (CARE) Consortium database [30] to develop a predictive model for sport-related concussion using baseline characteristics in NCAA collegiate athletes and US military cadets. We hypothesized the information commonly collected in athletes and cadets during their baseline pre-participation examinations could be used to predict subsequent concussions in this cohort. Using machine learning techniques, we developed an interpretable risk stratification model based on a large number of baseline covariates. By inspecting the model, we also sought insight into potential risk and protective factors for sport-related concussion which can be utilized to generate novel hypotheses for future study.

2 Methods

2.1 General Study Design

This observational study utilized CARE Consortium data collected at the 21 US academic institutions and military academies participating in the study during the 2015–2016 academic year. Participating student-athletes and military service academy cadets at each institution completed a preseason baseline assessment as part of the pre-participation examination process and were then prospectively monitored for concussion. Baseline assessments included a combination of “Level A” assessment measures common across all CARE Consortium Sites (demographics; medical, sport, academic, and family history; Brief Sensation Seeking Scale [BSSS]; Sport Concussion Assessment Tool [SCAT] Symptom Checklist; Brief Symptom Inventory [BSI]; Standardized Assessment of Concussion [SAC]; Balance Error Scoring System [BESS]; and computerized neurocognitive assessment) as well as optional “Level B” assessment measures collected at each site’s discretion (clinical reaction time, advanced measures of postural stability, oculomotor/oculovestibular assessments, and/or quality of life). One of four computerized concussion tests (ImPACT, Axon/CogState, CNS Vital Signs, or Automated Neuropsychological Assessment Metrics [ANAM]) was administered at each CARE Consortium site. Additional CARE Consortium Study details, including a description of Level A and B measures, are available from Broglio et al. [30]. Institutional Review Board (IRB) approval was obtained at the lead study site, with US Department of Defense Human Research Protection Office approval as well as local IRB approval at each participating site. This study was performed in accordance with the standards of ethics outlined in the Declaration of Helsinki.

2.2 Participants

All CARE Consortium Study participants enrolled during the 2015–2016 academic year were eligible for inclusion in this analysis. All cadets are eligible to consent for CARE participation at the US military service academies, regardless of varsity athlete status, and all varsity student-athletes are eligible to consent at the traditional colleges and universities. The only criterion for exclusion from this analysis was failure to complete the 2015–2016 baseline CARE assessment prior to sustaining a concussion during the study period. A total of 15,682 participants were included in the final dataset. All participants provided informed written consent.

2.3 Primary Outcome

The primary study outcome was sustaining a clinician-diagnosed concussion, as defined by consensus definition produced through evidence-based guidelines and adopted by the CARE Consortium Study group [30, 31], as a consequence of sport participation between August 1st, 2015 and July 31st, 2016. Participants sustaining a concussion during any form of sport participation (i.e., game or competition at varsity or non-varsity levels of play) were considered positive for the primary study outcome and negative otherwise, including participants who sustained concussions unrelated to sport.

2.4 Covariates

We considered 176 baseline covariates for each subject, grouped into the following categories: academic or military institution, demographic variables, anthropometric variables, academic variables, primary and secondary sports, primary sport position, primary sport equipment details, concussion history, personal medical history, medications, family medical history, social history, self-reported concussion symptoms (SCAT symptom checklist), psychological and quality of life assessment results, neuro-cognitive assessment results, computerized concussion test results, balance assessment results, vision/vestibular–ocular test results (Table 1). Prior to analysis, the data were cleaned to remove obviously incorrect responses. We then mapped each variable to a set of binary features. Categorical variables were mapped to one binary feature per category. Continuous variables were first discretized based on quintile ranges, or were alternatively classified into discrete bins using clinically relevant cutoffs (e.g., normal vs. abnormal), when available. Clinically relevant cut-offs were applied for the Brief Symptom Inventory, Hospital Anxiety and Depression Scale, and Vestibular Ocular Motor Screening examination. Continuous data were then mapped to one binary feature for each discrete range/bin of each variable. This discretization procedure allowed the incorporation of domain knowledge (i.e., clinically meaningful thresholds) and extended the linear model’s capacity to capture nonlinear relationships that could not be identified in the original continuous feature space. All baseline variables and their associated binary features are listed in Supplementary Appendices A and B.

Table 1 Baseline covariates

2.5 Data Analysis

2.5.1 Learning Algorithm

We used a linear support vector machine (SVM) model [32] to stratify subjects’ risk for sport-related concussion based on their baseline data. Nonlinear models were also considered, but performed similar to the SVM model (please refer to Supplementary Appendix C for additional detail regarding nonlinear model performance). As such, only the SVM results are reported given the greater interpretability of the SVM model, i.e., the potential that learned model coefficients may provide insight into potential risk or protective factors [33]. We assumed risk and protective factors were shared across sports, thus framing the problem as a single task and we therefore used a single task learning (STL) model. Sport-specific models were also considered, but performed similar to the STL model, likely in part because very few concussions occurred in some sports (please refer to Supplementary Appendix D for additional detail regarding sport-specific models). As such, only the STL model results are reported. We implemented our linear model using the popular Python Machine Learning package—scikit-learn (https://scikit-learn.org/stable/about.html).

2.5.2 Model Selection

We trained the model by repeating the following process 20,000 times (note: 20,000 repetitions were performed so we could assess the statistical significance of the predictive factors, as described in Sect. 2.5.3):

  1. 1.

    Split. Data were split into a training set and a held-out test set such that approximately 80% of participants contributed to the training set and approximately 20% to the test set. For sports with less than 10 participants, all participants were included in the training set. For sports with more than 10 participants, we employed stratified splits to ensure equal proportions of positive and negative examples (i.e., participants with and without sport-related concussions) between the training and test sets, except when less than 5 positive examples were present (i.e., less than 5 participants sustained a concussion), in which case we used random splits.

  2. 2.

    Train. A model was trained (i.e., learned the model parameters) using the training data set with a pre-selected hyperparameter to control for the tradeoff between regularization and loss (i.e., to optimize model performance without overfitting). Such a setup is essential in high-dimensional settings to avoid simply memorizing the data.

    We pre-selected the hyperparameter C by repeated nested cross-validation with a grid search (i.e., C = [10−6,10−5,10−4,10−3,10−2, 10−1]) and five-fold cross validation [34] on the training data-set for 100 splits. For computational efficiency, we used the mode of the resultant 100 hyperparameters (C = 10−4) as the pre-selected hyperparameter for all 20,000 models (please refer to Supplementary Appendix E for additional detail regarding hyperparameter selection).

  3. 3.

    Test. The trained model was applied to the test data-set to evaluate its performance, as quantified using the area under the receiver operating characteristics curve (AUROC). AUROC is a common measure of a model’s ability to discriminate positive vs. negative examples (i.e., participants with vs. without sport-related concussion) [35] and intuitively represents the probability of the model correctly ranking two randomly chosen examples. A model with AUROC of 1 would be perfect, while a model with AUROC of 0.5 would perform no better than chance.

2.5.3 Model Performance and Predictive Factor Analysis

By repeating the above process 20,000 times, we produced an empirical distribution of model performance allowing calculation of the mean AUROC with an associated 95% confidence interval. The confidence interval was calculated using the 2.5th percentile and 97.5th percentile values of the empirical distribution of AUROCs. Using a linear model allowed us to investigate the learned model parameters, thus identifying potential risk/protective factors. We considered those features with the same sign over the 20,000 repeated runs as statistically significant since features with the same sign over k runs would have a p value of at most 1/k when testing a null hypothesis of zero effect size for that feature. Therefore, to claim statistical significance at a p value < 5.22 × 10−5 (based on a Bonferroni Correction to account for comparisons over 957 features, 0.05/957 = 5.22 × 10−5), k must be at least 19,140 (equivalent to 1/k < 0.05/957) which we elected to round up to a total of 20,000 repetitions of the experiment.

3 Results

Of the 15,682 study participants included in this analysis, 595 (3.79%) sustained a subsequent sport-related concussion during the study period. Characteristics of the study population are presented in Table 2.

Table 2 Study population characteristics

After preprocessing, our model included 957 binary features (Supplementary Appendices A and B). After splitting the data, our training sets contained 12,539 baseline records (476 positive for sport-related concussion) and our test sets contained 3143 baselines (119 positive for concussion), on average. Applied to the test sets, our model achieved a mean AUROC of 0.73 [95% CI 0.70–0.76]. Figure 1 presents the mean ROC curve. Model performance varied across sports, as did the number of participants available for both the training and test data sets (Fig. 2). Despite American football having the greatest number of concussions, model performance was strongest for Swimming (AUROC = 0.86 [95% CI 0.61–1.00]) and weakest for Cheerleading (AUROC = 0.41 [95% CI 0–1.00]), where the model performed no better than chance.

Fig. 1
figure 1

Mean receiver operating characteristics (ROC) curve on the test set. The mean ROC curve is generated by first evaluating all the models on the test set and then macro-averaging all the true-positive rates interpolated at 100 evenly spaced false-positive rates starting from 0 to 1 inclusively

Fig. 2
figure 2

Sport-specific model performance and fractions of the entire study population and concussions observed across sports. AUROC error bars represent 95% confidence intervals

The mean effect sizes of those features reaching statistical significance is illustrated in Fig. 3. Features with a negative effect sizes can be interpreted as protective (i.e., reduce estimated risk of concussion), while features with a positive effect size are risk factors (i.e., increase estimated risk of concussion). Of the 259 statistically significant features in the model, 179 are protective and 80 are risk factors. The most heavily weighted risk and protective factors reaching statistical significance in each variable category are listed in Table 3.

Fig. 3
figure 3

Relative weights (regression coefficients/mean effect sizes, mES) of all statistically significant features (p = 0.00005) identified by the STL model, illustrated by variable category. Higher values are associated with greater risk of sport-related concussion, while lower values are associated with lower risk. Dot radius is proportional to mean effect size absolute value. See Table 3 for features with the smallest and largest mES in each variable category, and Supplementary Appendices A and B for all individual feature identifications

Table 3 Statistically significant risk and protective factors of greatest mean effect size (mES) by variable category

4 Discussion

Concussion is a serious injury affecting athletes and military service members that can be associated with significant short- and long-term morbidity [36, 37]. Since many concussions occur during sport participation, the ability to identify athletes at elevated risk for sustaining sport-related concussion would be of significant clinical value. Early identification of at-risk athletes could allow sports medicine providers to apply more targeted preventative measures, educational intervention, and injury surveillance strategies, and has the potential to influence injury management to prevent re-injury. To this end, we sought to develop a risk-stratification tool capable of identifying athletes and cadets at elevated risk for sport-related concussion. To our knowledge, this is the first successful attempt to develop a model for predicting concussions from prospectively collected baseline data. Using only data elements collected during athletes’ and cadets’ baseline pre-participation examinations, we developed a risk-stratification model that was able to predict those who would go on to sustain a sport-related concussion during the same academic year with an overall mean AUROC of 0.73. This is remarkable considering this model’s ability to predict future concussions falls within the range of sensitivities and specificities demonstrated by existing concussion assessment tools for distinguishing already concussed from non-concussed athletes [38,39,40,41]. The variability in model performance across sports is not surprising given the large differences in the number of athletes participating in each sport as well as the number of observed concussions.

While the primary goal of this study was to develop a predictive algorithm for sport-related concussion, evaluation of heavily weighted factors in the model can provide insight and generate hypotheses regarding potential concussion risk and protective factors. In a recent evidence-based systematic review of risk factors for sport concussion, Abrahams et al. identified only two high-certainty concussion risk factors: having sustained two or more previous concussions and match vs. practice play [12]. Of these, only concussion history is a baseline characteristic. Not surprisingly, the current investigation also identified concussion history as a significant predictor of subsequent sport-related concussion, and not having a concussion history as protective. With respect to concussion history, recent work has demonstrated that sustaining a first concussion at an earlier age is associated with a higher number of subsequent concussions [42]. Similarly, while more specific to repetitive sport-associated head impact exposure than diagnosed concussions, other evidence in retired professional American football athletes has implicated earlier age of first exposure to tackle football as a potential risk factor for adverse long-term neuroanatomical, neurocognitive, and neuropsychiatric outcomes [43,44,45,46]. It is, therefore, interesting to note in the present investigation that earlier age of first concussion did not receive a significant weight in the model. Other results from the CARE Consortium and elsewhere have similarly failed to identify an association between age of first exposure and outcomes during adolescence/young adulthood [47,48,49].

In addition to those associated with concussion history, some other heavily weighted factors in this study came as little surprise. Concussion rates are consistently found to differ across sports, with contact and collision sports associated with greater concussion risk compared to sports with limited or no contact [50,51,52]. In keeping with previous research, [26, 53], this investigation also found a number of contact/collision sports were associated with an increased concussion risk (e.g., soccer, basketball, American football, water polo, rugby, diving, lacrosse), while a number of non-contact sports were associated with decreased risk (e.g., cross-country/track, rowing, swimming, baseball, golf, field events, tennis). Additionally, this study identified female sex as a concussion risk factor, with male sex being protective. This sex-based result is consistent with previous research comparing concussion rates between males and females in sex-comparable sports with similar rules and physicality between males and females [13, 53, 54], but the present study extends this finding by analyzing concussion risk across all sports.

It is interesting to note that reporting higher baseline symptoms (3–5 positive symptoms on the SCAT Symptom Checklist; SCAT Symptom Severity Score 8–86) tended to be associated with higher weight in the model, while lower baseline concussion symptoms (SCAT Symptom Severity Score of 2–3) tended to be associated with negative weight. In addition, several significant predictive factors identified in this study are associated with neurological comorbidities generally considered to represent concussion modifying factors. For example, the model identified the presence of migraine headache at baseline, either by self-report of a prior medical diagnosis or a positive response to the ID Migraine Questionnaire [55] as a heavily weighted concussion risk factor, while the absence of migraine headaches was protective. These findings offer prospective evidence that is consistent with previous speculation for migraine headache as a potential concussion risk factor [56, 57]. In a similar manner, this model identified a history of ADHD [58], or learning disability at baseline as associated with higher risk, while the absence of both conditions at baseline was protective. While a comparable trend was present for history of depression, only the absence of depression at baseline reached statistical significance as a protective factor after Bonferroni correction. Also, it is interesting to note while a normal baseline Hospital Anxiety and Depression Scale (HADS)-Anxiety score at baseline was a significant protective factor, the risk associated with HADS-Anxiety scores falling in the abnormal or borderline-abnormal ranges failed to reach statistical significance as a risk factor.

It is important to recognize that a feature’s significance as a risk or protective factor in this predictive statistical model does not imply causality. In fact, it is likely that a number of significant predictive factors in the model are correlated with concussion risk in the absence of any clinically interpretable causal relationship. For example, an abnormal Anxiety score on the BSI-I8 at baseline was identified a significant protective factor, in contrast to the decreased risk associated with normal HADS-Anxiety scores described above. Furthermore, the model identified being right-handed as a significant protective factor, not using marijuana or alcohol in the previous month as significant risk factors, and missing answers to the marijuana and alcohol questions as significantly protective. In addition, the model identified many academic features as significant predictors of sport-related concussion. For example, several factors corresponding to higher academic performance (e.g., having a high-school GPA in the 4th quintile or a collegiate GPA in the 4th or 5th quintile; having an ACT-Math score in the 4th or 5th quintile, an ACT-Science score in the 3rd or 5th quintile, or an ACT-Reading score in the 4th quintile) were associated with lower risk of concussion, while having a total ACT score in the 1st quintile, corresponding to lower academic performance, was associated with higher risk. The model also identified several significant factors associated with age. Specifically, freshman class status and being in the lowest age quintile (age ≤ 18 years) were significant concussion risk factors, while senior and “fifth-year-senior” class status as well as being in the fourth or fifth age quintiles (20 years < age ≤ 21 years; 21 years < age ≤ 30 years) were identified as being protective. One might hypothesize the age/class-status trend to be associated either with physical maturation over a collegiate career or a “weed-out” effect with some athletes who sustained early concussions not continuing to compete through their entire academic careers. However, in many cases significant model observations fail to suggest clinically relevant hypotheses.

It is also noteworthy that a high-degree of collinearity was present among many of the baseline variables assessed. As such, the significance of some heavily -weighted factors in the model has a high likelihood of being attributable to an association with other more obvious risk or protective factors. For example, the observation that wearing protective gear and wearing a mouth guard are both concussion risk factors is likely attributable to protective equipment and mouth guards both being more frequently worn in higher risk contact sports. Furthermore, the observation of African-American race and weight falling into the 5th quintile are concussion risk factors may be attributed to disproportionate participation in American football by African-Americans (about one-third of African-Americans in the study population were football players) and athletes in the highest quintile for weight (about half of the athletes in this weight quintile were football players).

Ultimately, individual features, regardless of their weight, must be interpreted in the context of the entire model and not as independent risk or predictive factors. As such, it cannot be assumed based on the present results that intervening on a potentially modifiable risk or protective factor would necessarily influence an athlete’s future concussion risk. Such a conclusion would require additional support from a prospective intervention trial. Identification of heavily weighted factors, especially when unanticipated, should prompt novel hypothesis generation and future concussion research. For example, based on the greater risk of sport-related concussion observed in younger freshman athletes and cadets, it would be reasonable for future research to investigate the more gradual incorporation of incoming freshman into varsity collegiate sport participation as a concussion risk reduction strategy.

This study was not without limitations. Despite likely differences in the mechanisms of injury causing concussions in different sports, as well as differences between the varsity and non-varsity levels of play, we were not able to develop sport-specific models. This was largely due to variability in the number of athletes participating in different sports and a small number of positive examples (i.e., concussions) in some sports, leading to over-fitted models that did not generalize well to test data. Other factors were also likely at play, given that American football, which included more participants than any other single varsity sport as well as the greatest number of concussions, had an AUROC falling below that of the full model. Given challenges in developing sport-specific models, we employed a model trained using data from all sports at both the varsity and non-varsity levels of play. With additional data, future research could seek to develop sport- and level-of-play-specific models, as well models specific to military cadets, which might have greater predictive ability. Another study limitation was the presence of “missing data” and the potential for non-values to be coded in more than one way for many variables. For example, non-values could potentially be coded as “skipped” (i.e., a “skip this question” response was selected/provided), “missing” (i.e., no response was selected/provided), “unknown” (i.e., a “don’t know” response was selected/provided), or “N/A” (i.e., the question did not apply to the athlete/cadet; e.g., collegiate GPA for an incoming freshman, helmet type for an athlete participating in a non-helmeted sport, or results of a “Level B” measure or computerized concussion test not used at the athlete’s/cadet’s institution) for many variables. Given that missing data rates varied greatly across features (from a minimum of 0.6% to a maximum of 99.9%), we elected to analyze the dataset retaining all potential non-value options to account for the possibility they might contain predictive information. However, it is challenging to develop clinically relevant hypotheses for most significant non-value features identified in the model. In addition, it is also possible that some inaccurate self-reported information may have remained in the dataset despite our careful review and attempt to remove obviously incorrect data. It would be impractical to independently review records to verify all self-reported information collected during CARE baseline assessments so self-reporting errors are possible. Furthermore, while we attempted to discretize continuous variables using clinically relevant cut-offs when available, our procedure for mapping other continuous variables to discrete sets of binary features using quintile ranges may not have captured clinically important or statistically significant cut-offs in some cases. We elected to use quintiles because of their standardization and reproducibility, but other discretization strategies may have yielded different results. Next, while the effect size values reported in Supplementary Appendices A and B are meaningful for comparing the relative model weights between features in the present study, they cannot be interpreted outside the context of this study in the same way as standardized effect size values like Cohen’s D. In addition, these results should not be extrapolated beyond a collegiate athlete/military cadet population. Even within a population of collegiate athletes and military cadets, it would be challenging to apply this model outside of the CARE Consortium study given the number of baseline variables used as model inputs, many of which are not routinely collected outside of the CARE study protocol. Future work should build on the present results to develop more streamlined models utilizing a subset of the most predictive features so that the models can more easily be applied in a routine clinical setting. In a similar vein, these results should not be extrapolated to non-sport concussions. In fact, since this study focused only on sport-related concussion, it is possible that some participants considered negative for the primary study outcome could potentially have sustained concussions outside of sport participation during the study period. Lastly, given potential athlete under-reporting of concussion symptoms and because there is no objective confirmatory test for concussion, the concussion diagnosis relies both on accurate injury identification as well as the clinical impression of the evaluating medical provider. While issues of potential missed injuries and diagnostic uncertainty are common across all concussion research, the CARE Consortium’s use of a standard concussion definition by all sites should at least mitigate the potential for diagnostic uncertainty. Nonetheless, any diagnostic inaccuracy is undesirable in the context of developing a data-driven risk prediction model where an accurate classification of “case-ness” is relied upon heavily for the analysis and will render injury prediction and evaluation of those predictions more challenging.

5 Conclusion

This collaborative data-driven study leverages powerful analytical techniques and a robust clinical dataset to develop a novel model for predicting collegiate athletes’ and military cadets’ risk of sustaining sport-related concussion. This is clinically important because, to our knowledge, it represents the first successful attempt to predict sport-related concussion using only baseline data. As such tools are developed and refined, clinicians will increasingly need to determine how to apply them in clinical practice. Might a future model perform so accurately that one day certain athletes would be restricted from participating in certain sports due to the amount of model-predicted risk? For now, this study suggests it is feasible to identify athletes at elevated risk of sport-related concussion in whom targeted prevention, education, and injury surveillance strategies may be employed. Furthermore, this study offers insight into potential risk and protective factors for sport-related concussion, generating novel hypotheses for future concussion research. Future work is needed to develop a predictive model that can be easily applied in routine clinical practice, as well as to identify modifiable concussion risk factors that can be intervened upon to modify an athlete’s injury risk.