Introduction

Chronic pain is a highly prevalent condition associated with significant disability and societal cost [1]. The etiology of chronic pain can vary substantially across patients and often appears to be secondary to biological factors, such as musculoskeletal injury (e.g., osteoarthritis [OA]), nerve injury (neuropathic pain), autoimmunity (rheumatoid arthritis [RA], systemic lupus erythematosus), or substance abuse (alcoholic neuropathy and chronic pancreatitis). In many cases, however, chronic pain seems to be the primary illness associated with a given condition (e.g., fibromyalgia syndrome [FM; 2]). Over the past several decades, substantial effort has been dedicated to the development of methodologies for the reliable discrimination or classification of patients with a given chronic pain condition from people without the condition (i.e., “healthy controls”), patients with similar but distinct illnesses, or both.

The goal of many studies is to accurately identify individuals with a clinical chronic pain condition in question based on potential mechanistic underpinnings. Given that some chronic pain conditions are associated with peripheral tissue pathophysiology, numerous studies have been aimed at reliable discrimination of individuals with clinical pain using peripheral measures including structural (e.g., knee degeneration, lumbar disc pathology; [3]), functional (e.g., gait abnormality, inflammatory processes in rheumatoid arthritis; [4, 5]), and genetic data [6]. Because chronic pain often exists as a primary symptom without remarkable tissue abnormalities, there is an increasing interest in their neural correlates, i.e., mechanistic or structural abnormalities derived from structural or functional magnetic resonance (MRI) brain imaging studies of chronic pain (e.g., grey matter density/volume, white matter integrity, or functional connectivity between pain-related brain regions). In addition, such information can be used to determine whether central nervous system (CNS) abnormalities are suitable to appropriately classify individuals with chronic pain.

In this regard, a major goal of many MRI-based chronic pain classification studies is the development of objective biomarkers for each condition that do not rely on self-report [7, 8•, 9]. Brain-based biomarker studies of chronic pain leverage the extensive neuroimaging literature describing the critical role of certain brain regions in the sensory (e.g., thalamus, primary and secondary somatosensory cortex, posterior insula, basal ganglia), affective (e.g., anterior insula, hypothalamus, anterior cingulate cortex, amygdala, hippocampus), and cognitive-evaluative (e.g., dorsolateral prefrontal cortex, anterior cingulate cortex, thalamus) aspects of the pain experience (for review, see [10]). Although our review focuses exclusively on brain imaging biomarkers, it is worth noting that a substantial literature has also applied this approach to peripheral mechanisms. For instance, measurement of exhaled organic volatiles [11] and joint ultrasound [12] have been suggested as useful approaches for the automated classification and/or differential diagnosis of rheumatoid arthritis.

Some proponents of brain imaging-based approaches to illness classification claim that brain biomarkers could act as a “surrogate” for pain self-report [13]. This could result in greater sensitivity to measure efficacy of analgesic treatments [9] by avoiding potential pitfalls associated with the use of subjective patient and provider reports. Furthermore, such brain biomarkers could serve as targets for novel treatments [14, 15]. In particular, objective markers of chronic pain may play an important role in pain classification of patients with cognitive/psychological dysfunction and in individuals who are unable to communicate, or in cases of deception [16•, 17], as well as aid in the adjudication of legal claims of injury-related pain disorders [18]. These purported advantages have spurred interest in the development of biomarkers, although it is important to note that self-report remains the gold standard for pain measurement. Numerous studies have indicated that self-report measures of pain and psychosocial factors have excellent classification accuracy and reliability [8•, 19•, 20]. Indeed, evidence suggests that chronic pain patients can be distinguished from healthy controls with greater than 90% accuracy based on personality factors [21], perceived pain and functional disability [22], and simple visual analog scale (VAS) measures of affect [23••]. Questionnaire-based approaches have shown similar accuracy for the separation of FM and RA [24] and FM, RA, and OA [25].

The purpose of this critical review is to examine the development of brain biomarkers for chronic musculoskeletal pain disorders using machine learning (ML) in a manner which should be useful to physicians and other health care providers. Therefore, we will restrict our review to musculoskeletal conditions for which chronic pain is a primary complaint. The conceptual, scientific, and ethical issues related to this approach, however, will also apply to biomarker studies for other chronic conditions (e.g., depression).

Machine Learning Algorithms

Existing methods regarding the use of structural and functional brain imaging to discriminate patient groups from healthy controls have largely relied on ML algorithms, which provide an automated approach to making predictions about previously unknown data. ML algorithms can be broadly classified as supervised or unsupervised. Supervised methods develop models for classifying observations according to a known outcome (e.g., a clinical diagnosis). In contrast, unsupervised methods attempt to discern patterns or structure in data without guidance or pre-existing labels. Although unsupervised ML methods may be of interest for the development of subgroups within existing conditions or new diagnostic classification criteria based on brain abnormalities, ML studies in chronic pain to date have mostly relied on supervised methods. Support vector machines (SVM; [26, 27]), a type of ML algorithm that uses a training data set composed of one or more features to determine an optimal boundary separating a set of cases, have been most often used [23••, 28, 29]. However, many other ML techniques have been developed and applied to the automated classification of chronic pain patients [23••, 30]. Detailed discussions of the strengths and limitations of these methods are beyond the scope of this review, but basic descriptions of the most common ML algorithms applied in the study of chronic pain, as well as detailed references, are provided in Table 1.

Table 1 Basic descriptions of common supervised machine learning algorithms

Scientific and Clinical Utility of Chronic Pain Biomarkers

In this section, we will discuss theoretical, practical, and ethical issues surrounding pain biomarker development and use, with a particular emphasis on clinical utility.

Choice of Classification Algorithm

In an ideal scenario, the best algorithm for detecting a chronic pain condition would be one that matches the underlying model or structure that distinguishes that condition from healthy individuals or other conditions (e.g., if a linear decrease of prefrontal cortical thickness is the underlying mechanism of a condition, then a linear model would be sufficient). However, given the complex and often heterogeneous nature of chronic pain conditions, this information is frequently not available. It is often difficult, if not impossible, to determine a priori which algorithm might perform best in a certain data set [37]. This problem frequently necessitates the comparison of multiple different algorithms for any single classification task (for examples, see [23••, 38]). As a general rule, simple models may not perform as well on training data but will likely generalize better when assessed on testing data, whereas more complex models will tend to perform well on training data but will not perform as well on testing data. More complex models or those that blend inputs for many different algorithms (e.g., ensemble methods) may also lead to challenges in interpretability, requiring additional pipelines to distill results in ways that are meaningful and useful for clinicians. In general, large sample sizes are often needed to increase generalizability and prevent overfitting [39]. A recent study suggests that accuracy values obtained from smaller samples are opportunistically biased [40]. Therefore, proposed biomarkers must demonstrate strong performance in generalizable samples.

Biomarker Reliability and Feasibility of Implementation

A neuroimaging-derived biomarker cannot be reliably used for scientific or clinical purposes if the measures composing it do not demonstrate high test-retest reliability (i.e., the reproducibility of results over time). This concern is especially relevant given recent evidence that the test-retest reliability of functional connectivity metrics may vary widely depending on the brain regions examined [19•]. For this reason, the reliability of neuroimaging-based measures should not be taken for granted. This concern also applies to populations where typical brain function may be perturbed due to injury or illness (e.g., post-stroke [41]).

Even a biomarker that is meeting or exceeding criteria for appropriate use as a diagnostic tool will not be clinically useful if it cannot be readily assessed in clinic settings. As noted by Woo and Wager [42], all parameters of the biomarker and its associated testing procedure should be rigorously standardized so that all users collect data comparable to the standard used to derive the measure. Depending on the demands associated with collecting certain biomarkers (especially those derived from BOLD activity during task-based fMRI), such rigor may be difficult to achieve in clinical settings. Often the additional time, nuanced design, and/or special expertise required in the acquisition of reliable data will limit some biomarkers’ applicability and usefulness.

What Constitutes Adequate Biomarker Performance?

A major issue in the application (and reporting) of ML-based brain biomarkers for chronic pain is determining whether a given marker’s accuracy is adequate. Developing criteria for optimal biomarker performance depends on the setting in which it will be used, as well as the cost of misclassification. Pure research-oriented use cases (e.g., identifying potential markers, phenotypic subgroups, or treatment targets; discerning disease mechanisms) may allow for less stringent performance criteria. In contrast, clinical tasks (diagnosis, prognosis, treatment planning) carry risk of harm (misdiagnosis, delaying treatment, providing inaccurate or unnecessary treatment) and will therefore require considerably more stringent performance criteria for use in patient care. Although the risk of medical error from decisions based upon ML-derived biomarkers may be no greater than those commonly accepted in current practice, the “black-box” nature of these markers raises liability concerns in the face of medical errors promulgated by ML-derived biomarkers [43] and may increase physicians’ reluctance to use even those that are well validated.

Ultimately, any given biomarker’s practical applicability will depend on a case-by-case assessment of its cost, deployability, and accuracy in the context of its intended use. For instance, in clinical settings, biomarker performance should be judged by taking into consideration whether it is intended to replace physicians in a role they can already perform well (e.g., differentially diagnose mechanistically and/or symptomatically distinct conditions) or, alternatively, provide novel information that clinicians would not otherwise have. This approach may help identify mechanistically distinct clinical subgroups, guide treatment decisions, indicate prognosis, or separate conditions with significant mechanistic or symptomatic overlap. In cases where biomarker application provides useful clinical information where it would not otherwise be available, higher cost, more difficult deployability, and/or relatively poorer performance may be quite tolerable [43].

Impact of Chronic Pain Prevalence on Biomarker Performance

Ethical and practical issues surrounding potential under- or overtreatment of pain remain even in cases where biomarker sensitivity (i.e., probability of a positive biomarker given an individual has the condition) and specificity (i.e., probability of a negative biomarker given an individual does not have the condition) reach very high levels, depending on the given use case. This is because biomarker positive predictive value (PPV) and negative predictive value (NPV), as previously noted [44•], depend on the prevalence (i.e., base rate) of the condition of interest in a particular setting (i.e., Bayes’ theorem). Specifically, biomarker PPV increases with prevalence, while NPV increases as prevalence decreases. This relationship is illustrated graphically in Fig. 1 using the sensitivity (81%) and specificity (75%) of a structural brain biomarker for pain we have previously reported [23••]. As a result, even the highest performing chronic pain biomarkers previously reported (e.g., 92% sensitivity, 92% specificity [16•]) will have high rates of false positives in application settings with low prevalence and high rates of false negatives in settings with high prevalence. Investigators need to take use case into account when reporting biomarker performance and judging its adequacy.

Fig. 1
figure 1

Illustration of the relationship between chronic pain prevalence and the positive/negative predictive value of biomarkers using sensitivity (81%) and specificity (75%) values derived from Robinson et al. [23••]. Positive predictive value (PPV) increases with prevalence, while negative predictive value (NPV) decreases as prevalence increases. Thus, during real-world application of brain biomarkers for chronic pain, performance will depend largely on the expected proportion of individuals with the condition in question

Finally, self-report of pain symptomatology is ultimately the basis for diagnosing and assessing the severity of chronic pain conditions (so-called “Gold Standard”). Therefore, the predictive utility of biomarkers is necessarily limited by the accuracy of the existing diagnostic gold standard [8•, 45]. It is conceivable that subgroups within a diagnostic category could be identified solely based upon physiological measurements using unsupervised ML methods and later validated upon self-report, thereby circumventing this issue. However, to our knowledge, no previous neuroimaging-based biomarker studies for chronic pain have used this approach.

Pain Biomarkers Based on Brain Imaging

Structural Biomarkers

Chronic pain brain biomarkers derived from structural MRI are dependent on the assumption that chronic pain conditions are associated with abnormalities in brain structure that either pre-date or are the result of pain chronification. Brain structure can be characterized in terms of grey matter features, which reflects integrity of neuronal cell bodies, or white matter features that reflect axonal integrity. Note that although white matter perturbations have been demonstrated in a variety of chronic pain conditions [46,47,48], to our knowledge, ML algorithms have yet to be applied to these data in order to generate chronic pain classifiers. Thus, we will focus on studies utilizing grey matter features.

Grey matter structure can be assessed in several ways. One common approach is the use of voxel-based morphometry (VBM) on high-resolution T1-weighted images. VBM produces estimates of grey matter density for each voxel and subject after warping to fit a standardized template brain [49, 50]. Another typical technique is to assess grey matter characteristics in cortical (thickness, volume, surface area, or mean curvature) or subcortical (volume) structures by assigning a neuroanatomical label to each brain voxel using a probabilistic atlas [51, 52]. Though other techniques and measures are available for assessing grey matter structure, efforts to construct biomarkers for chronic pain conditions based on grey matter structure have largely relied on grey matter density [29, 53] or thickness/volume/surface area/curvature [23••, 30].

Structural brain biomarkers for chronic pain build on a significant literature demonstrating both atrophy and hypertrophy in chronic pain patients in numerous pain-related brain regions. For example, chronic low back pain (cLBP) has been associated with lower grey matter density compared to controls in the dorsolateral prefrontal cortex (DLPFC), a region associated with cognitive/evaluative function and pain modulation [54]. At the same time, FM has been associated with both increased grey matter volume in striatum, orbitofrontal cortex, and cerebellum [55], and lower volume in anterior cingulate cortex, amygdala, thalamus, superior temporal gyrus, supplementary motor cortex, and insula [55,56,57,58]. Although not directly contradicting one another, these studies differed somewhat with regard to regions showing significant differences between FM patients and controls, with differences in anterior cingulate cortex being the most consistently observed [56, 58]. The potential utility of these measures as features for classification algorithms to distinguish chronic pain patient groups depends on the assumption that reliable commonalities and differences in grey matter morphometrics can be detected between chronic pain conditions. It is also worth noting that the goal of chronic pain biomarker studies is distinct from between-group analyses of structural differences because they are focused specifically on identifying optimal combinations of features that best separate patient groups and/or healthy controls.

Classifier Studies Based on Structural Brain Abnormalities

ML classifier studies based on structural brain features have been conducted in several chronic pain conditions, including CPP, IBS, FM, and cLBP, using samples from 26 to 160 participants. In each case, samples were composed of ∼50% splits of patients to controls. Performance of these potential biomarkers has differed substantially between studies, with sensitivity and specificity ranging from 65 to 81%. Although brain regions discriminating patient groups from normal controls varied, certain areas were more frequently reported. These included precentral gyrus (primary motor cortex), postcentral gyrus (primary somatosensory cortex), amygdala, and cuneus. In addition, many discriminating regions were convergent with those identified in previous studies focused on identifying structural differences between chronic pain patients and controls (e.g., DLPFC, amygdala, cingulate cortex, insula, etc.,). For further detail regarding sample size, algorithm(s) used, classifier performance, and brain structures with the greatest contribution to performance for studies using structural brain biomarkers, see Table 2.

Table 2 Characteristics of machine learning-based classifier studies using brain imaging results

Functional Brain Biomarkers

Whereas, high-resolution neuroanatomical data is the basis for studies proposing structural neuroimaging biomarkers of chronic pain, brain activity measures are used to derive functional neuroimaging biomarkers. Functional MRI (fMRI), which refers to noninvasive methods for measuring brain function, has two common variants. The first detects changes in cerebral blood oxygenation (BOLD), and the second, arterial spin labeling (ASL), measures changes in regional cerebral blood flow (rCBF). Although neither of these methods directly measure neuronal activity, they are both considered surrogate markers of such activity (for review of these technologies, see [61]). These methods excel in their ability to provide spatial information about where changes in BOLD response or rCBF occur over the course of time; however, they are intrinsically limited in temporal resolution, or accuracy in determining exactly when the neuronal activation occurred, due to the physiological characteristics of blood oxygenation (BOLD) and/or extended repetition times (ASL; [62]). Although ASL has several theoretical advantages over BOLD regarding the measurement of brain function [63], most brain classification studies for chronic pain to date have relied on BOLD fMRI.

Because functional neuroimaging provides a global measure of brain activity, participants’ mental state during data collection is always important to note. For example, brain activity can be captured during a goal-directed task (e.g., participants are asked to continuously rate their levels of clinical pain or undergo acute painful stimulation), or during wakeful rest (i.e., resting state) in which participants are not instructed to engage in a particular task. Raw data collected during these scans typically undergoes preprocessing, or various image and signal correction techniques, to improve the ratio of desired signal to undesired noise [64]. Subsequently, preprocessed data are statistically analyzed at the individual-participant level to determine either 1) whether a significant change in signal occurred for a given region or across the whole brain (i.e., activation), or 2) the coherence in activation among spatially distinct brain regions over time (i.e., functional connectivity, FC). Statistically analyzed activation or FC values can be selected across the whole brain, or limited to hypothesis-driven regions of interest (i.e., a priori ROIs). In classification studies proposing functional neuroimaging biomarkers, individual-level activation or FC information is then entered into computational models, which classify individuals into distinct groups based on these values.

Classifier Studies Based on Functional Brain Abnormalities

Compared to structural brain abnormalities, fewer studies have used fMRI activation or functional connectivity metrics as features in classification models for chronic pain patients. Using measures of functional brain activation in pain-related brain regions during repeated 14-s blocks of experimental electrical pain induction, Callan et al. [16•] achieved 92% accuracy, sensitivity, and specificity classifying cLBP and healthy controls. The two most informative regions reported by Callan et al. [16•] have been strongly implicated in pain discrimination: (primary somatosensory cortex) and attention (inferior parietal cortex) [65]. In a separate study using resting state fMRI [59], RA and FM patients could be discriminated from HC with 79 and 62% accuracy, respectively, using measures of functional connectivity between structures associated with stimulus evaluation (i.e., the “salience network” [SN]) and internal focus/mind wandering (i.e., “default mode network” [DMN]). These networks have been implicated in pain processing (SN) or reported as perturbed in chronic pain states (DMN). Interestingly, RA and FM patients could also be discriminated from each other with 79% accuracy, suggesting that, despite potential commonalities in neuronal plasticity, the functional neural correlates of certain chronic pain disorders may be sufficiently distinct to enable their separation [59]. Finally, in a recent fMRI study using ML [60••], the investigators could discriminate FM patients from HC with 93% accuracy based on their brain activation associated with painful pressure stimuli [66] and non-painful multisensory stimulation (e.g., simultaneous auditory, tactile, and visual stimuli). Overall, the rapid improvement of fMRI-based “pain signatures” has not only helped our mechanistic understanding of chronic pain but will also benefit the diagnosis of FM and other chronic pain conditions.

Summary

Taken together, a limited number of studies (see Table 1) have tested the ability of ML algorithms to discern chronic pain patients from healthy controls using structural or functional brain abnormalities. However, interest in ML approaches for chronic pain diagnosis and classification is growing due to their purported potential to help elucidate chronic pain mechanisms, identify resilient or vulnerable subgroups, improve clinical decision making, predict treatment outcome, or augment [29] or even replace self-report (for critical discussion, see [8•]). However, as previously discussed, certain caveats apply to the practical application of biomarkers even where performance metrics may be very high.

Conclusions

Interest in the development of clinical and ML-based biomarkers for chronic musculoskeletal pain conditions derived from structural and functional neuroimages has increased substantially in recent years. Current reports describe novel biomarkers capable of separating patient groups and healthy controls with accuracies ranging from 70 to 93%. Such studies provide valuable mechanistic information regarding both the unique and common neural correlates of these conditions, with the potential to highlight differences between both musculoskeletal pain patient groups and controls to which traditional statistical approaches may not be sensitive, or identify mechanistic subgroups within certain pain conditions. However, at this time, critical theoretical, practical, and ethical concerns preclude the replacement of patient self-report for the diagnosis of chronic musculoskeletal pain with brain imaging-derived biomarkers, as self-report remains the gold standard for pain assessment [8•].