Keywords

6.1 Comparing Cognitive Screening Instruments

As is evident from the previous two chapters, there are many screening instruments, focusing on either cognitive or non-cognitive domains of function, which may be used in the assessment of patients presenting with cognitive complaints (interpreted in the context of the patient history and examination; Chap. 3). How does one decide which of these instruments should be used, which is optimal? The role of clinician preference should not be underestimated in this choice, but ideally it should be based on some rigorous method of comparison between tests.

One strategy, adopted in previous editions of this book (Larner 2012a:50–2, 97–8; 2014a:131–3, 189–91), is to construct “leagues tables” for various diagnostic metrics e.g. likelihood ratios , diagnostic odds ratios , clinical utility indexes , and area under the receiver operating characteristic curve (AUC ROC) (see also Larner 2015a, Chap. 4). As was then pointed out, such “league table” comparisons relate to historical and usually non-overlapping patient cohorts, so direct comparisons between instruments cannot be made. Nonetheless, such “league tables” may give some clues as to the relative merits of the tests used in pragmatic studies.

Ideally however, comparison requires head-to-head studies where two (or more) instruments are administered (in random order), and blinded to the result of the other(s), to the cohort of patients undergoing assessment. This is potentially a time-consuming strategy, and fatiguing for patients, although it has been used in some studies performed in the Cognitive Function Clinic (CFC) at the Walton Centre for Neurology and Neurosurgery (WCNN) in Liverpool to compare cognitive screening instruments (CSIs), e.g. MMSE vs. ACE (Larner 2005), MMSE vs. ACE-R (Larner 2009, 2013a), MMSE vs. MoCA (Larner 2012b), MMSE vs. MMP (Larner 2012c), MACE vs. MMSE (Larner 2015a, b), and MACE vs. MoCA (Larner 2017a). Ideally tests should be administered sequentially in counter-balanced order to avoid bias, although this is not possible in some circumstances (e.g. because MMSE is incorporated into both ACE and ACE-R ; Sects. 4.1.5.1 and 4.1.5.3).

Other methods of comparison may be based on:

  • measures of discrimination (Sect. 2.3.2), as documented for individual screening instruments (Chaps. 4 and 5), such as weighted comparison or Q* index ;

  • measures based on the reference standard diagnosis , such as effect size (Cohen’s d) ;

  • or measures of association (non-diagnostic), such as correlation, test of agreement (Cohen’s kappa statistic) , or Bland-Altman limits of agreement (Sect. 2.3.3).

6.1.1 Weighted Comparison (WC) and Equivalent Increase (EI)

The shortcomings of AUC ROC as an overall measure of diagnostic test accuracy have been emphasized (Mallett et al. 2012), specifically the fact that this unitary metric combines test accuracy over a range of thresholds which may be both clinically relevant and clinically nonsensical. It has been argued that the most relevant and applicable presentation of diagnostic accuracy test results should include interpretation in terms of patients, clinically relevant values for test thresholds, disease prevalence , and clinically relevant relative gains and losses (Mallett et al. 2012).

One such index is the weighted comparison (WC) measure described by Moons et al. (1997) which gives weighting to the difference in sensitivity and specificity of two tests and takes into account the relative clinical misclassification costs of false positive diagnosis and also disease prevalence . This may be expressed by the equation:

$$ \mathrm{WC}=\varDelta \mathrm{sensitivity}+\left[\left(1-\pi /\pi \right)\times \mathrm{relative}\ \mathrm{cost}\left(\mathrm{FP}/\mathrm{TP}\right)\times \varDelta \mathrm{specificity}\right] $$

where π = prevalence ; FP = false positives; and TP = true positives.

The relative misclassification cost (FP/TP) is a parameter which seeks to define how many false positives a true positive is worth. Clearly, such a “cost” is very difficult to estimate. In the context of diagnostic accuracy studies for CSIs, it may be argued that high test sensitivity to identify all true positives, with the accompanying risk of false positives (e.g. emotional consequences for a patient of an incorrect diagnosis , and/or inappropriate treatment), is more acceptable than tests with low sensitivity but high specificity which risk false negative diagnoses (i.e. missing true positives, and possibly the opportunity to initiate symptomatic or disease-modifying treatment). This argument is of course moot in the current absence of disease modifying therapies for most causes of dementia or MCI . For studies in CFC, FP/TP was arbitrarily set at 0.1, following previous authors (Mallett et al. 2012), reflecting the desire for high test sensitivity.

Of note, the WC equation used here (Moons et al. 1997) does not take into account false negative diagnoses, which of course have their own potential cost. However, another index, addressing whether screening tests are “costworthy”, also incorporates the benefit (advantage) of TP test for an identified individual and the cost (harm) of FP test for a wrongly identified individual but without reference to false negatives (Ashford 2008).

To aid interpretation, another parameter may be calculated using WC, namely the equivalent increase (EI) in TP patients per 1000, using the equation:

$$ \mathrm{EI}=\mathrm{WC}\times \mathrm{prevalence}\times 1000 $$

As this is a measure of patient numbers, results are rounded to integer values.

Weighted comparison and calculation of equivalent increase has been undertaken for a number of the CSIs examined in CFC. These have compared patient performance-related CSIs: MMSE with ACE-R , MoCA , TYM, MMP (Larner 2013b), 6CIT (Abdel-Aziz and Larner 2015) and MACE (Larner 2015a), as well as TYM against ACE-R , and MoCA against MACE (Larner 2016a, 2017a). Comparison of performance-related CSIs and informant scales has also been examined: AD8 with MMSE and 6CIT (Larner 2015c). Most comparisons have been for the diagnosis of dementia, but some also for the diagnosis of MCI . The figures for sensitivity, specificity, AUC ROC, prevalence of dementia/MCI/cognitive impairment were extracted from each study, and Δsensitivity, Δspecificity, and WC and EI were then calculated (Tables 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8 and 6.9).

Table 6.1 Weighted comparison ACE-R vs. MMSE for diagnosis of dementia (adapted from Larner 2013b; data from Larner 2009, 2013a)
Table 6.2 Weighted comparison MoCA vs. MMSE for diagnosis of any cognitive impairment (adapted from Larner 2013b; data from Larner 2012b)
Table 6.3 Weighted comparison TYM vs. MMSE for diagnosis of dementia (adapted from Larner 2013b; data from Hancock and Larner 2011)
Table 6.4 Weighted comparison MMP vs. MMSE for diagnosis of dementia (adapted from Larner 2013b; data from Larner 2012c)
Table 6.5 Weighted comparison 6CIT vs. MMSE: (a) for diagnosis of dementia vs. no dementia (n = 150); (b) for diagnosis of dementia vs. MCI (n = 65); (c) for diagnosis of MCI vs. subjective memory complaint (n = 128) (adapted from Abdel-Aziz and Larner 2015)
Table 6.6 Weighted comparison MACE (cut-off ≤25/30) vs. MMSE (cut-off ≤24/30): (a) for diagnosis of dementia vs. no dementia (n = 135); (b) for diagnosis of MCI vs. subjective memory complaint (n = 111) (adapted from Larner 2015a)
Table 6.7 Weighted comparison TYM vs. ACE-R (adapted from Hancock and Larner 2011, corrected from Larner 2014a:137)
Table 6.8 Weighted comparison MACE (cut-off ≤25/30) vs. MoCA (cut-off ≥26/30): (a) for diagnosis of dementia vs. no dementia (n = 260); (b) for diagnosis of MCI vs. subjective memory complaint (n = 217) (adapted from Larner 2017a)
Table 6.9 Weighted comparison AD8 vs. (a) MMSE (n = 125), and (b) 6CIT (n = 169) for diagnosis of cognitive impairment vs. no cognitive impairment (adapted from Larner 2015c)

The dataset from a patient cohort seen in an old age psychiatry memory clinic (Hancock and Larner 2015; Sect. 5.2.4) permitted a further weighted comparison of MMSE and ACE-R in an independent cohort (n = 181) to be undertaken, with results akin to that from the CFC study (Larner and Hancock 2014; compare Tables 6.1 and 6.10).

Table 6.10 Weighted comparison ACE-R vs. MMSE for diagnosis of dementia (data adapted from Larner and Hancock 2014)

The various WC and EI findings are summarised in Table 6.11 (Larner 2015d:105). The data suggest that for the diagnosis of dementia ACE-R is superior to MMSE and TYM; for diagnosis of any cognitive impairment MoCA and AD8 are superior to MMSE, and AD8 is superior to 6CIT; and for diagnosis of MCI MACE is superior to MMSE. All WC evaluations were in the same direction as the available values for AUC ROC, i.e. favoured ACE-R , MoCA , MMP , and 6CIT vs. MMSE, favoured MMSE vs. TYM, and favoured ACE-R vs. TYM (Larner 2013b).

Table 6.11 Summary of weighted comparison and equivalent increase between CSIs for identification of (a) dementia vs. no dementia, (b) any cognitive impairment (= dementia + MCI) vs. no cognitive impairment, and (c) MCI vs. subjective memory complaint (adapted from Larner 2015d:105)

The calculation of WC and EI is largely dependent on differences in test sensitivity, which are ultimately dependent on the test cut-off used, like many other measures of discrimination derived from the 2 × 2 table (Sect. 2.3.2). Choice of a different method for determining test cut-off may potentially change the outcome of weighted comparisons, from net benefit to net loss (Larner 2015e).

6.1.2 Q* Index

Another potentially useful summary measure denoting the diagnostic value of a screening instrument is the Q* index derived from the ROC curve (Walter 2002). Q* index is defined as the “point of indifference on the ROC curve ”, where the sensitivity and specificity are equal, or, in other words, where the probabilities of incorrect test results are equal for disease cases and non-cases (i.e. indifference between false positive and false negative diagnostic errors, with both assumed to be of equal value/cost). The Q* index is that point in ROC space which is closest to the ideal top left-hand (“northwest”) corner of the ROC curve , where the anti-diagonal through ROC space intersects the ROC curve (Fig. 6.1).

Fig. 6.1
figure 1

(a) Typical receiver operating characteristic (ROC) curve or plot with diagonal or chance line (data for ACE-R adapted from Larner 2009, 2013b, see Fig. 4.4) reprinted with permission; (b) typical ROC curve (same data points as a) with anti-diagonal line: where the lines cross in ROC space indicates equal test sensitivity and specificity, by definition the Q* index (the point closest to the ideal top left-hand corner of the ROC curve) (Larner 2015f) reprinted with permission

Q* index was derived for a number of CSIs examined in pragmatic diagnostic test accuracy studies undertaken in CFC (Larner 2015f; Table 6.12). Q* index ranged from 0.88 for ACE-R to 0.76 for MACE. The ranking of Q* index for the various CSIs examined paralleled that for AUC ROC, with ACE-R ranked highest and MACE lowest using either parameter.

Table 6.12 Summary of Q* index for various CSIs compared with area under the receiver operating characteristic curve (AUC ROC) (adapted from Larner 2015f)

Comparing the Q* index cut-off point with cut-offs defined in CSI index studies, the former was always lower (and hence less sensitive but more specific) than the latter. Comparing test sensitivity and specificity at the Q* index cut-off point showed that for all CSIs with the exception of ACE-R , Q* index-derived test cut-offs lay between those derived from maximal correct classification accuracy and maximal Youden index . Hence, if Q* index point were used as the test cut-off, it was more sensitive (and less specific) than if using the maximal correct classification accuracy cut-off, and less sensitive (and more specific) than if using the maximal Youden index cut-off. Q* index cut-offs reduced the sensitivity of very sensitive tests such as the ACE-R , MoCA , TYM and MACE ≤25/30, but improved sensitivity for very specific tests such as MACE ≤21/30 (Larner 2015f).

If a metric to compare diagnostic tests is required, Q* index has merit and, since it is based on sensitivity and specificity , may perhaps be preferred to AUC ROC results as a more intuitive measure.

6.1.3 Comparing Test Speed Versus Test Accuracy

The trade-off between speed and accuracy in the performance of voluntary movements, such that more accurate movements are performed more slowly, has long been recognised (Woodworth 1899). This speed-accuracy trade-off may perhaps apply to any task, and since speed is inversely proportional to time it may also be formulated as a time-accuracy trade-off, longer times being required for greater accuracy .

Is there is a trade-off between CSI diagnostic accuracy and administration time, or in other words are shorter CSIs less accurate than longer ones which may sample more cognitive domains and/or in greater depth? This was examined for a number of CSIs used in CFC by comparing parameters of test diagnostic accuracy against duration of test administration. The latter is not routinely measured in the clinical setting (there are exceptions when a stopwatch has been used, but this is usually for research purposes; Lees et al. 2017), although approximate timings can be given (see Sect. 2.1.3, Box 2.1). Hence, more easily accessible surrogate measures of test duration were used, namely either the overall test score or the total number of items/questions in the test (Larner 2015g, h).

Two measures of diagnostic accuracy , the correct classification accuracy or overall test accuracy (defined as the sum of true positives and true negatives divided by the total number of patients tested; Sect. 2.3.2, Box 2.3) and the area under the receiver operating characteristic curve (AUC ROC) were plotted (= output or effect, hence the dependent variable, y axis) against overall test score and against the total number of items/questions in the test (= inputs or causes, hence independent variables, x axis). Correlations between correct classification accuracy and AUC ROC and the surrogate time measures were also calculated.

Data (Table 6.13) were extracted from several pragmatic prospective diagnostic test accuracy studies examining nine performance-based CSIs: Addenbrooke’s Cognitive Examination (ACE), Addenbrooke’s Cognitive Examination-Revised (ACE-R), DemTect , Mini-Addenbrooke’s Cognitive Examination (MACE), Mini-Mental Parkinson (MMP), Mini-Mental State Examination (MMSE), Montreal Cognitive Assessment (MoCA), Six-item Cognitive Impairment Test (6CIT), and the Test Your Memory (TYM) test (see Chap. 4 for index studies).

Table 6.13 Approximate administration time for cognitive screening instruments (CSIs) and surrogate measures thereof (total test score, total number of test items/questions) with diagnostic accuracy (overall correct classification and area under ROC curve) for diagnosis of dementia (adapted from Larner 2015h)

Correct classification accuracy was positively correlated with both total test score (r = 0.58) and with total number of test items/questions (r = 0.66). Both correlations were classified as moderate, and respectively either did not reach statistical significance (t = 1.89, df = 7, p > 0.1) or showed a trend towards significance (t = 2.33, df = 7, 0.1 > p > 0.05).

AUC ROC curve was positively correlated with total test score (r = 0.83; Fig. 6.2) and with total number of test items/questions (r = 0.79; Fig. 6.3). Both correlations were classified as high and both reached statistical significance (t = 3.86, df = 7, p < 0.01; and t = 3.46, df = 7, p < 0.02, respectively).

Fig. 6.2
figure 2

Scatter plot of area under ROC curve (= measure of accuracy) versus total test score (= surrogate measure of test administration time) (adapted from Larner 2015h) reprinted with permission

Fig. 6.3
figure 3

Scatter plot of area under ROC curve (= measure of accuracy) versus total number of test items/questions (= surrogate measure of test administration time) (adapted from Larner 2015h) reprinted with permission

These analyses suggested that there is a trade-off for CSIs between two surrogate measures of duration of test administration and two measures of test diagnostic accuracy . Investing more time during the clinical encounter in administering longer CSIs might therefore pay dividends in terms of improved accuracy of dementia diagnosis . In light of these findings it might be argued on pragmatic grounds that a policy of longer outpatient clinic appointments in clinic templates (Sect. 2.1.3) for patients with cognitive complaints (e.g. 45–60 min), as compared to general neurology outpatient appointments (e.g. 15–30 min), is justified in order to permit adequate time for the administration of longer CSIs to facilitate the desired outcome of more accurate diagnosis . To borrow informally an analogy from science and engineering, there may be a lower “signal to noise ratio” when using longer CSIs (where the delivered strength of “signal” is related to statistical significance, and “noise” to standard deviation) due to their increased “bandwidth” (i.e. broader range of test scores or items) (Larner 2015h). The greater neuropsychological coverage of longer CSIs, one of the desiderata suggested by expert consensus (Malloy et al. 1997), may reduce test ceiling and floor effects.

6.1.4 Meta-Analysis : ACE and ACE-R

Meta-analysis is now a standard statistical approach to combine the results of multiple studies to improve estimates of effect size or resolve uncertainties when individual results disagree.

A meta-analysis of studies was undertaken to better understand ACE (Mathuranath et al. 2000; Sect. 4.1.5.1) and ACE-R (Mioshi et al. 2006; Sect. 4.1.5.3) utility (Larner and Mitchell 2014), using methods similar to those applied in previous meta-analyses of MMSE diagnostic accuracy (Mitchell 2009, 2013, 2017).

Literature search to end May 2013 identified 29 reports of studies of the ACE, 13 using the English version and 16 using non-English versions. All the studies identified were from high prevalence specialist secondary care settings. After application of exclusion criteria, 5 ACE studies were deemed suitable for meta-analysis (Mathuranath et al. 2000; Garcia-Caballero et al. 2006; Larner 2007a; Stokholm et al. 2009; Yoshida et al. 2011). Across the 5 included studies there were 529 cases of dementia out of a population of 1090, a prevalence of 49%. There was no evidence of publication bias (Harbord bias = −8.23, 95% CI = −29.1 to 12.6, p = 0.37; Harbord et al. 2006).

Pooling the raw data from these studies demonstrated that 512 out of 529 cases were correctly identified using the ACE, giving a pooled sensitivity of 0.968. On meta-analytic weighting this was corrected to 0.969 (95% CI = 0.927 to 0.994). Non-cases (377) were correctly ruled-out from a sample of 561 comparison subjects to give a pooled specificity of 0.672. On meta-analysis this was corrected to 0.774 (95% CI = 0.583 to 0.918). Unadjusted the PPV was therefore 0.747 and the NPV 0.955 (Larner and Mitchell 2014).

Literature search to end May 2013 identified 31 reports of studies of the ACE-R , 16 using the English version and 15 using non-English versions. All the studies identified were from high prevalence specialist secondary care settings. After application of exclusion criteria, 5 studies were deemed suitable for meta-analysis (Mioshi et al. 2006; Larner 2009, 2013a; Alexopoulos et al. 2010; Yoshida et al. 2012; Dos Santos Kawata et al. 2012). Across the 5 included studies there were 560 cases of dementia out of a population of 1156, a dementia prevalence of 48%. Harbord bias was not significant (0.097, 95% CI = −18.95 to 19.14, p = 0.99; Harbord et al. 2006).

Pooling the raw data from these studies demonstrated that 514 out of 560 cases were correctly identified using the ACE-R , giving a pooled sensitivity of 0.918. This was adjusted on meta-analysis to 0.957 (95% CI = 0.922 to 0.982). Non-cases (383) were correctly ruled-out from a sample of 596 comparison subjects to give a pooled specificity of 0.643. This was corrected on meta-analysis to 0.875 (95% CI = 0.638 to 0.994). Unadjusted the PPV was therefore 0.707 and the NPV 0.893 (Larner and Mitchell 2014).

Combining the studies (n = 9) which used the MMSE against either the ACE (n = 5) or ACE-R (n = 4) generated a pooled MMSE sensitivity of 0.920 (95% CI = 0.849 to 0.968) and specificity of 0.869 (95% CI = 0.805 to 0.921) (Larner and Mitchell 2014), inverting the pattern of low sensitivity and high specificity typically seen in diagnostic test accuracy studies of MMSE (see Sect. 4.1.1; Tables 4.14.7).

6.1.5 Effect Size (Cohen’s d)

Effect size may be denoted by a variety of summary indices, of which Cohen’s d is probably the most commonly used in the medical literature (Cohen 1988). This parameter is calculated as the difference of the means of two groups divided by the weighted pooled standard deviations of the groups (see Sect. 2.3.2, Fig. 2.2). Cohen (1988, 1992) suggested that effect sizes of 0.2 to 0.3 were small, 0.5 medium, and ≥0.8 large.

Effect size (Cohen’s d) for a number of the CSIs examined in CFC has been calculated (Larner 2014b, 2016b) based on data from previous pragmatic diagnostic accuracy studies which examined the MMSE , MMP , 6CIT, MoCA , TYM, ACE-R , AD8 , and MACE. Mean test scores for demented and non-demented groups, and for mild cognitive impairment and subjective memory complaint groups, along with their standard deviations, were applied to the Cohen’s d formula to calculate effect sizes .

Comparing patients with dementia and no dementia suggested large but similar effect sizes for all of the CSIs examined (Table 6.14). These values suggested a consistent difference in test scores between demented and non-demented individuals.

Table 6.14 Effect size (Cohen’s d) for diagnosis of dementia versus no dementia (MCI + SMC) (adapted from Larner 2014b, 2015d:99)

Comparing patients with mild cognitive impairment and no dementia (subjective memory complaint ) suggested smaller effect sizes for all of the CSIs examined than in the dementia versus no dementia distinction (Table 6.15). However, effect sizes for the MoCA and MACE were larger than for other tests. These values suggested a consistent difference in test scores between MCI and non-demented individuals, but with MoCA and MACE performing best. Since MoCA was designed to identify MCI cases (Nasreddine et al. 2005) this observation might be anticipated.

Table 6.15 Effect size (Cohen’s d) for diagnosis of mild cognitive impairment versus subjective memory complaint (adapted from Larner 2014b, 2015d:99, 2016b)

Looking at subgroups of older people (age ≥ 65 years) suggested larger effect sizes in this at-risk group in these cohorts (Table 6.16, Fig. 6.4; Wojtowicz and Larner 2017).

Table 6.16 Effect size (Cohen’s d) for whole cohorts and for older (≥65 years) subgroups for diagnoses of dementia versus MCI and MCI versus SMC (adapted from Wojtowicz and Larner 2017)
Fig. 6.4
figure 4

Effect sizes (Cohen’s d) for whole cohorts and for older (≥65 years) subgroups for diagnoses of dementia versus MCI and MCI versus SMC (Wojtowicz and Larner 2017) reprinted with permission

6.1.6 Correlation

Calculation of a correlation coefficient (e.g. Pearson product moment correlation coefficient ) between the scores of different screening instruments applied to the same population is often performed. In diagnostic test accuracy studies correlation is not necessarily taken to imply causation (unlike research into disease aetiology, although correlation is never equivalent to causality), and hence any correlates of the target disorder may be potentially diagnostically useful, independent of any causal interpretation. Correlation is a measure of the strength of association between datasets, but it is also sometimes incorrectly assumed that high correlations give a measure of how well tests agree. Whilst the potential of a new test may be suggested if it correlates with an existing test, indicating concurrent validity, correlation is not agreement. Indeed, high correlation may in fact mask lack of agreement (Bland and Altman 1986; see Sect. 6.1.8).

Examples of correlations between different CSI scores and between CSI scores and informant and/or non-cognitive screening instruments from studies undertaken in CFC are shown in Tables 6.17 and 6.18. Unsurprisingly CSI scores are generally highly correlated, indicating concurrent validity, whereas CSI and non-CSI scores are generally less well correlated, indicating that these tests may examine different constructs.

Table 6.17 Summary of correlation coefficients for different cognitive screening instruments examined in pragmatic diagnostic test accuracy studies in CFC (adapted from Larner 2015d:100)
Table 6.18 Summary of correlation coefficients for different cognitive and informant and/or non-cognitive screening instruments examined in pragmatic diagnostic test accuracy studies in CFC (adapted from Larner 2015d:101)

6.1.7 Cohen’s Kappa Statistic : Test of Agreement

The “test of agreement ” or Cohen’s kappa statistic (Cohen 1960) compares observed diagnostic agreement with that expected by chance alone (i.e. chance corrected agreement; see Sect. 2.3.3). This metric has sometimes been used to compare diagnostic tests (Table 6.19), although it is a measure of precision rather than of accuracy .

Table 6.19 Summary of Cohen’s kappa statistic (test of agreement) for different cognitive screening instruments examined in pragmatic diagnostic test accuracy studies (adapted from Larner 2015d:102)

6.1.8 Bland-Altman Limits of Agreement

As previously mentioned (Sect. 6.1.6), correlation between test scores may indicate concurrent validity, but correlation is not agreement and indeed high correlation may actually mask lack of agreement. Bland and Altman (1986) suggested a method which provides a measure of agreement between tests by estimating how far apart the two values are on average and putting an interval around this (see Sect. 2.3.3). The limits of agreement thus defined indicate how closely two methods agree, but what is accepted as “close” remains a clinical rather than a statistical judgement. The Bland Altman methodology is a simple way to evaluate bias between mean differences which avoids the potentially erroneous conclusions based on correlation analyses.

Bland Altman methodology was used to calculate limits of agreement for three brief CSIs (MMSE, MoCA , MACE) which were contrasted with Pearson product moment correlation coefficients between test scores (Larner 2016c). Mean differences between test scores were small (<1 for MACE versus MoCA , up to 4 for MMSE versus MACE) but the calculated limits of agreement were broad (>10 points for MMSE versus MoCA and MACE versus MoCA ; and >15 points for MMSE versus MACE). Test scores were highly correlated (r > 0.8) in all the studies (Table 6.20). Bland-Altman plot of difference against mean for the comparison of MMSE versus MACE is shown in Fig. 6.5.

Table 6.20 Limits of agreement (with 95% confidence intervals) and Pearson’s product moment correlation coefficients (r) for different cognitive screening instruments (adapted from Larner 2016c)
Fig. 6.5
figure 5

Bland-Altman plot of difference against mean for MMSE versus MACE (Larner 2016c; data from Larner 2015a, b; n = 244) reprinted with permission

6.2 Combining Screening Instruments

The expectation that a single screening instrument might be entirely adequate for the diagnosis of a multidimensional construct such as the dementia syndrome , with the changes in symptomatology which occur in that syndrome over time, is likely to be wishful thinking. Different methods in staging dementia are recognised to give different results, with moderate to fair correlation of clinical scales and MMSE but a much greater dispersion of functional capacity as measured by the Instrumental Activities of Daily Living (IADL) Scale, indicating that factors other than dementia severity influence functional capacity (Juva et al. 1994). Hence combinations of tests, perhaps addressing the different domains (cognitive; functional, behavioural, global; see Chaps. 4 and 5 respectively) might be desirable, as may combinations of patient and informant information. Combinations of test results have been examined on occasion and found to give “added value” in some instances (e.g. Mackinnon et al. 2003; De Lepeleire et al. 2005).

As previously mentioned (see Sect. 2.3.2), when using screening instruments there is always a balance or trade-off to be struck between test sensitivity and specificity , with the chosen test cut-off being determined by the needs of the particular clinical situation. To optimise this trade-off, combinations of tests may be required. For example, ACE VLOM ratio showed poor sensitivity but good specificity for the diagnosis of FTLD (see Sect. 4.1.5.2), principally because cases of bvFTD were missed (Bier et al. 2004), so combination with the Frontal Assessment Battery (FAB; see Sect. 4.2.1), which is highly sensitive for bvFTD, might be appropriate. FAB may therefore be useful as a situation-specific clinical assessment when a diagnosis of bvFTD is being considered. Use of the semantic index subscore of the ACE is appropriate if semantic dementia is being considered in the differential diagnosis (Sect. 4.1.5.2). Studies in CFC have not encouraged the view that the Ala subscore is useful prospectively for the diagnosis of DLB (Sect. 4.1.1.1), likewise the modified Ala (Sect. 4.1.5.2) and MoCA Ala (Sect. 4.1.8.1). The Mayo Fluctuations Questionnaire might be considered if DLB or PDD enters in the differential diagnosis (Ferman et al. 2004; Larner 2012d; see Sect. 5.4.3).

Following the methodology of Flicker et al. (1997), tests may be combined either in series (both tests required to be positive before a diagnosis of dementia is made: the “And” rule) or in parallel (either test positive sufficient for a diagnosis of dementia to be made: “Or” rule); in other words, respectively, sequency and simultaneity.

6.2.1 Combining Cognitive Screening Instruments: MMSE and MoCA

The combination of the MMSE and the Clock Drawing Test (“Mini-clock”) has been reported to improve detection of mild AD and MCI (Cacho et al. 2010). Since MMSE (Folstein et al. 1975) has high specificity (see Sect. 4.1.1) and the Montreal Cognitive Assessment (MoCA; Nasreddine et al. 2005) has high sensitivity (see Sect. 4.1.8) for dementia diagnosis , the effect of combining these two cognitive screening instruments has been investigated (Larner 2012b).

In patients administered both MoCA and MMSE (n = 148), combining the tests in series (“And” rule) gave results almost identical to those using the MMSE alone, whilst combining tests in parallel (“Or” rule) gave results almost identical to those using the MoCA alone (Table 6.21; compare with Tables 4.23 and 4.24). In other words, MoCA “and” MMSE was less sensitive, missing a significant proportion of the dementia and MCI cases (35% of cases) but with few false positives, whereas MoCA “or” MMSE identified almost all the cases of dementia and MCI but with a large number of false positives (greater sensitivity).

Table 6.21 Diagnostic parameters for MMSE + MoCA in both series and parallel paradigms for diagnosis of any cognitive impairment (dementia + MCI) vs. no cognitive impairment (adapted from Larner 2012b)

The combination of these cognitive screening instruments therefore seems to offer little over and above their individual use (Larner 2012b). An item analysis of the MoCA and the MMSE (Damian et al. 2011) indicated that not all subtests were of equal predictive value , and that a selection of MoCA and MMSE items with high predictive value might engender a more useful hybrid test, although to the author’s knowledge this has yet to be examined.

6.2.2 Combining Informant and Cognitive Screening Instruments

The Alzheimer Association has recommended the combined use of an informant interview with a performance measurement to detect dementia most efficiently (Cordell et al. 2013). Data from CFC which explore such combinations are presented here.

6.2.2.1 IQCODE and MMSE/ACE-R

The combination of an informant scale , the Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE; see Sect. 5.4.1), and a cognitive scale, the MMSE (Sect. 4.1.1), has been previously reported in both community (Mackinnon et al. 2003) and clinical samples (Mackinnon and Mulligan 1998; Abreu et al. 2008; Narasimhalu et al. 2008), some finding the combination helpful for detection of cases and non-cases (Mackinnon and Mulligan 1998; Mackinnon et al. 2003; Narasimhalu et al. 2008), others not (Abreu et al. 2008). This difference in findings may be related in part to the different casemix in these studies.

Many of the patients in the CFC/Brooker Centre IQCODE study (see Sect. 5.4.1) were administered the MMSE (n = 132) and/or the ACE-R (n = 114) at the same time that an informant completed the IQCODE (Hancock and Larner 2009). The IQCODE and MMSE scores showed a low negative correlation (r = −0.37; t = 4.49, df = 130, p < 0.001). Using the test of agreement (Cohen’s kappa statistic ; Cohen 1960), κ = 0.23 (95% CI = 0.07–0.39), where 1 is perfect agreement between tests and 0 is agreement purely due to chance alone. For IQCODE and ACE-R , tests scores showed a low negative correlation (r = −0.46; t = 5.46, df = 112, p < 0.001) with κ = 0.29 (95% CI = 0.11–0.46).

Results of using IQCODE in combination with either MMSE (n = 132) or ACE-R (n = 114) in series or in parallel (method of Flicker et al. 1997) showed the expected improvement in specificity in the series (“And” rule) paradigm, with some reduction in sensitivity but with improved overall accuracy , PPV, diagnostic odds ratio and positive likelihood ratio (Table 6.22). There was little difference between results combining IQCODE and MMSE versus IQCODE and ACE-R , with a marginal advantage for ACE-R . In the parallel (“Or” rule) paradigm, there was the expected improvement in sensitivity, but with no change in accuracy , specificity or PPV (Hancock and Larner 2009).

Table 6.22 Measures of discrimination for diagnosis of dementia for IQCODE, MMSE, ACE-R, IQCODE + MMSE in series or parallel, and IQCODE + ACE-R in series or parallel (adapted from Hancock and Larner 2009)

These results were in some ways similar to those of Narasimhalu et al. (2008) who, in an Asian population with low education, found best sensitivity for combined test use with application of the “Or” rule. Overall they found that a “weighted sum” of MMSE and IQCODE produced statistically superior area under the ROC curve and specificity results.

6.2.2.2 AD8 and MMSE/6CIT/MoCA/MACE

The combination of the AD8 informant scale (Galvin et al. 2005, 2006; see Sect. 5.4.2) and a number of cognitive screening instruments has also been examined.

In the AD8 study (Larner 2015c; Table 5.10), AD8 was combined with MMSE and with 6CIT. Combining AD8 with MMSE in series (i.e. both tests required to be positive before a diagnosis of cognitive impairment is made: the “And” rule) showed the expected improvement in specificity (0.83) but with greatly reduced sensitivity (0.50), whereas in parallel (i.e. either test positive sufficient for a diagnosis of cognitive impairment to be made: “Or” rule), sensitivity was maximised (1.0) whilst specificity was very low (0.08). Combining AD8 with 6CIT in series showed reduced sensitivity (0.70) and specificity (0.13) whilst in parallel both sensitivity (0.99) and specificity (0.59) were improved (Table 6.23).

Table 6.23 Measures of discrimination for diagnosis of cognitive impairment for AD8, MMSE, 6CIT, AD8 + MMSE in series or parallel, and AD8 + 6CIT in series or parallel (adapted and corrected from Larner 2015c)

In a subsequent study (Connon and Larner 2017; Larner 2017c), AD8 was combined with either the Montreal Cognitive Assessment (MoCA) or the Mini-Addenbrooke’s Cognitive Examination (MACE) .

Over a 6-month period (May–October 2016), consecutive new outpatients attending CFC accompanied by a capable informant were administered MoCA whilst the informant completed AD8 . Of 46 patient-informant dyads (F:M = 19:27, 41% female; age range 32–88 years, median 64), 13 were diagnosed with dementia (DSM-IV-TR criteria; dementia prevalence  = 0.28), 22 had MCI (Petersen criteria; MCI prevalence  = 0.67 of non-demented); the remainder (n = 11) were diagnosed with subjective memory complaints (Larner 2017c).

Using test cut-offs for cognitive impairment from index studies (AD8 ≥2/8; MoCA <26/30), standard measures of discrimination were calculated for individual tests and for combinations of AD8 with MoCA in series and in parallel (Table 6.24). Individually both tests were highly sensitive (>0.95) but with low specificity (all ≤0.45). In series combination maintained specificity for little loss of sensitivity. Conversely in parallel combination maintained sensitivity. Predictive values were ≥0.8 for both combinations, with predictive summary index better for parallel combinations.

Table 6.24 Measures of discrimination for diagnosis of cognitive impairment for AD8, MoCA, and AD8 + MoCA (n = 46) in series or parallel (adapted from Larner 2017c)

In 67 patient-informant dyads seen over an 8-month period (May–December 2016; F:M = 33:34, 49% female; age range 26–88 years, median 64), the patients were administered MACE whilst the informants completed AD8 (Connon and Larner 2017). Fourteen patients were diagnosed with dementia (DSM-IV-TR criteria), 32 with MCI (Petersen criteria). Using cut-offs defined in index studies (AD8  ≥2/8; MACE ≤25/30), the measures of discrimination (Table 6.25) showed both instruments were very sensitive (≥0.98) but not specific (≤0.38) for diagnosis of cognitive impairment. In series (“And” rule) combination improved diagnostic specificity for little loss of sensitivity. In parallel (“Or” rule) combination maximised sensitivity but with poorer specificity. Series combination had better Youden index and correct classification accuracy than parallel combination. Predictive values were >0.7 for both combinations, with predictive summary index marginally better for parallel combination (Table 6.25).

Table 6.25 Measures of discrimination for diagnosis of cognitive impairment for AD8, MACE, and AD8 + MACE (n = 67) in series or parallel (adapted from Connon and Larner 2017)

The data from these studies suggested that series combination of AD8 and either MoCA or MACE may improve the balance of sensitivity and specificity for diagnosis of cognitive impairment, principally by improving diagnostic specificity in comparison to the use of individual tests.

6.2.3 Combining Functional and Cognitive Screening Instruments: IADL Scale and ACE-R; Free-Cog

The Instrumental Activities of Daily Living (IADL) Scale and its derivative , the 4-IADL score (see Sect. 5.1.1), are reported to correlate strongly with measures of cognitive function such as the MMSE (Lawton and Brody 1969; Barberger-Gateau et al. 1992; De Lepeleire et al. 2004). MCI patients with impaired IADL have a higher percentage of conversion to AD than MCI patients with preserved IADL (Chang et al. 2011). Hence a combination of functional and cognitive scales might possibly assist in dementia diagnosis .

The combination of a functional scale, IADL Scale, and a cognitive scale, ACE-R , has been examined in a subgroup of patients (n = 79; M:F = 34:45; dementia prevalence  = 57%) from the IADL study (see Sect. 5.1.1; Hancock and Larner 2007). Using the same IADL Scale cut-off (≤13/14) as used in that study, sensitivity and specificity for dementia diagnosis were comparable (Se = 0.91 vs. 0.87; Sp = 0.62 vs. 0.50). Using the same ACE-R cut-off (≥73/100) defined in the study of that instrument (see Sect. 4.1.5.3; Larner 2009, 2013a), sensitivity and specificity for dementia diagnosis were comparable (Se = 0.76 vs. 0.87; Sp = 0.91 vs. 0.91). IADL Scale scores and ACE-R scores were moderately correlated (r = 0.58; t = 6.25, df = 77, p < 0.001) and the test of diagnostic agreement between the two tests was similarly moderate (κ = 0.38, 95% CI 0.18–0.58) (Larner and Hancock 2012).

Results of using IADL in combination with ACE-R in series or in parallel (as per method of Flicker et al. 1997) showed the expected improvement in specificity in the series (“And” rule) paradigm, along with improved PPV, and positive likelihood ratio, but with loss of sensitivity, negative predictive value and negative likelihood ratio. In the parallel (“Or” rule) paradigm, there was the expected improvement in sensitivity, negative predictive value and negative likelihood ratio, but with loss of specificity, positive predictive value and positive likelihood ratio (Table 6.26). Parallel use might therefore be of possible advantage for increased sensitivity (case finding) (Larner and Hancock 2012).

Table 6.26 Diagnostic parameters for IADL + ACE-R, in both series and parallel paradigms (Larner and Hancock 2012)

The Free-Cog scale (Sect. 4.1.10) attempts to incorporate assessment of cognition and function in a single instrument. Preliminary study showed that subscores for the cognitive function and executive function components had only low correlation (r = 0.47; t = 2.24, df = 18, p < 0.05), as might be anticipated when testing different constructs.

6.3 Converting Cognitive Screening Instrument Scores

The Mini-Mental State Examination (MMSE) has been available for over 40 years and has come to be regarded as the benchmark against which other simple cognitive CSIs are compared. The development of more sensitive CSIs may have reduced the utility of MMSE, as may concerns about infringement of copyright (e.g. Newman and Feldman 2011; Mitchell 2013). However, MMSE test scores may still be used as the indicator or determinant for important clinical decisions in cognitively impaired patients, such as the initiation of prescription of cholinesterase inhibitors and/or memantine .

Different screening instruments measure slightly different things, based on their different item content (Chaps. 4 and 5), but these are all aspects of the construct of cognitive function. Simple methods to convert test scores from one of the commonly administered CSIs to another might therefore be of clinical utility.

One method to do this involves deriving a conversion table of equivalent scores from equipercentile equating with log-linear smoothing (e.g. for MMSE and MoCA : Roalf et al. 2013; van Steenoven et al. 2014). Another method is the calculation of linear regression equations of the form y = a + bx. For example, Kalbe et al. (2004) reported MMSE = 19.997 + 0.567DemTect (other examples: for MMSE and ADAS-Cog, see Doraiswamy et al. 1997; for MMSE and one version of the clock drawing test , see Shua-Haim et al. 1997).

6.3.1 Linear Regression Equations

The datasets of several pragmatic diagnostic test accuracy studies undertaken in CFC (Abdel-Aziz and Larner 2015; Larner 2015c, 2016a, 2017a) were used to calculate regression equations of the form y = a + bx (Larner 2017d), where y, the dependent or outcome variable, was approximate CSI score; x, the independent or explanatory variable, was score on a different CSI with which the first CSI was being compared; and a is the intercept and b the slope or gradient (regression coefficient) of the regression equation. Pearson product moment correlation coefficients were also calculated (Table 6.27).

Table 6.27 Regression equations and correlation coefficients of some commonly used cognitive screening instruments (adapted and extended from Larner 2017d)

As anticipated, since MoCA and MACE are scored positively and correlate positively with MMSE scores their regression coefficients with MMSE were positive, whereas for 6CIT and AD8 , which are negatively scored and correlate negatively with MMSE scores, the slope of the regression line was negative, indicating lower MMSE scores for subjects with higher 6CIT and AD8 scores. Since MoCA , MACE, 6CIT and AD8 were all more sensitive than MMSE in the base studies, the intercept values of the regression equations were all high, indicating that many correct answers may be achieved on MMSE whilst the other tests remain at floor. MMSE is recognized to include relatively easy items which are of little value in patient assessment (Sect. 4.1.1). Greater coincidence of the various test scores occurred around ceiling.

Calculation and application of these regression equations is a relatively simple way to obtain approximate scores when converting between screening instruments (calculations can be easily done on a mobile phone calculator). Whether this approach might also be used outside of the secondary care clinic setting, whence the original data were generated, remains to be addressed. The regression equations derived here may be a simple way to generate approximate MMSE scores which may be used to inform clinical decision making without recourse to administering the MMSE per se and any potential copyright issues.

6.4 Summary and Recommendations

The various comparative metrics examined here suggest that a number of CSIs are suitable for the diagnosis of dementia. In the previous edition (Larner 2014a:140), ACE-R was noted to be at or near the top in most categories, so was recommended as eminently suitable for those requiring cognitive screening in a dedicated Cognitive Function Clinic (i.e. a high prevalence setting). The withdrawal of ACE-R because of issues around MMSE copyright was regretted; it was hoped that ACE-III (Hsieh et al. 2013) might be a suitable replacement. Other options include MoCA and MACE, both of which appear to be highly acceptable, and certainly seem to be best for diagnosis of MCI . MMSE may still retain a place, acknowledging its shortcomings, in terms of both its neuropsychological limitations and questionable ecological validity (Larner 2007c). However, MMSE is certainly not good for identification of MCI , so if this frames the clinical question then MoCA or MACE are preferable.

Combinations of CSIs with informant scales or with functional instruments may have added diagnostic value compared to CSIs in isolation, and certainly pragmatic value in planning clinical interventions (Chap. 5). Conversion between test scores may also be useful; for example, if therapeutic decision making is to be based on MMSE scores, then conversion of other CSI scores to approximate MMSE scores by using linear regression equations might be used.