Introduction

Cognitive screening in adults and elders is relevant to both neurological/neuropsychiatric diagnostics and prevention in internal patients with possible brain damage [1], as well as, in turn, to prognosis and interventional management [2]. To screen for cognitive deficits is indeed meant to ease clinicians into determining whether II-level neuropsychological assessment (i.e., an in-depth examination of multiple cognitive/behavioral functions) is needed for a given patients [3].

As aimed at providing practitioners with an optimal compromise between informativity and ease of use within the early detection of changes in cognition [1], cognitive screening tests (CSTs) need to come with robust psychometric and diagnostic properties, representative norms, and evidence of clinical feasibility in target conditions (i.e., clinical populations that they are meant to be administered to) [2, 4] (see Table 1).

Table 1 Desirable psychometric and diagnostic properties of a cognitive screening test (CST)

However, it has been already acknowledged that widespread CSTs happen to fail reaching the aforementioned statistical standards, in turn negatively affecting their level of recommendations [9]. In this respect, cross-cultural adaptations of CSTs have been specifically highlighted as suffering from psychometric/diagnostic weaknesses [10], and this representing a major issue in the light of the relevance of culture-/language-specificity to cognitive assessment [11].

In Italy, much attention has been historically devoted to providing norms within the development and adaptation of CSTs [12]. However, it is debated whether this focus might have led to neglecting other fundamental statistical aspects when standardizing tests, such as validity, reliability, and diagnostic properties [13].

In light of the above premises, this study aimed at systematically reviewing evidence on originally Italian/adapted-to-Italian CSTs in order to (a) provide an up-to-date compendium of available CSTs in Italy; (b) report their psychometric and diagnostic properties; and (c) address current issues with regard to their development, adaptation, and standardization.

Methods

Search strategy

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were consulted [14]. This review was pre-registered on the International Prospective Register of Systematic Reviews (PROSPERO; CRD42021254561: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=254561).

The following search terms were entered into Scopus and PubMed databases on May 1, 2021 (no date limit set): neuropsych* OR cogniti* AND screen* OR “screening test” OR “screening tool” OR "screening instrument" AND Italy OR Italian. Fields of search were title, abstract, and key words for Scopus whereas title and abstract for PubMed. Only peer-reviewed, full-text contributions written in English/Italian were considered. Hence, non-peer-reviewed literature was not searched for. Further contributions of possible interest were identified within reference lists of included articles/through manual search.

Contributions focusing either on the standardization of Italian/adapted-to-Italian CSTs (i.e., investigations on psychometric/diagnostic or normative studies) or their feasibility/usability in healthy participants (HPs) and in patients with neurological or neuropsychiatric diseases were considered for eligibility. For a non-normative study to be included, at least one property among validity, reliability, and sensitivity/specificity (or related metrics) had to be assessed. Case reports/case series, reviews/meta-analyses, abstracts, research protocols, qualitative studies, and opinion papers were excluded. Among feasibility/usability studies, those focusing on selected clinical populations that would have not allowed sufficient generalizability were not considered. Investigations on proxy-report tools, questionnaires, CSTs for pediatric populations, or requiring ≥45′ to be administered were also excluded in order to improve external validity of conclusions.

Data collection and quality assessment

Screening and eligibility stages were performed by one of the authors (EN.A.) via Rayyan (https://rayyan.qcri.org/welcome); a second author (G.A.) supervised this stage.

Data extraction was performed by two independent collaborators (S.R. and F.C.), whereas one independent author (E.N.A.) supervised this stage and checked extracted data.

Outcomes of interest were (1) sample size, (2) sample representativeness (geographic coverage, exclusion criteria), (3) participants’ demographics, (4) test adaptation procedures, (5) modality of assessment (in-person vs. remote), (6) administration time, (7) validity metrics, (8) reliability metrics (including significant change measures), (9) measures of sensitivity and specificity, (l0) metrics derived from sensitivity and specificity, (11) norming methods, and (12) other psychometric/diagnostic properties (e.g., accuracy, acceptability rate, assessment of ceiling/floor effects).

Formal quality assessment was performed for each CST according to the aforementioned categories by developing an ad hoc checklist (Cognitive Screening Standardization Checklist, CSSC) (see Table 2). The CSSC encompasses two sections, “Sampling” (ranging 0–13) and “Psychometrics, diagnostics, and usability” (ranging 0–29). The first section evaluates the sampling adequacy as for representativeness; the second section focuses on psychometric, diagnostic properties, and feasibility. CSSC total scores range from 0 to 42; a given CST was thus judged as “statistically sound” if scoring ≥21 (i.e., 50% of the maximum) on the CSSC. CSSC items were based on [1, 2] and [7].

Table 2 Cognitive Screening Standardization Checklist (CSSC)

Scores were “cumulatively” assigned for each CST by evaluating all available studies on it among those included. Items targeting non-cumulative information which were nonetheless retrievable in multiple studies — e.g., the normative sample size — were scored according to that study providing the highest-quality information — e.g., the highest N.

Quality assessment was performed by one of the authors (S.R.) and supervised by a second, independent one (E.N.A.).

Results

Study selection process is shown in Fig. 1.

Fig. 1
figure 1

PRISMA flow chart displaying study selection process. PRISMA Preferred Reporting Item for Systematic Reviews and Meta-Analyses, NPs neuropsychological. Diagram adapted from [14](www.prisma-statement.org)

Sixty-one studies were ultimately included. Extracted outcomes are reported in Table 3. A summarization of most relevant psychometric and diagnostic properties for each included CST, along with CSSC scores, are reported in Table 4.

Table 3 Summary of extracted outcomes 
Table 4 Synopsis on availability psychometric and diagnostic properties of Italian cognitive screening tests

The vast majority of contributions were studies mostly aimed at providing normative data (N = 32) — of which, 11 did not report any further relevant statistical property. Twenty-seven studies instead focused on psychometric/diagnostic properties with only marginal/absent attention to normative values (either in the context of clinical usability or not).

Included CSTs fell under the following categories: (a) domain-/disease-nonspecific (in-person: N = 14; remote: N = 3); (b) domain-specific (N = 7), targeting executive functioning, language, memory, and praxis; and (c) disease-specific (N = 16), targeting neurodegenerative disorders (Alzheimer’s, Parkinson’s, and motor neuron diseases), cerebrovascular accidents, neuropsychiatric conditions, infective sub-cortical dementias, delirium, migraine-related subjective cognitive dysfunction, and dementia in the context of intellectual disabilities. Among all the investigations, target clinical populations were present in 33 studies.

Validity was investigated in 37 studies and mostly by convergence (N = 31); divergent validity was assessed in 5 studies, whereas criterion validity in 4 (3 of which via concurrent validity, whereas one via predictive validity). Only 5 of the included studies assessed the factorial structure underlying CSTs by means of dimensionality reduction approaches. No overt evidence of content, face, and ecological validity was detected.

Reliability was investigated in 32 studies and mostly as inter-rater (N = 17), internal consistency (N = 14), and test-retest (N = 12). Parallel forms were developed within 4 studies only.

Although sensitivity and specificity measures were often reported (N = 22), derived metrics (e.g., positive and negative predictive values and likelihood ratios) were provided in 10 studies only.

With respect to norming, regression-based and inferential-error-controlling methods — e.g., tolerance limits and/or Equivalent Scores [12], were highly represented (N = 26). Several studies (N = 17) derived point-estimate cut-offs through receiver-operating characteristic (ROC) analyses.

Acceptability of the CST was overtly examined in 9 studies, while ceiling/floor effects in 11. When applicable, administration time ranged from 2 to 45 min.

Discussion

The present work investigates statistical features of CSTs currently available in Italy, shedding a new light on their clinical and experimental utilization. Information here reported have the potential to promote a more aware and critical usage of CSTs among Italian clinicians, as well as to serve as overall guidelines for researchers either involved in CST development/adaptation/standardization or devoted to addressing open issues on CST psychometrics/diagnostics.

Overall, although psychometrics and diagnostics for a given CST happened not to be assessed within the same study, basic properties and norms were provided within different ones, especially for most widespread CSTs (e.g., Mini-Mental State Examination, MMSE; Montreal Cognitive Assessment, MoCA).

Moreover, although results show a general trend towards focusing on providing only normative data and cut-off values, the majority of included CSTs proved to be supported by sufficient evidence as for basic psychometric/diagnostic requirements are concerned. The present review hints at a relatively high quality of selected global and domain-specific CSTs — e.g., MoCA (CSSC = 34) and Addenbrooke’s Cognitive Examination — Revised (ACE-R; CSSC = 31), Screening for Aphasia in Neurodegeneration (SAND; CSSC = 27), and Frontal Assessment Battery (FAB; CSSC = 24). Disease-specific CSTs were shown to be less statistically robust, with a few exceptions — e.g., ALS Cognitive Behavioral Screen (ALS-CBS; CSSC = 26).

Validity

Findings on validity happened to show misinterpretations of the psychometric concepts and incomplete analyses for certain CSTs.

First, it is worth mentioning that convergent and concurrent validity happened to be mistaken for each other — e.g., concurrent validity being tested by means of correlations instead of regressions, or convergent and concurrent validity being addressed as the same construct [54, 55, 61, 69, 73].

In this regard, one should also note that correlational measures happened not to be meant to assess the same construct as that of the target CSTs — e.g., FAB validity being tested against the MMSE [24].

Moreover, predictive validity happened to be almost never assessed [29] — despite the longitudinal dimension being relevant to the monitoring of patients’ cognitive profile. This may be due to the high cost of performing a proper longitudinal study to assess predictive validity.

It is also worth mentioning that the vast majority of included CSTs lacked divergent validity evidence. This might be due to the fact that different CSTs are commonly found to correlate despite being meant to assess different functions; this because target constructs often overlap to some extent. Researchers are thus encouraged to test divergence by addressing measures that are supposed to deviate from a given CST as far as either construct or face validity is concerned. This could be done by comparing a CST with either a II-level, domain-specific cognitive test, or with a psychodiagnostic tool.

Furthermore, although the need for cognitive measures that are predictive of daily functioning has been highlighted [12], it has to be noted that the ecological validity of CSTs has never been found to be directly investigated within original standardization studies. This may be due to the lack of a wide consensus on how to investigate ecological validity, as well as to the scarce availability of ad hoc scales designed for assessing the specific impact of cognitive disorders on real-life functioning, going beyond a general evaluation of functional disability.

Finally, researchers should take into consideration to explore content validity and factorial structures of CSTs; this equally applying both to those tests postulated to be mono-factorial (e.g., MMSE) and to domain-specific ones (e.g., SAND), which might nonetheless cover multiple cognitive functions.

Reliability

Overall, reliability of Italian CSTs was frequently assessed, although often either incompletely or inefficiently.

When testing reliability of CSTs, it is worth bearing in mind that internal consistency might be problematic: indeed, different items within the same CST may be meant to measure different facets of cognition, this possibly being even truer for multi-domain tests such as the MMSE. This is an aspect that needs further developments.

By contrast, assessing reliability via test-retest or inter-rater may be generalizable to different CST categories and more practically relevant (e.g., clinicians are interested in knowing whether the CST yields similar outcomes/scores when administered in different conditions).

Furthermore, parallel-form reliability was seldom examined, and no CST came with information on its ability to detect significant change [6, 76]. Indeed, although parallel forms reduce the possibility to have “practice effect” (i.e., systematic performance improvements across consecutive assessments), the lack of appropriate methods for detecting clinically meaningful changes over time unrelated to practice has a crucial (even detrimental) impact whenever CSTs are meant to be used longitudinally to monitor the progress of cognitive functions or dysfunctions for either diagnostic or prognostic purposes. Indeed, without thresholds for significant change it is not possible to ascertain whether observed score variations over repeated measurements could be merely traced back to intrinsic and expected, physiological oscillations of performances, or whether they more likely reflect a true cognitive change (worsening or improvement).

Diagnostic properties

The study of the diagnostic properties of CSTs was often addressed within a nosographic-descriptive framework, which, however, might not always fit cognitive semiology [7]. Indeed, a one-to-one correspondence between cognitive profiles and neurological/neuropsychiatric conditions is often not straightforward [13]. Thereupon, the notion of “target condition” within ROC analyses may happen to be elusive, hence limiting the disease-specificity of certain CSTs [2, 7]. The present work indeed highlights the need for identifying more rigorous statistical methods for deriving the optimal cut-off values (e.g., the Youden statistics, an index identifying the best cut-off at the optimal compromise between sensitivity and specificity).

Moreover, although basic diagnostic properties happened to be investigated, less attention has been given to those selectively relevant to screening aims, such as taking into account disease prevalence (e.g., positive and negative predictive values) and allowing an estimation of post-test probability of cognitive impairment (e.g., positive and negative likelihood ratios) [3].

With respect to Italian CSTs, it is noteworthy that diagnostic properties were almost investigated only for 5 out of 16 disease-specific CSTs. Although evidence of case-control discrimination was frequently provided by means of between-group comparisons (e.g., ALS-CBS), it is recommended that sensitivity, specificity, and derived measures be tested in order to statistically substantiate CST applications to target conditions.

Norms

As far as normative data are concerned, although regression-based and inferential-error-controlling techniques were highly represented, it should be noted that a relatively high heterogeneity in approaching norming methods was detected. First, the Equivalent Score method happened to be embraced “incompletely,” by only computing tolerance limits only but not Equivalent Score thresholds [12]. Second, norms were occasionally derived via approaches assuming a normal distribution, this possibly undermining their adequacy as cognitive data often present with overdispersion and skewness [43]. With this last respect, checking for ceiling/floor effects in test scores is encouraged; unfortunately, this was rarely carried out in the studies herewith included.

With a few exceptions, sampling revealed to be overall adequate as far as typical/clinical sample sizes are concerned. However, it has to be noted that geographic coverage of normative samples was often circumscribed. Between-regional differences should nonetheless receive attention as being potential confounders in cognitive testing; this issue has been only recently addressed within the concerning Italian literature [66, 70].

Feasibility

Despite a core feature of a CST is a short administration time [2], it should be noted that several CSTs here reported require up to 20′ to be administered and scored (e.g., ACE-R), in turn limiting their usability in time-restricted settings (e.g., bedside evaluations). By contrast, these “in-depth CSTs” may be more adequate in outpatient settings [2].

With regard to cross-cultural/-linguistic adaptations, it has to be stressed that back-translation approaches have been seldom adopted and culture/language-related issues often not addressed. The latter aspect is of major interest especially for items assessing language functioning, which should instead undergo dedicated, country-specific controls for psycholinguistic predictors (e.g., word frequency for naming tasks) [10].

An overall need for more systematic evidence of acceptability/face validity of CSTs also emerged from the present work. This indeed would help practitioners select a test based on the administration setting, as for instance the assessment of acute patients would benefit from a short, tolerable CST that is clearly recognizable as such by the patient.

Limitations and perspectives

First, it has to be noted that the goodness of a CST is not exhausted in psychometric/diagnostic properties. Indeed, in order for a CST to be introduced in clinical practice, thorough evidence of its applicability in atypical populations should be provided. Moreover, it should be born in mind that evidence of both psychometric and diagnostic soundness may be inferred from applied studies as well. Thereupon, future studies should focus on reviewing available contributions on the clinical usability of Italian CTSs in order to provide a more comprehensive picture on their statistical/methodological quality.

Furthermore, Italian practitioners might benefit from a future review focused on psychometric/diagnostic properties of qualitative/proxy measures of cognition that were not addressed within the present study for generalizability reasons.

Although beyond the aim of this work, it should be then noted that more detailed item-level analyses (Item Response Theory) were conducted in only one of the records included [66]. As being able to provide insights into adaptive testing as well as to help ease interpretations issues, Item Response Theory-based analyses should be taken into consideration when assessing psychometrics/diagnostic properties of CSTs [66].

Finally, it is important to underline that, to the best of the authors’ knowledge, there is not an official, worldwide consensus on the relevant properties to be addressed in cognitive screening, this resulting in the choices being possibly incomplete or selectively reflecting the knowledge of researchers. This latter consideration stresses the importance of developing wider agreement within neurological/neuropsychological societies to ensure higher standards and raise the awareness on the impact of statistical properties on the applicability of CSTs in both applied (e.g., clinical and forensic) and research contexts.

Conclusion

The present work shows that, although available Italian CSTs overall met basic psychometric/diagnostic requirements, their statistical profile often proved to be deficient on several properties that are desirable/needed for clinical applications, with a few exceptions among general and domain-specific CSTs yielding high soundness, namely, the MoCA and ACE-R, and the FAB and SAND, respectively. In particular, this work highlights that:

  • - psychometric/diagnostic properties of disease-specific CSTs happened were poorly examined;

  • - construct and criterion validity should be differentiated and assessed separately;

  • - factorial structure underlying CSTs should be tested for both general and domain-specific ones;

  • - ecological validity of CSTs need to be addressed to provide information relevant to patients’ everyday functioning;

  • - significant change thresholds and alternate versions of CSTs need to be developed in order to improve their longitudinal usage;

  • - a general lack of investigations on sensitivity-/specificity-derived diagnostic metrics selectively relevant to screening aims (i.e., positive and negative predictive values and likelihood ratios) was detected;

  • - a clearer definition of target conditions for a given CST is needed, especially for those thought to be disease-specific;

  • - information on CST acceptability, face validity, and administration time are desirable, as helping an ad hoc usage by practitioners select.