Introduction

According to the statutes of the Biobanking and BioMolecular resources Research Infrastructure — European Research Infrastructure Consortium (BBMRI-ERIC), biobanks are defined as “collection, repositories and distribution centres of all types of human biological samples, such as blood, tissues, cells or DNA and/or related data such as associated clinical and research data, as well as biomolecular resources, including model- and micro-organisms that might contribute to the understanding of the physiology and diseases of humans” [1].

The terms “biobank” and “biorepository” are often used interchangeably. According to the National Institutes of Health (NIH), biorepositories are “a place, room, or container where biospecimens/tissue samples are stored”, where the term “biospecimen” refers to materials taken from the human body (such as tissues, blood, plasma and urine), commonly accompanied by information about the patient from whom the biospecimen was obtained [2]. More specifically, while biorepositories may be collections of biological material from any living organism, biobanks usually refer to collections of human biological material [3, 4]. In order to gather such a plenty of data, it has been necessary to develop large electronic databases [5]. Based on the Declaration of Taipei published by the World Medical Association, a health database can be defined as “a system for collecting, organising and storing health information” [6]. Imaging data were initially either not collected in biobanks or minimally represented. For example, one of the first cohort studies involving the collection of MRI data was the Multi-Ethnic Study of Atherosclerosis (MESA), which included less than 5000 patients with the goal to investigate subclinical cardiovascular diseases in the general population [7,8,9]. Only in the last few years, more imaging data have begun to be collected in biobanks, which are often intended as a collection of biological “images”, not necessarily related to radiology, but consisting of digital images of pathology specimens [9]. To address the concept of imaging biobanks in the field of precision medicine, the European Society of Radiology (ESR) Research Committee established an Imaging Biobanks Working Group in 2014, which defined imaging biobanks as “organised databases of medical images and associated imaging biomarkers (radiology and beyond), shared among multiple researchers, linked to other biorepositories”, and suggested that “biobanks (which focus only on the collection of genotype data) should simultaneously come with a system to collect related clinical or phenotype data” [10]. The primary aim of imaging biobanks is to guarantee the long-term storage and retrieval of secured medical images and associated metadata for research and validation purposes, whereas the secondary goal is to connect imaging and tissue biobanks, thus providing a deep association between phenotype and genotype expressions [11, 12].

The BBMRI differentiates between two types of research biobanks, i.e. population-based and disease-oriented biobanks. Population-based biobanks collect data from the general population and are focused on identifying risk factors for disease development [such as the Rotterdam Study (https://populationimaging.eu/image-data/)], whereas disease-oriented biobanks aim to investigate the pathogenesis of human diseases and are generally focused on specific diseases (mostly cancer, such as the Primage, CHAIMELEON, Pro-Cancer-I and EuCanImage H2020 projects [13,14,15,16]).

Modern biobanks are more than a collection of samples and data, but a complex infrastructure in which biomedical images and clinical data extracted from medical health records are organised, curated through standard procedures and securely protected from unauthorised external access [17,18,19,20,21]. All these aspects must coexist in compliance with current European legislation on data protection (such as the GDPR), in addition to the national laws of the countries involved in the construction and use of biobanks. Biobanks need to generate data governance models that protect the information provided by the subjects enrolled, while simultaneously enabling and promoting its use for biomedical research purposes with open controlled access to data sets [22]. To overcome fragmentation, the most prominent European biobanks have started collaborations within BBMRI, whose mission is to secure access to biological resources and data required for health-related research in a sustainable manner.

In 2015, the ESR Working Group on Imaging Biobanks evaluated imaging biobanks within the ESR community by administering a survey to its members. The survey identified 27 potential repositories of images and associated biomarkers that could be in line with the definition of imaging biobanks [10]. However, no specific details regarding each biobank could be retrieved at the time of the survey, and the latter has received no updates so far. Although few articles overview imaging biobanks [5], they are focused on specific types of biobanks, such as population-based [23] or oncological biobanks [11]. To the best of our knowledge, no articles have been published to date which offer a systematic review of existing imaging biobanks. Our purpose is to explore the current status of imaging biobanks.

Methods

Literature search strategy

A systematic literature search was carried out to identify imaging biobanks. To this purpose, we searched the PubMed (https://pubmed.ncbi.nlm.nih.gov), Scopus (https://www.scopus.com) and Web of Science databases (https://www.webofscience.com) using the combination of the following search terms: “imaging” AND “biobanks”, and in a separate search, “imaging” and “repository”. The search period ranged from January 2010 to July 2021, and the search was carried out in July 2021. Only articles published in English were selected.

Moreover, the Community Research and Development Information Service (CORDIS) database (https://cordis.europa.eu/projects) was searched using the combination of the search terms: “imaging” AND “biobanks”, with the search encompassing collections, projects, project deliverables, project publications and programmes.

Study selection

In an attempt to improve the quality of our inclusion criteria, our analysis was performed following a four-step flow diagram based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [24].

Results

All retrieved publications (totalling 3139 and 6133 articles regarding imaging biobanks and imaging repositories, respectively) were separately uploaded to EndNote™, and all duplicate records were removed (1157 and 2107 articles regarding imaging biobanks and imaging repositories, respectively).

Two reviewers (M.G., R.B.) manually screened all included articles by title and abstract, and if potentially eligible, their full text was reviewed. The reviewers excluded (a) entries including both terms (imaging AND biobanks or imaging AND repository) but not necessarily related to each other in the text, (b) entries including imaging but unrelated to diagnostic imaging, (c) reviews and conference papers, (d) entries including the same imaging biobank and (e) entries including biobanks with data derived from imaging, but without image collection (Fig. 1). Finally, 42 articles with a reference to an imaging biobank and 46 articles with a reference to an imaging repository (of which 34 referring to the same biobank) were selected, resulting in a total of 54 imaging biobanks.

Fig. 1
figure 1

PRISMA flow chart of literature search for imaging biobanks (left) and imaging repositories (right). Adapted from [24]

To obtain more detailed information on the storage and use of images by the biobanks selected, the biobank website (if available) or the trial or project pages were also accessed.

Of the 54 imaging biobanks retrieved, 21 were population-based (38.9%) [925,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44] and 33 were disease-oriented (61.1%) [45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77]. Most imaging biobanks were European (26/54, 48.1%), followed by American (20/54, 37.0%) biobanks, with the remaining ones being based in Asia (5/54, 9.3%), Oceania (2/54, 3.7%) and Africa (1/54, 1.9%). Thirty-eight out of 54 biobanks (70.4%) were accessible under request, 1 (1.8%) supported public access, and for the remaining 15 (27.8%) no information was found (Fig. 2).

Fig. 2
figure 2

Pie charts of existing imaging biobanks classified by (a) target population (i.e. disease-oriented and population-based), (b) geographical distribution and (c) type of data access

The number of patients collected by each biobank ranged from 240 to 3,370,929 (last access date: 14 July 2021) (Table 1). Among disease-oriented biobanks, 9 were focused on neurodegenerative diseases (9/33, 27.3%), with particular regard to Alzheimer disease and other dementias, and 9 on cancer (9/33, 27.3%), of which 3 related to multiple cancers, 2 to lung cancer, 2 to glioma, 1 to diffuse intrinsic pontine glioma (DIPG) and diffuse midline glioma (DMA) and 1 to neuroblastoma and DIPG. The remaining disease-oriented biobanks were focused on the following: other neurological and psychiatric diseases (6/33 (18.2%), including migraine (1), autism (1) and neuropsychiatric disorders (2), traumatic brain injuries (1) and post-traumatic epilepsy (1)), stroke (3/33, 9.1%), cardiovascular diseases and diabetes (3/33, 9.1%) and lung diseases excluding cancer (2/33 (6.1%), including tuberculosis (1) and chronic obstructive pulmonary disease (COPD) (1)). Only 1 disease-oriented biobank (1/33, 3.0%) was mixed, due to including data about cancers, cardiovascular and chronic diseases.

Table 1 List of existing imaging biobanks classified based on their distinguishing features. COPD = chronic obstructive pulmonary disease, DIPG = diffuse intrinsic pontine glioma, DMG = diffuse midline glioma, DO = disease-oriented, DSA = digital subtraction angiography, HR-pQCT = high-resolution peripheral quantitative computed tomography, PB = population-based

The imaging modality most frequently involved was MRI (40/54, 74.1%), followed by CT (20/54, 37.0%), PET (13/54, 24.1%) and ultrasound (12/54, 22.2%) (Fig. 3). Two-third of biobanks had been developed in the consecutive years 2009–2010 (11/54, 20.4%) and 2012–2016 (25/54, 46.3%) (Fig. 4).

Fig. 3
figure 3

Diagrams of existing imaging biobanks classified by disease distribution among disease-oriented biobanks (upper) and imaging modality (lower). *Mixed biobank (Parelsnoer Institute Biobanks)

Fig. 4
figure 4

Time distribution of imaging biobanks by year of development (defined as year of biobank creation, or alternatively, of beginning of patient recruitment, based on available information). The red dashed line marks 50% of the maximum number of biobanks developed in a single year (n = 7 in the year 2010)

The following 3 disease-oriented imaging biobanks were found within the CORDIS database: CHAIMELEON, PRIMAGE and CARDIATEAM [13, 14, 78].

Discussion

Existing imaging biobanks are imaging data repositories intended for research, which also contain clinical data and are mainly disease-oriented, with a more frequent focus on oncologic, neurological and cardiovascular diseases. A major role of current imaging biobanks is to provide imaging data for radiomics analysis so as to obtain biomarkers that can be correlated with clinical, genomics and histopathological factors. Although all biobanks included in our study have imaging data stored in them, only some (e.g. Bioheart Study, Cimbi, Health Brain Network, Alzheimer’s Disease Neuroimaging Initiative, ROBINSCA, Imagen) have correlated imaging data (e.g. functional MRI) with clinical and laboratory data, with a few planning on analysing quantitative imaging features and parameters to obtain imaging biomarkers.

Among disease-oriented biobanks focused on cancer, OncoLifeS (Oncological Life Study: Living well as a cancer survivor) is a hospital-based biobank with the threefold goal to offer an infrastructure for clinical cancer research, to drive translational research towards more personalised cancer care and to monitor the quality of oncological care outcomes. It also aims to integrate clinical, laboratory, pathological data and image biomarkers in order to predict a range of outcomes, such as progression and treatment response in cancer patients [45].

Several biobanks have been built with the objective to deal with specific tumour types. Among them, PRIMAGE (PRedictive In-silico Multiscale Analytics to support cancer personalized diaGnosis and prognosis, Empowered by imaging biomarkers) is a funded Horizon 2020 project with a 4-year duration launched in 2018. The main purpose of this project is to create an open cloud-based platform for supporting decision making in the clinical management of two paediatric cancers, i.e. neuroblastoma (the most frequent solid cancer of early childhood) and DIPG (the leading cause of brain tumour-related death in children) [13].

Other tumour-specific biobank projects include the Polish Mobit project, the French Glioblastoma Biobank and the CHAIMELEON project. CHAIMELEON (Accelerating the lab to market transition of AI tools for cancer management) is a funded Horizon 2020 4-year research project started in September 2020 and aimed to develop a structured repository of health images and related clinical and molecular data on the most prevalent cancers in Europe, such as lung, breast, prostate and colorectal cancer [14]. The objectives of the Polish Mobit (Molecular Biomarkers for Individualized Therapy) project include establishing a lung cancer biobank collecting patients’ tissue, blood and urine samples, and developing individualised lung cancer diagnostics which integrate genomics, transcriptomics, metabolomics and PET/MRI radiomics analysis [51]. The French Glioblastoma Biobank is another national biobank founded in 2013 with the goal to collect clinical and imaging data, along with biological samples, for supporting translational research projects and AI technologies for the management of glioblastoma [61].

A disease-oriented biobank project focused on non-oncological diseases is CARDIATEAM (CARdiomyopathy in type 2 DIAbetes mellitus), a 5-year research project started in 2019 and supported by Horizon 2020. The aim of this project is to determine how distinct diabetic cardiomyopathy is from other forms of heart failure, and to assess the extent to which type 2 diabetes contributes to its development and progression. In this way, the project is expected to deliver biological markers that could indicate which diabetic patients are at a greater risk for developing diabetic cardiomyopathy, possibly leading to a more detailed understanding of the disease [78]

ImaLife is a population-based project aimed to investigate early imaging biomarkers of three common diseases in the general population (i.e. cardiovascular disease, lung cancer and COPD), in an effort to prompt earlier treatment and reduce mortality [30].

The integration of quantitative features extracted from medical images and associated clinical data is a key factor to catalyse a change of paradigm in precision medicine healthcare, to assist diagnosis and prognosis, to predict valid disease outcomes and to understand the mechanisms behind complex diseases by learning from retrospective data [79]. Nevertheless, most results emerging from radiomic studies are negatively affected by a reduced number of cases collected, and novel biomarkers need to be validated in large multicentre populations. A small sample size can be a critical shortcoming for finding statistically significant correlations between treatment outcomes and radiomics features with high confidence intervals [80, 81]. This problem is especially significant when the different data repositories that form biobanks are intended to be used as a basis for the generation of predictive models (i.e. diagnosis, prognosis or response to treatment). In this case, the sample size of the datasets is crucial for the models developed to be valid and robust. Sample size estimation is essential to ensure that the results are conclusive and representative of the cohort under investigation, as well as to increase the degree of reproducibility of the study generated from the repository data.

In light of the above, the design of a biobank oriented towards exploitation using radiomics and AI predictive models should consider that the constructed data repository must have sufficient cases corresponding to the different events that may have been experienced by the subjects enrolled in a biobank [82]. Interestingly, while in the previous decade only pathological and clinical data were exploited for making predictions of disease outcomes, nowadays genomic, proteomic and imaging data extracted by high-throughput screening techniques are available for most patients with several diseases, including cancer [83].

Oncology is a suitable field for the discovery of imaging biomarkers, since cancer patients are frequently monitored with different imaging modalities for staging and post-treatment follow-up. In this setting, the correlation of imaging data with genetic, pathological, clinical and molecular data is of paramount importance to allow the development of predictive models [80].

The quantity and representativeness of the datasets used and the availability of high-performance computing infrastructures (HPCI) can have a strong impact on the validity of the proposed model of imaging biobank [84, 85]. Moreover, the standardisation process ensures data quality and the interoperability of different data sources through shared message formats and controlled terminologies, thus fostering the collaboration between biobanks and interoperability in data sharing [84]. As an indispensable requirement for research projects, the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles must be met, ensuring that data are traceable, accessible, interoperable and reusable, which has a direct impact on the definition of the biobank architecture. Standards like the Minimum Information About BIobank data Sharing (MIABIS) [86] or the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) [87] can facilitate the exchange of sample information and data and the systematic analysis of heterogeneous observational databases, making the biobank compliant with the specified FAIR principles. To this regard, an integration model of imaging data and biological data expanding the MIABIS standard with DICOM metadata has recently been proposed [88].

The present study has one limitation in that not all imaging biobanks could be traced back to specific literature articles, partly because several biobanks are still under construction. Another limitation is the difficulty of finding information regarding the biobank (for instance, many biobanks do not have a website) and the way imaging data were handled.

In conclusion, existing imaging biobanks can serve as infrastructures aimed at collecting radiological images, which could be used to obtain radiomic data and to train and validate AI models. Links between biobanks (either pathological, imaging or clinical) are needed to fuel AI and radiomics research, aiding the development of precision medicine and promoting earlier transfer into clinical practice.