Introduction

The development of biobanking all over the world during the last two to three decades has done a great step towards the higher level, the new quality. Until now a critical mass of knowledge has been achieved, and the future progresses and growth must effectively capitalize the knowledge, expertise, research achievements, and experience accumulated.

Number of publications, number of new biobanks, new projects, and new national and international initiatives (Rony, Rooney et al, 2018) and activities reflect the global movement. As biobanking is multi-branched and multidisciplinary, it affects many research areas like medicine, biology, systems biology, information technologies (IT), artificial intelligence (AI), machine learning, modelling, mathematics, statistics, big data, and others.

Biobanks have a primary role in the era of personalised medicine (some authors use terms precision medicine, person centred, patient centred, individualized medicine) [1,2,3,4], and the ability of a large collection of patient samples is a critical requirement for personalised medicine to advance patient treatment [5, 6]. Biobanks are one of the pillars in personalised medicine tackling all its aspects such as prevention, diagnosis, treatment, and monitoring of an individual patient [7].

Originally the biobanks collect, store, and share biological samples and data [8, 9]. Both samples and data are of different origin and structure and require different methods to handle with. Samples stored in human biobanks are of great variety of human body: human fluids (blood, serum, plasma, urine, saliva, tears, spinal fluid, and so on), tissues that are frozen, FFPE (formalin-fixed paraffin-embedded tissues), cells, DNA, and RNA, for instance hairs and nails; almost any part of human body can act as human biological sample, if we know how to use them. For every type of samples, the special methods for every step of the sample life cycle (acquisition, handling/process, cleaning, storage, distribution, scientific analysis, and restocking used sample) [10, 11] are defined. Relatively new sources of samples represent imaging techniques as structural and functional magnetic resonance imaging, positron emission tomography (PET), electroencephalography, and magnetoencephalography [12], which also brings a new quality of data.

Every sample is associated with related data that are of different types: clinical (demographics, death/survival data, questionnaires), imaging (ultrasound, magnetic resonance, positron emission tomography), biosample data (values from blood, urine, saliva), molecular data (genomics, proteomics), digital pathology data, data from wearable devices (blood pressure, heart rate), implantable biosensors, miniaturized sensor embodiments, and much more [13,14,15]. The qualitative and quantitative aspects of biobanking data is growing fast, the data structure is more and more complicated, and the management of data during their entire life cycle require specific innovative approaches. Amount of data that is generated every day is astonishing [16]. This exponential growth of data is further fueled by the digitisation of patient-level data: stored in electronic health records (EHRs) and health information exchanges (HIEs) and enhanced with data from imaging and test results, medical and prescription claims, and personal health devices [17]. Importanat current sources of big data are human microbiome biobanks and collections of microbiota of the human body [18]. Microbiome as the entire collection of microorganisms, their genomes and their metabolic interactions in a specifically defined environment, influences many human metabolic and other functions such as energy production, body temperature, reproduction, and tissue growth [19] and as a resource of big data has irreplaceable role in current and future biomedical research.

Big data in health is too big, too fast, and too complex to process and interpret with existing tools [20]; similarly the biobank data as becoming bigger and bigger extending beyond the basic computer facility and throughput and due to the biobank data is converted to the category of “big data”. Currently big data has become one of the most important frontiers for innovation, research, and development in computer sciences [21, 22] and is becoming an innovation driver for biobanking modern development. Big data is a huge new phenomenon that brings together cutting-edge theory and practice from academia and industry; it is a broad landscape focused around data [23]. Big data is radically changing biomedical research [24]. There are some examples how big data are used in healthcare: preventing medical errors, identifying high risk patients, reducing hospital costs and wait times, and enhancing patient engagement and outcomes and widespread use of electronic health records (EHRs) [14, 25].

As biobanking is the foundation of personalised medicine [26, 27], all aspects concerning big data in biobanks contribute to all aspects of personalised medicine from prevention, diagnosis, prediction, to treatment. These processes continuously change biobanking research to data-driven research. During the last few decades, biomedical research has undergone transformation, conducting to a novel paradigm of data-driven biomedical science [28, 29] using innovative strategies [30]. Big data not only in biobanks promises an enormous revolution in healthcare, with important advancements in everything from the management of chronic disease to delivery of personalised medicine [17]. We are currently in the era of “big data” that completely changed people’s view of healthcare activity [29].

Big data

Definitions

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making (Gartner’s definition [31]).

McKinsey’s definition 10 years later describes big data as “the datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze” [32].

Big data creates radical shift in how we think about research [33].

The big data paradigm shift is significantly transforming healthcare and biomedical research [34].

All biobanking data not only from an individual but from a cohort or a population and data from clinical trials and longitudinal studies show the characteristics of big data. The data of human subjects stored in biobank are diverse and miscellaneous. Electronic health records, sensory data gathered through wearable mobile and other types of devices [35], are also additional data used in biobanks. Imaging data are as regards their volume considered as big data. The important feature of biobanking data is that they are generated, flowing, and growing continuously in time. From every patient with wearable mobile application, the fluent supply of data is coming continuously. From a cohort of patients, the volume of fluent data is much bigger and more diverse than from a defined group of patients. From a society, the data are even bigger and more diverse. Going higher in the hierarchy, the data are bigger, more diverse, and more miscellaneous. To identify the hierarchy and find efficient tools to handle and to evaluate biobanking data according to research purposes is extremely difficult, because biobanking data exceeded the characteristics of “normal” data and as presented reached the quantity and quality of big data.

Big data are generally characterized by three major features, commonly known as “3 Vs”: volume, variety, and velocity [29, 35, 36]. Volume means “how much data?”, variety means “what kind of data?”, and velocity means “how frequent or real-time is the data?” [37].

Subsequently the list of “Vs” was extended up to 5 Vs (volume, velocity, variety, veracity, and value) [38], whereas Andreu-Perez et al. [20] offered 6 Vs (value, volume, velocity, variety, veracity, and variability), and recently 7 Vs is taken into consideration (volume, velocity, variety, variability, veracity, visualization, and value) [39, 40].

  1. 1.

    Volume is how much data we have which can be measured in gigabytes (GB), in zettabytes (ZB), or even in yottabytes (YB), yottabyte = 1,208,925,819,614,629,174,706,176 bytes) [41].

  2. 2.

    Velocity is the speed in which data is accessible. Current opinion is presented by expression “if it’s not real-time it’s usually not fast enough”.

  3. 3.

    Variety describes one of the biggest challenges of big data. It can be unstructured, and it can include so many different types of data. Organizing the data in a meaningful way is no simple task, especially when the data itself changes rapidly.

  4. 4.

    Variability is different from variety. If the meaning of data is constantly changing, it can have a huge impact on real data homogenization.

  5. 5.

    Veracity is all about making sure that the data is accurate, which requires processes to keep the bad data from accumulating in the systems. The simplest example is contacts with false names and inaccurate contact information.

  6. 6.

    Visualization is critical in today’s world. Using charts and graphs to visualize large amounts of complex data is much more effective in conveying meaning than spreadsheets and reports chock-full of numbers and formulas.

  7. 7.

    Value is the end game. After addressing volume, velocity, variety, variability, veracity, and visualization—which takes a lot of time, effort, and resources—researcher and the organization need to be sure to get value from the data.

Currently the final highest number of “Vs” was completed by Borne K. [42] in big data 10 “Vs”:

  1. 1.

    Volume: lots of, we are now dealing with a “ton of bytes”

  2. 2.

    Variety: complexity, thousands or more features per data item, many data types, and many data formats

  3. 3.

    Velocity: high rate of data and information flowing into and out of our systems, real-time, incoming

  4. 4.

    Veracity: necessary and enough data to test many different hypotheses, vast training samples for various models

  5. 5.

    Validity: data quality, governance, data management on massive, diverse, distributed, heterogeneous, “unclean” data collections

  6. 6.

    Value: the all-important V, characterizing the business value, and potential of big data to transform the organization from top to bottom (including the bottom line)

  7. 7.

    Variability: dynamic, evolving, spatiotemporal data, time series, seasonal, and any other type of nonstatic behaviour in data sources, customers, objects of study, etc.

  8. 8.

    Venue: distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud

  9. 9.

    Vocabulary: schema, data models, semantics, ontologies, taxonomies, and others

  10. 10.

    Vagueness: confusion over the meaning of big data

According to the author [11], the long list of 10 “Vs” illustrates big challenges of big data.

Sun [22] presents original vision of big data as 10 big characteristics of big data “10 bigs”: big volume, big velocity, big variety, big veracity, big intelligence, big analytics, big infrastructure, big service, big value, and big market. Volume, velocity, variety, and veracity are fundamental characteristics of big data, whereas intelligence, analytics, and infrastructure are technological characteristics, and remaining service, market, and value are socioeconomic characteristics.

Faroukhi et al. [43] have recently published a transparent review on the big data, and the authors are supporting the theory of 7 “Vs” (volume, velocity, variety, veracity, value, variability, and visualization).

Biobanks and big data

Big data velocity

Big data is faster and faster. Velocity means how fast is data generated. As data is coming continuously from more and more resources, biobanks are facing to work with both “old” data and “real-time data” and usually work with both data together. Velocity of data makes the processing of data more complicated because of different velocity of different data. When speaking about data, we usually speak about stored data. Real-time patient data is coming continuously from wearables and biosensors on/in patient’s body and can be recorded, observed, and monitored almost immediately—in real time. Growing proportion of health-related data is generated on the go (top 12 ways). It makes new dimension in data processing that requires new access to the data and new tools to handle with. Another important feature in real-time data systems is that for a special sample, the scientists need additionally and retrospectively patients’ data that was not of importance during collecting and storing. Historical data and real-time data enable machine learning models (will be specified later in text) and generate various predictions or classifications. This will help predicting individual patient outcomes, risks factors, and use of clinical notes [44] and be “really” personal.

Big data volume

The volume of data refers to the size of the data sets that need to be analysed and processed, which are now frequently larger than terabytes and petabytes. The sheer volume of the data requires distinct and different processing technologies than traditional storage and processing capabilities. In other words, this means that the data sets in big data are too large to process with a regular laptop or desktop processor [45]. Regarding a single patient basic data, they are simple data, e.g. age, sex, body parameters, laboratory analysis data, and clinical trials data, that are measured regularly or on a one-time basis or from wearable mobile applications continuously [28]; this data is not of the volume of big data. Getting together a group of patients, cohort, or a population, the data reach the characteristics of big data. Recently a big contribution to the biobanking big data is coming from imaging, multi-omics, and EHRs. Then the volume of big data in biobanks is enriched by data taken from a patient group or from a population on a long-term basis as a discrete data or continuous data [28]. Among big data types, imaging data can be considered the largest in volume [46].

Big data variety

Big data in biobanks originates from different sources of different formats and different types, e.g. personal data, body parameters, imaging, multi-omics, and wearables [15, 20, 35]. Biobanking data is structured, unstructured, or semistructured or as described by Faroukhi et al. in 2020 [43] data structured, unstructured, and/or between. The difference between the categories is clear: structured, quantitative data that can be highly organized and easily analysed (dates, numbers, patient names, body parameters) and data that have pre-defined data model such as databases; unstructured, qualitative data is the opposite; it is textual or non-textual or human or machine generated (audio, video, images, word documents, social media, notes from EHRs, clinical trial results) [47, 48] data that have not predefined data model [49]. The ratio between structured and unstructured data is rapidly moving towards unstructured data. Unstructured data is more difficult to operate with, and when it is accessible, searchable, available, and relevant, it can be converted into information [49]. According to the same source, e-mail is considered semistructured data that have some organizational properties and is easier to operate than unstructured data. Variety in big data also means that data comes from different sources, data incomplete in the subject or in time [43].

Big data veracity means in general how accurate or truthful a data set may be, cleaned from “not” trustworthy, reliable, and secure data. Veracity is not just the quality of the data itself but how trustworthy and reliable the data source, type, and processing of it is. Removing abnormalities or inconsistencies, duplication, and volatility are just a few aspects that factor into improving the accuracy of big data [37]. Veracity is the most important characteristics of big data; without this no correct results could be achieved, or it can lead to wrong predictions as the data context is not always known [20]. Big data veracity guarantees the right starting point for predictive models and new research theories creation.

Volatility as another “V” makes the situation more complicated, because data changes during its life cycle. And the speed of changes differs, some data change less, and some change more frequently. Veracity also means to have enough data to formulate hypotheses and to design models [29]. The value of biobank big data lies especially in developing algorithms for prevention, prediction, treatment, and follow-up.

Big data visualization means to make them as transparent and descriptive as possible using tables, graphs, maps, 3D models, animations, and so on [43], using graphical tools and techniques. Visualization makes the decision-making and models easier and better to present and to understand.

Big data life cycle

As biosamples also big data have their “data life cycle”: data acquisition, data pre-processing and processing, data storage, management, analysis, and finally visualization. Big data is a new discipline, where data management techniques, tools, and platforms can be applied [50, 51]. New science supporting tools especially IT tools, AI, and machine learning are extremely important to keep up with the biobanking data development.

Data acquisition

Current situation in gathering data often unstructured, disordered, that are erroneously growing in value is a challenge for bioinformatics, biostatisticians, and IT and AI specialists. Primary data or raw data (raw data is the data that is measured and collected directly from machine, web, etc.) usually is not in the format that is ready to perform analysis [52]. Biobanking data is produced actively or passively by humans, systems, or sensors from different resources and can appear in structured, semistructured, or unstructured formats [49].

Data pre-processing

To work with these kinds of data is almost impossible, so the next step is to pre-process data, because quality decisions need quality data [51]. There are several steps how to make the raw or primary data ready for next actions, so-called data pre-processing: cleaning, which means using only complete data; reduction, which means that the data follow a specific model and only the data with model parameters can be used; and transformation means conversion of data to the specific format for intended analysis and discretization, which divides data to special sets like subgroups, intervals, subsets, and files [43]. Researchers need to qualify what data is crucial and necessary and what data is ballast and not for the actual research necessarily needed. Situation where the more data we have the more research we can do is not the way nowadays.

Data storage

Data are stored, and it is necessary to take into consideration that data are collected from diverse resources, so to provide storage space great enough, reliable, and safe is a complex and structured process. New technologies like cloud computing services reveal a shift to a new computing paradigm, and it has become increasingly challenging to assure consistency in managing such large-scale data in cloud storage [53]. Local storage space of biobanks is or will be in a short time full, and the cloud storage system is becoming more and more important. With this, the problem of security and safety of data in clouds is raising, as well as the financial aspects and sustainability.

Big data and artificial intelligence

Big data is closely connected with artificial intelligence (AI). According to the widely accepted definition, AI refers to the development of machines that are capable of perceiving, thinking, learning, and adapting their behaviour, just like biological organisms [54]. Artificial intelligence is changing the world we live in [55] and plays and will play a great role in health research and care, by the ability to work with great amount of data, sort them and process them to better predict some risk factors, and thus contribute to formulation of preventive activities, more accurate diagnosis, and treatment and finally predict the treatment outcomes [56]. Artificial intelligence as new method is slowly making inroads into biobanking. The interrogation of vast quantities of data from a large biobank can now be completed within a few weeks (even remotely) as opposed to months or years previously [6]. An efficient use of big data in real practice together with the tools of artificial intelligence and machine learning enables to evaluate the data, predict an incident, evaluate risk, and save money and doctors time. Artificial intelligence thus will effectively enable to make use of big data in health to prevent disease, speed recovery, and save lives [55]. Machine learning and deep learning, a type of AI, allows computers to “learn” without being explicitly programmed. In any given domain, it can help improve and automate decision-making [57]. AI is important in managing big data and using them for new targets for new drugs and new biomarkers. Imaging methods as non-invasive methods produce huge amount of data, and with the help of AI biopsies, tissue samples and other painful and stressful processes could be avoid. In these cases, carefully prepared models, algorithms, and other IT solutions can replace invasive actions. Based on discussion published by Bresnick in 2018 [56] on health IT analytics, some other crucial applications of AI in healthcare systems are in progress: AI can in some way safe the staff in hospitals and healthcare institutions and organizations, as well as in research centres. Big data can be efficiently used for studying various populations and ethnics and their specific features and help to predict risk factors. AI supports personalised medicine in many ways. Personalised and precision healthcare can become a reality rather than a concept [58].

Digitalization

Great future is predicted for digitalization. Currently we use digital pathology, digital radiology, digital imaging, and other digital functionalities. Digitalization as new IT tool in biobanking brings not only new quality but also new great amounts of data and with them requirements to safe collection, storage, sharing, and processing of these data.

Automatisation

One of the basic prerequisites for results veracity, truthfulness, and research reproducibility is the quality of samples and the quality of data. The best possible way how to achieve it is automatisation as much as possible. Not the whole process of biobanking from collecting, transport storage, and sharing can be fully automated. But the most recent tendency is to automate as much as possible to avoid mistakes made by human beings. Better situation is with data that can be automated at higher level and at better modes using IT solutions, algorithms, and artificial intelligence. IT solutions help to categorize data and make catalogues and databases, taking into account different biobanks, regions, networks, and consortia. IT solutions provide visibility and utility of samples and data stored in biobanks.

Big data in biobanks and personalised medicine

Big data indicate the features of “personalisation”; they need to be used at right data at the right time for the right patient [59] and thus support implementation of principles of personalised medicine in practice.

Big data contributes to the personalised medicine. Cirillo and Valencia [24] in their review predicted that big data in personalised medicine would require significant scientific and technical developments, including infrastructure, engineering, project, and financial management. Exploiting new tools to extract meaning from large volume information has the potential to drive real change in clinical practice, from personalised therapy and intelligent drug design to population screening and electronic health records mining [15].

Big data provides the opportunity to enable effective and precision medicine by performing patient stratification [20]; fine patient stratification is the basic step towards real personalised approach.

Big data and AI contribute to the changing paradigm in personalised approach to the healthcare from treatment to prevention and prediction. AI devices can be combined with each other and with other wearables, biosensors, mobile diagnostics, and telemedicine make possible to monitor a patient continuously and to receive a vast amount of data from an individual for further scientific purposes. Machine learning algorithms and their ability to synthesize highly complex datasets may be able to elucidate new options for targeting therapies to an individual’s unique genetic makeup [56]. AI contains gaming, patient coaching, and virtual doctor interactions and especially in chronic patients contributes to a novel predictive, preventive, and personalised approach, where patient is self-managed [60, 61].

One good example of big data utilization in personalised medicine is in cancer patients. Despite remarkable achievements in cancer research, in these patients does not exist reliable treatment [62]. New machine learning algorithms based on multi-omics approach and due to big data from a big cohort of cancer patients can make it easier to find the best possible treatment for every patient—personalised treatment [63]. Another aspect is the establishment of optimal biomarker panels for individualized patient profiling and improved multi-level predictive and prognostic diagnostics [64] and other factors like inflammatory cells in the tumour microenvironment [65]. It has been a problem that drugs often have heterogenous treatment responses even for the same type of cancer and some drugs show sensitivity in a small number of patients [66]. AI and predictive and preventive algorithms can identify the accident more advanced than based on traditional procedures. For these studies, algorithms and big data sets of excellent quality can be used to produce reliable background for the automatized decision-making programs. For all these automated or partly automated processes, models and algorithms samples and data from specifically oriented biobanks are crucial [38]. Big data enables connecting real cancer biobanks into virtual biobanks with greater number of patients and related data and makes utilization of samples and data more efficient.

Special attention for the future should be paid to the children’s biobank. To obtain data from children patients is more difficult because of the special and sensitive type of data. For a child to be in a hospital is a stress, and to collect, store, and share data of children patients are difficult even at national levels. Children and youth are more open to new technical devices, wearable devices, and even smart phones that are usually better accepted than in adult patients and in home environment, and data can be received continuously and at the same conditions give sometimes more optimal results than in hospital. Currently it is ready and successfully used in an algorithm that can diagnose 90 disorders in children [56]. In young population, also long-term collection of risk factors data such as lifestyle, smoking, alcohol consumption, drug abuse, overweight, hypertension, and others including innovative screening programmes will contribute to better prevention [67].

Future

Regarding the data, one does not know, indeed cannot know, how data will be used in the future or what other data they will be linked with [68]. Every day scientists face larger and larger amount of data that can be now, and in the future, used for better healthcare. The main task is to find the optimal tools how to discover the secret hidden in the data.

Biobanks will play an essential role in the translation to personalised medicine by linking biological data to electronic medical records [12]. The more data from a single patient will be available, the better and more personalised approach to the prevention, diagnosis, and treatment will be available.

The future success of biobanks lies in using the data to predict and treat diseases [12]. As we are facing paradigm shift “from treatment to prevention” in healthcare, based on big data, the risk factors could be identified early, and the more effective preventive measures could be offered to a patient. Models for predicting health risk assessment [69], survival rates estimation, and therapeutic recommendations would contribute to better healthcare [70].

Big data are becoming in some case personalised in biobanking: researchers need to use right data at the right time for better customer relationship [59]. The principles are to identify as precisely as possible the scientific and patient’s needs, to near-time or even real-time data processing. As in other research and business fields, big data is a driving force in research itself. It means to use all “Vs” of big data for the best possible individual/personal outcomes.

World map of leaders in biobanking is continuously changing, when new biobanking players entered the place, e.g. China [27], India, and Africa.

EU and big data

The importance of big data in biomedical research and human health is highlighted by the European Commission (EC) in the biggest European research and innovation program ever, Horizon 2020.

Big data presents great opportunities as they help us develop new creative products and services, for example, apps on mobile phones or business intelligence products for companies. It can boost growth and jobs in Europe, but also improve the quality of life of Europeans. Big data contribute to enhancing diagnosis and treatment while preserving privacy [71]. Several projects (AEGLE: An analytics framework for integrated and personalised healthcare services in Europe, My Health My Data, KConnect: Khresmoi Multilingual Medical Text Analysis, Search and Machine Translation Connected in a Thriving Data-Value Chain, MIDAS: The Meaningful Integration of Data, Analytics and Services) offer various data solutions for new drug discovery, treatment, and care and try to find the optimal use of heterogenous resources like bio-signal streams, health records, genomics, and other -omics, with respect to the patient data privacy and safety [71].

Biobanking data that are primarily personal data are according to a novel EU wide legal framework for the protection of personal data EU GDPR (European Union General Data Protection Regulation) sensitive data: data of birth, sex, age, weight, blood pressure, and other body parameters and data about lifestyle, employment, society, religion, and so on. EU GDPR entered into force on May 28, 2018, as binding rule for every member state of European Union (EU) and totally changed the rules for samples and data collecting, storing, managing, and sharing not only within the EU but also within other partners from all over the world. GDPR impacts the data during the whole process of life cycle from collecting data, their environment, their use and availability, storage and duration limits, sharing and data access, and reproducibility. GDPR is the first regulation with international scope, and as such, it is affecting organizations around the world [68, 72].

EU’s goal is to personalize the care that means more effective care, less waste of time and resources, and greater patient satisfaction [55].

Conclusions

Big data is necessary to support the biobank transformation to the upgraded level, and biobanks on the other hand contribute significantly to the big data issue to make the big data research driven.

The big data paradigm shift is significantly transforming healthcare and biomedical research [34], large amount of multi-omics, imaging medical devices, and health electronic records data allowing personalised medicine interventions while engaging infrastructural and research management and innovation and sustainability [24].

Big data enable the use of large volumes of medical information to look for trends or associations that are not otherwise evident in smaller data sets [15].

Big data offers both opportunities and challenges, and big data make possible to ask and answer questions in new ways [28].