Keywords

14.1 Big Data

Along with the population growth worldwide, and the fast evolution of technologies into almost every corner of human activities , an enormous quantity of information is generated every day, from many sources, such as: web information, mobile phone usage, social media activities, consumer preferences, financial systems, climate information, scientific reports, medical and health care information [1], moreover, the software and data business company Domo Inc. on its annual report of the world’s data generation called “Data Never Sleeps 5.0” for 2017, indicate that there are almost 3.7 billion users in the global internet population, and there is an output of 2.5 quintillion bytes a day of data created daily [2]. In this context, the digitalization, storage, collection, maintenance and analysis of these data have led to advances in data sciences and into the so called “Big Data Era”. This revolution has led to changes in prevalent paradigms , since most of these data cannot be managed by traditional and conventional techniques for data management, and novel infrastructure and approaches for data sciences need to be taken by executives, academics and economist all over the world.

John Mashey a researcher form Pennsylvania University in computer science was responsible for making the term “Big Data” quite popular in the early 1990’s decade. Nowadays, Big Data is usually defined as huge amount of data (in the order of Terabytes-1012 bytes and Petabytes-1015 bytes) highly complex, that within a reasonable period of time cannot be captured, managed, processed, interpreted and organized as information that could be read by human beings or traditional data processing software [1]. Most of the technologies used for these novel approaches are computational linguistics and machine learning. Table 14.1 indicates some of the main software available for the Big Data processing and management. In this context, accordingly to Davenport, Big Data could be classified in two main groups: machine-generated (data created by machines without human intervention) and human-generated (created with human intervention).

Table 14.1 Main software available for big data and data mining processing

14.1.1 Data Mining and Big Data Analysis

Managing data is one of the most important issues once you start to work with Big Data, several authors have stated different methodologies to start to work with, that in general could be divided by: Data collection , Data measures, Data Analysis and Knowledge discovery [3], accordingly to: Volume (data generation and processing numbers), Velocity (data processing in concordance with data generation speediness) and Variety (quality of data i.e. type of data and if it is structured or unstructured).

Data Analysis and Knowledge discovery will make us get novel and useful information from the databases that could be useful for prediction , classification and innovation depending on the type of information we are working with. In this context, the so called Data Mining defined as an analytic process designed to discover or extract patterns, novel information and systematic relationships from variables of large data sets (usually Big Data) [4], involving also the process of storage and processing the data, as well as the complicated process of presenting the results in a way understandable and easily interpretable for everybody The last goal of Data Mining is the prediction of patterns, the term was first coined by Piatetsky-Shapiro and Frawley in the early 90’s when there was a rush for developing novel algorithms for data processing in business by several software and data companies all over the world [5], it was originally called Data Collection by IBM in the 60’s. Data mining techniques are the result of a long process of research in product and business development, that began when the business of data warehousing and collection began to grow.

In order to make Data mining usable as an important tool it involves the complete understanding of the architecture of the data and the analytical methods to be used, since predictive relationships of data may not necessarily represent causes of an action or a behavior [6]. Although a detailed review of the complete Data mining technique is beyond the scope of this chapter, briefly, the process of data mining consists in 3 different stages [7]:

  1. A)

    Exploration and Collection of data: At this stage of the analysis we usually focus on the type of data we are going to work with, the type of data, and the nature of the analytical problem we wish to work with. Usually this stage involves the process of data curation [8], i.e. selection of adequate data with complete meta-information for the analysis, transforming data to the correct format for the software analysis, type of variables and predictors to be measured and usually exploratory analysis.

  2. B)

    Modelling the data: Building a model involves the knowledge of the kind of answers you wish to apply or the type of prediction you wish to do; since there are several software packages with already known models developed [6]. This stage is critical and involves the highest and most elaborated process in Data mining, in this context, there are a wide variety of techniques, such as: association, normal statistics, classification, Bayesian statistic, and several machine learning techniques [9] such as: clustering, neighborhood, decision trees, neural networks, deep learning, bootstrap aggregating, boosting, etc.

  3. C)

    Deployment and validation of the data: Building the model is not the end of the analysis, and once the model has been processed and evaluated, you must validate and interpret the results, their significance, and in case how does the novel information could be classified, remembering that Data mining is not statistics, we do not care how the data is distributed but what is the potential of such data for prediction and, or pattern identification [10]. Moreover, if you gained knowledge from the data you will then need to organize and present such results in a way that anyone in the field could use it. In this stage, there are several examples of informatics tools useful for deployment of the data and the exporting of algorithms, such as: predictive model markup language, portable format for analytics, confusion matrices, SQL, R-algorithms, Python, applications for Java and several programming tools [11].

14.2 Big Data and Health

It is widely known that health involves the production of an enormous quantity of data including diagnostic results and images, medical records, laboratory results, public health registry in the different levels, and data produced for biomedical and for clinical research [12]. In this context, Ruckenstein et al. [13] proposed the so-called “datafication of health” i.e. the conversion of all aspects of health at its different levels, into quantifiable data, meaning that all techniques that are currently being used for Big Data and Data mining could be applied to analyze health at different levels, the so called “Biomedical Big Data” (BBD). Moreover, the value of health research based on non-traditional data streams services from internet such as: e-mail, online purchasing and video conferences has already been demonstrated [14].

BBD could help us analyze and store people’s information throughout their lives on: diseases, phenotype, genotype, behavior, environmental location, occupation, and clinical data, making therefore, health-predictions easily for individuals and in consequence for populations [12]. In this context, it could provide of the correct tools to governments, public health departments, and decision makers for the implementation of the adequate prevention politics and interventions , in order to improve health on population, for instance, machine learning techniques have allowed several medical disciplines to create more accurate prognostic and diagnostic models based on pattern recognition (computer-aided detection), that could help physicians to improve their diagnosis [15]. Several governments all over the world have started to invest and create departments dedicated only to the analysis of BBD, for instance, the National Institutes of Health (NIH) launched the “Big Data to Knowledge ” (BD2K) initiative in 2012 [16], which involves multiple research centers such as the Big Data for Discovery Science Center and the Center for Expanded Data Annotation and Retrieval, as well as a set of focused individual research and training projects, to enable biomedical research and consider novel approaches in data science, to facilitate discovery and support new knowledge, the most important objective of this initiative is to index software to operate these datasets; on the other hand, the e-Health Action plan for 2012–2020 for the European Commission of Health, has allocated 2 billion € under Horizon 2020 program to invest in research and innovation on Big Data [17] as well as, a public-private partnership with a budget of 1.638 billion from the EU commission and 1.425 from other life sciences industries and organizations. Moreover, an Oxford Economics revealed that at least 70% of healthcare companies are looking to invest in Big data, Data mining and cloud computing expecting a significant impact in innovation in several topics on healthcare field. In this context, as stated by Vayena et al. [14], it is important to think of health-related big data as an evolving ecosystem.

14.3 Big Data in Epidemiology

Epidemiology and public health, have changed dramatically over the last years, and are now more interconnected with so many other sciences due to the global advances in technology and to the interdisciplinary nature of such sciences. Nowadays, the use of Big Data for epidemiologist is as referred by Salerno J., et al., “the exploration and interpretation of very large and complex datasets derived from pooling cohorts, from omics projects, electronic stored medical records, and health digital information” [18], since usually a normal epidemiologist turns for primary and secondary data sources to initiate its study, however, in the Big Data era, there is an enormous amount of information and sources of data (medical records, biobanks, geolocalization, shopping habits, genomic data, pharmaceutical prescriptions, social behavior, among others). In this context, novel epidemiologist must be prepared for time-consuming data collection , curation and storage, as well as, novel approaches using Data Mining tools and statistical approaches, as well as novel ethical and legal challenges related to potential harms of Big Data use, including confidentiality and privacy issues and other potential concerns from institutions and public agencies all over the world.

Several examples of epidemiological studies and Big Data have arisen over the last years, among the most important projects is the Nordic Arthroplasty Register Association (NARA) database which includes information concerning implant brands, fixation methods, and implant survival,, since 1995 in Denmark, Finland, Iceland, Norway, and Sweden, generating information from more than 1 million patients from each of the different countries [19] in each of the countries; other interesting example is the Observational Medical Outcomes Partnership (OMOP) which is a public-private partnership formed by several public and private representatives including the Food and Drug Administration (FDA) and members of the Big Pharma industry mainly focused on pharmacoepidemiology using and incentivizing novel healthcare databases through electronic health records allowing therefore, linkage possibilities and the possibility of a lifelong complete follow-up [20]; moreover, South Korea which is a world leader in information technology infrastructure launched in 2011 the so-called Big Data Initiative, which established a pan-governmental big data network and analysis systems, it was so important that in 2014 the ministry of science of such country released the Medical information consulting program to collect medical data and customize treatment and help the national health insurance service become more efficient, providing information to patients such as: duration of illness, cost of treatments, medical services, cases per location, institutions specialized in diseases; and to medical industry of pharmaceutical trends , distribution of drugs, medical equipment and devices most asked for by the population [21]. A very successful case example was the Seoul National University Bundang Hospital, which was the first hospital in the Asia-Pacific region to fully digitalized big data, and doctors and nurses are able to configure systems with precise clinical information, improving time of patient referral from 48h to 4–6h, and for example reduce the dosage of antibiotics before surgery [22].

14.4 Big Data Biomedicine Research and Omics

It has been years since the Human Genome Project results were published, and biology experienced a revolution; as well in our understanding of biological mechanisms as well as in access to an enormous amount of biomedical data now freely available. Novel -omics tools are being daily applied to several topics in biomedical research projects in the last few years (genomics, transcriptomics, proteomics, epigenomics, metagenomics, metabolomics, and microbiome, among others) [23]. Moreover, this information now helps us understand relationships between genotype, phenotype, behavior, and many outcomes of human beings through the development of the so-called Systems Biology (Fig. 14.1) [16].

Fig. 14.1
figure 1

Holistic points of view in systems biology. Analysis coming from different omics technology could only be helpful if we have a holistic approach from different levels this is the objective of systems biology

However, information obtained from such technologies increases dramatically, as well as the need of developing public repositories. Besides, each Omic experiment represents challenges, since standard nomenclature and analysis for genes are not the same for proteins or metabolites. Each technology involves the production of different data in different formats (Table 14.2), standards for information and meta-information are needed so that stored data is clear enough for interpretation and re-use from other scientist to compare and reproduce experiments [24].

Table 14.2 Main omics used in biomedical research and most common formats

In the last few years, there has been an international effort from several consortia all-over the world to develop and standardize -omics technologies results to be used and published e.g. for Genomics (Minimum Information about a Genome/Metagenome Sequence from Genomic Standards Consortium), Transcriptomics (Minimum Information about a Microarray Experiment from Functional Genomics Data Society), Proteomics (Minimum Information about a Proteomics experiment from the Human Proteome Organization), Metabolomics (Core information for Metabolomics), Epigenomics/Transcriptomics (Minimum information about a high-throughput Nucleotide sequencing Experiment by Functional Genomics Data Society) [24]. In this burgeoning context, information will soon be reproducible and amenable to knowledge translation into personalized medicine.

14.5 Challenges of Big Data and Aging

It is widely known that life expectancy all over the world has increased in the last few years. According to the World Health Organization report on Aging and Health, in 2016 global life expectancy for females is 73.8 and 69.1 for males [25] as life span grows, the prevalence of chronic conditions and age associated disease increases, and becomes a matter of interest for several public and private stakeholders. Multi-morbidity results from this phenomenon and represents a challenge for health care systems. Diversity increases as we age, and disadvantage plays a significant role in modelling these differences, which vary widely throughout a lifetime. In this very complex and diverse context, recent advances on information, communication and biomedical technologies such as electronic devices, medical digital diagnostics and prescriptions, medical assistive and wearable devices, personalized medicine and genomics, among others, represent an opportunity for an efficient use in order to provide the health-care system with new tools based on anticipation, prevention and an improved continuum of care. Besides, another field of development, as electronic medical records improve and become widespread; could be the use of a Big Data approach to provide decision makers in the public and private sectors with novel information for planning interventions and public health policies; and to help develop new pharmaceutical and health device resources answering to the needs of an aging population [26, 27].

Modern developments in Big data and Data Mining represent an opportunity to improve multidisciplinary exchanges “Big Knowledge ” and accordingly to William Callaghan a novel paradigm in the scientific method , since id. “...comprehensive data coverage unearths causal relationship between phenomena…” [28], which is not far away from most of modern scientific projects for instance, most genomic projects involve first the genome acquisition and consequently the data analysis to develop a hypothesis . However, since Big Data involves a holistic approach multidisciplinary research groups must be created so that results could improve in order to raise the probability of moving towards the right conclusion especially in the field of aging, where a continuous monitoring and correct interventions, as well as prevention are among the main challenges for health systems [26]. However, we must be aware of the fact that already, among the complexity of aging in modern life; the lived reality of big data should be approached with caution because all the data we are shedding every day is too revealing of our intimate selves but may also misrepresent us. Like a fluorescent light in a dark corridor, it can both show too much and not enough [29]. Therefore, we do not really know the true potential or the dangers of big data analysis, and we must be restrained in our approach and look for a balance between enthusiasm and the reality of our limitations. It’s important to mention that any government that adopts any type of Big-Data approach on public health must create policies for the protection of individual data and at the same time promote access and sharing for the use of such data for public benefit [14].

Over the past few years many long-term care programs have been implemented for aging in several countries such as: Japan, Taiwan, USA, Australia, Canada, South Korea, and Denmark, which intended to implement information and communication technologies to improve long-term care systems the so-called aging in place [26]. Even so, Japan’s government made mandatory the long-term care insurance system and implemented the “Guideline to promote the Appropriate Use of Information Systems in Care for the Elderly in Conjunction with At-home Care” to stimulate Big Data networks of information from healthcare providers and licensed in home caregivers, nursing and doctors, as well as information from patients such as residence, medical treatment, nursing, etc. [30]. Another successful example of the implementation of Big Data to healthcare systems in older people, is the Australian e-Health Research Centre which have almost 50 researchers, software engineers and doctorate students developing technologies in partnership with clinicians to improve health care systems. They have developed an innovative platform for cardiac rehabilitation, using wearable devices and mobile phones [31]. Moreover, several start-up companies have moved into the field, such as the New York based Hometeam Inc. which developed its own software to match caregivers with families through mobile technology in a very personalized care planning for older adults . Another successful example is CareZapp Inc. located in United Kingdom; they developed a mobile app that creates an ecosystem platform connecting caregivers , volunteers, family, friend and doctors to improve the health of older patients, by means of in-home sensors, wearable devices, and mobile phone with notifications 24/7 of patient status [32].

14.6 Globolomics of Aging Research

For decades, researchers in the biology of aging have focused on defining mechanisms that modulate aging by primarily studying a single metric, sometimes described as the “gold standard” lifespan . Increasingly, geroscience research is turning towards defining functional domains of aging such as the cardiovascular system, skeletal integrity, and metabolic health as being a more direct route to understand why tissues decline in function with age (see Chap. 4 on Geroscience). Each model used in aging research has strengths and weaknesses, yet we know surprisingly little about how critical tissues decline in health with increasing age and how the different systems interact. We know very little as well about the interplay between the biological mechanisms of aging and chronic disease .

Over the last few years, research on the molecular foundations of aging and consequently anti -aging therapies, have increased dramatically, just to mention an example, a search for the words: “molecular” and “aging” in PubMed database gets almost 35,563 documents , increasing from 553 documents in 1997 to 3551 in 2017. It becomes clear that several Omic approaches have been critical for progress in the field, performed either with clinical human samples or with animal models (C. elegans, S. cerevisiae, M. musculus, Rattus norvegicus etc). In this context, there is an enormous amount of information freely available in the web, consequently, Big Data and Data Mining approaches become relevant in order to systematically extract information and obtain knowledge from such.

As previously mentioned nowadays there are several Omics technologies available, however integration of the data is the main challenge in the field and remains a black box in genomic studies [33], however a new way to improve our understanding of the aging process is acknowledging its complexity and developing new methods capable of embracing it. The use of such technologies and its results, each omic tool, will only give an answer to specific different questions i.e. proteomics will give us information that may be or may be not be related with metabolomics. Therefore globolomics (or “deep phenotyping” leading to molecular or genetic epidemiology, albeit at a finer resolution) approaches must be consider when performing studies in aging so we could have a systems approach of results. Moreover, association with physiological conditions must be consider [34], projects such as the Human Physiome Project become increasingly important, as their analysis could give us a better approach to the proportion of genetic and environmental (nature and nurture) participation in aging and consequently, contribute to the development of a novel aging theory.

14.7 Expectations on Big Data in Aging

It’s been quite a few years since the declaration of the Madrid International Plan of Action on Ageing, called for the elimination of the inequalities in health-access and to the development of novel health-care systems policies for older people [35], and although it’s still a long and arduous road to travel, novel technologies on informatics, genomics and communication could help us diminish such inequalities and move forth in all the fields of aging research. In this context, there are a lot of international efforts already being successfully developed from private and public sectors, and a number of researchers and governments are already compiling aging and health information in current databases [30, 31, 36]. The more we learn on Big Data and Data Mining the faster we will close this gap .