Keywords

15.1 What Is “Big Data”: The Big Part, the Data Part?

“Big Data” is a term that has been used in the last 10–15 years to describe not only the increase in the volume and complexity of data available in organizations, but also the novel computational techniques and methods needed to derive knowledge from the data. One formal definition for “Big Data” was published by De Mauro, Greco and Grimaldi in 2016: “Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value.” [1] Big data is often described in terms of the ‘Vs’ that characterize it: Volume, Velocity, and Variety . Volume refers to the size of datasets [2]. Velocity refers to the dynamic nature of the data, meaning that it might be rapidly changing and may require frequent updates to retain value. Variety refers to data complexity. Data complexity can mean heterogeneity of data elements in a dataset (e.g., timestamps, codes, text, images, etc.), of types of data (e.g., genomic, clinical, behavioral, administrative), or of code systems (LOINC, ICD, SNOMED, RxNORM)—any of which can make the work of deriving meaning from the combined data more convoluted.

In healthcare, particularly for mental health, the ‘Big Data’ paradigm has been recognized to have great potential [2,3,4,5]. Scientific publications in this field have been increasing since 2003 (Figs. 15.1 and 15.2). It is expected that this paradigm and the concept of ‘Big Data’ will continue to evolve as will likely applications to mental health informatics research.

Fig. 15.1
figure 1

Search results from PubMed (as of 29 Sept 2019); keywords “big data” and ‘big data” and “mental health”

Fig. 15.2
figure 2

PubMed search results on a logarithmic scale, including the search for ‘informatics’ for comparison

Compared to other health domains, mental health conditions are currently classified less by underlying mechanisms of pathology, and more by symptom patterns (see Chap. 5). While it is universally known that mental health and illness are influenced by complex relationships between mental, interpersonal, environmental, and biological factors [6] the nature of these relationships has been elusive. Knowledge discovery in mental health depends on greater insight into relationships between these disparate phenomena. This requires access to a range of data sources including medical, administrative, molecular, ‘omics’, environmental, socio-economic, geographic, and social media repositories [7,8,9].

How to decide what constitutes ‘Big’ (Volume), dynamic (Velocity) and varied (Variety) depends on, and is relative to, each clinical research question or problem, as well as to data availability. For instance, many relevant research problems in mental health research might relate to rare diseases (e.g. conversion disorder, certain types of psychosis) or rare outcomes (e.g. suicide, birth defects). In this this context “big data” can mean data that is very complex and difficult to work with, even if it is not necessarily large (Volume). It could be complex because data sources are scattered across healthcare institutions and need to be mapped and linked, and require complex analytical methods. The need for specialized infrastructures, computational tools and methods to analyze this type of data is perhaps the key component of the big data paradigm, and what makes it distinct from other approaches to research.

15.2 Methods and Paradigms

Compared to other research methodologies, the big data paradigm is often more exploratory, and data driven. Knowledge discovery typically means that one applies computational and statistical methods that are designed to identify previously unknown patterns in the data, thus leading to a hypothesis-generating approach. This contrasts with a hypothesis-testing approach, where theory and a priori knowledge drives the question framing, study design and research methodology choices (see Fig. 15.3). More recently, intermediate and mixed approaches have also emerged to synergistically combine the best of the two contrasting modes [10].

Fig. 15.3
figure 3

The columns represent different research paradigms, and the rows different stages in the research process. The text in the boxes provides examples of activities and methods for each stage under the different paradigms. Note that a given dataset may fall at different points in the spectrum throughout its lifecycle- data collected through a hypothesis-driven protocol may later be used for knowledge discovery through secondary analysis, sometimes in combination with other datasets

Analytical and computational methods that are applied within the big data paradigm may range from simple statistical association to complex machine learning (ML) algorithms (see Chap. 10). Depending on the nature of the data in a data repository, several methods may need to be combined and applied to the data. For example, complex variables (e.g., images, natural language) often need to be converted to simpler structured variables that can then be used for further analysis. Machine learning algorithms that can natively deal with the complexity of the underlying data (e.g., multimodal learning algorithms) may also need to be applied. Machine learning and data mining algorithms (Chap. 10) are used to develop classification models and predictive models: they automatically identify patterns in the data by converting it so that the data can be modelled computationally. These algorithms are usually divided into two main groups: supervised and unsupervised. In supervised machine learning, the data has labels, e.g. diagnostic codes or assessment scores. The algorithm uses labeled training data to produce a model that can predict a label on new, unseen data. In unsupervised learning, the data has no labels, and the algorithm tries to identify inherent patterns in the data, e.g. clusters or other groupings.

15.2.1 Essential Elements for Big Data Repositories

Some key elements are essential to the utility of big data repositories: appropriate governance, technical infrastructure, and metadata.

15.2.1.1 Governance

The first aspect that needs to be in place in order for a big data repository to be of value is appropriate governance models. Governance models outline how the data in a repository can and should be used to comply with national and organizational regulations. This is particularly important in mental health research and other clinical research fields, where the data may contain sensitive, identifiable information. There are many different models for this, ranging from repositories that are completely open and where identifiable information has been removed, to repositories that are strongly guarded in secure environments and where access to the data is restricted to approved users. Data that poses any privacy risks is usually only made available under Data Use Agreements (DUAs) that specify how the data may be used and that require the user to take steps to ensure protection of participant or patient privacy.

In general, individuals providing data to a research repository give informed consent for the storage and use of that data, but rules and regulations are quite complex and vary from region to region. In many cases, data that have been stripped of all identifying information may be used for research without explicit consent. In some cases, Institutional Review Boards (IRBs), the entities responsible for reviewing research proposals within a given institution for ethical standards, may grant a “waiver of consent,” allowing research to be performed without consent. In the context of retrospective data mining studies, these usually apply when there is minimal potential risk to the individual and when the research could not feasibly be carried out without such a waiver.

15.2.1.1.1 Technical Infrastructure

The core element of IT infrastructures for any data repository is handling data: data storage, management, and information models (the representation that specifies the types, relations and constraints of data) in databases. This can be designed and administered in different ways, with two emerging directions described as either “centralized” or “federated.” Centralized systems are locally maintained and organized, sharing a central framework; they generally involve moving data to a common location (often at the level of physical storage on the same platform) and protecting the boundaries with common privacy and security processes and safeguards. In contrast, federated solutions leave the data in place stored in physically and logically separate individual systems; instead, integration is achieved by creating common query interfaces that serve as an abstraction layer to link the individual systems for different information needs. As an example of a centralized approach, some Nordic countries have longstanding population databases to which all hospitals are mandated to provide data, which can be linked to an individual with a unique personal identifier [11]. Federated models, on the other hand, can allow the data to stay owned by the healthcare management organizations, but leveraged together for secondary use, like the Mental Health Research Network [12] that brings together 13 centers and records from approximately 12.5 million people.

The process of accessing the data in a central system involves running a query against the common data store. In contrast, in federated systems, a query is typically passed to the common query layer, where it is broken up into pieces, with each piece sent to the appropriate source system. The results of the different query-pieces are then combined and returned to the user, as if coming from a single central system. Of course, the results will be limited by whatever constraints the common query interface imposes—for example, it’s not uncommon for such systems to only return counts of cases, but not the detailed case attributes. In general, the federated approach can impose some additional technical complexity, but it can also solve a very important, and sometimes otherwise intractable problem in data governance: it allows different organizations to retain local control of their data while exposing the local data assets to limited forms of computation (e.g., counting cases) that are defined by the federated query interface.

A productive approach to increasing the utility of data in a federated model is to map the data to one of the established common data models (CDMs), which are standardized models for organizing and representing data across different repositories. For example, the Observational Medical Outcomes Partnership (OMOP) [13] is a CDM increasingly used for representing data from electronic health record (EHR) systems, transforming the content to a standardized format that can then be used for further analytics- see Fig. 15.4. Another example is the National Patient-Centered Clinical Research Network (PCORnet) [14] in which a CDM has been developed to enable further research capability of data repositories [15]. In clinical research, Clinical Data Interchange Standards Consortium (CDISC) has developed a set of common data models [16]. For instance, the Clinical Data Acquisition Standards Harmonization (CDASH) is a model for data collection, the Analysis Data Model (AdaM) for analysis, and the Operational Data Model (CDISC-ODM) for data exchange, that can help harmonize data collected by different clinical trials or investigator-initiated studies [17].

Fig. 15.4
figure 4

Mapping disparate data sources to a Common Data Model such as OMOP enables federated analysis across data sources. (From https://www.ohdsi.org/data-standardization/the-common-data-model/)

Other important aspects of technical infrastructure include ensuring appropriate compute capacity, software environments, backup procedures, firewalls, user access protocols, etc. There have been significant advances in the development of distributed, high-performance computing environments in recent years. Distributed environments allow for efficient processing of large datasets as well as deploying complex algorithms that a single computer or server would take much longer time to run. These enable more powerful processing for increasingly complex machine learning algorithms and may also support real-time processing. Hadoop [18], released by the Apache Software Foundation under an open source license, was one of the earliest examples, and is still widely used. Other examples include Spark [19], Hive [20], Flink [21] and Kafka [22]—with each optimized for different properties, e.g., efficient in-memory processing, streaming data, etc. Novel developments also include technical solutions for virtual warehousing, where linkage of various data sources with different ownership and structure is enabled without moving the data to a central location (providing technical methods for the federated approach described above), as well as Platform-as-a-Service (PaaS) delivery models, which are complete virtual development and deployment environments, i.e. building and maintaining the infrastructure is done by the cloud environment provider, not the data repository owner.

15.2.1.1.2 Metadata

For data repositories to be useful and manageable, the raw data needs to also be organized and documented in a way that enables would-be users to understand the data—what it represents and how it was collected or created. Metadata, or data about the data, is essential to characterizing the content in a repository by adding a layer (or several layers) of information about the data itself. For instance, one layer of metadata in a data repository is structural, in that it defines the elements and their relations in the database itself, such as the tables and columns. Other metadata layers might represent descriptive information to enable searching and extracting information, e.g. disease area or protocol type in a research database. Metadata models are particularly important for mapping and linking different data repositories. To ensure the utility of any data repository, the data structure, contents, meaning, and provenance must be well documented and, as much as possible, follow appropriate standards.

15.3 Big Data and Data Repositories

15.3.1 The Fair Guiding Principles

Since the late 2010s, there has been a movement towards “open science” that has grown into an expectation from funding agencies and major publishing outlets [23]. The intellectual starting point for this movement was the so-called “reproducibility crisis”—that is, the failure for published findings made by one group to be reproduced and published by another. There are numerous reasons for lack of reproducibility of research [24]. The “open science” paradigm addresses two of them, namely a lack of transparency in the methods and transparency of the data. One thing that individual researchers can do to make their own work “reproducible” is to ensure that methods and data are available alongside any results. However, publishing these in an ad hoc or non-curated manner may be insufficient for other people to make use of them. A group of stakeholders came together to formalize a set of guiding principles for researchers to enhance data sharing and reuse [25]. These guiding principles, published in 2016, were summed up by the acronym “FAIR”—Findable, Accessible, Interoperable and Reusable. Findable entails ensuring that data are assigned a globally unique and persistent identifier, described with rich metadata, and indexed in a searchable resource. Accessible involves using open, standardized protocols for data retrieval purposes, allowing for authorization when necessary, and metadata that persists even if the data are no longer available. Interoperable means that the repository should use a formal and standardized knowledge representation model, using standardized vocabularies that themselves follow FAIR principles. Ensuring that a repository is reusable means that it should be free from reuse restrictions, and released with clear usage licenses, with rich details around the content of the data in compliance with relevant community standards (see also Chap. 7).

Repositories are often created to store and allow recall of discrete sets of data for transparency and reproducibility. Curated repositories provide a way to satisfy these aims by following FAIR principles [26]. However, once a repository has been used for these purposes, it can have a secondary purpose: for further knowledge discovery using big data approaches [27].

15.4 Secondary Usage

The use of electronic health records as data repositories for research stands somewhat in contrast to the curated model for data repositories. “Learning Health Systems” (LHS), described in detail in Chap. 1, rely on data collected through clinical care to inform and enable research, which in turn informs practice. One key attribute of an LHS is that new data is captured as an integral by-product of the care experience [28]. In this paradigm, each patient encounter may be considered a data point from which to glean new knowledge. Modern EHR systems create opportunities for knowledge discovery using data collected as a by-product of clinical care, rather than as a research artifact.

This approach to knowledge discovery using EHR data is sometimes referred to as “secondary use” to distinguish it from the primary use of EHR data in support of care delivery, health system administration, and billing. Note, however, that EHR data alone are rarely sufficient as a data source to answer research questions about a specific disease or practice area. Understanding the difficulties with using EHR data for research directly help illuminate the benefits of more traditional data repositories. Unlike EHRs, data repositories do the hard work of organizing data for one or more research uses. When they are successful, they dramatically reduce the time necessary to “wrangle and clean” the data prior to using it to answer a research question. When they are exceptionally successful, they allow data to be used for many kinds of related research questions, many of which couldn’t have been anticipated by the original designers of the data repository. Thus, data repositories are likely to remain in high demand even as our health systems move to further embrace EHRs and their secondary use in research.

15.4.1 Biobanks

Biobanks are large collections of biological and medical data, such as blood samples and blood pressure, on a group or groups of individuals that provide a platform for study of health science (see also Multi-Modal Data Repositories below). The UK Biobank for example, holds information and samples from 500,000 volunteers from England, Wales, and Scotland that are available to any researcher (for a small fee) to use for projects for the public good [29].

15.5 Categories of Data and Data Repositories

Data Repositories of big data come in many forms. Virtually any of the kinds of data that can be used to acquire biomedical or healthcare knowledge can be used in big data paradigms. However, unlike most other forms of research, the researcher working with repositories will usually have had very little input into the collection or organization of the data. Here we discuss the kinds of data that have been organized into repositories to which big-data methods have been applied. In the tables that follow in this chapter we have listed a variety of big data resources that have been, or could be, used to carry out knowledge discovery, categorized by the type of data and the resource type. In so doing, we have used an existing categorization of resources [30]. These categories are: (1) initiatives—activities or groups creating, collecting or cataloging data for research (I); (2) platforms—applications that enable a researcher to search for data sets (P); (3) datasets—specific data resulting from a study or created for a processing challenge (D); (4) studies—the processes that collect data from individuals or individual points to create the datasets (S).

Quite commonly, big data resources have characteristics of more than one type. For mental health, types of data repositories that have been developed and used include some that have been developed specifically for the study one particular disease, such as Genetic Links to Anxiety and Depression [31] or RADAR-MDD [32], both of which are primarily aimed at understanding recurrent depression in people living in the UK and Europe. Others are broader, and these tend to cover larger populations and data types, such as the Psychiatric Genomic Consortium [33] that has input from studies around the world and the AllofUs biobank that is collecting data to study all aspects of health and wellbeing [34]. Some repositories are easier to understand, because the data has been selected and organized, which we refer to as “curated”, while some require expert knowledge or tools to search, but may be more convenient to store data as they have fewer rules. For example, a dictionary is highly curated, but the world-wide web is not. Big data repositories may comprise many different types of data—in some cases one at a time, and in others integrating many together.

15.5.1 Refined Scientific Knowledge: Publication Databases and Specialist Databases

Databases of refined scientific knowledge often have as their unit of reference the publications or records of scientific studies, which are curated with metadata to enable consistency and easy searching (see Table 15.1). Clinicians and researchers use these sorts of databases every day for both searching for specific studies and for carrying out systematic searches of a research topic. Publication databases are one type of refined scientific knowledge data repositories. There are several types of publication databases, each one covering some scope of medical knowledge from broad to specific. The best known of this is the Medline

Table 15.1 Refined scientific knowledge repositories. As well as internal patterns, these databases are mined for information to analyze external datasets

database, which evolved from the “Index Medicus”, published by the US National Library of Medicine (NLM) since 1879 to index published literature of medical interest. Since 1997, Medline has been available to search online though the PubMed application. It currently has over 25 million citations indexed from 5200 journals, 85% of which have an abstract [47], and are also indexed by a bespoke hierarchical thesaurus known as Medical Subject Headings (MeSH) [48]. More specialist repositories, such as PsycINFO® [36] for behavioral and social science publications, will be highly tuned to the storage and recall of specific publications. Use of big data paradigms has enabled new uses of this data [49]. These have particular value in looking for potentially unanticipated patterns [50]. They have proved to be particularly useful in considering transdiagnostic patterns and comorbidity [51] by looking beyond the contents of publications to the patterns of the entire corpus, which often features publications in multiple disciplines and across multiple classes of disorders.

Instead of, or as well as, publications, some findings will be recorded in other databases specialized to the study type. For example DrugBank is a database of drug binding data reported elsewhere [39] and PharmGKB is a curated database of pharmacogenetic interaction knowledge [52]. One study integrated a database on the molecular structure and interactions of medicines with one on side-effects to predict side-effects of psychiatric medication [53]. The same technique also has potential for drug repurposing and drug design [54].

15.5.2 Biological Data

Many big data repositories have been developed to store biological data either with or without other types of data (see Table 15.2). Databases of -omic data, where omics refers to a specific study in biology, as shown in Table 15.3 and described in more detail in Chapter 11. Imaging data, and data from wearable devices (see Chap. 17) without phenotypic data is of little use for knowledge discovery in mental health in itself. However, these data can be used for designing and training algorithms. The algorithms can then process and summarize results in a way that makes future biologic data from linked datasets, such as biobanks, more tractable. Examples can be seen in the use of the UK Biobank imaging and genetic data. Large numbers of brain MRI and genomes were made available as part of the UK Biobank process, resulting in a massive resource for which full processing would test the capacity of most research institutes. However, researchers have used this alongside machine learning to develop rules that allow, for example, the relative thickness of areas of the cortex to be accurately and automatically measured from brain MRI pictures [63, 64] and copy-number variant sites in the genome to be identified [65]. These processes can then be used to probe the relationship between these features and disease using other clinical and research datasets.

Table 15.2 Biological data repositories
Table 15.3 “Omics” are fields of study in biology

Specialized tools such as WebGestalt [38] for genetic information and the Neuroscience Information Framework (NIF) [41] for brain-related information use specialized knowledge databases and biologic data repositories to add value to each resource. For instance, a team described performing a reverse GWAS for depression. A genome-wide association study (GWAS) usually starts from a trait or phenotype to find the genetic differences, but this team reversed the process and used WebGestalt to describe biologically significant subtypes of depression on the basis of the genetic differences seen between individuals [72].

15.5.3 Behavioral Data

Of particular interest for mental health and illness research are records of behavior, which may be derived from interactions with social media, computers, wearable devices, and mobile phones. Table 15.4 gives some examples of the types of data that have been used in research to date. Traditional research on behavior would use self-reported or informant-reported observations captured on questionnaires or observation schedules. In the digital era there is potential for passive data collection of physical activity (accelerometer in wearable device), location data (geolocation on phone), voice data (from phone conversations) and facial features (from video data). These Big Data streams bring many of the challenges from the ‘Vs’ (Volume, Velocity, Variety), and innovative processing methods are often used.

Table 15.4 Behavioral data repositories

An example of a data processing opportunity came about through the accelerometer data from 100,000 participants in the UK Biobank cohort study. These wrist-worn sensors recorded motion in three dimensions 100 times a second (100 Hz) for seven days—over 60 million data-points for each person. The purpose of the motion capture was to assess activity and sleep in UK Biobank participants but processing such data on this scale had not been done before. Two techniques were developed. One team had video recording from a subset of those with accelerometers, which they manually coded, then processed using machine learning methods to pick out the accelerometer signature for activities of interest [79]. Another team summarized the data based on periodicity indicating circadian rhythms in the participants [80].

Elsewhere, the challenge of processing speech and video to detect emotion has been tackled in part with a set of research community challenges called the Audio/Visual Emotion Challenge (AVEC). AVEC brings together programmers from different fields into teams that are given a problem and a training set, and compete to develop the best prototype solutions over a limited time [81]. The scope of research may be expanded beyond just research volunteers into population-level mental health research through the use of virtual data trails. For instance, web searches related to suicide have been associated with trends in suicide over place and time [82], Twitter has been used to look at attitudes towards mental health [75] and Reddit used to look at associations between social support and mental health [78].

There are practical and ethical considerations around use of behavioral data, particularly for mental health research. Public consultations have shown people are wary about technology that tries to infer mental health states, such as speech processing, in a way that they are not about physical health [83]. What’s more, use of data in the public domain may be legally acceptable, but social media users have expressed discomfort at their text being used for research [84]. A further limitation is culture-specificity of content. For example, one study in Chinese social media found risk factors for suicidal behavior not seen in English-language studies [85]. Studies may need to be repeated in cross-culturally representative databases before findings are generalized. For more on these topics, see Chaps. 13 and 18 on Natural Language Processing and Ethical, Legal and Social Issues, respectively.

15.5.4 Clinical Administrative Data Repositories

Clinical administrative databases come in two broad types, as shown in Table 15.5. The first, exemplified by the Nordic health registers, are collected for public health monitoring, have very wide coverage of the population (aiming to be universal), and some go back many decades. The second, collected primarily for billing and reimbursement, track healthcare usage more narrowly, and can be subject to bias from reimbursement policies [94]. These repositories have some distinctive characteristics. The scale of these databases has several advantages. They can include people who may not volunteer for research, detect rare outcomes, and have the statistical power to look at subgroups in the population. Use of this data can give answers to highly clinically relevant questions, for example, in clarifying who is at risk of antidepressant-related suicidal behavior and from which medications [95].

Table 15.5 Clinical administrative data repositories

The distinctive characteristics also have some implications, particularly with respect to the quality of the data. It is important to remember that the data is entered for administrative or regulatory purposes, and subject to the fashions and influences of time and place. These may be particularly important for mental health in contrast with many physical disorders, where signs and symptoms are more clear-cut. For mental disorders, there are frequently barriers in seeking help, receiving a diagnosis, and getting treatment. And changes in these barriers may impact administrative-dependent statistics, which may look like changes in prevalence [96]. For instance UK statistics show that while the numbers of people with symptoms of depression has stayed more-or-less the same over time, the numbers with an administrative code of depression went down, and the numbers treated with an antidepressant went up [97, 98]. One can imagine a similar effect in the US based on changes in reimbursement for different diagnostic codes. Another consideration related to the characteristics of the data for efficient use of these databases is understanding that the coding systems that are used in the structured part of the clinical records are complex and are based on disease classifications (ontologies) that differ between settings and change over time. Figure 15.5 uses the example of what might be labelled as recurrent depression over time (from ICD-9 to ICD-10) and between settings (secondary care using ICD-10 and primary care using SNOMED-CT). The change in the WHO’s International Classification of Disease (ICD) from ICD-9 to ICD-10 altered the way mood disorders are classified, due to ideological shifts in the classification of psychopathology. These changes mean that one-to-one mapping of concepts is not possible. To the coding of disease states using ICD-10, other coding languages add risk states, reasons for clinical encounter and management. The Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) is a widely used, multilingual, computer processable ontology—but has a complex hierarchy structure, making the creation of a comprehensive list of codes to represent a disease in SNOMED a huge task. In Fig. 15.5, a clinician has noted the current depressive episode, prescription and referral for a patient, but a colleague might have instead coded the history of recurrent depression, or specific symptoms of depression. The choice of coded items is quite variable and non-specific codes (e.g. “Had a chat” SNOMED ID: 183093006) are very common. Researchers are encouraged to consult the clinicians and coders who use the language, as well as looking for established code lists.

Fig. 15.5
figure 5

Representing severe recurrent depression in the International Classification of Disease (ICD) versions 9 and 10, and Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT)

Another clinical administrative database where the unit of analysis is not a patient but a medication is formed by spontaneous reporting of adverse events associated with medication, the largest of which is the World Health Organization Program for International Drug Monitoring central database, which gathers information from 123 countries, and has over 10 million reports [88, 99].

15.5.5 Electronic Health Records

Electronic health records (EHRs) contain the information entered by clinicians and administrators on a day-to-day basis in clinical care. Having evolved from systems of paper notes, they are meant to support clinical practice. EHR information can be structured, as in assessment forms, lab results, diagnostic or medication codes, as well as unstructured, as medical notes written in free text. They are not designed for research use, but can be used for research purposes with certain caveats in place [100] (see Table 15.6).

While administrative databases carry summary information about health episodes, as required by the entity housing the registry, EHRs go beyond this, containing more contextual information about each healthcare encounter, even when limited to coded data. A study comparing the Clalit claims database in Israel to structured information from electronic health records from the same encounter show incremental gains from the extra information [101]. Such gains may come at the cost of extra practical difficulties and issues of confidentiality that arise from accessing individuals’ notes, although there are a number of governance and regulation models that can facilitate access while maintaining high ethical standards (see Chap. 22). Going beyond codes by including the full text of electronic notes in the registry can vastly increase the ability for identifying aspects of phenotypes that are either not frequently coded [102, 103] or are not included in current ontologies [104]. It also offers some of the best opportunities for capturing personal life events, such as bereavement or domestic abuse, that are vital for research involving social determinants of health [105]. For example, knowledge discovery techniques have been used in full-text EHR notes especially to explore patterns of symptoms and diagnosis [106, 107], predict risk of disorder or adverse events [108, 109] and explore disease correlations [110].

As EHR systems have become more widely used in healthcare, the potentials for using these within big data paradigms have increased. Recently, initiatives for integrating and linking EHR repositories from different healthcare institutions have been developed, such as the Informatics for Integrating Biology & the Bedside (i2b2) consortium and the Shared Health Research Information Network (SHRINE) [111, 112], which enable more comprehensive use of diverse EHR data with both more individuals included and different disciplines represented. These initiatives use the federated model described above. Another example is PopMedNet [113], a platform with the aim to enable distributed health data networks. Furthermore, EHR systems allow for the opportunity to merge daily healthcare with data-driven research in (almost) real time, to accelerate learning health system frameworks [114, 115]. As described above and in Chap. 1, these frameworks have the goal of providing continuous improvements in healthcare delivery by using the information generated by clinical practice to improve the care delivered to patients.

Table 15.6 Electronic Health Records (EHRs)

15.5.6 Linked Multi-Modal Data Repositories: Multiple Data Sources

Linking databases with different types of data offers immense opportunities to researchers and clinicians using big-data paradigms to acquire actionable knowledge by maximizing the variety and volume of data available for generating hypotheses, as shown in Table 15.7. For example, a system that integrated notes from different specialties breaks down the traditional information silos that have built up first through paper, then through lack of interoperability, to increase the variety of the data [136]. It is worth noting that all the data for a multi-modal data repository could sit in one place, or could sit in separate repositories linked by a virtual framework that allows integrated searching [137], using the “federated query” model describe above.

Table 15.7 Multi-modal data repositories

The potential for knowledge discovery about mental disorders expands greatly when there is more variety in data types, for instance linking a participants’ clinical information (such as presence of psychotic illness or not) and other types of data [138], including biological or behavioral data. Thus, genetic data linked to self-reported diagnoses can generate hypotheses about the heritability of mental disorders [139], and prescription data linked with diagnostic codes can look for patterns to generate hypotheses about of efficacy and adverse events [94, 140, 141]. Predictive models usually perform better when different types of data of more and different kinds are linked together. For instance algorithms predicting treatment response for people with depression have been shown to be more accurate when they take into account more types of data [142] and a study looking at predictors of suicide in US soldiers found important predictors such as service history and criminal record, in addition to standard clinical information [143]. Linkage of clinical data to external datasets can also be used to include aspects of functioning missing from clinical data of healthcare encounters, as in this study using disability claims to explore absence from work [144] and in attempts to pool educational data about younger people, to look for early signs of mental disorder [145].

The linkage of detailed phenotypic data with -omics data, imaging data and detailed geographical data was formerly limited to small-scale cohort studies or surveys, which have led to the discovery of many features that confer risk of mental disorder, but each of small effect [146]. Very large samples are required to look at the interplay of these features. The UK Biobank for instance has enrolled 500,000 people who spent a half-day at an assessment center and gave blood for genomics, metabolomics and epigenetics; activity data and imaging data will be available on 100,000 each; a focused mental health questionnaire has been answered by 160,000. This information is linked to hospital registry data for all, and primary care data in a majority—and is searchable online (example in Fig. 15.6). Such data repositories are particularly useful for studies that look at associations between systems that are usually studied by different groups of researchers, such as metabolic phenotype with depression phenotype [147], and ripe for data mining for potential new biomarkers [148].

Fig. 15.6
figure 6

Screenshot from the UK Biobank (Credit UK Biobank ©)

The field can benefit from participation in existing data repositories, but there are still limitations—and initiatives there to improve upon them. For example, UK Biobank has insufficient coverage of ethnic minorities to make any meaningful comparison between people of different backgrounds, or indeed to know whether findings even apply to individuals with ancestries other than the majority White European. The National Institute of Health in the USA has a bigger biobank called “AllOfUs” that has engaged minority communities to try to get coverage that will allow better studies of how ethnic background and associated factors affect mental health [149]. UK Biobank also has only a small share of questions on mental health and a restricted age range, but disease-specific biobanks such as the Genetic Links to Anxiety and Depression (GLAD) took advantage of a completely web-based platform to recruit people of all ages across the UK who had all experienced depression or an anxiety disorder. Finally, there are some studies that require an enormous number of observations to make discoveries, which has led to international collaborations to pool data like the Psychiatric Genomics Consortium [33].

Much of the work done on these linked databases is to try to generate hypotheses regarding potential etiology of mental disorders, and through this insight, to suggest potential treatment and prevention. For instance, considering comorbidities of mental disorders has suggested genes and proteins that may link them [110, 150] and the biologic basis of mental disorders is being investigated by the linking of genomic data to imaging data to mental and behavioral data [151]. Linking different kinds of psychosocial data can also help to understand health outcomes, such as linking personality traits to social behavior and self-harm [152] and to look at wider outcomes of mental disorder such as educational attainment [145] and occupation [153]. Conventional mental disorder diagnostic categories are usually used in knowledge discovery, but teams have also used data to suggest refinements to diagnostic categories—for instance the finding that immunology can be used to subtype Autism Spectrum Disorders—and these subtypes have an influence on clinical trajectory [154]. Others have gone beyond categories to look at transdiagnostic patterns and dimensional phenotypes [155]. This is greatly facilitated by extracting features from full text in electronic health records [103, 104].

15.5.7 Practical Challenges of Using Data Repositories for Mental Health Research

Different kinds of data collection methods may result in different biases. A distinction may be made between those research data sources where the participants are volunteers, and administrative data sources where data is used under provisions for the ‘public good’ in a massed and de-identified way. A volunteer cohort often has a selection bias towards the health-conscious and well-educated [156]. Administrative health data is commonly only routinely collected when the participant receives medical care—usually when they are unwell. This gives rise to an observation bias (attending medical care for one disorder makes documentation of another disorder more likely), which may need attention in analysis. A consideration of these and other source-specific biases is important in planning studies and interpreting results [157]. Two particularly pertinent considerations are missingness and psychiatric diagnosis.

Missingness

Consider the situation where a researcher is interested in differences in psychiatric diagnosis in people from different racial groups. They may use an EHR repository and find a structured field for ethnicity, but they find that in over half of cases this is not completed. The researcher then discovers that someone has published a natural language processing application that extracts ethnicity information from free text, but it was developed on and designed for primary care notes rather than secondary care notes, so how the application will perform on this new data is unknown. There is the possibility to link the EHR to national census data (where regulations permit), but this only links in cases where the person has not relocated since the last census, and the census contains a different ethnicity classification than the EHR. The overall picture is not actually just missing data, but of multiple sub-optimal possibilities for ascertaining data, which the researcher has to navigate. While data missing at random is difficult enough, it is actually more likely that there will be different bias in the availability of each of these data types, which means that just using the cases with complete data is liable to reduce not only size, but representativeness of the whole. For example, the census data will be less likely to reflect students and people with insecure housing, who might make up important strata within the study.

Psychiatric Diagnosis

There is a ‘diagnosis’ structured field in the EHR that is an ICD-10 code, but the researcher may find that since clinicians are obliged to complete this field as soon as they see someone in clinic, many cases are coded using “fudge codes” (such as F99—mental and behavioral disorder not otherwise specified). Using hospital discharge codes instead gives a more intelligible output, but restricting to people discharged from psychiatric hospitals will distort the sample to those who are most likely to be admitted—those who are perceived to be a risk to self or others. Ideally a researcher would like to know about the reliability of a discharge diagnosis through “validation studies”, but as recent reviews testify [158, 159], the variability between sources of diagnosis, probably by hospital/clinician, and possibly by gender and ethnicity [160], mean that validation done in one cohort/database may not translate to another. There is also a documented phenomenon of a misclassification bias away from more stigmatizing diagnoses in administrative diagnoses [161]. Ultimately, databases may never be able to give that fully considered nuanced diagnostic formulation a clinical interview can give, and this can have consequences for research [162], so the researcher may have to embrace that uncertainty. The issue of diagnostic classification is particularly thorny when working across different cultures [163], so that extra considerations in research designs may be needed where this occurs [164].

15.6 Case Study: Developing a Big Data Registry/Repository

To understand the design constraints on research data repositories, it may be helpful to adopt the perspective of an entity (or entities) charged with developing and maintaining them. As an example, a task might be to develop a data repository of all data generated by research funded by the US National Institute of Mental Health, perhaps only on a single mental disorder, Autism Spectrum Disorder (ASD). The only requirement is to store data about research participants or patients (not, for instance, data generated by wet-lab experiments on bacteria strains).

The first step is to conduct a requirements analysis to answer some basic questions to an adequate level of specificity. The goal of the analysis will be to develop a reasonably clear picture of the intended data uses, the expected data sources, and a vision for how to transform and store the data from the sources so that it supports the intended uses. This analysis should aim to answer (at least) the following questions:

  • What are the intended data uses?

    • What kinds of research questions can the data answer?

    • Can prototypical analytics methods be articulated that are appropriate for the data?

    • Who are the expected users and what type of skills and knowledge relative to data use might they have?

    • Who are the important stakeholders in the data repository that may not be data users (e.g., the public, anyone providing funding, government oversight agencies, data sources, industry groups)? What does the repository need to show to keep these stakeholders informed and supportive?

    • Are there important privacy constraints on intended uses?

  • What are the expected data sources?

    • What are the expected data types that will be supported? (E.g., limiting submissions to form-derived data, or more complex experimental results or raw sensor readings that may be submitted as large files).

    • What format are the sources most likely to provide the data in? How much variability is anticipated in data submission formats and content? How much uniformity can be enforced in data submission formats and content?

    • What data linking requirements (if any) are going to be enforced? Will research participants be linked across studies? How? Are there privacy constraints on linkage?

  • What options are available for data transformation and storage that would support the intended uses?

    • What data models and architecture provide adequate representation for each intended data type (e.g. is a relational database sufficient? Can all data points and their relations be represented accurately)?

    • What mechanisms will be available to the data users to search and retrieve data of interest?

The requirements analysis should provide input into the next step: the design phase. One main area of tension in the design is likely to revolve around how strict vs. relaxed the data submission requirements might be, which is related to how highly “curated” the repository will become. Very relaxed requirements means anything goes in. It lowers the barrier to submission for data sources and reduces the cost of data validation for the repository. On the other hand, the result can be very difficult to use and may not support the intended data uses (such as one desideratum implied by our use case: aggregating analytic data sets across multiple studies). “Data Lakes” (defined as repositories capable of storing all of your structured and unstructured data without having to first impose any specific structure on that data) can easily become “Data Swamps” if care is not taken to curate what can flow in.

Box 15.1 Constructing a Large Data Repository

What are the main design considerations?

  • What are the data sources?

  • What are the intended uses?

  • Who are the intended users?

  • How should the data be deposited and stored so that it supports the intended uses?

  • What are the main design dimensions?

  • Data submission standards and quality requirements: Should they be easy or rigorous? Relaxed or tightly controlled?

  • Data volume requirements: What are the characteristics of the data, and what implications do they have on requirements?

  • Data governance: Under what conditions should/can data be made available? How many hoops does a potential data user have to jump through?

Box 15.2 Using a Large Data Repository

  • Understand the data collection protocols

  • Understand the large-scale data structure (the tables)

  • Understand the fine-grained data structure (the columns, coding schemes)

  • Learn to use cohort discovery tools, if available

  • Identify a cohort of interest

  • Apply for access

  • Access a data set

  • Run and review data quality reports and match against any data release notes

  • Run exploratory analyses and verify that you understand each variable you plan to use

  • Run actual analysis

  • Contact original data acquisition team, if needed. Don’t be shy.

  • Publish and bask in glory

  • Give credit

Box 15.3 Submitting Data to a Large Data Repository

  • Understand the policies

  • Understand the data submission process

  • Complete forms (yes, many, many forms)

  • Generate the upload package

    • Datasets

    • Associated files

    • Meta-data

  • Complete a test upload and review the validation error reports. Yes, many, many errors.

  • Fix data issues and resubmit. Rinse, repeat.

  • Bask in the warm glow of contributing to humanity’s progress by expanding the shared pool of usable data

Unfortunately, the reversed approach, of a strictly curated registry with strict and extensive submission requirements, poses its own problems. It can create an insurmountable burden for data submission partners and can dramatically increase the cost of validation and meta-data management to make the data repository program financially unsustainable. To extend the earlier metaphor, if a “Data Lake” only admits the purest distilled water the result might be a mere trickle feeding a miniscule “Data Puddle.” The wise designer must navigate this tension and the practical outcome is usually far from either extreme.

15.6.1 Who Develops Disease-Specific Data Repositories in Mental Health and Why?

There are several types of organizational entities that develop data repositories in mental health. Some of these are developed specifically for research purposes (e.g. a publication database), some are developed organically in an organization through daily practice or use (e.g. a reimbursement register), and some can be a combination of both (e.g. a linked ehr database). Government agencies, like the National Health Service (NHS) in the UK, or the National Institute of Mental Health (NIMH) in the US, produce, fund, or host data repositories of various types for research, policy information, and other purposes. Professional specialty societies such as the American Psychological Association (APA), or the American Academy of Neurology (AAN) also develop and host data repositories. Note that in the US professional specialty societies may have regulatory and financial incentives for developing data assets, e.g., to get society members reimbursement under the MIPS program. In other countries the incentives and players may be different.

Other examples include specific disease advocacy groups, such as the National Organization for Rare Disorders (NORD), the Anxiety and Depression Association of America (ADAA) [165], and the Simons Foundation Autism Research Initiative (SFARI). There are also academic research networks and research centers specifically focusing on certain diseases, such as the Autism Biomarker Consortium for Clinical Trials (ABC-CT).

More recently, online platforms of different types have also become important data repository sources for mental health. Some social media platforms have emerged focusing solely on mental health related topics where peer support is a main feature, e.g. platforms like PatientsLikeMe.com. Furthermore, online counseling services and internet-based cognitive behavioral therapy intrinsically generate data that could be used for knowledge discovery, though of course this approach calls for significant ethical and legal consideration.

For each of these types of repositories, it is important to consider who the stakeholders are, who might want the data and for what purpose, and understand the context in which it is developed. Moreover, depending on the context, it is also important to consider what the organizational or business models are underlying the repository and what the sustainability strategies are. Another contextual aspect that is important to consider with any data repository is the political context for the creation and maintenance of any resource, to understand strengths and limitations of the data.

15.7 Closing Thoughts: Opportunities and Challenges

We live in an era where the way mental health research is conducted can be transformed by novel combinations of technical infrastructure, data collection and availability, computational methods, and analytical approaches. Recent advances have opened unprecedented opportunities, but to truly reach a state of “reproducible” scientific practices and “open science” following the FAIR principles, certain aspects of knowledge discovery in these types of data repositories need special consideration.

Although most of the sources that are mentioned in this chapter are from developed countries, sufficient technology now exists in low and middle-income countries to collect data to enable them to benefit from data insights in order to create a learning health system. Data will come through conventional health information systems [166], community health workers [167] and demographic data through other agencies [168]. Infection and epidemics remain the most obvious aim of these systems, but developing countries also have a huge burden of non-communicable disease, including mental disorders, and use of data insights may decrease this burden and promote development [169, 170]. As mobile phones have become ubiquitous and technology an integral part of humanitarian response to disasters, data will become available on the most vulnerable populations on the globe who have been displaced through war and natural disaster, and could be used to help future responses.

Considerable challenges to using Big Data resources remain and must be tackled by both experienced and novice researchers working in the field. In fact, your work in this area will significantly impact how we take advantage of the opportunities and navigate the challenges. When working with data repositories and knowledge discovery methods, first consider the provenance of the data, which is often not collected with research in mind, or with a different type of research in mind. Second, consider that data collection, ingestion, and curation can inadvertently reduce to dichotomous outcomes what may be nuanced human traits or states. Then consider that linking between any two sources that were not initially designed to be linked is by no means simple or infallible.

The researcher who develops an approach to identify patterns in the data, or a predictive model based on retrospective data, often cannot interpret a finding unless they know where the data comes from, how it is collected, the limitations of repositories, and the underlying assumptions of the learning algorithms.

Despite the many remaining challenges, Big Data is growing in importance as an exceptionally exciting source of knowledge about mental health. We are confident that the growth will continue, and we hope yours will be among the many hands that will help overcome the challenges described above.