Key words

1 Introduction

The ability to collect large volumes of molecular data as well as detailed clinical measurements for patients will impact on disease classification and clinical management. Systems medicine is an approach that uses the concepts and methods of systems biology to understand disease conditions through an integration of data at multiple levels of biological organization [1]. Systems biology has progressed rapidly in recent years due to advances in technology that enable the rapid and increasingly inexpensive capture of multiple “omics” data (e.g., genomics, epigenomics, transcriptomics, proteomics, metabolomics), together with advances in computing that make it possible to store, query, and analyze the associated large datasets. An important feature of systems medicine is the interplay between biology, computation, and technology [2].

Many diseases are heterogeneous , that is, they are associated with a variety of phenotypes, which sometimes overlap. Such diseases represent a collection of subtypes, each characterized by, for example, different aberrant pathways and processes. Patient stratification involves identification of the particular subtype of disease from which a patient is suffering, and this can impact on drug discovery toward more personalized and effective treatments. If a disease condition is considered as a single homogeneous entity , potentially useful therapeutics could be discarded as they may show no overall beneficial effect in the cohort as a whole. However, these therapeutics might be of value to a selected group of patients with a particular disease subtype. In addition, an understanding of the mechanistic basis of disease subtypes could lead to the development of novel subtype-specific medicines.

When carried out on a large scale, the application of systems approaches to medicine offers the potential for the development of a new taxonomy of disease, namely, a taxonomy based on molecular mechanistic features rather than the presentation of clinical symptoms (see, e.g., [3, 4]). For some diseases, such a classification may reveal a number of subtypes, each involving different molecular pathways and processes. This new classification can lead to a more individualized approach to therapy, as the identification of disease subtypes can directly impact on clinical management, with therapeutic intervention tailored to the disease subtype of the patient. A new taxonomy might also suggest that several apparently different diseases, hitherto thought to be separate and distinct conditions, share common mechanisms at the molecular level, and such information could be useful for drug repurposing (see, e.g., [5]). Barabasi and coworkers [6] have used a network approach to connect human disease genes (the disease genome) with various human diseases (the disease phenome). The relationships are represented as a bipartite graph from which two networks are extracted, a human disease network (HDN) and a disease gene network (DGN ). The modular structure of the HDN approach reveals connections between diseases that may appear different at the phenotypic level and that of the DGN shows groups of genes that share a disease phenotype.

Systems medicine involves the collection of large amounts of data, including clinical data, omics data, and, recently, data on patients’ environment and activities collected through devices that make use of wearable sensor technologies. This has led to a new approach to personalized medicine, namely, P4 systems medicine, which is personalized, preventive, predictive, and participatory [710]. The aim is to develop personalized healthcare plans, to monitor the status of a patient’s wellness so that early intervention can be made when appropriate, opening the possibility, through the identification of actionable molecular or lifestyle targets, to prevent the transition from wellness to disease, or to promote the reversal from the early stage of disease onset to the normal condition. The ability to accurately predict disease progression would mean that unnecessary or inappropriate therapy could be avoided thereby saving costs and limiting exposure to potential side effects of the therapy. Additionally and importantly, P4 systems medicine is participatory, as it involves active engagement of researchers, clinicians, and patient groups empowered through social networks. Wearable sensor technologies will mean that patients can collect their own lifestyle and exposure measurements to score and comment on how they feel on a regular basis and share these data easily with the stakeholders they choose. The knowledge gained from community participation (e.g., response of individuals to particular therapies or other actionable interventions) can be fed back into computational systems biology models. P4 systems medicine will also need to address challenges associated with societal issues such as ethics, privacy, and data security or educational issues such as understanding how the patient can be seen as his/her own control in monitoring the transition from health to disease states [11].

Advances in genomics have triggered a revolution in medical genetics, reducing the cost of sequencing, accelerating health-improvement projects, and providing a comprehensive resource on human genetic variants establishing the link between the genotype and the phenotype [1215]. The 1000 Genomes Project released a catalogue of validated loss-of-function (LoF) variants and naturally occurring “knockout” alleles for over 1000 human protein-coding genes; many of these genes have minimal functional annotation [16, 17]. Coding variants could affect human fitness with regard to responses against pathogens and heightened susceptibility to infection.

In oncology, there have been several recent examples of the clinical potential of a strategy involving diagnosis of subtype followed by a specific therapy. Most of these have involved the presence or absence of specific mutations or chromosomal rearrangements that can be indicative of disease prognosis or drug response. The relatively short time in which crizotinib [18] was demonstrated to be an effective therapy for a subset of patients with non-small cell lung cancer (NSCLC ) indicates how patient mutational status can be one route to stratification. A subset of patients with NSCLC shows a chromosomal inversion that leads to the production of a fused protein encoded by a recombination of the echinoderm microtubule-associated protein-like 4 (ELM4) gene and the anaplastic lymphoma kinase (ALK ) gene . This presence of the fusion protein can act as a diagnostic and also a target for the drug crizotinib [19]. Another example of a therapeutic strategy showing the potential of testing for given mutations combined with individualized therapies comes from colorectal cancer (CRC). In CRC , patients with particular mutations in the KRAS protein (which is involved in signaling pathways) show poor response to the epidermal growth factor receptor (EGFR) inhibitor drugs, panitumumab and cetuximab [20].

Respiratory diseases such as asthma and chronic obstructive pulmonary disease (COPD) are examples of complex heterogeneous diseases. These diseases, characterized by airway inflammation and remodeling and airflow limitation, involve various types of interacting elements at the molecular level and provide illustrative examples for systems medicine approaches. For a number of patients suffering from severe asthma, the disease cannot be controlled by corticosteroid therapy. A recent systems medicine project U-BIOPRED (Unbiased BIOmarkers for the PREDiction of respiratory disease outcomes) aims at the identification of asthma subtypes and stratification of patients with respect to subtype that offers the prospect of more individualized approaches to treatment [21]. The Synergy-COPD project [22, 23] is another recent example of a systems medicine initiative aimed at understanding the heterogeneity of COPD and the associated patterns of comorbidity. As part of Synergy-COPD, Turan and coauthors [24] explored skeletal muscle wasting in COPD patients, by integrating physiological and gene expression data to build molecular networks which could then be tested for pathway and functional enrichment. The extent to which asthma and COPD share common mechanisms has been discussed in the literature [25] and illustrates some of the challenges of understanding complex heterogeneous disease. Specifically, the authors have curated common pathways and developed gene networks for four major respiratory diseases (asthma, COPD, tuberculosis, and essential hypertension) based on specialized studies from literature. The network overlap between these disease types has been analyzed, with the highest being identified between asthma and COPD. These results show stronger association between asthma and COPD than between the other analyzed phenotypes, suggesting the potential of developing therapeutic strategies to target both these diseases.

Finally, one of the most recent initiatives in P4 systems medicine is the pioneers of health and wellness pilot project [8]. The project targets deep characterization of wellness and proposes to (1) obtain a complete genome sequence for each individual; (2) follow with digital-monitoring devices that measure heart rate, activity, quality of sleep, weight, and blood pressure; and (3) follow every 3 months measurements of blood metabolites; blood organ-specific proteins for the brain, heart, and liver; the gut microbiome; salivary cortisol; white cell methylation; and telomeric lengths and clinical chemistries focused on nutrition, thus combining approaches that have proven to be effective by two pioneering individuals who monitored themselves for environmental factors and lifestyle [26] or multiple omics [27] which when combined with clinical assessments enabled the identification of early signs of disease occurrence that they could counteract through appropriate individual adaptations.

Large-scale systems medicine studies will require robust computational pipelines. The application of computing methodologies to the domain of translational medicine research to enable the storage, mining, analysis, and visualization of large patient datasets is sometimes referred to as translational informatics. In the next sections, we describe three computational challenges associated with P4 systems medicine, namely:

  1. 1.

    Integrative approaches to subtype discovery

  2. 2.

    Obtaining a mechanistic understanding of disease subtypes

  3. 3.

    Developing a platform for translational informatics

We conclude with a brief discussion of some of the broader educational challenges associated with the development of systems medicine.

2 Integrative Approaches to Disease Subtype Discovery

Biomarkers are patterns that discriminate between disease and non-disease or between different disease subtypes. Identification of subtypes could suggest appropriate therapeutic strategies. Biomarkers can also be used prognostically, for example, to predict whether a patient may have an aggressive or benign form of a disease. The components of the patterns could be a set of genes or metabolites or other measured biological features.

Relatively few clinically useful biomarkers have been developed, although extensive work has been done on biomarker studies during the last decades [28]. Given that biological data is generally noisy (perhaps due to the heterogeneous nature of most complex diseases), it is difficult to obtain reproducible molecular signatures. Filtering and integrating, which exploit prior knowledge and the ability to group together data, can help to improve the signal to noise ratio [29]. In an analysis of breast cancer metastasis [30], gene expression patterns were integrated with protein-protein interaction networks and biomarkers were identified as subgraphs. These network-based signatures were more robust than markers based on individual genes, showing greater reproducibility across studies.

Biomarker identification is usually carried out in a supervised manner. Supervised approaches use the patient phenotype as a class label associated with each patient. This might be assigned from the clinical presentation of the patient and could, for example, be associated with severity or aggressiveness of the disease. Supervised approaches aim to find sets of features that distinguish between classes.

Unsupervised approaches to disease classification make no assumption as to patient phenotype, and, essentially, group data according to similarities among the molecular features (e.g., gene expression, metabolomics profiles), resulting in groups (clusters) that may represent disease subtypes. These may correspond to already known differences in the phenotypes or may represent some as-yet undiscovered subtypes reflecting some perturbation of underlying molecular processes and pathways that are not immediately apparent from the patient’s clinical measurements.

Omics measurements from different platforms can provide complementary types of information, and the collection and integration of data from multiple omics platforms can suggest novel disease subtypes. Several cancer-related studies have recently been published with multiple omics data types collected for the same group of patients (e.g., refer to [31]). This strategy of data collection is likely to become increasingly common in translational medicine [32, 33]. In the context of asthma treatment, this approach has been taken, for example, with the U-BIOPRED consortium, which aims to develop molecular fingerprint and handprint signatures that will lead to a better classification of the different types of severe asthma [34]. Knowledge of these subtypes may help in the development of better types of treatment for asthma patients.

One might expect that the underlying biology associated with different disease subtypes would be reflected within different types of molecular data collected across the patient cohort. The identification of consistent patterns across different omics platforms may, therefore, reflect more reliable subgroupings. Alternatively, it may be the case that the signal from an individual platform is too weak to distinguish between disease subtypes, but taken together, the data might lead to the discovery of robust patterns that are useful diagnostically and prognostically. Disease subtype discovery using multiple genomics datasets of different types (e.g., gene expression, copy number variation, methylation) collected for the same set of patients can offer new insights into the taxonomy of disease. Each data type can be clustered separately, and the concordance and the conflict between clusters can be explored. An integrative approach should ideally identify (1) molecular patterns (signatures) that are common across the different omics datasets, (2) patterns that are specific to individual datasets, and also (3) patterns that only emerge after data integration [35]. We report below a few examples of integrative approaches to the subtype discovery problem. Patterns specific to a given data type can be termed fingerprints, and those associated with the integrated data can be termed handprints.

The multiple “omics” data types can include gene expression, proteomics, and metabolomics. The data matrices associated with each data type j are of dimensionality n by p j where the number of patients is n and the index p j refers to the number of features or variables for data type j. Typically, p j will be different across different platforms. Several exploratory data analysis methods have been proposed to compare two omics datasets (j = 2). These include partial least squares ( PLS ) and canonical correlation analysis ( CCA ) , and sparse approaches have been used to perform the integration and variable selection together [36].

Co-inertia analysis (CIA ) [37] measures the degree to which two datasets are in concordance and is suitable for datasets where the number of features (variables) exceeds the number of samples (patients) which is usually the case with omics data. An approach using CIA has been applied to compare microarray data from two different platforms [38]. Recently, this approach has been extended to handle more than two omics datasets [39], and the multiple co-inertia analysis method is available as an R package (omicade4). For example, given the same set of (n) patients for which multiple data types are available (such as gene expression, transcripts, methylation levels), the omicade4 package performs co-inertia analysis by combining information from all these datasets provided for the n patients. Note that all datasets must have one common dimension (i.e., the patient number), while the second dimension can differ.

Shen and coworkers [35] proposed an integrative method called iCluster using a joint latent variable approach. iCluster performs a simultaneous clustering of omics datasets, which are represented by (p × n) matrices with p being the number of features (e.g., genes) and n the number of patients. The method simultaneously projects high-dimensional data matrices associated with various omics platforms and with different numbers of features onto a unified latent space of lower dimensionality. Simulations show that clustering in the latent space produces a better separation than using PCA [40]. A recent breast cancer study using iCluster demonstrated the value of integration through the identification of subtypes that were not suggested by the component data platforms of gene expression and copy number [41]. The program iCluster + (a further development to iCluster) permits integration of different data types, e.g., binary, categorical, and continuous [42]. The program is available as an R package.

Kirk and coauthors [43] use an unsupervised Bayesian correlated clustering approach for multiple dataset integration (MDI ). MDI allows the identification of subsets of samples that cluster across several different datasets. Lock and Dunson [44] described a flexible integrated approach that allows an overall clustering to identify shared structure and a clustering that is specific to each data modality.

A network-based approach to subtype discovery was developed by [45]. A similarity network is constructed for each data type, such as gene expression or DNA methylation . These individual networks are fused to form an integrated network. An advantage of this approach is that strong similarities supported by evidence from several data modalities are retained, as well as some weak similarities that share a common tightly connected network neighborhood across the individual networks. The method has been demonstrated [45] to detect clinically relevant subtypes for a variety of cancer datasets from the Cancer Genome Atlas TCGA.

3 Toward a Mechanistic Understanding of Disease Subtypes

After identifying putative molecular signatures that are associated with disease subtypes, the next challenge for translational informatics is to use these signatures to get a mechanistic insight into disease heterogeneity. Pathways and networks describe a level of functional organization that is between molecular function and physiological function. As such, mapping genes, which are suspected to be involved in disease, to pathways and networks can give insight into potential mechanisms that may be involved in the disease process and could also suggest strategies for therapeutic intervention. Pathways and networks represent collections of molecular components that interact in some way and participate in a given process, and these collections can be represented in a variety of ways and at different levels of granularity. Traditional pathway and network maps captured metabolic reactions showing the associated reactants and products. These have been extended to include signaling pathways and pathways involved in disease.

The Systems Biology Graphical Notation (SBGN ) project [46] aimed to provide a standard representation of molecular pathways and networks while recognizing that their complexity required different views depending on the visualization requirements. The SBGN standard [46] consists of three complementary languages: process description (PD), activity flow (AF), and entity relationship (ER). Each of these three complementary languages has certain purpose, advantages, and limitations (refer to Table 1). The process description is currently the most widely used language. It represents biological events such as metabolic reactions, protein phosphorylation, and complex formation in an unambiguous way and depicts causal sequences of events that are well-suited for mathematical modeling and simulation. This language is arguably the best environment for knowledge representation that can be used both by mathematicians and biologists to accurately express detailed information about a biological system and use it for model development, hypothesis generation, and predictions. One of the major limitations of this language, similarly to other pathway map visualizations, is that it cannot deal with potential combinatorial explosion [47]. The activity flow language can be seen as a simplified version of process description language with fewer details and the focus on activity transformation from one molecule to another in a pathway. This SBGN language is the closest to the commonly used signaling pathway diagrams, for example, in BioCarta. This level of representation is a good fit for omics data visualization and functional analysis. Similar less-detailed compact diagrams are used, for example, in Ingenuity Pathway Analysis (Qiagen) and MetaCore (Thomson Reuters). The entity relationship language loses the sequential expressiveness of the other two languages but instead can very well deal with combinatorial explosion. The entity relationship diagram cannot be read as a pathway or network but rather as a set of states one molecule can be in depending on the influences from other molecules. There are many software applications that support SBGN diagrams (www.sbgn.org/SBGN_Software).

Table 1 Advantages and limitations of the three complimentary languages within SBGN

CellDesigner [48] is a software used for developing diagrams in a format compliant with SBGN process description language, with visual elements that in most cases correspond to visual elements of SBGN process description. These diagrams can be exported from CellDesigner in SBGN and SBML formats.

Two disease maps have recently been constructed in the area of neurodegenerative disease. Mizuno and coauthors [49] have developed a map of signaling pathways in Alzheimer’s disease, and Fujita and coauthors [50] have constructed a Parkinson’s disease (PD) map which captures known components involved in gene regulatory and metabolic processes associated with this disease. Both the Alzheimer’s map and the PD map are developed in CellDesigner in SBGN compliant format. The PD map is coupled to bioinformatics tools that, for example, explore the overlay of differentially expressed genes from a gene expression study in order to allow identification of the main pathways that may be perturbed. The map can be extended by the addition of other data types such as protein interaction data and associations derived from text mining that can facilitate hypothesis discovery.

A major challenge i n the construction of disease maps is the extraction of information from the scientific literature. While text-mining approaches have considerable potential [51], this needs to be supplemented by manual curation by domain experts if a reliable high-quality map is to be developed. Another challenge is the level of detail to be included. A description of a disease condition at the molecular level ideally needs also to capture the effect of, for example, single-nucleotide polymorphism (SNP) or information relating to enrichment of gene expression in particular tissue types. Quantitative information about reaction kinetics could be included, which would allow simulations and modeling to be carried out. However, mostly only qualitative information about correlative relationships can be found. The biological expression language (BEL) [52] is a framework for capturing causal and correlative relationships and has the advantage that it is expressive and human readable and can be extended. The BEL framework does not have its own ontology but makes use of existing ontologies.

Disease maps represented at multiple levels of granularity are important to obtain mechanistic insight into disease subtypes and to put biological context around experimental results. Although details of sequential reaction steps and temporal and spatial information in relation to these processes are valuable, many relationships described in the scientific literature are at a much higher level of granularity suggesting that entity A has an effect on entity B. Malhotra and coauthors [53] describe an integrative approach to put functional context around putative biomarkers by capturing more speculative interactions and relationships using text mining and by including protein-protein interaction networks and gene expression data. This strategy is a synthesis of data driven and background knowledge-based approaches.

Multi-scale modeling, which aims to describe functionality of the whole system rather than of its components, is being increasingly used in systems medicine during recent years in order to facilitate exploration of key features at various scales, from molecular (e.g., metabolite network) to cellular (e.g., intercellular influences) and macroscopic levels (e.g., phenotype manifestation) [54]. Major advantages of developing multi-scale models in systems medicine are represented by the possibility of integrating experimental data at different scales, exploring features of phenomena across system layers, and investigating drug effects on the whole biological system. Although considered powerful, the multi-scale description approach is characterized by (1) a large set of parameters needed to represent system compounds and (2) various ranges of both spatial and temporal scales (e.g., from rapid dynamics and microscopic spaces, such as metabolic activity occurring at second order within the nucleus, to slow dynamics and extensive spaces, such as disease progression observed at the tissue and organ level over years). The multi-scale modeling approach has been used in cancer medicine to explore, e.g., abnormal phenomena during breast, colorectal, lung, and prostate cancers, reviewed in [54, 55] and references therein; in drug discovery and development research to investigate the impact, e.g., of cytostatic agents on tumor progression [56] of danoprevir inhibitor on hepatitis C virus dynamics [57]; and in epigenetics research to analyze relationships between aging and aberrant modifications of the transgenerational epigenetic inheritance mechanisms [58]. A recent paper by [59] introduces a multi-scale computational model developed to predict lung functional deregulations during respiratory disease development by integrating information on CT images from COPD and asthma patients.

4 Developing a Platform for Translational Informatics

What systems need to be in place to enable the exploitation of the large datasets that are being collected as part of ongoing and planned systems medicine initiatives? Although the cost of genome sequencing has dropped dramatically in recent years to around 1000 US$, the cost of analysis and interpretation is still considerably higher by two orders of magnitude [60]. Additionally, implementation of systems medicine approaches to health and wellness transitions, drug development, and treatment will require a computational infrastructure to allow for storage, retrieval, and mining of data in an integrated manner (refer to Fig. 1 ).

Fig. 1
figure 1

Three key components for a translational medicine platform: (a) A data warehouse for storage and querying of clinical and multi-omics data associated with patient cohorts; data harmonization is needed to ensure that the data conforms to standards, in order to facilitate comparison across different studies. (b) A data analytics component for visual exploration of the data and for subtype discovery. (c) A knowledge base to enable experimental results to be understood in the context of known disease pathways and processes and to suggest a possible mechanistic basis for subtypes

In addition to omics datasets being collected from different platforms (such as gene expression, copy number variation methylation, etc.), phenotypic data is also becoming richer. Instead of a single endpoint status representing the phenotype (e.g., disease or non-disease), a set of measurements may be collected (e.g., mild, moderate, or severe disease). These can give a better description of the phenotype and may, if taken across time, describe disease progression or reversal [26, 27]. It is likely that such high-dimensional phenotype datasets (described by a large number of clinical measurements for each patient taken at given time points) will become increasingly important. As these datasets start to be routinely collected, each patient will be associated with a cloud of data consisting of millions to billions of data points, and the mining and analysis of this data is likely to offer insight into the onset of disease as well as transitions from health to disease and vice versa [8, 10].

Traditionally, various bioinformatics data repositories have tended to be centered around fixed data types. Until fairly recently, the identification of molecular signatures associated with disease conditions has been derived from analyses of gene expression data, with the main public repositories being the gene expression omnibus (GEO) [61] and ArrayExpress databases [62]. Other omics data types in addition to transcriptomics data can include proteomics, metabolomics, and data from genome structural variations, and these can provide additional molecular signatures (refer to, e.g., [63]). More recently, the database of genotypes and phenotypes (dbGAP ) has been established as a repository for genotype-phenotype data and includes molecular data (e.g., expression, copy number variation, methylation) as well as phenotypic data and contextual information (e.g., research protocols).

A number of initiatives are underway aimed at making bioinformatics tools more accessible and at sharing analysis workflows and results: GenomeSpace (www.genomespace.org) and Garuda (www.garuda-alliance.org) are frameworks for interoperability of bioinformatics tools; Galaxy [64] is a web-based platform for tools integration which also allows tracking of provenance and the sharing of workflows; Synapse [65] provides a framework for collaboration with particular emphasis on provenance and sharing, with respect to the data and also the results of analyses carried out on the data; cBioPortal [66] provides a web-based tool for analysis and visualization of cancer datasets of multiple data types, such as gene expression and copy number variation, and offers an R interface so that data in the repository can be queried from R scripts and a Matlab (www.mathworks.com) toolbox to allow programmatic access from Matlab code (www.mathworks.com); and OMICtools [67] is a manually curated repository for web-based tools related to “omics” data analysis.

Computational platforms to enable systems medicine will need to be accessible to researchers and clinicians and have advanced visualization functionalities to facilitate hypothesis generation. The data types that need to be stored, queried, and integrated include gene expression, proteomics, metabolomics data, and DNA structural variations such as chromosomal rearrangements, copy number variations, DNA methylation data, microRNA data, as well as data associated with medical imaging and clinical data.

One of the main challenges in the development of a data repository for translational medicine studies is the semantic heterogeneity of the data. This means that the same concept (e.g., a clinical measurement) may be referred to by different names or different concepts may be referred to by the same name. The use of available standards for clinical data (e.g., CDISC) and for multi-omics data (ISA standards [68]) will help to address this challenge and will enable cross-study comparisons. However, there will remain a problem in the harmonization of legacy data, which do not conform to current standards, and this is likely to be resource intensive, involving semi-manual curation. A translational medicine platform also needs to be secure and to conform to legislation relating to data privacy.

The data analytics component of such a platform will need to handle common types of analyses, such as exploring attributes of the patient cohorts, required by clinicians and biomedical researchers, and include workflows for facilitating disease subtyping by the identification of molecular signatures from the omics datasets associated with each patient sample. Finally, the platform should put putative disease subtypes into biological context by using background knowledge in disease maps, to suggest a mechanistic basis for the subtypes.

An early example of the development of a translational informatics platform Oncomine [69] aimed at the integration of microarray data from cancer studies. This platform attempted to address some of the challenges of semantic and syntactic heterogeneity of the data. The tranSMART platform [70, 71] was developed as a warehouse for both clinical and high-dimensional omics data such as gene expression and SNP data. The platform facilitates cohort selection and exploratory visual exploration of clinical data associated with the cohorts and has been integrated with more specialized analytics tools such as GenePattern (www.genepattern.org). More recently, tranSMART has been integrated with Genedata Analyst for advanced analysis of a number of omics datasets [72]. The Innovative Medicines Initiative recently funded a project to build a platform for translational research eTRIKS (European Translational Information and Knowledge Management Services), which uses tranSMART at the core of its infrastructure.

To advance systems medicine clinical data, basic research data, mathematical modeling, and knowledge management need to be integrated and interlinked. Important prerequisites to achieve this are standards for data acquisition, harmonization of documentation, and a policy for data sharing [73]. BioXM, developed by Biomax Informatics, has been used for data management in the Synergy-COPD project [74], where information is represented as a network showing evidence that relates different biological concepts [75]. This has been extended to allow the integration with applications for computational modeling and simulation and for clinical decision support systems [76]. Accessibility to a wide range of biomedical and clinical researchers can be an important feature in some environments for a data integration platform for translational medicine. T-MedFusion [77] is a system that integrates patient data, clinical measurements, and omics data that have been evaluated in use cases for psoriasis and rheumatoid arthritis. Recently, the STATegra project developed STATegra EMS to manage clinical with high-throughput omics data (RNA-seq, ChIP-seq, Methyl-seq, etc.) [78].

As large cohort longitudinal studies gain momentum, patients will be represented by millions to billions of data points based on the collection of increasingly complex and heterogeneous data types, and fast effective data analytics and visualization will be necessary if these platforms are to become clinical decision support tools. Cloud computing may provide a solution which is scalable and requires low start-up costs [79].

5 Conclusions

Systems medicine will involve the collection, integration, and analysis of large patient datasets. It offers the prospect of suggesting a new taxonomy of disease, from which patient-specific therapeutic strategies can be developed. It represents the best option to enable implementation of participatory, personalized, predictive, and preventive medicine, thus fostering the transition from the reactive practice of medicine and treating the symptoms when they have fully developed into a disease stage, to a proactive and anticipative medical practice based on a scientific understanding of wellness.

Large longitudinal studies will enable monitoring the transition from wellness to disease and identification of the associated perturbations in molecular networks. The application of genomic medicine will present several challenges. It will necessitate the education of both clinicians and the patient community. Modern medical curricula will need to reflect the importance of an integrative, holistic, patient-centered approach. It will also require developments in data integration, big data analytics, and visualization.

Personalized predictive approaches to medicine have the potential to impact significantly on patient wellness. The challenge to overcome, in order to make it endorsed and practiced across developed as well as very poor social environments and scientific and medical infrastructures, is to demonstrate that it could lead to reductions in healthcare costs through more effective strategies for disease management.