Keywords

FormalPara After reading this chapter, you should know the answers to these questions:
  • Which key translational bioinformatics problem are AI methods positioned to solve?

  • What principles would guide your choice of which AI techniques and tools to apply to a translational bioinformatics problem?

  • What are some important “-omic” databases that can be used to interpret and validate translational bioinformatics related machine learning results from the biomedical perspective?

Introduction and Concepts

The field of translational bioinformatics is concerned with the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data and genomic data into proactive, predictive, preventive, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to various stakeholders, including biomedical scientists, clinicians, and patients [1].

Voluminous data, which are often understood as data volume of a few gigabytes and beyond, can be due to:

  • A large proportion of irrelevant information in the data. For example, bulk RNA sequencing data for just one sample is more than ten gigabytes before compression and a few gigabytes after compression [2]. However, the ratio of exon reads, which are used in later analysis, over intron reads, which are not used, is small (it is expected to be 1/24 [3]).

  • A large number of samples, also called datapoints (each of which may represent a subject), in the data. For example, the Nationwide Emergency Department Sample data [4] contains more than 30 million subjects. In this dataset, there are just over 100 features only.

  • A large number of features. For example, human gene expression data have 20,412 protein-encoding genes, 14,600 pseudogenes, 14,727 long non-coding RNAs, and 5037 small non-coding RNAs. Some datasets may have large numbers of both samples and features. For example, single-cell RNA sequencing data may contain data from hundreds of thousands of cells [5], capturing the whole genome.

A Brief History of Translational Bioinformatics

Translational bioinformatics is a relatively young field. According to Ouzounis [6], translational bioinformatics started in 1996. In the beginning, the field primarily researched how to organize biomedical data and build an ontology system improving the interpretation and searching of biomedical research. After the first version of the human genome project in 2003 [7], genomic analysis was added to translational bioinformatics and continued growing to be a key area in the field. Since 2005 in Europe and 2009 in the United States, programs to widely adopt electronic medical records in patient care and research have been launched [8]. Consequently, large amounts of past clinical data stored in electronic format could readily be used in translational research. This enabled the development of the biomedical informatics component of translational bioinformatics. As translational bioinformatic techniques mature in the areas of genomics and biomedical informatics, they are further adapted to carry out research on other biomedical data, such as microbiome, chemical informatics, and metabolomics data. Today, translational bioinformatics is a multidisciplinary field, extending from the molecular level (genes, proteins, and other molecular entities below the cell) to the population level (collections of living subjects).

Concepts of AI in Translational Bioinformatics

AI in translational bioinformatics covers a broader range of problems than it does in other clinical fields. In clinical practice, the main goal of applying AI is often to complete tasks that used to require manual labor. Some AI applications, such as predicting patient readmission, may perform tasks not typically conducted by human beings. However, producing new knowledge is not required. In translational bioinformatics, besides supporting manual labor, an important goal when using AI is to infer new knowledge, with typical applications including:

  • Association Studies: mining for novel relationships among different biomedical entities.

  • Subtyping and clustering: dividing patients and samples into different groups such that each group may explicitly represent a sub-clinical outcome or a sub-phenotype.

  • Modeling and knowledge representation: mathematically representing the associations and cause-effect relations among different biomedical entities. The representation, in this case, is often in a system of differential equations.

  • Simulation: mathematically representing the changes observed in biomedical subjects by a system of dynamic equations. The system has the general form x(t + 1) = F(x(t), u(t)). Here, x(t) represents the subject at timepoint t, u(t) represents the interference at timepoint t, and x(t + 1) represents the subject at the next timepoint.

  • Spatial visualization: visualizing biomedical datapoints in 2D or 3D space.

Primary Data Categories in Translational Bioinformatics

Genomic Data

This chapter broadly refers to all types of data involving genes, proteins, miRNAs, metabolites, proteins, and biological reactions as genomic data, which includes data in both genomic and functional genomic subcategories.

Genomic and other -omic data, as introduced in Chap. 3, refer to the measure, characteristic, and annotation of genes. The original definition of genomics only referred to the study of genes or the DNA sequences and their related information [9]. However, the data and research scope in bioinformatics also covers other molecular entities involved in the transcription and translation processes. Therefore, -omic data include:

  • Proteomics is the study of proteins [10].

  • Metabolomics studies the chemicals participating in dynamic metabolic processes [11].

  • Transcriptomics studies the transcription process [12], which focuses on RNAs and other transcription-regulator molecular functions.

In some literature [13, 14], the word “genomics” (or gene) is used interchangeably with “transcriptomics” (RNA) and “proteomics” (protein).

When analyzing translational bioinformatics data, it is important to recognize and categorize the data by measure and resolution. Measure refers to the type of molecular entities that are collected, counted, or observed. The technical terms microarray [15], copy-number variation [16], and mutation only refer to DNA. The technical terms RNA sequencing and transcript count only refer to RNA [17, 18]. The terms western blotting [19], multi-level structure, and protein-protein binding affinity [20] refer only to protein. Each measure and molecular type has its unique physical characteristics; therefore, applying a method built for one measure to another should be very carefully considered. Resolution refers to whether the measures are collected from the tissue (bulk) sample, which is a collection of cells, at the single-cell level, or at the sub-cellular molecular level (i.e., isolated proteins). While bulk and single-cell samples can have the same measures (e.g., transcript count in bulk RNA and single-cell RNA), their numerical characteristics are very different.

The results from analyzing -omic data by researchers from multiple fields are carefully curated and organized into annotated catalogs. The Gene Ontology catalog [21, 22] identifies which genes participate in specific biological processes, belong to specific cellular components, or share a specific molecular function. Pathway catalogs, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] and Reactome [24], annotate how groups of genes interact with each other and activate in specific orders to regulate cellular phenotypes and processes in response to external stimuli. Protein catalogs, such as UniProt [25], Protein Data Bank [26], and STRING [27], collect and organize protein structures and interactions. These catalogs are important to interpret the new omic analytic results, highlighting the molecular features that differentiate two or more phenotypes.

Clinomic Data

Clinomic data, which is also called the “clinotype“[28], refers to the measures and characteristics of the living subject, which are useful for medical research and interventions [28, 29]. The major biomedical data categories are diagnosis-related data, laboratory test results, medication data, and medical imaging data [30]. Diagnosis-related data refer to the time, stage, and survivability of a disease or disorder in a subject. Laboratory test results, which are also called biomeasures [31], refer to a subject’s quantifiable observations and the biological material concerned (i.e., blood and urine samples). Because laboratory test results are quantifiable, we can systematically define thresholds and criteria to determine whether the result indicates particular diseases or disorders. Medication data refer to the time and types of interventions, or treatments, received by a subject. Medical imaging data, such as X-rays and functional magnetic resonance images, can be understood as observations from a subject that are not directly quantified. Therefore, interpretation of medical images in research and clinical practice often occurs on a case-by-case basis and depends on physicians’ and researchers’ training experience. Besides, biomedical data may contain other types of data that are specifically collected for particular research and clinical practice scenarios, such as smoking history [32, 33], diet type [34], exercise frequency [35], and drug side-effect history. Also, clinomic data may include genomic data when a subjects’ genetic data is used to diagnose and help decide a treatment [36,37,38].

Phenotypic Data

In translational bioinformatics, the basic phenotype definition refers to the diseases and abnormalities affecting each subject, such as breast cancer and diabetes. In the broader context, phenotype refers to the subjects’ categorization and definition as assigned by biomedical experts. In this context, a phenotype definition, such as ‘cell proliferation’ and ‘chemotherapy resistance’, is specific to each research project or clinical trial. To differentiate phenotypic data from clinomic data, we may understand that phenotypic data are directly derived from clinomic observations.

Categorizing AI Applications in Translational Bioinformatics

In treating complex diseases and precision medicine, linking omics and biomedical data is expected to improve the quality of care [39, 40] from the current practice, which primarily relies only on clinomic data or phenomic data (which is also referred as biomedical data) alone. Biomedical data will still be essential to detect and monitor disease progression, tasks for which single-type omics data has not yet shown superiority [39]. Meanwhile, omics data is essential to find driver mutations and their functionalities, prerequisites to understanding their roles in causing a disease. Besides, even when the major causes of a disease are not exclusively genetic, such as with hypertension, knowing the patients’ omics data may still improve treatment precision [41, 42].

Linking of -omics, clinotype, and phenotype data opens new questions that require advanced AI, machine learning, and data mining techniques to resolve. Clinotype-to-clinotype (C2C) association discovery, similar to “omic association” [43, 44], finds the clinotypes that co-occur in subjects’ data and determines whether one clinotype occurrence is likely to precede those of other clinotypes. Discovery of novel clinotype-to-phenotype associations may advance risk assessment beyond current disease:lab-test markers toward finding sets of simpler, more cost-effective risk markers. This would allow patients and physicians to take early and preventive actions [45].

From a knowledge discovery perspective, Fig. 14.1 summarizes AI applications in translational bioinformatics according to the data categories we have introduced. Here, three data categories yield six possible types of association. In this figure, genomic-to-clinomic association has not been well-defined (and is therefore not illustrated). The other five types of association are as follows.

Fig. 14.1
An illustration showing the categorization of phenotypes into genotypes and clinotypes with the possible association between the three categories, namely P2P, G2P, P2C, G2G, and C2C.

Categorizing AI Translational Bioinformatics

G2G (Genomic to Genomic)

G2G refers to applications involving finding gene-gene associations using AI techniques. G2G has many sub-problems, which are defined by the gene-gene mechanisms of interaction concerned. Sub-problems that are foci of current research include estimation of protein-protein binding affinity [46], and prediction of the targets of transcription factors [47].

G2P (Genomic to Phenotypic): Genome-Wide Association Studies (GWAS)

The main purpose of GWAS is to find the genetic variants associated with a specific disease or phenotype. According to the GWAS catalog in 2019 [48], there have been 5687 GWAS studies, which list 71,673 variant-phenotype associations computed using statistical methods. AI methods can improve GWAS results by enhancing the statistical power of associations, improving polygenic risk scoring, and ranking gene variants that are strongly associated to a genetic disease [49]. In most AI applications in GWAS, the key step is to build classification models using variant features to differentiate the phenotype, such as disease vs. normal.

From an AI perspective, the GWAS data has the following characteristics:

  • Statistical feature selection is generally applied before using AI methods to analyze the data. However, the statistical methods select the features one by one; thus, they may not address important dependencies among the features.

  • The data is often represented by a binary matrix, which represents whether or not a patient’s genome has a specific variant.

In analyzing GWAS, linear models such as regularized regression and support vector machines (Chap. 6) are widely applied (see for example [50,51,52,53]). Here, model scores, such as the predicted probabilities, can serve as risk scoring, and model coefficients can be used to rank the features. Random forest [54] models can also support these two tasks (see for example [55, 56]). Meanwhile, some research [57, 58] shows that artificial neural network models may have the advantage in risk scoring/classification performance; however, the model architecture is less conducive to feature ranking than other more straightforward approaches such as regression models.

P2P (Phenotypic to Phenotypic): Identify Disease Genomic Subtypes

Subtyping of complex genomic disease, such as glioblastoma multiforme (GBM) [59], can answer key questions in both pre-clinical studies and clinical practice. These diseases are caused by multiple genetic anomalies and signaling pathway disruptions; therefore, a combination of therapeutic strategies is required to treat them. The purpose of subtyping in this context is to partition the disease into multiple subgroups, and find the explicitly disrupted signaling pathways in each of them. This may allow for customizing the treatment for each group according to the affected pathways. Solving this problem requires clustering and feature selection algorithms in AI. The clustering results reduce the subtyping problem into classification (which imputed subtype does this patient belong to?) and feature selection (which signaling pathways are affected in this imputed subtype?) problems for follow-up analyses. For example, in the TCGA-GBM dataset [59], the clustering was followed by GWAS analysis in each patient group. Here, GWAS mutations defined four GBM subtypes: classical, mesenchymal, proneural, and neural, and 29 subtype-related prognostic markers.

P2C (Phenotypic to Clinomic)

The protocol to diagnose a disease, which often consists of a pre-defined set of laboratory tests, is the most commonly used type of phenotype-clinotype association. Other phenotype-clinotype associations, once discovered by AI techniques, are considered novel. For example, in work on the identification of hypertension risk [45], the patients’ future hypertension could be predicted by using non-blood-pressure affordable lab tests and AI.

C2C (Clinomic to Clinomic)

From a narrow perspective, clinotype-clinotype association refers to the correlation among laboratory test results. In the broader perspective, this type of association refers to how a specific clinotype result might change, or be predicted, given other clinotype results. For example, in work seeking to derive links between the primary translational informatics data categories [28], a linear model was constructed in which all other clinotypes become input features, to predict clinotype values. Here, the model coefficients were used to quantify clinotype-clinotype association.

Informatics Challenges in Translational Bioinformatics

Big Data Characteristics

An understanding of big data characteristics is required to apply AI in translational bioinformatics. This section reviews the characteristics of big data and how each characteristic can impact AI performance.

Volume of Data

Large data size constrains AI performance in translational bioinformatics in two ways. First, on account of associated logistic challenges, large data size may make solving some AI problems impractical without sufficient computer storage and specialized hardware. For example, Quang et al. [60] show an example of a motif discovery problem that may take a few weeks for a computer to complete without a big-data-specific GPU. Second, on account of the “curse of dimensionality” challenge, the very large number of features in big data can decrease the predictive performance, which is known as the Hughes phenomenon [61]. While dimension reduction may help relieve the curse of dimensionality, it also reduces the interpretability of the AI results (because the components of a reduced-dimensional representation may not map back to individual features). In other words, it leads to inferring less powerful hypotheses from the input feature to the predicted output variable.

Veracity of Data

Veracity refers to data quality, and conversely, noise. Noise detection and filtering is a challenge in analyzing data from many fields. However, in translational bioinformatics, differentiating between noise and meaningful but as-yet-unverified novel information makes this challenge more difficult. In biomedical research, the data often contain yet-to-be-discovered information. This information may only appear in a very small percentage of the data; therefore, statistically, it has similar characteristics to the noise. For example, single-cell expression data [62] usually show small cell clusters, which consist of less than 5% of the cell population. These clusters may correspond to “doublets“[63], a technical error (real noise), or stem/progenitor cells (which would constitute critical and novel information). Tackling this challenge requires reasonable noise assumptions and novel hypotheses to emerge from a strong collaboration between the AI analyst and the biomedical experts concerned.

Variability of Data

Data heterogeneity, or variability, has always been among the most difficult challenges in translational bioinformatics [64]. There are many aspects of data heterogeneity. First, the data are of many types, including omics data subtypes and biomedical data subtypes. In this aspect, the data integration strategy and integrative analysis are the keys to overcome the challenge. Second, the same data type may have significant variability due to the methods and platforms of collection, such as the batch effect [65] when single-cell sequencing the same tissue using 10X and ICell8 RNA-seq platforms [66]. In this aspect, computational mapping across the platforms is critical to the analysis. Third, results with the same data type and the same collection method can still be interpreted differently by different healthcare providers. For example, the Hematocrit percentage test normal range can be 35–40% or 40–50%, depending on specific patients and physicians who analyze the test result [28]. In this aspect, accurate and comprehensive biomedical domain knowledge is required.

Velocity of Data

Velocity refers to how quickly the data must be processed and analyzed, and how quickly results must be produced. In general translational bioinformatics research, velocity is not a major challenge. However, requirements to deliver results on time must be considered when building online tools and clinical decision support. The principle requirement to tackle this challenge is to understand the data management system and hardware infrastructure from which the AI tools are to be deployed (see the section on “Applications of AI in Translational Bioinformatics”).

Social-Economic Bias

Unlike pre-clinical research, clinical translational bioinformatics research uses patients’ clinical information. In this setting, incorporating socioeconomic factors into analyses may be unavoidable. However, predictive models based on such factors raise concerns about algorithmic fairness with the potential to exacerbate existing socioeconomic disparities (see Chap. 18). Thus, in the data processing pipeline, strongly associated features to the socioeconomic factors should be removed.

Domain Knowledge Representation and Interpretability

Using domain knowledge collections to validate findings can improve the interpretability of AI results, which is desirable before making clinical decisions on the basis of these findings. In bioinformatics, the common practice is to use feature extraction methods or infer model-explicit features (biomarkers) that differentiate the sample classes. These features are then forwarded to pathway, gene set, and gene ontology analysis to reveal which biological mechanisms are involved. The closeness between the highlighted mechanisms and the biomedical samples justifies the quality of the analysis. For example, when analyzing the fetal mouse heart proliferation data [67], the proliferative pathways and gene ontologies, such as cell cycle and cell differentiation, are expected to be enriched. If the proliferative pathways and ontologies are not enriched, while the ‘hypertrophy’ ones are, then the analytic quality could be questionable.

Model Robustness and Quality Control

Sample imbalance is usually the first issue impacting model quality in AI applications in translational bioinformatics. In the pre-clinical setting, the proportion of negative samples, such as healthy or disease-free samples, is often much smaller. In the clinical setting, we often see a very small proportion of positive samples. Here, we expect that the number of patients will be small compared to the population size. Extremely imbalanced data may seriously impair AI models, which are often optimized for accuracy. For example, when the positive sample proportion is only 5%, a naïve “all negative” prediction model yields an accuracy of 95% (very high). However, this model cannot predict positive samples; thus, it has no clinical value. To tackle imbalance, oversampling or undersampling can be applied to create a training dataset with a positive/negative sample ratio that is more balanced than it is in the whole dataset. In oversampling, a rare-class sample may randomly appear more than once in the training set, such as in the Synthetic Minority Over-sampling Technique (SMOTE) [68]. Here, the sample may be slightly permuted if it is selected more than once, using techniques of data augmentation [69, 70] analogous to those applied to images when training models for computer vision, such as rotations and reflections [71]. Alternatively, in undersampling, only a subset of popular-class samples are randomly selected for inclusion into the training set.

The optimization criteria in training AI models should be carefully decided case by case. This involves two choices. The first is the choice or definition of the loss function (see Chap. 6), with commonly applied examples including the mean-square error, L1 loss, hinge loss, or cross-entropy loss. The second is the choice of metric to focus on: maximizing accuracy, AUC, positive-predictive value, or negative-predictive value.

Statistical tests for model robustness. Many AI methods are model-based, and assume the data have certain characteristics and follow particular distributions. Therefore, these assumptions need to be verified before applying the AI methods. The Kolmogorov–Smirnov test (KS test) [72] addresses whether a set of numbers follows a pre-defined distribution, and to what degree two sets of numbers are drawn from the same distribution. Thus, in principle, the KS test and another similarly-purposed test should be applied to examine the data before deciding the AI method. On the other hand, the model result depends on its hyperparameters [73], which must be set before applying the AI algorithm. Choosing the hyperparameter is beyond the scope of the optimizations that can be achieved by the AI algorithm itself. Therefore, post-hoc analyses, such as the Wald test [74] and other model-fitness tests, should be used to test for fitness of the computed parameters (which are also commonly called the model parameters). To conduct these tests, the null model parameters need to be pre-defined; usually, a null model parameter is set as 0, which implies that the parameter plays no role in the model. If the test result is insignificant, which means the computed model parameters are very similar to the null parameters, then the model parameter may not be robust. This means that one should choose other algorithm hyperparameters to recompute the model.

Translational Bioinformatics Tools & Infrastructure

The big data characteristics described in the section on “Concepts of AI in Translational Bioinformatics” necessitate efficient, scalable tools and infrastructure components. In this section we will describe some of the key tools and components required to conduct translational bioinformatics analyses.

Extended Data Management Systems

While improving translational bioinformatics data storage may not be the primary research objective of AI in medicine, the developers of AI tools should consider the existing data facilities in order to improve their runtime performance in practice. First, the object-relational database is still the primary translational bioinformatics data structure, with prominent examples including the Mayo Clinic database [75], STRING (a database of protein-protein interactions) [27], and the US National Inpatient Sample database [76]. The relational structure has the advantage of supporting the transformation of the data into any customized data structure used by AI applications. Meanwhile, to improve data retrieval performance, some translational bioinformatics data warehouses choose a non-relational structure for specific data types. Second, when the data has a hierarchy and/or primarily association among different entities, the non-relational database is adopted. For example, the Reactome biological pathway repository [24] is built upon the Neo4j [77] graph database engine, and Gene Ontology [22] organizes data based on the hierarchical Extensible Markup Language (XML) files [78]. Some other systems may adopt a hybrid structure when the data are extremely large and the data should be kept in multiple formats; for example, cBioPortal for cancer genomics [79] uses a relational table structure for patient clinical data and a file data system for patient omic data. Third, hybrid and distributed data warehouses implement both relational and non-relational infrastructure, such as in CloudBurst [80], BiobankCloud [81], and Hydra [82].

Data Preprocessing Pipelines

Pipelines to Build the Data Matrix

AI techniques view data in matrix format. However, before processing, biomedical data, such as high-throughput sequencing data, are not in this format. Therefore, data-type specific pipelines are required to convert the raw biomedical data to a matrix format. Table 14.1 summarizes the well-known pipelines in translational bioinformatics.

Table 14.1 Popular standard pipelines used in translational bioinformatics

Enhancing the Data Matrix

After formatting the data as a matrix, the matrix must be further processed to remove bias and noise. Choosing which method to use in this step is an ad-hoc decision and is made on a case-by-case basis. Well-known problems and techniques in data processing are:

  • Dimension reduction: typically used methods include principal component analysis [87] and canonical correlation analysis [88].

  • Data scaling and normalization. For example, gene expression data is assumed to follow the negative binomial distribution [89]. Gene expression analysis packages, such as DeSeq2 [90] and SAGE [91] implement negative binomial scaling before applying the main statistical analysis.

  • Batch effect correction [92] is applied when the same type of dataset is generated at different rounds of experiments. Variance due to uncontrolled and random technical issues may appear in the data.

  • Embedded data visualization, such as TSNE [93] and UMAP [94].

Supervised and Unsupervised Learning

Supervised machine learning (see Chap. 6), also called supervised analysis, involves finding a function that reproduces an output from an input. Here, “reproduce” implies that the correct output is already available without using the machine learning function. In biomedicine, supervised analysis often involves creating and fine-tuning computer algorithms and software that can substitute for a human performing a task. Examples of supervised analysis in biomedicine are:

  • Automated detection of a tumor region and estimation of tumor size from a radiological image [95, 96] (see Chap. 12).

  • Identifying cell type from single-cell gene expression data [97].

  • Detecting patients’ chronic disease diagnoses from their general health checkup records [28].

The main purpose of such supervised analysis is to automate and speed up manual tasks. Supervised learning can be implemented by digitizing rules and human knowledge, which is also called rule-based learning [98, 99], or through computation exclusively without such rules or knowledge. Support vector machines [100], linear regression [101], random forests [54], and deep learning methods [102] are well-known fully computational techniques that neither encode nor require human rules and knowledge. Although inferring novel knowledge may be achieved with supervised analysis (for example, by making predictions of drug effects beyond the scope of the training data), this is not usually the main goal when conducting supervised analysis in translational bioinformatics.

Unsupervised learning still finds a function from input to output, but the correct output has not been determined in advance. The main purpose of unsupervised learning in translational bioinformatics is to generate new biomedical knowledge that can be tested and verified in follow-up studies. Some examples of unsupervised learning are:

  • Identifying new disease subtypes from genomic data [28], such as identifying genetic mutations that potentially indicate subtypes of glioblastoma [103].

  • Drug repurposing [104], or finding ways of applying an old drug to treat a new disease. For example, the disease-drug and drug-drug associations could be represented by a graph [105]. Clustering this graph, which is an unsupervised learning problem, results in many clusters. Each cluster consists of multiple diseases and drugs to be further examined for repurposing.

  • Identify and characterize new cell subtypes in single-cell omics data.

Some popular unsupervised machine learning problems are clustering [106], expectation-maximization methods [107], and non-negative matrix factorization [108].

Popular Algorithms in Translational Bioinformatics

Extending the discussion of algorithms and tools in Chap. 6, this section provides additional details for popular AI algorithms in translational bioinformatics.

Classification Algorithms

Random forest [54] is a well-known example of an ensemble classification algorithm. A random forest consists of many decision trees (Fig. 14.1). Each tree is a discrete classification model, which consists of multiple classification rules. Each tree is constructed by applying the decision tree algorithm [109] to a random subset, including randomly selected samples and features, of the training data. Then, to classify a sample, random forest combines all trees’ classification results, using majority rule voting or some other aggregation procedure.

Figure 14.2 illustrates a random forest. Here, the classification has 4 features. Three decision trees are randomly constructed: tree 1 only uses x1 and x2, tree 2 only uses x3 and x4, tree 3 uses x1, x2, and x4. Each tree is a classification model, which consists of multiple if-then rules. For example, tree 1 has three rules: x1 ≤ 5 → Class: No, x1 > 5 & x2 ≤ 50 → Class: No, x1 > 5 & x2 > 50 → Class: Yes. When classifying the sample (2, 68, 342, 6899), each tree finds the decision branch according to the sample feature (grey-shaded hexagon). Then, all trees’ results (2 No, 1 Yes) are combined using a majority rule vote to make the final decision (Class: No).

Fig. 14.2
A flow chart of 4 features named x subscript 1 to x subscript 4 with values 2, 68, 342, 6899. They lead to 3 binary trees labeled 1 to 3 that divide further and the combined decision produced is no.

Illustration of a random forest

To measure how important a feature x is in each tree, the algorithm compares the classification accuracies where x is included and removed from the tree [110]. The more accuracy decreases when removing x, the more important x is in the tree. The random forest combines x’s feature importance scores for all tree into x feature importance score in the forest. The random forest feature importance can be used as a metric for feature ranking.

Naïve Bayesian classifier. Given a datapoint x = (x1 = a1, x2 = a2, …, xn = an) in n dimensions; here, we denote x1, x2, …, xn as the attributes, and a1, a2, …, an as the values. The objective is to predict which class Ci x belongs to among k classes. According to Bayes’ rule, the posterior probability that x belongs to class Ci is:

$$ p\left({C}_i|\textbf{x}\right)=\frac{p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)|{C}_i\right)\times p\left({C}_i\right)}{p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)\right)}. $$
(14.1)

In Eq. (14.1), the denominator is negligible because it is the same for all k classes. Thus, the classification is decided by which class has the largest p(x = (a1, a2, …, an)| Ci) × p(Ci). Here, the prior p(Ci) is the frequency of class i appearing in the population. The likelihood p(x = (a1, a2, …, an)| Ci) is the probability of observing a sample x = (a1, a2, …, an) in class Ci. In the naïve scenario, we assume that all attributes x1, x2, …, xn are independent from the other. Therefore:

$$ p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)|{C}_i\right)=p\left({x}_1={a}_1|{C}_i\right)\times p\left({x}_2={a}_2|{C}_i\right)\times \dots p\left({x}_n={a}_n|{C}_i\right). $$
(14.2)

In discrete data, all elements in Eqs. (14.1) and (14.2) are computed by counting. In continuous data, the probabilities p(xj = aj| Ci) are the probabilistic mass function of class Ci. This requires advanced probabilistic modeling. Also, when the attributes are not completely independent, the naïve Bayesian algorithm is extended to the Bayesian Network. The audience can freely practice these advanced cases using open source and freely available machine learning libraries such as Weka [111].

Clustering Algorithms

Expectation-maximization clustering [112], briefly, is a repeated process to identify the distribution parameters of the data. In each iteration, the datapoints are allocated to a distribution component; then, all component parameters are re-calculated according to the datapoint allocation. The process repeats until the component parameters converge.

K-mean clustering [113] is a popular clustering algorithm. In this problem, the goal is to partition the datapoints into k clusters. The defining parameter for each cluster is its centroid, which is the average of all datapoints assigned to it. A specific datapoint is allocated to the cluster with the closest cluster centroid. Clusters are initially assigned randomly, and the process of centroid estimation and cluster assignment is repeated iteratively until the centroids converge. Figure 14.3 illustrates k-means clustering using a toy example (n = 10 datapoints, k = 3).

Fig. 14.3
An illustration consisting of 5 tables. The first is the input table. The second table adds Distance to and cluster values to the first with the rest being iterations of it named 2, 3, and 4.

An illustrative example of the k-means clustering algorithm

Consensus clustering. In the GBM case study (introduced in the section on “Phenotypic Data”14.1.3.3) [59], consensus clustering was applied to divide the TCGA-GBM patients into k = 4 groups (subtypes). Consensus clustering is an iterative clustering procedure. The process starts by choosing the number of expected clusters (k) in the dataset. The core clustering algorithm, which was hierarchical clustering in TCGA-GBM [59], executes multiple runs with the same k parameter in the dataset. Then the results of these runs are aggregated and evaluated for clustering quality. The core clustering algorithm repeats the same process with other choices of k; here, larger k often yields higher clustering quality. After experimenting with many choices of k, the final k is chosen by balancing the preference for a smaller k against the desire for better clustering quality. Silhouette index [114] is a well-known metric for clustering quality.

Matrix factorization. In machine learning, briefly, matrix factorization is the decomposition of a data matrix M (m datapoints × n attributes) into the product PQM. Here, P is a m × k matrix, Q is a k × n matrix, and k is a pre-defined number determining the dimensions of the latent feature space. Each row in P represents a data point in the latent space, and each column in Q represents an attribute in the latent space (although such latent attributes may be derived from multiple input features in M, and thus relatively difficult to interpret, as discussed in the section on “Volume of Data”). In matrix-factorization clustering, the latent space is defined by the number of clusters. Then, each row in P represents which clusters the corresponding datapoints may belong to; and each column in Q represents in which clusters the corresponding attribute is enriched.

Mathematically, matrix factorization is an optimization problem (for an introduction to solving optimization problems using gradient descent, see Chap. 6): find P and Q to minimize

$$ F={\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2+\alpha {\left\Vert \textbf{P}\right\Vert}^2+\beta {\left\Vert \textbf{Q}\right\Vert}^2. $$
(14.3)

In this formula, α and β > 0 are pre-selected regularization parameters. The popular approaches to solve (1) are based on gradient descent theory [115]. Computing the partial derivative of F over P and Q, we have

$$ \frac{\partial F}{\partial \textbf{P}}=\frac{\partial \left({\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2\right)}{\partial \textbf{P}}+2\alpha \textbf{P}. $$
(14.4)

The first term in Eq. (14.4) can be computed by analyzing each entry (i, j) in M. We have, by matrix multiplication:

$$ {m}_{i,j}\approx \sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}. $$
(14.5)

In Eq. (14.5), mi, j is the entry at ith row and jth column in M, pi, l is the entry at ith row and lth column in P, and ql, j is the entry at lth row and jth column in Q. To minimize \( {\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2=\sum \sum {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2 \), we have partial derivative:

$$ \frac{\partial {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2}{\partial {p}_{i,l}}=-2{q}_{l,j}\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right). $$
(14.6)

This allows update each entry pi, l with a very small learning rate σ :

$$ {p}_{i,l}={p}_{i,l}+\sigma \left(\sum \limits_{j=1}^n\frac{\partial {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2}{\partial {p}_{i,l}}+2\alpha {p}_{i,l}\right)={p}_{i,l}+\sigma \left(\sum \limits_{j=1}^n\left(-2{q}_{l,j}\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)\right)+2\alpha {p}_{i,l}\right). $$
(14.7)

The theory to update ql, j is similar and is left as an exercise. Formulae (14.6) and (14.7) are used in many iterations until P and Q converge.

Dimension Reduction Algorithms

Embedding, briefly, is a non-affine method to reduce the data dimensionality. By non-affine, we mean that a high-dimensional datapoint x is mapped to a lower, usually in 2D, dimensional datapoint y such that the inverse mapping from y → x does not have a precise formula. Embedding optimizes and preserves the original relative similarity between any datapoint (xi, xj) pair in the embedded space (yi, yj). In t-distributed Stochastic Neighbor Embedding (tSNE) [93], the pairwise similarity (xi, xj) in the high-dimensional data space is defined as

$$ {p}_{j\mid i}=\frac{\exp \left(-{\textbf{x}}_i-{{\textbf{x}}_j}^2/2{\sigma}_i^2\right)}{\sum_{\forall k\ne i}\exp \left(-{\textbf{x}}_i-{{\textbf{x}}_k}^2/2{\sigma}_i^2\right)}. $$
(14.8)

And the pairwise similarity space (yi, yj) in the embedded data space is defined as

$$ {q}_{j\mid i}=\frac{{\left(1+{\left\Vert {\textbf{y}}_i-{\textbf{y}}_j\right\Vert}^2\right)}^{-1}}{\sum_k{\sum}_{l\ne k}{\left(1+{\left\Vert {\textbf{y}}_k-{\textbf{y}}_l\right\Vert}^2\right)}^{-1}}. $$
(14.9)

Upon defining these similarities, tSNE minimize the Kullback–Leibler divergence

$$ KL\left(P\Big\Vert Q\right)=\sum \limits_{i\ne j}{p}_{j\mid i}\times \mathit{\log}\left(\frac{p_{j\mid i}}{q_{j\mid i}}\right). $$
(14.10)

Then, tSNE finds the embedded datapoints y using the gradient descent approach, and computing the partial derivative ∂KL/yi.

Association Mining Algorithms

In translational bioinformatics, mining associations among features depends on defining an association metric. Some popular choices are:

  • Pearson’s correlation. Briefly, Pearson’s correlation is the ratio between the covariance of two features and the product of their standard deviations. This metric requires the features to be in numeric format, and only measures how the two features linearly correlate. If the two features have a non-linear association, Pearson’s correlation may not detect the association.

  • Mutual information metrics [116]. Mutual information measures the dependency between two features. For example, consider two Boolean features A and B. In a dataset, A and B are independent if the frequency of ‘A is true’ is approximately the same as the frequency of ‘A is true given B is true’.

  • The Jaccard index [117]. Given two Boolean features A and B. In a dataset, the Jaccard index is the ratio between the number of samples such that ‘both A and B are true’ (also called the intersection) and the number of samples such that ‘A or B is true’ (also called the union).

Security, Privacy, and Ethical Considerations (see also Chap. 18)

In practice, the AI scientist must consider the following ethical points, according to Safdar et al. [118]:

  • Population bias. This happens when the sociodemographic group proportion in a research and training dataset does not reflect the study population. Rare demographics and ethnicities are often under-sampled.

  • Data ownership. Translational bioinformatics data is often derived from human subjects. Therefore, consent from study participants, which allows using their data for research and tool development, must be obtained.

  • Privacy protection. Human subject identifiable information must be removed before analyzing and publicly releasing the data for research.

Team Data Science Infrastructure

AI researchers in translational bioinformatics use publicly available big data and analytics infrastructures to accelerate their research. These infrastructures not only support scaling to large data sets, but also installation of popular analytic tools. Some examples are as follows:

  • UAB UBRITE (https://ubrite.org/) integrates many machine learning programming libraries, which can be activated in R/Python scripts.

  • Google Colab (https://colab.research.google.com/notebooks/) has a free collection of pre-built machine learning Jupyter notebooks [119] available for reuse. Community users may slightly modify these notebooks for specific projects, and freely run a notebook for up to 12 hours using Google Cloud computing resources.

Applications of AI in Translational Bioinformatics

Improving Translational Bioinformatics Data Infrastructure

Experimental validation and manual curation, which are the most reliable approaches to construct translational bioinformatics databases, are time-consuming and resource-costly. Therefore, AI methods are utilized to infer novel information and enrich these databases. For example, in the STRING database [27], text mining is the most significant channel contributing to broad coverage. Here, the number of protein-protein interactions (PPI) for human proteins in STRING is approximately 2.9 million [13] (though some of these may be inaccurate on account of natural language processing errors); meanwhile, BioGRID [120], which is among the largest experimentally-validated and manually-curated PPI collections, only has approximately 268,599 PPIs. As another example, JASPAR [121] uses a hidden Markov model [122, 123] to predict 337 (over 1964) transcription factor – target interactions.

Inferring Pairwise Molecular Regulation

Many biological research areas require understanding and predicting regulatory networks to provide clear insight into living cells’ cellular processes [124]. For example, in injury response, regulation from G Protein-coupled receptors and their interactees is crucial to DNA damage response [125]. In another example, the transcription factor c-JUN promotes and maintains the expression level of CCND1, which is required in cell cycle progression [126]. Thus, discovering the gene regulatory network can enrich the translational bioinformatics database (see section on “Improving translational bioinformatics data infrastructure”), and inform the design of targeted therapies [127]. On the other hand, the number of possible regulatory pairs is too large to be fully validated by biological experiments, and many have not yet been discovered. This explains why predictive methods to infer molecular regulation, especially transcription factor—target and ligand—receptor pairs, are still an active research area.

Transcription factors are a set of DNA-binding proteins, and the genes at the DNA location where the transcription factors bind to are their targets [128]. The expression of the genes encoded around the binding sites is significantly up or down-regulated by the transcription factor. Therefore, the focus of AI in predicting transcription factor—target relationships is to identify the binding sites, which may number in the tens of thousands [128], of a transcription factor. Furthermore, the prediction must be filtered due to two types of occurrences: one binding event may control multiple target genes, and one gene may be targeted by multiple bindings. AI tools use Chromatin immunoprecipitation followed by sequencing (ChiP-seq) data and the expression data as inputs for prediction. IM-PET [129], RIPPLE [130], and PETModule [131] predict transcription-factor targets using the random forest algorithm. Here, the Chip-seq data is processed to obtain the distance between the binding/enhancer sites and the targets’ DNA coding regions. Besides, deep learning based tools, such as DeepTFactor [132], TBiNet [133], and scFAN [134] focus on precisely predicting transcription factor binding sites. Their output can be provided to other AI methods to infer the targets.

A ligand is a substance that forms a complex with a cell surface protein (called a receptor) and then triggers a series of cellular signaling events [135]. These signaling events respond to the stimulus that produces the ligand. For example, in natural skin wounds, the fibroblast releases the WNT5A ligand; this ligand binds to FZD1/2 receptors on the skin cell and activates the WNT signaling pathway [136] and the pathway promotes skin cell proliferation and helps heal the wound [137]. In drug discovery, after selecting which signaling pathways and related receptors to activate, the next task is designing an artificial ligand that can bind to the receptor. Computing the ligand-receptor affinity is an important task before testing the artificial ligand. Recently, machine-learning-based models have been shown to outperform other methods that do not apply machine learning on this task [138, 139].

Table 14.2 summarizes the popularly used AI tools described in this section

Table 14.2 Summary of AI tools used in inferring pairwise molecular regulation

Inferring and Characterizing Cellular Signaling Mechanism that Determines the Cellular Response

Identifying and characterizing signaling mechanisms is essential in complex phenotype research because the disease outcomes concerned involve many genes interacting and responding to each other in response to an external stimulus [140]. In each phenotype, the highly expressed genes and interactions are found in the in-vitro experiments and annotated as a “pathway”. According to KEGG [23], at the time of this writing, 543 pathways have been well defined and annotated across all species included in the database. Adding species-specific pathways, Reactome [24] reports that the number may rise to 999. Extending from pathways, Gene Ontology [22] groups and characterizes 44,945 ontology terms, where each ontology term concerns a set of genes that participate in a biological process, are located in the same cellular location, and/or share the same molecular profile. Among this large number of pathways and ontologies, the most frequently investigated ones often regulate cell proliferation, cell apoptosis, and cell differentiation. Understanding the mechanisms regulating these processes helps to explain the progression and infer potential treatments for diseases with some of the highest mortality rates: cardiovascular diseases and cancers.

AI can support this research area by answering three basic questions. The first question concerns how to identify which pathways are involved in disease progression. Here, feature selection methods [141, 142] can identify highly differentiated genes between healthy controls and subjects with a disease of interest (identified genes can be considered as biomarkers). Then, applying pathway analysis techniques [143] to the biomarkers can yield a list of pathways involved in the disease. The second question concerns how to identify the “master regulator”, or the “origin” of the perturbed signaling mechanisms. Here, in highly interconnected genes, many perturbed gene signals are just the responses to other genes. The third question concerns how to find a therapeutic target: interfering with genes such that the genes are targetable and yield the highest reduction in disease progression. These three questions can be answered using AI-based system simulations.

Before deep learning, the state-of-the-art approaches in this area focused on representing the interactive network among the pathway genes with a mathematical equation system, and then solving the system by either logic programming or dynamic differential equations [144, 145]. In logic programming, the interaction between two (or more) genes is represented by a combination of logic gates. Since the logic gate is finite and deterministic, the simulation is straightforward. A limitation of the logic gate is that it cannot represent feedback loops or two-way gene-gene interactions. The feedback loops are critical to maintaining a stable environment inside a living subject. For example, in wound healing, after the initial platelets respond to the wound, these platelets release adenosine diphosphate (ADP); this ADP binds to P2Y1 and P2Y12 genes to activate more platelets; more platelets produces more ADP to continue this activation loop until the wound surface is completely covered and prevents further blood loss [146]. Dynamic differential equations can overcome this limitation. In principle, the dynamic differential equations discretize the system into multiple time points, define the dynamic equation for each gene expression at each timepoint, and calculate gene expression over a sequence of time points.

Below are some simple examples of how to simulate a 2-gene and a 3-gene system using dynamic equations. The system has two proteins (objects), denoted PA and PB. These proteins have the initial values S0 = (−1, 0) for PA (strongly inhibited) and PB (normal), correspondingly. PA up-regulates PB while PB down-regulates PA in a negative-feedback loop [147], as in Fig. 14.4. These interactions can be represented by system matrix M = \( \left[\begin{array}{cc}0& 1\\ {}-1& 0\end{array}\right] \). Suppose the discrete dynamic equation, which takes both the initial values and the system matrix, is as follows (this equation is related to the equation underlying the PageRank algorithm [148]):

$$ {S}_t=0.15\times {S}_0+0.85\times {\textbf{M}}^T{S}_{t-1} $$
Fig. 14.4
A line graph of signal versus iteration with lines named PA and PB. Both oscillate between -1 and 0.5 and depict gradual decay. The equation reads M(1,2)= -1 PA PB M(1,2)=1.

The system (top) and simulation result when M = [0 1;-1 0]

The following Matlab code shows how to implement this system:

A MatLab code of 12 lines. It includes declarations, assignment statements, and a loop. Also, there are comment lines.

The result in Fig. 14.5 shows that over tim e, PA and PB converge to (−0.15, −0.13), respectively. This system is closely balanced (S ≈ 0), suggesting that the initial inhibition of A can balance the system. Figure 14.4 shows that the system results are completely different if we make a slight change to the model parameters by setting M = \( \left[\begin{array}{cc}0& -1\\ {}1& 0\end{array}\right] \), such that PA downregulates PB, while PB upregulates PA.

Fig. 14.5
A line graph of signal versus iteration. PA oscillates between -1 and 0.5 and PB oscillates between -0.5 and 1 and depicts gradual decay. The equation reads M(1,2)= -1 PA PB M(1,2)=1.

The system (top) and simulation result when M = [0 1;-1 0]

With deep learning approaches, signaling interactions can be used to construct and train a deep learning architecture. For example, Ma et al. describe a deep learning approach to simulate cell proliferation, in which the learning architecture is organized according to the proliferation-related gene ontology hierarchy instead of a conventional convolutional structure [149].

The AI tools described in this section are summarized in Table 14.3.

Table 14.3 Summary of AI tools for inferring and characterizing cellular signaling mechanisms

Identifying and Characterizing New Cell Types and Subtypes

Single-cell transcriptomics technologies enable measuring genetic information at the individual cell resolution [151], which is the “building-block” level of all living organisms. Single-cell transcriptomic data also present new questions, and require new analytic techniques that are not available in bulk transcriptomics. First, does the data present novel cell populations that have not been studied due to the limitations of bulk technologies, especially of the stem and progenitor cell types? Second, for signaling pathways that function differently in different cells of the same cell type and in the same tissue, how might we quantify the signaling activity within each cell? AI techniques are essential to answer these questions.

To answer the first question, state-of-the-art single-cell analytic tools [152, 153] apply clustering algorithms to partition the entire cell dataset. In each cell cluster, genes explicitly expressed in the cluster are queried in the cell-type canonical marker literature, such as the CellMarker database [154], to determine which cell type the cluster corresponds to. For the clustering step, density-based clustering [155] and Louvian clustering [156] are the most popular methods. Also, embedding methods, such as tSNE [93] and UMAP [157], are often used to visualize the cell clusters. In many single-cell datasets [158,159,160], small clusters that highly express canonical markers from more than one cell type appear. These small clusters need careful examination because they could either be technical errors, such as doublets [161], or may indeed represent a new cell population.

To answer the second question, the major challenge is the potential for missing values in single-cell data, which is called the dropout effect [162]. The best contemporary single-cell transcriptomic techniques may achieve 6500 genes per cell [163], which only covers 25–30% of the human genome. Consequently, single-cell gene expression data often have a high proportion of zero values. Here, a zero can either mean the gene does not express in the cell, or that the gene does express, but the sequencing step did not capture this expression. To tackle the dropout effect, single-cell pathway analysis may employ machine learning to quantify pathway activity. This requires choosing “positive” cells, in which the pathway is known to function, and “negative” cells in which it is known not to function or to express at a very low level. For example, fetal and adult cardiomyocytes are excellent “positive” and “negative” cells for cell cycle signaling pathways. The pathway genes can be used as features to build a classifier distinguishing between the “positive” and the “negative” cells. Here, the “positive” cells should have a high model score and vice versa. Then, the model can be applied to analyze the pathway activity in other cells.

The AI tools in this section are summarized in Table 14.4.

Table 14.4 Summary of AI tools used for identifying and characterizing new cell types and subtypes. C, P and G indicate Clinotypic, Phenotypic and Genotypic data respectively

Drug Repurposing

Drug repurposing, briefly, is applying an approved or investigational drug to treat a new disease [104]. In principle, drug repurposing can be conducted by calculating similarity. If two drugs A and B are highly similar, and A is approved to treat disease C, then B may also be used to treat C. Similarly, if two proteins D and E are highly similar, and A targets D, then A may also target E. Thus, drug repurposing includes many sub-problems for which machine learning techniques can be promising solutions.

Generating and mining a drug-drug similarity network. In this problem, each drug is represented by a vector. The drug vector represents structural chemical information, known drug-protein interactions, and information about the drug’s side effect [164]. The drug-drug pairwise similarity matrix, or network, is computed from a matrix containing vectors for all the drugs under consideration. Then, applying matrix factorization [165,166,167] results in drug clusters. Here, drugs sharing the same cluster are more likely to be repurposed for each other’s diseases.

Mining target-target a similarity network. Similarly to mining the drug-drug similarity network, machine learning techniques can be applied to cluster the target-target network [168, 169]. Then, a drug targeting one gene may be repurposed to target the other genes in the same cluster.

Mining a bi-partite drug-target network. Here, the drug-drug similarity, target-target interaction, and drug-target networks are co-factorized [170, 171].

Model-based simulation. This approach utilizes dynamic modeling, as discussed in sect. 14.4.3. The approach requires the following definition. First, the initial disease condition is represented by a non-zero vector of gene expression values (S0). The computational goal is to get S = 0, which represents the non-disease state. Second, the treatment is also characterized by a vector u. Usually, u has the same dimension as S; and the dimensions in u and S correspond to each other. Third, the gene-gene interaction and signaling mechanisms dynamically change the expression vector, yielding the equation St = F(St-1) or St = F(St-1, u). Then, there are two options to score the repurposing candidate. First, all drug treatments can be computed and ranked in the recursive system St = F(St-1) | St = F(St-1, u). Second, applying the system control approach [172] yields a “hypothetical treatment”, which optimally returns S = 0. The hypothetical treatment can be used as a template to match with real drug treatments to select the repurposing candidates.

The AI tools discussed in this section are presented in Table 14.5.

Table 14.5 Summary of AI tools in drug repurposing

Supporting Clinical Decisions with Bioinformatics Analysis

In genetic diseases caused by a single genetic disorder [174], mutation-analysis protocols can be used for diagnosis directly. Meanwhile, in complex diseases, bioinformatics analysis is used case by case. For example, in work by Kim et al. [36], the single-cell transcriptomic analysis, which used AI methods for clustering, detected that a JAK-STAT signaling pathway disruption was the cause of severe hypersensitivity syndrome/drug reaction in a patient. Therefore, tofacitinib, a JAK-STAT inhibitor, was selected and successfully treated the patient, although the drug is not indicated to treat hypersensitivity syndrome or drug reactions in general.

In cancer, the patient-derived xenograft (PDX) [175] platform is another direction for bioinformatics protocols to support clinical decision-making. Briefly, PDX is a technique to host a patient tumor biosample in a mouse. Cancer researchers can perform many mouse interventions and observe the mouse clinical outcomes, such as survival and speed of tumor growth. Given a sufficiently large amount of PDX samples with experimental results, called a PDX catalog, a new patient tumor biosample can be mapped to the PDX catalog. Here, the most closely mapped PDX samples’ experimental outcomes can be used to infer the patient’s likely clinical outcome under different treatment decisions, and fulfill the role of clinical decision support. Due to the heterogeneity of biomedical samples, designing the mapping algorithm for this application is a significant challenge that may require advanced AI techniques [176].

Predicting side effects is another clinical application where AI-based translational bioinformatics methods are potentially helpful. In pre-clinical applications, similarly to drug repurposing, predicting side-effects relies on mining drug-drug similarity [177, 178]. In the clinical setting, the principle involves mining past side effects recorded in a patients’ medical records. Thus, the side-effect analysis is provider-specific and customized according to the medical record data infrastructure, such as in the work of Sohn et al. [179]. Here, itemset (drug – side effect) mining and rule-mining are standard AI methods used to predict the drugs’ side effects.

Predicting Complex Biochemical Structures

After finding the target gene and other genetic causes for a disease, the next cornerstones in drug discovery are (i) to represent the physical structures of the target protein (the protein encoded by the target gene); (ii) to represent the physical structures of the chemical.

Representing protein physical structure involves reconstructing the 3D arrangement of atoms, or each amino acid, given the sequential order of the amino acids on the protein polypeptide chain [180]. The sequential order of the amino acids, also called the protein primary structure [177], identifies the protein, is always the first step in studying a protein, and can easily be found using today’s protein sequencing techniques [181]. Meanwhile, the protein functionalities and its interaction with other proteins and chemicals are primarily determined by its higher-level 3D structures. How to infer the protein 3D structure from the primary structure has been a grand challenge for decades [182]. Before deep learning, machine learning techniques were used to solve some protein structure subproblems, such as predicting pairwise distances among the amino acids in 3D [179] and predicting the 3D structure class [183]. The recent deep-learning-based methods can predict the 3D position of the protein atoms, which is a more challenging problem. For example, most recently, AlphaFold directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence [184]. AlphaFold shows significantly superior performance over other methods in the Critical Assessment of Techniques for Protein Structure Prediction 14th Challenge [185], which indicates a major breakthrough of AI in solving one of the cornerstones of chemical biology.

Representing chemical structure, also called ligand structure or drug structure, is somewhat a reverse-engineering problem compared to representing protein physical structure. In this problem, given a 3D physical structure, the objective is to find the arrangement, or order, of atoms that can create the 3D structure [186]. Here, the chemical 3D structure is defined such that the chemical structure can bind to the targeted protein 3D structure [187]. This problem is manually solved by chemical engineering experts, which often take years to complete [188]. Recently, it has been shown that deep learning can significantly accelerate this process. For example, GENTRL [189] designed new chemicals binding to DDR1 within just 23 days. Here, GENTRL applied deep-learning autoencoder [190], which is a technique to synthesize new datapoints from the existing (original) datapoints such that the AI classifier could not easily differentiate the synthetic ones from the original ones. The GENTRL deep autoencoder was trained on more than 200 million 3D chemical structures in the ZINC v.15 database [191]. Then, to find new chemicals binding to DDR1, GENTRL took the existing DDR1 inhibitors, which could be found in the ChEMBL database [192], as the input, then synthesized ‘similar’ compounds using the autoencoder.

Trends and Outlook

In pre-clinical research, future translational bioinformatics research will likely pivot around the current and forthcoming breakthroughs in biotechnology. By the time this chapter is published, single-cell -omics, which measure the molecular environment inside cells, and the patient-derived xenograft, which allows hosting a patient’s biosample in living organisms, will be the major platforms for new AI translational bioinformatics techniques. Key problems that require new and further technical development are:

  • Identifying and characterizing small but novel cell types from single-cell omics data. In this problem, stem, progenitor, and high-capacity proliferative cells are often the main focus because they are the key to treating the commonly fatal diseases: cardiovascular diseases and cancer. In cardiovascular disease, which is often due to the cardiac tissue’s low regenerative capacity, the goal is to promote cell proliferation [193]. Meanwhile, in cancer, the goal is to restrain the proliferative cells.

  • Characterizing the tissue microenvironment, which significantly contributes to the growth and survivability of the tissue. Single-cell data allows observing and separating the main tissue cell types, such as neural cells in brain cancers, and environmental cell types, such as immune cells and fibroblast cells. Cancer immunotherapies [194] are examples of how the microenvironment can impact the tissues. Here, key questions concern which molecular and signaling mechanism the microenvironment can stimulate tissue cell types, and which signaling pathways the tissue cells activate in response to the stimulus.

  • Characterizing cell differentiation explains tissue regeneration and tumor recurrence. Analyzing time-series single-cell data is the key to approach this problem.

Meanwhile, utilizing deep-learning-based models, especially autoencoders, will be the main focus in virtual docking, molecular design, and system biology simulation. In this area, the keys to successful AI applications include reducing overfitting, enlarging the molecular dataset, and setting up a gold-standard validation Scheme [195].

To make a higher impact in clinical applications, AI translational bioinformatics still requires substantial effort to build multi-omics, clinome, and phenotype integrative data infrastructure. In this aspect, the All of Us research program [196] is a pioneering project. While preparing for these infrastructures to emerge, AI researchers can solve smaller-scale research problems, such as predicting patients’ risk from clinome data, and linking omics and clinome data via text mining. Also, as shown in the work of Jensen et al. [27], AI can already play a significant role in clinical decision support under clinical experts’ guidance in specific cases.

Questions for Discussion

  • What are three types of translational bioinformatics data?

  • Clinotype-genotype association has not been well-explored in translational bioinformatics. Sketch a research strategy to mine this type of association using Natural Language Processing (NLP - Chap. 7),

    Hint: find a catalog of clinotype terms; apply NLP tools using the Pubmed collection of abstracts.

  • In the section on “Clustering Algorithms”, matrix factorization, derive the formula for \( \frac{\partial F}{\partial \textbf{Q}} \) using the method in Eqs. (14.2)–(14.6).

  • The section on “Inferring and Characterizing Cellular Signaling Mechanism that Determines the Cellular Response” shows that when the system has negative feedback, the signal oscillates. This is a well-known phenomenon in system biology modeling. Show that phenomenon again in the following system M = \( \left[\begin{array}{cc}-1& 1\\ {}0& 0\end{array}\right] \), St = 0.15 × S0 + 0.85 × MTSt − 1, S0 = (−1, 0). Draw the system diagram and PA, PB signals as in Figs. 14.3 and 14.4. How about with S0 = (1, 0)?

  • In the section on “Supporting Clinical Decision with Bioinformatics Analysis”, mapping new patients’ tumor expression (NPT) data to existing patients’ tumor (EPT) expression data may help predicting clinical outcomes. Two data processing methods are proposed to map the NPT to EPT. One way to select the better method is to plot the combined NPT-EPT embedding after processing the data. Recall, embedding preserves the pairwise similarity from the original data space in the embedded space. The embedding visualization is in Fig. 14.6 (below). Which processing method is better, and why?

Fig. 14.6
2 scatter plots with 2 types of data labeled NPT and EPT. Both depict combine NPT-EPT embedding utilizing 2 methods. Plot a visualizes them together and plot b visualizes them separately.

(for question 3). NPT-RPT-combined embedding visualization with two data processing methods. (a) Method 1. (b) Method 2

Further Reading

Russell, S., & Norvig, P. (2009). Artificial intelligence: a modern approach, 3rd edition. Prentice Hall, ISBN 0-13-604,259-7.

  • This book comprehensively covers AI theories, principles, algorithms, and techniques. This is considered a leading AI textbook. The student may find and read the 2nd edition as well.

Wei, D. Q., Ma, Y., Cho, W. C., Xu, Q., & Zhou, F. (Eds.). (2017). Translational Bioinformatics and Its Application. Springer.

  • This is the most up-to-date and comprehensive textbook about translational bioinformatics.

Tenenbaum, J. D. (2016). Translational bioinformatics: past, present, and future. Genomics, proteomics & bioinformatics, 14(1), 31–41.

  • This review summarizes the most prominent roles of translational bioinformatics, and provides a perspective of how the field may further improve health care.

Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., ... & Robles, V. (2006). Machine learning in bioinformatics. Briefings in bioinformatics, 7(1), 86–112.

  • This review describes the most popular AI algorithms in bioinformatics at the turn of the century.

Ciaburro, G. (2017). MATLAB for machine learning. Packt Publishing Ltd.

  • Licensing cost notwithstanding (Matlab is a commercial product), Matlab is without a doubt one of the most learner-friendly and easy-to-learn platform to learn, practice, and implement AI in many fields.

Lantz, B. (2013). Machine learning with R. Packt publishing ltd.

  • For open-source programming, R is a good platform to practice AI. Compared to Matlab, it provides a trade-off between convenience and programming speed and cost. Many translational bioinformatics algorithms have publicly available R implementation.

Raschka, S. (2015). Python machine learning. Packt publishing ltd.

  • For open-source programming, Python is a good platform to practice AI, and provides ready integration with a range of powerful publicly available machine learning libraries.