AI in Translational Bioinformatics and Precision Medicine

Nguyen, Thanh M.; Chen, Jake Y.

doi:10.1007/978-3-031-09108-7_14

Thanh M. Nguyen⁵ &
Jake Y. Chen⁵

Part of the book series: Cognitive Informatics in Biomedicine and Healthcare ((CIBH))

1303 Accesses

Abstract

This chapter covers the basic concepts in translational bioinformatics and categorizes key problems in this domain from an application perspective. For each problem, we review its characteristics, how these characteristics guide the selection and development of an AI technique to solve the problem, and provide examples of state-of-the-art AI techniques used for this purpose. We also present general strategies for quality control, a critical step as both the number of AI approaches and AI model complexity increase. After reading this chapter, you should know how each translational bioinformatics problem contributes to proactive, predictive, preventive, and participatory health, how to choose the right AI approach to tackle each problem, and how to design preprocessing and validation strategies to ensure AI models’ robustness.

Access provided by Autonomous University of Puebla. Download chapter PDF

AI/ML in Precision Medicine: A Look Beyond the Hype

Article 13 June 2023

Translational Challenges of Biomedical Machine Learning Solutions in Clinical and Laboratory Settings

Public Health Informatics in the Larger Context of Biomedical and Health Informatics

Keywords

FormalPara After reading this chapter, you should know the answers to these questions:

Which key translational bioinformatics problem are AI methods positioned to solve?
What principles would guide your choice of which AI techniques and tools to apply to a translational bioinformatics problem?
What are some important “-omic” databases that can be used to interpret and validate translational bioinformatics related machine learning results from the biomedical perspective?

Introduction and Concepts

The field of translational bioinformatics is concerned with the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data and genomic data into proactive, predictive, preventive, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to various stakeholders, including biomedical scientists, clinicians, and patients [1].

Voluminous data, which are often understood as data volume of a few gigabytes and beyond, can be due to:

A large proportion of irrelevant information in the data. For example, bulk RNA sequencing data for just one sample is more than ten gigabytes before compression and a few gigabytes after compression [2]. However, the ratio of exon reads, which are used in later analysis, over intron reads, which are not used, is small (it is expected to be 1/24 [3]).
A large number of samples, also called datapoints (each of which may represent a subject), in the data. For example, the Nationwide Emergency Department Sample data [4] contains more than 30 million subjects. In this dataset, there are just over 100 features only.
A large number of features. For example, human gene expression data have 20,412 protein-encoding genes, 14,600 pseudogenes, 14,727 long non-coding RNAs, and 5037 small non-coding RNAs. Some datasets may have large numbers of both samples and features. For example, single-cell RNA sequencing data may contain data from hundreds of thousands of cells [5], capturing the whole genome.

A Brief History of Translational Bioinformatics

Translational bioinformatics is a relatively young field. According to Ouzounis [6], translational bioinformatics started in 1996. In the beginning, the field primarily researched how to organize biomedical data and build an ontology system improving the interpretation and searching of biomedical research. After the first version of the human genome project in 2003 [7], genomic analysis was added to translational bioinformatics and continued growing to be a key area in the field. Since 2005 in Europe and 2009 in the United States, programs to widely adopt electronic medical records in patient care and research have been launched [8]. Consequently, large amounts of past clinical data stored in electronic format could readily be used in translational research. This enabled the development of the biomedical informatics component of translational bioinformatics. As translational bioinformatic techniques mature in the areas of genomics and biomedical informatics, they are further adapted to carry out research on other biomedical data, such as microbiome, chemical informatics, and metabolomics data. Today, translational bioinformatics is a multidisciplinary field, extending from the molecular level (genes, proteins, and other molecular entities below the cell) to the population level (collections of living subjects).

Concepts of AI in Translational Bioinformatics

AI in translational bioinformatics covers a broader range of problems than it does in other clinical fields. In clinical practice, the main goal of applying AI is often to complete tasks that used to require manual labor. Some AI applications, such as predicting patient readmission, may perform tasks not typically conducted by human beings. However, producing new knowledge is not required. In translational bioinformatics, besides supporting manual labor, an important goal when using AI is to infer new knowledge, with typical applications including:

Association Studies: mining for novel relationships among different biomedical entities.
Subtyping and clustering: dividing patients and samples into different groups such that each group may explicitly represent a sub-clinical outcome or a sub-phenotype.
Modeling and knowledge representation: mathematically representing the associations and cause-effect relations among different biomedical entities. The representation, in this case, is often in a system of differential equations.
Simulation: mathematically representing the changes observed in biomedical subjects by a system of dynamic equations. The system has the general form x(t + 1) = F(x(t), u(t)). Here, x(t) represents the subject at timepoint t, u(t) represents the interference at timepoint t, and x(t + 1) represents the subject at the next timepoint.
Spatial visualization: visualizing biomedical datapoints in 2D or 3D space.

Primary Data Categories in Translational Bioinformatics

Genomic Data

This chapter broadly refers to all types of data involving genes, proteins, miRNAs, metabolites, proteins, and biological reactions as genomic data, which includes data in both genomic and functional genomic subcategories.

Genomic and other -omic data, as introduced in Chap. 3, refer to the measure, characteristic, and annotation of genes. The original definition of genomics only referred to the study of genes or the DNA sequences and their related information [9]. However, the data and research scope in bioinformatics also covers other molecular entities involved in the transcription and translation processes. Therefore, -omic data include:

Proteomics is the study of proteins [10].
Metabolomics studies the chemicals participating in dynamic metabolic processes [11].
Transcriptomics studies the transcription process [12], which focuses on RNAs and other transcription-regulator molecular functions.

In some literature [13, 14], the word “genomics” (or gene) is used interchangeably with “transcriptomics” (RNA) and “proteomics” (protein).

When analyzing translational bioinformatics data, it is important to recognize and categorize the data by measure and resolution. Measure refers to the type of molecular entities that are collected, counted, or observed. The technical terms microarray [15], copy-number variation [16], and mutation only refer to DNA. The technical terms RNA sequencing and transcript count only refer to RNA [17, 18]. The terms western blotting [19], multi-level structure, and protein-protein binding affinity [20] refer only to protein. Each measure and molecular type has its unique physical characteristics; therefore, applying a method built for one measure to another should be very carefully considered. Resolution refers to whether the measures are collected from the tissue (bulk) sample, which is a collection of cells, at the single-cell level, or at the sub-cellular molecular level (i.e., isolated proteins). While bulk and single-cell samples can have the same measures (e.g., transcript count in bulk RNA and single-cell RNA), their numerical characteristics are very different.

The results from analyzing -omic data by researchers from multiple fields are carefully curated and organized into annotated catalogs. The Gene Ontology catalog [21, 22] identifies which genes participate in specific biological processes, belong to specific cellular components, or share a specific molecular function. Pathway catalogs, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] and Reactome [24], annotate how groups of genes interact with each other and activate in specific orders to regulate cellular phenotypes and processes in response to external stimuli. Protein catalogs, such as UniProt [25], Protein Data Bank [26], and STRING [27], collect and organize protein structures and interactions. These catalogs are important to interpret the new omic analytic results, highlighting the molecular features that differentiate two or more phenotypes.

Clinomic Data

Clinomic data, which is also called the “clinotype“[28], refers to the measures and characteristics of the living subject, which are useful for medical research and interventions [28, 29]. The major biomedical data categories are diagnosis-related data, laboratory test results, medication data, and medical imaging data [30]. Diagnosis-related data refer to the time, stage, and survivability of a disease or disorder in a subject. Laboratory test results, which are also called biomeasures [31], refer to a subject’s quantifiable observations and the biological material concerned (i.e., blood and urine samples). Because laboratory test results are quantifiable, we can systematically define thresholds and criteria to determine whether the result indicates particular diseases or disorders. Medication data refer to the time and types of interventions, or treatments, received by a subject. Medical imaging data, such as X-rays and functional magnetic resonance images, can be understood as observations from a subject that are not directly quantified. Therefore, interpretation of medical images in research and clinical practice often occurs on a case-by-case basis and depends on physicians’ and researchers’ training experience. Besides, biomedical data may contain other types of data that are specifically collected for particular research and clinical practice scenarios, such as smoking history [32, 33], diet type [34], exercise frequency [35], and drug side-effect history. Also, clinomic data may include genomic data when a subjects’ genetic data is used to diagnose and help decide a treatment [36,37,38].

Phenotypic Data

In translational bioinformatics, the basic phenotype definition refers to the diseases and abnormalities affecting each subject, such as breast cancer and diabetes. In the broader context, phenotype refers to the subjects’ categorization and definition as assigned by biomedical experts. In this context, a phenotype definition, such as ‘cell proliferation’ and ‘chemotherapy resistance’, is specific to each research project or clinical trial. To differentiate phenotypic data from clinomic data, we may understand that phenotypic data are directly derived from clinomic observations.

Categorizing AI Applications in Translational Bioinformatics

In treating complex diseases and precision medicine, linking omics and biomedical data is expected to improve the quality of care [39, 40] from the current practice, which primarily relies only on clinomic data or phenomic data (which is also referred as biomedical data) alone. Biomedical data will still be essential to detect and monitor disease progression, tasks for which single-type omics data has not yet shown superiority [39]. Meanwhile, omics data is essential to find driver mutations and their functionalities, prerequisites to understanding their roles in causing a disease. Besides, even when the major causes of a disease are not exclusively genetic, such as with hypertension, knowing the patients’ omics data may still improve treatment precision [41, 42].

Linking of -omics, clinotype, and phenotype data opens new questions that require advanced AI, machine learning, and data mining techniques to resolve. Clinotype-to-clinotype (C2C) association discovery, similar to “omic association” [43, 44], finds the clinotypes that co-occur in subjects’ data and determines whether one clinotype occurrence is likely to precede those of other clinotypes. Discovery of novel clinotype-to-phenotype associations may advance risk assessment beyond current disease:lab-test markers toward finding sets of simpler, more cost-effective risk markers. This would allow patients and physicians to take early and preventive actions [45].

From a knowledge discovery perspective, Fig. 14.1 summarizes AI applications in translational bioinformatics according to the data categories we have introduced. Here, three data categories yield six possible types of association. In this figure, genomic-to-clinomic association has not been well-defined (and is therefore not illustrated). The other five types of association are as follows.

An illustration showing the categorization of phenotypes into genotypes and clinotypes with the possible association between the three categories, namely P2P, G2P, P2C, G2G, and C2C. — **Fig. 14.1**

G2G (Genomic to Genomic)

G2G refers to applications involving finding gene-gene associations using AI techniques. G2G has many sub-problems, which are defined by the gene-gene mechanisms of interaction concerned. Sub-problems that are foci of current research include estimation of protein-protein binding affinity [46], and prediction of the targets of transcription factors [47].

G2P (Genomic to Phenotypic): Genome-Wide Association Studies (GWAS)

The main purpose of GWAS is to find the genetic variants associated with a specific disease or phenotype. According to the GWAS catalog in 2019 [48], there have been 5687 GWAS studies, which list 71,673 variant-phenotype associations computed using statistical methods. AI methods can improve GWAS results by enhancing the statistical power of associations, improving polygenic risk scoring, and ranking gene variants that are strongly associated to a genetic disease [49]. In most AI applications in GWAS, the key step is to build classification models using variant features to differentiate the phenotype, such as disease vs. normal.

From an AI perspective, the GWAS data has the following characteristics:

Statistical feature selection is generally applied before using AI methods to analyze the data. However, the statistical methods select the features one by one; thus, they may not address important dependencies among the features.
The data is often represented by a binary matrix, which represents whether or not a patient’s genome has a specific variant.

In analyzing GWAS, linear models such as regularized regression and support vector machines (Chap. 6) are widely applied (see for example [50,51,52,53]). Here, model scores, such as the predicted probabilities, can serve as risk scoring, and model coefficients can be used to rank the features. Random forest [54] models can also support these two tasks (see for example [55, 56]). Meanwhile, some research [57, 58] shows that artificial neural network models may have the advantage in risk scoring/classification performance; however, the model architecture is less conducive to feature ranking than other more straightforward approaches such as regression models.

P2P (Phenotypic to Phenotypic): Identify Disease Genomic Subtypes

Subtyping of complex genomic disease, such as glioblastoma multiforme (GBM) [59], can answer key questions in both pre-clinical studies and clinical practice. These diseases are caused by multiple genetic anomalies and signaling pathway disruptions; therefore, a combination of therapeutic strategies is required to treat them. The purpose of subtyping in this context is to partition the disease into multiple subgroups, and find the explicitly disrupted signaling pathways in each of them. This may allow for customizing the treatment for each group according to the affected pathways. Solving this problem requires clustering and feature selection algorithms in AI. The clustering results reduce the subtyping problem into classification (which imputed subtype does this patient belong to?) and feature selection (which signaling pathways are affected in this imputed subtype?) problems for follow-up analyses. For example, in the TCGA-GBM dataset [59], the clustering was followed by GWAS analysis in each patient group. Here, GWAS mutations defined four GBM subtypes: classical, mesenchymal, proneural, and neural, and 29 subtype-related prognostic markers.

P2C (Phenotypic to Clinomic)

The protocol to diagnose a disease, which often consists of a pre-defined set of laboratory tests, is the most commonly used type of phenotype-clinotype association. Other phenotype-clinotype associations, once discovered by AI techniques, are considered novel. For example, in work on the identification of hypertension risk [45], the patients’ future hypertension could be predicted by using non-blood-pressure affordable lab tests and AI.

C2C (Clinomic to Clinomic)

From a narrow perspective, clinotype-clinotype association refers to the correlation among laboratory test results. In the broader perspective, this type of association refers to how a specific clinotype result might change, or be predicted, given other clinotype results. For example, in work seeking to derive links between the primary translational informatics data categories [28], a linear model was constructed in which all other clinotypes become input features, to predict clinotype values. Here, the model coefficients were used to quantify clinotype-clinotype association.

Informatics Challenges in Translational Bioinformatics

Big Data Characteristics

An understanding of big data characteristics is required to apply AI in translational bioinformatics. This section reviews the characteristics of big data and how each characteristic can impact AI performance.

Volume of Data

Large data size constrains AI performance in translational bioinformatics in two ways. First, on account of associated logistic challenges, large data size may make solving some AI problems impractical without sufficient computer storage and specialized hardware. For example, Quang et al. [60] show an example of a motif discovery problem that may take a few weeks for a computer to complete without a big-data-specific GPU. Second, on account of the “curse of dimensionality” challenge, the very large number of features in big data can decrease the predictive performance, which is known as the Hughes phenomenon [61]. While dimension reduction may help relieve the curse of dimensionality, it also reduces the interpretability of the AI results (because the components of a reduced-dimensional representation may not map back to individual features). In other words, it leads to inferring less powerful hypotheses from the input feature to the predicted output variable.

Veracity of Data

Veracity refers to data quality, and conversely, noise. Noise detection and filtering is a challenge in analyzing data from many fields. However, in translational bioinformatics, differentiating between noise and meaningful but as-yet-unverified novel information makes this challenge more difficult. In biomedical research, the data often contain yet-to-be-discovered information. This information may only appear in a very small percentage of the data; therefore, statistically, it has similar characteristics to the noise. For example, single-cell expression data [62] usually show small cell clusters, which consist of less than 5% of the cell population. These clusters may correspond to “doublets“[63], a technical error (real noise), or stem/progenitor cells (which would constitute critical and novel information). Tackling this challenge requires reasonable noise assumptions and novel hypotheses to emerge from a strong collaboration between the AI analyst and the biomedical experts concerned.

Variability of Data

Data heterogeneity, or variability, has always been among the most difficult challenges in translational bioinformatics [64]. There are many aspects of data heterogeneity. First, the data are of many types, including omics data subtypes and biomedical data subtypes. In this aspect, the data integration strategy and integrative analysis are the keys to overcome the challenge. Second, the same data type may have significant variability due to the methods and platforms of collection, such as the batch effect [65] when single-cell sequencing the same tissue using 10X and ICell8 RNA-seq platforms [66]. In this aspect, computational mapping across the platforms is critical to the analysis. Third, results with the same data type and the same collection method can still be interpreted differently by different healthcare providers. For example, the Hematocrit percentage test normal range can be 35–40% or 40–50%, depending on specific patients and physicians who analyze the test result [28]. In this aspect, accurate and comprehensive biomedical domain knowledge is required.

Velocity of Data

Velocity refers to how quickly the data must be processed and analyzed, and how quickly results must be produced. In general translational bioinformatics research, velocity is not a major challenge. However, requirements to deliver results on time must be considered when building online tools and clinical decision support. The principle requirement to tackle this challenge is to understand the data management system and hardware infrastructure from which the AI tools are to be deployed (see the section on “Applications of AI in Translational Bioinformatics”).

Social-Economic Bias

Unlike pre-clinical research, clinical translational bioinformatics research uses patients’ clinical information. In this setting, incorporating socioeconomic factors into analyses may be unavoidable. However, predictive models based on such factors raise concerns about algorithmic fairness with the potential to exacerbate existing socioeconomic disparities (see Chap. 18). Thus, in the data processing pipeline, strongly associated features to the socioeconomic factors should be removed.

Domain Knowledge Representation and Interpretability

Using domain knowledge collections to validate findings can improve the interpretability of AI results, which is desirable before making clinical decisions on the basis of these findings. In bioinformatics, the common practice is to use feature extraction methods or infer model-explicit features (biomarkers) that differentiate the sample classes. These features are then forwarded to pathway, gene set, and gene ontology analysis to reveal which biological mechanisms are involved. The closeness between the highlighted mechanisms and the biomedical samples justifies the quality of the analysis. For example, when analyzing the fetal mouse heart proliferation data [67], the proliferative pathways and gene ontologies, such as cell cycle and cell differentiation, are expected to be enriched. If the proliferative pathways and ontologies are not enriched, while the ‘hypertrophy’ ones are, then the analytic quality could be questionable.

Model Robustness and Quality Control

Sample imbalance is usually the first issue impacting model quality in AI applications in translational bioinformatics. In the pre-clinical setting, the proportion of negative samples, such as healthy or disease-free samples, is often much smaller. In the clinical setting, we often see a very small proportion of positive samples. Here, we expect that the number of patients will be small compared to the population size. Extremely imbalanced data may seriously impair AI models, which are often optimized for accuracy. For example, when the positive sample proportion is only 5%, a naïve “all negative” prediction model yields an accuracy of 95% (very high). However, this model cannot predict positive samples; thus, it has no clinical value. To tackle imbalance, oversampling or undersampling can be applied to create a training dataset with a positive/negative sample ratio that is more balanced than it is in the whole dataset. In oversampling, a rare-class sample may randomly appear more than once in the training set, such as in the Synthetic Minority Over-sampling Technique (SMOTE) [68]. Here, the sample may be slightly permuted if it is selected more than once, using techniques of data augmentation [69, 70] analogous to those applied to images when training models for computer vision, such as rotations and reflections [71]. Alternatively, in undersampling, only a subset of popular-class samples are randomly selected for inclusion into the training set.

The optimization criteria in training AI models should be carefully decided case by case. This involves two choices. The first is the choice or definition of the loss function (see Chap. 6), with commonly applied examples including the mean-square error, L1 loss, hinge loss, or cross-entropy loss. The second is the choice of metric to focus on: maximizing accuracy, AUC, positive-predictive value, or negative-predictive value.

Statistical tests for model robustness. Many AI methods are model-based, and assume the data have certain characteristics and follow particular distributions. Therefore, these assumptions need to be verified before applying the AI methods. The Kolmogorov–Smirnov test (KS test) [72] addresses whether a set of numbers follows a pre-defined distribution, and to what degree two sets of numbers are drawn from the same distribution. Thus, in principle, the KS test and another similarly-purposed test should be applied to examine the data before deciding the AI method. On the other hand, the model result depends on its hyperparameters [73], which must be set before applying the AI algorithm. Choosing the hyperparameter is beyond the scope of the optimizations that can be achieved by the AI algorithm itself. Therefore, post-hoc analyses, such as the Wald test [74] and other model-fitness tests, should be used to test for fitness of the computed parameters (which are also commonly called the model parameters). To conduct these tests, the null model parameters need to be pre-defined; usually, a null model parameter is set as 0, which implies that the parameter plays no role in the model. If the test result is insignificant, which means the computed model parameters are very similar to the null parameters, then the model parameter may not be robust. This means that one should choose other algorithm hyperparameters to recompute the model.

Translational Bioinformatics Tools & Infrastructure

The big data characteristics described in the section on “Concepts of AI in Translational Bioinformatics” necessitate efficient, scalable tools and infrastructure components. In this section we will describe some of the key tools and components required to conduct translational bioinformatics analyses.

Extended Data Management Systems

While improving translational bioinformatics data storage may not be the primary research objective of AI in medicine, the developers of AI tools should consider the existing data facilities in order to improve their runtime performance in practice. First, the object-relational database is still the primary translational bioinformatics data structure, with prominent examples including the Mayo Clinic database [75], STRING (a database of protein-protein interactions) [27], and the US National Inpatient Sample database [76]. The relational structure has the advantage of supporting the transformation of the data into any customized data structure used by AI applications. Meanwhile, to improve data retrieval performance, some translational bioinformatics data warehouses choose a non-relational structure for specific data types. Second, when the data has a hierarchy and/or primarily association among different entities, the non-relational database is adopted. For example, the Reactome biological pathway repository [24] is built upon the Neo4j [77] graph database engine, and Gene Ontology [22] organizes data based on the hierarchical Extensible Markup Language (XML) files [78]. Some other systems may adopt a hybrid structure when the data are extremely large and the data should be kept in multiple formats; for example, cBioPortal for cancer genomics [79] uses a relational table structure for patient clinical data and a file data system for patient omic data. Third, hybrid and distributed data warehouses implement both relational and non-relational infrastructure, such as in CloudBurst [80], BiobankCloud [81], and Hydra [82].

Data Preprocessing Pipelines

Pipelines to Build the Data Matrix

AI techniques view data in matrix format. However, before processing, biomedical data, such as high-throughput sequencing data, are not in this format. Therefore, data-type specific pipelines are required to convert the raw biomedical data to a matrix format. Table 14.1 summarizes the well-known pipelines in translational bioinformatics.

Table 14.1 Popular standard pipelines used in translational bioinformatics

Full size table

Enhancing the Data Matrix

After formatting the data as a matrix, the matrix must be further processed to remove bias and noise. Choosing which method to use in this step is an ad-hoc decision and is made on a case-by-case basis. Well-known problems and techniques in data processing are:

Dimension reduction: typically used methods include principal component analysis [87] and canonical correlation analysis [88].
Data scaling and normalization. For example, gene expression data is assumed to follow the negative binomial distribution [89]. Gene expression analysis packages, such as DeSeq2 [90] and SAGE [91] implement negative binomial scaling before applying the main statistical analysis.
Batch effect correction [92] is applied when the same type of dataset is generated at different rounds of experiments. Variance due to uncontrolled and random technical issues may appear in the data.
Embedded data visualization, such as TSNE [93] and UMAP [94].

Supervised and Unsupervised Learning

Supervised machine learning (see Chap. 6), also called supervised analysis, involves finding a function that reproduces an output from an input. Here, “reproduce” implies that the correct output is already available without using the machine learning function. In biomedicine, supervised analysis often involves creating and fine-tuning computer algorithms and software that can substitute for a human performing a task. Examples of supervised analysis in biomedicine are:

Automated detection of a tumor region and estimation of tumor size from a radiological image [95, 96] (see Chap. 12).
Identifying cell type from single-cell gene expression data [97].
Detecting patients’ chronic disease diagnoses from their general health checkup records [28].

The main purpose of such supervised analysis is to automate and speed up manual tasks. Supervised learning can be implemented by digitizing rules and human knowledge, which is also called rule-based learning [98, 99], or through computation exclusively without such rules or knowledge. Support vector machines [100], linear regression [101], random forests [54], and deep learning methods [102] are well-known fully computational techniques that neither encode nor require human rules and knowledge. Although inferring novel knowledge may be achieved with supervised analysis (for example, by making predictions of drug effects beyond the scope of the training data), this is not usually the main goal when conducting supervised analysis in translational bioinformatics.

Unsupervised learning still finds a function from input to output, but the correct output has not been determined in advance. The main purpose of unsupervised learning in translational bioinformatics is to generate new biomedical knowledge that can be tested and verified in follow-up studies. Some examples of unsupervised learning are:

Identifying new disease subtypes from genomic data [28], such as identifying genetic mutations that potentially indicate subtypes of glioblastoma [103].
Drug repurposing [104], or finding ways of applying an old drug to treat a new disease. For example, the disease-drug and drug-drug associations could be represented by a graph [105]. Clustering this graph, which is an unsupervised learning problem, results in many clusters. Each cluster consists of multiple diseases and drugs to be further examined for repurposing.
Identify and characterize new cell subtypes in single-cell omics data.

Some popular unsupervised machine learning problems are clustering [106], expectation-maximization methods [107], and non-negative matrix factorization [108].

Popular Algorithms in Translational Bioinformatics

Extending the discussion of algorithms and tools in Chap. 6, this section provides additional details for popular AI algorithms in translational bioinformatics.

Classification Algorithms

Random forest [54] is a well-known example of an ensemble classification algorithm. A random forest consists of many decision trees (Fig. 14.1). Each tree is a discrete classification model, which consists of multiple classification rules. Each tree is constructed by applying the decision tree algorithm [109] to a random subset, including randomly selected samples and features, of the training data. Then, to classify a sample, random forest combines all trees’ classification results, using majority rule voting or some other aggregation procedure.

Figure 14.2 illustrates a random forest. Here, the classification has 4 features. Three decision trees are randomly constructed: tree 1 only uses x₁ and x₂, tree 2 only uses x₃ and x₄, tree 3 uses x₁, x_2, and x₄. Each tree is a classification model, which consists of multiple if-then rules. For example, tree 1 has three rules: x₁ ≤ 5 → Class: No, x₁ > 5 & x₂ ≤ 50 → Class: No, x₁ > 5 & x₂ > 50 → Class: Yes. When classifying the sample (2, 68, 342, 6899), each tree finds the decision branch according to the sample feature (grey-shaded hexagon). Then, all trees’ results (2 No, 1 Yes) are combined using a majority rule vote to make the final decision (Class: No).

A flow chart of 4 features named x subscript 1 to x subscript 4 with values 2, 68, 342, 6899. They lead to 3 binary trees labeled 1 to 3 that divide further and the combined decision produced is no. — **Fig. 14.2**

To measure how important a feature x is in each tree, the algorithm compares the classification accuracies where x is included and removed from the tree [110]. The more accuracy decreases when removing x, the more important x is in the tree. The random forest combines x’s feature importance scores for all tree into x feature importance score in the forest. The random forest feature importance can be used as a metric for feature ranking.

Naïve Bayesian classifier. Given a datapoint x = (x₁ = a₁, x₂ = a₂, …, x_n = a_n) in n dimensions; here, we denote x₁, x₂, …, x_n as the attributes, and a₁, a₂, …, a_n as the values. The objective is to predict which class C_i x belongs to among k classes. According to Bayes’ rule, the posterior probability that x belongs to class C_i is:

$$ p\left({C}_i|\textbf{x}\right)=\frac{p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)|{C}_i\right)\times p\left({C}_i\right)}{p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)\right)}. $$

(14.1)

In Eq. (14.1), the denominator is negligible because it is the same for all k classes. Thus, the classification is decided by which class has the largest p(x = (a₁, a₂, …, a_n)| C_i) × p(C_i). Here, the prior p(C_i) is the frequency of class i appearing in the population. The likelihood p(x = (a₁, a₂, …, a_n)| C_i) is the probability of observing a sample x = (a₁, a₂, …, a_n) in class C_i. In the naïve scenario, we assume that all attributes x₁, x₂, …, x_n are independent from the other. Therefore:

$$ p\left(\textbf{x}=\left({a}_1,{a}_2,\dots, {a}_n\right)|{C}_i\right)=p\left({x}_1={a}_1|{C}_i\right)\times p\left({x}_2={a}_2|{C}_i\right)\times \dots p\left({x}_n={a}_n|{C}_i\right). $$

(14.2)

In discrete data, all elements in Eqs. (14.1) and (14.2) are computed by counting. In continuous data, the probabilities p(x_j = a_j| C_i) are the probabilistic mass function of class C_i. This requires advanced probabilistic modeling. Also, when the attributes are not completely independent, the naïve Bayesian algorithm is extended to the Bayesian Network. The audience can freely practice these advanced cases using open source and freely available machine learning libraries such as Weka [111].

Clustering Algorithms

Expectation-maximization clustering [112], briefly, is a repeated process to identify the distribution parameters of the data. In each iteration, the datapoints are allocated to a distribution component; then, all component parameters are re-calculated according to the datapoint allocation. The process repeats until the component parameters converge.

K-mean clustering [113] is a popular clustering algorithm. In this problem, the goal is to partition the datapoints into k clusters. The defining parameter for each cluster is its centroid, which is the average of all datapoints assigned to it. A specific datapoint is allocated to the cluster with the closest cluster centroid. Clusters are initially assigned randomly, and the process of centroid estimation and cluster assignment is repeated iteratively until the centroids converge. Figure 14.3 illustrates k-means clustering using a toy example (n = 10 datapoints, k = 3).

An illustration consisting of 5 tables. The first is the input table. The second table adds Distance to and cluster values to the first with the rest being iterations of it named 2, 3, and 4. — **Fig. 14.3**

Consensus clustering. In the GBM case study (introduced in the section on “Phenotypic Data”14.1.3.3) [59], consensus clustering was applied to divide the TCGA-GBM patients into k = 4 groups (subtypes). Consensus clustering is an iterative clustering procedure. The process starts by choosing the number of expected clusters (k) in the dataset. The core clustering algorithm, which was hierarchical clustering in TCGA-GBM [59], executes multiple runs with the same k parameter in the dataset. Then the results of these runs are aggregated and evaluated for clustering quality. The core clustering algorithm repeats the same process with other choices of k; here, larger k often yields higher clustering quality. After experimenting with many choices of k, the final k is chosen by balancing the preference for a smaller k against the desire for better clustering quality. Silhouette index [114] is a well-known metric for clustering quality.

Matrix factorization. In machine learning, briefly, matrix factorization is the decomposition of a data matrix M (m datapoints × n attributes) into the product PQ ≈ M. Here, P is a m × k matrix, Q is a k × n matrix, and k is a pre-defined number determining the dimensions of the latent feature space. Each row in P represents a data point in the latent space, and each column in Q represents an attribute in the latent space (although such latent attributes may be derived from multiple input features in M, and thus relatively difficult to interpret, as discussed in the section on “Volume of Data”). In matrix-factorization clustering, the latent space is defined by the number of clusters. Then, each row in P represents which clusters the corresponding datapoints may belong to; and each column in Q represents in which clusters the corresponding attribute is enriched.

Mathematically, matrix factorization is an optimization problem (for an introduction to solving optimization problems using gradient descent, see Chap. 6): find P and Q to minimize

$$ F={\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2+\alpha {\left\Vert \textbf{P}\right\Vert}^2+\beta {\left\Vert \textbf{Q}\right\Vert}^2. $$

(14.3)

In this formula, α and β > 0 are pre-selected regularization parameters. The popular approaches to solve (1) are based on gradient descent theory [115]. Computing the partial derivative of F over P and Q, we have

$$ \frac{\partial F}{\partial \textbf{P}}=\frac{\partial \left({\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2\right)}{\partial \textbf{P}}+2\alpha \textbf{P}. $$

(14.4)

The first term in Eq. (14.4) can be computed by analyzing each entry (i, j) in M. We have, by matrix multiplication:

$$ {m}_{i,j}\approx \sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}. $$

(14.5)

In Eq. (14.5), m_{i, j} is the entry at i^th row and j^th column in M, p_{i, l} is the entry at i^th row and l^th column in P, and q_{l, j} is the entry at l^th row and j^th column in Q. To minimize $ {\left\Vert \textbf{M}-\textbf{PQ}\right\Vert}^2=\sum \sum {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2 $, we have partial derivative:

$$ \frac{\partial {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2}{\partial {p}_{i,l}}=-2{q}_{l,j}\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right). $$

(14.6)

This allows update each entry p_{i, l} with a very small learning rate σ :

$$ {p}_{i,l}={p}_{i,l}+\sigma \left(\sum \limits_{j=1}^n\frac{\partial {\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)}^2}{\partial {p}_{i,l}}+2\alpha {p}_{i,l}\right)={p}_{i,l}+\sigma \left(\sum \limits_{j=1}^n\left(-2{q}_{l,j}\left({m}_{i,j}-\sum \limits_{l=1}^k{p}_{i,l}{q}_{l,j}\right)\right)+2\alpha {p}_{i,l}\right). $$

(14.7)

The theory to update q_{l, j} is similar and is left as an exercise. Formulae (14.6) and (14.7) are used in many iterations until P and Q converge.

Dimension Reduction Algorithms

Embedding, briefly, is a non-affine method to reduce the data dimensionality. By non-affine, we mean that a high-dimensional datapoint x is mapped to a lower, usually in 2D, dimensional datapoint y such that the inverse mapping from y → x does not have a precise formula. Embedding optimizes and preserves the original relative similarity between any datapoint (x_i, x_j) pair in the embedded space (y_i, y_j). In t-distributed Stochastic Neighbor Embedding (tSNE) [93], the pairwise similarity (x_i, x_j) in the high-dimensional data space is defined as

$$ {p}_{j\mid i}=\frac{\exp \left(-{\textbf{x}}_i-{{\textbf{x}}_j}^2/2{\sigma}_i^2\right)}{\sum_{\forall k\ne i}\exp \left(-{\textbf{x}}_i-{{\textbf{x}}_k}^2/2{\sigma}_i^2\right)}. $$

(14.8)

And the pairwise similarity space (y_i, y_j) in the embedded data space is defined as

$$ {q}_{j\mid i}=\frac{{\left(1+{\left\Vert {\textbf{y}}_i-{\textbf{y}}_j\right\Vert}^2\right)}^{-1}}{\sum_k{\sum}_{l\ne k}{\left(1+{\left\Vert {\textbf{y}}_k-{\textbf{y}}_l\right\Vert}^2\right)}^{-1}}. $$

(14.9)

Upon defining these similarities, tSNE minimize the Kullback–Leibler divergence

$$ KL\left(P\Big\Vert Q\right)=\sum \limits_{i\ne j}{p}_{j\mid i}\times \mathit{\log}\left(\frac{p_{j\mid i}}{q_{j\mid i}}\right). $$

(14.10)

Then, tSNE finds the embedded datapoints y using the gradient descent approach, and computing the partial derivative ∂KL/∂y_i.

Association Mining Algorithms

In translational bioinformatics, mining associations among features depends on defining an association metric. Some popular choices are:

Pearson’s correlation. Briefly, Pearson’s correlation is the ratio between the covariance of two features and the product of their standard deviations. This metric requires the features to be in numeric format, and only measures how the two features linearly correlate. If the two features have a non-linear association, Pearson’s correlation may not detect the association.
Mutual information metrics [116]. Mutual information measures the dependency between two features. For example, consider two Boolean features A and B. In a dataset, A and B are independent if the frequency of ‘A is true’ is approximately the same as the frequency of ‘A is true given B is true’.
The Jaccard index [117]. Given two Boolean features A and B. In a dataset, the Jaccard index is the ratio between the number of samples such that ‘both A and B are true’ (also called the intersection) and the number of samples such that ‘A or B is true’ (also called the union).

Security, Privacy, and Ethical Considerations (see also Chap. 18)

In practice, the AI scientist must consider the following ethical points, according to Safdar et al. [118]:

Population bias. This happens when the sociodemographic group proportion in a research and training dataset does not reflect the study population. Rare demographics and ethnicities are often under-sampled.
Data ownership. Translational bioinformatics data is often derived from human subjects. Therefore, consent from study participants, which allows using their data for research and tool development, must be obtained.
Privacy protection. Human subject identifiable information must be removed before analyzing and publicly releasing the data for research.

Team Data Science Infrastructure

AI researchers in translational bioinformatics use publicly available big data and analytics infrastructures to accelerate their research. These infrastructures not only support scaling to large data sets, but also installation of popular analytic tools. Some examples are as follows:

UAB UBRITE (https://ubrite.org/) integrates many machine learning programming libraries, which can be activated in R/Python scripts.
Google Colab (https://colab.research.google.com/notebooks/) has a free collection of pre-built machine learning Jupyter notebooks [119] available for reuse. Community users may slightly modify these notebooks for specific projects, and freely run a notebook for up to 12 hours using Google Cloud computing resources.

Applications of AI in Translational Bioinformatics

Improving Translational Bioinformatics Data Infrastructure

Experimental validation and manual curation, which are the most reliable approaches to construct translational bioinformatics databases, are time-consuming and resource-costly. Therefore, AI methods are utilized to infer novel information and enrich these databases. For example, in the STRING database [27], text mining is the most significant channel contributing to broad coverage. Here, the number of protein-protein interactions (PPI) for human proteins in STRING is approximately 2.9 million [13] (though some of these may be inaccurate on account of natural language processing errors); meanwhile, BioGRID [120], which is among the largest experimentally-validated and manually-curated PPI collections, only has approximately 268,599 PPIs. As another example, JASPAR [121] uses a hidden Markov model [122, 123] to predict 337 (over 1964) transcription factor – target interactions.

Inferring Pairwise Molecular Regulation

Many biological research areas require understanding and predicting regulatory networks to provide clear insight into living cells’ cellular processes [124]. For example, in injury response, regulation from G Protein-coupled receptors and their interactees is crucial to DNA damage response [125]. In another example, the transcription factor c-JUN promotes and maintains the expression level of CCND1, which is required in cell cycle progression [126]. Thus, discovering the gene regulatory network can enrich the translational bioinformatics database (see section on “Improving translational bioinformatics data infrastructure”), and inform the design of targeted therapies [127]. On the other hand, the number of possible regulatory pairs is too large to be fully validated by biological experiments, and many have not yet been discovered. This explains why predictive methods to infer molecular regulation, especially transcription factor—target and ligand—receptor pairs, are still an active research area.

Transcription factors are a set of DNA-binding proteins, and the genes at the DNA location where the transcription factors bind to are their targets [128]. The expression of the genes encoded around the binding sites is significantly up or down-regulated by the transcription factor. Therefore, the focus of AI in predicting transcription factor—target relationships is to identify the binding sites, which may number in the tens of thousands [128], of a transcription factor. Furthermore, the prediction must be filtered due to two types of occurrences: one binding event may control multiple target genes, and one gene may be targeted by multiple bindings. AI tools use Chromatin immunoprecipitation followed by sequencing (ChiP-seq) data and the expression data as inputs for prediction. IM-PET [129], RIPPLE [130], and PETModule [131] predict transcription-factor targets using the random forest algorithm. Here, the Chip-seq data is processed to obtain the distance between the binding/enhancer sites and the targets’ DNA coding regions. Besides, deep learning based tools, such as DeepTFactor [132], TBiNet [133], and scFAN [134] focus on precisely predicting transcription factor binding sites. Their output can be provided to other AI methods to infer the targets.

A ligand is a substance that forms a complex with a cell surface protein (called a receptor) and then triggers a series of cellular signaling events [135]. These signaling events respond to the stimulus that produces the ligand. For example, in natural skin wounds, the fibroblast releases the WNT5A ligand; this ligand binds to FZD1/2 receptors on the skin cell and activates the WNT signaling pathway [136] and the pathway promotes skin cell proliferation and helps heal the wound [137]. In drug discovery, after selecting which signaling pathways and related receptors to activate, the next task is designing an artificial ligand that can bind to the receptor. Computing the ligand-receptor affinity is an important task before testing the artificial ligand. Recently, machine-learning-based models have been shown to outperform other methods that do not apply machine learning on this task [138, 139].

Table 14.2 summarizes the popularly used AI tools described in this section

Table 14.2 Summary of AI tools used in inferring pairwise molecular regulation

Full size table

Inferring and Characterizing Cellular Signaling Mechanism that Determines the Cellular Response

Identifying and characterizing signaling mechanisms is essential in complex phenotype research because the disease outcomes concerned involve many genes interacting and responding to each other in response to an external stimulus [140]. In each phenotype, the highly expressed genes and interactions are found in the in-vitro experiments and annotated as a “pathway”. According to KEGG [23], at the time of this writing, 543 pathways have been well defined and annotated across all species included in the database. Adding species-specific pathways, Reactome [24] reports that the number may rise to 999. Extending from pathways, Gene Ontology [22] groups and characterizes 44,945 ontology terms, where each ontology term concerns a set of genes that participate in a biological process, are located in the same cellular location, and/or share the same molecular profile. Among this large number of pathways and ontologies, the most frequently investigated ones often regulate cell proliferation, cell apoptosis, and cell differentiation. Understanding the mechanisms regulating these processes helps to explain the progression and infer potential treatments for diseases with some of the highest mortality rates: cardiovascular diseases and cancers.

AI can support this research area by answering three basic questions. The first question concerns how to identify which pathways are involved in disease progression. Here, feature selection methods [141, 142] can identify highly differentiated genes between healthy controls and subjects with a disease of interest (identified genes can be considered as biomarkers). Then, applying pathway analysis techniques [143] to the biomarkers can yield a list of pathways involved in the disease. The second question concerns how to identify the “master regulator”, or the “origin” of the perturbed signaling mechanisms. Here, in highly interconnected genes, many perturbed gene signals are just the responses to other genes. The third question concerns how to find a therapeutic target: interfering with genes such that the genes are targetable and yield the highest reduction in disease progression. These three questions can be answered using AI-based system simulations.

Before deep learning, the state-of-the-art approaches in this area focused on representing the interactive network among the pathway genes with a mathematical equation system, and then solving the system by either logic programming or dynamic differential equations [144, 145]. In logic programming, the interaction between two (or more) genes is represented by a combination of logic gates. Since the logic gate is finite and deterministic, the simulation is straightforward. A limitation of the logic gate is that it cannot represent feedback loops or two-way gene-gene interactions. The feedback loops are critical to maintaining a stable environment inside a living subject. For example, in wound healing, after the initial platelets respond to the wound, these platelets release adenosine diphosphate (ADP); this ADP binds to P2Y1 and P2Y12 genes to activate more platelets; more platelets produces more ADP to continue this activation loop until the wound surface is completely covered and prevents further blood loss [146]. Dynamic differential equations can overcome this limitation. In principle, the dynamic differential equations discretize the system into multiple time points, define the dynamic equation for each gene expression at each timepoint, and calculate gene expression over a sequence of time points.

Below are some simple examples of how to simulate a 2-gene and a 3-gene system using dynamic equations. The system has two proteins (objects), denoted PA and PB. These proteins have the initial values S₀ = (−1, 0) for PA (strongly inhibited) and PB (normal), correspondingly. PA up-regulates PB while PB down-regulates PA in a negative-feedback loop [147], as in Fig. 14.4. These interactions can be represented by system matrix M = $ \left[\begin{array}{cc}0& 1\\ {}-1& 0\end{array}\right] $. Suppose the discrete dynamic equation, which takes both the initial values and the system matrix, is as follows (this equation is related to the equation underlying the PageRank algorithm [148]):

$$ {S}_t=0.15\times {S}_0+0.85\times {\textbf{M}}^T{S}_{t-1} $$

A line graph of signal versus iteration with lines named PA and PB. Both oscillate between -1 and 0.5 and depict gradual decay. The equation reads M(1,2)= -1 PA PB M(1,2)=1. — **Fig. 14.4**

The following Matlab code shows how to implement this system:

A MatLab code of 12 lines. It includes declarations, assignment statements, and a loop. Also, there are comment lines.

The result in Fig. 14.5 shows that over tim e, PA and PB converge to (−0.15, −0.13), respectively. This system is closely balanced (S_∝ ≈ 0), suggesting that the initial inhibition of A can balance the system. Figure 14.4 shows that the system results are completely different if we make a slight change to the model parameters by setting M = $ \left[\begin{array}{cc}0& -1\\ {}1& 0\end{array}\right] $, such that PA downregulates PB, while PB upregulates PA.

A line graph of signal versus iteration. PA oscillates between -1 and 0.5 and PB oscillates between -0.5 and 1 and depicts gradual decay. The equation reads M(1,2)= -1 PA PB M(1,2)=1. — **Fig. 14.5**

With deep learning approaches, signaling interactions can be used to construct and train a deep learning architecture. For example, Ma et al. describe a deep learning approach to simulate cell proliferation, in which the learning architecture is organized according to the proliferation-related gene ontology hierarchy instead of a conventional convolutional structure [149].

The AI tools described in this section are summarized in Table 14.3.

Table 14.3 Summary of AI tools for inferring and characterizing cellular signaling mechanisms

Full size table

Identifying and Characterizing New Cell Types and Subtypes

Single-cell transcriptomics technologies enable measuring genetic information at the individual cell resolution [151], which is the “building-block” level of all living organisms. Single-cell transcriptomic data also present new questions, and require new analytic techniques that are not available in bulk transcriptomics. First, does the data present novel cell populations that have not been studied due to the limitations of bulk technologies, especially of the stem and progenitor cell types? Second, for signaling pathways that function differently in different cells of the same cell type and in the same tissue, how might we quantify the signaling activity within each cell? AI techniques are essential to answer these questions.

To answer the first question, state-of-the-art single-cell analytic tools [152, 153] apply clustering algorithms to partition the entire cell dataset. In each cell cluster, genes explicitly expressed in the cluster are queried in the cell-type canonical marker literature, such as the CellMarker database [154], to determine which cell type the cluster corresponds to. For the clustering step, density-based clustering [155] and Louvian clustering [156] are the most popular methods. Also, embedding methods, such as tSNE [93] and UMAP [157], are often used to visualize the cell clusters. In many single-cell datasets [158,159,160], small clusters that highly express canonical markers from more than one cell type appear. These small clusters need careful examination because they could either be technical errors, such as doublets [161], or may indeed represent a new cell population.

To answer the second question, the major challenge is the potential for missing values in single-cell data, which is called the dropout effect [162]. The best contemporary single-cell transcriptomic techniques may achieve 6500 genes per cell [163], which only covers 25–30% of the human genome. Consequently, single-cell gene expression data often have a high proportion of zero values. Here, a zero can either mean the gene does not express in the cell, or that the gene does express, but the sequencing step did not capture this expression. To tackle the dropout effect, single-cell pathway analysis may employ machine learning to quantify pathway activity. This requires choosing “positive” cells, in which the pathway is known to function, and “negative” cells in which it is known not to function or to express at a very low level. For example, fetal and adult cardiomyocytes are excellent “positive” and “negative” cells for cell cycle signaling pathways. The pathway genes can be used as features to build a classifier distinguishing between the “positive” and the “negative” cells. Here, the “positive” cells should have a high model score and vice versa. Then, the model can be applied to analyze the pathway activity in other cells.

The AI tools in this section are summarized in Table 14.4.

Table 14.4 Summary of AI tools used for identifying and characterizing new cell types and subtypes. C, P and G indicate Clinotypic, Phenotypic and Genotypic data respectively

Full size table

Drug Repurposing

Drug repurposing, briefly, is applying an approved or investigational drug to treat a new disease [104]. In principle, drug repurposing can be conducted by calculating similarity. If two drugs A and B are highly similar, and A is approved to treat disease C, then B may also be used to treat C. Similarly, if two proteins D and E are highly similar, and A targets D, then A may also target E. Thus, drug repurposing includes many sub-problems for which machine learning techniques can be promising solutions.

Generating and mining a drug-drug similarity network. In this problem, each drug is represented by a vector. The drug vector represents structural chemical information, known drug-protein interactions, and information about the drug’s side effect [164]. The drug-drug pairwise similarity matrix, or network, is computed from a matrix containing vectors for all the drugs under consideration. Then, applying matrix factorization [165,166,167] results in drug clusters. Here, drugs sharing the same cluster are more likely to be repurposed for each other’s diseases.

Mining target-target a similarity network. Similarly to mining the drug-drug similarity network, machine learning techniques can be applied to cluster the target-target network [168, 169]. Then, a drug targeting one gene may be repurposed to target the other genes in the same cluster.

Mining a bi-partite drug-target network. Here, the drug-drug similarity, target-target interaction, and drug-target networks are co-factorized [170, 171].

Model-based simulation. This approach utilizes dynamic modeling, as discussed in sect. 14.4.3. The approach requires the following definition. First, the initial disease condition is represented by a non-zero vector of gene expression values (S₀). The computational goal is to get S_∝ = 0, which represents the non-disease state. Second, the treatment is also characterized by a vector u. Usually, u has the same dimension as S; and the dimensions in u and S correspond to each other. Third, the gene-gene interaction and signaling mechanisms dynamically change the expression vector, yielding the equation S_t = F(S_t-1) or S_t = F(S_t-1, u). Then, there are two options to score the repurposing candidate. First, all drug treatments can be computed and ranked in the recursive system S_t = F(S_t-1) | S_t = F(S_t-1, u). Second, applying the system control approach [172] yields a “hypothetical treatment”, which optimally returns S_∝ = 0. The hypothetical treatment can be used as a template to match with real drug treatments to select the repurposing candidates.

The AI tools discussed in this section are presented in Table 14.5.

Table 14.5 Summary of AI tools in drug repurposing

Full size table

Supporting Clinical Decisions with Bioinformatics Analysis

In genetic diseases caused by a single genetic disorder [174], mutation-analysis protocols can be used for diagnosis directly. Meanwhile, in complex diseases, bioinformatics analysis is used case by case. For example, in work by Kim et al. [36], the single-cell transcriptomic analysis, which used AI methods for clustering, detected that a JAK-STAT signaling pathway disruption was the cause of severe hypersensitivity syndrome/drug reaction in a patient. Therefore, tofacitinib, a JAK-STAT inhibitor, was selected and successfully treated the patient, although the drug is not indicated to treat hypersensitivity syndrome or drug reactions in general.

In cancer, the patient-derived xenograft (PDX) [175] platform is another direction for bioinformatics protocols to support clinical decision-making. Briefly, PDX is a technique to host a patient tumor biosample in a mouse. Cancer researchers can perform many mouse interventions and observe the mouse clinical outcomes, such as survival and speed of tumor growth. Given a sufficiently large amount of PDX samples with experimental results, called a PDX catalog, a new patient tumor biosample can be mapped to the PDX catalog. Here, the most closely mapped PDX samples’ experimental outcomes can be used to infer the patient’s likely clinical outcome under different treatment decisions, and fulfill the role of clinical decision support. Due to the heterogeneity of biomedical samples, designing the mapping algorithm for this application is a significant challenge that may require advanced AI techniques [176].

Predicting side effects is another clinical application where AI-based translational bioinformatics methods are potentially helpful. In pre-clinical applications, similarly to drug repurposing, predicting side-effects relies on mining drug-drug similarity [177, 178]. In the clinical setting, the principle involves mining past side effects recorded in a patients’ medical records. Thus, the side-effect analysis is provider-specific and customized according to the medical record data infrastructure, such as in the work of Sohn et al. [179]. Here, itemset (drug – side effect) mining and rule-mining are standard AI methods used to predict the drugs’ side effects.

Predicting Complex Biochemical Structures

After finding the target gene and other genetic causes for a disease, the next cornerstones in drug discovery are (i) to represent the physical structures of the target protein (the protein encoded by the target gene); (ii) to represent the physical structures of the chemical.

Representing protein physical structure involves reconstructing the 3D arrangement of atoms, or each amino acid, given the sequential order of the amino acids on the protein polypeptide chain [180]. The sequential order of the amino acids, also called the protein primary structure [177], identifies the protein, is always the first step in studying a protein, and can easily be found using today’s protein sequencing techniques [181]. Meanwhile, the protein functionalities and its interaction with other proteins and chemicals are primarily determined by its higher-level 3D structures. How to infer the protein 3D structure from the primary structure has been a grand challenge for decades [182]. Before deep learning, machine learning techniques were used to solve some protein structure subproblems, such as predicting pairwise distances among the amino acids in 3D [179] and predicting the 3D structure class [183]. The recent deep-learning-based methods can predict the 3D position of the protein atoms, which is a more challenging problem. For example, most recently, AlphaFold directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence [184]. AlphaFold shows significantly superior performance over other methods in the Critical Assessment of Techniques for Protein Structure Prediction 14th Challenge [185], which indicates a major breakthrough of AI in solving one of the cornerstones of chemical biology.

Representing chemical structure, also called ligand structure or drug structure, is somewhat a reverse-engineering problem compared to representing protein physical structure. In this problem, given a 3D physical structure, the objective is to find the arrangement, or order, of atoms that can create the 3D structure [186]. Here, the chemical 3D structure is defined such that the chemical structure can bind to the targeted protein 3D structure [187]. This problem is manually solved by chemical engineering experts, which often take years to complete [188]. Recently, it has been shown that deep learning can significantly accelerate this process. For example, GENTRL [189] designed new chemicals binding to DDR1 within just 23 days. Here, GENTRL applied deep-learning autoencoder [190], which is a technique to synthesize new datapoints from the existing (original) datapoints such that the AI classifier could not easily differentiate the synthetic ones from the original ones. The GENTRL deep autoencoder was trained on more than 200 million 3D chemical structures in the ZINC v.15 database [191]. Then, to find new chemicals binding to DDR1, GENTRL took the existing DDR1 inhibitors, which could be found in the ChEMBL database [192], as the input, then synthesized ‘similar’ compounds using the autoencoder.

Trends and Outlook

In pre-clinical research, future translational bioinformatics research will likely pivot around the current and forthcoming breakthroughs in biotechnology. By the time this chapter is published, single-cell -omics, which measure the molecular environment inside cells, and the patient-derived xenograft, which allows hosting a patient’s biosample in living organisms, will be the major platforms for new AI translational bioinformatics techniques. Key problems that require new and further technical development are:

Identifying and characterizing small but novel cell types from single-cell omics data. In this problem, stem, progenitor, and high-capacity proliferative cells are often the main focus because they are the key to treating the commonly fatal diseases: cardiovascular diseases and cancer. In cardiovascular disease, which is often due to the cardiac tissue’s low regenerative capacity, the goal is to promote cell proliferation [193]. Meanwhile, in cancer, the goal is to restrain the proliferative cells.
Characterizing the tissue microenvironment, which significantly contributes to the growth and survivability of the tissue. Single-cell data allows observing and separating the main tissue cell types, such as neural cells in brain cancers, and environmental cell types, such as immune cells and fibroblast cells. Cancer immunotherapies [194] are examples of how the microenvironment can impact the tissues. Here, key questions concern which molecular and signaling mechanism the microenvironment can stimulate tissue cell types, and which signaling pathways the tissue cells activate in response to the stimulus.
Characterizing cell differentiation explains tissue regeneration and tumor recurrence. Analyzing time-series single-cell data is the key to approach this problem.

Meanwhile, utilizing deep-learning-based models, especially autoencoders, will be the main focus in virtual docking, molecular design, and system biology simulation. In this area, the keys to successful AI applications include reducing overfitting, enlarging the molecular dataset, and setting up a gold-standard validation Scheme [195].

To make a higher impact in clinical applications, AI translational bioinformatics still requires substantial effort to build multi-omics, clinome, and phenotype integrative data infrastructure. In this aspect, the All of Us research program [196] is a pioneering project. While preparing for these infrastructures to emerge, AI researchers can solve smaller-scale research problems, such as predicting patients’ risk from clinome data, and linking omics and clinome data via text mining. Also, as shown in the work of Jensen et al. [27], AI can already play a significant role in clinical decision support under clinical experts’ guidance in specific cases.

Questions for Discussion

What are three types of translational bioinformatics data?
Clinotype-genotype association has not been well-explored in translational bioinformatics. Sketch a research strategy to mine this type of association using Natural Language Processing (NLP - Chap. 7),

Hint: find a catalog of clinotype terms; apply NLP tools using the Pubmed collection of abstracts.
In the section on “Clustering Algorithms”, matrix factorization, derive the formula for $ \frac{\partial F}{\partial \textbf{Q}} $ using the method in Eqs. (14.2)–(14.6).
The section on “Inferring and Characterizing Cellular Signaling Mechanism that Determines the Cellular Response” shows that when the system has negative feedback, the signal oscillates. This is a well-known phenomenon in system biology modeling. Show that phenomenon again in the following system M = $ \left[\begin{array}{cc}-1& 1\\ {}0& 0\end{array}\right] $, S_t = 0.15 × S₀ + 0.85 × M^TS_t − 1, S₀ = (−1, 0). Draw the system diagram and PA, PB signals as in Figs. 14.3 and 14.4. How about with S₀ = (1, 0)?
In the section on “Supporting Clinical Decision with Bioinformatics Analysis”, mapping new patients’ tumor expression (NPT) data to existing patients’ tumor (EPT) expression data may help predicting clinical outcomes. Two data processing methods are proposed to map the NPT to EPT. One way to select the better method is to plot the combined NPT-EPT embedding after processing the data. Recall, embedding preserves the pairwise similarity from the original data space in the embedded space. The embedding visualization is in Fig. 14.6 (below). Which processing method is better, and why?

2 scatter plots with 2 types of data labeled NPT and EPT. Both depict combine NPT-EPT embedding utilizing 2 methods. Plot a visualizes them together and plot b visualizes them separately. — **Fig. 14.6**

References

Informatics Areas Translational Bioinformatics 2020; Available from: https://www.amia.org/applications-informatics/translational-bioinformatics.
Zhang E, et al. Identifying the key regulators that promote cell-cycle activity in the hearts of early neonatal pigs after myocardial injury. PLoS One. 2020;15(7):e0232963.
Article Google Scholar
Venter JC, et al. The sequence of the human genome. Science. 2001;291(5507):1304–51.
Article Google Scholar
Overview of the Nationwide Emergency Department Sample (NEDS). 2020. Available from: https://www.hcup-us.ahrq.gov/nedsoverview.jsp.
Cui M, et al. Dynamic Transcriptional Responses to Injury of Regenerative and Non-regenerative Cardiomyocytes Revealed by Single-Nucleus RNA Sequencing. Dev Cell. 2020;53(1):102–116 e8.
Article Google Scholar
Ouzounis CA. Rise and demise of bioinformatics? Promise and progress. PLoS Comput Biol. 2012;8(4):e1002487.
Article Google Scholar
Human Genome Project FAQ. 2020. Available from: https://www.genome.gov/human-genome-project/Completion-FAQ.
Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform. 2016;Suppl 1:S48–61.
Google Scholar
Organization WH. Genomics and world health: Report of the Advisory Committee on Health Research. 2002: World Health Organization.
Google Scholar
Anderson NL, Anderson NG. Proteome and proteomics: new technologies, new concepts, and new words. Electrophoresis. 1998;19(11):1853–61.
Article Google Scholar
Idle JR, Gonzalez FJ. Metabolomics. Cell Metab. 2007;6(5):348–51.
Article Google Scholar
Lowe R, et al. Transcriptomics technologies. PLoS Comput Biol. 2017;13(5):e1005457.
Article Google Scholar
Chen JY, Pandey R, Nguyen TM. HAPPI-2: a Comprehensive and High-quality Map of Human Annotated and Predicted Protein Interactions. BMC Genomics. 2017;18(1):182.
Article Google Scholar
Hu T, et al. Characterizing genetic interactions in human disease association studies using statistical epistasis networks. BMC Bioinformatics. 2011;12:364.
Article Google Scholar
Hardiman G. Microarray technologies 2003—an overview. Pharmacogenomics. 2003;4(3):251–6.
Article Google Scholar
McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet. 2007;39(7 Suppl):S37–42.
Article Google Scholar
Chu Y, Corey DR. RNA sequencing: platform selection, experimental design, and data interpretation. Nucleic Acid Ther. 2012;22(4):271–4.
Article Google Scholar
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
Article Google Scholar
Moritz CP. 40 years Western blotting: A scientific birthday toast. J Proteome. 2020;212:103575.
Article Google Scholar
Gromiha MM, Yugandhar K, Jemimah S. Protein-protein interactions: scoring schemes and binding affinity. Curr Opin Struct Biol. 2017;44:31–8.
Article Google Scholar
Huntley RP, et al. The GOA database: gene Ontology annotation updates for 2015. Nucleic Acids Res. 2015;43(Database issue):D1057–63.
Article Google Scholar
The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47(D1):D330–8.
Article Google Scholar
Kanehisa M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–61.
Article Google Scholar
Jassal B, et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020;48(D1):D498–503.
Google Scholar
UniProt C. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–15.
Article Google Scholar
Burley SK, et al. Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive. Methods Mol Biol. 2017;1607:627–41.
Article Google Scholar
Szklarczyk D, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–13.
Article Google Scholar
Nguyen T, et al. Linking Clinotypes to Phenotypes and Genotypes from Laboratory Test Results in Comprehensive Physical Exams. BMC Med Inform Decis Mak. 21(3):1–12.
Google Scholar
Bernstam EV, Smith JW, Johnson TR. What is biomedical informatics? J Biomed Inform. 2010;43(1):104–10.
Article Google Scholar
Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 2012;13(6):395–405.
Article Google Scholar
Sakshaug JW, et al. The collection of biospecimens in health surveys. In: Handbook of health survey methods; 2015. p. 383–419.
Google Scholar
Chen DL, Li QY, Tan QY. Smoking history and the efficacy of immune checkpoint inhibitors in patients with advanced non-small cell lung cancer: a systematic review and meta-analysis. J Thorac Dis. 2021;13(1):220–31.
Article Google Scholar
Stokes PR, et al. History of cigarette smoking is associated with higher limbic GABAA receptor availability. NeuroImage. 2013;69:70–7.
Article Google Scholar
Ahnen RT, Jonnalagadda SS, Slavin JL. Role of plant protein in nutrition, wellness, and health. Nutr Rev. 2019;77(11):735–47.
Article Google Scholar
Nielsen TT, et al. Improved metabolic fitness, but no cardiovascular health effects, of a low-frequency short-term combined exercise programme in 50-70-year-olds with low fitness: A randomized controlled trial. Eur J Sport Sci. 2021:1–14.
Google Scholar
Kim D, et al. Targeted therapy guided by single-cell transcriptomic analysis in drug-induced hypersensitivity syndrome: a case report. Nat Med. 2020;26(2):236–43.
Article Google Scholar
Green RF, et al. Evaluating the role of public health in implementation of genomics-related recommendations: a case study of hereditary cancers using the CDC Science Impact Framework. Genet Med. 2019;21(1):28–37.
Article Google Scholar
Bauer DC, et al. Supporting pandemic response using genomics and bioinformatics: A case study on the emergent SARS-CoV-2 outbreak. Transbound Emerg Dis. 2020;67(4):1453–62.
Article Google Scholar
Karczewski KJ, Snyder MP. Integrative omics for health and disease. Nat Rev Genet. 2018;19(5):299–310.
Article Google Scholar
Toga AW, et al. Big biomedical data as the key resource for discovery science. J Am Med Inform Assoc. 2015;22(6):1126–31.
Article Google Scholar
Arnett DK, Claas SA. Omics of Blood Pressure and Hypertension. Circ Res. 2018;122(10):1409–19.
Article Google Scholar
Cooper-DeHoff RM, Johnson JA. Hypertension pharmacogenomics: in search of personalized treatment approaches. Nat Rev Nephrol. 2016;12(2):110–22.
Article Google Scholar
Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404.
Article Google Scholar
Miryala SK, Anbarasu A, Ramaiah S. Discerning molecular interactions: A comprehensive review on biomolecular interaction databases and network analysis tools. Gene. 2018;642:84–94.
Article Google Scholar
Nguyen T, et al. Abstract P108: Identify Hypertension Risk from Health Exam Results. Hypertension. 2019;74(Suppl_1):AP108.
Article Google Scholar
Ballester PJ, Mitchell JB. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics. 2010;26(9):1169–75.
Article Google Scholar
Honkela A, et al. Model-based method for transcription factor target identification with limited data. Proc Natl Acad Sci U S A. 2010;107(17):7793–8.
Article Google Scholar
Buniello A, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–12.
Article Google Scholar
Nicholls HL, et al. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front Genet. 2020;11:350.
Article Google Scholar
Isakov O, Dotan I, Ben-Shachar S. Machine Learning-Based Gene Prioritization Identifies Novel Candidate Risk Genes for Inflammatory Bowel Disease. Inflamm Bowel Dis. 2017;23(9):1516–23.
Article Google Scholar
Hoffman GE, Logsdon BA, Mezey JG. PUMA: a unified framework for penalized multiple regression analysis of GWAS data. PLoS Comput Biol. 2013;9(6):e1003101.
Article Google Scholar
Wang S, et al. HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics. 2016;32(2):211–8.
Google Scholar
Ban HJ, et al. Identification of type 2 diabetes-associated combination of SNPs using support vector machine. BMC Genet. 2010;11:26.
Article Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article MATH Google Scholar
Chen W, et al. Risk of GWAS-identified genetic variants for breast cancer in a Chinese population: a multiple interaction analysis. Breast Cancer Res Treat. 2013;142(3):637–44.
Article Google Scholar
Chuang LC, Kuo PH. Building a genetic risk model for bipolar disorder from genome-wide association data with random forest algorithm. Sci Rep. 2017;7:39943.
Article Google Scholar
Liang Z, et al. DL-ADR: a novel deep learning model for classifying genomic variants into adverse drug reactions. BMC Med Genet. 2016;9(Suppl 2):48.
Google Scholar
Kim SH, et al. Prediction of Alzheimer's disease-specific phospholipase c gamma-1 SNV by deep learning-based approach for high-throughput screening. Proc Natl Acad Sci U S A. 2021;118(3):e2011250118.
Article Google Scholar
Verhaak RG, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98–110.
Article Google Scholar
Quang D, Guan Y, Parker SCJ. YAMDA: thousandfold speedup of EM-based motif discovery using deep learning libraries and GPU. Bioinformatics. 2018;34(20):3578–80.
Article Google Scholar
Shahshahani BM, Landgrebe DA. The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans Geosci Remote Sens. 1994;32(5):1087–95.
Article Google Scholar
Pellin D, et al. A comprehensive single cell transcriptional landscape of human hematopoietic progenitors. Nat Commun. 2019;10(1):2395.
Article Google Scholar
McGinnis CS, et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat Methods. 2019;16(7):619–26.
Article Google Scholar
Sarkar IN, et al. Translational bioinformatics: linking knowledge across biological and clinical realms. J Am Med Inform Assoc. 2011;18(4):354–7.
Article Google Scholar
Tung PY, Blischak JD, Hsiao CJ, Knowles DA, Burnett JE, Pritchard JK, Gilad Y. Batch effects and the effective design of single-cell gene expression studies. Scientific reports. 2017;7(1): pp. 1–15.
Google Scholar
Chen W, et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat Biotechnol. 2021;39(9):1103–14.
Article Google Scholar
Wang Y, et al. Single-cell analysis of murine fibroblasts identifies neonatal to adult switching that regulates cardiomyocyte maturation. Nat Commun. 2020;11(1):2585.
Article Google Scholar
Chawla NV, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Article MATH Google Scholar
Lu X, et al. Enhancing text categorization with semantic-enriched representation and training data augmentation. J Am Med Inform Assoc. 2006;13(5):526–35.
Article Google Scholar
Sayyari E, Kawas B, Mirarab S. TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification. Bioinformatics. 2019;35(14):i31–40.
Article Google Scholar
Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.
Article Google Scholar
Massey FJ Jr. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. 1951;46(253):68–78.
Article MATH Google Scholar
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13(2):281–305.
MathSciNet MATH Google Scholar
Hauck WW Jr, Donner A. Wald's test as applied to hypotheses in logit analysis. J Am Stat Assoc. 1977;72(360a):851–3.
Article MathSciNet MATH Google Scholar
Chute CG, et al. The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc. 2010;17(2):131–5.
Article Google Scholar
NIS Database Documentation. 2020. Available from: https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp.
Stothers JAM, Nguyen A. Can Neo4j Replace PostgreSQL in Healthcare? AMIA Jt Summits Transl Sci Proc. 2020;2020:646–53.
Google Scholar
Boley H. The rule markup language: RDF-XML data model, XML schema hierarchy, and XSL transformations. In: International conference on Applications of Prolog. New York: Springer; 2001.
Google Scholar
Gao J, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):l1.
Article Google Scholar
Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25(11):1363–9.
Article Google Scholar
Bessani A, et al. BiobankCloud: a platform for the secure storage, sharing, and processing of large biomedical data sets. In: Biomedical data management and graph online querying. New York: Springer; 2015. p. 89–105.
Google Scholar
Lewis S, et al. Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinformatics. 2012;13:324.
Article Google Scholar
The ENCODE Project Consortium n.d.. https://www.encodeproject.org/pipelines/ENCPL122WIM/
The ENCODE Project Consortium n.d.. https://www.encodeproject.org/pipelines/ENCPL444CYA/
10x Genomics n.d.. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
Manning CD, et al. The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations; 2014.
Google Scholar
Ringner M. What is principal component analysis? Nat Biotechnol. 2008;26(3):303–4.
Article Google Scholar
Jendoubi T, Strimmer K. A whitening approach to probabilistic canonical correlation analysis for omics data integration. BMC Bioinformatics. 2019;20(1):15.
Article Google Scholar
Di Y, et al. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat Appl Genet Mol Biol. 2011;10(1)
Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.
Article Google Scholar
Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32.
Article MATH Google Scholar
Tran HTN, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21(1):12.
Article Google Scholar
van de Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(Nov):2579–605.
MATH Google Scholar
Becht E, McInnes L, Healy J, Dutertre CA, Kwok IW, Ng LG, Ginhoux F. and Newell EW. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology. 2019;37(1): pp. 38–44.
Google Scholar
Zhou M, et al. Radiomics in Brain Tumor: Image Assessment, Quantitative Feature Descriptors, and Machine-Learning Approaches. AJNR Am J Neuroradiol. 2018;39(2):208–16.
Article Google Scholar
Bi WL, et al. Artificial intelligence in cancer imaging: Clinical challenges and applications. CA Cancer J Clin. 2019;69(2):127–57.
Google Scholar
Hu J, et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat Mach Intell. 2020;2(10):607–18.
Article Google Scholar
Glaab E, et al. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One. 2012;7(7):e39932.
Article Google Scholar
Mykowiecka A, Marciniak M, Kupsc A. Rule-based information extraction from patients’ clinical data. J Biomed Inform. 2009;42(5):923–36.
Article Google Scholar
Cortes C, Vapnik V. Support vector machine. Mach Learn. 1995;20(3):273–97.
Article MATH Google Scholar
Marill KA. Advanced statistics: linear regression, part II: multiple linear regression. Acad Emerg Med. 2004;11(1):94–102.
Article Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Article Google Scholar
Young JD, Cai C, Lu X. Unsupervised deep learning reveals prognostically relevant subtypes of glioblastoma. BMC Bioinformatics. 2017;18(Suppl 11):381.
Article Google Scholar
Pushpakom S, et al. Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov. 2019;18(1):41–58.
Article Google Scholar
Udrescu L, et al. Clustering drug-drug interaction networks with energy model layouts: community analysis and drug repurposing. Sci Rep. 2016;6:32745.
Article Google Scholar
McLachlan GJ, Bean RW, Ng SK. Clustering. Methods Mol Biol. 2017;1526:345–62.
Article Google Scholar
Do CB, Batzoglou S. What is the expectation maximization algorithm? Nat Biotechnol. 2008;26(8):897–9.
Article Google Scholar
Cai D, et al. Non-negative matrix factorization on manifold. In: 2008 Eighth IEEE International Conference on Data Mining. London: IEEE; 2008.
Google Scholar
Utgoff PE. Incremental induction of decision trees. Mach Learn. 1989;4(2):161–86.
Article Google Scholar
Qi Y. Random forest for bioinformatics. In: Ensemble machine learning. Springer; 2012. p. 307–23.
Chapter Google Scholar
Bouckaert RR. Bayesian network classifiers in weka. 2004. https://researchcommons.waikato.ac.nz/bitstream/handle/10289/85/content.pdf.
Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B Methodol. 1977;39(1)
Google Scholar
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129–37.
Article MathSciNet MATH Google Scholar
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Article MATH Google Scholar
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–7.
Article Google Scholar
Haussler D, Opper M. Mutual information, metric entropy and cumulative relative entropy risk. Ann Stat. 1997;25(6):2451–92.
Article MathSciNet MATH Google Scholar
Jaccard P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912;11(2):37–50.
Article Google Scholar
Safdar NM, Banja JD, Meltzer CC. Ethical considerations in artificial intelligence. Eur J Radiol. 2020;122:108768.
Article Google Scholar
Kluyver T, et al. Jupyter Notebooks-a publishing format for reproducible computational workflows. In: Fernando, Birgit S, editors. Positioning and Power in Academic Publishing: Players, Agents and Agendas. Amsterdam: IOS Press; 2016. p. 87–90.
Google Scholar
Chatr-Aryamontri A, et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013;41(Database issue):D816–23.
Google Scholar
Fornes O, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48(D1):D87–92.
Google Scholar
Mathelier A, Wasserman WW. The next generation of transcription factor binding site prediction. PLoS Comput Biol. 2013;9(9):e1003214.
Article Google Scholar
Eddy SR. What is a hidden Markov model? Nat Biotechnol. 2004;22(10):1315–6.
Article Google Scholar
Chai LE, et al. A review on the computational approaches for gene regulatory network construction. Comput Biol Med. 2014;48:55–65.
Article Google Scholar
Leysen H, et al. G Protein-Coupled Receptor Systems as Crucial Regulators of DNA Damage Response Processes. Int J Mol Sci. 2018;19(10)
Google Scholar
Wisdom R, Johnson RS, Moore C. c-Jun regulates cell cycle progression and apoptosis by distinct mechanisms. EMBO J. 1999;18(1):188–97.
Article Google Scholar
Villate-Beitia I, et al. Gene delivery to the lungs: pulmonary gene therapy for cystic fibrosis. Drug Dev Ind Pharm. 2017;43(7):1071–81.
Article Google Scholar
Essebier A, et al. Bioinformatics approaches to predict target genes from transcription factor binding data. Methods. 2017;131:111–9.
Article Google Scholar
He B, et al. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A. 2014;111(21):E2191–9.
Article Google Scholar
Roy S, et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res. 2015;43(18):8694–712.
Article Google Scholar
Zhao C, Li X, Hu H. PETModule: a motif module based approach for enhancer target gene prediction. Sci Rep. 2016;6:30043.
Article Google Scholar
Kim, G.B., et al., DeepTFactor: A deep learning-based tool for the prediction of transcription factors. Proc Natl Acad Sci U S A, 2021. 118(2).
Google Scholar
Park S, et al. Enhancing the interpretability of transcription factor binding site prediction using attention mechanism. Sci Rep. 2020;10(1):13413.
Article Google Scholar
Fu L, et al. Predicting transcription factor binding in single cells through deep learning. Sci Adv. 2020;6(51)
Google Scholar
Guryanov I, Fiorucci S, Tennikova T. Receptor-ligand interactions: Advanced biomedical applications. Mater Sci Eng C Mater Biol Appl. 2016;68:890–903.
Article Google Scholar
Jin, S., et al., Inference and analysis of cell-cell communication using CellChat. Nat Commun, 2021. 12(1): p. 1088.
Google Scholar
Fathke C, et al. Wnt signaling induces epithelial differentiation during cutaneous wound healing. BMC Cell Biol. 2006;7:4.
Article Google Scholar
Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P. Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics. 2018;34(21):3666–74.
Article Google Scholar
Wu J, et al. WDL-RF: predicting bioactivities of ligand molecules acting with G protein-coupled receptors by combining weighted deep learning and random forest. Bioinformatics. 2018;34(13):2271–82.
Article Google Scholar
Creixell P, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615–21.
Article Google Scholar
Abeel T, et al. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010;26(3):392–8.
Article Google Scholar
Xiong M, Fang X, Zhao J. Biomarker identification by feature wrappers. Genome Res. 2001;11(11):1878–87.
Article Google Scholar
Garcia-Campos MA, Espinal-Enriquez J, Hernandez-Lemus E. Pathway Analysis: State of the Art. Front Physiol. 2015;6:383.
Article Google Scholar
Hoops S, et al. COPASI--a COmplex PAthway SImulator. Bioinformatics. 2006;22(24):3067–74.
Article Google Scholar
Karplus M, Petsko GA. Molecular dynamics simulations in biology. Nature. 1990;347(6294):631–9.
Article Google Scholar
Saad J, Asuka E, Schoenberger L. Physiology, Platelet Activation, in StatPearls. Treasure Island (FL); 2021.
Google Scholar
Tsai TY, et al. Robust, tunable biological oscillations from interlinked positive and negative feedback loops. Science. 2008;321(5885):126–9.
Article Google Scholar
Bianchini M, Gori M, Scarselli F. Inside pagerank. ACM Trans Internet Technol. 2005;5(1):92–128.
Article Google Scholar
Ma J, et al. Using deep learning to model the hierarchical structure and function of a cell. Nat Methods. 2018;15(4):290–8.
Article Google Scholar
Chen T, He HL, Church GM. Modeling gene expression with differential equations. In: Biocomputing'99. Singapore: World Scientific; 1999. p. 29–40.
Google Scholar
Kanter I, Kalisky T. Single cell transcriptomics: methods and applications. Front Oncol. 2015;5:53.
Article Google Scholar
Butler A, et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20.
Article Google Scholar
Trapnell C, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–6.
Article Google Scholar
Zhang, X., et al., CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res, 2019. 47(D1): p. D721-D728.
Google Scholar
Ester M, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; 1996. p. 226–31.
Google Scholar
Blondel VD, et al. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.
Article MATH Google Scholar
McInnes, L., J. Healy, and J. Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, C. University, Editor. 2018, arXiv.
Google Scholar
Nakada Y, et al. Single nucleus transcriptomics: Apical resection in newborn pigs extends the time-window of cardiomyocyte proliferation and myocardial regeneration. Circulation. 2022;145(23):1744–7.
Article Google Scholar
Litvinukova M, et al. Cells of the adult human heart. Nature. 2020;588(7838):466–72.
Article Google Scholar
McKenzie AT, et al. Brain Cell Type Specific Gene Expression and Co-expression Network Architectures. Sci Rep. 2018;8(1):8868.
Article Google Scholar
McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 2019;8(4):329–337 e4.
Article Google Scholar
Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun. 2020;11(1):1169.
Article Google Scholar
What is sequencing saturation? Available from: https://kb.10xgenomics.com/hc/en-us/articles/115005062366-What-is-sequencing-saturation-.
Ding H, et al. Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Brief Bioinform. 2014;15(5):734–47.
Article Google Scholar
Zhang W, et al. Manifold regularized matrix factorization for drug-drug interaction prediction. J Biomed Inform. 2018;88:90–7.
Article Google Scholar
Yu H, et al. Predicting and understanding comprehensive drug-drug interactions via semi-nonnegative matrix factorization. BMC Syst Biol. 2018;12(Suppl 1):14.
Article Google Scholar
Shi JY, et al. TMFUF: a triple matrix factorization-based unified framework for predicting comprehensive drug-drug interactions of new drugs. BMC Bioinformatics. 2018;19(Suppl 14):411.
Article Google Scholar
Greene D, et al. Ensemble non-negative matrix factorization methods for clustering protein-protein interactions. Bioinformatics. 2008;24(15):1722–8.
Article Google Scholar
Zheng X, et al. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 2013.
Google Scholar
Cobanoglu MC, et al. Predicting drug-target interactions using probabilistic matrix factorization. J Chem Inf Model. 2013;53(12):3399–409.
Article Google Scholar
Yang J, et al. Drug-disease association and drug-repositioning predictions in complex diseases using causal inference-probabilistic matrix factorization. J Chem Inf Model. 2014;54(9):2562–9.
Article Google Scholar
Nguyen TM, et al. DeCoST: A New Approach in Drug Repurposing From Control System Theory. Front Pharmacol. 2018;9:583.
Article Google Scholar
Gottlieb A, et al. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol. 2011;7:496.
Article Google Scholar
Alliance, G., Understanding genetics: a district of Columbia guide for patients and health professionals. 2010.
Google Scholar
Lai Y, et al. Current status and perspectives of patient-derived xenograft models in cancer research. J Hematol Oncol. 2017;10(1):106.
Article Google Scholar
Couturier CP, et al. Single-cell RNA-seq reveals that glioblastoma recapitulates a normal neurodevelopmental hierarchy. Nat Commun. 2020;11(1):3406.
Article Google Scholar
Pauwels E, Stoven V, Yamanishi Y. Predicting drug side-effect profiles: a chemical fragment-based approach. BMC Bioinformatics. 2011;12:169.
Article Google Scholar
Zhou M, Chen Y, Xu R. A Drug-Side Effect Context-Sensitive Network approach for drug target prediction. Bioinformatics. 2019;35(12):2100–7.
Article Google Scholar
Sohn S, et al. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. J Am Med Inform Assoc. 2011;18(Suppl 1):i144–9.
Article Google Scholar
Karplus K, et al. Predicting protein structure using only sequence information. Proteins. 1999;Suppl 3:121–5.
Article Google Scholar
Gevaert K, Vandekerckhove J. Protein identification methods in proteomics. Electrophoresis. 2000;21(6):1145–54.
Article Google Scholar
Zhang Y. Progress and challenges in protein structure prediction. Curr Opin Struct Biol. 2008;18(3):342–8.
Article Google Scholar
Jain P, Garibaldi JM, Hirst JD. Supervised machine learning algorithms for protein structure classification. Comput Biol Chem. 2009;33(3):216–23.
Article Google Scholar
Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021:1–11.
Google Scholar
14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction 2020. Available from: https://predictioncenter.org/casp14/results.cgi.
Merz KM Jr, Ringe D, Reynolds CH. Drug design: structure-and ligand-based approaches. Cambridge: Cambridge University Press; 2010.
Book Google Scholar
Anderson AC. The process of structure-based drug design. Chem Biol. 2003;10(9):787–97.
Article Google Scholar
Hughes JP, et al. Principles of early drug discovery. Br J Pharmacol. 2011;162(6):1239–49.
Article Google Scholar
Zhavoronkov A, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol. 2019;37(9):1038–40.
Article Google Scholar
Kingma DP, Welling M. An introduction to variational autoencoders. arXiv preprint arXiv. 2019:1906.02691.
Google Scholar
Sterling, T. and J.J. Irwin, ZINC 15--Ligand Discovery for Everyone. J Chem Inf Model, 2015. 55(11): p. 2324–2337.
Google Scholar
Gaulton A, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45(D1):D945–54.
Article Google Scholar
Sadek H, Olson EN. Toward the Goal of Human Heart Regeneration. Cell Stem Cell. 2020;26(1):7–16.
Article Google Scholar
Hegde PS, Chen DS. Top 10 Challenges in Cancer Immunotherapy. Immunity. 2020;52(1):17–35.
Article Google Scholar
Brown N, et al. Artificial intelligence in chemistry and drug design. J Comput Aided Mol Des. 2020;34(7):709–15.
Article Google Scholar
Fox K. The Illusion of Inclusion—The “All of Us” Research Program and Indigenous Peoples' DNA. N Engl J Med. 2020;383(5):411–3.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA
Thanh M. Nguyen & Jake Y. Chen

Authors

Thanh M. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Jake Y. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jake Y. Chen .

Editor information

Editors and Affiliations

University of Washington, Seattle, WA, USA
Trevor A. Cohen
New York Academy of Medicine, New York, NY, USA
Vimla L. Patel
Columbia University, New York, NY, USA
Edward H. Shortliffe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nguyen, T.M., Chen, J.Y. (2022). AI in Translational Bioinformatics and Precision Medicine. In: Cohen, T.A., Patel, V.L., Shortliffe, E.H. (eds) Intelligent Systems in Medicine and Health. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-09108-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-09108-7_14
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09107-0
Online ISBN: 978-3-031-09108-7
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics

AI in Translational Bioinformatics and Precision Medicine

Abstract

Similar content being viewed by others

AI/ML in Precision Medicine: A Look Beyond the Hype

Translational Challenges of Biomedical Machine Learning Solutions in Clinical and Laboratory Settings

Public Health Informatics in the Larger Context of Biomedical and Health Informatics

Keywords

Introduction and Concepts

A Brief History of Translational Bioinformatics

Concepts of AI in Translational Bioinformatics

Primary Data Categories in Translational Bioinformatics

Genomic Data

Clinomic Data

Phenotypic Data

Categorizing AI Applications in Translational Bioinformatics

G2G (Genomic to Genomic)

G2P (Genomic to Phenotypic): Genome-Wide Association Studies (GWAS)

P2P (Phenotypic to Phenotypic): Identify Disease Genomic Subtypes

P2C (Phenotypic to Clinomic)

C2C (Clinomic to Clinomic)

Informatics Challenges in Translational Bioinformatics

Big Data Characteristics

Volume of Data

Veracity of Data

Variability of Data

Velocity of Data

Social-Economic Bias

Domain Knowledge Representation and Interpretability

Model Robustness and Quality Control

Translational Bioinformatics Tools & Infrastructure

Extended Data Management Systems

Data Preprocessing Pipelines

Pipelines to Build the Data Matrix

Enhancing the Data Matrix

Supervised and Unsupervised Learning

Popular Algorithms in Translational Bioinformatics

Classification Algorithms

Clustering Algorithms

Dimension Reduction Algorithms

Association Mining Algorithms

Security, Privacy, and Ethical Considerations (see also Chap. 18)

Team Data Science Infrastructure

Applications of AI in Translational Bioinformatics

Improving Translational Bioinformatics Data Infrastructure

Inferring Pairwise Molecular Regulation

Inferring and Characterizing Cellular Signaling Mechanism that Determines the Cellular Response

Identifying and Characterizing New Cell Types and Subtypes

Drug Repurposing

Supporting Clinical Decisions with Bioinformatics Analysis

Predicting Complex Biochemical Structures

Trends and Outlook

Questions for Discussion

Further Reading

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation