Keywords

Introduction

Since the Enlightenment, the approach to physical, chemical, and biological processes has involved the development of analytical methods that would use quantitative data to build simple integrated models leading to prediction of drug action [1]. This was a logical approach as it could capture a large quantity of data in a manner that was easy to store and transmit from one individual to the another. However, as scientific research has become increasingly prominent type of human activity, there has been a dramatic growth of scientific publications reporting the research outcomes in a form combining both textual description and numerical data [2]. Most scientists do spend a lot of time thinking about the best verbal ways of describing and reasoning over their results and thus, a lot of useful information and knowledge could be obtained by reading the scientific literature. As important as reading is in the life of every scientist, the process of obtaining summative knowledge compiled from many publications is a non-scalable effort [3].

Fortunately, the advent of computer technology enables storing and efficient processing of large amounts of data, including textual sources. The analysis of this complex data allows mechanistic inferences to be drawn that promote novel hypotheses that illuminate fundamental natural phenomena. The importance of evolving computational methods that allow consideration of a wide range of data and their implications cannot be understated [4].

The current exponentially increasing cost and decades of inefficiency in drug discovery and development have resulted in a problematic situation with respect to pharmaceutical innovation and commercialization. The overwrought drug discovery pipeline may take up 15 years to develop a successful drug (considering hit-to-lead discovery/development, pre-clinical, and clinical studies), with an average cost estimated to be from $800 million to $1.5 billion [5]. This process is deemed as inadequate and unsustainable, especially as concerning its ability to provide a therapy for diseases that affect people in poor parts of the world, such as tropical diseases, as well as those affecting a limited number of patients, such as rare diseases, due to the potential resulting low revenue [6, 7]. A disruptive approach is required that can bring about revolutionary, not evolutionary, change equivalent to the changes that have occurred in the communications, electronics and financial industries over the last 25 years.

Rare diseases, which are defined as a condition that affects fewer than 200,000 people in the United States and 1 in 2000 people in the European Union, are particularly in need of disruptive and revolutionary drug discovery paradigms. Although, individually each rare disease affects a small portion of the total population only, their collective effect on the human population is substantial as there are over 7000 rare diseases that roughly affect 25–30 million people in the United States [8]. Alarmingly, very few patients can be treated with an approved medicine. Taken together, rare diseases represent a substantial burden on individuals, families, and whole economies [9, 10].

Developing a drug for a rare disease, on average, is half the cost of common diseases [11]. Still, considering the smaller amount of data and investment, any innovative approaches to treat these diseases will likewise be of value for drug discovery writ large. It is anticipated that once paradigms are developed for the rare diseases drug discovery, which have less financial benefit than more prevalent diseases, drug discovery efforts in general will become more efficient [12]. In addition to financial concerns and a limited patient populations, rare disease drug discovery also suffers from sparse and heterogeneous data [13], which hamper the ability to draw novel insights and treatment hypotheses. However, a growing number of rare diseases registries has been incorporate within in different databases [14, 15], such as Pharos [16] (https://pharos.nih.gov/), ClinVar [17, 18] (https://www.ncbi.nlm.nih.gov/clinvar/), the Online Mendelian Inheritance in Man (OMIM) (https://omim.org/), among others [7]. While efforts have been made to promote the sharing of information between multi-disciplinary collaborations [19], there is still need to curate and properly integrate all of this information [13, 20,21,22].

Computational approaches have emerged as a practical solution to accelerate drug discovery efforts and reduce costs [23, 24]. One promising approach, named Literature-Based-Discovery (LBD), seeks to unlock biological observations hidden within informational sources, such as published texts and manuscripts [25]. Since 1988, when the relationship between magnesium and migraine was discovered in the literature by Swanson [26], other treatment hypotheses have been generated for many diseases, such as Parkinson’s disease [27], multiple sclerosis [28], and cancer [29]. This approach has been also used to elucidate adverse drug effects [25, 30]. As such, LBD is a powerful new technology in the drug discovery arsenal.

In this chapter, we aim to review the status of available biomedical data on PubMed and describe how mining complex drug-target-disease relationships within this database could contribute to finding new targets, new repurposed medications, and novel drug candidates for rare diseases. The intent of the following discussion is to focus on the impact of consideration of complexity in drug discovery and clinical data to allow new therapies to emerge that can be rapidly screened and progressed to clinical application. The overall approach described may likely be one component of a strategy that will regenerate pharmaceutical development and promote a rational approach to the pharmacological element of health care delivery.

Biomedical Knowledge Data in the Scientific Literature

Bioactivity data such as the outcome of in vivo and in vitro assays have been growing extensively in publicly available repositories such as ChEMBL [31, 32] (https://www.ebi.ac.uk/chembl/) and PubChem [33, 34] (http://pubchem.ncbi.nlm.nih.gov/). Despite the growth of these databases, the scientific literature remains the largest repository of untapped biomedical data [2]. The United States National Library of Medicine (NLM) journal citation database (MEDLINE) is the preeminent source of biomedical literature, with ~30 million citations [35]. This database can be accessed through PubMed, a search engine maintained by NLM at the National Institutes of Health (NIH). It is possible to retrieve reference for scientific articles stored in MEDLINE by querying specific terms named Medical Subject Headings (MeSH) [36], which are used to index and categorize publications stored in MEDLINE. MeSH terms encompass most drugs, targets, and diseases present in scientific publications and could potentially be used to accelerate drug discovery [37].

The major approach to manipulate knowledge stored in the literature is through natural language processing, a subfield of artificial intelligence that allows computers to understand, interpret, and manipulate human language [38]. For this purpose, many dictionary-based systems that recognize passages in the literature with ontological terms have been proposed and evaluated [39]. The SciLite Annotations platform (https://europepmc.org/Annotations) provides means to link research articles with biological data through text mining [40]. In a 2016 study, text mining on PubMed and social network analysis were integrated to analyze gene-gene interactions in order to identify new potential biomarkers for breast cancer [41]. More recently, text mining has been used to analyze gene-disease associations present in PubMed by integrating MeSH terms and co-occurrence methods [42].

Drug Repurposing

As discussed in the introduction, it may take a drug up to 15 years to reach the market [43]. This process usually includes discovery and development research, preclinical studies (in vitro/in vivo evaluation),0020and clinical research, divided in Phase I (safety and dose evaluation in healthy individuals), Phase II (efficacy and safety in small number of patients), Phase III (efficacy and safety in large number of patients), and Phase IV (post-market safety monitoring). During Phase II, approximately 90% of the compounds fail due to safety concerns and poor efficacy [44].

Drug repurposing, also known as repositioning or reprofiling, is a strategy to identify novel uses for approved or investigational drugs that are outside of the original therapeutic indication [45]. Recently, this approach has been a trending topic among researchers [46] and has attracted attention of companies due to the reduced cost associated with the low risk of failure, especially when safety evaluation has already been completed in preclinical and clinical trials [47]. Because repurposed drugs can skip safety evaluation during preclinical and Phase I studies, it is estimated that developing a repurposed drugs costs on average only $300 million over a 6.5 year period [48]. In addition to reduced cost and time, approximately 30% of repurposed drugs are approved, which can be seen as a market-oriented incentive to companies [45, 49]. For comparison, the typical approval rate for drugs entering clinical trials is 9.6% [50].

Repurposing studies very often are initiated after unexpected drug effects are observed during clinical trials or during pharmacovigilance upon their release on the market [51]. Many of the current repurposing studies have been initiated thanks to a serendipitous observation of unexpected drug effects upon clinical trials or following their release on the market. Prime examples of such discoveries are the stories of sildenafil (Viagra®) [52] and thalidomide [53, 54].

Recently, it has been shown in a bibliographic study [55] that more than 60% of all approved drugs or drug candidates (∼35,000) have been tried in more than one disease, including 189 drugs that have been tried in >300 diseases each. Considering only approved drugs, more than 30% have been tested during their lifetime for at least one additional indication following their original approval [55]. Despite several success cases, drug repurposing still faces lack of financial support due to potentially low return, lower drug prices, and short patent duration [56, 57]. Nevertheless, this approach is still considered promising, especially for rare diseases [58]. Small grant programs to help develop drugs or treatments for rare diseases are usually available from rare disease foundations [59]. The National Organization for Rare Disorders (NORD) (http://rarediseases.org/) provides recommendations to such organizations.

Using Chemotext to Infer Novel Therapies and Targets

Biological insights about the etiology of diseases, such as causative protein mutations or aberrant pathway signaling, and the potential drug treatments of these diseases are stored primarily in the biomedical literature [2]. As such, there exists biomedically relevant relationships between drugs, biological targets, and diseases, which we call DTD triangles, that lie latent within published texts [3, 60]. Using text-mining approaches, therefore, these DTD triangles can be identified and extracted from the published biomedical literature [61].

Text-mining capabilities in conjunction with the wealth of text-based data stored within PubMed considerations led to the development of Chemotext [62], a computational algorithm which extracts MeSH terms describing “drugs”, “targets”, and “diseases” and generates DTD triangles. Chemotext is based on the frequency with which MeSH terms of interest co-occur in abstracts of papers annotated in PubMed. Chemotext is thus an extension of Swanson’s ABC paradigm wherein “A” terms are drug (chemical) MeSH terms, “B” terms are target-associated MeSH terms, i.e., proteins and pathways, and “C” terms are disease MeSH terms (Fig. 1).

Fig. 1
figure 1

Swanson’s ABC paradigm incorporated in Chemotext. Chemical A is proposed to affect disease C since both terms are associated with target B. Solid lines (edges) indicate an actual text-based relationship, while dashed lines (edges) indicate proposed connections

The underlying DTD triangle generation starts with the observation that the MeSH term of drug “A” co-occurs in the same articles as the MeSH term of target “B” while the MeSH term of disease “C” co-occurs in the same or additional articles with the same target B. Thus, if drug A and disease C have not been mentioned together in the same article, an “A–C” connection mediated though target B can be inferred, completing a DTD triangle. This analysis, enabled by the Chemotext approach, leads to the identification of a new possible therapeutic use of drug “A”.

The power and efficacy of Chemotext is demonstrated by elucidation of the antineoplastic agent imatinib as a potential drug repurposing candidate for the treatment of severe refractory asthma. Imatinib is an FDA-approved tyrosine kinase inhibitor that is used in the treatment chronic myeloid leukemia (CML). Imatinib inhibits the activity of KIT, which reduces bone marrow mast-cell numbers in patients with CML [63]. KIT is also present in lung mast cells and was hypothesized as a basis of the pathobiology of severe refractory asthma [64], which is characterized by an adverse response to traditional glucocorticoid asthma treatment [65]. Figure 2 shows how Chemotext can be used to link Imatinib (A), Proto-Oncogene Proteins c-kit (B), and asthma (C).

Fig. 2
figure 2

Example showing how Chemotext connects Imatinib and Asthma with shared terms. In this example, query terms “Imatinib” and “Asthma” were searched in the Find Shared Terms module. The list of full associations was filtered by Proteins-Pathways-Intermediaries-Other. The MeSH term “Proto- Oncogene Proteins c-kit” was the fourth highest ranked shared term (two shared articles) selected as the potential biological target in the clinical outcome pathway

In 2017, a proof-of-principle trial demonstrated that imatinib reduced airway hyperresponsiveness, a physiological marker of severe asthma, as well as on airway mast-cell numbers and activation in patients with severe asthma. Since this publication had not yet been entered into the.

MEDLINE database, it was used a validation test of the Chemotext algorithm. Through co-occurrences of these MeSH terms in previously published studies, Chemotext was used to draw the interference between imatinib, KIT, and asthma, which constitutes a DTD triangle (Fig. 2). This case study demonstrates that Chemotext can identify drug repurposing candidates and targets through text-based inferences alone.

Mining Other Sources of Biomedical Data for Drug Repurposing

Mining literature data can afford rapid identification of all published studies that could confirm connections between drugs, their targets, underlying biological pathways, and diseases, including enabling new inferences of such connections [3, 60]. The elucidation of the mechanistic relationships between these connections is at the core of modern drug discovery research [61]. Currently, there are several databases with valuable information for drug discovery that could be connected to complete a DTD triangle. ChEMBL [31, 32] (https://www.ebi.ac.uk/chembl/) and PubChem [33, 34] (http://pubchem.ncbi.nlm.nih.gov/) contain many chemical–target (“A–B”) and chemical–disease (“A– C”) relationships. Other databases contain target–disease (“B–C”) associations, such as ClinVar [17, 18] (https://www.ncbi.nlm.nih.gov/clinvar/), the Online Mendelian Inheritance in Man (OMIM) (https://omim.org/). Pharos [16] (https://pharos.nih.gov/), specifically, contains data on the whole DTD triangle for many diseases. Several databases are available containing parts of the triangle available for rare diseases, such as Malacards [66] (http://www.malacards.org/ the National Organization for Rare Disorders (NORD) [67] (https://rarediseases.org/), the Genetic and Rare Diseases Information Center (GARD) [68] (https://rarediseases.info.nih.gov/), and the Infohub for Rare Diseases (https://rarediseases.oscar.ncsu.edu/).

Recently, NIH has launched the Biomedical Data Translator program (https://ncats.nih.gov/translator), which has integrated many data sources with multiple types of content, such as diseases, patient-reported outcomes, electronic health records, microbiome, proteins, genes, chemicals, among others. This massive project attempts to integrate currently available medical research data towards accelerated development of new treatments. The major challenge to establish valuable connections, as in any data science project, is proper curation of the data [13, 20,21,22]. To establish useful relationships between these sources of data, knowledge graphs have emerged as a practical solution. A knowledge graph is a network of entities that acquires and integrates information into an ontology and applies a reasoner to derive new knowledge [68]. A 2016 study has applied network-based modeling within to identify promising multi-target drugs for triple negative breast cancer [11]. More recently, a study has applied knowledge graphs to integrate different data sources on diseases and drugs to suggest the repurposing of 21 drugs for Autosomal Dominant Polycystic Kidney Disease (ADPKD) [68].

There has also been a growing interest in using social media to supplement established approaches for pharmacovigilance [69, 70]. The use of social media, also called “social listening”, therefore, is a potential resource for repurposing. Social media has been recently used in public health to estimate trends of cholera outbreak in the after math of the 2010 earthquake in Haiti [71], seasonal influenza surveillance [72], and onset of mental illness [73]. As previously discussed, many repurposed drugs have been discovered through adverse side effects observed during clinical trials or pharmacovigilance. Many people have used social media to report adverse effects of their medications. Several studies analyzing adverse reactions on social media have been published recently [30, 74, 75], which makes social media a potential source of adverse effect data to be mined for repurposing.

Drug Repurposing and Bibliometric Analysis on Rare Diseases

Several repurposing stories for rare diseases have been reported in the recent years. For instance, metformin has been studied to treat idiopathic pulmonary fibrosis [76]. A recent study suggests that inhibitors of p110β, a catalytic subunit of the phosphoinositide 3-kinase (PI3K) gene family, commonly associated with cancer, might prevent cognitive and behavioral defects and become a promising disease- modifying strategy for fragile X syndrome and other brain disorders [77]. Fenfluramine, initially proposed as a an appetite suppressant and withdrawn from the market, has been submitted to the FDA for the treatment of Dravet syndrome [78].

Many computational approaches historically applied for drug discovery, such as quantitative structure-activity relationships (QSAR) modeling, similarity search, molecular docking, etc., have been successfully applied for drug repurposing as well [79]. Computational drug repurposing approaches have been widely applied to neglected tropical diseases [80,81,82,83,84], and, more recently, to rare diseases [58, 83]. The eMatchSite, a platform for compare drug-binding sites have been applied to propose the possibility to repurpose a steroidal aromatase inhibitor to treat Niemann-Pick disease type C [85]. A structure-based virtual screening approach has been applied to screen FDA approved drugs on ENGase, a potential target for the treatment of N-Glycanase (NGLY1) deficiency. The authors experimentally confirmed the activity of rabeprazole (IC50 = 4.4. μM) on ENGase as a promising treatment to patients suffering from NGLY1 deficiency [69].

Mining literature data allows the exploitation of opportunities to reposition known drugs interacting with proteins associated with diseases [3, 60]. The integration of data on drug-target-disease to form networks has become a valuable approach for computational drug repositioning research [86]. Recently, a study has used bioinformatics methods and bibliographic research to propose the repositioning of some drugs as potential competitors against idiopathic pulmonary fibrosis [87].

As of June of 2019, there are 244,911 references with the term “rare disease” through the text and 17,134 references with the term “rare disease” in the title or abstract. Here, we performed a brief bibliometric analysis on drug repurposing for rare diseases, similar to the one that was recently published by Baker et al. [55]. We mined PubMed using earlier text-mining work [37] to identify articles in PubMed where a chemical entity was described in terms of its therapeutic association with a rare disease. We determined this relationship by examining the MeSH annotations in a stepwise manner (described in the supplementary material online). All drug–disease combinations were extracted, along with the year the article was published, into a separate dataset. This set included citations with no abstract and those in languages other than English, as long as they were annotated, and the annotations met the criteria.

In our analysis, we found that only 1267 out of more than 7000 rare diseases have been studied in association with a chemical entity. It was known that only a small fraction of rare diseases has associated treatments, but our findings reveal there is still a major gap in research for rare diseases, since many of them have not been associated with any chemical entity as a potential treatment. These findings reinforce the need to expand research on the development of novel therapies for rare diseases. As one can see in Fig. 3, 6570 out of 12,376 chemicals (53%) have been associated with only one rare disease, while 4796 (38%) have been associated with two to ten diseases, 984 (7.0%) have been associated with eleven to 100 diseases, and 26 (0.20%) chemicals with more than 100 diseases.

Fig. 3
figure 3

Distribution of chemicals tested in rare diseases mined from PubMed

We show in Table 1 the top 30 drugs that were tested for rare diseases. Sixteen out of 30 were among the top drugs most tested in the previous study [55]. As one can see, most of these drugs are used to suppress the immune system and/or to decrease inflammation, such as glucocorticoid medications (prednisone, prednisolone, dexamethasone, methylprednisolone, hydrocortisone, and cortisone), cancer chemotherapy agents (cyclophosphamide, bevacizumab, methotrexate), and medications used to prevent transplant rejections (sirolimus, rituximab, cyclosporine). The rare diseases with most publications and chemicals tested are presented in Table 2. Most of these diseases are rare forms of cancer, such as sarcoma, and neoplasm, and multiple forms of carcinoma, which explains why most of the most studied drugs present in Table 1 are suppressant of immune system, anti-inflammatory, and anti-cancer drugs. Surprisingly, none of the most studied drugs were used in some of the most studied diseases, such as malaria, tuberculosis, and Alzheimer.

Table 1 Top 30 drugs most tested in rare diseases with publications count
Table 2 Top 30 rare diseases ranked by number of publications and chemicals tested

Final Remarks

There is an urgent need for the development of treatments or cures for rare diseases. The complex biological systems and nature of drug discovery make iterative mechanistic strategies costly and inefficient. Current developments in database development, text mining, and machine learning tools allows efficient and inexpensive navigation through inferences to the identification of novel or repurposed drug candidates. The same principles can be employed to the traverse the complexity of drug delivery systems and biopharmaceutical principles that result in optimal drug disposition to achieve the desired therapeutic effect. In this manner, the development of novel pharmaceutical treatment options can focus on the generation of data suited to regulatory scrutiny and positive clinical outcomes without investment in the tangential iterative data generation that has historically been required to support statistical validation of the action, process, or clinical observations that surround the optimal approach.