Key words

1 Introduction

The vast number of scientific publications (25 million citations in PubMed ) provides an extremely valuable resource for researchers if approached by an automated analysis of the information. The background knowledge integrated with annotated content in biological databases (such as proteins in UniProt ) or repositories of genes function or protein-protein interactions is fundamental for hypothesis generation. A highly comprehensive review on the latest advances on automated literature analysis for biomedical research can be found here [1]. Our goal is to apply text mining for hypothesis generation in the case study of proteostasis and cancer.

Ubiquitin modifies proteins which target them to new cellular localization such as for example the proteasome for degradation. E1 ubiquitin activating enzymes (two in the human genome), E2 ubiquitin conjugating enzymes (~30 in the human genome), and finally E3 ubiquitin ligases (~600 in the human genome) conjugate ubiquitin through sequential actions [24]. Specificity is mainly provided by the E3 ubiquitin ligases which likely explain the association of specific E3 ligases with diseases. E3 ligases are both suggested as biomarkers and targets for cancer therapy [5]. Deubiquitinases (DUBs) , ~100 in the human genome, cleave off ubiquitin from modified proteins [3]. DUBs like E3 ligases have also been suggested as both biomarkers and targets for cancer therapy [5]. DUBs can regulate both oncogenes and tumor suppressors . Aberrant DUBs activity, both gain and loss of function by mutation and/or altered expression, can promote cancer. DUBs associated with cancer have been described as specific for targeting proteins. Evidence suggests that DUBs specificity may depend on tissue types and stage of malignancy, thereby making it difficult to access the general role of DUBs in tumorigenesis.

The success of bortezomib (Velcade™), a proteasome inhibitor , used for the treatment of relapse or refractory patients with multiple myeloma focuses the attention of cancer biologists on potential cancer treatment strategies that target proteostasis. Examples of such strategies are listed below.

  1. 1.

    When the production of misfolded proteins exceeds degradation, as often occurs in damaged or aging cells, or in cells exposed to chemical agents that perturb protein folding or the endoplasmic reticulum (ER) quality control (ERQC) pathway, the ER-associated degradation (ERAD) is elicited. There are two types of molecules that affect ERQC pathway which can be used to modulate ER stress and trigger apoptosis : (a) small molecules can enhance proteostasis by binding to and stabilizing specific proteins (pharmacologic chaperones ) increasing the proteostasis network capacity (proteostasis regulators) or (b) by regulating proteostasis.

    Certain cancer cells with high secretory capacities and basal levels of ER stress have been shown to be more sensitive to ER stress-induced cell death (e.g., multiple myeloma ) [6, 7]. Bortezomib, a proteasome inhibitor, inhibits the chymotrypsin activity of the proteasome is approved for the treatment of mantle cell lymphoma and relapse or refractory multiple myeloma [8, 9]. The effect of bortezomib involves many pathways of which some are linked to the unfolded protein response [1013] and others to protein factors such as p53 [14, 15] and NFκB [16].

  2. 2.

    Inhibition of p97 ATPase for ER membrane extraction and for subsequent transfer to the proteasome by the drug Eeyarestatin I can induce cell death in hematologic cancer cells [17]. Eeyarestatin I affects similar factors as bortezomib such as accumulation of polyubiquitinated proteins, ER stress causing downregulation of histone H2A ubiquitination with subsequent Noxa activation, and cell death [1].

  3. 3.

    Alternatively DUBs can also be targeted and some cancer cells are more susceptible to specific DUB inhibition than non cancer cells. This is also referred to as synthetic lethality. They may regulate the stability of key oncogenes , exemplified by USP28 stabilization of c-Myc . Alternatively DUBs can negatively regulate ubiquitin-dependent signaling cascades such as the NF-kB activation pathway [18].

  4. 4.

    Aberrant regulation of some E3 ligases is associated with cancer development [5]. Furthermore, cancer cells frequently overexpress E3 ligases and this correlates with increased chemoresistance and poor prognosis. E3 ligases are “drugable” and therefore potential cancer targets. Additionally, E3 ligases serve as cancer biomarkers. For example, germline mutations in the E3 ligase BRCA1 increase the predisposition for breast cancer [19]. Another example is MDM2 which targets the tumor suppressor p53 for degradation [20, 21].

In conclusion, proteostasis proteins are found to be aberrantly regulated at the expression level and mutated in cancer cells. Furthermore, proteostasis proteins are being targeted for cancer therapies. We therefore perform text mining on abstracts in PubMed to provide an overview of the most studied protein factors in connection with different cancer types.

2 Materials

The data mining was performed using the statistical programming language R. The libraries listed below were used:

  1. 1.

    Package “RISmed ” for PubMed search.

  2. 2.

    Package “tm” for text mining.

  3. 3.

    Package “wordcloud ” for graphical display.

  4. 4.

    Names of frequently studied proteins of the ubiquitin system were extracted from http://www.sabiosciences.com/rt_pcr_product/HTML/PAHS-3079Z.html.

  5. 5.

    Names of proteasome factors and DUBs were downloaded from the online database HUGO.

  6. 6.

    Disease-gene associations list was obtained from the DISEASES resource available at http://diseases.jensenlab.org/ [22].

3 Methods

The aim of the computer-assisted text mining approach here presented is to obtain a quick overview of the factors in the ubiquitin-proteasome system and their association with cancer (see Note 1 ). Furthermore, the generated wordclouds are useful to display the most important word terms related to specific ubiquitin proteasome factors. We do not provide detailed steps for the analysis since this will quickly be outdated and the manual of described software tools would always be the best source of details for computational steps.

3.1 Obtaining Text Corpus from PubMed

  1. 1.

    Use PubMed directly, eUtils or the R package “RISmed ” to download abstracts for each of the ubiquitin and proteoasome factors of interest. We used the search term “XXX AND (leukemia OR cancer OR lymphoma)” for the analysis presented here. Where XXX is replaced with one of the ubiquitin and proteasome factors, e.g., “BRCA1 AND (leukemia OR cancer OR lymphoma)” (see Notes 2 4 ).

  2. 2.

    Use the R command grep to filter the retrieved PubMed abstracts (use the R command “?grep” to obtain the grep manual). We chose to maintain only entries that contain the ubiquitin and proteasome factors in either abstracts or title. More sensitivity can be obtained by also including abstracts having ubiquitin and proteasome factors as keyword. However, the context of ubiquitin and proteasome factors in relation to cancer becomes obscure without having access to the full paper text.

Figure 1 displays, based on the above retrieved text corpus, the association between ubiquitin ligase complexes, DUBs , and proteasome factors to different cancer types. This plot was created with the R graphical command barplot but could also be plotted in Excel or other software tool. We infer from Fig. 1 that BRCA1 and BRCA2 are the most described factors in ubiquitin ligase complexes and mainly associated with breast, ovarian, and prostate cancer. We further see that different factors in ligase complexes, DUBs , and proteasome factors are associated with different cancer types.

Fig. 1
figure 1

Text mining inferred association between different factors in ligase complexes (a), Deubiquitinases (DUBs ) (b), and proteasome factors (c) to different cancer types

3.2 Wordclouds

Wordclouds are useful to obtain a visual overview of the most important terms in a text corpus . The wordclouds in Figs. 2 and 3 were created using the R packages “tm” and “wordcloud” by running the following steps.

Fig. 2
figure 2

Wordcloud for the search term “BRCA1 ”

Fig. 3
figure 3

Wordcloud for the search term “MDM2”

  1. 1.

    myCorpus = Corpus(VectorSource(TextBRCA1)) # (see Note 5 ).

  2. 2.

    myCorpus = tm_map(myCorpus, removeWords, stopwords(“english”)) # (see Note 6 ).

  3. 3.

    myCorpus = tm_map(myCorpus, removePunctuation).

  4. 4.

    myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 3)).

    m = as.matrix(myDTM).

  5. 5.

    v = sort(rowSums(m), decreasing = TRUE).

  6. 6.

    wordcloud(names(v),v, scale = c(5,0.5), max.words = 200, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2")).

The resulting wordclouds may contain duplicates such as “mutation” and “mutations”. This can be resolved by using the command: “myCorpus = tm_map(myCorpus, stemDocument)”. We find that the command “myCorpus = tm_map(myCorpus, removeNumbers)” has the unwanted site effect of removing all numbers resulting in BRCA1 becomes BRCA. However, for the two provided examples there was no need to remove numbers.

It is reassuring to see that the text mining and wordcloud for BRCA1 display terms like breast, ovarian, mutations, and genetic. Directly providing the valuable information that BRCA1 is associated with breast and ovarian cancers and its association to genetic predisposition. The association to other proteins such as BRCA2, p53, PARP and ATM is also informative.

Figure 3 shows the wordcloud for MDM2. We observe that MDM2 is associated to protein factors such as p53/TP53, bcl2, p21, nutlin3, cdk4, and HDM2. Furthermore, the more general protein families such as kinases and cyclins appear and the term apoptosis is more abundant than in the above analysis for BRCA1 (compare Figs. 2 and 3).

3.3 Gene Associations with Diseases

We next address more rigorous approaches regarding extracting text for a corpus . We reuse the DISEASES resource which lists disease-gene associations (http://diseases.jensenlab.org/ ). The disease-gene associations are calculated by a scoring scheme that simultaneously takes into account co-occurrences at the level of abstracts as well as individual sentences [22]. In contrast to the analysis above, we here analyzed gene association of a larger set of annotated diseases.

Figure 4 shows the top factors in ubiquitin ligase complexes’ association with diseases by using the DISEASES resource. The height of the bars corresponds to the counts in the unfiltered text mining matrix (human_disease_textmining_full.tsv) and the subdivision is made by normalizing with each disease confidence for a specific protein factor as reported in filtered text mining matrix (human_disease_textmining_filtered.tsv). It is evident that a large number of diseases have been associated with factors in ubiquitin ligase complexes and clearly cancer ranks as the most confident associated disease to the top factors in ubiquitin ligase complexes.

Fig. 4
figure 4

Top factors in ubiquitin ligase complexes’ association with diseases by using the DISEASES resource

A similar analysis for DUBs revealed that they are associated with many diseases as well where cognitive disease, inflammatory diseases, and cancer are among the most strongly associated with DUBs (Fig. 5).

Fig. 5
figure 5

Top deubiquitinases (DUBs ) association with diseases by using the DISEASES resource

The top ten proteasome factors were found to be associated with mainly infectious disease, cancer, and vascular diseases (Fig. 6).

Fig. 6
figure 6

Top proteasome factors’ association with diseases by using the DISEASES resource

3.4 Discussion of Results

The strongest association between the chosen genes in this analysis and a specific type of cancer was found for BRCA1’s association with breast and ovarian cancers (Fig. 1). Mutations in BRCA1 increase susceptibility to breast and ovarian cancers reflecting the predominance of citations associating these two types of cancer to BRCA1 (Figs. 1 and 2). BRCA1 participates in the cellular response to DNA damage as a sensor molecule and as an effector by transcriptional regulation of genes [23]. E3 ligase activity of BRAC1 is achieved by heterodimerization through its amino-terminal (really interesting new gene) RING domain with a RING partner, BARD1 [24]. We do not observe BARD1 in the wordcloud presented in Fig. 2 suggesting that few studies have focused on the role of BARD1 on the E3 ubiquitin ligase activity of BRCA1. A specific mechanism of DNA damage response of BRCA1 involves ubiquitinylation of claspin, an essential activator of the CHK1 checkpoint kinase, by BRCA1 triggering homology-directed DNA repair [25]. Despite the similarities between the phenotypes induced by disruption of BRCA1 or BRCA2 , they play a role in distinct functions in the biological response to DNA damage [23]. BRCA2 is a mediator of recombinase RAD51 and their role in DNA damage response is mechanistically distinct from BRCA1 . MDM2 is an E3 ubiquitin-protein ligase of the RING finger class that mediates ubiquitination of the tumor suppressor p53 /TP53, regulating its stability and activity [2628]. MDM2 was the next E3 ligase, after BRCA1 and BRCA2 , registering a high number of co-occurrences in the retrieved PubMed abstracts in relation with cancer, largely due to MDM2’s role in the regulation of p53 (Figs. 1 and 3). Inactivating p53 mutations occur in more than 50 % of human tumors. Variations in Mdm2 due to single nucleotide polymorphism , overexpression, or amplification impact the ubiquitination levels of p53 and consequently p53 degradation. Such variations are therefore tumor-prone phenotypes. Several compounds have been designed to inhibit MDM2 E3 ubiquitin ligase such as nutlin-3. In fact an analog of nutiln-3 is in phase I trials in patients with solid tumors or leukemia [29]. MDM2-p53 interaction illustrates how targeting the ubiquitin system and its factors can potentially succeed in drug development against cancer. Another interactor of p53 is promyelocytic leukemia (PML) tumor suppressor protein, a central regulator of cell proliferation and apoptosis . PML configures as one of the top ten factors in E3 ligase complexes associated with different types of cancer (Fig. 1a). PML protects p53 from Mdm2-mediated ubiquitination and degradation, and from inhibition of apoptosis [30]. A group of other protein factors containing a RING finger domain such as the X-linked inhibitor of apoptosis (XIAP) and the Casitas B-lineage Lymphoma (CBL) protein family are among the top factors in E3 ligase complexes associated with cancer (Fig. 1a). XIAP inhibits the activity of the cell death proteases, caspase-3, -7, and -9, and promotes the degradation of active-form caspase -3 mediated by its RING finger domain acting as an E3 ubiquitin ligase [31]. XIAP mediates an oncogenic signaling by the ubiquitination of TGF-beta-activated kinase 1 (TAK1) enabling TGF-beta to activate p65/RelA and to induce the expression of prometastatic and prosurvival genes in 4T1 breast cancer cells [32]. CBL small family of Cbl ubiquitin E3 ligases, c-Cbl, Cbl-b, and Cbl-c, regulates signaling through its N-terminal tyrosine-kinase-binding (TKB) domain composed of three different subdomains: a four-helix bundle (4H), a calcium-binding EF hand, and a divergent SH2 domain, which is followed by a RING finger and a proline-rich domain inducing a myriad of interactions [33]. Cbl proteins interact with tyrosine kinases through its TKB domain such as v-src oncogene, a preferential target of Cbl-c for degradation [34], inhibiting its oncogenic activity. Indeed Cbl-b predicts better prognosis in RANK-expressing breast cancer patients [35]. The following two examples, Von Hippel-Lindau (VHL) disease tumor suppressor gene and gigaxonin (GAN), constitute elements in ubiquitylation complexes acting upon key players in cancer-driven mechanisms. VHL is found mutated in a variety of tumors including clear cell carcinomas of the kidney, pheochromocytomas, and vascular tumors of the central nervous system and retina [36]. Under hypoxia conditions hydroxylated hypoxia-inducible factor (HIF) is recruited by the von Hippel-Lindau ubiquitination complex, leading to its ubiquitination and degradation [37]. GAN, an ubiquitin E3 ligase adaptor, and p16 protein expression contributed to senescence of cisplatin treated cells through NFkB ubiquitination. The increased nuclear p16 expression correlates with enhanced survival of head and neck cancer patients [38]. The manual validation of the text mining performed on proteostasis factors and cancer reassures the effectiveness of these approaches for assessing in an organized and nondisperse manner the vast literature on a specific subject.

4 Notes

  1. 1.

    We apply a text mining approach which means that the results presented are obtained semi-automatically. That is we have not extensively manually validated the extracted words for every abstract. This means that a few terms are likely to be extracted in the wrong context. Nevertheless, we are only interested in the most abundant terms so few errors are unlikely to corrupt the overall picture. However, we have performed basic validation as mentioned in the following notes.

  2. 2.

    We applied directly the official protein names for proteostasis factors. More sophisticated approaches could be taken to include also description of a protein and synonym protein names.

  3. 3.

    The keyword “AND” is very important. If not included then all abstracts with cancer, leukemia, and lymphoma will always be targeted. The key word “cancer” captures a large group of cancers such as lung cancer, stomach cancer, and breast cancer. We choose for simplicity here to only include “leukemia and lymphoma” as additional cancer types but in principle the list of cancer keywords could be longer.

  4. 4.

    The search terms described will also hit matches in keywords and authors fields. The hits on keywords provide more sensitivity but the hits on author fields give false matches. We therefore subsequently use the grep command to filter this first text corpus . An alternative approach could be to use the PubMed filter “[Text + Words]” to avoid matches on authors. For example, the ubiquitin E3 ligase adaptor “GAN” matches many abstracts with this author name.

  5. 5.

    Running the command “typeof” on the text “TextBRCA1” should give the output: "character".

  6. 6.

    The wordcloud still contains some noninformative words such as: use, can, and one (see Fig. 2). These can be filtered away by using the command: myCorpus = tm_map(myCorpus, removeWords, c("use", "can","one")).