Keywords

1 Introduction

The deluge of biomedical data in the last few years, partially caused by the advent of high-throughput gene sequencing technologies, has been a primary motivation for efforts related to curating, integrating, publishing, querying and visualising biomedical data [7, 12]. The biomedical research domain encompasses a wide range of spatial and temporal scales, from genes to organism through protein, cell, tissue, and organ, as well as from molecular events to human lifetime through cell signalling, diffusion, motility, mitosis and protein turnover. Information available at those different scales is organised in data resources where each data resource mainly specialises in a particular type of data [8]. The result is a large number of established online datasets that describe human biology.

Nevertheless an efficient and comprehensive search activity across these datasets can become quite problematic since similar data is located in many distributed datasets and is usually available in different data models and formats [9,10,11, 25]. As a result an individual scientist could perform manual search in several databases, take the results returned, change their format and paste them to the next database in search for an answer. Such a procedure would be very cumbersome and does not contribute to efficient scientific workflows.

The semantic connectivity between biomedical data constitutes a critical issue of biomedical scientific research and has been successfully exploited in a number of research projects for transitional medicine and drug discovery [17]. Moreover the adoption of linked data technologies will allow the integration of biomedical datasets provided by different and heterogeneous data sources (i.e. research groups, libraries, databases), as well as the provision of an aggregated view of the biomedical data in a machine-readable and semantically-enriched way that will facilitate reuse [21].

At the schema level, these resources mainly consist of both domain ontologies and terminological resources [15, 23]. Jimeno-Yepes et al. [16], propose a loose coupling between the domain ontologies and lexicon that cannot be treated with the same techniques nor simply merged into a single resource [20]. Term vocabularies, Dictionaries and Lexicon are used interchangeably and consist of a compendium of words enriched with information of its usage [14]. Whereas a domain ontology is an explicit specialisation of a conceptualisation.

In a recent study, the scope and the size of the terminological resources have been estimated taking into consideration the semantic domain covered by a specific resource [22]. This analysis – for the first time – quantified the “Lexeome”, i.e. the full range of terms provided from the terminological (and ontological) resources to give an upper estimate of entities captured in semantic resources.

In this paper the focus lies on introducing biomedical resources especially ontologies, repositories, and other data resources relevant in the context of Drug Discovery and Cancer Chemoprevention. We monitor the transformation of content into the triple representation and quantify the available content. The analysis gives an overview of which resources have to be considered, what amount of data requires integration and provides the opportunity to tailor semantic solutions to specific needs in terms of size and performance.

2 Biomedical Ontologies

There are several initiatives that address the need to standardise biomedical data. The first standard terminology, namely the International Classification of Diseases (ICD), was created in 1893Footnote 1. Since then several terminologies have been created. However, emphasis was given only to ensure that there are enough terms to cover the domain of focus. Over the period of time, terminologies have advanced from simple lists and hierarchies of terms to formal representations of concepts in a semantically standardised structure. Terminologies that use formal representations and usable by computers are often called “ontologies” [6, 18].

In contrast to manually-created hierarchical organisations of terms (referred to as taxonomies), ontologies make use of formal structures, relations and definitions to provide a conceptualisation of domain knowledge. A large collection of biomedical ontologies or bio-ontologies are available nowadays through services e.g. BioportalFootnote 2 and OBO foundryFootnote 3. These have mostly been developed as joint efforts by communities to enable easy integration of biomedical data from both the literature and publicly-available biomedical databases. This section highlight the most well-studied and prominent ontologies applicable to biomedical research and especially relevant for Drug Discovery and other scenarios. Furthermore, several general ontologies used for medical and clinical terms are also investigated in order to provide insights into how data can be represented.

These ontologies can fall into three main categories, namely (1) biomedical Ontologies, (2) drugs and chemical compound ontologies and (3) upper level ontologies. The biomedical ontologies are mainly used by biomedical applications and define the basic biological structures (e.g. genes, pathways etc.). The Drugs and Chemical Compound Ontologies are related to the clinical drugs and their active ingredients. Finally, the upper level ontologies describe general concepts that many biomedical ontologies share.

Biomedical Ontologies cover (amongst others): (1) Advancing Clinico-Genomic Trials on Cancer (ACGT) Master Ontology (MO)Footnote 4, (2) Biological Pathway Exchange (BioPAX)Footnote 5, (3) Experimental Factor Ontology (EFO)Footnote 6, (4) Gene Ontology (GO)Footnote 7, (5) Medical Subject Headings (MeSH)Footnote 8, (6) Microarray Gene Expression Data Ontology (MGED)Footnote 9, (7) National Cancer Institute (NCI) ThesaurusFootnote 10, (8) Ontology for biomedical Investigations (OBI)Footnote 11, (9) Unified Medical Language System (UMLS)Footnote 12.

Drugs and Chemical Compound Ontologies would mainly comprise RxNormFootnote 13, and Generic and Upper Ontologies would consider: (1) Basic Formal Ontology (BFO)Footnote 14, (2) OBO Relation Ontology (RO)Footnote 15, (3) Provenance Ontology (PROVO)Footnote 16.

Table 1 provides implementation details and quantitative overview of ontologies that are listed in Sect. 2, year of release (as per listed at Bioportal), the visibility (public/private) and implementation details (language and type of data) of different ontologies. Size and coverage of these ontologies in terms of total triples, number of entries/entities, dependency/or reuse of any ontology on others, sub-classification and brief description are also presented in the table. We also present the quantitative comparison of different ontologies in terms of total number of classes, total number of properties, total number of individuals and maximum depth.

3 Public Data Repositories for Drug Discovery

In this section, we analyse a comprehensive list of biomedical libraries and databases closely related to drug discovery that have been provided from the biomedical community. Since drug discovery has a focus to a specific disease domain, we have chosen to focus on cancer chemoprevention as a use case and thus list data resources relevant for this domain.

The databases are separated into the following categories:

  • Gene, Gene Expression and Protein Databases for gene and protein annotations as well as the expression levels and related clinical data,

  • Pathway databases denoting the protein interactions and the overall functional outcomes,

  • Chemical and Structure Databases including Biological Activities for the information related to drugs and other chemicals including also toxicity observations and clinical trials,

  • Disease Specific Databases for Prevention which deliver content specific to the prevention of cancer,

  • Literature databases.

Table 2 provides implementation details and quantitative overview of the Life Sciences related databases presented in Sect. 3. In addition, it lays out information regarding the year of release, accessibility (public, private) and implementation details (language and type of data) of different databases. Size and coverage of these databases in terms of total triples, number of entries/entities, sub-classification and brief description are also presented in the table.

3.1 Gene, Gene Expression and Protein Databases

For the complete understanding of the molecular processes, e.g., in cancer, it is highly relevant to be able to analyse the molecular processes. Such processes leads into the need to decompose functional processes into molecular processes and to predict the outcomes of such processes from the genetic background. Although cancer genomics tends to be complex due to the fact that cancer cells deviate from regular process, the genomics information – in particular the data with regards to the function of genes, their expression and transformation into proteins – is a major source for the understanding of molecular processes.

Table 1. Quantitative overview of implementation details of public ontologies (selected). (BO:biomedical Ontologies, DCCO:Drugs and Chemical Compound Ontologies, GUO: Generic and Upper Ontologies, T/C: Type/Category, Y: Year (acc. to Bioportal), Individuals, Classes/Concepts, Properties, Depth Public, “–” = N/A.)

The following data sources have to be considered for a complete and coherent representation of such molecular processes.

GenBankFootnote 17 is an open-access annotated collection of all publicly available nucleotide sequences and their protein translations. GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 380’000 distinct organisms.

ArrayExpressFootnote 18 archive is a database of functional genomics experiments including gene expression where one can query and download data collected to Minimum Information about a Microarray Experiment (MIAME) and Minimum Information about a high-throughput SeQuencing Experiment (MINSEQE).

Gene Expression Omnibus (GEO)Footnote 19 is a public repository that archives and freely distributes microarray, next-generation sequencing and other forms of high-throughput functional genomic data submitted by the scientific community.

Cancer Gene Expression Database (CGED)Footnote 20 is a database of gene expression profile and accompanying clinical information. This database offers graphical presentation of expression and clinical data with similarity search and sorting functions. CGED includes data on breast (prognosis and docetaxel datasets), colorectal, hepatocellular, esophageal, thyroid, and gastric cancers [4].

Universal Protein Resource (UniProt)Footnote 21 is a comprehensive resource for protein sequence and annotation data. The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate consistent and rich annotation [4]. This includes widely accepted biological ontologies, classifications and cross-references, as well as clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.

Protein DatabaseFootnote 22 is a collection of sequences from several sources, including translations from annotated coding regions in GenBank and TPA (Tissue plasminogen activator) as well as records from SwissProt, Protein Information Resource (PIR), Protein Research Foundation (PRF), UniProt and PDB. Protein sequences are the determinants of biological structure and function.

Protein Data Bank (PDB)Footnote 23 is a repository for the 3D structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography or NMR (Nuclear Magnetic Resonance) spectroscopy and submitted by biologists and biochemists from around the world, is freely accessible on the Internet. Most major scientific journals and some funding agencies require scientists to submit their structure data to the PDB [4].

3.2 Pathway Databases

Modelling of pathways provides the crucial information to understand functional states in the cells. Different sources are available which partially overlap. The richest source is KEGG with about 50 M triples provided.

Kyoto Encyclopedia of Genes and Genomes (KEGG)Footnote 24 is a database resource that integrates genomic, chemical, and systemic functional information. In particular, gene catalogues are linked to higher-level systemic functions of the cell, the organism, and the ecosystem. KEGG is further expanded towards more practical applications with molecular network-based views of diseases, drugs, and environmental compounds [4].

ReactomeFootnote 25 is an open-source, open access, manually curated and peer-reviewed pathway database. The rationale behind Reactome is to convey the rich information in the visual representations of biological pathways familiar from textbooks and articles in a detailed, computationally accessible format. Entities (nucleic acids, proteins, complexes and small molecules), participating in reactions form a network of biological interactions, are grouped into pathways. Examples of biological pathways in Reactome include signalling, innate and acquired immune function, transcriptional regulation, translation, apoptosis and classical intermediary metabolism [4].

Wikipathways [19] is an open, collaborative platform dedicated to the curation of biological pathways. WikiPathways thus presents a model for pathway databases that enhance and complement ongoing efforts, such as KEGG, Reactome and Pathway Commons.

cPath: Pathway Database SoftwareFootnote 26 is a software platform for collecting/querying biological pathways. It can serve as the core data handling component in information systems for pathway visualisation, analysis and modelling. cPath can be used for content aggregation, query and analysis. More specifically, its main features include: (i) Aggregate pathway data from multiple sources (e.g. BioCyc, KEGG, Reactome), (ii) Import/Export support with different formats PSI-MI (Proteomics Standards Initiative Molecular Interaction) and BioPAX, (iii) Data visualisation using Cytoscape and (iv) Simple web service.

3.3 Chemical and Structure Databases Including Biological Activities

The treatment of any disease and cancer in particular is based on chemical entities with a defined biological activity. Several data sources provide information on the chemical compound, on its relevance to specific treatments and the side effects that they may induce. The amount of data (i.e. triples) with regards to the different data sources is large and data integration is an ongoing difficult task (see OpenPhacts project). The following data sources are publicly available.

Chemical Compounds Database (Chembase)Footnote 27 collects and provides information on chemical compounds and their physical and chemical properties, NMR (Nuclear Magnetic Resonance) spectra, mass spectra, UV/Vis (Ultra-violet-Visible Spectroscopy) absorption and IR data.

Sigma-AldrichFootnote 28 product database includes datasheets for commercially available compounds including solubility.

ChemDBFootnote 29 is a public database of small molecules available on the Web. The database contains approximately 4.1 million commercially available compounds and 8.2 million isomers. It includes a user-friendly graphical interface, chemical reactions capabilities as well as unique search capabilities.

Chemical Entities of Biological Interest (ChEBI)Footnote 30 is a database and ontology of small molecular entities. The term ’molecular entity’ refers to any isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer etc. that is identifiable as a separately distinguishable entity. Molecules directly encoded by the genome, such as nucleic acids, proteins and peptides derived from proteins by proteolysis cleavage, are not included.

DrugBank database [24] is a bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6826 drug entries including 1431 Food and Drug Administration (FDA)-approved small molecule drugs, 133 FDA-approved biotech (protein/peptide) drugs, 83 nutraceuticals and 5211 experimental drugs. Additionally 4435 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries.

PubChemFootnote 31 provides information on the biological activities of small molecules including substance information, compound structures, and BioActivity data in three primary databases. PubChem is integrated with Entrez, NCBI’s (National Center for Biotechnology Information) primary search engine, and also provides compound neighbouring, sub/superstructure, similarity structure, BioActivity data, and other searching features [4]. PubChem contains substance descriptions and small molecules with fewer than 1000 atoms and 1000 bonds.

Aggregated Computational Toxicology Resource (ACToR)Footnote 32 is an online warehouse of all publicly available chemical toxicity data and can be used to find data about potential chemical risks to human health and the environment. ACToR aggregates data from over 500 public sources on over 500’000 environmental chemicals searchable by chemical name and by chemical structure [4]. It allows users to search and query data from chemical toxicity databases including: (1) ToxRefDB for animal toxicity studies, (2) ToxCastDB covering data from 1’000 chemicals in over 500 assays, (3) ExpoCastDB consolidating human exposure and exposure factor data, and (4) Distributed Structure-Searchable Toxicity (DSSTox) for high quality chemical structures and annotations.

ClinicalTrialsFootnote 33 is an up-to-date registry and results database of federally and privately supported clinical trials conducted in the United States and around the world [4].

TOXicology Data NETwork (TOXNET)Footnote 34 provides access to full-text and bibliographic databases oriented to toxicology, hazardous chemicals, environmental health and related areas.

3.4 Disease Specific Databases for Prevention

More of such databases will arise, once the data becomes available but currently it is limited to a smaller number of data resources with limited data contained.

Colon Chemoprevention Agents Database (CCAD) [3] contains results from a systematic review of the literature of Colon Chemoprevention in human, rats and mice. Target cancers are colorectal adenoma and adenocarcinoma, aberrant crypt foci (ACF) (a preneoplasic lesion), and Min mice polyp (adenomas in Apc+/− mutant mice). The Chemopreventive agents are ranked by efficacy (potency against carcinogenesis).

Dietary Supplements Labels DatabaseFootnote 35 offers information on label ingredients in more than 5’000 selected brands of dietary supplements to compare label ingredients in different brands. Information is also provided on the “structure/function” claims made by manufacturers and can therefore be used to narrow down active ingredients in different types of food which may be applicable as Chemoprevention agents. Ingredients of dietary supplements in this database are linked to other databases such as MedlinePlus and PubMed [4].

REPAIRtoire DatabaseFootnote 36 is a database resource for systems biology of DNA damage/repair. It collects and organises the information including: (i) DNA damage linked to environmental mutagenic and cytotoxic agents, (ii) pathways comprising individual processes and enzymatic reactions involved in the removal of damage, (iii) proteins participating in DNA repair and (iv) diseases correlated with mutations in genes encoding DNA repair proteins. It also provides links to publications and external databases. REPAIRtoire can be queried by the name of pathway, protein, enzymatic complex, damage and disease.

3.5 Literature Databases

The scientific literature is still one of the most comprehensive data sources for experimental findings. The content is provided in an unstructured way and some of its content is delivered through data curation into the data sources above. The most relevant data sources are listed below.

PubmedFootnote 37 is the most widely used source for biomedical literature. PubMed provides access to citations from the MEDLINE database and additional Life Science journals including links to many full-text articles at journal Web sites and other related Web resources. PubMed was first released in January 1996. The knowledge regarding Chemoprevention agents available as publications makes Pubmed a primary source of biomedical information [4].

PubMed Dietary Supplement SubsetFootnote 38 is designed to limit search results to citations from a broad spectrum of dietary supplement literature including vitamin, mineral, phytochemical, ergogenic, botanical and herbal supplements in human nutrition and animal models. It retrieves citations on topics including: chemical composition; biochemical role and function - both in vitro and in vivo; clinical trials; health and adverse effects; fortification; traditional Chinese medicine and other folk/ethnic supplement practices. [13].

Table 2. Quantitative overview of implementation details of public libraries and databases (selected) (LD: Literature Databases, T/C: Type/ Category, Public, Visibility, “–” = N/A)

4 Biomedical Services for Semantic Resources

The increase in the number of ontologies and databases creates new needs in the community of ontology users to find, reconcile and relate own data to the growing number of biomedical ontologies, thus requiring access to the full body of biomedical ontologies. A number of tools and services for this purpose have already been developed which facilitate the biomedical community locating ontologies, drugs, proteins and publications. More specifically, this section reviews the following biomedical services [13]:

4.1 BioPortal

BioPortal, created by the NCBO (National Centre for biomedical Ontology), is the Web interface that provides access to the full body of ontologies from the biomedical research community. They can be accessed in a variety of standard ontology formats. BioPortal organises ontologies according to a set of categories (such as anatomy, genomics, development etc.) enabling users to find groups of ontologies of interest as well as to visualise their content. BioPortal users will be able to rate ontologies, comment on how appropriate ontologies are for specific tasks and how well they cover their target domain (Table 3).

Table 3. Quantitative overview of ontologies listed at bioportal (as of June 2017).

4.2 Open Biomedical Ontology (OBO)

The OBO project is a repository with a Web portal containing ontologies as well as links to controlled vocabularies for shared use between medical and biological domains. The ontologies found in the OBO library are partially overlapped since they can be combined between themselves adding relations and giving rise to new ontologies. Researchers in the OBO project have also developed the OBO language for representing biomedical ontologies.

4.3 Ontobee

Ontobee is a linked data server designed for ontologies that aim to facilitate ontology data sharing, visualisation, query, integration and analysis. This service dynamically de-references and presents individual ontology term URIs to:

  • HTML based web pages for user-friendly web browsing and navigation.

  • RDF source code for Semantic Web applications.

Ontobee is the default linked data server for most OBO Foundry library ontologies as well as for many ontologies not registered at OBO (stats Table 4).

Table 4. Quantitative overview of ontologies listed at Ontobee (as of June 2017).

4.4 Ontology Lookup Service

The Ontology Lookup Service from the European Bioinformatics Institute provides a centralised query interface for ontologies in the OBO format. All ontologies are indexed and the user can query the content of the integrated ontologies with search terms to retrieve the most relevant related concept label and ontological definition. The service provides the best benefits to curators who have to explore the existence of a specific concept for their daily work (stats Table: 5).

Table 5. Quantitative overview of ontologies listed at Ontology Lookup Service (as of June 2017).

4.5 AmiGO

AmiGO, built by Gene Ontology Consortium, gives efficient access to the Gene Ontology and annotations stored in a specialist GO database. This solution is focused to only one ontology, but this ontology forms an over-arching role in the biomedical domain, since it encodes the key findings from biomolecular research: molecular function, biological process and cellular location. Again, this solution is mainly relevant to curation teams.

4.6 Entrez

Entrez [5] is a Web-based search and retrieval engine developed by the NCBI. It is capable of searching multiple NCBI databases through a single query. Entrez returns search results that can include a combination of many types of data on the query, such as nucleotide sequences, protein sequences, macro-molecular structures and related articles in the literature. The search engine forms a powerful means to oversee the collected information from different sources for a specific entity, e.g., a gene or a pathway.

4.7 E-Meducation

The Alfa Institute of Biomedical Sciences (AIBS) has created a medical portal providing a selection of open access Internet links in several medical fields, including internal medicine, infectious diseases, dermatology, nosocomial infections, antimicrobial resistance, Hepatitis B virus, general surgery and surgical infections. A feature of the e-meducation is the custom-built medical search engine that permits the tracking of medical information without having to filter for hours. The custom search engine generates results from professional oriented sites for Healthcare providers.

5 Linked Data

In March 2007 the W3C Semantic Web Education and Outreach (SWEO) Interest Group announced a new Community Project called “Interlinking Open Data”Footnote 46 that was subsequently shortened to “Linking Open Data” (LOD). The goal of the Linked Open Data project is twofold: (i) to bootstrap the Semantic Web by creating, publishing and interlinking RDF exports from open datasets, and, (ii) introduce the benefits of Semantic Web technologies to the broader Open Data community [2]. Linked Data aims to make data available on the Web in an inter-operate-able format so that agents can discover, access, combine and consume content from different sources with higher levels of automation than would otherwise be possible. The result is a “Web of Data”, a Web of structured data with rich semantic links where agents can query in a unified manner, across sources, using standard languages and protocols. Over the past few years, hundreds of knowledge-bases with billions of facts have been published according to the Semantic Web standards (using RDF as a data model and RDFS and OWL for explicit semantics) following the Linked Data principles.

5.1 Life Sciences Linked Open Data Cloud

This section reviews the linked biomedical datasets relevant in a Cancer Chemoprevention and drug discovery scenario, three significant providers are as follow: (1) Linked Open Drug Data (LODD), (2) Bio2RDF, and (3) LinkedLifeData.

Linked Open Drug Data (LODD)Footnote 47 is a set of linked datasets relevant to Drug Discovery. It includes data from several datasets including Drugbank, LinkedCT, DailyMed, Diseasome, SIDER, STITCH, Medicare, RxNorm, ClinicalTrials.gov, NCBI Entrez Gene and OMIM. The LODD datasets have been crawled by the Semantic Web Search Engine (SWSE)Footnote 48 that can be accessed via a faceted browsing interface.

Bio2RDFFootnote 49 constitutes a project that contains multiple linked biological databases including pathways databases such as KEGG, PDB and several NCBIs databases [1]. Bio2RDF is an open-source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences. Bio2RDF defines a set of simple conventions to create RDF(s) compatible Linked Data from a diverse set of heterogeneously formatted sources obtained from multiple data providers.

As of July 2014, Bio2RDF Release 3 containsFootnote 50 about 11 billion triples across 35 datasets (based on Virtuoso 7.1.0 as the SPARQL 1.1 endpoint). The new types of data have been included for example from OrphaNet, PubMed, SIDER, GenDR, and LSR. Further local endpoints have been integrated: Chembl, LinkedSPL, PathwayCommons, and Reactome. In the current version, every URI is an instance of an owl:Class, owl:ObjectProperty, or owl:DatatypeProperty.

LinkedLifeData (LLD)Footnote 51 is a semantic data integration platform for the biomedical domain containing 5 billion RDF statements from various sources including UniProt, PubMed, EntrezGene and 20 more. LDD allows writing complex data analytical queries, answering complex bioinformatics questions, helps navigate through the information or export results subsets. LDD offers two different access levels: (1) LLD Public – completely free anonymous access; and (2) LLD Enterprise – premium service access with extra features.

5.2 Quantitative Overview of Datasets

Table 6 provides implementation details and quantitative overview of dataset listed in Sect. 5.1, but also information regarding the year of release (as per reported at http://www.datahub.io, http://www.bio2rdf.org, http://www.linkedlifedata.com), the visibility (public/ private) and the implementation details (language and type of data) provided by different datasets. Size and coverage of these datasets in terms of total triples, number of entries/entities, link of SPARQL endpoint, sub-classification and brief description is also presented in the table. Quantitative comparison of datasets in terms of combination of information including total number of classes, total number of properties, total number of Instances, total number of triples and total number of entities is presented.

Table 6 shows that the largest triple store collections (2 to 10 B triples) have been from genes or proteins data and branch out to the reference information after data integration.

Table 6. Counts of triples (:T) and entities :E across the most relevant datasets across LSLOD, Bio2RDF and LLD (T/C: Type/Category, Y/D:Year/Date, E/F: Environmental Factors, SPLs: Structured Product Labels, DIKB: Drug Interaction Knowledge Base, LLD: Linked Life Data) “–” = N/A)
Table 7. Quantitative overview of datasets involving LODD only without judgement on the number of entries versus triples. (T/C: Type/Category, Y/D: Year/Date, SPLs: Structured Product Labels, DDIs: Drug Drug Interactions, “-”: N/A)

These triple stores will serve as a reference data resource, since the data integration is performed by providers of several of the integrated databases.

The next collection of triple stores (200 to 500 M triples; PubMed, ChEMBL, CTD, PharmGKB) are primary data resources that cover individual observations, where a scientific publication is categorized similarly. All these data resources are growing at a rate that is linked to ongoing research in this domain, in contrast to a data resource that would report on scientific entities that can only be discovered once, e.g. a specific protein in a given species.

The following two fields of data resources (50 to 100 M triples; 12 to 50 M triples) contain different types of resources. The data in the resource from the first group correlates with experiments that are performed according to discovery needs and may lose relevance over time (see Affymetrix data). The second group contains reference data resources for species (Wormbase, SGD), pathways (KEGG, Reactome, iRefIndex), but also large-scale resources with a very specific purpose, such as Taxonomy, BioPortal, and SIDER.

For the remaining resources, it can be expected that they will be developing into large-scale resources as seen above (MGI, dbSNP, BioModels) whereas others by the nature of their content, would show only very limited growth, such as HGNC, DrugBank, Orphanet, and also possibly InterPro. Further resources have been considered (ref. Table 7), but could not be analysed to the degree of detail as for the data resources given in Table 6.

As a conclusion, the life science research community has to determine, which technological solutions allow the delivery of the large-scale semantic Web triple stores to the general public. Other data resources may well be replicated at different sites for local integration work.

6 Conclusion

In this paper we analysed (and quantified) different tiers of biomedical data relevant to the Cancer Chemoprevention and Drug Discovery domain. This involves ontologies, libraries and databases in healthcare and the biomedical domain, Linked Data and Life Science Linked Open Data.

We classify ontologies into three main classes: (i) biomedical Ontologies (e.g. EFO, OBI, GO etc.), (ii) Drugs and Chemical Compound Ontologies (e.g. RxNorm) and (iii) Generic and Upper Ontologies (e.g. BFO, RO, PROV). Similarly we categorise libraries and databases in five categories that comprise (i) Gene, Gene Expression and Protein Databases, (ii) Pathway databases, (iii) Chemical and Structure Databases including Biological Activities, (iv) Disease Specific Databases for Prevention, and the (v) Literature databases. This paper also highlights biomedical services that provide ontologies and databases resources relevant for drug discovery.

7 Access to the data repositories

Affymetrix (http://cu.affymetrix.bio2rdf.org/sparql), BioModels (http://cu.biomodels.bio2rdf.org/sparql), BioPortal (http://cu.bioportal.bio2rdf.org/sparql), ChEMBL (http://cu.chembl.bio2rdf.org/sparql, http://rdf.farmbio.uu.se/chembl/sparql), ClinicalTrials (http://cu.clinicaltrials.bio2rdf.org/sparql), CTD (http://cu.ctd.bio2rdf.org/sparql), DailyMed (http://purl.org/net/nlprepository/linkedSPLs), DBpedia (http://dbpedia.org/sparql ), dbSNP (http://cu.dbsnp.bio2rdf.org/sparql), DIKB (http://dbmi-icode-01.dbmi.pitt.edu:2020/), Diseasome (http://www4.wiwiss.fu-berlin.de/diseasome/sparql), DrugBank (http://cu.drugbank.bio2rdf.org/sparql, http://www4.wiwiss.fu-berlin.de/drugbank/sparql), GenAge (http://cu.genage.bio2rdf.org/sparql), GenDR (http://cu.gendr.bio2rdf.org/sparql), GHO (http://gho.aksw.org), GOA (http://cu.goa.bio2rdf.org/sparql), HGNC (http://cu.hgnc.bio2rdf.org/sparql), HomoloGene (http://cu.homologene.bio2rdf.org/sparql), InterPro (http://cu.interpro.bio2rdf.org/sparql), iProClass (http://cu.iproclass.bio2rdf.org/sparql), iRefIndex (http://cu.irefindex.bio2rdf.org/sparql), KEGG (http://cu.kegg.bio2rdf.org/sparql), LinkedCT (http://data.linkedct.org/sparql), LinkedLifeData (http://linkedlifedata.com/sparql), LinkedSPL (http://cu.linkedspl.bio2rdf.org/sparql), LSR (http://cu.lsr.bio2rdf.org/sparql), Medicare (http://www4.wiwiss.fu-berlin.de/medicare/sparql), MeSH (http://cu.mesh.bio2rdf.org/sparql), MGI (http://cu.mgi.bio2rdf.org/sparql), NCBI Gene (http://cu.ncbigene.bio2rdf.org/sparql), NDC (http://cu.ndc.bio2rdf.org/sparql), OMIM (http://cu.omim.bio2rdf.org/sparql), Orphanet (http://cu.orphanet.bio2rdf.org/sparql), PathwayCommons (http://cu.pathwaycommons.bio2rdf.org/sparql), PharmGKB (http://cu.pharmgkb.bio2rdf.org/sparql), PubMed (http://cu.pharmgkb.bio2rdf.org/sparql), RDF-TCM (http://www.open-biomed.org.uk/sparql/endpoint/tcm), Reactome (http://cu.reactome.bio2rdf.org/sparql), RxNorm (http://link.informatics.stonybrook.edu/sparql/), SABIO-RK (http://cu.sabiork.bio2rdf.org/sparql), SGD (http://cu.sgd.bio2rdf.org/sparql), SIDER (http://cu.sider.bio2rdf.org/sparql, http://www4.wiwiss.fu-berlin.de/sider/sparql), STITCH (http://www4.wiwiss.fu-berlin.de/stitch/sparql), Taxonomy (http://cu.taxonomy.bio2rdf.org/sparql), UPNR (http://dbmi-icode-01.dbmi.pitt.edu:8080/sparql), WikiPathways (http://cu.wikipathways.bio2rdf.org/sparql), WormBase (http://cu.wormbase.bio2rdf.org/sparql).