Keywords

6.1 Introduction

Reference Expression dataset (RefEx) (Ono et al. 2017) is a web tool which allows users to browse gene expression profiles by genes collected from public databases. It can be searched by various types of IDs including gene names, chromosomal regions, gene family based on InterPro (Mitchell et al. 2015), gene expression patterns, or gene annotations based on Gene Ontology (Ashburner et al. 2000). Information about genes with tissue-specific expression are also provided, and the relative gene expression values are shown as choropleth maps on 3D human body images from BodyParts3D (Mitsuhashi et al. 2009). RefEx provides insight regarding the functional interpretation of unfamiliar genes through its web interface.

The purpose of RefEx is to provide a web tool for visualization of reference gene expression pattern of mammalian tissues and cell lines measured using different methods, which can facilitate the reuse of the precious data archived in several public databases.

RefEx provides suitable datasets as a reference for gene expression data from 40 normal tissues from human, mouse, and rat collected from public gene expression databases. The collected gene expression data are classified based on four different measurement strategies (Expressed Sequence Tags (ESTs), GeneChip, Cap Analysis of Gene Expression (CAGE), and RNA-Seq). These four types of data were linked based on the NCBI gene IDs in the dataset in RefEx. In addition to these datasets, RefEx currently includes quantified gene expression data from Functional Annotation of the Mammalian genome 5 (FANTOM5) dataset for human and mouse (The FANTOM Consortium & the RIKEN PMI and CLST (DGT) 2014).

Target user group of RefEx is biologists who wish to reuse public data, but accessing the data remains difficult due to its sheer magnitude and complicated access.

6.2 Database Overview

6.2.1 Importance of Reference Gene Expression Datasets

Gene expression data are exponentially accumulating after the advent of gene expression measurement methods on a genomic scale. Many datasets are now archived in the public gene expression databases [NCBI Gene Expression Omnibus (GEO) (Barrett et al. 2013) and EBI ArrayExpress (Kolesnikov et al. 2015)]. Because the description about datasets is written by different researchers who produced the gene expression data, they are so different that it is not machine readable currently.

Nevertheless, there is strong demand for a comprehensive set of reference gene expression data from huge gene expression data in public. The availability of such data is of benefit to biologists who wish to reuse it, but accessing the data remains difficult due to its sheer magnitude and complicated access. Recently, a meta-analysis of RNA-Seq expression data across various species, tissues, and studies was reported (Sudmant et al. 2015). However, the interpretation of such data is not easy. Biologists are often at a loss because of the sheer number of datasets in public databases provided by numerous researchers. From such situations, reference expression datasets are needed for the inference of functions of genes, and a proper web interface for visualizing such data is essential.

In addition, concerted patterns of gene expression profiles for different quantification methods can strengthen the evidence of these patterns. Also, tissue-specific expression can be a key feature to examine the function of genes of interest, and lists of genes with tissue-specific expression can help biologists to explore unannotated genes with prominent expression patterns. Thus, the functional annotation of genes from meta-analysis and the interface to access the data with graphical visualization are urgently required.

6.2.2 Current Status of Reference Gene Expression Data

As a reference gene expression data for a genomic scale, expression profiles in normal mammalian tissues by GeneChip were first maintained by researchers at the Genomics Institute of the Novartis Research Foundation (GNF). Users can access the microarray data produced in that project at the GNF Expression Atlas (also known as GNF SymAtlas), now called BioGPS (Wu et al. 2016). EBI also maintains the Expression Atlas which provides gene expression patterns under various biological conditions based on data archived in ArrayExpress (Petryszak et al. 2016). Recently, using RNA sequencing and CAGE (Shiraki et al. 2003), the FANTOM collaboration consortium released terabytes of transcriptome sequencing data from adult and fetal human and mouse tissue primary cell lines that can be used as the reference gene expression data (Lizio et al. 2015).

6.2.3 The Main Feature of RefEx

The main feature of RefEx is a simple web interface, which allows users to compare expression profiles by different methods at a glance (Fig. 6.1). It provides access to curated data from several other public databases, with expression levels in 40 tissues measured by four well-established gene expression quantification technologies (ESTs, Affymetrix GeneChip, CAGE, and RNA-Seq). The web interface allows users to browse the expression profiles by the gene name, various types of IDs, chromosomal regions in genetic maps, gene family based on InterPro, gene expression patterns, or biological categories based on Gene Ontology. The web interface also includes the way to browse the expression profile for adult and fetal human and mouse tissues obtained by the FANTOM5 project using CAGE for gene expression quantification. All the data provided through the RefEx web interface is listed with corresponding digital object identifiers (DOI) in Table 6.1.

Fig. 6.1
figure 1

The top page of RefEx web tool (https://refex.dbcls.jp/)

Table 6.1 Summary table of the entity stored in the database. Original data is available from RefEx page for download (https://refex.dbcls.jp/download.php?lang=en)

6.2.4 Future Update and Availability of the Database

RefEx is planned to be updated when the useful dataset for gene expression is publicly available. Data in RefEx is freely available under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Users can download a concatenated version of all the data at the RefEx download page (https://refex.dbcls.jp/download.php?lang=en), including the log-transformed ratios of the gene expression, the functional annotation of the genes, the list of tissue-specific genes, and the sample information, in a tab-delimited text format (Table 6.1). Data in RefEx is also available at figshare (https://doi.org/10.6084/m9.figshare.c.3812815) and the scripts to make RefEx data are available at GitHub (https://github.com/dbcls/RefEx/).

6.3 Content and Architecture of the Database

The type of data stored in RefEx is the processed and quantified gene expression data for human, mouse, and rat. All data in RefEx is originally from the public database. Table 6.2 shows original data sources for RefEx dataset.

Table 6.2 Original data sources for RefEx dataset. EST data is originally from the International Nucleotide Sequence Database (INSD). Original data of GeneChip data and CAGE & RNA-seq data is from the NCBI Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA), respectively

The data in RefEx were manually collected by RefEx curators from public databases, including the International Nucleotide Sequence Database (INSD, consisting of GenBank/DDBJ/ENA) (Cochrane et al. 2016), the NCBI Gene Expression Omnibus (GEO), and RNA-Seq data in the Sequence Read Archive (SRA) (Kodama et al. 2012). The raw data from the public databases were re-organized and compared against each other. Four types of data were linked based on the NCBI gene IDs, while the EST data were based on the Unigene IDs, and the GeneChip data were based on the Probe set IDs. Detailed information regarding four data extraction methods are described below. All scripts used to produce the data and additional descriptions are available on the GitHub site at https://github.com/dbcls/RefEx.

6.3.1 EST

The original EST data were retrieved from the EST division of the INSD. The number of ESTs was counted by source organ based on the BodyMap method (Okubo et al. 1992) according to the cDNA annotation of each EST entry. The EST data in RefEx originated from the BodyMap-Xs database, which contains previously compiled gene expression data from the INSD EST division for reuse (Ogasawara et al. 2006). After counting the number of ESTs, gene expression data were obtained for the 40 normal tissues stored in the BodyMap-Xs database (https://doi.org/10.6084/m9.figshare.4028721). For visualization purposes, the data were grouped into ten subsets (i.e., brain, blood, connective, reproductive, muscular, alimentary, liver, lung, urinary, and endo/exocrine; https://doi.org/10.6084/m9.figshare.4028715). This categorization of the organs was also applied to the gene expression data that were obtained by the other methods.

6.3.2 GeneChip

The GeneChip data deposited in the NCBI GEO database were selected for the reference dataset (tissue-specific patterns of mRNA expression) (Table 6.2). Those data were analyzed based on a typical microarray data analysis method (Wu et al. 2016). The expression values of the genes were calculated from the original CEL files after robust multi-array averaging (RMA) normalization (Irizarry et al. 2003) by the affy package (Gautier et al. 2004) in R (ver.3.0.3)/BioConductor (ver.2.12) (Gentleman et al. 2004).

6.3.3 CAGE

CAGE is a technique that produces a snapshot of the 5′ end of the mRNA population in a biological sample, and the CAGE data collected in the RIKEN FANTOM5 project were counted by source organ based on the original data, the FANTOM5 CAGE peak expression, and the annotation tables (Lizio et al. 2015). The CAGE tag counts were mapped onto the reference genome sequences (hg19 for human and mm9 for mouse) and reflect the intensity of the gene expression of the corresponding transcripts. The tag counts are normalized by tag per million (TPM). The processed data in RefEx is converted to log 2 for each TPM value of the original FANTOM 5 CAGE data and then organized for each sample classification and the data to which the same GeneID is assigned are added up and averaged. In addition to the 40 normal tissues, the FANTOM5 project collected hundreds of samples from cell lines, primary cells, and adult and fetal tissues of human (https://doi.org/10.6084/m9.figshare.4028685) and mouse (https://doi.org/10.6084/m9.figshare.4028688).

6.3.4 RNA-Seq

For RNA-Seq data, the normal tissue transcriptome sequence data were selected from the SRA. Utilizing human and mouse reference genome sequences (hg19 for human and mm9 for mouse), these data were processed using a typical RNA-Seq data analysis pipeline with TopHat (ver.2.0.7) (Trapnell et al. 2009) and Cufflinks (ver.2.0.2) (Trapnell et al. 2010), and the transcript abundances were calculated and normalized to fragments per kilobase of transcript per million reads (FPKM).

Four types of data above were linked based on the NCBI gene IDs in the dataset in RefEx. The EST data were clustered by sequence similarity, and the NCBI UniGene IDs were added to those clusters. The GeneChip data were based on Affymetrix probe IDs, which were originally designed based on the UniGene database (Wagner and Agarwala 2013). The remaining two methods were based on direct sequencing and were developed after the completion of the human and mouse genome sequencing projects; the data obtained by these methods can be mapped to the reference genomes by the genomic position. Thus, the NCBI Gene IDs were adopted, which are currently widely used to integrate other gene IDs, as a standard. Mapping the various gene IDs (UniGene ID, Affymetrix probe ID, and NCBI Gene ID) onto the various genomes was performed using the Biomart REST API (http://www.biomart.org/martservice.html).

The quality of expression data above is guaranteed by the biological replicates. For example, the RNA-Seq data used were generated from multiple reads (single-read and paired-end read) and those of mouse from triplicate sequence reads were averaged to represent the gene expression value.

6.4 Database Access and Mining Methods

6.4.1 Gene Expression Visualization Tool in RefEx

The relative gene expression values are shown in RefEx as choropleth maps on 3D human body images from BodyParts3D (Mitsuhashi et al. 2009). BodyParts3D has been developed by the Database Center for Life Science (DBCLS) as a dictionary-type anatomy database in which anatomical concepts are represented by 3D structural data that specify the corresponding segments on a 3D whole-body model of an adult human male. Foundational Model of Anatomy (FMA) ontology (https://bioportal.bioontology.org/ontologies/FMA) was used to map the gene expression data onto the corresponding tissues. Because drawing the choropleth maps dynamically on a 3D human body is quite labor-intensive, still images were prepared for only the GeneChip data for the whole entries. Figure 6.2 clearly illustrates that the selected transcript is highly expressed in the liver tissue. On the right (Fig. 6.2), the relative expression levels in 40 types of normal tissues that were more precisely classified are displayed. The visualization can help users to understand the differences in the gene expression patterns among tissues more intuitively.

Fig. 6.2
figure 2

The search results for liver-specific genes. This view can easily be viewed by clicking the liver icon at the top of the RefEx page (Fig. 6.1)

6.4.2 How to Query RefEx

Users can easily query RefEx with an effective filter to extract genes with concerted gene expression profiles. For example, genes with liver-specific gene expression can be retrieved only a single click. Details are available in a video tutorial (https://doi.org/10.7875/togotv.2016.068).

After the publication of the FANTOM5 project, pre-calculated gene expression data from the CAGE data in the FANTOM5 project were incorporated into RefEx. The most important benefit of the FANTOM5 CAGE data is that the search targets are much more abundant. The original version of RefEx only had forty tissue search targets (Fig. 6.3a). However, it is now possible to search more than 500 human samples, encompassing cell lines, primary cells, and adult and fetal tissues (Fig. 6.3b). RefEx also enables users to browse high-resolution gene expression data from approximately 800 samples (human plus mouse).

Fig. 6.3
figure 3

Detailed expression view by a gene (Troponin T type 2). (a) Forty normal tissues. (b) FANTOM5 CAGE

By clicking the tab on the right-hand side, users can switch to a FANTOM5 CAGE data viewer (Fig. 6.3b). This viewer shows the expression patterns of all samples in the lower portion of the screen and displays an enlarged view of a specific area in the upper portion of the screen. Because this is a representation of the expression profile in humans, 556 samples are shown in a bar chart in the lower portion of the screen. Therefore, a user can observe an overview of expression patterns in all the samples. The area displayed in the enlarged box can be moved freely by dragging. When a user enters a keyword into the search window of the viewer, the sample name containing that keyword is highlighted. The FANTOM5 CAGE data correspond to the tissue classification in the original RefEx and are linked to the original FANTOM5 data. The expression values of the samples obtained in the FANTOM5 project are averaged and listed in RefEx.

6.4.3 How to Download Data from RefEx

While the data shown in RefEx is originally from the public database and the sources for all data records are summarized in Table 6.2, the data used in RefEx including processed gene expression data can be downloaded from RefEx download page (https://refex.dbcls.jp/download.php?lang=en). These data are deposited in figshare (Fig. 6.4), which is a repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. Forty-one datasets uploaded to figshare can be accessible from figshare collection at https://doi.org/10.6084/m9.figshare.c.3812815.

Fig. 6.4
figure 4

Data in RefEx can be downloaded from figshare. https://doi.org/10.6084/m9.figshare.4028622.v5

6.4.4 Programmatic Technique to Access RefEx

As a member of the integrated database project in Japan, the Resource Description Framework (RDF) version of RefEx resides at the National Bioscience Database Center (NBDC) RDF portal, and the RefEx dataset is ready for use at the NBDC RDF Portal (https://integbio.jp/rdf/).

6.5 Use-Cases and Demo to Utilize the Database

As a simple use-case of RefEx, gene expression profiles for specific genes of interest in normal tissues were often used in medical research. The mRNA expression levels of isocitrate dehydrogenase 3 (NAD(+)) alpha (IDH3α) and vascular endothelial growth factor A (VEGF-A) were used and visualized as main indicators in ten major groups of normal tissues (Fig. 6.5) (Zeng et al. 2015) in conjunction with the calculated prognostic values by the PrognoScan database (Mizuno et al. 2009).

Fig. 6.5
figure 5

RefEx-based quantification of the mRNA expression levels of VEGF (a, c) gene and IDH3a (b, d) gene in the indicated ten major groups of normal tissues (Zeng et al. 2015)

Another application is used as a reference of tissue-specific information. In a study of murine colon proteomes, colon-specific genes in the mouse version of RefEx were compared to a list of genes from murine colon proteomes that was generated by the researchers’ own results (Magdeldin et al. 2012). A similar example was a study of liver-specific genes to investigate biomarkers indicating liver injury in humans. The gene expression profiles of albumin (ALB), apolipoproootein H (APOH), group-specific component (GC), and α-1 microglobulin/bikunin precursor (AMBP) were used to confirm the liver-specific expression (Okubo et al. 2016).

The data retrieved from RefEx strengthened the authors’ hypotheses without the further confirmation in wet-lab. The gene expression profile of noncardiac myosin, light chain 12A (MRLC) in RefEx was used to confirm the conclusions because noncardiac MRLC was expressed in the heart at the same level as that in the skeletal muscle while it was annotated as “noncardiac” (Mizutani et al. 2016). As an another example, RefEx was used to list the gene expression profiles of all genes previously reported to cause deafness in a review article (Nishio et al. 2015).

As an intermediate use-case of RefEx using the RefEx web interface, users can add up to three genes to their list and compare these genes simultaneously. Users can compare all the detailed information about the genes in that list, including the expression data. This parallel comparison enables users to easily identify the differences among the genes. Overlapped terms, such as the Gene Ontology and the InterPro gene family terms, are arranged in the same row (Fig. 6.6). Therefore, RefEx is also useful as a tool for investigating the relationships of unknown genes found in gene expression analyses.

Fig. 6.6
figure 6

Intermediate use-case. Comparison view. Up to three genes can be compared simultaneously. Users can compare all detailed information in parallel. The expression data and the overlapped annotated terms from Gene Ontology and the InterPro gene family are arranged in the same row

For the advanced use-case, RefEx can be used in comparative transcriptomic studies as a reliable reference expression dataset of human normal tissues and cell lines. To compare the characteristics between human small intestine and Bombyx mori larval midgut, gene expression profiles from a human colon carcinoma cell line (Caco-2) cells, which is used to estimate human intestinal absorption, human small intestine, and B. mori larval midgut were compared to identify common drug transporters in the human intestine and B. mori larval midgut (Ichino et al. 2018). While the B. mori larval midgut RNA-seq data were originally analyzed, the gene expression data for Caco-2 cells and human small intestine were reused from RefEx [“Processed expression data of all samples for CAGE human PRJDB3010 (FANTOM5)” downloaded from https://doi.org/10.6084/m9.figshare.4028613.v4]. As a result, 26 drug transporter homologs were found and those were common in the B. mori larval midgut and human intestine (Fig. 6.7).

Fig. 6.7
figure 7

Advanced use-case. The use-case in the comparison of B. mori midgut (red), Human small intestine (yellow), and Caco-2 cells (purple)

6.6 Summary and Future Development of the Database

RefEx is a tool for an interactive analysis of gene expression patterns on the web via the latest version of web browsers such as Firefox, Safari, and Chrome. RefEx has three main applications. First, users can examine the expression profiles of unfamiliar genes in normal body tissues, cells, and cell lines based on actual measurement data rather than only from a description in a journal article. Second, a search for tissue-specific genes can be performed simply by clicking on the appropriate tissue icon at the top of the RefEx page. Third, users can compare differences in gene expression levels related to the use of different experimental methods.

Currently, transcripts, which are based on RefSeq mRNA records, are used to integrate different types of measurement methods for gene expression. However, according to a high-throughput sequencing data analysis, over 90% of human genes undergo alternative splicing (Pan et al. 2008), and many of these are not yet included in RefSeq. To address this limitation, the definitions of the transcripts need to be redefined to include noncoding RNA in tight collaboration with the FANTOM project. In the upcoming version of RefEx, it is planned to use personalized gene expression data from the Genotype-Tissue Expression database (GTEx) (GTEx Consortium et al. 2015).