Keywords

1 Introduction

In 2014, statistics from American Cancer Society [6] stated that cancer is the second cause of death in the United States after the heart diseases. About one every four deaths are caused by cancer [6]. Recently, pathway analysis has lead to better cancer diagnosis and treatment.

Pathway is “a collection of genes that serves a particular function and/or genes that interact with other genes in a known biological process” [29]. In [18] they defined pathway as “a collection of genes that chemically act together in particular cellular or physiologic function”. In [1] they defined it as “a series of actions among molecules in a cell that leads to a certain product or a change in a cell”. There are three type of pathways: metabolic, gene regulation and signaling pathways. Metabolic pathways are series of chemical reactions in cells [1]. Gene regulation pathways regulate genes to be either active or inhibit [1]. Signaling pathways are series of actions in a cell to move signals from one part of the cell to another. In biological research, they classify genes in pathways to improve gene expression analysis [11] and simplify the analysis by looking to few groups of related genes (pathways) instead of looking to long lists of genes [15]. Pathway analysis concerns with finding out which pathways are responsible for a certain phenotype or which pathways are significant under certain conditions [3]. In addition, pathway analysis is used to explain biological results and as a validation phase in computational research [15]. Khatri et al. [15] pointed out two advantages of using pathway analysis. First, reduce the complexity of analysis from thousands of genes to few hundreds of pathways. Second, identifying significant pathways is more meaningful than a list of different gene expression when comparing two samples such as normal and cancerous.

Pathway analysis has useful applications such as discovering disease occurrences by finding out the disrupted biological pathways. Another application is drugs development that aims to design a drug that target one or two disrupted pathways [1, 28]. Moreover, researchers plan to use biological pathway approach to personalized patients treatment and drug development [1, 28].

The paper is organized as follows. Section 2 overviews pathway databases. Gen expression, microarray and RNA-seq are presented in Sect. 3. Section 4 overviews pathway analysis techniques. Section 5 describes some miRNA analysis techniques. The conclusion is presented in Sect. 6.

2 Pathway Databases

Pathways are curated manually from biological experiments or automatically using text mining techniques [24]. Manual curation is more accurate and reliable.

There are 547 available biological pathways related resources [2]. For example, KEGG (Kyoto Encyclopedia of Genes and Genomes) database is the most popular pathways resource. It contains manually curated and inferred metabolic, signaling and disease pathways for over 650 organisms [7]. Reactome is another example for manually curated and inferred pathways database. It contains metabolic, signaling and disease pathways for human [7]. Also, BioCarta contains manually curated metabolic and signaling pathways for human and mouse [7].

3 Gene Expression

Genes control cells functions and all cells have the same genetic information. Genes are active or inactive (have different gene expression) according to a cell type and different conditions. Gene expression measures amount of mRNA produced in a cell [13] and gives the degree to which gene is active under different conditions.

3.1 Microarray vs RNA-seq

Sequencing techniques allow scientists to analyze tens of thousands of genes in parallel at any given time [12]. These technologies help us to understand diseases and provide better treatments [5]. Sequencing techniques start with using microarray technologies. Then, next generation sequencing was developed and it has a lot of sequencing that used in gene expression analysis such as RNA-seq.

Microarray is a small glass or plastic or silicon chip in which tens of thousands of DNA molecules (probes) are attached. Microarray is able to detect specific DNA molecules of interest. It works as follow: from two mRNA samples (a test sample and a control sample) cDNAs are obtained and labelled with fluorescent dyes and then hybridized on the surface of the chip. Then, the chips are scanned to read the signal intensity that is omitted from the labelled and hybridized targets [12, 23].

RNA-seq is used to rapid profile mRNA expression of whole transcriptome [5, 9]. It works as follow: small reads are aligned to an annotated reference mRNA. Then, the number of reads that aligned to one of different cDNAs are counted [4]. RNA-seq outperforms microarray in various aspects. First, the ability of detecting and identifying unknown genes and detecting differential expression levels that have not detected by microarray [5]. Second, it does not require specific probes or predefined transcriptome of interest [5]. Third, it increases specificity and sensitivity for detecting genes [5].

Gene expression usually presents by \(i\times j\) matrix as in Fig. 1 where the rows represent expression pattern of genes and the columns represent different conditions such as different samples (normal vs cancer) or different time points [12, 13]. In microarray, \(x_{ij}\) represents intensity level of hybridization of ith gene in a jth condition. While in RNA-seq, \(x_{ij}\) represents the number of reads of gene i observed in condition j.

Fig. 1.
figure 1

Gene expression matrix

In microarray, chip is scanned to get hybridization data that are usually represent in a spreadsheet-like format [5] where each cell represents the intensity of hybridization of a specific gene in a specific condition as in Fig. 2. In RNA-seq, sequencing is used to get read counts that represent in spreadsheet-like format. Each cell represents the number of reads that aligned to one of thousands of different cDNAs [5] as in Fig. 3.

Fig. 2.
figure 2

Microarray gene expression matrix

Fig. 3.
figure 3

RNA-seq gene expression matrix

Having gene expression available a lot of analysis techniques can be applied. Pathway analysis is one of them that have impact on the development of drugs and disease diagnosis.

4 Classification of Pathway Analysis Techniques

Pathway analysis can be classified into two approaches: detecting significant pathways and discovering new pathways as in Fig. 4. Detecting significant pathways approach aims to define and rank significant pathways that related to a specific phenotype either by enrichment score analysis or machine learning techniques [18]. Shin and Kim [23] classified the computational approaches for pathways analysis into three groups: clustering-based methods, gene-based methods and gene set-based methods. Clustering based methods are based on assumption that genes with similar expression would have similar functions or involved in the same biological processes [13, 23]. Therefore, genes are clustered and pathways for each cluster are determined. In gene-based methods, differentially expressed genes DEGs between two samples (a test sample and a control sample) are identified, and then significant pathways that DEGs are involved are determined. In gene set-based methods the gene expression and a prior biological resource (i.e. pathway databases) are used to determine the significant pathways (gene sets) [23].

Discovering new pathways can be achieved either by mining the literature through text mining techniques or automatic inferring pathways from network interactions or gene expression data. In this paper, we focus on detecting significant pathways approaches and we categorize the research in the area according to the type of gene expression to be analyzed into two categories: pathway-based microarray analysis, and pathway-based RNA-seq analysis. Next we will review some research related to each category.

Fig. 4.
figure 4

classification of pathway analysis techniques

4.1 Pathway-Based Microarray Analysis

Most research are focused on analyzing microarray gene expression either to determine significant pathways that contribute to a phenotype of interest or deal with features (genes) selection problem. Next some research related to classification, feature selection and clustering approaches are reviewed.

Classification. It aims to define and rank significant pathways that related to a specific phenotype using machine learning approaches. Zhang et al. [29] used machine learning algorithms: nave bayes, support vector machine, decision tree and random forests to rank pathways based on classification error. By using three microarray expression datasets, they proved that machine learning algorithms outperform enrichment score analysis in identifying significant pathways. Pang et al. [20] used random forest classification and regression to analyze and rank pathways. In addition, they pointed out that their method was the first that used continuous measures for ranking pathways.

Features Selection. It aims to select informative genes within pathways before the pathway evaluation process to reduce computational time and improve accuracy [19]. Misman et al. [19] pointed out that when observing a particular biological context such as cancer some genes within pathways are only responsible for a phenotype. Thus, selecting subset of genes is important phase before ranking pathways. Zhang et al. [29] used minimum redundancy maximum relevance mRMR to select representative genes from each pathway. Panteris et al. [22] selected significant genes from each pathway (pathway signature) that describe the pathway at a given experimental condition. Misman et al. [19] used SVM-SCAD to select genes within pathways and have used B-type generalized approximate cross validation (BGACV) to select appropriate tuning parameter for SVM-SCAD. Jungjit et al. [14] proposed a KEGG pathway-based feature selection method for multi-label classification. Their method selects genes based on weighted formula that combines genes predictive accuracy and their occurrence in cancer-related KEGG pathways. Ibrahim et al. [11] selected strongly correlated genes for accurate disease classification by using pathways as prior knowledge. Their method was compared with five feature selection methods using two classifiers: K-nearest neighbour and support vector machine and it preformed the best for three microarray datasets.

Clustering. Detecting pathways in clustering analysis is used as a validation measure or as a partitioning measure. The reason of validation measure is to prove the validity of a clustering algorithm and for partitioning measure to partition datasets into biological meaningful clusters. For example, Shin and Kim [23] used hierarchical clustering with Euclidean distance to generate gene clusters from gene expressions. Then, pathways are identified in each cluster to check the validity of clustering to identify the subclasses of leukemia. Zhao et al. [30] proposed a pathway-based clustering approach that used pathways to identify clusters. Their aim was to identify subgroups of cancer patients that may respond to the same treatment. Since cancers have similar phenotypes but resulting from different genetic mutations which lead to different responses to the same treatment. Their method is as follow: identify differential gene expression. Then, identify KEGG pathways that enriched with DGEs. Finally, classify the samples according to the expression of genes within the specified pathways. Also, Milone et al. [17] proposed a new method based on self-organizing map SOM clustering that used common metabolic pathways and Euclidean distance as similarity measures to construct clusters. Their objective was to improve the quality of clustering formation by combining pathway information. They used transcripts and metabolites datsets form Solanum lycopersicum and Arabidopsis thaliana species. Their method just improved the biological meaning of clusters compared with classical SOM. Moreover, Kozielski and Gruca [16] proposed a method that combined gene expression and gene ontology to identify clusters. So, the cluster membership should satisfy both gene expression and gene ontology. The proposed method is based on fuzzy clustering algorithm. Pang and Zhao [21] have proposed a method to generate pathways clusters that are related to a phenotype of interest from pathway- based classification [20]. They used class votes from random forest as similarity measure between pathways and tight clustering approach. Table 1 summarizes the research in pathway-based clustering and explains dataset, aim of clustering and aim of pathway analysis either validation or partitioning measure.

4.2 Pathway-Based RNA-seq Analysis

There are limited research focusing on analyzing RNA-seq gene expression to determine significant pathways that contribute to a phenotype of interest. Theses research focusing on statistical approaches. For example, Xiong et al. [27] developed a tool set that have multiple gene-level and gene set-level statistics to determine significant pathways. Fridley et al. [9] proposed using gamma method with soft truncation threshold to determine the gene sets that related to particular phenotype. Then, they applied the method to a smallpox vaccine immunogenetic study to identify gene sets or pathways with differential expression genes between high and low responders to the vaccine. Wang and Cairns [26] proposed combining differential expression with splicing information to detect significant gene sets based on Kolmogorov-Smirnov-like statistic. Xiong et al. [27] pointed out that the Wang and Cairns method is computationally expensive. Also, Hanzelmann et al. [10] developed a method that calculates variation of pathway activity profile over a sample population to analyze gene sets. Their method can be applied to RNA-seq as well as microarray data.

Table 1. Clustering and pathway analysis

5 Pathway-Based MiRNA Analysis

There are few research focusing on analyzing miRNA to determine significant pathways. Among them, Chen et al. [8] proposed a mathematical model (Bayesian implementation) that used miRNA targets for mapping miRNA to pathways then applied hypothesis test to extract significant pathways. Zhang et al. [28] used sample-matched miRNA and mRNA expression and pathway structure to analyze glioma patient survival. Wang et al. [25] suggested using functional information (gene ontology) to improve miRNA target prediction algorithms since genes that regulated by the same miRNA may share similar functions. Most miRNa target prediction algorithms used physical interaction mechanisms such as free energy, seed match and sequence conservation. Wang et al. [25] built SVM ensemble classifier that combined gene ontology and sequence information to predict miRNA targets.

6 Conclusion

Pathway analysis is reliable in discovering diseases and has various useful applications. In addition, data mining techniques are applied to pathway analysis to discover biological interesting hidden information. Most research in the field are based on analysis of microarray datsets and few are based on RNA-seq. Thus, applying data mining approaches such as classification and clustering to pathway-based RNA-seq analysis leads to more biological results.