Keywords

1 Introduction

Stem cells [1] have remarkable potential i.e. atypical characteristics to develop into many different cell types in the body during growth of early life. Sometimes, the unrestricted growth of stem cells or the abnormal growth of cell tissue causes cancer. Stem cells are distinguished from other cell types by two important characteristics. First, they are unspecialized cells capable of renewing themselves through cell division or sometimes after long periods of inactivity. Second, under certain physiologic or experimental conditions, they can be induced to become tissue or organ-specific cells with special functions. Researchers primarily worked with two kinds of stem cells: embryonic stem cells and non-embryonic “somatic” or “adult” stem cells. The induced pluripotent stem cells (iPSCs) [24] are some specialized adult cells which “reprogrammed” genetically to assume a stem cell like state in special condition. This property is unique property of stem cell it can be explored by gene expression analysis. Microarray gene expression data to predict and analysis of cancer disease becomes very important. These data can be characterized by genome variables and with their corresponding observations (experiments) in a experimental limitations [5]. To discover co-regulated genes, analysis of gene expression data [6] is required. Previously, it has been assumed that similar patterns in gene expression profiles usually suggest relationships between the genes. Now it is proved genes targeted by same transcription factors, tend to show similar expression patterns along time. Analyzing expression profiles of genes, targeted by same transcription factors revealed complex relationships between co-regulated gene pairs and it also includes co-expression relationships. In this article, we developed a simple algorithm to find differentially co-expressed and co-regulated genes, and then, to construct pairwise co-expression network. We applied the algorithm on sample gene expression microarray data of normal stem cells (nscr), stem cells derive tumors cells (scdtr) and patch tumor cells (ptr).

2 Materials and Methods

For finding and analyzing the relationship between differentially co-expressed genes (DCEG), many techniques have been developed in the literature. Here, we applied statistical t-test [7, 8] and Benjamini Hochberg method [9] to identify differentially expressed genes (DEG) [10] within sample groups of nscr and scdtr, nscr and ptr (http://www.ncbi.nlm.nih.gov/GSE20948). In addition to this, we have studied the significance of up-regulated and down-regulated genes. We code the algorithms with using Matlab. The main steps of proposed algorithms are discussed in the following subsections.

2.1 Preprocessing of the Dataset

Our sample Bon Marrow microarray gene expression data is in normalized form. Normalized data for each gene is typically known as an ʻexpression ratio’ or as the logarithm of expression ratio. We have done data preprocessing with filtered out low expressed values and null values from the datasets.

2.2 Identification of Differentially Co-expressed Genes

In this section, we discuss the steps to identify DEG. First, a standard statistical t-test is performed for detecting significant changes between measurements of genes in sample microarray groups. It may occurs two types of errors, 1: a false positive by declaring that a gene is differentially expressed when it is not, and 2: a false negative when the test fails to identify a truly differentially expressed gene. Second, Benjamini Hochberg method [9] is used for choosing significantly differentially expressed genes. It is done by the following equation:

$$ P \le y*\frac{\alpha }{m}, $$
(1)

where \( P \) is the largest p-value called significant, \( y \) is the number of genes called significant and \( m \) is the total number of genes tested, \( \alpha \) is false discovery rate (FDR) [11, 12], defined as the expected ratio of the number of false positives to the total number of positive calls in a differential expression analysis between two sample groups [11]. FDR can be measured [11] as,

$$ FDR = Err\left[ {\frac{F}{F + T}} \right] = Err\left[ {\frac{{n_{0} \cdot [1 - specificity]}}{{n_{0} \cdot [1 - specificity] + n_{1} \cdot sensitivity}}} \right], $$
(2)

where \( F \) is the number of false positives, \( T \) is the number of true positives, and \( S \) is the total number of features called significant. Also, \( n \), number of p-values is seen more clearly, \( n_{0} \) is the number of truly null features in the study, and \( n_{1} = n - n_{0} \) is the number of truly alternative features. Regardless of whether the p-value threshold is fixed or data-dependent, the quantities \( F \), \( T \) and \( S \) are random variables. Therefore, it is common statistical practice to write the overall error measure in terms of an expected value, which we denote by \( Err \) [11].

2.3 Proposed Algorithm

Input::

phenotype1: gene expression values of normal stem cell (nscr),

phenotype2: gene expression values of stem cell derive tumor (scdtr),

phenotype3: gene expression values of patched tumor (ptr).

Output::

All differentially expressed genes.

Step1::

Perform two-sample t-test to evaluate differential expression of genes from phenotype1 and phenotype2. Their p-values and t-scores are stored.

Step2::

Perform a permutation t-test to compute the p-values of 10,000 permutations by permuting the columns of the gene expression data matrix of phenotype1 and phenotype2.

Step3::

Determine the number of genes considered to have statistical significance at the p-value ʻcutoff’ of 0.001.

Step4::

Estimate FDR for the genes with statistically significant p-values.

Step5::

Create a scatter plot of gene expression data, plotting significance versus fold change of gene expression ratios of phenotype1 and phenotype2.

2.4 Separation of Up-regulated and Down-regulated Genes

We plot the volcano plot of identified all differentially expressed genes as volcano plot (Fig. 4) of two phenotypes returns a structure containing information for genes that are considered to be both statistically significant and significantly differentially expressed. This information helps us to identify co-regulated genes, specially up-regulated and down-regulated genes. Now based on cutoff value (0.001), up-regulated and down-regulated genes are separated from the total set of differentially co-expressed genes.

2.5 Extraction of Co-expressed and Co-regulated Genes

To separate co-expressed and co-regulated genes from the identified DEG, we generate a algorithm and code the algorithm with using Matlab. DEGs which have the same \( p \)-value are called co-expressed genes. Up-regulated and down-regulated genes with similar p-values are known as co-expressed and co-regulated genes. Our proposed algorithm is as follows:

  • Algorithm

Input::

Exp: differentially expressed genes or up-regulated or down-regulated genes.

Output::

All Co-expressed and Co-regulated genes.

Step1::

Each distinct p-value of Exp and its respective position are stored in separate data vector and index vector.

Step2::

These data vector p-values are compared with the p-values of the original data, and if they match then their corresponding information is extracted from Exp with the help of the index vectors.

Step3::

The process is repeated until all distinct data vector p-values are compared.

2.6 Visualization of Differentially Co-expression and Co-regulation Network

We constructed pair-wise co-expression network to understand the relationship among co-expressed and co-regulated genes. Co-expression networks are built depending on their corresponding p-values. We have used cytoscape software to visualize the pair-wise network. The genes having same p-value construct a paired structure.

3 Results and Discussion

The Bon Marrow microarray data consists of 45,101 genes with having normal stem cell tissue (nscr) responses (5no.), stem cell derived tumor tissue (scdtr) responses (5no.) and patched tumor cell tissue (ptr) responses (4no.) are taken for the study. After analysis, we considered only the genes having p-value \( \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{ \prec } \) 0.001 as significant genes (Figs. 1 and 2). Out of 45,101 genes from the dataset, 2,325 genes for sample first group (nscrscdtr), 2,056 genes for sample second group (nscrptr) of data have been extracted on the basis of p-value which is approximately 5 % of total number of genes. Now, estimated false discovery rate (FDR) for statistically significant p-values are computed by using Eq. 2. As a result, we get 1,110, 792 up-regulated and 1,215, 1,264 down-regulated genes for two sample groups (Figs. 3 and 4) respectively. It implies that 2–3 % genes of total 45,101 genes are up-regulated and down-regulated.

Fig. 1
figure 1

a Histogram plot of sample group nscr and scdtr. b Histogram plot of sample group nscr and ptr

Fig. 2
figure 2

a Quantile plot of sample group nscr and scdtr. b Quantile plot of sample group nscr and ptr

Fig. 3
figure 3

a Fold change plot of sample group nscr and scdtr. b Fold change plot of sample group nscr and ptr

Fig. 4
figure 4

a Volcano plot of sample group nscr and scdtr. b Volcano plot of sample group nscr and ptr

Further, with finding similar p-values co-expressed and co-regulated genes are extracted from all differentially co-expressed genes (DCEG), up-regulated (UR) and down-regulated (DR) genes. We get 8, 4 numbers of up-regulated and down-regulated genes for group1 i.e. approximately 0.01 and 0.008 % of total 45,401 genes. Again we get 6, 14 numbers of up-regulated and down-regulated genes for group2 i.e. approximately 0.01 and 0.03 % of the total 45,101 genes. It is shown in the Table 1. The interaction among co-expressed and co-regulated genes depending on corresponding p-values, 6 paired networks are developed (Figs. 5 and 6).

Table 1 The table shows resulting differentially expressed and co-expressed genes of all sample groups
Fig. 5
figure 5

a Pairwise network of differentially co-expressed genes of sample group nscr and scdtr. b Pairwise network of differentially co-expressed genes of sample group nscr and ptr

Fig. 6
figure 6

a Pairwise network of differentially co-expressed up-regulated genes of sample group nscr and scdtr. b Pairwise network of differentially co-expressed up-regulated genes of sample group nscr and ptr. c Pairwise network of differentially co-expressed down-regulated genes of sample group nscr and scdtr. d Pairwise network of differentially co-expressed down-regulated genes of sample group nscr and ptr

The significance of the identified DEG are studied and analyzed with Gene Ontology (GO). We have listed it in Fig. 7.

Fig. 7
figure 7

Significant genes analyzed with gene ontology

4 Conclusion

In this paper, we first found DEG and then found correlations between gene-pairs for construction of co-expressed and co-regulated networks under diseased conditions that assist the interpretability of network. We also generated pairwise differentially co-expression network and constructed the same for differentially co-expressed and co-regulated genes of Bone Marrow stem cell microarray data. We also analyze the significance of DEG for the same microarray data. From our sample data we found 82 significantly co-expressed genes and 30 co-expressed and co-regulated genes. In future study, we can apply artificial intelligence based sophisticated techniques (fuzzy logic, neural networks, evolutionary computation) for better construction of cancer-specific regulatory networks.