Keywords

1 Introduction

Cancer has now ranked as leading cause of death worldwide [1]. Oral cancer (OC) is a category of head and neck squamous cell carcinoma, mostly developed on the lip, floor of the mouth, cheek lining, gingiva, palate or in the tongue. In India, OC is measured among top three types of cancers which accounts for more than 30% of all cancers [2]. Often OC is mostly diagnosed at its advanced stage i.e., when cancer has metastasized to another location, most likely the lymph nodes of the neck, which results in low treatment outcomes and leaves patient with significantly low survival rate [3]. Alcohol addiction, practice of tobacco products like cigarettes, smokeless tobacco and viral infection are the most common risk factors for oral cancer [4].

Cancer is a multi-step process which causes due to mutation in genes that controls cell behaviour. Mutated genes may result in uncontrolled growth of cells that invade and cause the adjacent tissue impairment [5]. Micro RNAs (miRNAs) are small non coding RNA sequences consisting of 20–23 nucleotides that are incriminated in numerous biological, anatomical processes including cell differentiation, cell signalling, apoptosis, metastasis and response to infection [6]. Dysregulated expression pattern of miRNA is an indicator for initiation and progression of various disease including cancer [23]. Hence, identification of dysregulated miRNAs becomes crucial towards understanding of the biological mechanism behind miRNAs. miRNA govern the post transcriptional expression of genes by complementary base pairing with target m-RNAs in both normal and disease condition of the cell [7]. Thus, prediction of the miRNA-mRNA target interactions becomes significant to elucidate the mechanism by which miRNA act in carcinogenesis process. However, it has become a current challenge to correctly characterize the course of action of miRNAs on their mRNA targets, because each miRNA has multiple mRNA targets and vice versa [8]. This association of miRNA-mRNA targets highlights the importance of integrating miRNA expression with downstream mRNA target genes [9].

2 Background Study

Several studies have been carried out in literature to identify novel miRNA signatures associated with cancer and elucidate miRNA-mRNA target interactions. Seo et al. applied an integrative approach of miRNA, mRNA and protein expression data for identifying cancer-related miRNAs and investigating the gene-miRNA association [10]. Modules of highly correrated miRNA, mRNA and proteins were constructed using SAMBA bi-clustering algorithm and a Bayesian network model. The regulatory relationship between these modules were then investigated for precise analysis of miRNA-target gene interactions. Another integrative approach was proposed in [11] to identify the mRNA targets of abnormally regulated miRNAs. Several aberrantly expressed miRNAs and the associated target mRNA signatures were identified in this approach across six different cancer types. Sathipati et al. proposed SVM-HCC model based on inheritable bi-objective combinatorial genetic algorithm for selecting novel miRNA signatures for predicting hepatocellular carcinoma stages [12]. A hierarchical integrative model was utilized in [13] to uncover the miRNA-mRNA associations utilizing the sequence data of miRNA and mRNA. The identified miRNA-mRNA pairs were observed to be involved in processes contributing to hepatocellular carcinoma progression. A biphasic technique of machine learning based feature selection followed by survival analysis was applied in [14] to identify the most significant miRNA biomarkers for breast cancer subtype prediction. There is a strong association of miRNAs in various oral carcinomatous process. Thus, the abnormal expression detected in samples obtained from oral cancer patients are clinically significant in prediction and the development of effective treatments [16]. Falzole et al. utilized miRNA expression data set from GEO and TCGA miRNA profiling datasets to identify miRNAs signatures specific to OC [15].

In this study we proposed an integrated computational approach for identification and analysis of dysregulated miRNAs and their target mRNAs for Oral Cancer. Dysregulated miRNAs were prioritized based on their contribution in predicting the diseased condition. Further, putative dysregulated target mRNAs specific to cancer were identified and their prediction ability in separating the clinical conditions was examined.

The paper is organized as follows: Sect. 3 briefly describes the steps of our proposed method. Section 4 presents and discuss the empirical results of this study. Finally, Sect. 5 presents the conclusions of our study.

3 Materials and Methods

3.1 Dataset Used

Next-generation Sequencing based miRNA and mRNA expression data for the same patient were utilized in this work. The dataset was taken from GDC data portal of TCGA (https://portal.gdc.cancer.gov/). The Cancer Genome Atlas (TCGA) is a consortium of cancer genomics spanning over 33 cancer types which applies high throughput genome analysis techniques for characterizing genetic mutations responsible for cancer [17]. The data set consists of expression values of 1881 miRNAs and 18283 genes for 120 tumor samples and 44 matched normal samples.

3.2 Proposed Model

Figure 1 illustrates the workflow of our proposed model. The steps of our proposed work goes as follows:

Fig. 1.
figure 1

Proposed model for identification of dysregulated miRNAs and associated target genes.

  1. (A)

    Data preparation: The data preparation step of our approach entailed removal of candidate miRNAs and genes with more than 30% of missing values followed by replacing the remaining missing values with mean of the sample [18]. Further, a logarithmic transformation base 2 [11] was applied on the resultant data in order to achieve normal distribution.

  2. (B)

    Identification of significant dysregulated miRNAs: A differential expression analysis of miRNAs and genes was done to find significant miRNAs and genes that show quantitative changes in expression levels between experimental groups normal and diseased. The candidate miRNAs and genes were investigated with the help of adjusted p-values and log-transformed fold change for identifying dysregulated miRNAs. A change in expression profile was considered as filtering criteria for identifying differentially expressed miRNAs. miRNAs with adjusted p-value \(\le \)0.05 and logFoldChange \(\ge \)2 [18] were considered to be significant in this study.

    To evaluate the predictive ability of the differentially expressed miRNAs and to extract a handpicked of miRNAs, Random Forest (RF) classifier was adopted [11]. The RF is a learning method for classification, which works on the principle of ensemble learning by combining the solutions produced by multiple classifiers. The forest generated in the RF model consists of many decision trees of varying depths. RF method applies boot strap aggregating technique to train the decision trees. The samples left out during the training of each decision tree is referred as Out-Of-Bag (OOB) samples. For a new unseen sample, the learned RF model predicts by taking the average of the prediction outputs given by distinct decision trees. RF classifier can also be used to rank the features. Here we used Mean Decrease Accuracy (MDA) as the parameter for filtering significant miRNAs [24]. MDA is the proportion of observations that are incorrectly classified by removing the feature from the learned model [21]. The higher the MDA value, the more important the feature is.

  3. (C)

    Identification of miRNA-target genes using open source repository: The potential targets for the dysregulated miRNAs resulted in the previous step, were obtained using the TargetScan database [19]. For finding the target genes, we considered the conserved miRNAs only. Because more than 60% of human genes contain targets of conserved miRNAs across species [20]. Top predicted target genes with the highest aggregate probability of conserved targeting (Aggregate P\(_{CT}\)), irrespective of site conservation were considered for further analysis. Finally, a set of distinct miRNA-target genes were identified for the inputted dysregulated miRNAs.

  4. (D)

    Screening of statistically significant target genes: Differential expression analysis of the identified miRNA-target genes was done to screen the genes which express at different levels between clinical conditions. These genes are expected to offer precise biological insight into the processes affected by the condition(s) of interest. Furthermore, to unwrap the correspondence between the differentially expressed miRNA-target genes and disease of interest, a disease relatedness analysis was done by taking the data from NCBI (www.ncbi.nlm.nih.gov/geo/). Data of 8933 cancer-related (CR) genes and 316 oral cancer (OC) genes were obtained from NCBI. The screened target genes were investigated for the presence of cancer genes and oral cancer genes. Three different classifiers KNN (k = 3), SVM, RF were applied to examine the prognostic ability of the identified target genes against two clinical conditions. Parameters such as Specificity, Sensitivity, Precision, F-Score, Matthews coefficient correlation (MCC) and Prediction accuracy were used to measure the classification performance [22].

4 Results

We performed step wise analysis and selection of miRNA signatures and the associated target genes. The results obtained in this proposed work are presented on

  1. (i)

    Prioritizing miRNA signatures

  2. (ii)

    Identification of significant target genes specific to the disease

  3. (iii)

    Effectiveness of the final selected target genes.

Table 1. Top 10 differentially expressed and computationally significant miRNA signatures.
  1. (i)

    Identification of prioritized miRNA signatures

    The miRNA expression and gene expression data collected from TCGA produced were inputted to the data preparation step of our proposed model which resulted in 493 miRNA signatures and expression values of 16478 genes. The differential expression analysis of the resultant 493 miRNA signatures resulted in 244 differentially expressed miRNA signatures with adjusted p-value <0.05 and logFoldChange >2. To further identify the miRNA signatures which are significant in disease prognosis, RF classifier was used. miRNA signatures were ranked in decreasing order of MDA value. miRNA signatures with MDA value >1 were considered significant. This resulted in 72 significant dysregulated miRNA signatures out of 244. This ensures that these handpicked 72 miRNA signatures are differentially expressed and computationally proficient as well. Among all, the top 10 significant miRNA signatures are illustrated in Table 1.

  2. (ii)

    Identification of significant target genes

    To obtain the potential targets for 72 notable miRNA signatures obtained in the previous step, a web-based target prediction tool: Target Scan was utilized. We systematically searched TargetScan for the identification of biological targets of the handpicked miRNAs. Target genes were queried for conserved miRNAs only. We obtained the top 50 targets for each conserved miRNAs with the highest aggregate probability of conserved targeting (PCT) value. It resulted in 1511 unique miRNA-target genes, which were finally mapped to pre-processed gene expression data obtained for OC. However, during mapping, a few miRNA-targets were dropped out because of its unavailability in the acquired TCGA gene expression data. This resulted in expression values of 1334 miRNA-target genes.

    For selecting the target genes showing significant changes in different diseased conditions, the expression vector of 1334 target genes for tumor and normal samples were compared with respect to adjusted p-value and logarithmic fold change in expression levels. An adjusted p-value <0.05 and logFoldChange >2 was kept as cut off parameter. It resulted in 671 differentially expressed miRNA-target genes. The volcano plot in Fig. 2a clearly represents the differentially expressed miRNA-target genes marked with cyan colored dots.

    Further, these target genes were reviewed for the existance of cancer-related genes and oral cancer genes using data collected from NCBI. Among 671 identified differentially expressed miRNA-targets, 331 genes were observed to be related to cancer whereas 19 genes were oral cancer genes. The proportion of CR and OC genes in the whole set of differentially expressed genes is demonstrated in Fig. 2b. These 350 (331 CR and 19 OC) genes were further validated in the following step with three different classifiers: KNN, RF and SVM.

  3. (iii)

    Effectiveness of the final selected target genes

    To examine the predictive efficiency of the 350 target genes obtained in the previous step, we run the classifiers KNN (k = 3), RF and SVM. All the classifiers were run with 10-fold cross validation. Table 2 illustrates the results of classification. For all the three classifiers: KNN (k = 3), RF and SVM, the identified miRNA-target genes achieved an accuracy of 95.6%, 97.1% and 98.5% respectively. The acknowledged target features were observed with an average MCC value of 0.9 for the considered classifiers. A MCC value >0.5 is mostly considered to be significant in various machine learning platforms when there is an imbalanced ratio of input samples. The result shows that the distinguished target genes obtained in this approach can put a new light on OC prognosis with efficacy. These genes can further be biologically validated to confirm their participation in disease specific pathways and biological processes.

Fig. 2.
figure 2

a) Volcano plot showing differentially expressed miRNA-target genes with cyan coloured dots.(b) Presence of cancer-related (CR) and oral cancer (OC) genes in the selected group of miRNA-target genes.

Table 2. The classifier performance result of identified miRNA-target genes.

5 Conclusion

Identification of specific miRNAs and the associated target genes is crucial in characterizing the course of action of miRNAs in biological processes which lead towards cancer progression. The proposed work started with sample matched data of miRNA expression and gene expression to identify the dysregulated miRNAs and respective target genes. Up and down regulated miRNAs with high value for mean decrease in accuracy of the classifier were considered to be dysregulated. The specific top ranked target genes for the obtained dysregulated miRNAs were obtained from the online repository and further analyzed with respect to their differential expression and affinity towards the disease. The cancer specific target genes obtained in this approach were observed with significant prediction accuracy, which directs their use in prognostic application in diagnosis and treatment. These handpicked miRNA-target genes may further be biologically validated to confirm their role in biological processed and oncogenesis pathways.