Introduction

DNA N6 methyladenine (6 mA) regulates various biological functions including genomic imprinting, cell developmental, and chromosome solidity and magnifies genomic diversity in both prokaryotes and eukaryotes (Xiong et al. 2019; Zhang et al. 2018). The 6 mA plays an important role to examine the host DNA and defenses the host genome via the several modification systems (Du et al. 2019), which is evenly circulated across the genome (Liu et al. 2018b; O'Brown and Greer 2016). However, the biological functions and epigenetic modifications of 6 mA still remain unclear. Genomic 6 mA distributions are essential for revealing potential biological functions of DNA. Recently different experimental methods have been accomplished to identify 6 mA, including liquid chromatography coupled with real-time sequencing for single-molecule, and methyladenine—precise PCR (McIntyre et al. 2019; Zhang et al. 2015), but these methods were time-consuming, laborious, and expensive. The rapid development in machine learning (ML)-based algorithms have driven computational chemistry to an unprecedented revolution with the explosive growth of biological sequences in the next generation sequencing era (Chen et al. 2015; Chou 2019; Liu et al. 2016; Sun et al. 2020). Therefore, ML-based methods can be used as an alternative to experimental efforts.

Nowadays, several species-specific ML-based approaches have been established for the identification of 6 mA sites, including rice(Basith et al. 2019; Chen et al. 2019a; Huang et al. 2020; Yu and Dai 2019), and Mus musculus (Feng et al. 2019). Although several computational methods have been proposed for 6 mA prediction in some species (Qianfei Huang et al. 2020; Wang and Yan 2018), none of them were developed to specifically identify 6mAs in the Rosaceae genomes. Particularly, the existing prediction models are not suitable to identify 6 mA in the Rosaceae genome because the existing algorithms are species-specific. Thus, a novel predictor is desired to be established to identify 6 mA sites in the Rosaceae genome.

To the best of our knowledge, we first propose a computation model named i6mA-Fuse (Identification of N6-MethylAdenine sites by Fusing multiple feature representation) to predict 6 mA sites from the Rosaceae genomes, especially in Rosa chinensis and Fragaria vesca. An overall framework is depicted in Fig. 1. Firstly, the five feature vectors were respectively generated by the five encoding schemes of the k-mer composition (Kmer), k-space spectral nucleotide composition (KSNC), mononucleotide binary encoding (MBE), dinucleotide binary encoding (DBE), and electron–ion interaction pseudopotentials (EIIP). Subsequently, a random forest (RF) classifier was used to build the five, single encoding-employing models. Finally, the predicted probability scores of an appropriate encoding-based models were combined through a linear regression to make a final prediction. As far as we know, the i6mA-Fuse is the first computational predictor of 6mAs within the Rosaceae genomes.

Fig. 1
figure 1

An overall framework of i6mA-Fuse. It involves three steps: (i) dataset construction based on MDR database; (ii) employing five different encoding schemes for converting nucleotides into numerical feature vectors; and (iii) model construction and evaluation using cross-validation. Subsequently, a webserver was constructed based on the proposed model (i6mA-Fuse), where it predicts putative 6 mA sites from the submitted query sequences

Materials and methods

An outline of i6mA-Fuse is presented in Fig. 1. Four key phases are discussed as follows: (i) datasets construction, (ii) feature extraction, (iii) probability scores calculation, and (iv) final model construction.

Datasets construction

A high quality dataset could guarantee the reliability and robustness of the proposed model (Xu et al. 2019). In this study, the positive samples (6mAs) were extracted from the reliable MDR database for F. vesca and R. chinensis (Liu et al. 2019). Each sample had a length of 41 base pairs with adenine nucleotide (“A”) at the center, whose modification score was \(\ge\) 20. To avoid any similarity bias, we utilized the CD-HIT program (Fu et al. 2012) to exclude highly similar samples by setting a cutoff threshold of 0.7. After such a screening procedure, we obtained the positive samples of 5733 and 1417 for R. chinensis and F. vesca, respectively. Meanwhile, the standard process was applied to collect negative samples as described in the previous studies (Basith et al. 2019; Lv et al. 2019a). Herein, we obtained strict and objective datasets consisting of 5733 positive and negative samples for R. chinensis, and 1417 positive and negative samples for F. vesca.

To further validate the predictive ability of the proposed model, the aforementioned datasets were randomly divided into the training and independent datasets at a 3 to 1 ratio. Finally, we obtained 4303/1067 positive and negative samples for F. vesca/R. chinensis as training datasets, while 1430/350 positive and negative samples for F. vesca/ R. chinensis were treated as independent datasets. To confirm the reproducibility of models, all the curated dataset used in this study are available at https://kurata14.bio.kyutech.ac.jp/i6mA-Fuse/help.php.

Feature extraction

One of the vital procedures is to express DNA sequences with an operative mathematical expression that can accurately reproduce the intrinsic correlation with the anticipated objective (Yang et al. 2019a). In this study, the five encoding schemes consisting of Kmer, KSNC, MBE, DBE, and EIIP were used for constructing the i6mA-Fuse predictor. The Kmer scheme encodes the occurrence of nucleotide frequency in a DNA sequence (Liu et al. 2018a; Manavalan et al. 2018c). The DNA sequence is expressed by Kmer as: F = f1, f2, f3, , fL, where L is the positive/negative dataset length. The Fi is a nucleotide of A, C, G, and T. Therefore, mono-, di-, tri-, and tetra–nucleotides were encoded and combined to form a 340 (41 + 42 + 43 + 44) dimensional feature vector. The KSNC scheme encodes the DNA nucleobase information of the curated samples using the frequency wise pair similarity search (Charoenkwan et al. 2013; Zhou et al. 2016). A space of nucleobase frequency pairs is encoded and normalized as

$${\text{Frequency}} {\text{pair}} = \frac{{N\left( {nf_{i} } \right)}}{w - S - 1}$$
(1)

where N(nfi) is the sum of nfi inside DNA samples with w length of sample and S is the space between two nucleotides. The KSNC generates a 4 × 4 × (Smax + 1) dimensional vector for a sequence, where Smax was equal to 3. The MBE scheme exactly encodes the nucleotide at each position as a binary vector, where the A, G, T, and C are encoded as (1,0,0,0), (0,0,1,0), (0,1,0,0), and (0,0,0,1), respectively. The MBE encodes the sequence with length of w, a 164-dimensional vector was generated. The DPE scheme encodes 16 potential di-nucleotide as 0 or 1 as 4-dimensional vector (Manavalan et al. 2019a). For example, AA, AC, AT, and GG were coded to (0,0,0,0), (1,1,1,1), (0,0,0,1), and (0,0,1,0), respectively. Using the DPE, a DNA sample is transformed to a 160-dimension feature. The EIIP scheme expresses the electron–ion energies beside with the curated sequences, which is extensively used in bioinformatics research (Basith et al. 2019; Jia et al. 2018). The EIIP indexes of {A, C, G, T} were set to {0.1260, 0.1340, 0.0806, 0.1335} that generates a w-dimensional feature vector for a sequence.

Machine learning algorithms

This study utilized an ensemble method named RF model to develop i6mA-Fuse predictor (Liaw and Wiener 2002; Schaduangrat et al. 2019; Shoombuatong et al. 2019; Su et al. 2019; Win et al. 2017). Typically, when training data of size T with Q features is given, RF builds Q subsets of the data by the bootstrap sampling, and then randomly assigns Q features to each node to optimize the trees based on the by Gini impurity. We used ‘randomForest’ implemented in R (https://cran.r-project) with a default cut-off tree number of 1000 to evaluate the optimum performance. This package has been successfully applied to many protein and peptide prediction problems (Hasan et al. 2019b; Lv et al. 2019b; Manavalan et al. 2018b; Zhou et al. 2016). In order to prove the effectiveness of the proposed methods, we compared our RF predictor with the well-known five ML algorithms, i.e. SVM, Adaboost (AB), Naïve Bayes (NB), artificial neural network (ANN), k-nearest neighbor (KNN). In this study, the ANN and NB models (Frank et al. 2004) were implemented from the WEKA software, while the KNN model was developed by using an in-house developed PERL language. For the SVM model, we used the SVMlight package with default parameters (Hasan et al. 2019c; Khatun et al. 2019b).

Fusion model

To improve the prediction performance, the RF probability scores estimated by the Kmer, KSNC, MBE, DPE, and EIIP encoding schemes were linearly combined using the following formula:

$${\text{Combined}} = w_{1} * {\text{Kmer}} + w_{2} * {\text{KSNC}} + w_{3} * {\text{MBE}} + w_{4} * {\text{DBE}} + w_{5} * {\text{EIIP}}$$
(2)

where w1, w2, w3, w4, w4, and w5 are the weight values exhibiting the contribution of each encoding, where the summation of w1, w2, w3, w4, w4, and w5 is 1. Herein, the linear fusion models of the five RF scores estimated by using the five encoding schemes are referred as Combined model. To enhance the predictive performance of the proposed model, each weight coefficient was adjusted in the range of 0–1 with an interval of 0.05 using a grid-search strategy.

Hybrid model

We investigated the effect of the hybrid feature (H) on 6 mA site prediction. The five encoding feature vectors (F) of Kmer, KSNC, MBE, DPE, and EIIP were combined as follows:

$$H = \left( {F\left( {{\text{Kmer}}} \right), F\left( {{\text{KSNC}}} \right), F\left( {{\text{MBE}}} \right),F\left( {{\text{DPE}}} \right), F\left( {{\text{EIIP}}} \right)} \right)$$
(3)

where H is the sequential combination of five different feature vectors with 1406 dimensions.

Meta-predictor

We generated a meta-classifier to check its potential in 6 mA site prediction. In brief, the probability scores of 30 prediction models (6 encodings \(\times\) 5 ML classifier) were considered as a new feature vector as follows:

$$P_{met} = \left( {P\left( {M\left( 1 \right),{\text{E}}\left( 1 \right)} \right), \ldots P\left( {M\left( i \right),{\text{E}}\left( j \right)} \right), \ldots , P\left( {M\left( s \right),E\left( t \right)} \right)} \right)$$
(4)

where Pmet is the new feature vector, \(P\left(M(s), \mathrm{E}(j\right))\) the expected probability by each ML of M(s) with encoding scheme E(j), i the index of the ML, j the index of the encoding scheme, s the total of ML classifiers, and t the total of encodings.

Performance assessment metrics

To assess the performances of i6mA-Fuse, we used four standard measurements consisting of accuracy (Ac), sensitivity (Sn), specificity (Sp), and the Matthews correlation coefficient (MCC) (Basith et al. 2020; Ding et al. 2016; Yang et al. 2019b):

$${\text{Ac}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{(TP}} + {\text{TN}} + {\text{FP}} + {\text{FN)}}}}$$
(5)
$${\text{Sn}} = \frac{{{\text{TP}}}}{{{\text{(TP}} + {\text{FN)}}}}$$
(6)
$${\text{Sp}} = \frac{{{\text{TN}}}}{{{\text{(TN}} + {\text{FP)}}}}$$
(7)
$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {{\text{(TP}} + {\text{FP)(TP}} + {\text{FN)(TN}} + {\text{FP)(TN}} + {\text{FN)}}} }}$$
(8)

where TP and TN describe the number of positive samples correctly predicted and the number of negative samples correctly identified, respectively. Meanwhile, FN and FP indicate the number of positive samples falsely identified as negative ones and the number of negative samples falsely identified as positive ones, respectively. Furthermore, in order to assess the prediction performance of algorithms using threshold-independent parameters, AUC values were calculated by using the ROC curve.

Results and discussion

Nucleotide preference of F. vesca and R. chinensis

The position-specific preferences of nucleotide compositions were analyzed by the two-sample logo software (Vacic et al. 2006) as depicted in Fig. 2. Figure 2a, b display the nucleotide preferences on the DNA sequences having a length of 41 base pairs and the A base at the center for F. vesca and R. chinensis, respectively. We examined the DNA preferences of the sequences adjacent the A bases of F. vesca and R. chinensis, while the enriched nucleotides indicate a statistical significance at a level of p < 0.05 (two-sample t-test). In case of F. vesca. Figure 2a shows that the A base is enriched at positions 12, 13, 15, 17–20, 25, 28, 29, 32, and 33, while the G base is more enriched at positions 13, 20, 22, 23, 26, and 29 than other nucleotides. The T base was significantly depleted at position 13, 17–20, 23, 25, 26, 28, 29, and 38. In case of R. chinensis, the G base was enriched at positions 1, 3, 7–11, 13,14, 16, 22–24, 26, 29, 32, 34, 37, and 38, while the C base was more enriched at positions 2, 5–8, 10, 11, 14, 15, 19, 24, 27, 30, 35, 36, and 38–40 than other nucleotides. As seen in Fig. 2b, the T base was significantly depleted at position 1–4, 9–20, 22, 23, 25, 26, 28, 29, and 32–39. Enrichment and depletion of nucleotides at a specific position might be significant information for discriminating positives from and negatives on both the F. vesca and R. chinensis samples. Thus, in the present study, the mentioned significant position-specific preferences of nucleotide compositions are used as input features to develop the i6mA-Fuse.

Fig. 2
figure 2

Nucleotide preferences of the surrounding positive samples compared to negative samples. aF. vesca. bR. chinensis. The level of Y-axis is dissimilar due to the different datasets. Only nucleotides that are significantly enriched or depleted (t-test, P < 0.05) nearby the centered positive and negative samples are shown

Performance comparison by cross-validation test

We carried out a series of comparative simulations using RF models with the five feature vectors of Kmer, KSNC, MBE, DBE, and EIIP encodings, and evaluated their performances on the training dataset by tenfold cross-validation. The cross-validation results are listed in Table 1 and Fig. 3. The previous studies suggested several ways to integrate multiple prediction models for improving the performances, including meta-predictor (Boopathi et al. 2019; Manavalan et al. 2019b, c, d), ensemble approach, and linear regression(Hasan et al. 2019a, d; Khatun et al. 2019a). Herein, the i6mA-Fuse linearly combined the five probability scores evaluated by the five, single encoding-employing RF models.

Table 1 Cross-validation results of the proposed predictors and other five encodings
Fig. 3
figure 3

ROC curves of i6mA-Fuse and the single encoding-employing models as evaluated by10-fold cross-validation. aF. vesca. bR. chinensis

For F. vesca, MBE encoding achieved the highest prediction results with Ac = 0.925, Sn = 0.891, Sp = 0.958, MCC = 0.971, and AUC = 0.858. Meanwhile, the second highest prediction result was obtained by EIIP encoding, which gave Ac = 0.915, Sn = 0.879, Sp = 0.950, MCC = 0.839, and AUC = 0.963. These two encoding schemes yielded Ac values of 0.915–0.925, which were 11.6–24.6% higher than Ac values of the other three encodings. To construct the i6mA-Fuse, the optimal weight coefficients of Kmer, KSNC, MBE, DBE, and EIIP were 0.00, 0.00, 0.75, 0.00, and 0.25. While the three encodings of Kmer, KSNC, and DBE did not contribute to any prediction performance, the MBE and EIIP contributed to 75% and 25% of the total prediction. The ROC curves of the i6mA-Fuse and the single encoding-employing models, evaluated by10-fold cross-validation, are presented in Fig. 3a. The i6mA-Fuse yielded AUC = 0.981, MCC = 0.873, Ac = 0.934, Sn = 0.908, and Sp = 0.957, surpassing all the single-employing models (Table 1). According to a P-value of 0.05, the i6mA-Fuse significantly outperformed the three models employing the single encoding of Kmer, KSNC, and DBE.

For R. chinensis, MBE achieved the best performance with an AUC value of 0.956 and Ac of 0.912 for all the single encoding-employing models. Meanwhile, the second highest prediction with an AUC of 0.945 and Ac of 0.900 was obtained by using EIIP encoding. To construct the i6mA-Fuse, we combined the RF scores with optimal weight coefficients of Kmer, KSNC, MBE, DBE, and EIIP of 0.15, 0.00, 0.65, 0.00, and 0.20, respectively, indicating that Kmer, MBE, and EIIP contribute to 15%, 65%, and 20% of the total prediction, respectively. The i6mA-Fuse yielded a peak AUC value of 0.968, while indicating MCC = 0.851, Ac = 0.916, Sn = 0.881, and Sp = 0.950 (Table 1). The i6mA-Fuse significantly outperformed the four models employing the single encoding of Kmer, KSNC, DBE, and EIIP (P-value of 0.05 by two-sample t-test).

Performance comparison among different ML algorithms by cross-validation test

To validate the effectiveness of the RF classifier in the i6mA-Fuse, we compared its performance with the five ML classifiers of SVM, AB, NB, ANN, and KNN on the training dataset. To make a fair comparison, we implemented the five ML classifiers in the same manners as the RF classifier. Figure S1 shows that the RF model provided better results than the other ML classifiers, while the prediction results of the SVM model were comparable to RF model in both two genomes. In F. vesca, the i6mA-Fuse achieved ~ 2–6% higher AUCs than any other ML-based combined models (Figure S1A). Meanwhile, the AUC values of the i6mA-Fuse were ~ 3–6% higher than those of the other combined models in R. chinensis (Figure S1B), thus demonstrating the superiority of RF.

Performance comparison of i6mA-Fuse with hybrid model and meta-predictor

We compared the linear regression model, employed by the i6mA-Fuse, with the two different models namely a hybrid model and meta-predictor. First, we concatenated the five feature encoding vectors of Kmer, KSNC, MBE, DPE, and EIIP and obtained 1406-dimensional feature vector. We generated the hybrid model for both species (F. vesca and R. chinensis), inputted these features to six different classifiers (RF, SVM, AB, NB, ANN, and KNN), and evaluated their performances by tenfold CV on the training datasets. In case of F. vesca, the hybrid model of RF, SVM, AB, NB, ANN, and KNN achieved AUC values of 0.978, 0.963, 0.957, 0.944, 0.919 and 0.933, respectively (Figure S2). Similarly, for R. chinensis, the hybrid model of RF, SVM, AB, NB, ANN, and KNN algorithms achieved AUCs of 0.958, 0.962, 0.927, 0.942, 0.918 and 0.922, respectively (Figure S2). Furthermore, we constructed the meta-predictor as described elsewhere (Manavalan et al. 2018a, 2019d; Wei et al. 2019). In F. vesca the i6mA-Fuse achieved higher AUCs than the hybrid model and meta-predictor (Figures S2 & S3). In R. chinensis, the i6mA-Fuse showed ~ 1.0–8.0% higher AUCs than the hybrid model and meta-predictor (Figures S2 & 3).

Performance of i6mA-Fuse by independent test

The i6mA-Fuse was evaluated by independent tests for two genomes. We compared the prediction performance of the i6mA-Fuse with that of the five, single encoding-employing models (Kmer, KSNC, MBE, DBE, and EIIP) by using an independent dataset of F. vesca. MBE encoding gave higher performance than any other single–encoding employing models for two genomes, as shown in Fig. 4a, b. Moreover, the ROC curves displayed that the i6mA-Fuse achieved better than all the single encoding-employing methods. The i6mA-Fuse achieved outstanding performances (MCC = 0.858, Ac = 0.929, Sn = 0.915, Sp = 0.943, and AUC = 0.978) and (MCC = 0.869, Ac = 0.937, Sn = 0.928, Sp = 0.948, and AUC = 0.982) for R. chinensis and F. vesca, respectively, on the independent sets (Table 2).

Fig. 4
figure 4

ROC curves of i6mA-Fuse and the single encoding-employing models as evaluated by independent test. aF. vesca. bR. chinensis

Table 2 Independent test of the proposed predictors and other five encodings

Validation of i6mA-Fuse with other species datasets

To further examine the generalization of i6mA-Fuse, the proposed i6mA-Fuse was applied to identify 6 mA sites in other species, i.e. rice and mouse genomes. We collected the rice genome dataset from SDM6A (Basith et al. 2019), which contains 221 positive and 221 negative samples, and the mouse genome dataset from iDNA6mA-PseKNC (Feng et al. 2019), where we randomly selected 200 positive and 200 negative samples. The prediction performances of AUC, MCC, Sp, Sn, and Ac were shown in Table S1. The R. chinensis- and F. vesca-specific i6mA-Fuse yielded AUC values of 0.870 and 0.928 for the rice genome, respectively. For the mouse genome, they provided AUC values of 0.748 and 0.769, respectively. The i6mA-Fuse can be applicable to the rice genome, especially the F. vesca–specific i6mA-Fuse (Table S1). It suggests that the sequences surrounding 6 mA sites of Rosaceae genomes have common characteristic patterns to rice genomes.

Conclusions

The accurate prediction of 6 mA sites is one of the challenging tasks in bioinformatics. Because the experimental approaches are time-consuming and costly, it is desirable to develop a computational model for rapidly and accurately identifying 6 mA sites. Although several computational methods have been proposed in some species (Basith et al. 2019; Chen et al. 2019a; Feng et al. 2019; Lv et al. 2019a; Yu and Dai 2019), none of them were developed to specifically identify 6 mA sites in the Rosaceae genomes. In this study, we developed the first species-specific predictor named i6mA-Fuse for identifying 6 mA sites of the Rosaceae genomes, especially in R. chinensis and F. vesca. We constructed the random forest (RF) models with the five encoding schemes of Kmer, KSNC, MBE, DPE, and EIIP, and then combined the predicted probability scores of the five models through a linear regression. The resultant species-specific i6mA-Fuse achieved remarkably high performances with AUCs of 0.982 and 0.978 and with MCCs of 0.869 and 0.858 on the independent datasets of Rosa chinensis and Fragaria vesca, respectively. In the F. vesca-specific i6mA-Fuse, the MBE and EIIP contributed to 75% and 25% of the total prediction; in the R. chinensis-specific i6mA-Fuse, Kmer, MBE, and EIIP contribute to 15%, 65%, and 20% of the total prediction. Interestingly, the i6mA-Fuse can be applicable to rice genome. To show the superiority of the linear regression, we compared it with the two combination methods of the hybrid model and meta-classifier. To further improve the prediction performance, we may use recently proposed integration methods (Chen et al. 2019b; Li et al. 2019; Zhang et al. 2019) and various modes of Chou’s pseudo information (Chen et al. 2016; Chou 2011; Liu et al. 2015).To assist high-throughput identification for DNA 6 mA sites, the i6mA-Fuse is publicly accessible at https://kurata14.bio.kyutech.ac.jp/i6mA-Fuse/.