Introduction

Sustainable food production is necessary to meet the demands of the ever-increasing human population (Mochida and Shinozaki 2013). Conversely, crop plants are constantly exposed to adverse environmental perturbations that are predicted to result in a 70% yield loss in important agricultural crops (Boyer 1982; Vij and Tyagi 2007; Zurbriggen et al. 2010). Abiotic stresses, such as cold, drought, heat and salt, have been a major factor in limiting crop yield and productivity (Akpnar et al. 2013; Budak et al. 2015). The pervasiveness and startling effects of the abiotic stresses on plant growth, development and quality have made them a significant concern in recent years (Anwar and Kim 2020). In order to activate defence mechanisms in response to abiotic stress, plants activate a network of genetic regulation, which includes changed gene expression in a considerable number of genes via transcriptional and/or post-transcriptional regulation (Ku et al. 2015). The expression of protective genes is specifically increased in plants, while the expression of negative regulators is decreased. Several protein-coding genes that control how plants respond to abiotic stresses have been unearthed in recent years (Zhang and Wang 2015).

Recent findings suggest that plants use tiny (20–24 nt) endogenous RNAs called microRNAs (miRNAs) as key post-transcriptional gene-expression regulators to inhibit plant growth and development under abiotic stress (Zhang 2015). The mRNA cleavage, translational suppression, chromatin remodelling and/or DNA methylation are some of the ways that miRNAs control gene expression (Wang et al. 2019). Typically, miRNAs that are upregulated in response to abiotic stress downregulate their target mRNAs, while those that are suppressed cause positive regulators to accumulate and become active (Chinnusamy et al. 2007). Abiotic stress leads to inconsistent miRNA expression in plants, according to numerous researches. For instance, Winter and Diederichs (2011) and Iwakawa and Tomari (2013) found that through controlling important elements of complex gene networks, miRNAs have a role in plants’ response to abiotic stress. Numerous studies have been conducted to analyse the changes in plant miRNA expression in response to biotic and abiotic stresses (Noman and Aqeel 2017). The miRNA-167, miRNA-169, miRNA-171, miRNA-319, miRNA-393, miRNA-394 and miRNA-396 are a few examples of miRNAs that work in various abiotic stress–related activities (Wang et al. 2014; Gao et al. 2016).

The response of miRNAs to abiotic stresses is largely decided by genotype, stress, tissue and miRNA type (Zhang 2015). For instance, miR408 expression is downregulated in rice (Zhou et al. 2010), cotton (Xie et al. 2015) and peach (Eldem et al. 2012) during drought stress, while it is upregulated in Arabidopsis (Liu et al. 2008), Medicago (Trindade et al. 2009) and barley (Kantar et al. 2011). In terms of tissue-dependent response of miRNAs, Wang et al. (2013) discovered an altered expression profile of miRNAs in roots versus leaves in response to drought and salinity stresses in cotton. The miR169 was found to be induced by salinity treatment in Arabidopsis but inhibited by drought stress (Li et al. 2008), demonstrating that abiotic stresses induce the expression of miRNAs in a stress-dependent manner. Similar to miR169 in Arabidopsis, miR398 was activated by UVB light but was suppressed by salinity, cold and oxidative stress (Sukar et al. 2006; Jia et al. 2009). In Arabidopsis under salinity stress, the expression of miR397 was significantly induced, but that of miR398 was significantly inhibited, indicating that plant response to abiotic stresses is miRNA-dependent (Liu et al. 2008). The studies referred above indicate that miRNAs play a substantial role in how plants react to various abiotic stresses and may be exploited as genetic targets to design plants to be more resilient to such abiotic stresses. Due to the significant role of miRNAs, they have been populated in various databases, including PlantMirnaT (Rhee et al. 2015), miRPlant (An et al. 2014), PMRD (Zhang et al. 2010), miRNEST (Szcześniak et al. 2012) and miRBase (Kozomara et al. 2014). The most recent resource for abiotic stress–responsive miRNAs is PncStress (Wu et al. 2020), which comprises experimentally validated miRNA sequences linked to diverse abiotic and biotic stresses.

Techniques including RT-PCR, cloning, RNA-microarrays and northern blots have all been extensively employed to find abiotic stress–related miRNAs. These resource-intensive wet experiments also have weak analytical qualities including accuracy, linear range and limit of detection (Ku et al. 2015; Shriram et al. 2016). Although abiotic stress–responsive miRNAs have been identified using NGS and deep sequencing technologies (Tripathi et al. 2015), the sequencing methods are species-specific. Therefore, employing existing plant miRNA sequence data, machine learning–based computational approaches may be a better alternative for predicting abiotic stress–related miRNAs. To predict abiotic stress–related miRNAs from plant miRNA sequences, we have already developed a machine learning–based technique termed ASRmiRNA (Meher et al. 2022). The developed model predicts abiotic responsive miRNA from its sequence. But, predicting miRNAs for specific abiotic stress from plant miRNA sequences is still necessary. Given the significance of miRNAs in plant response to abiotic stresses and the lack of computational methods for predicting such abiotic stress–specific miRNAs, the objective of this study is to develop a machine learning–based computational model for predicting abiotic stress–specific (cold, drought, heat and salt) miRNAs using features derived from miRNA sequences. For the purpose of discovering miRNAs under certain abiotic stresses, the current study is believed to supplement wet-lab techniques and other sequencing approaches.

Materials and methods

Collection, processing and construction of datasets

On August 23, 2022, the PncStress database (Wu et al. 2020) was accessed in order to retrieve mature miRNA sequences that are particular to an abiotic stress. This database contains 4227 stress-responsive non-coding RNAs (miRNA, LncRNA and circRNA) from 114 plants that have been experimentally verified to 48 biotic and 91 abiotic stresses. We collected 2110 miRNA sequences for 4 different abiotic stresses, including drought (862), heat (241), salt (559) and cold (448). Additionally, we took into account 376 miRNA sequences that were used as a negative set in a prior study (Meher et al. 2022). We created two distinct datasets called dataset-I and dataset-II to evaluate the performance of machine learning algorithms for predicting miRNAs that are specific to a particular abiotic stress.

Dataset-I

Thirty percent of the collected abiotic stress sequences (128 sequences for cold, 267 for drought, 68 for heat, and 167 for salt) for each stress category were set aside in order to utilize them as a positive independent test set. The positive set of the training dataset was composed of the remaining miRNA sequences from each category of abiotic stress. To prevent homologous bias in the prediction accuracy, sequences with > 60% sequence homology to any other sequences within each stress set were eliminated using the CD-HIT algorithm (Huang et al. 2010). After removing redundant sequences, a total of 216, 350, 114 and 249 sequences were obtained for cold, drought, heat and salt stress, respectively which were used to build the positive training set. Homology reduction was also applied to the positive independent set, yielding a total of 79, 149, 36 and 90 sequences for cold, drought, heat and salt, respectively. For a given abiotic stress, the other three types of miRNA are taken into account equally (at random) to construct the negative training set. The independent negative set was also built in a similar way.

Dataset-II

The negative training set for each class of stress was constructed by using the same amount of observations from the collected 376 miRBase miRNA sequences, whereas the positive training set remained the same as that of dataset-I. The positive independent set remained the same as that in dataset-I, and the remaining negative sequences in each case (after excluding the negative training set) were utilized to form the negative independent test set. A balanced dataset was taken into consideration for each category to train the model in order to prevent prediction bias toward the class having a larger number of observations. Table 1 gives a summary of the positive, negative and independent datasets.

Table 1 Summary of the positive and negative datasets used in the current study

Generation of numeric features and feature selection

As the pseudo composition of nucleotides accounts for the long-range sequence order effect, we used pseudo K-tuple nucleotide compositional (PseKNC) (Guo et al. 2014; Chen et al.2014) features in this study to convert each miRNA sequence into a numeric feature vector. The PseKNC descriptor has been effectively used in several fields of computational biology, including the prediction of nucleosome placement (Guo et al. 2014), the prediction of miRNAs that are responsive to abiotic stress (Meher et al. 2022) and others (Chen et al. 2015; Yang et al. 2018). To generate the PseKNC features, it is necessary to first identify the tier of correlation (\(\lambda\)), weight factor (w) and Kmer size (K). Since the miRNA sequences are only about 20–24 nucleotides long, correlation up to 3 tiers was taken into consideration. In this work, the default weight factor w value of 0.2 was used. The number of features generated was 7, 19, 67, 259 and 1027, correspondingly, by utilizing 5 different Kmer sizes (K = 1, 2, 3, 4 and 5). Each miRNA sequence yielded a total of 1379 features. The generated features are sparse in nature because miRNAs are shorter in length. Furthermore, since the dataset is small, there is a chance that using a large number of features will lead to over prediction. However, feature selection approach aids in the removal of redundant and irrelevant features, reducing the computational burden and boosting classification accuracy (Aksu et al. 2010; Huang et al.2014).Thus, key features were chosen using the SVM-recursive feature elimination (SVM-RFE) method (Guyon et al. 2002). Pse-in-One software (Liu et al. 2015) was used to generate the PseKNC features, and the “sigFeature” R-package was used to implement the SVM-RFE approach (Das et al. 2020).

Prediction with machine learning algorithms

Machine learning approaches have been successfully used in different areas of bioinformatics, such as gene discoveries and genome annotation (Guo et al. 2017), protein class prediction (Pradhan et al. 2022), gene expression analysis (Abbas and EL-Manzalawy 2020), complex interaction modeling in biological systems (Pradhan et al. 2021) and others. In this study, we used seven different machine learning techniques, including the support vector machine (SVM) (Vapnik 1963), the extreme gradient boosting (XGB) (Chen and Guestrin 2016), the random forest (RF) (Breiman 2001), the light-gradient boosting machine (LGBM) (Ke et al. 2017), the bagging (BAG) (Breiman 1996), the adaptive boosting (ADB) (Freund and Schapire 1999) and gradient boosting decision tree (GBDT) (Friedman 2001). The R-software was implemented for execution of the learning algorithms. The R-packages used to execute the learning models and parameter configuration for different learning models are provided in Table 2.

Table 2 Software used and parameter setting for the learning models used for prediction of abiotic stress–responsive miRNAs

Cross validation and performance metrics

A five-fold cross-validation approach was used to assess the performance of different learning models. Both the positive and negative datasets were randomly separated into five subgroups of equal size in order to perform the five-fold cross-validation (Jiang and Wang 2017). In each fold of the cross-validation, one randomly selected subset from each class was used as a test set, and the remaining four subsets from both classes were merged to serve as a training set. For each fold, distinct training and test sets were used during the classification process. The accuracy across all five test sets was averaged to provide the performance measures. To measure the effectiveness of the prediction models, the following metrics were used: accuracy, area under receiver operating characteristic curve (auROC) and area under precision recall curve (auPRC):

$$Accuracy=\frac{1}{2}\left(\frac{TP}{TP+FN}+\frac{TN}{TN+FP}\right)$$
$$auROC={\int }_{0}^{1}\frac{TP}{P}d\left(\frac{FP}{N}\right)$$
$$auPRC={\int }_{0}^{1}\frac{TP}{TP+FP}d\left(\frac{TP}{P}\right)$$

Here, TP, FP, TN and FN, respectively, represent the number of positive samples predicted to be positive, negative samples predicted to be positive, negative samples predicted to be negative and positive samples predicted to be negative. In Fig. 1, a flowchart illustrating each steps of the proposed approach is presented, and the pseudocodes for the developed algorithm are as follows:

Fig. 1
figure 1

Illustration of the brief outline of the proposed approach. The diagram depicts the overall design of the entire computational strategies followed to develop the miRNA prediction models for each abiotic stress. (A) Retrieval of experimentally validated abiotic responsive miRNA sequences from the PncStress database and processing of sequence data; (B) sequence-derived PseKNC feature generation and selection of most important features and machine learning algorithm (MLA) based on auROC and auPRC; (C) model building using machine learning techniques with selected features and assessment of cross-validation accuracy

INPUT: Cold, drought, heat and salt responsive miRNA sequences labelled as positive dataset, and equal number of non-abiotic stress responsive miRNA sequences labelled as negative dataset

figure a

Results

Analysis of discriminatory motifs

For each stress category, we conducted the discriminatory motif discovery study, which involved finding the pattern in the positive set against the negative set. Only the significant motif (p-value 0.05) was taken into consideration, and the searched length of the motif was restricted to 2 to 6 nucleotides. STREME software (Bailey 2021) was used to analyse the discriminative motifs. Figure 2 shows the discriminatory motifs that were discovered for each stress. Three cold stress motifs, including AUCMC, AUUGA and GCCGCS, were discovered to be substantially more prevalent in the positive set than the negative set. Similarly, two significant motifs were found for the drought stress, namely AAUGUU (p-value 6.810−3) and GCCGR (p-value 5.110−3). The discriminating motifs GACAGC and WGAUG were also discovered for the heat stress. While searching motifs in the positive set vs the negative set, three major conserved motifs were identified for the salt stress, including GAUUUG, AAGGAG and ASBUGC. In conclusion, different motifs were found for different stress categories, which may be important for mRNA binding.

Fig. 2
figure 2

Discriminatory motifs for stress-specific miRNAs. Different discriminatory motifs were found for different stress

Performance analysis of MLA with PseKNC features

The prediction accuracy of 7 machine learning algorithms was evaluated with 5 different PseKNC feature sets using training dataset-I. With Kmer sizes 4 and 5, respectively, SVM was shown to have the highest auPRC for cold (60.4%) and drought (55.02%) (Fig. 2). When it came to heat, the PseKNC feature set with Kmer size 4 was used, and LGBM obtained the highest accuracy of 60.9% auPRC, followed by BAG (60.39%) and XGB (59.43%) (Fig. 2). With K = 5, BAG and GBDT were observed to achieve higher accuracy (56% auPRC) for salt stress; however, XGB and GBDT were seen to achieve almost similar accuracy for salt stress (55% auPRC) for Kmer size 4 (Fig. 2). The feature sets formed with Kmer sizes 4 and 5 often had higher prediction accuracies than those generated for Kmer sizes 1 to 3, which may be due to the larger size of the feature set.

Prediction analysis for training set-I using selected features

We found that using features generated with Kmer size 4 and 5 increased prediction accuracy (Fig. 3a). However, a significant portion of the features is sparse in nature due to the shorter length of the miRNA sequences (20–24nt), which may significantly create bias in the accuracy. Therefore, after integrating all of the Kmer features, the features were ranked using the SVM-RFE approach. It was discovered that different number of features was selected for each stress category to achieve the best degree of accuracy (Fig. 3b). It was also shown that when analysis was conducted using selected features, the SVM obtained the highest accuracy for all stress categories (Fig. 3b). With SVM and top 246 chosen features, the highest auPRC of 60.01% for cold stress was attained. Similarly, using 230, 310 and 240 features respectively, SVM was able to predict drought (53.54%), heat (78.34%) and salt (66.78%) stresses with the highest auPRCs (Fig. 3b). While using all of the features for prediction, the accuracy was also seen to be declining. In comparison to their highest accuracy obtained with a single PseKNC feature set, i.e. Kmer size 4 for heat and 5 for salt stress, the accuracy was shown to be enhanced by ~ 17% and ~ 10%, respectively, with the selected features (Fig. 3 a and b). Contrarily, the accuracy for cold and drought was not increased with the chosen feature sets.

Fig. 3
figure 3

a Heat maps of auPRC for different machine learning algorithms with PseKNC features for Kmer size 1 to 5. b Plot of the auPRC with the ranked features selected through SVM-RFE method. The training dataset-I was used for prediction analysis in both cases

Prediction with independent test set-I

The model trained with the respective training set-I was used to predict the independent dataset-I. For cold, drought, heat and salt stress, the accuracy in terms of auPRC was found to be 59.63, 66.94, 72.88 and 69.57%, respectively (Table 3). Highest prediction accuracy was observed for heat and lowest for cold, similar to cross-validation accuracy. It was also observed that, relative to their respective cross-validation accuracy, the accuracy of the independent dataset was better for drought and salt and lower for cold and heat (Table 3). For cold, drought, heat and salt, the overall accuracy was determined to be 62.02, 61.40, 77.78 and 66.67%, respectively (Table 3).

Table 3 Prediction accuracy for the independent test set-I. The prediction was performed by using the model trained with the training set-I along with the respective selected feature sets

Prediction analysis with training set-II

Utilizing the chosen set of features, five-fold cross-validation prediction analysis was also carried out using SVM on the training dataset-II. For cold, drought, heat and salt, the ideal number of features to attain the maximum accuracy was 380, 272, 174 and 340, respectively (Table 4). Cross-validated prediction accuracies in terms of auPRC were found to be 90.15, 90.09, 87.71 and 89.25% with the chosen feature sets (Table 4). Prediction for the independent test set II was also done using the model learned with training set II. Overall, it was found that the prediction accuracies were 84.57, 80.62, 80.38 and 82.78%, respectively (Table 4). When compared to the training set-I and independent test set-I, respectively, the cross-validation accuracy of the training set-II and the independent test set-II were found to be significantly higher.

Table 4 Performance metrics for the training set-II and independent test set-II. The SVM with the selected feature sets was used for prediction. Prediction for the independent test set-II was performed using the model trained with the respective training set-II

Analysis of selected features

For each stress category, tSNE plots were created before and after feature selection in order to further illustrate the discriminatory feature sets. The analysis made use of the training dataset-II. The R-package Rtsne (Krijthe et al. 2017) was used to create the tSNE plot. Utilizing both the selected feature sets and all the feature sets, different tSNE plots were produced (Fig. 4a). Due to the 2-dimensional nature of the plot, it was found that the distinction between the stress and non-stress categories was not clear. When the selected features were analysed using the training dataset-II, the numbers of selected features for cold (201), drought (138) and salt (180) were greater with Kmer size 5, whereas the numbers of selected features for heat stress (62) were higher with Kmer size 4 (Fig. 4b). This might be because there were not as many observations for the heat stress, which produced more homogenous features (mainly 0s) for Kmer size 5. Additionally, it was discovered that among the selected features, 63 features were found common among the four stresses (Fig. 4c). Only 100, 51, 28 and 78 selected features were found to be independently attributed to cold, drought, heat and salt stresses respectively (Fig. 4c).

Fig. 4
figure 4

a tSNE plots for different stress category with selected features. b Pie chart of the number of selected feature for different Kmer size. c Venn diagram for the selected feature sets

One-to-one prediction analysis

Additionally, binary classification was also done by classifying two distinct stress sets. A balanced dataset with the same number of observations from both classes was utilized to make the prediction. For the classification of cold-drought, cold-heat, cold-salt, drought-heat, drought-salt and heat-salt, the optimal number of features was 116, 80, 60, 194, 166 and 440, respectively (Fig. 5). For identifying cold-heat and drought-heat, overall cross-validation accuracy was 60.91% and 60.45%, respectively (Table 5). The classification accuracy for the remaining four combinations was found to be less than 60% (Table 5). Additionally, it was observed that with a few notable exceptions, accuracy increased up until a certain point before starting to decline (Fig. 5).

Fig. 5
figure 5

Plot of the performance metrics for SVM model using the training dataset-II for the selected features

Table 5 Performance metrics for one-to-one prediction using the training set-II. The SVM with the respective selected feature sets were used for prediction

Performance analysis of deep learning models in the selected feature sets

Performance of four cutting-edge deep learning models, including one-dimensional convolutional neural networks (CNN) (Kim 2014), attention-based convolutional neural network (ABCNN) (Yin et al. 2016), long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and Auto-encoder (AE) (Liou et al. 2014), was also compared with that of SVM. Prediction analysis was performed using the training dataset-II through five-fold cross validation, where the selected number of features (Table 4) was used for the analysis. Among the deep learning models, AE achieved higher accuracy in all four abiotic stresses (cold: 77.07%; drought: 77.57%; heat: 77.47%; salt: 79.91%) (Table 6). The ABCNN was found to be the least performer among the deep learning models (Table 6). The SVM was observed outperforming all the deep learning algorithms for predicting abiotic stress–responsive miRNA for all abiotic stresses (Table 6). Specifically, SVM achieved 5–6% higher accuracy than that of best-performing deep learning model AE.

Table 6 Comparative performance metrics of the SVM with deep learning models. The training dataset-II along with the selected feature sets was used for prediction

Prediction tool ASmiR

We developed an online prediction server ASmiR (https://iasri-sg.icar.gov.in/asmir/) for prediction of abiotic stress–responsive miRNAs in cold, drought, heat and salt. The front end of the server was designed using HTML, whereas the developed R-code run at the back end with the help of PHP. The SVM model developed using the dataset-II is implemented in this server due to its high accuracy for all four abiotic stresses. The prediction can be made by using four types of abiotic stresses. The user has to paste or upload the miRNA sequences in FASTA format. The results are presented in tabular format, where the probabilities with which each miRNA sequence predicted to a specific stress category is provided.

Evaluation of ASmiR using experimentally validated dataset

For cold, drought, heat and salt stress, miRNA sequences are manually collected from available literature (Shriram et al. 2016; Begum 2022; Zhang et al. 2022) in order to further verify the effectiveness of the developed model ASmiR. Additionally, it was made sure that these sequences were not present in the positive set of the train model. We obtained 51 sequences for cold, 165 sequences for drought, 31 sequences for heat and 50 sequences for salt stress. The developed model was used to predict these sequences with respective abiotic stress, and it was found that for cold, drought, heat and salt stress respectively, 90.20, 94.54, 93.56 and 92% of the sequences were correctly predicted to their respective abiotic stress.

Discussion

Climate change, which has accelerated in recent years, is a key cause of abiotic stress, causing damage to cellular homeostasis and having a negative impact on plant growth and development (Mickelbart et al. 2015). Plant growth is impeded by abiotic stress since plants lack the ideal environmental conditions for cell division and growth. For instance, drought stress precludes plant growth because water is required for cell turgor, which promotes cell expansion (Seleiman et al. 2021); similarly, cold stress reduces plant growth because enzyme and other protein activities are limited in low temperatures (Sanghera et al. 2011).

Plants develop a variety of defence mechanisms against these abiotic stresses, among them involves using miRNAs to control the expression of abiotic stress–responsive genes. In response to various abiotic stresses, where the gene expression is controlled by translational inhibition, the miRNAs function as post-transcriptional regulators of gene expression in a sequence-specific manner (Shriram, 2016). The Argonaute proteins are recruited by the miRNA to specifically target mRNA via base-pairing in order to repress its translation and stability (Chipman and Pasquinelli 2019; Yan et al. 2018). The entire process of translational repression begins with the specific base pairing of miRNA with the target region, where the order of nucleotides in the miRNA is crucial. Targeting in particular depends on the base pairing of the miRNA’s seed region, which consists of nucleotides (nts) 2–7, to sites in the 3′UTRs of mRNA. Additionally, it has been discovered that the miRNAs’ 3′ ends play a role in controlling target specificity and regulation (Yan et al. 2018), where the degree of base pairing at the miRNA 3′ end can influence the stability of the miRNA itself (Chipman and Pasquinelli 2019). Thus, identification of abiotic stress–responsive miRNAs based on the sequence information is an important area of research as far as the plant response to different environmental stresses is concerned. In this direction, we have already developed a machine learning–based method named ASRmiRNA (Meher et al. 2022) for the first time to predict abiotic stress–related miRNA from plant miRNA sequences (Meher et al. 2022). However, this method is more generalized and cannot predict stress-specific miRNAs. Given the significance of miRNAs in plant response to specific abiotic stresses, this study focused on to develop a machine learning–based computational model for predicting abiotic stress–specific (cold, drought, heat and salt) miRNAs.

Construction of an appropriate dataset is one of the key factors determining the quality of the predictive model and is the cornerstone of machine learning algorithm learning, which directly influences the model accuracy (Sharma et al. 2021). In this study, we prepared two different datasets named as dataset-I and dataset-II for evaluation of machine learning methods for predicting abiotic stress–specific miRNAs. The accuracy was observed to be much higher for dataset-II as compared to the dataset-I. The improvement in accuracy may be due to the use of different negative sets in both datasets. As we know that same miRNA can be associated with more than one abiotic stress, the negative datasets prepared by using the observations of the rest of the stress categories may produce less accuracy. This may be the probable reason the prediction accuracy is less in case of one-to-one prediction. However, the negative sets of dataset-II were constructed from the non-abiotic stress miRNA sequences collected from miRBase (Kozomara and Griffiths-Jones 2014) which may be one of the probable reasons for higher discrimination accuracy in case of dataset-II.

Encoding of miRNAs to numeric feature vectors is essential, as machine learning algorithms can accommodate only numeric inputs (Zhang et al. 2006; Meher et al. 2018; Asefpour 2020). Sequence ordering of microRNA is important for its target recognition. It has been found that mutations in certain position may disrupt the binding of miRNAs to their original target genes (Bhattacharya and Cui 2017). Therefore, we used the pseudo K-tuple nucleotide compositional (PseKNC) features to encode miRNAs into numeric feature vectors in order to capture the sequence ordering in a miRNA. The PseKNC has also been successfully utilized in earlier studies (Guo et al. 2014; Yang et al. 2018; Meher et al. 2022) for prediction using biological sequence data.

Here, we considered Kmer size 1 to 5, and a total of 1379 numbers of features were generated. As miRNA sequences are only 20–24 nucleotides long, there is a higher probability of generated features containing large numbers of 0s, which may introduce redundancy in the feature set. In other words, because all features are derived from the PseKNC descriptor, prediction accuracy can be misleading when redundant or irrelevant features are present. Therefore, it is crucial to choose significant features from the generated features. In this study, the ideal feature set for the prediction of miRNAs specific to abiotic stress was chosen using the SVM-RFE (Wang et al. 2011). Numerous other applications, such as genomics (Tang et al. 2008), proteomics (Dao et al. 2017) and metabolomics (Lin et al. 2012), have successfully adopted the SVM-RFE method. The number of selected features was different for different stress category.

We utilised seven different machine learning methods such as SVM, RF, XGB, ADB, BAG, LGBM and GBDT for prediction of abiotic stress–responsive miRNAs. The prediction accuracies were generally found higher with the features generated with Kmer size 4 and 5, which may be due to the larger size of the feature set as compared to that of Kmer size 1 to 3. But in the selected feature sets for all abiotic stresses, SVM achieved higher accuracies over other learning algorithms in both datasets. Due to its ability to handle large and noisy data, SVM has been widely and successfully implemented in many computational studies (Brown et al. 2000; Guo et al. 2014; Chen et al. 2014). The performance of SVM was further compared with four variant of deep learning algorithms, such as CNN, ABCNN, LSTM and AE using training dataset-II with the respective selected feature sets. The SVM outperformed all four deep learning algorithms. The lower accuracies of prediction for shallow and deep learning models may be due to the features selected using SVM-RFE may not be appropriate to achieve higher accuracy with the other deep learning methods.

Conclusion

The proposed tool ASmiR (https://iasri-sg.icar.gov.in/asmir/) offers an alternative approach for predicting abiotic stress–specific (cold, drought, heat, and salt) miRNAs using features derived from miRNA sequences. Due to encouraging results, the ASmiR can be effectively used for large-scale prediction of abiotic stress–specific miRNAs by utilizing only sequence information. Given the importance of miRNAs in plant response to abiotic stresses and the lack of computational methods, it is anticipated that the proposed approach will supplement the existing experimental techniques for predicting abiotic stress–specific miRNAs.