1 Introduction

Long non-coding RNAs (lncRNAs) with biological functions in animals [1,2,3] and plants [4,5,6] have been discovered in recent years. In addition, it has been found that lncRNAs play a wide range of roles in many processes of individual development, such as transcription and inactivation of chromosomes, gene expression and shutdown, and cell cycle [7]. Increasingly, it has been shown that a number of lncRNAs with small open reading frames (sORFs) of no more than 300 nucleotides (nt) in length can encode small peptides of no longer than 100 amino acids (aa) [8, 9]. The first small peptide encoded by the sORF in plants lncRNAs was found to be the soybean ENOD40 peptide, which regulates the conversion and uptake of sucrose in root nodules in legumes [10]. Frank et al. identified a small new protein which can promote division and polarized growth of maize leaf epidermal cells [11]. Li et al. found that small peptides encoded by sORFs in plant lncRNAs can regulate plant organogenesis and leaf morphogenesis [12]. In Drosophila, a lncRNA was found to encode three 11aa and one 32aa small peptides that function during the epidermal morphogenesis of embryonic development by regulating the structure of the F-actin bundle [13]. According to Pauli and colleagues’ findings, during zebrafish gastrulation, a small peptide known as Toddler which is encoded by the sORF of lncRNA, can stimulate cell motility by activating APJ/Apelin receptor signaling [14]. Olson et al. identified a small peptide DWORF with a length of 34aa encoded by the sORF located on lncRNA. DWORF was shown to be abundantly expressed in the mouse heart and is able to regulate muscle contraction, and its expression was found to be suppressed in human ischemic heart tissue, suggesting that it may be involved in heart failure [3]. Matsumoto et al. found that lncRNA LINC00961 has the ability to encode a small peptide called SPAR, which inhibits the activity of mTORC1 and thus regulates muscle regeneration [15].

As more and more small peptides are discovered, related research has attracted more and more attention. The current research on sORFs is mainly carried out through computational prediction and biological experiments. Biological experiments mainly include ribosome profiling [16,17,18], mass spectrometry [19, 20], and immunoblot assays [21]. Due to the short length, small relative mass, and large order of magnitude of sORFs and small peptides, biological experiments have many limitations, such as time-consuming, inefficient, inaccurate, costly, and difficult to achieve batch identification. With the speedy development of machine learning algorithms, it has played an important role in lncRNA-disease associations [22], cell-penetrating peptides identification [23], lncRNA identification [24], miRNA-lncRNA interaction [25], and DNA–protein binding sites prediction [26]. Moreover, it can provide a powerful reference for biological experimental validation, saving a lot of time and cost, and accelerating the pace of research.

Machine learning-based small peptides identification is still in its inception stage. CRITICA [27], CPC2 [28], and PhyloCSF [29], which are alignment-based methods used to distinguish mRNAs from lncRNAs, can be used to identify small peptides. The fact that these alignment-based methods heavily rely on pre-existing data is, however, a clear disadvantage of these approaches. If there is a significant gap between the fresh data and the historical data, the outcomes of the prediction will suffer as a direct consequence. The other is the alignment-free method, which only depends on the intrinsic information of the sequence, making them more flexible and general than the alignment-based methods. MiPepid, a tool designed exclusively for recognizing micropeptides, was created by Zhu et al. [30], using 4-mers features to construct logistic regression (LR) models. It has better performance compared to tools such as CPC, CPC2, CPAT, not only in predicting regular-sized proteins, but also in identifying micropeptides well. Tong et al. [31] proposed a feature engineering CPPred using 8 RNA sequence-based features and protein sequence-based features collected from CPAT and CPC2 with the addition of CTD features to identify coding RNA using SVM. In addition, it does a good job of distinguishing between coding and non-coding RNAs of lengths less than 303nt. Zhang et al. [32] adopted the dataset of CPPred in their study, extracting and integrating multiple sequence basic composition features as well as the newly proposed nucleotide bias descriptor. The mDS feature selection method was then used to filter the features before they were fed into a CNN, and thus the CNN-based RNA coding potential prediction method DeepCPP was proposed. Notably, DeepCPP overcomes the sORF mRNA identification barrier by not only performing well on normal data but also on sORF-type data in particular. In addition, the authors collected 8 small peptides encoded by ncRNAs that are associated with cancers or diseases for the experiments. This further demonstrates the good performance of DeepCPP.

The above methods have been of great help in my research since they have produced outstanding results in identifying small peptides and discriminating between coding and non-coding RNAs. However, it is noteworthy that the existing methods use a low number of features, and the lack of research on feature representation capability hinders further improvement of prediction performance. If the discriminative information can be fully mined from multiple perspectives, the prediction performance is expected to be further improved. Second, existing methods are generally single shallow machine learning models or deep learning models, such as XGBoost, SVM, and CNN. Single classifiers have their own drawbacks, and this is where further improvements can be made.

For reasons such as data sample size, the current focus of small peptides research has been skewed toward humans and animals, while relatively little research has been done on plants. Since there are differences in the way that ncRNAs are produced in plants and animals [33], there may likewise be some differences between the small peptides encoded by sORFs in plant and animal lncRNAs. Thus, whether such predictors trained using human or animal datasets can be used directly for studies related to small peptides encoded by sORFs in plant lncRNAs is a question that needs to be validated. Therefore, it is imperative to develop a method suitable for predicting sORFs in plant lncRNAs. Accurate and effective prediction of sORFs with coding potential in plant lncRNAs will not only lay the foundation for further identifying small peptides with biological functions encoded by sORFs in lncRNAs, but also be of great importance for studies such as plant breeding and exploring plant biological processes.

To address the aforementioned challenges, a brand-new ensemble learning-based method termed sORFPred is proposed. The following aspects sum up the uniqueness of sORFPred. (1) A model based on multi-scale convolution and Squeeze-and-Excitation Networks (SENet) [34] named MCSEN is designed to extract generative high-level features. (2) To fully mine the discriminative information of sORFs from various perspectives, a multi-feature integration strategy is used to fuse 16 sequence-based and physicochemical descriptors with generative high-level features to obtain 2307 dimensional features. (3) Principal component analysis (PCA) is utilized to optimize the feature space and a novel feature selection method Boruta [35] is adopted to remove redundant features. (4) To obtain accurate and robust results, base classifiers are optimized by the Bayesian optimization package [36], and then are combined by LR to form the final predictor. sORFPred has verified its performance and generalization capabilities by comparing with several existing methods. The results show that sORFPred outperforms shallow machine learning as well as deep learning models, with an accuracy of 97.28% on the Arabidopsis thaliana (A. thaliana) dataset.

The rest of this paper is structured as follows. Section 2 provides an overview of the dataset acquisition, feature engineering, and sORFPred’s framework. Subsequently, Sect. 3 analyzes and discusses the experimental results. Lastly, the presented work is summarized in Sect. 4 along with a preliminary discussion of future work.

2 Materials and Methods

2.1 Framework of sORFPred

An ensemble learning method called sORFPred is proposed in this paper, and its framework is shown in Fig. 1. Method sORFPred consists of three phases: (1) Feature extraction, (2) Feature optimization, and (3) Ensemble classification. In phase (1), sORF sequence is encoded into a 168-dimensional feature vector by 9 nucleotide sequence-based descriptors, while the amino acid sequence corresponding to each sORF sequence is encoded into a 1627-dimensional feature vector by 7 amino acid sequence-based descriptors, totally 1795-dimensional feature vector by manually extracting feature descriptors. Further, the MCSEN model is used to convert each sORF sequence into a 512-dimensional feature vector. In phase (2), the Boruta package and the PCA method will work together to optimize the original features. After this, the classifiers in phase (3) will make their predictions based on the final feature vectors. The predictor is a two-layer prediction model using the stacking strategy. The first layer uses Extra Trees as base classifiers which are optimized by the Bayesian optimization package, then the second layer uses LR to combine the base classifiers.

Fig. 1
figure 1

The overall architecture of sORFPred method. It comprises four phases: A dataset construction. B Feature extraction. C Feature optimization. D Building ensemble model

2.2 Datasets Construction

Currently, due to the lack of sORFs that have been experimentally validated, the credible datasets are constructed with the help of available bioinformatics tools and public databases. A. thaliana as the most widely used model plant has been intensively studied. Glycine max (G. max) and Physcomitrella patens (P. patens) also have relatively abundant data, which have been commonly used in previous studies [37]. Therefore, the lncRNA data of those species were downloaded from GreeNC [38]. Then, sORFfinder [39] and ORF finder [40] were then used to obtain sORFs. After obtaining the intersection and difference sets of the results from the two tools, the sequences with similarities higher than 80% were removed using CD-HIT [41]. Since sORFfinder can predict sORFs with coding ability, the intersection is used as the candidate positive sample set while the difference set is taken as the candidate negative sample set. Then, based on the idea of logical reasoning [42], the knowledge base was built to further filter the candidate positive sample set and candidate negative sample set to improve the credibility of the dataset, and thus Dataset1 was obtained. In addition, Dataset2 was constructed to test the performance and generalization ability of sORFPred. 20 sORFs sequences of functional lncRNA-encoded small peptides from Drosophila melanogaster (D. melanogaster), Homo sapiens (H. sapiens), Mus musculus (M. musculus), G. max, Zea mays (Z. mays), and A. thaliana were downloaded from ncEP [43] as the positive samples while 40 sORFs without coding potential that do not belong to Dataset 1 are picked at random as the negative sample set. A summary of the details of all datasets is presented in Table 1.

Table 1 Datasets information

2.3 Feature Extraction

A multi-feature integration strategy is used to fuse various feature descriptors to fully mine the discriminative information of sORFs from different perspectives. Based on sequence categories and extraction methods, these feature descriptors can be divided into the following three major categories: nucleotide sequence-based features, amino acid sequence-based features, and features extracted by the MCSEN model. Then, the sORF sequences and corresponding amino acid sequences were successfully encoded with the 1795-dimensional manually extracted feature vectors and 512-dimensional MCSEN-extracted feature vectors. All features are summarized in Table 2. The details of how these feature descriptors are encoded can be found in the Supplementary Method S1. However, the performance of the model will be constrained by the high feature dimensionality and the superfluous features. Therefore, a feature selection strategy is described in the feature optimization part in order to optimize the feature space.

Table 2 Features information

2.3.1 Nucleotide Sequence-Based Features

In order to predict the sORFs with coding potential, the sequence-based features of sORFs were extracted based on the traditional feature extraction method of RNA, including k-mer [44], short sequence motifs (SSM) [45], signal-to-noise (SN) [46], the content of base C and G (GC_content), the ratio of base C and G (GC_ratio), and the length of the sequence (sORF_length). In addition, we extracted some RNA features descriptors from recently published work [31, 47] and used them to predict sORF for the first time. Fickett score and Hexamer score are derived from CPAT [47]. Similarly, the CTD descriptor mentioned by CPPred [31] is also added.

A total of 168-dimensional features has been extracted for sORF sequences, where k-mer, as an approximate expression of codon frequencies, describes the nucleotide sequence composition information. GC_ratio and GC_content are also extracted as the genome of an organism or a specific DNA or RNA segment has a specific content of base C and G. SN descriptor can be interpreted as strength of the 3-base periodicity per nucleotide and indicates the bias of base usage in sORFs. Since k-mer descriptor only considers the properties of contiguous nucleotides, SSM descriptor is introduced to describe the association between discontinuous nucleotides. The difference in the combined effect of nucleotide composition and codon use bias in sORF sequences is described by the Fickett score descriptor. It is calculated from the sORF sequences using four position values together with four composition values. Hexamer score descriptor distinguishes coding sequences from non-coding sequences based on hexamer usage bias, while the hexamer usage difference between coding and non-coding sequences is measured by the log-likelihood ratio. In addition, the CTD descriptor describes the differences in nucleotide composition, nucleotide transition, and nucleotide distribution between coding and non-coding sequences.

2.3.2 Amino Acid Sequence-Based Features

First, the sORFs sequences in the dataset are translated into amino acid sequences based on the correspondence between codons and amino acids. As for amino acid sequences, seven descriptors have been collected from iFeature [48]. 1627-dimensional features are extracted for the amino acid sequences, where Amino Acid Composition (AAC) describes the composition frequencies of 20 amino acids. Based on the dipoles and side chain volumes, the 20 amino acids can be categorized into 7 groups. The k-Spaced Conjoint Triad (KSCTriad) descriptor treats any three amino acids separated by k (k = 0, 1, 2) residues as a single unit when considering the properties of an amino acid and its neighbors. The Composition, Transition, and Distribution (CTD) descriptor categorizes the 20 amino acids into 3 groups based on 13 physicochemical attributes, which indicate the amino acid distribution patterns of a certain structural or physicochemical feature in a peptide or protein sequence. On the basis of physical properties, such as hydrophobicity, charge, and molecular size, the 20 amino acids are further classified into 5 categories. Then, the frequency of each amino acid group is represented by the Grouped Amino Acid Composition (GAAC) descriptor. In addition, Grouped Dipeptide Composition (GDPC) and Grouped Tripeptide Composition (GTPC) descriptors are used to define grouped dipeptide composition and grouped tripeptide composition in an amino acid sequence, respectively. Moreover, the Composition of k-spaced Amino Acid Pairs (CKSAAGP) descriptor was employed to calculate the frequency of amino acid group pairs separated by any k residues (k = 0, 1, 2, …, 5).

2.3.3 Features Extracted by the MCSEN Model

The features in deep learning methods are automatically extracted by the artificial neural network, which reduces human intervention and provides more feature information compared to the manually extracted features. To further obtain more information of sORF sequences, the MCSEN model combining multi-scale convolution and SENet [34] is constructed to extract 512-dimensional local features.

The traditional encoding methods tend to ignore the correlation between nucleotides. To address this problem, the p-nts [49] encoding method is used to encode sORF sequences. Instead of a single convolution kernel, a multi-scale convolution operation is used to extract features. For the purpose of solving the problem of loss resulting from the different importance of different channels during the convolution and pooling process, the SENet structure is introduced. SENet adopts a new feature rescaling strategy, which automatically obtains the importance of each channel through learning, and then enhances the useful features and suppresses the features which are useless for the problem at hand in accordance with the importance, thus highlighting the key features and further optimizing the model performance. During the training phase, different hidden neurons are randomly dropped by Dropout and the training time will be early stopped to avoid overfitting. The overall architecture is shown in Fig. 2. In addition, the feature extraction with MCSEN includes the following steps.

Fig. 2
figure 2

Overall architecture of MCSEN model. There are three main operations: (1) encode sORFs sequence by p-nts encoding method, (2) extract local features by multi-scale convolution layer and max-pooling layer, and (3) Recalibrate channel-wise feature responses by SENet structure

Step 1: The sORF sequences are split and encoded using p-nts (p = 3) encoding method.

Step 2: The embedding layer maps the coded sequence into a 128 × 101 matrix vector to facilitate convolutional operations and feature extraction.

Step 3: To avoid loss of effective information, convolution kernels of 4 different scales are used to more fully extract local features. The convolution pooling operation for each scale is performed as follows.

(a) 64 convolution kernels of scale f (\(K \in \Re^{m \times f}\)) are selected for the convolution operation to obtain the convolved feature matrix C, where m denotes the convolution kernel width which is equal to the embedding dimension and f is the convolution kernel length.

(b) Max-pooling operation is performed on the feature matrix C to extract the important feature information P in the local region, where ci is the i-th convolution feature map, f denotes the convolution kernel scale and l is the length of the sequence. After the convolution operation with a convolution kernel of scale f, the output after the max-pooling operation with pooling size 1\(\times\)(l-f) is as follows:

$$P_{i}^{l - f} = \max \left( {c_{i} ,c_{{\left( {i + 1} \right)}} , \ldots ,c_{{\left( {i + l - f - 1} \right)}} } \right),\quad i \in \left( {1,2, \ldots ,f + 1} \right)$$
(1)

(c) After performing the convolution and pooling operation on the 4 scales of convolution kernels f1, f2, f3, and f4, the output results of each are concatenated to obtain the final result V of the multi-scale convolution operation, which is represented as follows:

$$V = \left[ {P^{{l - f_{1} }} ,P^{{l - f_{2} }} ,P^{{l - f_{3} }} ,P^{{l - f_{4} }} } \right]$$
(2)

Step 4: Input V into the SENet structure to recalibrate channel-wise feature responses.

First, the feature map with input size W × H × N is squeezed, that is, the global average pooling is performed (pooling size is h × w), and then the feature map is compressed to 1 × 1 × N vectors. Subsequently, a two-layer fully connected bottleneck structure is used for the excitation operation to determine the weights of each channel in the feature map. In addition, the number of channels is reduced by the SERatio parameter to reduce the computation. SERatio is set to 1/58 in this paper. Finally, the result is output after the weight value for each channel determined by the SENet structure has been multiplied by the 2-D matrix of the corresponding channel in the original feature map.

Step 5: The results obtained in step 4 are input to the Flatten layer, which turns the multidimensional input into one dimension.

Step 6: Then, the Dense layer with the parameter 1 is connected, and the feature vector is mapped to [0, 1] to get the probability of the predicted label after the activation function sigmoid.

Step 7: Finally, the sORF sequences are fed into the MCSEN model to extract local features, and the output of the Flatten layer is then extracted to obtain 512-dimensional features.

2.4 Feature Optimization

Boruta package [35] differs from the common feature selection method. It aims to pick all features which are associated with the dependent variable. Boruta package is based on the idea of shadow features and binomial distribution and determines the importance of features by creating synthetic features consisting of the target features and their randomly rearranged values. The process of Boruta is shown in Fig. 3. In addition, the specific steps are: (1) shuffle the original feature matrix to obtain shadow features, and then a new feature matrix is formed by stitching the original features with the shadow features. (2) The newly obtained feature matrix is adopted as the input to train the classifier. (3) Calculate the importance values of the original features and the shadow features separately. (4) If the importance value of the original features is higher than the shadow features, then the feature is marked as “important” and retained. Otherwise, it is marked as “unimportant” and will be removed from the feature set. (5) Delete all shadow features. (6) Repeat the above steps until all features are marked as “important” or “unimportant”.

Fig. 3
figure 3

Framework of Boruta feature selection method. The specific steps are: (1) shuffle the original feature matrix to obtain shadow features, (2) calculate the corresponding importance values of shadow features and original features separately, and (3) filter features based on feature importance in an iterative process

In order to investigate the effect of each feature importance ranking algorithms on the classification results, RF, XGBoost, LightGBM and GBDT were used as feature importance ranking algorithms for feature selection under the Boruta framework, respectively. For the sake of fairness in comparison, ‘n_estimators’ was selected to be set to ‘auto’ in the Boruta framework, and ‘max_depth’ was set to 5 uniformly, and the filtered features were then fed into the ensemble classification model. The experimental results have been added to the Supplementary Table S1. According to the result of the experiments, it can be seen that using XGBoost and LightGBM as feature importance ranking algorithms, the number of features obtained from the filtering is small and the information contained in the features is too one-sided. Although the accuracy is improved, the generalization is relatively poor as seen from the two independent test sets. The Boruta framework using GBDT as the feature importance ranking algorithm can further remove the redundant features while retaining relatively comprehensive feature information compared to the RF feature importance ranking algorithm. Therefore, GBDT is adopted as the feature importance ranking algorithm under the Boruta framework.

To enhance the model performance and better understand the features of the data, the manually extracted 1795-dimensional features are filtered by the Boruta package to obtain all features useful for prediction (Boruta1795). To remove redundant data as well as prevent the overfitting phenomenon, the features extracted by MCSEN (MCSEN512) are dimension-reduced by PCA to obtain a new feature set (MCSEN10). Then, Boruta1795 is combined with MCSEN10 to form the final feature set.

2.5 Bayesian Optimization Method

Bayesian optimization method [36] builds probabilistic models based on the information available from previous evaluations of the objective function and finds the value of minimizing or maximizing the objective function by a minimum number of steps. Compared with the currently used algorithms such as particle swarm optimization, random search, genetic algorithm, and grid search, the Bayesian optimization method considers the previous parameter information and constantly updates the prior, which has fewer iterations and better performance, and can save a lot of useless work.

2.6 Ensemble Learning Construction

Ensemble models can make stronger and more accurate predictions compared with single classifier, because each classifier in the ensemble model has its own strengths. Ensemble models have many successful applications in bioinformatics [50,51,52], and the stacking method is a common integration strategy. A two-layer stacking strategy is used in this paper. First, the final features are input into several shallow machine learning models, and the performance of each model is measured by the five-fold cross-validation method. Then, the model which has the best prediction performance is selected and further optimized by the Bayesian optimization method to obtain the first layer's basic classifiers. Then, the prediction results of each base classifier are input into the LR model of the second layer to obtain the final prediction results.

2.7 Implementation of sORFPred

MCSEN is implemented by Keras 2.2.4 with the backend of TensorFlow 1.12.2. The scripts are written by Python 3.6.5. While sORFPred is implemented by Keras 2.7.0 with the backend of TensorFlow 2.7.0. The scripts are written by Python 3.8.5. The hardware experiment environment is a PC equipped with 16 GB of RAM, the GPU is AMD Radeon R7 200 series, and the CPU is 4 cores of Intel Core i5-6500 3.2 GHz.

2.8 Evaluation Criteria

In this paper, four commonly used evaluation criteria are used to evaluate the performance of sORFPred. They are formulated as follows:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(3)
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(4)
$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(5)
$${\text{F1 - score}} = \frac{{{\text{2TP}}}}{{{\text{2TP}} + {\text{FP}} + {\text{FN}}}}$$
(6)

where TP, TN, FP, and FN stand for the corresponding totals of true-positive, true-negative, false-positive, and false-negative samples, respectively. As for all the metrics listed above, the better the model performs, the higher the score it receives.

3 Results

3.1 Performance Analysis of the Model in Different Feature Spaces

This section analyzes the performance of various types of features on the A. thaliana dataset. In the feature representation stage, three categories of features were extracted for encoding sORF sequences, namely nucleotide sequence-based features (nt168), amino acid sequence-based features (aa1627), and features extracted by the MCSEN model (MCSEN512), respectively. Further, the dimension of MCSEN512 is reduced using PCA to obtain a new feature set (MCSEN10). The nucleotide sequence-based and amino acid sequence-based features are fused to obtain extracted features of 1795 dimensions (original1795). Then, original1795 is filtered using Boruta package and fused with MCSEN10 to obtain the final feature set (final-feature). In this section, it will be discussed which type of feature is more discriminative in predicting sORFs with coding potential and sORFs without coding ability. For a fair comparison, the nt168, aa1627, and the original1795 feature are processed separately by the Boruta package to obtain the optimized feature subsets Boruta168, Boruta1627, and Boruta1795. In addition, five-fold cross-validation is conducted to compare the performance of ensemble model with different types of features. The performance is shown in Fig. 4A. In addition, receiver-operating characteristic (ROC) curves is further plotted as shown in Fig. 4B in order to present the comparison results more clearly. Detailed results are shown in Supplementary Table S2. It is clear that all five types of features are effective in predicting sORFs. Notably, the fused feature Boruta1795 achieved better results than the single feature Boruta168 and Boruta1627, while the final-feature after fusing MCSEN10 achieved an average accuracy of 97.28% for the prediction of sORFs, which was higher than the manually extracted features (Boruta168, Boruta1627, and Boruta1795) of 6.67~9.17%. This suggests that the MCSEN model can successfully learn the local features of sORF sequences. It can also be seen that the features optimized by Boruta package (Boruta1795) achieved better performance compared with the original features (original795), while reducing the feature dimensionality and running time.

Fig. 4
figure 4

Results of the proposed sORFPred with different types of features on A. thaliana dataset. A The performances of sORFPred with different types of features. B ROC curves and AUC values of sORFPred with different types of features

3.2 Selection of Base Classifiers

In order to obtain the optimal base classifiers, GaussianNB, kNN [53] SVM [54], RF [55], and Extra Trees [56] are selected as candidate classifiers. Then, the performance of each classifier is evaluated using the five-fold cross-validation method on A. thaliana dataset. As shown in Supplementary Table S3, Extra Trees achieved the best performance compared to other models with 96.09% Accuracy, 95.31% Precision, 97.26% Recall, and 96.32% F1-score. In terms of Accuracy, Extra Trees is higher than the other models by 0.24–9.57% and the standard deviation (SD) is only 0.51%, indicating that the stability of the Extra Trees is better. Overall, RF, Extra Trees, and kNN outperformed the other models by a large margin. Although RF is slightly higher than Extra trees in terms of precision, Extra Trees outperforms RF in terms of Accuracy, Recall, and F1-score by 0.31%, 2.74%, and 1.4%, respectively. Therefore, Extra Trees with relatively better performance is identified as the base classifiers.

3.3 Comparison with Other Models

To more impartially verify the effectiveness of sORFPred, on the A. thaliana dataset, it was compared with commonly used deep learning models, namely CNN, BiLSTM, CNN + BiLSTM, CapsNet, and MConvMCaps [42], as well as the state-of-the-art methods, namely MiPepid, CPPred, and DeepCPP. The performance of each model is presented in Fig. 5 and more specific data are available in Supplementary Table S4. As can be seen in Fig. 5, sORFPred performs obviously better than those models. The mean of Accuracy, Precision, and F1-score are 97.28%, 97.06%, and 97.29%, respectively, which are 5.98~26.63%, 6.71~28.94%, and 5.88~24.7% higher than the compared models. Although sORFPred is slightly lower than CNN in terms of Recall, the high recall value of CNN is obtained at the expense of the precision of prediction. Overall, sORFPred is more powerful in distinguishing whether sORFs have coding ability or not than those commonly used deep learning models and the state-of-the-art methods. From Fig. 5E, it is clear that area under the curve of sORFPred is significantly larger than the area under the other curves. This indicates that the proposed method has high sensitivity and a low false-positive rate. In other words, sORFPred can better learn the information embedded in the original data so as to achieve a robust and credible prediction of sORFs.

Fig. 5
figure 5

Performance of the proposed sORFPred and other models on A. thaliana dataset in terms of A accuracy, B precision, C recall, D F1-score and E ROC curve

3.4 Prediction Performance on Other Species

In order to validate the generalization capability of sORFPred, experiments are conducted on P. patens and G. max datasets, respectively. As shown in Fig. 6, the model trained on A. thaliana datasets was then tested on P. patens and G. max, respectively, with accuracies of 76.72% and 81.01%, indicating that sORFPred generalizes well to other plants.

Fig. 6
figure 6

Performance of sORFPred on other species

3.5 Comparison with the State-of-the-Art Methods

In order to further validate the performance of sORFPred, it has been compared with commonly used methods such as MiPepid, CPPred, and DeepCPP on Dataset2, which is composed of sORFs with validated coding capabilities. Two experiments were conducted on Dataset2. One was a direct prediction of Dataset2 using the three existing tools. The other was to retrain the existing tool on the A. thaliana dataset before making predictions on the sORFs in Dataset2. As can be seen in Fig. 7, without retraining, although MiPepid and sORFPred correctly predicted the highest number of samples out of a total of 20 positive samples (Dataset2), MiPepid has a false-positive rate of 40%. As for the prediction of negative samples, DeepCPP predicted 39 out of a total of 40 negative samples (Dataset2). It was slightly better than sORFPred, but it was a poor predictor of positive samples with a high false-negative rate. After retraining the three tools mentioned above, although their performance improved significantly, sORFPred's performance remained relatively good. It is also clear from the comparison of the two experiments that the existing tools before being retrained do not perform well in predicting sORFs in lncRNAs due to their original datasets, and further demonstrates that sORFPred is a good method in predicting sORFs in lncRNAs.

Fig. 7
figure 7

Performance of sORFPred compared to the state-of-the-art methods

4 Conclusions

According to our best knowledge, this research is the first to predict sORFs with coding potential in plant lncRNAs using such comprehensive and detailed features and an ensemble learning model based on the Bayesian optimization method. In comparison to existing methods, it achieves greater performance and generalization capability. We expect that sORFPred will become a potent method for the large-scale prediction of sORFs. The prediction of sORFs with coding ability in plant lncRNAs will not only lay the foundation for the discovery of lncRNA-encoded small peptides, but also provide an important reference for biological experimental validation, which is conducive to revealing the molecular mechanisms of life-form traits and disease resistance, and is of great value in agriculture and forestry production and other fields. In this research area, the majority of techniques currently used to construct predictors using a single classification algorithm, such as RF or SVM. In fact, it has been demonstrated that well-established ensemble classifiers can increase the prediction quality in protein fold classification, DNA-binding protein prediction, and other applications. Our future research will primarily concentrate on investigating more effective feature selection techniques and more potent classification algorithms.