Keywords

1 Introduction

The widespread use of next generation sequencing (NGS) technologies in the recent years has resulted in an increase in the volume of biological data generated, including both DNA sequences and also derived protein sequences. A challenge arising from the increased volume of data consists of organizing, analyzing, and interpreting this data, in order to create or improve genome assemblies or genome annotation, or to predict protein function, structure and localization, among others. Some of these problems can be framed as biological sequence classification problems, i.e., assigning one of several labels to a DNA or protein sequence based on its content (e.g., predicting the presence or absence of an acceptor or donor splice site in DNA sequences centered around GT or AG dimers; or determining where a protein is localized, such as in cytoplasm, inner membrane, periplasm, outer membrane, or extracellular space, a.k.a., protein localization).

Using machine learning or statistical inference methods allows labeling of biological data several orders of magnitude faster than it can be done manually, and with high accuracy. For example, hidden Markov models are currently used in gene prediction algorithms, and support vector machines have shown promising results with handwritten digit classification [29], optical character recognition [16, 23] and translation initiation sites classification based on proximity to start codon within sequence window [16] or based on positional nucleotide incidences [32], classification into malign or benign of gene expression profiles [18], ab initio gene prediction [2], classification of DNA sequences into sequences with splice site at a determined location or not [1, 5, 10, 11, 13, 20, 25, 26, 28, 31], and classifying the function of genes based on gene expression data [3].

However, using a supervised classifier trained on a source domain to predict data on a different target domain usually results in reduced classification accuracy. Instead of using the supervised classifier, an algorithm developed in the domain adaptation framework can be employed to transfer knowledge from the source domain to the target domain. Such an algorithm has to take into consideration the fact that some, if not all, of the features have different probabilities in the target and source domains [12]. In other words, some of the features that are correlated to a label in the source domain might not be correlated to the same or any label in the target domain, while, some of the features have the same label correlations between the source and target domains. The former ones are known as domain specific features and the latter ones are generalizable features [12].

Domain adaptation algorithms are particularly applicable to many biological problems for which there is a large corpus of labeled data for some well studied organisms and much less labeled data for an organism of interest. Thus, when studying a new organism, it would be preferable if the knowledge from other, more extensively studied organism(s), could be applied to a lesser studied organism. This would alleviate the need to manually generate enough labeled data to use a machine learning algorithm to make predictions on the biological sequences from the target domain. Instead, we could filter out the domain specific features from the source domain and use only the generalizable features between the source and target domains, together with the target specific features, to classify the data.

Towards this goal, we modified the Adapted Naïve Bayes (ANB) algorithm to make it suitable for the biological data. We chose this algorithm because Naïve Bayes based algorithms are faster and require no tuning. In addition, this algorithm was successfully used by Tan et al. [27] on text classification for sentiment analysis, discussed in Sect. 2. It combines a weighted version of the multinomial Naïve Bayes classifier with the Expectation-Maximization algorithm. In the maximization step, the class probabilities and the conditional feature probabilities given the class are calculated using a weighted combination between the labeled data from the source domain and the unlabeled data from the target domain. In the expectation step, the conditional class probabilities given the instance are calculated with the probability values from the maximization step using Bayes theorem. The two steps are repeated until the probabilities in the expectation step converge. With each iteration, the weight is shifted from the source data to the target data. The key modifications we made to this algorithm are the use of labeled data from the target domain, and the incorporation of self-training [14, 22, 30] to make it feasible for biological data, as presented in more detail in Sect. 3.

We tested the ANB classifier on three biological datasets, as described in the Sect. 3.4, for classifying localization of proteins, and predicting splice site locations. The experimental results, Sect. 3.6, show that this classifier achieves better classification accuracy than a Naïve Bayes classifier trained on the source domain and tested on the target domain, especially when the two domains are less related.

2 Related Work

Up to now, most of the work in domain adaptation has been on non-biological problems. For instance, text classification has received a lot of attention in the domain classification framework. One example, the Naïve Bayes Transfer Classification algorithm [4], assumes that the source and target data have different distributions. It trains a classifier on source data and then applies the Expectation-Maximization (EM) algorithm to fit the classifier for the target data, using the Kullback-Liebler divergence to determine the trade-off parameters in the EM algorithm. When tested on datasets from Newsgroups, SRAA and Reuters for the task of top-category classification of text documents this algorithm performed better than support vector machine and Naïve Bayes classifiers.

Another algorithm derived from the Naïve Bayes classifier that uses domain adaptation is the Adapted Naïve Bayes classifier [27], which identifies and uses only the generalizable features from the source domain, and the unlabeled data with all the features from the target domain to build a classifier for the target domain. This algorithm was evaluated on transferring the sentiment analysis classifier from a source domain to several target domains. The prediction rate was promising, with Micro F1 values between 0.69 and 0.90, and Macro F1 values between 0.59 and 0.91. However, the classifier did not use any labeled data from the target domain.

Nigam et al. [17] showed empirically that combining a small labeled dataset with a large unlabeled dataset from the same or different domains can reduce the classification error of text documents by up to 30 %. Their algorithm also uses a combination of Expectation Maximization and the Naïve Bayes classifier by first learning a classifier on the labeled data which is then used to classify the unlabeled data. The combination of these datasets trains a new classifier and iterates until convergence. By augmenting the labeled data with unlabeled data the number of labeled instances was smaller compared to using only labeled data.

For biological sequences, most domain adaptation algorithms employed support vector machines. For example, Sonnenburg et al. [26] used a Support Vector Machine with weighted degree kernel to classify DNA sequences into sequences that have or not have a splice site at the location of interest. Even though the training data was highly skewed towards the negative class, their classifier achieved good accuracy.

For more work on domain adaptation and transfer learning, see the survey by Pan and Yang [19].

3 Methodology

3.1 Identifying and Selecting Generalizable Features from the Source Domain

To successfully adapt a classifier from the source domain to the target domain, the classifier has to identify in the source domain the subset of the features that generalize well and are highly correlated with the label. Then, use a combination of only these features from the source domain and all the features from the target domain to predict the labels in the target domain.

We used the feature selection method proposed by Tan et al. [27]. Theoretically, the set of features in each domain can be split into four categories, based on two selection criteria. The first selection criterion is the level of correlation. The features have varying degrees of correlation with the label assigned to a sequence. Based on the correlation between the feature and the label, the features can be divided into features that are highly related to the labels, and features that are less related to the labels. The second selection criterion is the specificity of the features. Based on this criterion, the features can be divided into features that are very specific to a domain, and features that generalize well across related domains.

To select these features from the source domain we rank all the features from the source domain based on their probabilities. The features that are generalizable would most likely occur frequently in both domains, and should be ranked higher, as shown in Fig. 1. Moreover, the features that are correlated to the labels should be ranked higher. Therefore, we use the following measure to rank the features in the source domain:

$$\begin{aligned} f(w)=\displaystyle \log \frac{P_s(w)\cdot P_t(w)}{\vert P_s(w)-P_t(w)\vert +\alpha } \end{aligned}$$
(1)

where \(P_s\) and \(P_t\) are the probability of the feature \(w\) in the source and target domain, respectively. The numerator ranks higher the features that occur frequently in both domains, since the larger both probabilities are the larger the numerator is, and thus the higher the rank of the feature is. The denominator ranks higher the features that have similar probabilities (i.e., the generalizable features), since the closer the probabilities are for a feature in both domains, the smaller the denominator value is, and thus the higher the rank. The additional value in the denominator, \(\alpha \), is used to prevent division by zero. The higher its value is the more influence the numerator has in ranking the features, and vice versa. To limit its influence on ranking the features, we chose a small value for this parameter, 0.0001. The probability of a feature in either domain is

$$\begin{aligned} P(w)=\displaystyle \frac{N(w)+\beta }{\vert D\vert + 2\cdot \beta } \end{aligned}$$
(2)

where \(N\) is the number of instances in the domain in which the feature \(w\) occurs, \(D\) is the total number of instances in the domain and \(\beta \) is a smoothing factor, which is used to prevent the probability of a feature to be 0 (which would make the numerator in (1) equal to 0, and the logarithm function is undefined for 0). We chose a small value for \(\beta \) as well, 0.0001, to limit its influence on the ranking of features. Note that the values for \(\alpha \) and \(\beta \) do not have to be the same, but they can be, as used by Tan et al. [27] and in our case.

Fig. 1.
figure 1

Ranking of features in the source domain using Eq. (1). The rank of a feature is higher if it has a high probability or occurs with similar probability in the target domain. Note: This graph was drawn using Octave [6].

3.2 Multinomial Naïve Bayes (MNB) Classifier

The multinomial naïve Bayes classifier [15] assumes that the sample data used to train the classifier is representative of the population data on which the classifier will be used. In addition, it assumes that the frequency of the features determines the label assigned to an instance, and that the position of a feature is irrelevant (the naïve Bayes assumption). Thus, using Bayes’ property a classifier can approximate the posterior probability, i.e., the probability of a class given an unclassified instance, as being proportional to the product of the prior probability of the class, and the feature conditional probabilities given an instance from the sample data:

$$\begin{aligned} P(c_k\mid d_i)\propto P(c_k)\prod _{t\in \vert V\vert }\left[ P(w_t\mid c_k)\right] ^{\displaystyle N_{t,i}} \end{aligned}$$
(3)

where the probability of the class is

$$\begin{aligned} P(c_k)=\displaystyle \frac{\displaystyle \sum _{i\in \vert D\vert }P(c_k\mid d_i)}{\vert D\vert } \end{aligned}$$
(4)

and the conditional probability is

$$\begin{aligned} P(w_t \mid c_k)=\displaystyle \frac{\displaystyle \sum _{i\in \vert D\vert }N_{t,i}\cdot P(c_k\mid d_i) + 1}{\displaystyle \sum _{t\in \vert V\vert }\sum _{i\in \vert D\vert }N_{t,i}\cdot P(c_k\mid d_i)+\vert V\vert } \end{aligned}$$
(5)

Here, \(N_{t,i}\) is the number of times feature \(w_t\) occurs in instance \(d_i\), \(\vert V\vert \) is the number of features, and \(\vert D\vert \) is the number of instances.

3.3 Adapted Naïve Bayes Classifier for Biological Sequences

One limitation of the MNB classifier is that it can only be trained on one domain, and when the trained classifier is used on a different domain, in most cases, its classification accuracy decreases. To address this, we used the Adapted Naïve Bayes (ANB) classifier proposed by Tan et al. [27], with two modifications: we used the labeled data from the target domain, and employed the self-training technique. These will be described in more detail shortly.

The ANB algorithm is a combination of the expectation-maximization (EM) algorithm and a weighted multinomial Naïve Bayes algorithm. Similar to the EM algorithm, it has two steps that are iterated until convergence. In the first step, the \(M\)-step, we simultaneously calculate the class probability and the class conditional probability of a feature. However, unlike the EM algorithm that uses the data from one domain to calculate these values, this algorithm uses a weighted combination of the data from the source domain and the target domain.

$$\begin{aligned} P(c_k)=\displaystyle \frac{(1-\lambda )\displaystyle \sum _{i\in D_s}P(c_k\mid d_i)+\lambda \sum _{i\in D_t}P(c_k\mid d_i)}{(1-\lambda )\vert D_s\vert +\lambda \vert D_t\vert } \end{aligned}$$
(6)
$$\begin{aligned} P(w_t\mid c_k)=\displaystyle \frac{(1-\lambda )(\eta _t N_{t,k}^s)+\lambda N_{t,k}^t+1}{(1-\lambda )\displaystyle \sum _{t\in \vert V\vert }\eta _t N_{t,k}^s+\lambda \displaystyle \sum _{t\in \vert V\vert }N_{t,k}^t+1} \end{aligned}$$
(7)

where \(N_{t,k}\) is the number of times feature \(w_t\) occurs in a domain in instances labeled with class \(k\):

$$\begin{aligned} N_{t,k}=\displaystyle \sum _{i\in D}N_{t,i}P(c_k\mid d_i) \end{aligned}$$
(8)

\(\lambda \) is the weight factor between the source and target domains:

$$\begin{aligned} \lambda =\min \{\delta \cdot \tau ,1\} \end{aligned}$$
(9)

and \(\tau \) is the iteration number. \(\delta \in (0,1)\) is a constant that determines how fast the weight shifts from the source domain to the target domain, and \(\eta _t\) is 1 if feature \(t\) in the source domain is a generalizable feature, 0 otherwise.

Unlike the algorithm proposed by Tan et al. [27], which considers that all the instances from the target domain are unlabeled and does not use them during the first iteration (i.e., \(\lambda = 0\)), it is reasonable to assume that there is a small number of labeled instances in the target domain, and our algorithm uses any labeled data from the target domain in the first and subsequent iterations. In the first iteration we use only labeled instances from the source and target domains to calculate the probability distributions for the class conditional probabilities given the instance. In subsequent iterations we use the class of the instance for the labeled data from the source and target domains and the probability distribution of the class for the unlabeled data from the target domain.

In the second step, the \(E\)-step, we estimate the probability of the class for each instance with the values obtained from the \(M\)-step.

$$\begin{aligned} P(c_k\mid d_i)\propto P(c_k)\prod _{t\in \vert V\vert }\left[ P(w_t\mid c_k)\right] ^{\displaystyle N_{t,i}} \end{aligned}$$
(10)

The second modification we made to the ANB classifier [27], is our use of self-training, i.e., at each iteration, we select, proportional to the class distribution, the instances with the top class probability, and consider these to be labeled in the subsequent iterations. This improves the prediction accuracy of our classifier because it does not allow the unlabeled data to alter the class distribution from the target labeled data.

The two steps, \(E\) and \(M\), are repeated until the instance conditional probabilities values in (10) converge (or a given number of iterations is reached). The algorithm is summarized in Algorithm 1.

3.4 Data Sets

We used three data sets to evaluate our classifier, two for the task of protein localization and one for the task of splice site prediction. The first data set, PSORTb v2.0Footnote 1 [8], was first introduced in [9], and contains proteins from gram-negative and gram-positive bacteria and their primary localization information: cytoplasm, inner membrane, periplasm, outer membrane, and extracellular space. For our experiments, we identified classes that appear in both datasets, and used 480 proteins from gram-positive bacteria (194 from cytoplasm, 103 from inner membrane, and 183 from extracellular space) and 777 proteins from gram-negative bacteria (278 from cytoplasm, 309 from inner membrane, and 190 from extracellular space). The second data set, TargetPFootnote 2, was first introduced in [7], and contains plant and non-plant proteins and their subcellular localization: mitochondrial, chloroplast, secretory pathway, and “other.” From this data set we used 799 plant proteins (368 mitochondrial, 269 secretory pathway and 162 “other”) and 2,738 non-plant proteins (371 mitochondrial, 715 secretory pathway and 1652 “other”). Predicting protein localization is an important biological problem because the function of the proteins is related to their localization. The third data setFootnote 3, first introduced in [24], contains DNA sequences of 141 base pairs centered around the donor splice site dimer AG and the label of whether or not that AG dimer is a true splice site. The sequences are from five organisms, C.elegans as the source domain, and C.remanei, P.pacificus, D.melanogaster, and A.thaliana as target domains. In each dataset there are about 1 % positive instances. Accurately predicting splice sites is important for genome annotation [2, 21].

figure a

3.5 Data Preparation and Experimental Setup

Protein Localization. We represent each sequence as a count of occurrences of k-mers. We use a sliding window approach to count the \(k\)-mer frequencies. For example, the protein sequence LLRSYRS would be transformed when using 2-mers into 1, 1, 2, 1, 1 which are the counts corresponding to the occurrences of features LL, LR, RS, SY, YR.

In order to obtain unbiased estimates for classifier performance we use five-fold cross validation. We use all labeled data from the source domain for training (tSL) and randomly split the target domain data into 3 sets: 20 % used as labeled data for training (tTL), 60 % used as unlabeled data for training (tTU), and 20 % used as test data (TTL), as shown in Fig. 2(a). So, we train our classifier on tSL + tTL + tTU and test it on TTL.

Fig. 2.
figure 2

Experimental setup. We use 3 datasets to train our algorithm – source domain labeled (tSL), target domain labeled (tTL), and target domain unlabeled (tTU) – and 1 to test it – target domain labeled (TTL).

We wanted to answer several questions – specifically, how does the performance of the classifier vary with:

  1. Q1.

    Features used (i.e., 3-mers, 2-mers, or 1-mers)?

  2. Q2.

    Number of features used in the target domain (i.e., keep all features, remove at most 50 % of the least occurring features)?

  3. Q3.

    Number of features retained in the source domain after selecting the generalizable features?

  4. Q4.

    Variation with the size of the target labeled/unlabeled data set (i.e., train on 100 % tSL + \(x\)% tTL + \(y\)% tTU, where \(x\in \{5, 10, 20\}\) and \(y\in \{20,40,60\}\))?

  5. Q5.

    The distance between the source and target domains?

  6. Q6.

    The choice of the source and target domains?

As baselines, we compared our classifier (ANB) with the multinomial naïve Bayes classifier trained on all source data (MNB s), the multinomial naïve Bayes classifier trained on 5 % target data (MNB 5t), and the multinomial naïve Bayes classifier trained on 80 % target data (MNB 80t). Each classifier was tested on 20 % of target data. The expectation is that the prediction accuracy of our classifier will be lower bounded by MNB 5t, upper bounded by MNB 80t, and be better than MNB s.

To evaluate our classifier we used the area under the receiver operating characteristic (auROC), as the class distributions are relatively balanced.

Splice Site Prediction. Similar to the protein localization task, we represent each sequence as a count of occurrences of 8-mers. We use a sliding window approach to count the 8-mer frequencies.

For the splice site prediction there are 3 folds for each organism. From each fold we use the dataset with 100,000 instances for the source domain (tSL), the datasets with 2,500, 6,500, 16,000, and 40,000 instances for the target domain as labeled (tTL), the 100,000 datasets for the target domain as unlabeled (tTU), and 20,000 instances to test our algorithm on the target domain (TTL), as shown in Fig. 2(b). Then we averaged the results over the 3 folds to obtain unbiased estimates. Just like for protein localization, we train our classifier on tSL + tTL + tTU and test it on TTL.

We wanted to answer similar questions to the protein localization task:

  1. Q1.

    Number of features used in the target domain (i.e., keep all features, remove at most 50 % of the least occurring features)?

  2. Q2.

    Number of features retained in the source domain after selecting the generalizable features?

  3. Q3.

    Variation with the size of the target labeled data set (i.e., train on 100,000 tSL + \(z\) tTL + 100,000 tTU, where \(z\in \{2,500, 6,500, 16,000, 40,000\}\))?

  4. Q4.

    The distance between the source and target domains?

As baseline, we compared our classifier with the best overall algorithm in [24].

To evaluate our classifier we used the area under precision-recall curve (auPRC), a metric that is preferred over area under a receiver operating characteristic curve when the class distribution is skewed, which is the case with this dataset.

3.6 Results

Protein Localization. This section provides empirical evidence that augmenting the labeled data from a source domain with labeled and unlabeled data from the target domain with the ANB algorithm improves the classification accuracy compared to using only the limited labeled data from the target domain or using only the data from a source domain with the MNB classifier.

Table 1 shows the average auROC values over the five-fold cross validation trials for our algorithm and for the baseline algorithms. For our algorithm, we used different amounts of labeled and unlabeled data from the target domain. For example, the top-left value is the auROC for our algorithm trained on 5 % labeled data and 20 % unlabeled data. For each dataset and the features used the largest auROC value for the ANB is highlighted.

Table 1. A comparison, on the protein localization task, between the Adapted Naïve Bayes classifier (ANB), the multinomial naïve Bayes classifier trained on all source data (MNB s), the multinomial naïve Bayes classifier trained on 5 % target data (MNB 5t), and the multinomial naïve Bayes classifier trained on 80 % target data (MNB 80t). The results are reported as average auROC values over five-fold cross validation trials. For the ANB classifier, the row headings indicated how much target unlabeled data was used in training the classifier and the column headings indicate how much target labeled data was used. The best values for the ANB are highlighted. Note that ANB is bounded by MNB 5t and MNB 80t, and that ANB predicts more accurately as the length of k-mers increases.

We noted that the performance of the ANB classifier varies, as follows:

  1. A1.

    The best results were obtained when using 3-mers as features. This makes sense since longer \(k\)-mers capture more information associated with the relative position of each amino-acid. When using 3-mers, our algorithm provides between 9.84 % and 34.14 % better classification accuracy when compared to multinomial naïve Bayes classifier trained on 5 % of the labeled data from the target domain, and between 0.37 % and 28.2 % when compared to the multinomial naïve Bayes classifier trained on labeled data from the source domain, except when the plant proteins are the target domain.

  2. A2.

    When trying to establish how many features from the target domain should be used we determined that removing any features does not improve the performance of our algorithm.

  3. A3.

    When trying to ascertain how many features from the source domain should be kept after ranking them with Eq. 1, we determined that the best results were obtained when at least 50 % of the features were kept, i.e., the 50 % top-ranked features and any other features with the same rank as the last feature kept.

  4. A4.

    For most cases, the largest auROC values for our algorithm were obtained when using the least amount of target unlabeled data. This would suggest that even though using unlabeled data is beneficial, using too much unlabeled data is detrimental because the unlabeled instances act as noise and corrupt the prediction from the target labeled data. In addition, intuitively, using more labeled data from the target domain should lead to better prediction accuracy. This was indeed the case with our classifier.

  5. A5.

    When the source and target domains are close the classifier learned is better. For example, the auROC is higher for the PSORTb datasets than for the TargetP datasets. Therefore, the closer the target domain is to the source domain the better the classifier learned.

  6. A6.

    For the PSORTb dataset, the ANB classifier had better prediction accuracy when the gram-negative proteins were used as the source domain than when the gram-positive proteins were used as the source domain. Similarly, for the TargetP dataset, we obtained better predictions when using non-plant proteins as the source domain than when using plant proteins as the source domain. This is because in both cases there were more gram-negative instances and more non-plant instances, respectively, than gram-positive instances and plant instances, respectively.

Table 2. auPRC values for the 4 target organisms based on the number of labeled target instances used for training: 2,500, 6,500, 16,000, and 40,000. We show for comparison with our algorithm the values for the best overall algorithm in [24], \(\mathrm SVM _\mathrm{S,T }\).

It is interesting to note that in some instances the multinomial naïve Bayes classifier trained on the source domain performed better than our algorithm. This occurred mainly when our algorithm used 5 % or 10 % of the target labeled data and when the features were 1-mers or 2-mers. However, this is somewhat expected, as using very little labeled data from the target domain does not provide a representative sample for the population, and using short k-mers does not capture the relative position of the amino-acids.

Splice Site Prediction. Although our algorithm works well for the protein localization task, the results for the third dataset, on the splice site prediction task, were very poor, as shown in Table 2. Our algorithm always gravitated towards classifying each instance as not containing a splice site. We believe that this is due mainly because the 8-mers indicating a splice site occur with low frequency and their relative position to the splice site is important. We will discuss in Sect. 4 how we propose to address this issue in future work.

We noted that the performance of the ANB classifier varies, as follows:

  1. A1.

    Similar to protein localization task, removing any features does not improve the performance of our algorithm.

  2. A2.

    In terms of the number of features from the source domain to keep after ranking them with Eq. 1, we determined that the best results were obtained when at least 50 % of the features were kept, i.e., the 50 % top-ranked features and any other features with the same rank as the last feature kept.

  3. A3.

    The auPRC values for our algorithm were very similar regardless of the amount of target labeled data.

  4. A4.

    The classification performance of our algorithm did not decrease as the distance between the source and target domains increased, as we would have expected.

The last two observations lead us to believe that our features need to take into consideration the locations of the 8-mers to improve the classification accuracy of our classifier on the splice site prediction task.

4 Conclusions and Future Work

In this paper, we proposed a domain adaptation classifier for biological sequences. This algorithm showed promising classification performance in our experiments. Our analysis indicates that the closer the target domain is to the source domain the better is the classifier learned. Other conclusions drawn from our observations: using 2-mers or 3-mers results in better prediction, with small differences between them; removing features from the target domain reduces the accuracy of classifier; having more target labeled data increases the accuracy of classifier; and adding too much target unlabeled data decreases the accuracy of classifier.

In future work we would like to investigate how would assigning different weights to the data used for training influence the accuracy prediction of the algorithm. We would like to assign higher weight to the labeled data from the target domain since this is more likely to correctly predict the class of the target test data than the labeled data from the source domain or the unlabeled data from the target domain.

We would also like to evaluate other methods for selecting the generalizable features. For example, we would like to investigate if selecting generalizable features using the mutual information of the features instead of their probabilities, in Eq. (1), leads to better classification accuracy.

Another aspect we would like to improve is the accuracy of this classifier on the splice site dataset, to get accuracy that is similar to state of the art splice site classifiers, e.g., SVM classifiers. We would like to reduce the number of motifs with different clustering strategies, and identify more discriminative motifs using Gibbs sampling or MEME. In addition, we would like to run experiments on smaller splice site datasets to better understand the characteristics of this problem.