1 Introduction

In our previous work (Khan and Kamal 2013a, b) we have used hidden Markov model, neural network (NN), and suffix tree (ST). The outcomes are good, but it takes more space and time. Besides, we also checked the DNA damages for ontological alignment (Khan and Kamal 2013a, b). In 1975, Frederick Sanger and his colleagues, and Alan Maxam and Walter Gilbert developed two different methods for determining the exact base sequence of a cloned piece of DNA (Robert 2002). These spectacular breakthroughs revolutionized molecular biology. They have allowed molecular biologists to determine the sequences of thousands of genes and they have even made it possible to sequence all three billion base pairs of the human genome (Robert 2002). It has made very positive impacts on functional identification of newly sequenced genes and building reshaped DNA sequences (Doolittle 1996).

Local sequence alignment locates the best approximate subsequence match within two given subsequences. Sequences might be DNA, RNA and protein sequences in biological sequences. Besides LSAs, gapless alignment such as BLAST (Altschul et al. 1990) or FASTA (Lipman and Pearson 1985, 1988) is examined (Arratia et al. 1988; Dembo and Karlin 1990; Karlin and Altschul 1990, 1993). We have noticed that some alignments with gap also play an important role to decide the proper match for the exact sequence matching. We have examined the alignments with gap such as Needleman–Wunsch (Needleman and Wunsch 1970) or Smith–Waterman (Smith and Waterman 1981; Waterman 1989, 1994) algorithms. Dynamic programming is also considered in these alignments (Batzoglou et al. 2000a, b).

Generally, nucleotides are arranged along two standards of double helix, called the forward and reverse standards with the nucleotides of one stand bonding with complementary nucleotides on the others. Normally, adenine nucleotides bond only with thymine, and cytosine nucleotides only bond with guanine; the bonded nucleotide is called the basepair (bp). Sequencing consist of four bases of DNA double strand structure. Suppose \(P = a_{1} ,a_{2} \cdots a_{m}\) and \(Q = b_{1} ,b_{2} \cdots b_{n} \left( {n \le m} \right)\) over alphabet ∑. The general case of alignment is to match the similarities of these two strings under the name of local, global or semi-local alignments.

2 Related work

According to the various algorithms, implemented through 2001 to 2005, Ning et al. (2001), Kent (2002), Furey et al. (2002), Schwarz et al. (2003) and Watanabe et al. (2005) examined the accuracy of the sequencing. In the age of information superhighway, we know that the genome sequences may have continual or near continual patterns in the given or collected data sets. As a result, the outcomes might be the same for many positions. On the contrary, the mutations and Indel may generate incorrect judgments for sequencing and mapping, whether it is local, global or pairwise alignment or mapping. The algorithms above do not consider the situations regarding the repetitions of the patterns, mutations and incorrect mapping. Here, we have noticed that the system rejects the data sets, makes the area small and as a consequence the calculations become complicated as well as wrong. Ewing and Green (1998) proposed a solution to overcome the ambiguity, but this method produces low-quality regions. Here, we have implemented the Markov chain concept under Chapman–Kolmogorov equations to establish probable sequenced positions. In the field of sequencing, there are a number of softwares that help to detect the sequences and the frequently used systems are PolyPhred (Stephens et al. 2006), SNP detector (Zhang et al. 2005), and novoSNP (Weckx et al. 2005). But these three systems only detect genotype sample. They are unable to solve the dynamic patterns and sequences. Our systems will cover these problems under the bounding box scheme. The softwares that are used for sequencing are mainly classified broadly into two categories: training based and detection based. The detection-based category depends on homolog finding. But the homolog finding mainly depends on database finding (Claverie et al. 1997). However, database finding is not always efficient due to the size and cost of the equipments. Training-based sequencing is not able to identify the proper coding regions due to the lack of generalizations (Pati et al. 2010). Besides, predictions are vulnerable with many false-positives identifications and sensitivities (Yok and Rosen 2011). Nowadays, many researchers have focused on the efficiency and correctness of predictions. The latest methods have been designed on the basis of advanced annotations or predictions in alignments and matching by removing false predictions (van Baren and Brent 2006) or generating new genome sequences. However, there is some scope of uncertainty in the generated sequences, which are convergent, bounded or monotonic.

Frameshift identification (Tech and Meinicke 2006) is an important part of the sequencing, whether cross-checking of genes or adjustment of translational initiation site (Stormo et al. 1982; Zhu et al. 2004, 2007). According to Wu et al. 2006, phylogenic fingerprint DNA sequencing is improved, but the convergent features are not checked. In the same field of multiple sequencing of proteins (Keller et al. 2011), there are no guarantees of finiteness checking. Besides, there are some established gene identifiers such as ORPHEUS (Fleischmann et al. 1995), AUGUSTUS (Keller et al. 2011), SLAM (Alexandersson et al. 2003) and GenePrimp (Stormo et al. 1982) that have strictly mapped the genes. To accomplish efficiency as well as reduce sensitivity, we have designed a complete package that coordinates the stages by maintaining the linkage and arrangements of sequences. Due to the lack of linkage of methods, the sequencing does not always answer proper actions. In ORPHEUS, we have noticed that the system does not take invariant information at the stage of matching. It also proposed that (Dhar et al. 2009) unknown motifs may be difficult to identify from intergenic regions. We have checked the alignments by OHSUMED data set. OHSUMED data set is a very informative data house for medical document classification and identifications. The OHSUMED data set, which was created for the TREC conference, has become an evaluation benchmark in automatic information classification research since 1994 (Yetisgen-Yildiz and Pratt 2005). OHSUMED is composed of 348,566 MEDLINE documents from 270 journals published between 1987 and 1991. The documents are categorized under the environment of 14,321 MeSH classes. We have selected the OHSUMED due to its smaller amount of spaces on storing and manipulating. For example, there are 753 classes with only 1 document in OHSUMED. The research result and implementation of Ruiz and Srinivasan (1999), and Lewis (1992, 1996 ) used 49 categories related to heart diseases with at least five training documents. The data set related to DNA and diseases has been commonly used in DNA classification as well as for the identification of specific patterns from a given DNA set and DNA collections.

3 LSA and integration

Local sequence alignment is a process that enables the sequence to match the sequence at specific parts of the total given DNA sequences. Local alignment may be on the basis of pairwise comparison of sequences, whatever the length. Local alignment is very important in DNA testing for medical science, law and justice, criminal identification, baby identification and, last but not the least, diseases identifications. Local alignment works by finding small polynucleotides (e.g., UAA, UCG, ATG) or any pattern that is responsible for the identification of the desired identity. Figure 1 shows the brief scenario of local sequencing.

Fig. 1
figure 1

The process of LSAs. Perfect matches, Mismatches, Insertions and deletions (indel) are marked clearly using various colors. The fundamental basic for the local alignment sequencing are these three measurements

From Fig. 1, we see that the left part of the figure indicates the LSA under the two small lengths (T G C G in the first row and A G C G in the second row). In this case the local sequence length is four and it may be any length less than or greater than this length. Pairwise alignments are frequently used in repeat match, unique match, contiguous seed, space seed, vector seed and maximum match subsequences (MMSS) (Waqaar et al. 2008), a collection of bases that is also part of the LSA. Recently, sequencing has been performed on the basis of seed selection and then extending the sequence according to the demand of the sequence (Yu et al. 2007). The right part of Fig. 1 shows the operations of the LSA where indel is used when there are any deletions or insertions of the nucleotides. The group of nucleotides representing perfect matches indicates the seeds of the LSA. This seeds finding process is accomplished by two phases. In the first phase, the algorithm finds the seeds under consideration of perfect matches, and in the next phase the system increases the length of the seeds toward the MMSS finding. The BLAST algorithm works for the fixed length alignment, whereas our algorithm works with any variable length sequences. We also compare the alignment of our system with MUMmer and MUMmerGPU. We have significantly noticed that RSAM is faster and moderately sensitive than BLAST, MUMmer and MUMmerGPU. Schatz et al. (2007) and Delcher et al. (2002)) depicted that both MUMmer and MUMmerGPU show the same outcome where the sensitivity is very high and the complexity is O(n 2) when the sequences vary their lengths and it is always more sensitive.

Suppose s(p,n) indicate the number of alignments of two sequences of DNA length n and seed size b. Basically, s(p,n) is a count of (0,1) matrices under minimum two sequences or rows of the given DNA and an unknown number of columns, so that both sequences contain precisely n 1s, i.e., each column contains at least one 1. The asymptotic expression of s(p,n) for a specified b as n → ∞, is a function of b:

$$S\,\left( {p,\,n} \right)\, \ge \left( \begin{gathered} 2n \hfill \\ n \hfill \\ \end{gathered} \right).$$
(1)

According to Stirling’s theory as n → ∞ and seed b is fixed,

$$s\,\left( {p,\,n} \right) \ge \left( {\left( {\pi n} \right)^{ - 0.5} } \right)\,\left( {4^{n} + o\left( 1 \right)} \right)\,\;{\text{as}}\;\,n \to \infty .$$
(2)

When s(1,n) there is only 2-sequence alignment. It is easy to narrate the sequencing by generating a function when b ≥ 1:

$$T\,\left( x \right) = \left( {1 - x} \right)^{2} - 4x\left( {x^{b} - x + 1} \right)^{2} .$$
(3)

From Eq. (3) it is possible to decide the alignment possibilities under the finite seeds or blocks.

3.1 Data reducing process

For LSA, it is very important to have the specified length of DNA, RNA or proteins. From the baseline paper (Waqaar et al. 2008), we have reviewed that the author has implemented two different steps: MMSS selection and MMSS anchor selections. MMSS selection by suffix tree is a costly approach due to its complexity in searching the matching at any specified lengths of DNA. The complexity of the suffix tree under this environment is O(n + m) time, where O(m) time is essential to build the suffix tree and O(n) time is used for searching the patterns. Total time is O(n + m). After than to select the anchor length it is essential to have O(n) time. So the total complexity is O(n + m) × O(n), i.e., the total time complexity for the first two steps is O(n 2 + mm).

We have implemented these two steps by using the data reducing process (DRP) length selections of the given DNA where the length of the genome is n and the number of genome is t. The total size of the bounding box is n × t. The size will repeatedly divide until it reaches the unique portions of the nucleotide. Here, unique means single nucleotide. The consideration whether the length of a genome is odd or even is an important factor. The DRP (nt) algorithm will narrate details about the process. Figure 2 shows the operations of the DRP.

Fig. 2
figure 2

DRP with length n = 40; genome number middle part shows the first division of the selected DNA genomes and the last part shows the second division of the same DNA with the same length and genomes toward unique nucleotides. This process will lead to obtain unique nucleotides for LSAs. This DRP is used for the first time in the field of bioinformatics LSAs. This is a novel idea for sequence length selections

Data reducing process (n, t)

  1. 1.

    Select the length n, with the dimensions t, and the number of genome sequences under the array as n × t.

  2. 2.

    Subdivide the array n × t to (n/2 × t/2) in the first phase.

  3. 3.

    Continue step 2 until the subdivision reaches the single nucleotide.

$$\left( {n \times t} \right) \cdots \left( {n/k \times t/k} \right)\, \cdots \,\left( {n/2k \times t/2k} \right) \cdots \,\left( {n/4k \times t/4k} \right) \cdots \,\left( {n/nk \times t/nk} \right).$$
  1. 4.

    Let (n = 2n + 1).

  2. 5.

    n ← (n – 1).

  3. 6.

    Store (2n + 1)th position nucleotides; repeat steps 2 and 3.

  4. 7.

    if ((n/nk×t/nk) == 1)

  5. 8.

    Stop subdivisions.

  6. 9.

    Otherwise,

  7. 10.

    Repeat 2, 3, 4 and 5.

  8. 11.

    Locate, match, mismatch and indel are shown in Fig. 1.

The DRP sign DRP(S) denotes a set of nucleotides S. Here \({\text{DRP}} = \left( {a_{1} , \, a_{2} , \, a_{3} \cdots a_{n} } \right)\) are various nucleotides elements.

4 Integration environments

In this algorithm we have checked the alignment by considering individual nucleotides. After taking the input of the sequences, we perform dynamic programming and inverse dynamic programming to check the alignments with set operations and check that the modifications are matches, mismatches and indel (Fig. 3).

Fig. 3
figure 3

In this integrated algorithm the figure shows the relationship among various machine learning approaches. The identity of this algorithm is that it contains dynamic programming and set operation along with Markov chain. The process started by taking DNA sequence as input. After taking the sequences dynamic programming is imposed on the respective sequences. This step helps to classify the sequences. At the third step, we see that the set operation makes the distinction of the DNA segments. The next step is a very important part of this algorithm. In this case the selection of matches, mismatches and indel is determined

Integration Algorithm (MNU, NMNU)

Here,

MNU = matched nucleotide unit

NMNU = non-matched nucleotide unit

  1. 1.

    At first, input the genome sequences and identify the MNU and NMNU from the DRP algorithms.

  2. 2.

    Implement the dynamic programming and inverse dynamic programming to identify pairwise alignments.

  3. 3.

    Use the set operations on the selected part of the alignments.

  4. 4.

    Modify the matches, mismatches and indel.

  5. 5.

    Finally, train the sequencing to identify the MNU and NMNU.

  6. 6.

    Let (NNLN == NMNU).

  7. 7.

    Repeat the step 5.

  8. 8.

    Otherwise,

  9. 9.

    Repeat the steps 1 to 5.

  10. 10.

    End.

4.1 Dynamic and inverse dynamic programming

We have implemented dynamic programming to choose the MMSS. Consequently, we also imply inverse dynamic programming to verify the alignment that does not match.

Suppose

\(P = p_{l} , \, p_{2} \cdots p_{n} ,{\text{ and }}Q = q_{1} , \, q_{2} , \ldots q_{m} ,\) are two DNA sequences. The indel of two sequences is denoted by the weight M (g). At first, we consider the best alignments as \(F\left( {p,\,q} \right) = \hbox{max} \forall \left( {p*,q*} \right)\), where F is a function which relates the current alignments and new alignment. By using dynamic programming we can check the sequences F(p, q) recursively.

$$F\,\left( {i,j} \right)\, = F\,\left( {p_{1} , \, p_{2} \cdots p_{i} \,q_{1} , \, q_{2} ,\, \ldots q_{j} } \right)$$

where \(F\left( {0,O} \right) = O,F\left( {O,j} \right) = F\left( { - ,q_{1} ,q_{2} , \ldots q_{j} } \right) = M\left( j \right)\) and \(F\left( {i, \, 0} \right) = M\left( i \right)\). Then,

\(F\left( {i,\, j} \right) = \left\{ {F\left( {i - 1, \, j - l} \right) + F\left( {p_{i} , \, q_{j} } \right),\hbox{max} \{ F\left( {i - k, \, j} \right) + M \, \left( k \right)} \right\},\hbox{max} \left\{ {F\left( {i, \, j - I} \right) + M\left( l \right)} \right\}\). For LSA the function \(L\left( {p,q} \right) = \hbox{max} \{ F\left( {p_{u} ,p_{u + 1} , \ldots p_{v} ,q_{x}, q_{x + 1} , \ldots q_{y} } \right):1 \le u \le v \le n, \, l \le x \le y \le m\}\). The dynamic algorithmic operation to select maximum matches subsequence for the two DNA sequences P and Q are as follows.

Maximum matches subsequences (Sequence P, Sequence Q)

Besides the alignment with DP for MMSS, our activity also checked the complement of MMSS by inverse dynamic programming (IDP). The inversion assures the nonmatches nucleotides only. Inversion will consider for both matches and mismatches. The inversions must be the complement of the regular pairwise alignments.

INVERSION(c, d, i, j) = F 1 (p c , p c+1p c+2,…p i q j q j−1,…q h ).

Here,

$$\begin{gathered} {\text{INVERSE}}\left( {\text{A}} \right) = {\text{T}} \hfill \\ {\text{INVERSE}}\left( {\text{C}} \right) = {\text{G}} \hfill \\ {\text{INVERSE}}\left( {\text{G}} \right) = {\text{C}} \hfill \\ {\text{INVERSE}}\left( {\text{T}} \right) = {\text{A}}. \hfill \\ \end{gathered}$$

In the case of regular and authentic sequencing, the genome parts p c p c+1p c+2,…p i q j q j+1,…q h must be matched after inversions.

4.2 Markov chain for mismatch selection

From the baseline work (Waqaar et al. 2008), we have examined the total process of the mismatches seeds and mismatches anchor selection where it cost time complexity O(n 2 + mn) and space complexity as O(mn(n 2 + m 2)). We measure mismatches using Chapman–Kolmogorov formula. The formula for a stochastic process with random variable \(X\;{\text{is}}\;X = \{ X_{t} ,t \in T\}\), where t = index and it indicates the time. X t  = state of the process, T = index set constituted by time t.

Suppose \(n = 0, \, 1, \, 2,3 \ldots ,\;m = 1, \, 2, \, 3 \ldots \;{\text{and}}\;i_{0} \ldots i_{m} \in E.\) E = all possible values that the random variable X t can assume. Then,

$$\begin{aligned} & \Pr \left\{ {X_{n + 1} = i_{1} , \ldots ,X_{n + m} = i_{m} \left| {X_{n} = i_{0} } \right.} \right\} = \Pr_{{i_{0} i_{1} }} \cdot \Pr_{{i_{1} i_{2} }} \cdots \Pr_{{i_{m - 1} i_{m} }} . \\ & \Pr \left\{ {X_{n + m} = j\left| {X_{n} = i} \right.} \right\} = \sum\limits_{k = 0}^{\infty } {\Pr \left\{ {X_{n + m} = j\left| {X_{n + 1} = k,X_{n} = i} \right.} \right\}\Pr \left\{ {X_{n + 1} = k\left| {X_{n} = i} \right.} \right\}} \\ & \sum\limits_{k = 0}^{\infty } {\Pr \left\{ {X_{n + m} = j\left| {X_{n + 1} = k} \right.} \right\}\Pr \left\{ {X_{n + 1} = k\left| {X_{n} = i} \right.} \right\}} \\ & = \sum\limits_{k = 0}^{\infty } {\Pr_{ik} \,\Pr_{kj} } \\ & = \Pr_{ij}^{m} \\ \end{aligned}$$

In general,

$$\Pr_{ij}^{n + m} = \sum\limits_{k \in E} {\Pr_{ik}^{n} \Pr_{kj}^{m} \,{\text{for}}\,{\text{all}}\,n,\,m \ge 0,\,\;{\text{all}}\;\,i,\,j \in E.}$$

5 Implementation

Before starting the experiments we have marginalized the data set by sampling. Based on the sequence nature we have verified the data set by stratified sampling, because it helps to select the subgroups from a defined population so that it represents the total impact of the population. To accomplish this sampling, first of all we have to define and determine the population and sample size. Then, the desired variables have to be selected so that the sample becomes appropriate. Here, the local variables will be the nucleotides and polynucleotides. Finally, the exact classification determines the local sequences. Compared with the baseline paper (Waqaar et al. 2008), we have checked the sensitivity with specificity of the sequencing based on polynucleotide segments as sequence key factors under receiver operating characteristics (ROC).

We have implemented and experimented under the environments of Java with Integrated Development Environment NetBeans. The object-oriented implementation helped us to perform with the nucleotides (A, C, T, and G) as distinct objects. The total process was executed in an object-oriented manner instead of procedural C programming language as in Waqaar et al. (2008). Object orientation enables faster and machine-independent environments over any procedural language. The machine-independent analysis, synthesis and anti-synthesis create a powerful algorithmic procedure. The steps of the algorithm are depicted in Fig. 3.

6 Experiments and result

In most of the cases, the authors have considered one or two matrices to assess the speed and sensitivity of sequencing. We here consider six dominant metrics to identify the performance of our alignment algorithm. The matrices are: speed (X 1), complexity (X 2), space (X 3), sensitivity (X 4), accuracy (X 5) and risk (X 6). For speed, sensitivity and accuracy we measured the referential value as best, average and low. We have checked the complexity, risk, accuracy and space for the first time and many LSA tools measured the sensitivity without any standard parameter. According to the MUMmer, Delcher et al. (2007) termed the parameter ‘q’ as the ratio between accurate aligned nucleotides pairs and total number of nucleotides in the given sequence. Total column score is another aspect of the MUMmer procedure. Again according to the AVID (Bray et al. 2003), the authors consider the alignment pairs which have score greater than the predefined threshold value. Instead of all of the methods above, we have concentrated on the set operations under the complete machine learning process on exons and introns. Introns measurement are also an essential part of the alignment to maintain proper checking instead of only one parameter checking (exons). For speed, sensitivity and accuracy the reference values have been checked according to the fuzzy manner, such as best (H), average (M) and low (L).

The measurements of all the parameters mentioned above have been depicted in the following tables (Table 1). We have considered the sequence length randomly starting from 500 thousand to 2,000 thousand. In this work we used Nucleotide Database from National Central for Biotechnology Information, OHSUMED database and ROSETTA (Batzoglou et al. 2000a, b) database instead of only one database. It is clearly visible that alignment varies from database to database. Table 2 depicts the speed comparisons of BLAST, baseline paper (Waqaar et al. 2008) and our algorithms. The input of the sequence length is considered as the weight of the comparisons of this system.

Table 1 The reference values for speed, sensitivity and accuracy
Table 2 The speed comparison of three algorithms: BLAST, baseline paper and this system

The very interesting and effective observation of RSAM is that as the size increases, the performance of the RSAM increases. NNs and Markov process enables faster calculations in RSAM. Table 3 below is the comparison of the complexity of the three algorithms. As the size increases, the complexity of the BLAST and Baseline Paper algorithm increases. But RSAM is the same, though the sizes of the sequence increase.

Table 3 Complexity comparison of BLAST, baseline paper and this system

From the figure above it is clearly visible that the high–low lines among BLAST, baseline paper and RSAM indicate the differences and faster response of RSAM algorithm compared with the other two. Receiver operating characteristics analysis for the data set according to the CPU utilization is shown Figs. 4, 5, 6.

Fig. 4
figure 4

The CPU utilization for the sequence length starting from the range of 500,000 to two millions nucleotides with the exons and introns in the sequences. We have clearly noticed that RSAMA performs better irrespective of any length and as the size of the nucleotides increases. It is interesting that as the lengths of the sequences increase RSAM performing better

Fig. 5
figure 5

ROCs based on checking the linearity at speed of RSAMA where the speed range is approximately 5 ns. The sequence starts from the length 500,000 to 2,000,000 bp

Fig. 6
figure 6

The pictorial descriptions of the complexity analysis of the sequencing under the ROC environments and calculation. Here, we found that the complexity of this algorithm was Ω (nlogn)

Table 3 below shows the complexity comparison of BLAST, baseline paper (Waqaar et al. 2008) and RSAM. The BLAST is a linear method to check the local sequence and exon sequences. Linearity is good for simple and small size of the sequences and exons finding is straightforward. But it becomes difficult when the length of sequence and exon are not fixed. We consider the complexity with CPU utilization considering per-million nucleotides sequences. CPU utilization is the ration between user time and wall clock time. Besides the wall clock time, we have minimized the number of comparisons by incorporating the parallel comparisons (Akl 1985). Suppose k stands for the total number of comparison rounds (time) of any algorithms for n elements. Then the total comparisons are denoted by a function Total (k,n). So in view of complexity comparisons, Total (k,n) is the upper bound as worst case. On the contrary, minimum total number of comparisons Minimum (k,n) is lower bound as best case. In this work, we have significantly noticed that for maximum rounds of comparisons, the complexity is Ω (nlogn). For sequence lengths from 500,000 to 2,000,000, the complexities among BLAST, baseline paper (Waqaar et al. 2008) and our algorithms are as follows.

In our baseline paper (Waqaar et al. 2008), there is no clue to compare the sensitivity of the sequencing and suffix tree is implemented in patterns matching and searching. We, on the other hand, measured the sensitivity by comparing the performance between suffix tree and suffix array. In view of this parameter, both suffix tree and array have the benefit in space and time values. Due to the fundamental benefits of suffix tree over suffix array, it performs better searching and matching. On the contrary, suffix array requires less space than suffix tree. For very long sequence length, suffix array is better than suffix tree.

7 Conclusion

In this work and activity, we have implemented and proposed the combined machine learning approach for sequence alignment and local sequencing. We have noticed that the combination of neural network, Markov chain, set operation, dynamic and IDP, bounding box algorithm for fixing the lengths of the local sequence make our revised sequence alignment matching (RSAM) faster and accurate compared with BLAST and baseline paper. We also successfully checked the performance between suffix tree and suffix array under sensitivity measurement. Bounding box algorithm is very interesting and newly imposed idea for maximum matching subsequence selections. One parameter we were unable to measure was the Specificity with Sensitivity in the ROCs analysis. In future, we will check this parameter.