Keywords

Introduction

In the past decade, a lot of research work has been done in the field of DNA sequence alignment. As a result, a large number of alignment algorithms and related software are available at present [1]. Basically, a sequence alignment problem deals with matching the nitrogen base of a sequence with a reference sequence. The processes help in finding out the similar functional regions that have great importance in the field of bioinformatics and its related field.

DNA Sequencing and Alignment

DNA is generally a long sequence of base pairs, polymer of repeating subunits called nucleotide. Each nucleotide contains five-carbon sugar, a phosphate group, and a base. Different bases are adenine (A), guanine (G), thymine (T), cytosine (C), and uracil (U). Nitrogen base combines with the five-carbon sugar to form nucleoside. Nucleoside in turn combines with phosphate groups to form nucleotide [2]. During the sequencing, DNA is represented as sequence of these nitrogen bases. Various methods are there for sequencing the DNA like Sanger’s method, Maxam & Gilbert method, Pyro-sequencing, etc. [3]. Apart from these conventional methods, there are many other interdisciplinary approaches like graphical representation techniques [4], Voss representation, 2-bit binary, the 4-bit binary, the paired nucleotide, the 12-letter alphabet, the digital Z-signals, and the phase-specific Z-curve [5].

Sequence alignment is the simultaneous matching of two or more nucleotides. There are different algorithms available for this purpose. Broadly, those can be classified as algorithms based on hash tables, algorithms based on suffix trees and algorithms based on merge sorting [6]. Optimal alignment methods aim to maximize the score or minimize the cost of the process. But in case of dynamic programming approach, the time complexity of the algorithm is proportional to length of the sequence in exponential terms [7, 8]. In such approach, if the gap cost is nonlinear then complexity is extremely more [9].

Proposed Method

In this research paper, two DNA sequences have been aligned using a dynamic window algorithm. The major functions associated with this algorithm are LengthEqualiser (X, Y), dynamic window (p, q, X, Y), Modify-Y (i, j, X, Y), Modify-X (i, j, X, Y), and score (hit, c).

The function Length Equalizer takes the two sequences as arguments and tries to make them the same length by introducing the gaps at proper place. NIG represents the number of introductory gaps. X and Y are the two sequences,| | represents their cardinality.

The dynamic window function tries to match the possible common sequences. The dynamic nature of the window is maintained with the help of sub-functions—Modify-Y (i, j, X, Y) and Modify-X (i, j, X, Y). Apart from this, the function sends the hits calculated to the score function to calculate the score of the process and keeps the different common sub-sequences obtained.

These two Modify functions are accountable for the dynamic nature of the window. First of all, these functions search the right position for the shifting of a base by calculating the number of gaps required which is represented by “c”. Then after introducing the gaps in the either sequence X OR Y, they revoke the Length_Equaliser (X, Y) and Dynamic_Window (p, q, X, Y) which work recursively. They also return the c-value to Score function.

The Score () function takes hit from Dynamic_Window (p, q, X, Y) and c from Modify-X (i, j, X, Y) or Modify-Y (i, j, X, Y) and calculates the score. Here, the cost of hit is calculated using the cost of each base in the sequence. Similarly, the penalty is calculated using the cost of a single gap and cumulative number of gaps introduced. Figure 7.1 represents the schematic view of the working of the algorithm.

Fig. 7.1
figure 1

Schematic view of the working of the dynamic window algorithm

Result and Discussion

The performance of the dynamic window algorithm has been analyzed by different combinations of the following DNA sequences of β-globin genes of eight species given in Table 7.1[10]. A part of the algorithm functioning is given in Fig. 7.2.

Table 7.1 The coding sequence of the exon1 of beta-globin gene of eight different species
Fig. 7.2
figure 2

Dynamic_Window algorithm functioning; left side represents the working of Modify-Y (i, j, X, Y) function; right side represents the working of Modify-X (i, j, X, Y) function

As a result of the working of the Dynamic_Window algorithm on a sub-sequences of Human and Opossum β-globin genes, it has been found that the hits are 11, gaps are 7 using Modify-Y (i, j, X, Y) function whereas the hits are 12, gaps are 7 using Modify-X (i, j, X, Y) function. By taking the cost of a base as +1 and that of a gap as −1, the score % in first case is 13.79 (left side in Fig. 7.2) whereas in second case, it is 17.29 (right side in Fig. 7.2). The common sequences in first case in the List are {ATGGTGCAC, CT, GA} and in the second case are {ATGGTGCAC, T, GA}.

Conclusion

In this study, the common sequence alignment using dynamic window algorithm has been proposed. The structure and the working of the algorithm have been discussed. The algorithm has been used for different sets of subsequences of β-globin genes and the score % has been obtained. The dynamic nature is clear from the comparative analysis of the score percentages.