Abstract
Methods of the spectral–statistical approach (2S-approach) for revealing latent periodicity in DNA sequences are described. The results of data analysis in the HeteroGenome database which collects the sequences similar to approximate tandem repeats in the genomes of model organisms are adduced. In consequence of further developing of the spectral–statistical approach, the techniques for recognizing latent profile periodicity are considered. These techniques are basing on extension of the notion of approximate tandem repeat. Examples of correlation of latent profile periodicity revealed in the CDSs with structural–functional properties in the proteins are given.
Access provided by CONRICYT – Journals CONACYT. Download protocol PDF
Similar content being viewed by others
Key words
- Latent periodicity
- Approximate tandem repeats
- Profile periodicity
- HeteroGenome database
- CDS
- Spectral–statistical approach
1 Introduction
Until recently the reliable methods for recognizing latent periodicity in genome were based on the notion of approximate tandem repeat [1, 2]. However, employment of these methods has shown that approximate tandem repeats constitute a small part in the genome sequences of various organisms. So, the indirect methods for estimating latent periodicity period have spread, exploited without determination of periodicity type and its corresponding pattern. Fourier analysis [3–7] and the other techniques [8–15] displaying dominant peaks in the graphs of a single statistical parameter which values depend on the tested periods of DNA sequence can be referred to such methods. Without a model of periodicity, the latent period estimate obtained by such methods cannot be unambiguously interpreted.
Spectral–statistical approach to revealing latent periodicity has been originally developed in the work [12]. Initially the problem was set to select quantitative statistical parameters for revealing approximate tandem repeats and DNA sequences that are similar with the repeats. In investigating approximate tandem repeats in the TRDB database [16], two characteristic statistical parameters have been revealed. One of them characterized heterogeneity level that in approximate tandem repeats has sufficiently high values. Another one described a mean level of character (base) preservation at tested period. This mean level is close to unity (~0.8), if a tested period coincides with latent period in the approximate tandem repeats. In the framework of spectral–statistical approach (the 2S-approach), these statistical parameters are considered in accordance with a length of tested period in analyzed DNA sequence. The graphics of these parameters are called spectra. They characterize initial stage in the developing of the 2S-approach with methodology represented in the works [12, 17, 18].
The analysis of genome sequences from the model organisms Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster has been done with the help of the 2S-approach spectra. In the result of the analysis, the HeteroGenome database (http://www.jcbi.ru/lp_baze/) has been created [18] for which the sequences similar to the approximate tandem repeats were selected from DNA sequences of the organisms. The description of the HeteroGenome database methodology will be done in the next sections.
However, according to the data from the HeteroGenome, DNA sequences similar to approximate tandem repeats cover a small part of genome (~10 %). So, the methods, searching for latent periodicity of unknown type, are widely spread that could be called indirect, as they are not based on any model of periodicity. For example, Fourier analysis and the like techniques can be placed to such methods [3–9]. Dominant peaks revealed by these methods in the spectra are used to estimate period length of latent periodicity. In the strict sense, such estimates of period length demand an additional instantiation [19].
A new notion of latent periodicity called latent profile periodicity has been proposed in the works [12, 20]. This new notion is based on a model of profile periodicity (profility) [20, 21] allowing generalize notion of approximate tandem repeat. Basing on this model, the 2S-approach has got a shot in the arm of recognizing the latent profile periodicity in DNA sequences [21, 22]. Since new type of periodicity generalizes the notion of approximate tandem repeat, one can suppose a share of recognizable latent periodicity will sufficiently grow. This assumption is proved by the examples of analysis of DNA sequences from human genome [21]. The results of the analysis allowed putting forward a hypothesis about the existence of two-level organization of encoding in the CDSs. Besides, it appears that latent profility, revealed in coding DNA regions, can be translated into structural particularities of protein sequence. Direct revelation of such particularities is a sufficiently complicated problem because the goal of the search is a priori unknown.
New methods of the 2S-approach have been proposed [20–22] for recognizing latent profile periodicity . They are based on a model of profile string that is special periodic random string with a pattern of independent random characters. Every one of such the random characters is a random variable taking on the values from textual alphabet of DNA sequences. In the frames of the 2S-approach, DNA sequence with displayed latent profile periodicity is considered as realization of a profile string. Therefore, statistical methods and criteria have to be used for recognizing latent profile periodicity. Existence of latent profile periodicity in DNA sequence is recognized in that case, when this sequence is statistically close to a profile string. In fact, the problem of latent profile periodicity recognition in DNA sequence leads to the problem of specifying a profile string considered as periodicity etalon for the sequence. Random pattern of such a profile string is an analogue of consensus-pattern deduced from the sequence of approximate tandem repeat. One of the next sections is deduced to the description of the 2S-approach for recognizing latent profile periodicity.
2 HeteroGenome Database. Materials, Methodology, and Analysis of the Results
The methods of the 2S-approach to search for the regions in DNA sequence that are close to approximate tandem repeats have been applied to the genome sequences of well-studied model organisms [23] S. cerevisiae, A. thaliana, C. elegans, and D. melanogaster. These organisms represent a genome of the eukaryotes ranging from unicellular organism (baker’s yeast) to multicellular plants (Arabidopsis) and animals (nematode), which facilitates the general study of the phenomenon of latent periodicity in genome. Original DNA sequences of the whole genomes of model organisms have been obtained from the GenBank [24] at ftp://ftp.ncbi.nih.gov/genomes/. The results of genome analysis have been systemized in the HeteroGenome database (http://www.jcbi.ru/lp_baze/) described in the work [18].
Approximate tandem repeats are the most studied type of latent periodicity in DNA sequences, because this type is described by relevant models [1, 2]. A significant number of publications are devoted to search for approximate tandem repeats and their recognition (e.g., see Refs. [25–28]). However, such repeats constitute sufficiently small part in genome sequences of various organisms [18]. Besides, the methods, estimating length of latent period in the sequences which are not approximate tandem repeats, gained widespread acceptance in scientific literature (e.g., see Ref. [9]). At that, type of periodicity remains unknown, and it is not based on any model. So, in creating the HeteroGenome database, the following compromise approach to search for the sequences with latent periodicity was chosen. The sequences similar to approximate tandem repeats were selected. As similarity estimate two parameters have been chosen whose high values are characteristic for the periods of approximate tandem repeats. These parameters will be further described in detail.
2.1 Spectral–Statistical Approach for Revealing DNA Sequences Similar to Approximate Tandem Repeats
The revelation of latent periodicity close to approximate tandem repeats was done by determining heterogeneity of high significance level \( \left(\sim {10}^{-6}\right) \) at the test periods of an analyzed nucleotide sequence. A test period of DNA sequence is called an integer number which does not exceed one-half the sequence length. For each test-period λ analyzed sequence is divided into the substrings of length λ (last substring can be of smaller length).
Division into the substrings of length λ allows calculating a frequency \( {\pi}_j^i\le 1 \) \( \left(i=\overline{1,4},\kern0.5em j=\overline{1,\lambda}\right) \) to find a character a i from nucleotide sequence alphabet \( A < a={a}_1,\kern0.5em t={a}_2,\kern0.5em g={a}_3,\kern0.5em c={a}_4 \)> in the jth position of the test period λ. Matrix \( \pi ={\left({\pi}_j^i\right)}_{\lambda}^K \) is called a sample λ-profile matrix for analyzed sequence, where \( K=4 \) is the size of alphabet A. Then in analyzed sequence a character preservation level pl (λ) at the test-period λ is determined by a formula:
By such a way, for an analyzed sequence at its test periods, a spectrum of character preservation level pl is introduced. According to the results of numerical experiments [12], character preservation level \( pl(L)\ge 0.5 \) corresponds to the sequences of approximate tandem repeats with period length equal to L.
Along with the high value of the pl spectrum, high level of the repeat’s heterogeneity is observed at period length in approximate tandem repeat. In the HeteroGenome, a check on heterogeneity in the sequence of length n at the test-period λ is done with the help of Pearson χ 2-statistics [29]:
In accordance with the results of numerical experiments done in the work [12], high character preservation level allows omitting claim of a large number of the repeats for the test-period λ. When character preservation level is high (\( pl\left(\lambda \right)\sim 0.8 \) and more), a value of the statistics (Eq. 2) is not taken into consideration, even though the number of repeats \( \frac{n}{\lambda }<5 \) is small.
In searching for the sequences similar to approximate tandem repeats , check on heterogeneity in DNA sequence is carried out at a level of significance \( \alpha ={10}^{-6} \) [12]. For the test-period L, a critical value χ crit2(α, N) with \( N=\left(K-1\right)\left(L-1\right) \) freedom degrees corresponds to this level. If character preservation level pl(L) is sufficiently high and value of statistics ν(L, n) meets a condition
then the sequence is recognized similar to approximate tandem repeat with period L. In this case it is supposed that the value of pl(L) is close to a maximal value of the pl-spectrum in a range of the test periods of the sequence. So, as spectral characteristics of analyzed nucleotide sequence in the HeteroGenome database, a spectrum H is used that at the test-period λ takes on a value
The graphic of the H-spectrum obviously demonstrates a display of significant heterogeneities in a sequence at those test periods, where \( H\left(\lambda \right)>1 \), and these test periods are further analyzed with the help of the pl-spectrum. As it was mentioned above, one of these test periods is selected as an estimate for the period length of latent periodicity that is pointed at by the first clear-cut maximal value in the pl-spectrum (see Fig. 1). Such a maximal value of the pl-spectrum can be interpreted as an index of preservation for the copies of periodicity pattern. Figure 1 gives an example of how, by jointly using both of the parameters (H-spectrum and pl-spectrum), one can unambiguously estimate periodicity pattern length. The analysis of a graphic of the H-spectrum in Fig. 1 allows distinguishing heterogeneities in a sequence under consideration at the test-periods multiple of seven. Maximal value in the pl-spectrum outlines the test period of 21 bp which is accepted as an estimate of periodicity pattern length. So, in the HeteroGenome database, visualization of the sequence alignment at the test period of 21 bp is shown automatically. User can additionally obtain the sequence alignment at the other test periods.
2.2 Strategy of Searching for and Structuring Data in the HeteroGenome
In creating the HeteroGenome database [18], to reveal periodicity close to approximate tandem repeats , a method of searching for DNA regions with highly significant heterogeneity (at the level \( \alpha ={10}^{-6} \)), by scanning a series of overlapping windows, has been applied. Length of initial window is equal to 30 bp. Length of each the following window is set twice as large, until a limiting value will be achieved. Shifting with variable step, the windows scan an analyzed DNA sequence. General strategy of searching for the sequences similar to approximate tandem repeats resembled “shotgun strategy” of genome sequencing [30]. Within the framework of such a strategy, relatively short and overlapping fragments are sequenced first. Then computer assembling of the fragments into the more extended regions is done, and the borders of revealed heterogeneity regions are optimized.
For nonredundant data representation in the HeteroGenome database, each logical record is a group of DNA sequences revealed on chromosome with statistically significant heterogeneity (latent periodicity ) which are intersected or (and) have the same or multiple period length. There are two levels of data representation in the group. At the first level, DNA sequence of the greatest length is considered that is called group representative. The rest sequences belong to the second level. As a rule, they correspond to the well-determined local structures of periodicity in the sequence of group representative.
2.3 Results of the HeteroGenome Data Analysis
The comparison of the data on periodicity for the genomes of S. cerevisiae, A. thaliana, C. elegans, and D. melanogaster in the HeteroGenome with corresponding data in the TRDB database [16] has shown that the HeteroGenome collects practically all tandem repeats represented in the TRDB and, moreover, essentially supplements them with the data on highly divergent tandem repeats.
In investigating the evolution and functional meaning of the latent periodicity regions in genome, the proportion of the whole genome covered by such regions is a quantitative indicator of no little significance. Nonredundant data on the regions of significant heterogeneity (latent periodicity) in the HeteroGenome database allows estimating the percent of tandem repeats in the analyzed genomes of model organisms. Table 1 represents such estimates.
As it will be shown further, the largest part of latent periodicity regions in the analyzed genomes is represented by micro- and mini-satellites (period length is less than 100 bp). It is known that in human genome its fraction amounts to 3 % [30]. With the other approximate tandem repeats (period length is of order 1000 bp), the latent periodicity regions in human genome account for about 10 % [25]. Also, taking into consideration data from the Table 1, it can be supposed that periodicity in eukaryotic genome constitutes ~10 %. Probably, such a percent is due to a balance between the molecular mechanism of originating tandem repeats and divergence of their sequences which stabilizes length of the repeats.
2.3.1 Impact of Latent Periodicity on Chromosome Length
Periodicity regions are the hot spots in genome, able to both expand and diminish size in response to slippage of DNA replicase and recombination and duplication processes [31–33]. Mutations (point substitutions, insertions/deletions of the nucleotides) disturb with time determined structure of DNA periodicity regions, stabilizing region lengths. Since the method of latent periodicity revelation used in the work [18] allows nonredundant estimating the periodicity proportion in genome, it becomes possible to investigate an influence of periodicity regions at the chromosomes.
Let us consider a percentage of periodicity regions in accordance with chromosome length in the genomes of analyzed model organisms (see Fig. 2). For each organism a characteristic scatter of the percents of chromosome’s coverage by periodicity regions is observed. Though in the genomes of S. cerevisiae, C. elegans, and D. melanogaster a scatter of the percents for the chromosomes is comparable to a mean percent value in corresponding genome, in A. thaliana genome such a scatter is no more than 0.75 %. As Fig. 2a shows, while chromosome length is growing, the percent of the periodicity regions remains practically constant for Arabidopsis chromosomes.
Generally, as shown in Fig. 2, with growth of chromosome length, a percentage of its periodicity regions has a tendency to constancy or even reduction in all analyzed genomes of the model organisms. Nevertheless, in the consequence of ability for elongation, tandem repeats have markedly influenced at chromosome length (periodicity coverage ~10 %).
2.3.2 Analysis of Periodic Structure Preservation in the Regions of Heterogeneity
In accordance to the HeteroGenome data, Fig. 3 gives an example of histogram showing a distribution of the revealed latent periodicity regions in relation to preservation level of their periodic structure (see Eq. 1 for pl(L) parameter). Separately for micro- (period length is in a range \( 2\le L\le 10 \)), mini- \( \left(10<L\le 100\right) \), and mega- \( \left(100<L\le 2000\right) \) satellites for each chromosome, a percent of the repeats’ length is shown for highly divergent \( \left(0.4\le pl\le 0.7\right) \), moderately \( \left(0.7\le pl\le 0.8\right) \), slightly \( \left(0.8< pl\le 0.9\right) \) divergent, and perfect \( \left(0.9< pl\le 1.0\right) \) tandem repeats.
According to Fig. 3, in the genome of A. thaliana, highly divergent mini-satellites \( \left(10<L\le 100\right) \) constitute a noticeable part (\( \sim 1-1.5\kern0.5em \% \) for each chromosome) which is comparable with the percentage of micro-satellites \( \left(2\le L\le 10\right) \). Consequently, mini- and micro-satellites similarly contribute into structural and functional organization of A. thaliana genome. A portion of mega-satellite repeats in Arabidopsis genome \( \left(\sim 1\%\right) \) is also sufficiently noticeable.
On the page Database Statistics (http://www.jcbi.ru/lp_baze/statistics/index.html) in the HeteroGenome database, one can see analogous histograms for structural content of periodicity regions on the other chromosomes of the rest analyzed genomes. Basing on the analysis of these histograms, in every genome one or few types of characteristic dominating periodicities can be distinguished [18], as, for example, highly divergent micro-satellites in S. cerevisiae genome. The genomes of A. thaliana and C. elegans have similar composition of characteristic periodicities. Probably, sufficient percentage \( \left(\sim 1.5\%\right) \) of mini- and mega-satellites is a consequence of active recombination processes [31–33] in the genomes of Arabidopsis and nematode. Domination of the micro-satellites in yeast genome could be related with the large number of genome replications in yeast growing and, consequently, with frequent replicase slippage [31–33] conducive to the elongation of such periodicity regions.
2.3.3 Revealing Latent Periodicity in the Genome Functional Regions
Using a link to the Sequence Viewer (http://www.ncbi.nlm.nih.gov/projects/sviewer/), for any periodicity region in the HeteroGenome database, one can receive information about the annotation of genome sequence, wherein the region is placed. As shown in the work [18], for the genomes of S. cerevisiae, A. thaliana, C. elegans, and D. melanogaster, correspondingly 80, 62, 65, and 67 % of the HeteroGenome groups (see Subheading 2.2) are placed in the genes . The rest of the groups from the database, practically, are situated in unassigned sequences of the genomes. However, it should be noted that 2.6 % of the groups from D. melanogaster genome is placed in the regions of various repeats.
2.3.4 Density of Distributing Latent Periodicity Regions Along the Chromosomes
How the latent periodicity regions are distributed over the chromosomes was studied for all genomes of model organisms in the database. Each chromosome was subdivided into sequential intervals of the same length, corresponding to 0.5 % of chromosome total length. Then for each interval a summary length of the latent periodicity regions (total number of the nucleotides) revealed within the interval boarders was calculated. Such a value, normalized by total chromosome length and multiplied by 100 %, was considered as a part (restricted by the interval) of the whole periodicity percentage on a chromosome. Summing the parts, over all intervals give an estimate of the whole periodicity percentage on a chromosome.
In investigating a density distribution within the intervals, only the group representatives from the HeteroGenome were considered, as corresponding to nonredundant estimate of chromosome coverage by the regions of latent periodicity . Besides, for every chromosome three additional distributions were obtained, corresponding to the density of micro- (period length is in a range \( 2\le L\le 10 \)), mini- \( \left(10<L\le 100\right) \), and mega- \( \left(100<L\le 2000\right) \) satellites.
Investigation results for the density distribution of latent periodicity regions along the chromosomes are represented on the page Database Statistics (http://www.jcbi.ru/lp_baze/statistics/index.html) in the HeteroGenome. An example of such distributions for all chromosomes of A. thaliana is shown in Fig. 4. Lengths of the unique sequential intervals for the chromosomes I-V were equal to 152138, 98491, 117299, 92925, and 134877 bp, correspondingly [17].
As one can see from the histograms in Fig. 4, the density distribution of the latent periodicity regions on chromosome is its unequivocal characteristic in genome. Such histograms can be considered as some kind of individual bar code for the chromosomes in genome.
3 Spectral–Statistical Approach for Recognizing Latent Profile Periodicity
Initially, the 2S-approach was developed as complex of the methods searching for the regions of statistical heterogeneity in the genomes in order that further research of the regions will conduce to revealing new types of periodicity which are different from approximate tandem repeat. Among the HeteroGenome data, the sequences have been identified, wherein a new type of latent periodicity is recognized [18]. In the present section, new methods of the 2S-approach in recognizing such a type of latent periodicity, called latent profile periodicity or profility [20, 21], in DNA sequences are described.
3.1 Methodology of Recognizing Latent Profile Periodicity
Latent profile periodicity (latent profility) has a statistical basis. So, the statistical criteria which determine the similarity of analyzed DNA sequence with periodic random string of an etalon to recognize latent profility are formulated below. Consequently, a statistical hypothesis is tested that DNA sequence can be considered as a realization of etalon periodic random string. If such a hypothesis is accepted, existence of latent profile periodicity in DNA sequence is recognized, and a periodicity pattern is estimated. Hence, a special random string with periodicity pattern, consisting of independent random characters, is proposed as a model of the periodicity. This random string is perfect tandem repeat of such a pattern and called a profile string. The methods recognizing the latent profility are based on a model of profile string.
3.1.1 Model of Profile String and Notion of Latent Profile Periodicity
Profile string is a particular case of special random string which consists of independent random characters. In the general case, such a special random string of length n can be considered as a schema of the n independent tests of different random values, where each value has K outcomes as the letters of alphabet \( A = \left\langle {a}_1,\dots, {a}_K\right\rangle \). For DNA sequences \( K=4 \) is the size of textual alphabet which is written as \( A=\left\langle {a}_1,\dots, {a}_4\right\rangle =\left\langle a,t,g,c\right\rangle \). Every independent random value is called a random character, designated as Chr(p) and determined by probability column \( \mathbf{p}={\left({p}^1,\dots, {p}^K\right)}^T \), where p i is a probability of appearance for the ith \( \left(i=\overline{1,K}\right) \) letter from the alphabet A. Consequently, such a schema of the n independent tests can be represented by formal string \( St{r}_n\left(\mathbf{p}\right)=Chr\left({\mathbf{p}}_1\right)\dots Chr\left({\mathbf{p}}_n\right) \). This string is n-dimensional random value, wherein Chr(p j ) is random character describing the jth \( \left(j=\overline{1,n}\right) \) trial. Such a random string is unambiguously induced by a matrix \( \varPi =\left({\mathbf{p}}_1,,\dots,, {p}_n\right)={\left({\pi}_j^i\right)}_n^K \) called n-profile matrix or profile matrix of the string Str n (π). In accordance to the works [12, 20–22], any integer number L out of a range 1 , …, L max, \( {L}_{\max}\le \frac{n}{5K} \), is called a test-period for this string.
Let L be a test-period of the strings \( Str= St{r}_n\left(\pi \right) \) \( 0\le M<L \) and \( St{r}_n\left(\pi \right)= St{r}_L\left({\pi}_1\right)\dots St{r}_L\left({\pi}_m\right) St{r}_M\left({\pi}_{m+1}\right) \) is a decomposition of the string Str into the substrings of length L. If \( M=0 \) (\( \pi =\left({\pi}_1,\dots, {\pi}_m\right) \) and the string \( St{r}_M\left({\pi}_{m+1}\right) \) is empty), then a matrix \( {\varPi}_{Str}(L)=\frac{1}{m}{\displaystyle \sum}_{i=1}^m\pi \) is called L-profile matrix of string \( Str= St{r}_n\left(\pi \right) \). If \( M\ne 0 \), then matrix Π Str (L) is corrected correspondingly. Thus, a profile-matrix spectrum Π Str , determined at each test period, is introduced for the string \( Str= St{r}_n\left(\pi \right) \). If \( {\pi}_1=\dots ={\pi}_m={\pi}_0 \) and \( {\pi}_0=\left({\pi}_{m+1},{\pi}_{01}\right) \), then string Str n (π) is called L-profile string with a random periodicity pattern \( Pt{n}_L\left({\pi}_0\right)= St{r}_L\left({\pi}_0\right) \). Here, it is supposed that the pattern cannot be represented by consequent repeating of another random string. In this case a designation Tdm L (π 0, n) is used for the string Str n (π). Besides, matrix π 0 is called a general profile matrix of string Tdm L (π 0, n), because this matrix induces a whole profile-matrix spectrum of the string. Integer L is called a period length of the string Tdm L (π 0, n). If \( L=1 \), then profile string \( Td{m}_1\left({\pi}_0,n\right)=Td{m}_1\left(\mathbf{p},n\right)=\underset{n\kern1em times}{\underbrace{Chr\left(\mathbf{p}\right)\dots Chr\left(\mathbf{p}\right)}} \) will be called a homogeneous string, because its period length equals to unity.
Letter \( {a}_i\in A \) can be identified with a random character which all components of probability (frequency) column are zeroes, excepting the ith unity component. Such a random character will be called a textual character. Consequently, any textual string in the alphabet A can be identified with corresponding special random string of the same length. Such a special string will be called a textual string also.
As for any random value for profile string \( Str=Td{m}_L\left({\pi}_0,n\right) \), the n tests, corresponding to the string’s scheme, can be carry out. In the result of these trials, a textual string str called a realization of the string \( Str=Td{m}_L\left({\pi}_0,n\right) \) will be obtained. For the string str, one can pose a question on the existence of latent profile periodicity in it. If length n of the strings \( Str=Td{m}_L\left({\pi}_0,n\right) \), \( \left(L<\kern0.5em {L}_{\max}\le \frac{n}{5K}\right) \), and str is sufficiently large, then their profile-matrix spectra will be statistically similar with great probability. This property is used in the 2S-approach for recognizing latent profile periodicity in the textual strings (DNA sequences).
In consistent with the 2S-approach, for recognizing latent profile periodicity in DNA sequence, it is necessarily to find such a profile string for that analyzed sequence can be considered as its realization. The search for such a profile string is carried out with the analysis of the spectral characteristics (the statistical spectra) of a textual string (DNA sequence) under consideration.
3.1.2 Methods for Estimating Period Length of Latent Profile Periodicity
To estimate the period of latent profile periodicity , the 2S-approach applies special statistical spectra of textual string which are introduced in the present section.
Let \( Str= St{r}_n\left(\pi *\right) \) be a random string of n independent random characters in the initial alphabet \( A=\left\langle {a}_1,\dots, {a}_K\right\rangle . \) This string is induced by its n-profile matrix \( \pi *=\left({\mathbf{p}}_1,\dots, {\mathbf{p}}_n\right), \) where \( \mathbf{p}={\left({p}^1,\dots, {p}^K\right)}^T=\frac{1}{n}{\displaystyle \sum}_{j=1}^n{\mathbf{p}}_j={\varPi}_{Str}(1) \) is a probability (frequency) vector of the letter (from the alphabet A) occurrence in the string \( Str= St{r}_n\left(\pi *\right) \). Then for each test-period λ of the string Str, λ-profile matrix \( {\varPi}_{Str}\left(\lambda \right)={\left({\pi}_j^i\right)}_{\lambda}^K \) determines the following value Ψ1(λ):
By such a way, for the string \( Str= St{r}_n\left(\pi *\right) \), a function Ψ 1, defined at the test-periods of this string, is introduced that is called the string’s general spectrum.
If \( L\ne 1 \), for nonhomogeneous profile string \( Str=Td{m}_L\left({\pi}_0,n\right) \) (particularly, for textual tandem repeat), the following assertion can be mathematically strictly proven.
General spectrum Ψ 1, defined by Eq. 5, for nonhomogeneous profile string \( Str=Td{m}_L\left({\pi}_0,n\right) \) has a period L. Maximal values of the spectrum Ψ 1 are taken out only at the test-periods multiple of L. For homogeneous string \( \left(L=1\right) \), according to Eq. 5, its general spectrum takes on zero values.
To visually illustrate the above assertions, Fig. 5 shows the graphics of general spectra for textual perfect tandem repeat (Fig. 5a) and profile string (Fig. 5b). This profile string is that its realizations are not the approximate tandem repeats .
By analogy with Eq. 5, for textual string str of length n, a general spectrum Ψ 1 is introduced which at the test-period \( \lambda <{L}_{\max}\le \frac{n}{5K} \) takes on value:
where \( {\varPi}_{str}\left(\lambda \right)={\left({\pi}_j^i\right)}_{\lambda}^K \) is λ-profile matrix of the string str and \( {\varPi}_{str}(1)={\left({p}^1,\dots, {p}^K\right)}^T \). For the realizations of homogeneous string of length n, in accordance with Pearson goodness-of-fit test [29], a distribution of the Ψ1(λ) is statistically equivalent to the \( {\chi}_{\left(K-1\right)\left(\lambda -1\right)}^2 \) distribution, where χ 2 N is the χ 2-distribution with N degrees of freedom, i.e.,
In plotting a graph of general spectrum Ψ 1 for textual string str obtained in the result of the realization of profile string \( Str=Td{m}_L\left({\pi}_0,n\right) \), theoretical form of general spectrum Ψ 1 for string \( Str=Td{m}_L\left({\pi}_0,n\right) \) will be distorted. To illustrate such a distortion, the graphics of general spectra for a realization of homogeneous (1-profile) string (see Fig. 5c) and 9-profile CDS sequence (Fig. 5d) from the KEGG database [34] are shown. Furthermore, bold line in Fig. 5c, d shows a graphic of the right-hand critical value χ 2 crit (N, α) correspondence to the test-period λ for the χ 2 N -distribution at significance level \( \alpha =0.05 \), where \( N=\left(K-1\right)\left(\lambda -1\right) \).
According to Eq. 7, in the 2S-approach [20–22] for checking a hypothesis about homogeneity of textual string str (at significance level \( \alpha =0.05 \)), a spectrum D 1 is used that at the test-period λ takes on value:
If value \( {\mathrm{D}}_1\left(\lambda \right)>1 \), then in accordance with goodness-of-fit test [29] at the test-period λ, heterogeneity is manifested in analyzed string. So, the D 1 spectrum for textual string is called as a spectrum of deviation from homogeneity.
For nonhomogeneous profile string of length n, a probability distribution of the values in general spectra of the string’s realizations at the test-period λ does not coincide with the χ 2-distribution, having \( N=\left(K-1\right)\left(\lambda -1\right) \) degrees of freedom. In comparison with this χ 2-distribution, the existing distribution of the general spectrum values for the realizations of nonhomogeneous profile string induces essentially larger probability to exceed the critical level \( {\chi}_{crit}^2\left(\left(K-1\right)\left(\lambda -1\right),\alpha \right) \) than \( \alpha =0.05 \). So, in the D 1 spectra for the realizations of nonhomogeneous profile string, the test periods at which the values of the D 1 spectrum exceed unity will be observed. In such a case textual string realizations will be called heterogeneous strings.
Figure 6a shows the D 1 spectrum of deviation from homogeneity that was obtained from the Ψ 1 general spectrum (see Figs. 5d or 6c). According to the D 1 spectrum, human CDS (KEGG, hsa:338872) is considered as heterogeneous sequence.
The graphics of general spectra of profile string (Fig. 6d) and its “realization” (Fig. 6c) which in reality is CDS (KEGG, hsa:338872) from human genome are shown over again. As it follows from Fig. 6, the difference between the general spectra of profile string and its realization, practically, is of the form of graphic for a function linearly dependent on the test periods of the strings. Analogous to the \( {\chi}_{\left(K-1\right)\left(\lambda -1\right)}^2 \)-distribution, with the increase of test-period λ, the freedom degrees of probability distribution for the values in the general spectrum Ψ 1 of the original profile string realizations ascend also. To level such a growth for realization str, a spectrum C is introduced as follows:
where \( M\left({\chi}_N^2\right)=\left(K-1\right)\left(\lambda -1\right) \) is a mean value of the χ 2-distribution with N degrees of freedom. Further, the spectrum C is called a characteristic spectrum of analyzed textual string. The graphic of such a spectrum for an analyzed realization str is shown in Fig. 6b.
In comparing the characteristic spectrum (Fig. 6b) for the realization of an original 9-profile string with the general spectrum for 9-profile string (Fig. 6d), visual similarity both of the spectra is obvious. The 2S-approach is based on such a similarity in recognizing latent profile periodicity in the textual strings. For heterogeneous textual string realizations, a maximal value in characteristic spectrum is achieved (with allowance made to small random error) at a period of latent profile periodicity. Such the properties of characteristic spectrum are used in the 2S-approach for estimating period length of latent profile periodicity. For estimating period length in an analyzed textual string, the following rule is proposed.
At the beginning, a test-period L is selected out of string test periods at which the first clear-cut maximal value in characteristic spectrum С is achieved. If \( {D}_1(L)>1 \), then the test-period L is considered as an estimate of latent period of profile periodicity .
Spectrum D 1 of deviation from homogeneity is shown in Fig. 6a which has been obtained from the general spectrum Ψ 1 (see Figs. 5d or 6c). Characteristic spectrum С (Fig. 6b) is corresponded to these spectra. According to the rule accepted above, an estimate of 9 bp is proposed as length of latent period of profile periodicity in analyzed coding DNA sequence (KEGG, hsa:338872) from human genome.
Efficiency of the rule formulated above for estimating period of latent profile periodicity in heterogeneous DNA sequences which cannot be considered as approximate tandem repeats has been proved in the works [20–22]. For such sequences, Fig. 7 shows the examples of characteristic spectra and spectra of deviations from homogeneity. It will be shown further that in these sequences the latent periodicities with the periods of \( L=10 \) (Fig. 7a), \( L=84 \) (Fig. 7c), and \( L=9 \) (Fig. 7e) are revealed.
3.1.3 Pattern Estimate for Etalon of Latent Profile Periodicity on Basis of Goodness-of-Fit Test
For textual string str, an estimate of the period of latent profile periodicity \( L>1 \) has been obtained basing on the C (see Eq. 9) and D 1 (see Eq. 8) spectra of the string. Then by analogy with a general spectrum (see Eq. 6), to test whether the test-period L is a period of latent profile periodicity, the spectrum Ψ L is used which at test-period λ takes on value:
where \( {\varPi}_{str}\left(\lambda \right)={\left({\pi}_j^{*\ i}\right)}_{\lambda}^K \) and \( {\varPi}_{Td{m}_L}\left(\lambda \right)={\left({\pi}_j^i\right)}_{\lambda}^K \) are λ-profile matrices of the textual string str and L-profile string \( Td{m}_L=Td{m}_L\left({\varPi}_{str}(L),n\right) \), correspondingly. For the realizations of L-profile string according to Pearson goodness-of-fit test [29], the following ratio is true:
Using the statistics (Eq. 10) and the ratio (Eq. 11), the D L spectrum of string str deviation from L-profility is introduced, taking (at the test-period λ) on the value:
where χ2crit(N, α) is a critical value of the χ N2-distribution with N freedom degrees at significance level \( \alpha =0.05 \). The D L spectrum is used for checking a hypothesis about L-profility existence in analyzed textual string according to the following rule.
Let Q be a relative fraction of the test periods for an analyzed string at which the values of the D L spectrum are greater than unity. The hypothesis about L-profility existence in the string is accepted, if \( Q<0.05 \).
Let us give an example of how this rule is used. According to the spectra in Fig. 7, for three DNA sequences which are not approximate tandem repeats , the length estimates of 10, 84, and 9 bp have been proposed for the latent periods of profile periodicity . These estimates are visually confirmed in Fig. 8 with the help of the spectra of deviation from the corresponding profility.
The results of analysis for textual string str, where latent L-profile periodicity was revealed, allow supposing a random string \( Pt{n}_L\left({\varPi}_{str}(L)\right)= St{r}_L\left({\varPi}_{str}(L)\right) \) of independent random characters as an estimate of this periodicity pattern. This random string is unambiguously characterized by profile matrix Π str(L) of string str. In this case a hypothesis about string str statistical similarity (at the significance level \( \alpha =0.05 \)) with profile string Tdm L (Π str(L), n) is accepted. Thereby, profile string Tdm L (Π str(L), n) is an etalon of profile periodicity for the string str. Besides, random string Ptn L (Π str(L)) is an estimate for pattern of this latent profile periodicity. Pattern Ptn L (Π str(L)) is an analogue of consensus-pattern deducing when approximate tandem repeats are recognized.
3.1.4 Methods, Reconstructing Spectrum of Deviation from Homogeneity and Confirming a Pattern Estimate for Etalon of Latent Profile Periodicity
Let a hypothesis about latent L-profility existence be accepted for heterogeneous textual string str (see Eq. 12 and text below). Consequently, the string str can be considered as a realization of L-profile etalon string \( Td{m}_L=Td{m}_L\left({\varPi}_{str}(L),n\right) \).
In forming etalon of profile periodicity \( Td{m}_L=Td{m}_L\left({\varPi}_{str}(L),n\right) \), goodness-of-fit test was used for an analyzed string str. But for obtained estimate of latent profile periodicity pattern, an additional conformation can be obtained. By analogy to the D 1 spectrum (see Eq. 8), for random profile string Tdm L , a spectrum Th L is introduced, representing the string deviation from homogeneity, which at the test-period λ takes on value:
In fact, the Th L spectrum is a theoretical reconstruction of the D 1 spectrum for string str. To confirm an estimate of latent profile periodicity pattern, a method of comparing the spectra D 1 and Th L of deviation from homogeneity for the strings str and Tdm L , correspondingly, was proposed in the works [20–22]. If for the string str a pattern estimate of latent profile periodicity etalon \( Td{m}_L=Td{m}_L\left({\varPi}_{str}(L),n\right) \) is correct, then the spectrum Th L is obviously similar to the D 1 spectrum. Figure 9d shows theoretical reconstruction of the D 1 spectrum for human CDS (KEGG, hsa:26974). Visual similarity of this reconstruction with the original D 1 spectrum of deviation from homogeneity (Fig. 9b) provides support for the revealed latent 84-profile periodicity.
3.2 Notion of 3-Regularity in Coding Regions of DNA Sequences
Earlier [21] in characteristic spectra of heterogeneous coding DNA sequences, regular repetition of the peaks at the test-periods multiple of three (see, e.g., Fig. 10a) was observed. Such a phenomenon contrary to the latent profility was called as 3-regularity of DNA sequences.
Let us describe a criterion of 3-regularity existence in DNA sequence [35]. Let us divide a range of definition for characteristic spectrum of an analyzed DNA region into sequential triplets of the test periods. Within each triplet a test-period, corresponding to local maximal value in characteristic spectrum, is associated to unity, and the rest two test-periods are associated to zeros. As the result a binary string of the zeros and units is formed, i.e., textual string str in alphabet \( A=\left\langle 0,1\right\rangle \) of size \( K=2 \). This string is compared with perfect periodic string of the same length and with periodicity pattern: 001. Index I3, equal to a ratio of coinciding components between binary strings str and the perfect periodic one to the strings’ length, is called an index of 3-regularity for analyzed sequence. If index \( {\mathrm{I}}_3>0.7 \), then 3-regularity is observed in characteristic spectrum. For example, according to such a criterion in the characteristic spectra in Figs. 10a and 11b, d, f, corresponding to coding DNA sequences, 3-regularity is observed. In characteristic spectrum in Fig. 10b, corresponding to intron sequence, 3-regularity is not revealed, which is confirmed by the value of index \( {\mathrm{I}}_3=0.42<0.7 \). In Figs. 10a and 11b, 3-regularity of the characteristic spectra is obvious. With the existence of 3-regularity in characteristic spectra in Fig. 11d, f is confirmed by the values of 3-regularity index \( {\mathrm{I}}_3=0.87 \) and \( {\mathrm{I}}_3=0.78 \), correspondingly.
3.3 Results of the 2S-Approach Application to Recognizing Latent Profile Periodicity and Regularity in DNA Sequences
Here, let us give a number of the examples of the 2S-approach application results for recognizing latent profile periodicity and 3-regularity in DNA sequences.
The methods of the 2S-approach revealed existence of latent profility of 33 bp (33-profility) in the genes of apolipoprotein family PF01442 from the Pfam (database of Protein families, http://pfam.sanger.ac.uk/) [36]. This family includes the apolipoproteins Apo A, Apo C, and Apo E which are the members of multigene family that, probably, has evolved from a common ancestor gene. Apolipoproteins perform lipid transport and serve as enzyme cofactors and the ligands of cellular receptors. The family amounts greater than 800 proteins from 100 different species. In Fig. 12a, b, c, the characteristic spectra of the coding regions of apolipoproteins for sea bream Sparus aurata (Apo A-I), chicken Gallus gallus (Apo A-IV), and mouse Mus musculus (Apo E) are shown. The maximal values in these spectra are achieved at test-periods multiple of 33 bp. According to the 2S-approach, the latent 33-profility is recognized in these regions.
The well-known secondary structure of apolipoprotein family PF01442 consists of a few pairs of alpha-helix with 11 and 22 amino acid residues. Such a structure correlates with the profile periodicity of apolipoprotein genes of 33 bp. The peculiar pattern size of the latent profile periodicity in the genes of PF01442 family, possibly, influences on the formation of typical secondary structure in the protein family, and it is in agreement with the hypothesis about that family had originated from a common ancient gene.
In the characteristic spectra of coding regions, a regularity of the peaks at the test-periods multiple of three is observed (see, e.g., Fig. 12a, b, c). Thus, the first level of coding organization is manifested, that is, conditional by the genetic triplet code. Frequently, dominant peak in Fourier spectra at frequency 0.33 corresponds to this level (see, e.g., Fig. 12d). In existing 3-regularity, latent profility, which is distinct from 3-profility, reveals the second level in coding organization. Clear-cut maximal value in characteristic spectrum points at such level of the organization (Fig. 12a, b, c).
Existence of the latent 84-profility in coding DNA sequence (see Figs. 8c, and 9c, d) corresponds in protein to repeating zinc finger domain which includes one alpha-helix and two antiparallel beta-structures. As a rule, zinc finger domain counts about 20 amino acid residues, and it is stabilized by one or two zinc ions. DNA-binding transcription factors are the main group of the proteins with “zinc fingers.”
With the help of the 2S-approach, proposed methods search for 3-regularity and latent profility was done in 18140 human CDS from the KEGG database (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) whose functional activity received experimental evidence. Within statistical errors of the methods, the CDSs are heterogeneous and 3-regular. Moreover, latent profile periodicity is observed for 74 % of the CDSs. The second level of encoding (different from 3-regularity and 3-profility) was revealed for 11 % of the analyzed CDSs, in that latent profility is displayed with period length multiple of three [21].
Analogous analysis was done for the introns also. The sequences of 277477 human introns (noncoding gene parts) from the EID (The Exon-Intron Database, http://utoledo.edu/med/depts/bioinfo/database) [37] were considered. Only 3 % of 3-regular sequences were revealed among them [21]. That is, in the frame of statistical method error, one can believe that the absence of 3-regularity is characteristic property for the introns.
4 Conclusion
Within the framework of the 2S-approach, the methods for recognizing two types of latent periodicity in DNA sequences were under consideration in the work. The first type was represented by the sequences which are similar to approximate tandem repeats . The second type is based on earlier introduced notion of latent profile periodicity (profility). The notion of latent profile periodicity generalizes notion of approximate tandem repeat. Presented methods of the 2S-approach allow recognizing these types in DNA sequences.
The application of the methods recognizing DNA sequences similar to approximate tandem repeats was demonstrated on the examples of genome analysis for model organisms from the HeteroGenome database. Special structure of the records in the HeteroGenome presents data on nonoverlapping latent periodicity regions on the chromosomes, providing with nonredundant data overview. The HeteroGenome database was design for molecular-genetic research and further study of latent periodicity phenomenon in DNA sequences. The analysis of data from the HeteroGenome has served to developing the spectral–statistical approach and passing on recognition of new type latent periodicity, called latent profile periodicity . Actuality of recognizing the latent profile periodicity due to such periodicity can correlate with the structural–functional organization of DNA sequences and their encoded proteins .
References
Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27:573–580
Sokol D, Benson G, Tojeira J (2007) Tandem repeats over the edit distance. Bioinformatics 23:e30–e35
Issac B, Singh H, Kaur H, Raghava GPS (2002) Locating probable genes using Fourier transform approach. Bioinformatics 18:196–197
Sharma D, Issac B, Raghava GPS, Ramaswamy R (2004) Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics 20:1405–1412
Paar V, Pavin N, Basar I, Rosandić M, Gluncić M, Paar N (2008) Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats. BMC Bioinformatics 9:466
Wang L, Stein LD (2010) Localizing triplet periodicity in DNA and cDNA sequences. BMC Bioinformatics 11:550
Nunes MC, Wanner EF, Weber G (2011) Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome. BMC Genomics 12(Suppl 4):S4
Stoffer DS, Tyler DE, Wendt DA (2000) The spectral envelope and its applications. Stat Sci 15:224–253
Korotkov EV, Korotkova MA, Kudryashov NA (2003) Information decomposition method for analysis of symbolical sequences. Phys Lett A 312:198–210
Kumar L, Futschik M, Herzel H (2006) DNA motifs and sequence periodicities. In Silico Biol 6:71–78
Nair AS, Mahalakshmi T (2006) Are categorical periodograms and indicator sequences of genomes spectrally equivalent? In Silico Biol 6:215–222
Chaley M, Kutyrkin V (2008) Model of perfect tandem repeat with random pattern and empirical homogeneity testing poly-criteria for latent periodicity revelation in biological sequences. Math Biosci 211:186–204
Salih F, Salih B, Trifonov EN (2008) Sequence structure of hidden 10.4-base repeat in the nucleosomes of C. elegans. J Biomol Struct Dyn 26:273–281
Epps J (2009) A hybrid technique for the periodicity characterization of genomic sequence data. EURASIP J Bioinform Syst Biol 2009:924601
Glunčić M, Paar V (2013) Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res 41(1):e17
Gelfand Y, Rodriguez A, Benson G (2006) TRDB – The Tandem Repeats Database. Nucleic Acids Res 00(Database issue):D1–D8
Chaley MB, Kutyrkin VA, Tuylbasheva GE, Teplukhina EI, Nazipova NN (2013) Investigation of latent periodicity phenomenon in the genomes of eukaryotic organisms. Math Biol Bioinform 8:480–501
Chaley M, Kutyrkin V, Tulbasheva G, Teplukhina E, Nazipova N (2014) HeteroGenome: database of genome periodicity. Database article ID bau40
Epps J, Ying H, Huttley GA (2011) Statistical methods for detecting periodic fragments in DNA sequence data. Biol Direct 6:21
Chaley MB, Kutyrkin VA (2010) Structure of proteins and latent periodicity in their genes. Moscow Univ Biol Sci Bull 65:133–135
Chaley M, Kutyrkin V (2011) Profile-statistical periodicity of DNA coding regions. DNA Res 18:353–362
Kutyrkin VA, Chaley MB (2014) Spectral-statistical approach to latent profile periodicity recognition in DNA sequences. Math Biol Bioinform 9:33–62
Fields S, Johnston M (2005) Cell biology. Whither model organism research? Science 307:1885–1886
Benson DA, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2015) GenBank. Nucleic Acids Res 43(Database issue):D30–D35
Boeva V, Regnier M, Papatsenko D, Makeev V (2006) Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22:676–684
Grover A, Aishwarya V, Sharma PC (2012) Searching microsatellites in DNA sequences: approaches used and tools developed. Physiol Mol Biol Plants 18:11–19
Gelfand Y, Hernandez Y, Loving J, Benson G (2014) VNTRseek – a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Res 42:8884–8894
Anisimova M, Pečerska J, Schaper E (2015) Statistical approaches to detecting and analyzing tandem repeats in genomic sequences. Front Bioeng Biotechnol 3:31
Cramer H (1999) Mathematical methods of statistics. Princeton University Press, Princeton, NJ
International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Dieringer D, Schlötterer C (2003) Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res 13:2242–2251
Ellegren H (2004) Microsatellites: simple sequences with complex evolution. Nat Rev Genet 5:435–445
Richard GF, Kerrest A, Dujon B (2008) Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol Mol Biol Rev 72:686–727
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M (2012) KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res 40(Database issue):D109–D114
Chaley M, Kutyrkin V (2016) Stochastic model of homogeneous coding and latent periodicity in DNA sequences. J Theor Biol 390:106–116
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR et al (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230
Shepelev V, Fedorov A (2006) Advances in the Exon-Intron Database. Brief Bioinform 7:178–185
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Chaley, M., Kutyrkin, V. (2016). Spectral–Statistical Approach for Revealing Latent Regular Structures in DNA Sequence. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_16
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_16
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols