1 Introduction

Several genomic studies have focused on the analysis of word counts and word distances, namely, phylogeny studies [1], alignment-free methods [2, 3], CpG detection [4], coding regions detection [5] and other DNA structure analysis [6, 7].

In the context of DNA structure analysis, non-B conformations have been shown to play an important role in DNA damage and repair, genetic instability, gene regulation, and chromatin architecture [8]. In particular, hairpin/cruciforms structures are important regulators for biological processes and gene function [9].

Inverted repeats are a required feature of cruciform structures, but not all inverted repeats will form cruciforms. Cruciforms are dynamic structures that may occur when certain conditions are met, such as the coiling state of DNA, but are less stable than the normal B-DNA conformation. Although their properties and relevance in several biological processes are acknowledged, evidence of their genomic localization and mechanism of action are lacking in vivo [10, 11].

The stem and loop lengths of cruciform structures seem to vary over a wide range. According to different authors, the stem lengths vary between 6 and 100 nucleotides, while loop lengths may range from 0 to 2000 nucleotides [12,13,14]. Shorter distances could favour the occurrence of these structures, but long distances have also been reported, such as the translocation breakpoints associated with human developmental diseases or infertility [10].

Computational techniques have been used to identify DNA motifs that are known to potentially form non-B DNA structures [6, 14]. A DNA word analysis based on the distribution of the distances between adjacent symmetric words of length seven [7] showed a strong over-representation of distances up to 350, a feature that the authors considered might be associated with the potential for the occurrence of cruciform structures. Recently, the same research group extended their analysis to include distance distributions of non-adjacent inverted repeats, since adjacency is not a required condition for cruciform structures to form [15].

The present work focuses on identifying and characterizing a particular type of motif, the inverted repeat, whose distance distribution contains some atypical frequencies (peaks) at regular intervals (the occurrence of regular peaks in the distance distribution of some symmetric word pairs was first reported in [15]).

2 Methods

We want to find, in the human genome, structures beyond the already well-known repetition structures published in the literature. Thus, we used pre-masked sequences available from the UCSC Genome Browser (http://genome.ucsc.edu) webpage. These files contain the GRCh38 assembly sequences, with repeats reported by RepeatMasker [16] and Tandem Repeats Finder [17] masked with Ns.

2.1 Distance Between Symmetric Word Pairs

Consider the alphabet \(\mathscr {A}= \{\mathrm{A, C, G, T}\}\) and let w be a symbolic sequence (word) defined in \(\mathscr {A}^k\), where k is the length of w. The pair composed of one word, w, and the corresponding reversed complement word, \(w'\), is called a symmetric word pair. For example, (ACT, AGT) is a symmetric word pair.

For a given word length k, we compute the frequency distributions of distances between occurrences of each word and all succeeding reversed complements, \(f_{w,w' \ldots w'}\) up to a maximum distance (4000 in this work).

For example, consider the following sequence:

$$\begin{aligned} \underline{\mathrm{ACT}}\mathrm{GGAA}\overline{\mathrm{AGT}}\mathrm{AAGA}\overline{\mathrm{AGT}}\underline{\mathrm{ACT}}\mathrm{TTGT} \underline{\mathrm{ACT}}\mathrm{GGG}\overline{\mathrm{AGT}}\mathrm{TTGT} \end{aligned}$$

For word \(w=\) ACT, we have, in the previous sequence, five distances to all the succeeding reversed complement words (distances 7, 14, 30, 13, and 6).

Motivated by previous work and the stem length of possible cruciform structures and considering computational limitations , we study words of length \(k=7\). For each word w,  we analyse distances up to 4000 nucleotides, but, if an N symbol is found, the search for \(w'\) is stopped. To avoid the direct dependencies associated with the nucleotide composition of some words, we exclude distances shorter than k from the analysis.

2.2 Detecting Symmetric Word Pairs with Atypical Distance Distributions

To find the symmetric word pairs with atypical (high) frequencies in the distance distribution, we developed and used a simple algorithm based on finding outliers within the distance distribution. The algorithm is the following:

  • For each symmetric pair distance distribution, compute the distances that are outliers.

    • use MATLAB function isoutlier to find the distances whose frequencies are more than six local scaled MAD (median absolute deviation) from the local median computed over a window with length 101;

    • verify if the mass of the distances candidates to be outliers is greater than ten occurrences.

  • Select as atypical the symmetric word pairs whose distance distribution contains more than ten outliers as computed above.

The application of the above algorithm to all the \(4^7\) words resulted in the identification of 247 distance distributions that were considered to have atypical (high) frequencies.

The visualization of the distance distributions with atypical frequencies revealed many words with regular peaks and with different periods of repetition of the peaks. Those observations lead to the development of a method to detect periodic regularities in the distance distributions (see Sect. 2.3).

Figure 1 shows, as an example, the distance distribution of \(w=\text {CCAGCTG}\) with regular peaks spaced by 102 positions. The distances with atypical frequencies are marked with red dots.

Fig. 1
figure 1

The distance distribution for word CCAGCTG showing also the distances considered atypical (red dots)

2.3 Detecting Periodic Regularities

Consider a frequency distribution f(i) of an integer variable defined for \(i=1, 2, \ldots , N\). We define a family of distributions, derived by “wrapping” f around itself, modulo n:

$$\begin{aligned} f_n(i) = \sum _{j=0}^{J(i)} f(i+jn), \end{aligned}$$
(1)

for \(i=1, 2, \ldots , n\). The upper bound of the summation is \(J(i) = \left\lfloor {\frac{N-i}{n}}\right\rfloor\).

If f contains a periodic pattern of peaks at positions \(i=a+jn\), with \(j \in \{0, 1, \ldots \}\), where n is the period and a is the position of the initial peak, then those peaks will be superimposed at the single position \(i \equiv a \pmod {n}\), with \(i \in \{1, 2, \ldots , n\}\), on the n-wrapped distribution \(f_n\). On the contrary, if f also contains peaks spaced with a distinct period \(m \ne n\), then those peaks will be spread over several positions in \(f_n\). Therefore, any component peaks with period n in f will be relatively amplified and stand out against the other components in the n-wrapped distribution \(f_n\). This is the rationale for using this analysis tool.

Figure 2 shows an example of a distribution f(i) defined for \(i \in \{1, 2, \ldots , 100\}\). This distribution has a total mass of 1000 and half of this mass is concentrated in positions \(i=3, 13, 23, \ldots , 93\).

Fig. 2
figure 2

An example distribution f with a periodic pattern of peaks. The peaks at positions \(i \equiv 3 \pmod {10}\) are highlighted. These contain \(50\%\) of the total mass in the distribution. The maximum frequency M is shown in position \(i=83\)

Figure 3 shows four wrapped distributions obtained from the f distribution in Fig. 2. The ten-wrapped distribution (Fig. 3b) displays a distinct concentration of mass in \(f_{10}(3)\), which allows the correct identification of the period and initial position of the pattern of peaks in f. The concentration of mass vanishes with just a minimal change in the wrapping period n, as shown in Fig. 3a, c. Some concentration of mass is expected when n is a multiple of the period, as demonstrated by Fig. 3d.

Fig. 3
figure 3

Wrapped distributions \(f_n\) of the distribution in Fig. 2. From left to right: \(f_{9}\), \(f_{10}\), \(f_{11}\) and \(f_{20}\) are shown. The maximum is identified in each distribution

2.3.1 Finding the Fundamental Period

To find a periodic component in a distribution f, we can generate the family of n-wrapped distributions \(f_n\) for \(n=1, 2, \ldots\) and select the one with the most concentration of mass in a few positions. We use the maximum frequency, \(M_f(n) = \max f_n\), as an indicator of the concentration of mass in each n-wrapped distribution. However, we expect the maxima of n-wrapped distributions to grow with decreasing n even if the original distribution has no periodic pattern of peaks. Therefore, the maxima for different periods n are not directly comparable. This is quite evident in Fig. 4 (top), which shows the maxima \(M_f(n)\) derived from the example distribution of Fig. 2. The maximum \(M_f(10)=500\) stands out, as expected, since \(n=10\) is the period of the pattern of peaks in f. But that is surpassed by \(M_f(5)\) and \(M_f(2)\), unsurprisingly since 5 and 2 are divisors of the period.

Fig. 4
figure 4

Maxima \(M_f(n)\) of the n-wrapped distributions of f (top) and the corresponding concentration scores for the same distributions (bottom)

To make the true period stand out even against its divisors, we define a concentration score by the ratio

$$\begin{aligned} s(n) = \frac{\max f_n}{\max g_n}, \end{aligned}$$
(2)

where \(g_n\) is the n-wrapped distribution of a distribution g obtained by sorting the frequencies in f in descending order. This score effectively normalizes the maxima of the n-wrapped distributions of f against those of a derived distribution from which all regularities have been removed.

The maxima \(\max g_n\) are shown as a thin line in Fig. 4 (top) for the example f distribution. Figure 4 (bottom) displays the corresponding concentration scores.

3 Results

The application of the method to detect regularities on the distance distribution of the 247 previously identified symmetric word pairs produced the following results.

Table 1 shows the 10 words with the highest concentration scores and Table 2 lists the most frequent periods (T) found in the set of the 247 words. It may be observed in Table 1 that the symmetric word pairs have, in general, similar scores.

Table 1 The ten words with the highest concentration scores for the peak period (T)
Table 2 The most frequent periods (# of words \(\ge 10\)) and the corresponding mean and median concentration scores

The analysis of the data in Table 2 shows that there are several distinct symmetric word pairs whose distance distributions contain peak regularities with the same period.

At least two questions may be asked: are the sequences that lead to these periodic peaks spread over the entire genome? or are they localized in specific chromosomes?

The periods 44, 61 and 84 were selected (highest median(s(T))) to carry out a genomic local analysis to find the positions of the sequences that originate the regular peaks in the distance distribution.

From the analysis of the local distribution of sequences that originate the regular peaks, it was found that, from the set of words with period 44, only four words (ATGGTGA, CACCATG, CATGGTG and TCACCAT) had significant number of occurrences and that those occurred mainly in chromosome 7. For the words with period 61, only two (GCAGACT and AGTCTGC) had significant number of occurrences and mainly in the X and Y chromosomes. For the words with period 84, only 12 words were considered relevant and occurrence mainly in chromosome 19 (see Table 3).

Table 3 Words with a peak period of 84 and a significant number of occurrences

Figures 5, 6 and 7 show the positions in the relevant chromosomes of the selected words for periods 44, 61 and 84.

Fig. 5
figure 5

Positions of the first peak for four words with peak period of 44 (in chromosome 7)

Fig. 6
figure 6

Positions of the first peak for two words with peak period of 61 (in chromosome X)

Fig. 7
figure 7

Positions of the first peak for 12 words with peak period of 84 (in chromosome 19)

4 Conclusion

We studied the occurrence or regular peaks (over-representation) of some distances between symmetric words.

The results of this work revealed sets of words with unusual distribution of distances to the corresponding reversed complements and also with distinct periods of peak regularity.

Since we use masked sequences, the observed regularities are, to the authors knowledge, not due to the known repetitive structures in the human genome and may indicate possible sites for the occurrence of cruciform structures.

We developed a method for detecting periodic regularities in distance distributions that is also able to find the fundamental period of the regularities.

A local analysis was carried out for some symmetric word pairs and it was found that the regular periodic pattern of peaks of the distance distribution occurs mostly at some regions of a single chromosome. Moreover, the symmetric word pairs with the same period of the regular peaks tend to occur in the same chromosome(s).

We expect that this analysis contributes to clarify the possible association between the features of distances between symmetric words and the occurrence of cruciform structures.

To the authors knowledge, the regularly spaced inverted repeats found in this work are a novel genomic feature. We believe that this new feature may be associated with the potential of occurrence of more complex non-B conformations with medium or long length.