Introduction

Atypical Structure of Human Y Chromosome

One of challenging problems in genomics is related to the evolutionary development of Y chromosome. The Y chromosome has a unique role in human population genetics with properties that distinguish it from all other chromosomes (Mitchell et al. 1985; Jobling and Tyler-Smith 2003; Skaletsky et al. 2003). Prevailing theory is that X and Y chromosomes evolved from a pair of autosomes (Muller 1914; Ohno 1967; Graves 1995; Lahn and Page 1999; Marshall Graves 2006). Lack of recombination between nonrecombining parts of X and Y chromosomes was thought to be responsible for decay of the Y-linked genes, the pace of which slows over time, eventually leading to a paucity of genes. Identification of distinct palindromes harboring several distinct gene families unique to the long arm of Y chromosome, frequent gene conversion, and multiplication have raised some doubt about progressive decay of the Y chromosome (Kuroda-Kawaguchi et al. 2001; Skaletsky et al. 2003; Rozen et al. 2003; Ali and Hasnian 2003; de Knijff 2006). It was shown that the Y chromosome has acquired a large number of testis specific genes during the course of evolution, including those essential for spermatogenesis (Saxena et al. 1996; Silber and Repping 2002; Skaletsky et al. 2003).

Considerations of atypical structure of human Y chromosome were largely focused on the gene-related content. On the other hand, however, the human Y chromosome is replete with many pronounced repetitive sequences, and multicopy gene arrays are embedded in palindromes (Tyler-Smith 1985; Wolfe et al. 1985; Tyler-Smith and Brown 1987; Oakey and Tyler-Smith 1990; Cooper et al. 1993a; Skaletsky et al. 2003; Rozen et al. 2003; Perry et al. 2007; Kirsch et al. 2008).

Alphoid Higher Order Repeats

Alphoid arrays in centromeres of human and other mammal chromosomes consist of tandem repeats of AT-rich alpha satellites (Maio 1971; Manuelidis and Wu 1978; Mitchell et al. 1985; Tyler-Smith 1985; Willard 1985; Waye and Willard 1987; Tyler-Smith and Brown 1987; Romanova et al. 1996; Warburton and Willard 1996; Warburton et al. 1996; Choo 1997; Alexandrov et al. 2001; Rudd et al. 2006). Stretches of alpha satellites lacking any higher-order periodicity mutually diverge by ~20–35% and are referred to as monomeric (Warburton and Willard 1996).

Higher order repeats (HORs) are defined as higher order periodicity pattern superimposed on the approximately periodic tandem of alpha monomers: if an array of n monomers denoted by \( 1,{ 2}, \ldots n \) is followed by the next array of monomers denoted by \( n + 1,n + 2, \ldots 2n \), where the monomer 1 is almost identical (more than 95%) to the monomer n + 1, the monomer 2 to the monomer n + 2, and the monomer n to the monomer 2n, these arrays belong to the nmer HOR (Warburton and Willard 1996). The HOR copies from the same locus diverge from each other by <5%, while the alpha satellite copies within any HOR copy diverge from each other by ~20–35% (Warburton and Willard 1996).

Alphoid HORs are chromosome-specific (Willard 1985; Jorgensen et al. 1986; Willard and Waye 1987; Haaf and Willard 1992; Warburton and Willard 1996; Choo 1997). A type of polymorphism found in alphoid arrays involves HOR units that differ by an integral number of monomers (monomer insertion or deletion), but nonetheless closely related in sequence (Haaf and Willard 1992; Warburton and Willard 1996).

Investigations using restriction endonuclease digestion have revealed a major block of alphoid DNA in the centromeric region of human Y chromosome (Mitchell et al. 1985; Tyler-Smith 1985; Wolfe et al. 1985; Tyler-Smith and Brown 1987; Cooper et al. 1993a, b). The size of this alphoid block was found to be polymorphic, widely varying between different individuals (Tyler-Smith and Brown 1987; Oakey and Tyler-Smith 1990). Initially, a 5.7 kb HOR unit was reported as a major variant of secondary periodicity and 6.0 kb HOR unit as a minor variant. These HOR units were associated with 34mer and 36mer, respectively (Tyler-Smith and Brown 1987). In a more recent study, a 5941 bp secondary periodicity (35 alphoid repeat units) was reported (Skaletsky et al. 2003).

The alpha satellite DNA can be considered as a paradigm for processes of concerted evolution in tandemly repeated DNA families (Willard and Waye 1987; Willard 1991; Warburton and Willard 1996).

Bioinformatics Studies of Alphoid HORs

During the last decade sequence contigs spanning the junction at the edges of the centromere DNA array are available for bioinformatics analyses (Rudd et al. 2003; Skaletsky et al. 2003; Rosandić et al. 2003a, b, 2006; Rudd and Willard 2004; Paar et al. 2005, 2007; Ross et al. 2005; Nusbaum et al. 2006). However, major gaps still remain at the centromeric region of chromosomes (Schueler et al. 2001; Henikoff 2002; Rudd and Willard 2004). Mostly, only peripheral HOR copies are accessible, at the edges of centromeric region. Previously, Rudd and Willard (2004) analyzed the Build 34 assembly, using a combination of BLAST (Altschul et al. 1990) and DOTTER (Sonnhammer and Durbin 1995), and reported the presence of HORs. Recently, using Tandem Repeat Finder (TRF) (Benson 1999) and other standard bioinformatics tools, Gelfand et al. (2007) and Warburton et al. (2008) studied human HORs in more details.

In a different approach, we have shown that the Key String Algorithm (KSA) and an extension Global Repeat Map (GRM) are effective in identification and analysis of intrinsic structure of HORs (Rosandić et al. 2003a, b, 2006; Paar et al. 2005, 2007). Applying KSA and GRM to the NCBI human genome assembly, detailed structure of known and some new human alphoid HORs was determined.

Comparison of Human and Chimpanzee Genome Sequences

To understand the genetic basis of unique human features, the human and chimpanzee genomes have been compared in a number of studies (King and Wilson 1975; Sibley and Ahlquist 1987; Laursen et al. 1992; Haaf and Willard 1997, 1998; Chen and Li 2001; Pennacchio and Rubin 2001; Fujiyama et al. 2002; Olson and Varki 2003; Boffelli et al. 2003; Webster et al. 2003; Watanabe et al. 2004; Cheng et al. 2005; Khaitovich et al. 2005; Mikkelsen et al. 2005; Newman et al. 2005; Bailey and Eichler 2006; Patterson et al. 2006; Varki and Altheide 2005; Kuroki et al. 2006; Ebersberger et al. 2007; Kehrer-Sawatzki and Cooper 2007; Perry et al. 2008; Varki et al. 2008; Liu et al. 2009). Large variation in sequence divergence was often seen among genomic regions. For example, the last intron of the ZFY gene showed only 0.69% divergence between human and chimpanzee (Dorit et al. 1995), whereas for the OR1D3P pseudogene a divergence of 3.04% was found (Glusman et al. 2000). Thus, to have reliable estimates of the average divergences between hominoid genomes, it was concluded that sequence data from many genomic regions are needed (Chen and Li 2001). Estimates of divergence due to nucleotide substitutions were about 1.24% between selected intergenic nonrepetitive DNA segments in humans and chimpanzees, substantially lower than previous ones, of about 3%, which included repetitive sequences (Chen and Li 2001; Fujiyama et al. 2002; Ebersberger et al. 2002; Mikkelsen et al. 2005). A greater sequence divergence (1.78%) was obtained between reported finished sequence of the chimpanzee Y chromosome (PTRY) and the human Y chromosome (Kuroki et al. 2006). Comparing the DNA sequences of unique, Y-linked genes in chimpanzee and human, evidence was found that in the human lineage all such genes were conserved, and in the chimpanzee lineage, by contrast, several genes have sustained inactivating mutations (Hughes et al. 2005).

On the other hand, the overall sequence divergence by taking regions of indels into account was estimated to be approximately 5% (Britten 2002; Britten et al. 2003; Cheng et al. 2005; Gibbs et al. 2007). In some short stretches of human and chimpanzee genomes, so called human-accelerated regions, significant increase of substitution divergence was found (Pollard et al. 2006a, b; Popesco et al. 2006; Prabhakar et al. 2006; Pollard 2009). On the other hand, based on phylogenetic analysis of large number of DNA sequence alignments from human and chimpanzee it was found that for a sizeable fraction of our genome we share no immediate genetic ancestry with chimpanzee (Ebersberger et al. 2007).

Experimental evidence suggests that a progenitor of suprachromosomal alphoid family 3 was established and dispersed to chimpanzee chromosomes homologous to human chromosomes 1, 11, 17 and X prior to the human–chimpanzee split (Durfy and Willard 1990; Baldini et al. 1991; Willard 1991; Warburton et al. 1996). Notably, the alphoid HOR organization in the X chromosome has been conserved (Durfy and Willard 1990); only the localization of the suprachromosomal family (SF) 3 alpha satellite is substantially conserved. It was concluded that the lack of sequence or HOR conservation among human and chimpanzee indicates that most alpha satellite sequences do not evolve orthologously.

In a recent publication, Hughes et al. (2010) have shown by sequence comparison of human and chimpanzee MSY that humans and chimpanzees differ radically in sequence structure and gene content. It was concluded that, since the separation of human and chimpanzee lineages, sequence gain and loss have been far more concentrated in the MSY than in the balance of the genome, indicating accelerated structural remodeling of the MSY in the chimpanzee and human lineages during the past six million years.

The previously reported 35mer alphoid HOR in human Y chromosome (Tyler-Smith and Brown 1987; Warburton and Willard 1996; Skaletsky et al. 2003) involves the largest alphoid HOR unit found in human genome and it is of particular interest to look for divergence between alphoid HOR in human and chimpanzee Y chromosome. Alphoid HOR in chimpanzee Y chromosome was not yet reported.

Having in mind possibly important information regarding the evolutionary role of human and chimpanzee Y chromosomes and availability of their genomic sequences (Skaletsky et al. 2003; Mikkelsen et al. 2005) and a demanding task of studying bioinformatically such long HOR units, we perform here an extensive study applying novel robust bioinformatics tools GRM. We investigate the major alphoid HOR from Build 37.1 assembly of human Y chromosome and determine detailed monomer scheme and consensus sequence, finding a riddling pattern not reported previously. In the chimpanzee Y chromosome, for the first time, we identify and analyze alphoid HOR. We find that the human and chimpanzee HORs are sizeably different, both in size and composition of HOR units and in the constituting monomer structure.

Furthermore, we identify and investigate in human and chimpanzee Y chromosomes more than 20 other tandems, HORs and regularly dispersed repeats based on large repeat units, showing sizeable human–chimpanzee divergence. Most of these repeats are reported here for the first time.

Materials and Methods

Key String Algorithm

In spite of powerful standard bioinformatics tools, there are still difficulties to identify and analyze large repeat units. For example, the detection limit of TRF is 2 kb (Gelfand et al. 2007; Warburton et al. 2008). Here, we use a new approach useful in particular for very long and/or complex repeats.

The KSA framework is based on the use of a freely chosen short sequence of nucleotides, called a key string, which cuts a given genomic sequence at each location of the key string within genomic sequence. Going along genomic sequence, the lengths of ensuing KSA fragments form KSA length array. Such array could be compared to an array of lengths of restriction fragments resulting from a hypothetical complete digestion, cutting genomic sequence at recognition sites corresponding to KSA key string. Any periodicity appearing in the KSA length array enables identification and location of repeat in a given genomic sequence. Analysis of repeat sequences at position of any periodicity in the KSA length array gives consensus repeat unit and divergence of each repeat copy with respect to consensus. Any presence of higher order periodicity in the KSA length array reveals the presence of HOR at that location and enables determination of consensus HOR repeat unit and divergence of each HOR copy with respect to consensus.

Similarly, with a proper choice of key string, the KSA fragments a given tandem repeat into monomers, as for example cutting Alu sequence at two identical positions providing identification of Alu sequences, cuts a palindrome providing identification of large palindrome sequences and their substructure, and so on. KSA provides a straightforward ordering of KSA fragments, regardless of their size (from small fragments of a few bp to as large as tens of kilobasepairs). KSA provides high degree of robustness and requires only a modest scope of computations using PC. Due to its robustness, KSA is effective even in cases of significant deletions, insertions, and substitutions, providing detailed HOR annotation and structure, consensus sequence, and exact consensus length in a given genomic sequence even if it is highly distorted, intertwined and riddled (segmentally fuzzy repeats). Using a HOR consensus sequence, in the next step KSA computes finer characteristics, as for example the SF classification and CENP-B box/pJα distributions.

Global Repeat Map

The GRM program is an extension of KSA framework. GRM of a given genomic sequence is executed in five steps.

Step 1:

GRM-Total module Computes the frequency versus fragment length distribution for a given genomic sequence by superposing results of consecutive KSA segmentations computed for an ensemble of all 8 bp key strings (48 = 65536 key strings). In GRM diagram, each pronounced peak corresponds to one or more repeats at that length, tandem or dispersed. GRM computation is fast and can be easily executed for human chromosome using PC.

Step 2:

GRM-Dom module Determines dominant key string corresponding to fragment length for each peak in the GRM diagram from the step 1. A particular 8 bp key string (or a group of 8 bp key strings) that gives the largest frequency for a fragment length under consideration is referred to as dominant key string.

Step 3:

GRM-Seg module Performs segmentation of a given genomic sequence into KSA fragments using dominant key string from the step 2. Any periodic segment within the KSA length array reveals the location of repeat and provides genomic sequences of the corresponding repeat copies

Step 4:

GRM-Cons module Aligning all sequences of repeat copies from step 3 constructs the consensus sequence.

Step 5:

NW module Computes divergence between each repeat copy from step 3 and consensus sequence from step 4 using Needleman–Wunsch algorithm (Needleman and Wunsch 1970).

Regarding the 8 bp choice of key string size: using an ensemble of r-bp key strings the average length of KSA fragments is ~4r. With increasing length of key strings the overall frequency of large fragment lengths increases. We tested that the 8 bp key string ensemble is suitable for identification of repeat units in a wide range of lengths, from ~10 bp to as much as ~100 kb. However, from GRM construction it follows that fully reliable results are obtained for key string lengths not exceeding the repeat length under study.

In summary, the characteristics of GRM are:

  • robustness of the method with respect to deviations from perfect repeats, i.e., substitutions, insertions, and deletions;

  • use of ensemble of all 8 bp key strings as a starting point of algorithm, thus avoiding the need to choose a particular key string for any repeat structure;

  • straightforward identification of repeats (tandem and dispersed), applicable to very large repeat units, as large as tens of kilobasepairs;

  • easy identification of HORs and determination of consensus lengths and consensus sequences.

Results and Discussion

Using GRM algorithm we have identified and analyzed tandem repeats, HORs and regularly dispersed repeats with large repeat units in human and chimpanzee Y chromosomes (Build 37.1 and Build 2.1 assemblies, respectively). Summary of all large repeat units identified and analyzed in this article and the human–chimpanzee comparison are given in Tables 1, 2, and 3.

Table 1 Tandem repeats, HORs and dispersed repeats with large repeat units in contigs of human Y chromosome
Table 2 Tandem repeats, HORs and dispersed repeats with large repeat units in contigs of chimpanzee Y chromosome
Table 3 Correspondence of large repeat and HOR units in Y chromosome contigs of human and chimpanzee

Alphoid Higher Order Repeat Units in Human and Chimpanzee Y Chromosome

Riddled HOR Scheme with 45 Distinct Alphoid Monomers in Human Y Chromosome

The largest repeat array in human Y chromosome assemblies studied here is the major alphoid HOR array and, as will be shown here, strongly diverges from the chimpanzee alphoid HOR. For this reason, we first present our results for alphoid HORs. In the contig NT_087001.1 in centromere of human chromosome Y and in NT_011878.9 in the pericentromeric region on the proximal side of p arm (DYZ3 locus), we identify the peripheral segments of the major block of alphoid HOR array. In the spacing between these two contigs lies a large central section of this HOR array. This spacing of ~3 Mb was not sequenced so far in the Build 37.1 assembly. The GRM results for alphoid monomer structure of the two peripheral HOR segments are shown in Fig. 1 and Supplementary Table 1. In Fig. 1, we use a method of schematic presentation described by Rosandić et al. (2006).

Fig. 1
figure 1

Schematic presentation of aligned monomer structure of 45mer alphoid HOR (consensus length 7662 bp) in human chromosome Y (Build 37.1). This method of schematic presentation of HOR sequences is self-evident if one compares Fig. 1 and Supplementary Table 1. Top enumeration of columns corresponding to 45 constituent consensus monomers (enumerated Nos. 1 to 45) in consensus HOR. (For simplicity, only every fifth number is shown.) Each HOR copy is presented by a bar in the corresponding column numerated at the top. Monomers from different HOR copies corresponding to the same monomer from consensus HOR are presented by bars in the same column corresponding to its enumeration at the top. For example, in the first HOR copy the first monomer corresponds to monomer No. 6 in consensus HOR and is presented by a bar at position of 6th column (denoted by 61), the second monomer in the first HOR copy corresponds to monomer No. 7 in consensus HOR and is presented by a bar at the position of 7th column…, the fourth monomer in the first HOR copy corresponds to monomer No. 15 in consensus HOR and is presented by a bar at the position of 15th column…, and the last monomer in the first HOR copy (the 23rd) corresponds to the monomer No. 45 in consensus HOR and is presented by a bar at the position of 45th column. Upper panel: HOR copies in contig NT_011878.9. Lower panel: HOR copies in contig NT_087001.1. Middle panel: The 5941 bp secondary periodicity sequence from Skaletsky et al. (2003) mapped into alphoid monomers {m}. For mapping of {w}-monomers from Skaletsky et al. (2003) into {m}-monomers, see the text and Supplementary Tables 2–4. Open circle: pJα motif (essential part) in alpha monomers. The m05 monomer from the last incomplete HOR copy (56) in contig NT_087001.1 is followed by alpha satellite monomeric region (not shown here). a After m08: 210 bp insertion (no similarity to HOR monomers); b after m13–m16 duplication (inserted after m17) there are two insertions: 170 bp insertion (differing in 19 bases from m24 and m34 as the closest monomers from HOR) and 168 bp insertion (differing in 20 bases from m28 as the closest monomer from HOR); c after m40: 278 bp insertion (no similarity to HOR monomers); d after the first 34 bases from m15: end of the contig NT_011878.9; e the last 166 bases of m14: start of the contig NT_087001.1; f after m17: 311 bp insertion (no similarity to HOR monomers); g after m36: 171 bp insertion (differing in 13 bases from m23 as the closest monomer from HOR); h, i two deletions in w20; j 53 bp nonalphoid insertion in w29

In each of these two segments we identify 45 distinct alphoid monomers, denoted \( {\text{m}}0 1, \ldots ,{\text{m45}} \), arranged head-to-tail in the same orientation and mutually diverging by ~20%. The consensus length of this 45mer HOR is 7662 bp. Here, an alphoid monomer is assigned as constituent of HOR if it appears in at least two HOR copies at a very low mutual divergence. Consensus sequences of monomers forming HOR are shown in Supplementary Table 2. In both the contigs, the consensus sequences of monomers constituting HOR are equal, reflecting the fact that they are two peripheral segments of the same HOR array (Table 4).

Table 4 Riddled pattern with variety of number of monomers in human alphoid HOR copies (Build 37.1 assembly)

Divergence between monomers in individual HOR copies and the corresponding consensus monomers is very low (on the average 0.3%). However, the HOR structure is characterized by some pronounced monomer deletions and insertions, giving a riddled pattern (Table 4) due to a variety of lengths of HOR copies (Fig. 1). We find monomer deletions in seven HOR copies, monomer insertions in two, and nonalphoid insertions of 0.2 to 0.3 kb in three HOR copies. (In some HOR copies there are multiple insertions and/or deletions.)

Two out of ten HOR copies contain the 10-alphoid-monomer subsequence \( {\text{m24}}, \ldots {\text{m33}} \) (Fig. 1). These ten monomers are positioned between the monomers m23 and m34. Distance between the two highly identical 10-alphoid-monomer subsequences is ~3 Mb.

The other 35 alphoid monomers from 45 distinct alphoid monomers in the peripheral region of major alphoid HOR form a subsequence, consisting of two segments, additionally riddled at some positions. Each of these 35 alphoid monomers appears in three or more HOR copies (Fig. 1). If we delete the 10-alphoid-monomer subsequence from the 45mer, we obtain a 5957 bp 35mer, which is similar to the secondary periodicity sequence of 5941 bp reported in (Skaletsky et al. 2003).

Discussing relationship of the initially reported 5.7 and 6.0 kb repeat units, Tyler–Smith and Brown proposed that one HOR unit is derived from the other, although more complex explanations, with both units derived from a third unknown HOR unit were considered as possible (Tyler-Smith and Brown 1987). It was considered as very unlikely that the 6.0 kb unit arose from a 5.7 kb unit by addition of two alphoid monomers, because results excluded the possibility that the two additional alphoid monomers in the 6.0 kb unit are duplications of any monomers contained in the 5.7 kb unit (Tyler-Smith and Brown 1987). Therefore, the favored hypothesis was that the shorter, 5.7 kb HOR unit arose from the longer 6.0 kb HOR unit by deletion of two alpha monomers. Extending similar considerations to the present case, the 35mer in internal centromere region could be considered as arising from 45mer by deletion of ten alphoid monomers which are all distinct from the monomers in 35mer. This is consistent with a general view (Warburton and Willard 1996) that a type of polymorphism found in alphoid arrays can be related to HOR units that differ by an integral number of alphoid monomers.

Divergence pattern provides an additional evidence that ten additional alphoid monomers \( {\text{m24}}, \ldots ,{\text{m33}} \) are constituents of major HOR. Mutual divergence between these ten monomers is similar to their mean divergence with respect to the other 35 monomers (Table 5).

Table 5 Average divergence between two subsets of alphoid monomers from 45mer HOR copies

Suprachromosomal Family Assignment of Monomers in 45mer HOR

Studies of sequence comparison of alpha satellite monomers in human chromosomes revealed 12 types of monomers, forming five suprachromosomal families (SFs), which descend from two basic subsets of monomers, A and B: to the subset A belong the SF types J1, D2, W4, W5, M1, and R1, and to the subset B belong J2, D1, W1, W2, W3, and R2 (Romanova et al. 1996; Warburton and Willard 1996; Alexandrov et al. 2001). We determine the SF assignments of monomers constituting alphoid HOR by pairwise comparison between every monomer from HOR to every of 12 SF consensus monomers from Romanova et al. (1996). A 45 × 12 divergence matrix is constructed between 45 monomers from HOR and 12 SF consensus monomers from Romanova et al. (1996). To each monomer from HOR we assign the SF classification of the most similar SF consensus monomer. In this way we find that, out of forty-five monomers from HOR, forty monomers are of M1 type (in most cases the second lowest divergence corresponds to R2, and in three cases the M1 and R2 divergences are equal), and five are of R2 type (in these cases the second lowest divergence corresponds to M1 type).

The differences between A and B subsets are, in general, concentrated in a small region which matches functional protein binding sites for pJα in subset A and for CENP-B in subset B (Romanova et al. 1996). Analyses of human genome have indicated that a CENP-B box appears in the subset B monomers (in about 60% of B-type monomers) and is absent in the subset A monomers; while the pJα motif would occur only in some of monomers from the subset A and not in the subset B monomers (Romanova et al. 1996).

After determining the SF classification of monomers in consensus HOR, we investigate the appearance of CENP-B box and pJα motif in these monomers. We find that the pJα motif (essential part) is present in 55% of ten new alphoid monomers and similarly, in 57% of the other 35 monomers, while the CENP-B box is completely absent (Fig. 1). Consensus HOR has a robust pJα distribution, containing 25 pJα motif copies. All alphoid monomers in consensus HOR are significantly more similar to pJα motif than to the CENP-B box: the mean deviation is 0.6 bp for the pJα motif and 4.7 bp for the CENP-B box, reflecting that the absence of pJα motif in some of monomers from 45mer HOR can be attributed mostly to a single nucleotide mutation within an initially pJα motif.

Since the pJα motif is essential for protein binding, an interesting question is whether the monomers with and without pJα motif have different sequence divergences. In this respect, pairwise divergence among 45 monomers shows no dependence on the presence or absence of the pJα motif.

It should be noted that HOR copies in chromosome Y are the only reported case where pJα motif is present and CENP-B box absent.

In this connection, we note a unique case of 13mer HOR (2214 bp consensus length) in chromosome 5, which contains neither CENP-B box nor pJα motif (Rosandić et al. 2006).

Alignment of Peripheral and Internal Human HOR Copies

Let us now compare our consensus HOR for the peripheral parts of major HOR alphoid block (DYZ3 locus) (Supplementary Table 2) to the 5941 bp secondary periodicity sequence in its internal part reported by Skaletsky et al. (2003) which corresponds to the sequence gap between the contigs NT_011878.9 and NT_087001.1 in the Build 37.1 assembly.

First, we fragment the 5941 bp sequence from Skaletsky et al. (2003) into 35 constituent alpha monomers, denoted \( {\text{w}}0 1, \ldots ,{\text{w35}} \) (Supplementary Table 3). We find a peculiar feature of this secondary periodicity sequence: two of its constituent monomers, w20 and w29, exhibit sizeable length deviation from the alpha satellite consensus length of 171 bp: the alphoid monomer w20 has a length of 104 bp (i.e., 67 nucleotides are deleted with respect to consensus alpha monomer length) while the monomer w29 is 224 bp long, containing a 53 bp nonalphoid insertion with respect to consensus alpha monomer.

To align the internal monomer sequence {w} (Supplementary Table 3) to the peripheral monomer sequence {m} (Supplementary Table 2), we shift the start position of alpha monomers \( {\text{m}}0 1,{\text{ m}}0 2, \ldots ,{\text{m45,}} \) obtaining the sequence denoted by \( {\text{n}}0 1,{\text{ n}}0 2, \ldots ,{\text{ n45}} \) (Table 6). The 35 alphoid monomers from the sequence {w} are aligned to 35 out of 45 monomers {n} (Table 6 and Supplementary Table 4). The sequences \( {\text{n26}}, \ldots {\text{ n35}} \) have no counterpart in the {w} sequence which corresponds to internal part of major alphoid HOR from Skaletsky et al. (2003).

Table 6 Transformation between monomer sets {m} and {n} and alignment between alphoid monomer sets {w} and {n}

Global Repeat Map for Riddled Alphoid HOR and Characteristic HOR-Signature in Human Chromosome Y

To investigate more closely the major alphoid HOR array in human chromosome Y, we compute the GRM diagram for genomic sequence of Y chromosome (Fig. 2). The most pronounced peaks in this diagram correspond to following tandem repeats in chromosome Y: the alphoid repeats (GRM peaks at multiples of the ~171 bp repeat unit), the 125 bp repeats (GRM peaks at multiples of the 125 bp repeat unit), GRM peaks at multiples of 5 bp repeat unit and GRM peaks corresponding to ~20.3 kb repeat unit. In addition, there are nine pronounced GRM peaks at repeat lengths above 2000 bp.

Fig. 2
figure 2

GRM diagram for Build 37.1 genomic assembly of human chromosome Y for the intervals of fragment lengths: a 0–1500 bp. There are two pronounced tandem arrays with repeat units below 1.5 kb: the alphoid tandem repeat with alpha satellite repeat unit of 171 bp and the overlapping tandem repeat with repeat unit of 125 bp. The peaks at multiples of alphoid monomer repeat unit 171 bp, n · 171 bp, are denoted by nα. The peaks at multiples of 125 bp repeat unit, n · 125 bp, are denoted by . b 0–80000 bp. Pronounced peaks above 2 kb are denoted by the corresponding fragment lengths. The most pronounced peaks are approximately at 2385, 10848, 15775, 20309, 23541, and 41584 bp. Arrow i: peak corresponding to 715mer. Arrow j: peak corresponding to 1123mer. For description of peaks see the text

Here, we perform detailed study for alphoid HOR repeat sequence. Analyzing partial contributions to GRM diagram of chromosome Y from individual contigs we find that the largest frequency contributions to alphoid HOR peaks are arising from the contigs NT_011878.9 and NT_087001.1. The relevant intervals of fragment lengths for these two contigs are shown in Fig. 3a and b, respectively. In both the figures peaks at approximate multiples of basic repeat length ~171 bp are decreasing with increasing multiple orders. That is a natural trend for tandem repeats. However, we do not find a peak corresponding to the HOR length, which for regular HORs in other chromosomes appears at their consensus lengths. This is because the Build 37.1 assembly of chromosome Y encompasses only peripheral tails of major HOR array and those exhibit sizeable riddling in both relevant contigs, as shown in the monomer structure of peripheral HOR copies in Fig. 1. For these riddled HOR copies there is no dominating consensus length and therefore no peak corresponding to consensus length is present. Instead, the GRM diagram shows more intricate HOR-related peaks which characterize riddled alphoid HOR copies. These peaks will be referred to as GRM HOR-signature. Most pronounced GRM HOR-signature peaks of riddled HOR pattern in peripheral regions of major alphoid HOR in chromosome Y are at the lengths shown in Fig. 3a, b. These characteristic fragment lengths are fully consistent with the riddled HOR structure from Fig. 1.

Fig. 3
figure 3

GRM diagrams for sequences in contigs containing alphoid HOR in chromosome Y: a NT_011878.9, b NT_087001.1, and c secondary periodicity sequence for internal part of major interior alphoid HOR block (genomic sequence from Skaletsky et al. 2003)

As an example, let us consider the largest GRM HOR-signature peak at 5551 bp, characterizing HOR pattern in NT_011878.9. This peak arises from approximate repeat of the 13–143 subsequence at the position of the 14–144 subsequence. The distance l between the corresponding bases in these two subsequences (Table 7) is equal to a distance between monomers 13 and 14 (Fig. 1 and Supplementary Table 1).

Table 7 Contributions to the fragment length 5551 bp alphoid GRM HOR-signature peak for human Y chromosome

Therefore, the GRM diagram shows a pronounced peak at the 5551 bp fragment length, reflecting the riddling structure of HORs. Similarly, we interpret all the other HOR-signature peaks which characterize riddling in HOR copies from Fig. 1.

In addition to GRM computation for Build 37.1 sequence of chromosome Y, let us comment on the GRM HOR-signature related irregularity (monomers w20 and w29) in the interior region of major alphoid HOR array in chromosome Y (Supplementary Tables 3, 4). Figure 3c displays GRM diagram computed for the 5941 bp secondary periodicity sequence from Skaletsky et al. (2003). Here again, we see the main pattern of monomer multiples ~171, ~2 × 171, ~3 × 171 bp, … with decreasing frequencies for increasing multiples. In addition, we obtain two weak subsequences of peaks, at fragment lengths ~104 bp, ~(104 + 171 bp), ~(104 + 2 × 171 bp), … and at ~224 bp, ~(224 + 171 bp), ~(224 + 2 × 171 bp), … These two additional weak subsequences are due to two distorted monomers in the 35mer periodicity (HOR) sequence that we deduced from the HOR genomic sequence in Skaletsky et al. (2003): the alphoid monomer w20 has a length of 104 bp (i.e., 67 nucleotides are deleted with respect to consensus monomer) while the monomer w29 has the length 224 bp, containing a 53 bp nonalphoid insertion with respect to consensus monomer. Such deletions/insertions in two distant alphoid monomers within HOR are absent in the peripheral regions of major HOR array in chromosome Y, i.e., they are absent in Build 37.1 assembly. Therefore, GRM diagrams of these regions (Fig. 3a, b) do not have these two additional weak subsequences of peaks. This actualizes the interest for future extension of Build assembly to the region of sequence gap of ~3 Mb between the contigs NT_011878.9 and NT_087001.1.

Riddled 30mer HOR Scheme in Chimpanzee Chromosome Y

Applying GRM to the chimpanzee chromosome Y, we find two 30mer HOR arrays in chimpanzee contig NW_001252921.1 (NCBI Build 2.1), positioned one after another (with a gap of 599 bp in between) at the front part of the contig. The first HOR, truncated at the start of the contig is referred to as direct. In fact, it seems to be a truncated tail of a major HOR block positioned in unsequenced domain in front of the contig NW_001252921.1. We find that the reverse complement of the second HOR array is highly identical to the first HOR array, and therefore this second HOR array is referred to as reverse complement. This indicates that the direct and reverse complement HOR arrays are positioned on the opposite arms of a palindrome.

Our results for detailed monomer scheme of these two peripheral HOR arrays, which are reverse complement to each other, are shown in Fig. 4 and Supplementary Table 5. The consensus length of 30mer HOR unit is 5066 bp (consensus sequence in Supplementary Table 6).

Fig. 4
figure 4

Schematic presentation of aligned monomer structure of 30mer alphoid HOR (consensus length 5066 bp) in chimpanzee chromosome Y (Build 2.1, contig NW_001252921.1). Top row enumeration of 30 constituent alpha monomers from consensus HOR. Upper panel: HOR copies in interval 264–20019. Lower panel reverse complement of HOR copies in interval from 20618–42459. After monomer No. 20 (label a): 41 bp insertion (no similarity to monomers in 30mer). For comparison with human alphoid HOR see Fig. 1. Open circle pJα motif (essential part) in alpha monomers

In GRM diagram of the whole chimpanzee Y chromosome (Fig. 5), the peak at 5066 bp fragment length is much weaker than the near-lying 5096 bp peak of another repeat structure (see Tables 2, 3) and is therefore overshadowed. For this reason, we compute the GRM diagram selectively for alphoid HOR-containing section of genomic sequence at the start of contig NW_001252921.1 (positions 1–20019) (Fig. 6). In Fig. 6, in the length interval between 0.1 and 1 kb there are pronounced peaks approximately at multiples of alphoid monomer repeat unit 171 bp (Fig. 6a), in analogy to Fig. 5a for the whole chimpanzee chromosome Y. Furthermore, the HOR-signature peaks are clearly seen in Fig. 6b as pronounced peaks at 5066 bp (~30 × 171 bp, denoted as 30α), 4895 bp (~29 × 171 bp, denoted as 29α), 3884 bp (~23 × 171 bp, denoted as 23α), and 8777 bp (~52 × 171 bp, denoted as 52α). These HOR-signature peaks can be also deduced directly from HOR structure from Fig. 4 and Supplementary Table 5.

Fig. 5
figure 5

GRM diagram for Build 2.1 genomic assembly of chimpanzee chromosome Y for intervals of fragment lengths: a 0–1500 bp. There is only one pronounced tandem array with repeat units in the interval between 0.1 and 1.5 kb: the alphoid tandem repeat with alpha satellite repeat unit of 171 bp. The peaks at multiples of alphoid monomer repeat unit 171 bp, n · 171 bp, are denoted by nα. b 0–80000 bp. Pronounced peaks above 2 kb are denoted by the corresponding fragment lengths. The most pronounced peaks above 1.5 kb are approximately at 2383, 5096, 10762, 10853, 21218, 23578, 32071, 60523, 64624, and 72140 bp. For description of peaks see the text

Fig. 6
figure 6

GRM diagram for HOR containing section from positions 1–20019 bp in the chimpanzee contig NW_001252921. Intervals of fragment lengths: a 0–1000 bp, b 0–10000 bp. For description of peaks see the text

For example, the 8777 bp (52α) HOR-signature peak arises from the approximate repeat of the 12–42 subsequence at position of the 14–44 subsequence (the 13–43 subsequence is missing due to riddling) (Table 8). Distance between the corresponding bases in these two subsequences is equal to the distance between monomers 12 and 14.

Table 8 Contributions to fragment length 8777 bp alphoid GRM HOR-signature peak for chimpanzee Y chromosome

Similarly, we interpret all the other pronounced HOR-signature peaks in Fig. 6b. The frequencies of these peaks are sizably smaller than of peaks arising from some other tandem repeats and therefore are overshadowed in Fig. 5 for the whole chimpanzee Y chromosome. We note that the HOR-signature peaks at 3884, 4895, 5066, and 8777 bp are the only significant GRM peaks above 1.5 kb in Fig. 6b.

Some peaks from GRM diagram for the whole chromosome Y (Fig. 5) are missing in GRM diagram for the HOR section in Fig. 6a. For example, the peak at 551 bp from Fig. 5a is missing in Fig. 6a, because the repeat unit of 551 bp is positioned outside of the HOR-section of genomic sequence included in Fig. 6a.

In addition to the equidistant multiple alphoid peaks, in the GRM diagram in Fig. 6a there is a family of weaker equidistant peaks at fragment length 118, 118 + α, 118 + 2α, 118 + 3α, … (like in Fig. 5, here α, 2α, 3α … denote multiples of alpha monomer length ~171 bp). This weak equidistant family of repeat lengths is based on the 118 bp peak. The origin of this peak is that one of monomers within HOR, m25, is truncated, with size reduced from the standard value ~171 to 118 bp. (Observe that we find an analog appearance of additional bands based on monomers of irregular length, 104 and 224 bp, for two human monomers in 35mer alphoid HOR in the interior part of HOR array.)

Comparison of Alpha Satellite Monomers in Human 45mer and Chimpanzee 30mer HORs

Computing divergence between 45 human consensus alpha monomers from consensus 45mer HOR and 30 chimpanzee consensus alpha monomers from consensus 30mer HOR (Supplementary Table 7) we see that due to scattering of divergences and the absence of any small divergence, none of chimpanzee monomers can be assigned to a particular human monomer (Supplementary Table 8). In the whole human–chimpanzee divergence matrix the lowest divergence value is 12%, appearing in a few cases only (Table 9). The mean value of the lowest human–chimpanzee divergence for each human monomer is 17% (Supplementary Table 8). The absence of identity between particular human and chimpanzee monomers from alphoid HORs is also seen from the mean values of divergences in Table 10.

Table 9 Illustration of divergences of human monomers m01 and m24 with respect to 30 chimpanzee monomers
Table 10 Comparison of mean values of human and chimpanzee consensus monomer divergences

On the other hand, we find that alpha monomers in 30mer HORs in chimpanzee Y chromosome are predominantly of M1 SF type, similarly as alpha monomers in 35mer/45mer HORs in human chromosome Y. Accordingly, similarly as for human Y chromosome, monomers in chimpanzee Y chromosome are also characterized by the presence of pJα motif and the absence of CENP-B box (Fig. 4). As already noted, the human Y chromosome was the only known case where pJα motif is present and CENP-B box absent and now we see that the chimpanzee Y chromosome shares this feature.

As to the degree of riddling, the human HOR is more riddled than the chimpanzee HOR. In particular, the human HOR has more insertions than the chimpanzee HOR, which is reflected in their respective GRM HOR signature.

Peculiarities of Alphoid HOR in Human Y Chromosome

We show that HOR structure in the peripheral regions of the major alphoid block in human chromosome Y is more complex than the previously reported structure for the internal region. In this computational study, we identify and fully characterize the peripheral region, in particular finding ten new monomers constituting alphoid HOR copies, different from the known 35 constituent monomers, giving evidence for the presence of 45mer in the peripheral region of HOR array. Furthermore, while 33 out of 35 constituting alphoid monomers in HOR copies in the interior HOR region are highly homologous to the corresponding monomers in the peripheral region, we find that the remaining two monomers in the interior region have a sizeable deletion and nonalphoid insertion, respectively, with respect to the corresponding monomers from the peripheral region. The study of these riddled HOR copies may be valuable for understanding possible sources of genomic diversity, but also has the potential to provide useful markers for medical, population, and forensic genetic studies, and may give a route for identifying mechanisms of DNA sequence evolution.

Some peculiarities studied in this work regarding the major alphoid HOR that may shed some new light at the mysteries of human Y chromosome are:

The 33 consensus monomers from the peripheral HOR structure are highly identical to the aligned 33 monomers of previously reported secondary periodicity sequence from Skaletsky et al. (2003). On the other hand, we find peculiar differences: the 10mer alphoid sequence, inserted in the peripheral HOR structure, is absent in the reported internal structure; and in the previously reported internal secondary periodicity structure one constituent alphoid monomer has a sizeable deletion (67 bp) and the other a sizeable nonalphoid insertion (53 bp) accompanied by clustered substitutions of 11 bases with respect to the peripheral HOR structure.

The highly identical alphoid 10mer insert appears in both peripheral regions of major HOR, but was not reported so far in the internal centromere region between the two peripheral regions.

The peripheral regions of major HOR alphoid block reveal coexistence: on one hand, very low divergence between the aligned constituent alpha monomers from different HOR copies (average divergence ~0.3%) and, on the other hand, pronounced riddling due to deletions and insertions of alpha monomers and/or due to insertions of nonalphoid segments. The HOR copies in chromosome Y are the only known case where the pJα motif is present and CENP-B box absent.

The major alphoid HOR in Y chromosome exhibits more deletions and insertions of alphoid monomers and highly distorted insertions than HORs in other chromosomes.

Difference Between Humans and Chimpanzees Alphoid HOR Repeat Units

The number of different monomers constituting HOR in human Y chromosome (45 monomers in the peripheral sections of major HOR array, and 35 monomers in the interior section) is different than in the chimpanzee genome (30 monomers).

HOR pattern in the sequenced domain in Build 37.1 assembly (peripheral region) is characterized by substantial riddling, which is more pronounced in human than in chimpanzee genome.

All alpha satellite monomers constituting major human 35/45mer HOR are different from monomers constituting chimpanzee 30mer HOR by ~20%, which is comparable to divergence between monomers within a single HOR copy.

The lengths of major alphoid HOR arrays in human and chimpanzee are widely different, ~3 and ~1 Mb, respectively.

Other Human and Chimpanzee Tandem, HOR and Regularly Dispersed Repeat Arrays Based on Large Repeat Units

Besides the alphoid HOR, in human Build 37.1 and chimpanzee Build 2.1 Y chromosome assemblies we find over 20 other large repeat units (Tables 1, 2, 3). Some of large repeat units appear both in human and in chimpanzee genomic assembly, and some in human only or in chimpanzee only. We describe here some pronounced repeats identified from GRM diagrams (labeled a in Tables 1, 2). The remaining repeats (denoted b in Tables 1, 2) are described in Supplementary information.

Chimpanzee ~550 bp Primary Repeat Unit, ~1652 bp 3mer HOR Secondary Repeat Unit, and ~23578 bp Tertiary Repeat Unit

In the GRM diagram for chimpanzee Y chromosome in the length interval between 100 and 1500 bp, besides the major peaks associated with alphoid HOR and tandem repeat based on the 125 bp repeat unit, there is additional pronounced peak at ~550 bp (Fig. 5a). Using GRM, we find that this peak arises due to the appearance of 3mer HOR copies constituted from three ~550 bp monomers, denoted mc01 mc02 and mc03. These monomers are mutually diverging by ~8%, while different 3mer HOR copies mutually diverge by only ~1%. About eight times smaller divergence between 3mer copies then between individual monomers within each 3mer are a signature of HOR. However, these HOR copies are not in tandem, in contrast to previously known HOR structures; instead, they are dispersed with rather regular spacings. Consensus sequences of three monomers mc01 mc02 and mc03, determined from NW_001252921.1 (using key string AGGTACTG) are given in Supplementary Table 9. The main contributions to the ~550 bp GRM peak arise from the array of ~550 bp monomers within each 3mer copy.

Performing the GRM analysis we find 20 dispersed HOR copies (Table 11). In addition, in four HOR copies in NW_001252921.1 one of three ~550 bp monomers is deleted. In NW_001252921.1, we find dispersed highly identical 3mer HORs, direct and reverse complement. HOR copies after the first one are grouped into five pairs of 3mers:

Table 11 Dispersed 3mer HOR copies based on ~550 bp monomer in chimpanzee Y chromosome
  • D S D

  • R S R

  • D S D

  • R S R

  • D S D

where D is the direct 3mer copy, R is the reverse complement 3mer copy, and S is the spacing of ~24 kb (see Table 11). (Three of 3mer copies in these pairs of 3mer copies are truncated from three to two monomers.) Since the two 3mer copies in each pair are separated by spacing S, there is no GRM peak at ~1.65 kb. Instead, this gives rise to a tertiary repeat unit, with a ~24 kb peak (more precisely ~23578 bp) in the GRM diagram.

We find even an approximate next higher pattern, three copies of quartic repeat unit:

$$ {\text{R S}}_{ 2} {\text{ D S D S}}_{ 1} {\text{R S R S}}_{ 2} {\text{D S D S}}_{ 1} {\text{R S R S}}_{ 2} {\text{D S D S}}_{ 1} {\text{R}} $$

where S2 is spacing of ~0.40 MB, and S1 spacing of ~0.28 Mb (see Table 11). The length of this unit is ~0.73 Mb. In NW_001252921.1, we find an array of three such quartic repeat units. This would give rise to a GRM peak at ~0.74 Mb fragment length (computation is performed here up to 100 kb fragment lengths).

We note that in NW_001252926.1 we find a D S1 R S R subsection of the above pattern.

Human ~545 bp Primary Repeat Unit, ~1641 bp 3mer HOR Secondary Repeat Unit, and ~23541 bp Tertiary Repeat Unit

The GRM peak at 545 bp is due to the ~545 bp monomers, organized in dispersed 3mer HOR copies of ~1641 bp (Table 12). The distance between start positions of two 3mer copies is again ~24 kb, similar as in the chimpanzee Y chromosome, giving rise to the appearance of ~23541 bp peak in GRM diagram.

Table 12 Dispersed 3mer HOR copies based on ~545 bp monomer in human Y chromosome

The 23541 bp repeat unit corresponds to previously reported 23.6 kb repeat units containing RMBY genes, but previously it was not related to the 545 bp PRU (Skaletsky et al. 2003; Warburton et al. 2008).

As seen, the human HOR pattern of sequenced Y chromosome contains fewer copies than chimpanzees and is less symmetrically organized. The human ~545 bp monomers (denoted m01 m02 m03) are similar to the chimpanzee ~550 bp monomers (denoted mc01 mc02 mc03): divergence between the human 3mer HORs m01, m02, and m03 and the chimpanzee 3mer HORs is ~4%, while the divergence between off-diagonal monomers (i.e., m01 vs. mc02, m01 vs. mc03, …) is ~8%. Only a small subsection of ~24 kb encompassing each human HOR copy is similar to the corresponding section encompassing each chimpanzee HOR copy (divergence less than 10%), while the remaining part of large spacings, of total length ~2 Mb, strongly diverges between human and chimpanzee. This gives a substantial contribution to the overall human–chimpanzee divergence. Furthermore, the subsequences of ~24 kb human sequence are scattered in various parts of chimpanzee Y chromosome.

Human ~2385 bp Primary Repeat Unit and ~7155 bp 3mer HOR Secondary Repeat Unit

The DAZ gene family, located in the AZFc region of Y chromosome, is organized into two clusters and contains a variable number of copies (Seboun et al. 1997; Glaser et al. 1998; Saxena et al. 2000; Fernandes et al. 2006). A ~2.4 kb repeat unit in DAZ genes was reported by (Skaletsky et al. 2003; Warburton et al. 2008). Accordingly, the GRM peak at 2385 bp (Fig. 2b) is due to tandem repeats with ~2.4 bp PRU in DAZ genes. Human DAZ repetitions are located in contig NT_011903.12 (positions 1346649 to 1361029, 1425263 to 1473290, 2977988 to 2997102, and 3050498 to 3086580), i.e., from position 25.3 to 27 Mb within the human Y chromosome.

Using GRM we classify the assembly of ~2.4 kb monomers into five monomer families (consensus sequences in Supplementary Table 10). The average divergence between monomers of the same family is below 1%, while the average divergence between monomers from different families is ~11%. The monomer family with highest frequency of appearance has consensus length 2385 bp, which determines the length of the 2385 bp GRM peak. This monomer family forms a highly homologous monomeric tandem repeat, which is present in DAZ2 and DAZ4 genes.

We find that the GRM peak at 7155 bp corresponds to 3mer HOR composed of three variants of ~2.4 kb DAZ repeat monomers, denoted m01, m02, and m03 (the first three consensus sequences from Supplementary Table 10). Computing the GRM diagram of any of the 7155 bp copies we obtain two pronounced peaks, at ~2.4 and ~4.8 kb, revealing the 3mer character. We find that these 3mer HOR copies are present in all four DAZ1–DAZ4 genes. Human DAZ genes contain 12 DAZ HOR copies organized into four tandem arrays (DAZ1–DAZ4).

The ~4757 bp peak in GRM diagram corresponds to the 2mer HOR copies arising from 3mer HOR by deletion of one monomer from the 7155 bp secondary 3mer HOR unit. In GRM diagram of the 4757 bp repeat copies, we obtain only one pronounced GRM peak, at ~2.4 kb, showing the 2mer character of 4757 bp repeat copies. We find that such 2mer HOR copies are present in all four DAZ1–DAZ4 genes.

Chimpanzee ~2383 bp Primary Repeat Unit and Absence of Tandem of Higher Order Repeats

The GRM peak at ~2383 bp is due to tandem repeats with ~2.4 bp repeat unit in DAZ genes in chimpanzee Y chromosome. Chimpanzee DAZ repetitions are located in contigs NW_001252917.1 (positions 1109191 to 1130961 and 1259092 to 1280862) and NW_001252922.1 (positions 997017 to 1028356 and 1070171 to 1099128) that is at chromosome positions from ~3.2 to 3.4 Mb and from ~11.2 to 11.3 Mb. Positions of the corresponding subsequences widely differ in human and chimpanzee chromosomes. Divergence between human and chimpanzee consensus sequences is ~5%.

We find that the chimpanzee Y chromosome contains 3mer and 2mer HOR copies, similar to those for human Y chromosome, but with one pronounced distinction: chimpanzee DAZ genes contain four DAZ HOR copies, which are, unlike the case of human Y chromosome, not organized into tandem but into dispersed HOR copies. Therefore, there are no GRM peaks corresponding to HORs.

The presence of tandem of DAZ HOR copies in human and absence of such tandem in chimpanzee Y chromosome provides an interesting evolutionary distinction between human and chimpanzee Y chromosomes.

Human ~3579 bp 715mer HOR Unit and 5 bp Primary Repeat Unit

The GRM peak at ~3579 bp is due to a tandem of 28 repeat copies in NT_025975.2. These copies differ in lengths from 3544 to 3589 bp. The length 3579 bp has the highest frequency and is equal to consensus length. Other copy lengths appear due to deletion or insertion of 5 bp subsequences. Average divergence of copies with respect to consensus sequence is ~1%. Due to differences in lengths of copies, the GRM peak at ~3579 bp is broadened (Fig. 2b).

In the next step, we find a strong peak at the fragment length 5 bp in GRM diagram for the 3579 bp consensus sequence. A dominant key string for segmentation of the 3579 bp consensus sequence into 5 bp fragments is ATTCC, which is the consensus sequence of 5 bp primary repeat copies. Thus the 3579 bp repeat unit is a 715mer HOR based on ATTCC primary consensus repeat unit. Here 34% of primary repeat 5 bp copies are equal to consensus, 38% differ from consensus by one base, 21% by two, 6% by three and 1% by four bases.

This 3579 bp HOR corresponds to the previously reported 3584 bp HOR (Skaletsky et al. 2003).

Absence of Chimpanzee HOR Unit Corresponding to Human 3579 bp 715mer HOR Unit

In the Build 2.1 assembly for chimpanzee Y chromosome we find no analog of the human 3579 bp 715mer HOR unit.

Human ~5607 bp 1123mer HOR Unit and 5 bp Primary Repeat Unit

The 5607 bp peak corresponds to a new HOR, with 5607 bp SRU (5 bp GGAAT PRU). The main contribution to this peak is from contig NT_113819.1. We identify a tandem of 11 copies, from position 496682 to 553881 (Supplementary Table 11) and determine the 5607 bp consensus sequence (Supplementary Table 12).

To investigate the structure of 5607 bp repeat unit, we compute the GRM diagram of its consensus sequence. Using 8 bp key string ensemble, we obtain the GRM diagram characterized by a set of GRM peaks at fragment lengths of 5 bp and its multiples (Supplementary Fig. 1a), revealing the underlying 5 bp PRU. However, the reciprocal distribution of GRM peaks shows deviation from the exponential distribution expected due to random mutations of fragments of multiple orders at KSA recognition sites. This deviation is due to the fact that the length of key strings in the ensemble is larger than the repeat unit. This is shown by computing the GRM diagram by using the 3 bp key string ensemble, shorter than the 5 bp PRU (Supplementary Fig. 1b). In that case the reciprocal distribution of GRM peaks corresponding to the 5607 bp consensus sequence indeed follows exponential distribution, as expected.

The 5607 bp HOR consensus unit consists of 1123 pentamer copies. Out of these copies, 353 are identical to GGAAT which is the primary repeat consensus. The mean divergence between 5 bp consensus GGAAT and pentamer copies that are not identical to consensus is ~30%. Differences are mostly due to substitutions. There are only a few indels: two copies have 1-base insertion, one has 2-base insertion, ten have 1-base deletion and one has 2-base deletion.

Absence of Chimpanzee HOR Unit Corresponding to the Human 5607 bp HOR Unit

In the Build 2.1 assembly for chimpanzee Y chromosome we find no repeat unit corresponding to human 5607 bp HOR unit.

Chimpanzee 10853 bp Primary Repeat Unit and 64624 bp Secondary Repeat Unit

The GRM peak at 10853 bp is due to a tandem in NW_001252917.1 (eight copies), with repeat unit consensus length 10853 bp. The 10853 bp consensus sequence is given in Supplementary Table 15. The third copy in this tandem is distorted: truncated after the first 6399 bases and followed by a large insertion, so that the total length of truncated third copy and neighboring insertion amount to the combined length of 21218 bp. The structure of the eighth copy is distorted similarly as the third copy, leading again to a ~21 kb combined length.

Distance between the corresponding bases in neighboring copies (except those involving the third copy) is ~10853 bp, giving rise to the 10853 bp GRM peak.

Distance between the start of the 6399 bp subsection of the third copy and the start of the fourth copy is 21218 bp, giving rise to the 21218 bp GRM peak. Distance from the end of the second copy (which has no counterpart in the truncated third copy) to the end of the fourth copy is 10853 + 21218 bp = 32071 bp, giving rise to the 32071 bp GRM peak.

The copies No. 1, 2, and 4–7 are identical up to 1%, while the copies No. 3 and 8 have similar truncation and additional insertion. Therefore, the copies No. 1–5 form a secondary repeat HOR copy of the approximate length 2 × 10853 + 21218 + 2 × 10853 (precise value 64624 bp). The last three copies in tandem, No. 6–8, represent the first three copies belonging to the second 64624 bp HOR copy.

The insertion after the truncated third copy in chimpanzee tandem repeat with 10853 PRU 21218–6399 bp = 14819 bp is also present in the human Y chromosome as a tandem of two repeat units (divergence ~ 4%) in contig NT_011903.12. Because these repetitive units are mutually reverse complement, GRM diagram for human chromosome Y does not show this peak.

Summary of Human–Chimpanzee Divergence Due to Repeats Based on Large Repeat Units

We determine approximately the number of bases which are different in repeat arrays of human and chimpanzee Y chromosome using a simple formula:

$$ d = \sum\limits_{i} {d_{i} }, $$

where \( d_{i} = \min (l_{{i,{\text{hum}}}} ,l_{{i,{\text{chimp}}}} ) \cdot p_{i} + l_{i}.\)

Here, l i,hum and l i,chimp are sums of lengths over all copies of the ith’s human and chimpanzee repeat unit, respectively; min(l i,hum, l i,chimp) is the smaller of two lengths l i,hum and l i,chimp; l i  = |l i,hum − l i,chimp|; and p i is divergence between human and chimpanzee repeat unit i. In this way, we include contributions to human–chimpanzee divergence both from substitutions and indels.

For example, in the case of alphoid HOR in Y chromosome (repeat No. 1 from Tables 1, 2, 3) we have: l 1,hum = 3048138 bp, l 1,chimp = 1042459 bp, l 1 = 2005679 bp, p 1 = 0.20, giving d 1 = 2.214.171 bp (Fig. 7). With respect to the sequence of larger alphoid HOR, of the length l 1,hum, this corresponds to an approximate divergence 100 · d 1/l 1,hum = 72.6%.

Fig. 7
figure 7

Schematic presentation of applying the formula for calculation of human–chimpanzee divergence for the case of a large repeat unit (major alphoid HOR)

Summing over all repeats (\( i = 1,{ 2}, \ldots \)) from Tables 1, 2, and 3, we obtain a summary number of different bases between human and chimpanzee large repeats: d ~ 3.4 Mb (3378539 bp). The corresponding divergence with respect to all repeats from Tables 1, 2, and 3 is:

$$ {\text{div(rep)}} = 100 \cdot d /L, $$

where the summary length of all repeats from Tables 1, 2, and 3 is L = 4848892 bp.

Thus, we obtain divergence with respect to repeat sequences included in Tables 1, 2, and 3:

$$ {\text{div(rep)}} \approx 70\% . $$

If we smear out divergence over the whole Build sequence of length L as = 25 Mb, we obtain the overall divergence with respect to assembly length:

$$ {\text{div(Build)}} = 100 \cdot d /L_{\text{as}} $$
$$ {\text{div(Build)}} \approx 1 4\% . $$

This estimate of overall divergence due to repeats based on large repeat units should be additionally increased due to overall estimates of approximately 1–2% divergence for nonrepeat sequences.

Both the human and the chimpanzee Y chromosome sequences are still incomplete; in human chromosome ~25 Mb out of total length of ~59 Mb was sequenced. Thus, a greater contiguity at several genomic regions is desired to reach more precise conclusions regarding human–chimpanzee divergence. However, the main body of results will probably stand, because, in general, nonsequenced gaps are rich in repeat structures. It should be noted that a whole-genome comparison of chimpanzee and human revealed an increased divergence in the terminal 10 Mb of the corresponding chromosomes, consistent with general association between increased divergence rates and location near the chromosome ends (Mikkelsen et al. 2005; Pollard et al. 2006a). In general, and in accordance with Gibbs et al. (2007), it can be expected that unsequenced regions of repeat elements, that are difficult to align, might for the whole Y chromosome somewhat increase the presently estimated divergence of 14% for the sequenced part. Definitive studies of genome evolution will require high-quality finished sequences (Mikkelsen et al. 2005).

An interesting question is how much the observed sizeable divergence can be generalized to the whole genome. In this sense, we have started a systematic study of human–chimpanzee divergence due to large repeats in other chromosomes.

We see a tendency that large repeat units in humans are on average larger and copy numbers greater than those in chimpanzees. This is in accordance with previous observation that microsatellites in humans are on average longer than those in chimpanzees (Vowles and Amos 2006).

We identify large repeat units which contribute substantially to divergence between humans and chimpanzees. Our results indicate that alphoid HOR and most of characteristic tandem repeats with large repeat units (some present only in human and not in chimpanzee Y chromosome, or some vice versa) have been created after the human–chimpanzee separation, while only a smaller number of tandems with large repeat units (present both in human and in chimpanzee Y chromosome at low mutual divergence) originate from a common ancestor that predated the human–chimpanzee separation. This is in accordance with previous observations in some other chromosomes that alpha satellite subsets found in great apes and humans are in general not located on their corresponding homologous chromosomes (Jorgensen et al. 1992; Warburton et al. 1996); for example, the alpha satellite subset on human chromosome 5 is a member of SF 1, while the homologous chimpanzee chromosome belongs to SF 2 (Haaf and Willard 1997, 1998). It was pointed out that this implies that the human–chimpanzee sequence divergence has not arisen from a common ancestral repeat, but instead represents initial amplification and homogenization of distinct repeats on homologous chromosomes (nonorthologous evolution).

Haaf and Willard (1997) discussed the propositions for homogenization of alpha satellites. Homogenization processes appear to proceed in localized, short-range fashion that leads to formation of large domains of sequence identity (Tyler-Smith and Brown 1987; Durfy and Willard 1989; Warburton and Willard 1990). Genomic turnover mechanisms (molecular drive; Dover 1982, 1986) must be at work that spread and homogenize individual variant repeat units throughout arrays and throughout populations (Haaf et al. 1995). However, the mechanisms by which this concerted evolution occurs seem unclear, although several genomic turnover mechanisms such as unequal crossing over between repeats of sister chromatids (Smith 1976), sequence conversion (Baltimore 1981), sequence transposition (Calos and Miller 1980), translocation exchange (Krystal et al. 1981), and disproportionate replication (Hourcade et al. 1973; Spradling 1981; Lohe and Brutlag 1987) have been observed to be active in certain genomes.

Previous FISH studies support the conclusion that the localization of SF 3 alpha satellite is substantially conserved, while alpha satellite sequences belonging to families 1 and 2 are not shared by the corresponding chimpanzee homologs (D’Aiuto et al. 1993; Archidiacono et al. 1995). Here we find that, although the SF 4 which is composed of M1 alpha satellite monomers constituting human and chimpanzee alphoid HORs in Y chromosomes is conserved, both the alpha satellite monomers in human and chimpanzee HORs and the HOR lengths are widely different.

It was pointed out that it is not known whether evolutionary important mutations predominantly occurred in regulatory sequences or coding regions (King and Wilson 1975; McConkey et al. 2000; McConkey 2002; Olson and Varki 2003; Carroll 2003). Preliminary data suggested that gene expression patterns of human brain might have evolved rapidly (Enard et al. 2002; Caceres et al. 2003; Uddin et al. 2004; Dorus et al. 2004).

Comparative genomic analyzes strongly indicated that the marked phenotypic differences between humans and chimpanzees are likely due more to changes in gene regulations then to modifications of genes themselves (King and Wilson 1975; Pollard et al. 2006a, b; Popesco et al. 2006; Prabhakar et al. 2006). The gene regulatory evolution hypothesis proposes that the striking differences between humans and chimpanzees are due to gene expression: the change of pattern and timing of turning genes on and off.

Pollard et al. (2006b) identified ~100 bp short genomic regions that are highly conserved in vertebrates, but show significantly accelerated substitution rates on human lineage relative to chimpanzee (Pollard et al. 2006a, b). Many of these Human Accelerated Regions (HARs), characterized by dense clusters of nucleotide substitutions, are associated, in particular, with the nervous system, reproductive system, and immune system.

Detailed studies have indicated that forces other than selection for random mutations that increase fitness in specific functional elements may be at play in strongly accelerated regions (Pollard et al. 2006a). There is a possibility that changes in the accelerated regions result from a combination of multiple evolutionary processes, perhaps including biased gene conversion and a selection-based process (Pollard et al. 2006a).

Here, we find another type of accelerated regions: for some repeat arrays we find dramatic evolutionary acceleration of repeat pattern, from monomeric arrays in chimpanzee to HOR organization of repeat arrays in human Y chromosome, i.e., the rapid onset of unequal crossing over in human lineage. Such region of accelerated evolution of HOR pattern will be referred to as human accelerated HOR region (HAHOR).

The hallmark of evolutionary shift of function is sudden change in a region of genome that previously has been conserved (Pollard et al. 2006b). The function of sets of genomic regulatory sequences has been previously compared to electronic microprocessing: they process the information contained in a set of regulatory elements into the corresponding pattern of gene expression. It was noted that one of basic ways how the regulatory genomic features are related to evolutionary processes is the recruitment of existing regulatory pathways into newly evolving context (Gierer 1998; Tautz 2000; Pires-da Silva and Sommer 2003). These processes follow the rules of nonlinear interactions. These, in turn, allow for sudden or very fast changes resulting from the accumulation of rapidly succeeding small steps with self-enhancing features. Furthermore, mechanisms of bifurcation and de novo pattern formation may lead, for instance, to strikingly different developments in parts of an initially near-uniform area. Thus, in general, small causes can result in big effects (Gierer 2004). Finally we note a possibility that accelerated large repeat units and HAHORs could have a functional role of new categories of long-range regulatory elements (Noonan and McCallion 2010).

Conclusion

In this study, we identify and analyze tandem repeats, HORs and regularly dispersed repeats in chimpanzee and human. For the first time we report a dozen new large repeats in chimpanzee and several new large repeats in human genome. Comparing the corresponding repeats based on large repeat units in human and chimpanzee we find substantial contribution to the human–chimpanzee divergence from these repeats, approximately 70% divergence with respect to repeat arrays based on large repeat units. Smearing out these differences in large repeats over the whole sequenced assemblies, human Build 37.1 and chimpanzee Build 2.1, i.e., by neglecting divergence between other segments of genome sequences, we obtain an overall human–chimpanzee divergence between sequenced assemblies of approximately 14%. This numerical estimate far exceeds the available earlier numerical estimates for human–chimpanzee divergence.

Our results are in accordance with recent publication by Hughes et al. (2010) where it was shown by overall comparison that the human and chimpanzee MSYs differ radically.

We explicitly identify, analyze, and compare a dozen of large repeats which give a substantial contribution to human–chimpanzee divergence.

We find in humans several HAHORs on human lineage relative to chimpanzee, containing HOR structures, in particular the alphoid HORs, the ~2.4 kb DAZ repetitions and the ~15.8 kb repetitions. On the other hand, in chimpanzee genome we find a chimpanzee-accelerated HOR region (CAHOR) based on ~550 bp PRU.

While the HARs discovered previously (Pollard et al. 2006a, b; Popesco et al. 2006; Prabhakar et al. 2006; Pollard 2009) were HARs characterized by short dense clusters of nucleotide substitutions, the HAHORs found in this work are characterized by higher-order organization extended over larger genomic stretches.

Our results show explicitly that large repeat units and HORs provide substantial contribution to the human–chimpanzee divergence.

GRM Analysis

GRM analysis was performed using novel GRM code, which is available upon request.