Introduction

During the last two decades, an increasing number of viral, prokaryotic, and eukaryotic genomes have been released to public databases thanks to progress in sequencing and analysis technologies (Sedlazeck et al. 2018). Second-generation sequencing technologies allow high-throughput production of small reads (50 to 400 nucleotides long) of excellent quality, while third-generation sequencing technologies produce long sequencing reads from single DNA molecules. Furthermore, modern genome assemblers can integrate information from several sources. The constant decrease in costs of per base pair DNA sequencing that accompanies these technological innovations widens the range of questions that can be addressed in genomic studies.

At the same time, as these new research opportunities are becoming available, it is important to remember that these new studies rely on genome sequences and annotations that are just models. These models depend on the quality of the DNA sequences, the mapping technology, the automated genome assembly and on annotation pipelines, both automated and manual. The quality of these genome models is mainly evaluated from the point of view of DNA sequencing and assembly quality. These are evaluated through a series of metrics related to the size of the model genome, the sizes of the contigs and scaffolds, features of misassembled contigs, and how completely functional elements (mostly genes) have been assembled (Gurevich et al. 2013; Khiste and Ilie 2015). The location and sequence content of some genomic regions, such as centromeres, have acquired a reputation as being difficult to sequence and assemble because of their repeat content (Copenhaver 2003; Kapusta and Suh 2017; Kapusta et al. 2017; Aldrup-MacDonald and Sullivan 2014; Khost et al. 2017). Because of this, they have received less attention than other genomic regions, outside of the most studied model organisms such as the baker yeast (Saccharomyces cerevisae), fruit flies (Drosophila melanogaster), thale cress (Arabidopsis thaliana), mice (Mus musculus), and humans (Homo sapiens). In consequence, the location and size of centromeres in chromosome models is often unknown and represented by long tracts of Ns.

Centromeres are specialized chromosomal regions that are involved in chromosome segregation during mitosis and meiosis. They are not determined by their DNA sequence but by epigenetic mechanisms that partially involve the deposition of centromere-specific histone H3 variant CENP-A (so-called CENH3) within centromeric nucleosomes and an enrichment in the histone modification H4K20me1 (Hori et al. 2014, 2017). The location of centromeres in a genome model can be identified by the position of CENP-A enrichment peaks on chromosomes using ChIP-seq data obtained with anti-CENP-A antibodies. A complementary approach is to use in situ hybridization with marker probes surrounding the centromeres (Kretschmer et al. 2018).

Centromere regions can be very variable in size and in sequence between species, but also between chromosomes in each species (for review, see Plohl et al. 2014). There may be a single locus (centric) in each of all chromosomes or centromeres may be all diffuse (holocentric). The latter type arose independently at least 13 times during the evolution of both plants and animals, but the DNA sequences of such centromeres remain poorly described. There are two main types of centric centromeres, repeat-based and repeat-free. The repeat-based centromeres are generally composed of large arrays of tandem repeats (so-called satellite DNA) in which transposable elements are interspersed and have likely accumulated over time. Species can generally be classified via their centromeres as (i) all their chromosomes display repeat-based centromeres (e.g., H. sapiens and D. melanogaster), (ii) all chromosomes display repeat-free centromeres (e.g., S. cerevisae), or (iii) centromeres may be both repeat-based and repeat-free, depending on the chromosome (e.g., Equus caballus and Solanum tuberosum).

The chicken genome was the third vertebrate to be sequenced (International Chicken Genome Sequencing Consortium 2004), and to date, avian genomes have been one of the most investigated groups of vertebrates in genome-sequencing projects (Zhang et al. 2014). However, avian genomes remain a technical challenge to sequence and assemble due in part to their high GC content. Recently, significant discrepancies between expected versus assembled genome size in eukaryotes have been reported (Peona et al. 2018). One of the most striking examples is the ostrich genome (Struthio camelus) for which the genome model has a size of 1.23 Gbp (Zhang et al. 2015) while its estimated genome size based on more classic methods is 2.16 Gbp (Eden et al. 1978). The chicken genome is organized into 10 macrochromosomes (1 to 9, plus Z) and 29 microchromosomes (10 to 38, plus W) and is reluctant to deliver “all its secrets,” especially those of certain microchromosomes. In the current galGal5 genome model (Warren et al. 2017), 6 microchromosomes (29 and 34 to 38) are not represented. In addition, the GC-rich outer arm ends and subtelomeric regions of macrochromosomes (Federico et al. 2005) were recently found to possibly harbor genes that are absent from the current chicken model (Seroussi et al. 2017; Mello and Lovell 2018). Centromeres are also poorly described in the different versions of the chicken genome model.

In the previous galGal4 chicken genome model, macrochromosome centromere sizes were arbitrarily assigned as 1,500,000 Ns and those of microchromosomes as 500,000 Ns in the absence of any evidence of their true lengths (International Chicken Genome Sequencing Consortium 2004). In the galGal5 model, centromeres were again arbitrarily assigned a stretch of Ns (500,000 this time) in all 16 chromosomes where they were annotated (see the UCSC genome browser, https://genome.ucsc.edu/cgi-bin/hgGateway). Some authors have argued that their absence was due to the difficulty in sequencing and assembling centromeres (Kapusta and Suh 2017; Kapusta et al. 2017). However, this is not always the case since the organization of some chicken centromeres have previously been described, and these were identified as being of at least two types (Shang et al. 2010). The DNA sequence of centromeres in chromosomes 1, 2, 3, 4, 7, 8, and 11 was found to consist of chromosome-specific, tandem repeat arrays that span several hundred kilobases. By contrast, the DNA sequence of centromeres in chromosomes 5, 27, and Z do not contain tandem repeat sequences and span regions of about 30 kb. Therefore, one would expect both large and small centromeres in the chicken genome depending on the chromosome type. Furthermore, small centromeres should, a priori, not present any particular difficulties for sequencing and assembly.

Here, we review current knowledge regarding centromere localization in chicken chromosomes by comparing their features in the galGal5 model with three sources of published information: (i) one dataset of Illumina reads (SRA archive DRR018430) obtained from a ChIP-seq using chromatin of DT40 cells (a chicken line of bursa lymphoma) and using anti-CENP-A antibodies for immunoprecipitation (Shang et al. 2013); (ii) sequence markers close to centromeres with a location previously verified by fluorescent in situ hybridization mapping on giant lampbrush chromosomes from growing chicken oocytes (Krasikova et al. 2006, 2012; Zlotina et al. 2010, 2012); (iii) sequences assembled using Illumina reads obtained from a ChIP-seq experiment performed using the chromatin of DT40 cells transfected with a plasmid vector expressing a flag-CENP-A protein and using anti-flag antibodies for immunoprecipitation (Shang et al. 2010).

Results

Location of N-tracts and centromeres in galGal5

We reviewed karyology studies to verify the centromere location in each chicken chromosome, categorizing them as metacentric, submetacentric, acrocentic, subtelocentric, or telometic (Table 1, columns 2 to 5; Fechheimer 1990). Using a custom-written Perl script, the N-tracts were inventoried in galGal5 chromosomes (Online Resource ESM_1.xlsx) and compared to the annotation of centromeres available on the UCSC website (http://hgdownload.cse.ucsc.edu/goldenPath/galGal5/database/cytoBandIdeo.txt.gz). In the assembled chromosomes, centromeres are located by tract of 500,000 Ns, but there are some (Table 1, columns 4 to 6) long N-tracts corresponding to regions that were difficult to sequence and assemble but do not correspond to centromeres (e.g., in chromosome 27 between positions 1,073,340 and 1,173,806 [100,466 bp]).

Table 1 Features of centromeres in galGal5 and location of the most enriched CENP-A regions

Location of CENP-A-enriched peaks in galGal5

The locations of the 500,000 N-tracts were first compared to those of CENP-A enrichment peaks in galGal5 calculated as described by Shang et al. (2013) from a ChIP-seq dataset based on DT40 cell chromatin, anti-CENP-A antibodies, and immunoprecipitation. Briefly, the DRR018430 SRA dataset was downloaded (https://www.ncbi.nlm.nih.gov/sra/?term=DRR018430.sra), filtered, and aligned with bowtie2 (Langmead and Salzberg 2012) to the galGal5 model. The resulting BAM file was then transformed into a bedgraph file using a window of 10,000 nucleotides, and the presence of peaks was visualized in galGal5 chromosomes using the Integrative Genomics Viewer (IGV; James et al. 2011; Thorvaldsdóttir et al. 2013). Four different types of outcomes were observed: (1) cases where both CENP-A peaks placed the centromere in the same region as N-tracts, that is, chromosomes in which the putative centromere (represented by 500,000 Ns, Fig. 1, boxes in gray) was flanked by the main peaks of enrichment of CENP-A (Fig. 1a–d); (2) chromosomes in which the putative centromere was flanked by peaks of enrichment in CENP-A that were not the main peaks of enrichment, (e.g., chromosome 3 Fig. 1b); (3) chromosomes in which the putative centromere was not flanked by peaks of enrichment in CENP-A and in which a strong CENP-A peak was localized elsewhere (e.g., chromosomes 5 and 27, Fig. 1d, e); (4) chromosomes in which the location of the centromere was not indicated in the galGal5 annotation but positioned using ChIP-seq data in the galGal5 model (e.g., chromosomes 28 and Z, Fig. 1f, g). A summary of centromere locations identified using ChIP-seq data is shown in Table 1, columns 6 to 8. It revealed that the centromere location matched both information sources for only 7 (1, 2, 4, 7, 8, 11, and 14) of the 16 chromosomes that were annotated with a centromere in galGal5. The centromere locations in chromosome models 3, 5, 6, 10, 13, 25, and 27 were found to be different from those indicated in the galGal5 annotation and did not support the presence of a sequenced centromere in chromosome models 9 and 25, as annotated in galGal5 (Table 1). The probable reason for this was that there was no CENP-A-enriched region on chromosome models such as 9 and 25, the pericentromeric region of which could not be assembled. However, centromeres, or centromeric sequences, could be positioned in chromosome models 12, 15, 16, 18, 19, 20, 21, 26, 28, and Z. For chromosome 16, which contains several types of repeats (genes and satellite DNA), previous studies have shown that this centromere is subtelomeric (Miller et al. 2014). The near-telomeric location of the CENP-A peak in chromosome model 16 was due to the absence of the p-arm that mainly consisted of AT-rich repeats.

Fig. 1
figure 1

Graphic representation, using the IGV, of inner regions of chromosome models 1 (a), 3 (b), 4 (c), 5 (d), 27 (e), and 28 (f), and of the complete model of chromosome Z (g). Centromeres corresponding to N-tracts in the official annotation of galGal5 are indicated with gray boxes. Peaks of CENP-A enrichment are indicated by pink bars. Markers (WAG35013 and WAG44P17; Zlotina et al. 2012) used for fluorescent hybridization to locate the centromere in chromosome 3 are indicated with their names in red. Below each graphic, gene-containing regions are indicated in blue. Above each graphic, the scale of the region is indicated. In a, c, e, and g, small CENP-A peaks are shown reflecting the putative presence of neocentromeres (Shang et al. 2013)

Confirmation of centromere locations using data from karyological markers and from prior centromeric sequences

The sequence of marker probes flanking centromeres in chromosomes 1, 2, and 3 (Zlotina et al. 2012) were used to verify their location in these three chromosomes models. We found that markers WAG43N11 and WAG53E23 were located at positions 65,72,1,251 and 76,397,565 in chromosome 1 and markers WAG21J8 and WAG18G1 at positions 50,277,689 and 53,029,789 in chromosome 2 and surrounded the centromere in each chromosome (Table 1). In chromosome 3, markers WAG35O13 and WAG44P17 were located at positions 2,133,568 and 5,508,192. They did not flank the centromere described in the galGal5 annotation but flanked the CENP-A peak (Fig. 1b). This confirmed that the largest CENP-A peak was the centromere in chromosome 3, and those detected in all other chromosomes were very likely reliable. The CENP-A peaks flanking the chromosome 3 centromere described in the galGal5 annotation might correspond to the presence of a neocentromere. Neocentromeres are present in all eukaryote taxa and correspond to atypical centromeres spontaneously bound by CENP-A and able to form on unique sequence regions (Scott and Sullivan 2014). They are able to act as centromeres when the main centromere is lost by deletion (Shang et al. 2013).

Finally, centromeric sequences (NCBI accessions AB556643 to AB556736) corresponding to major families of tandem repeats in the chicken genome, which were previously reconstructed from a ChIP-seq Illumina dataset obtained with a different antibody (Shang et al. 2010), were used to search chromosome sequences. Among the 7 centromeres located in the same region by the galGal5 annotation and the CENP-A peaks, sequence AB556722 was found to match with 6 repeats within the centromere of chromosome 1 (positions 74,615,536 to 75,136,859), AB556723 with 12 repeats within the centromere of chromosome 2 (positions 52,315,814 to 52,854,454), AB556725 with 15 repeats within the centromere of chromosome 4 (positions 18,841,161 to 18,852,575), AB556726 with 12 repeats within the centromere of chromosome 7 (positions 7,330,539 to 7,848,612), AB556727 with 10 repeats within the centromere of chromosome 8 (positions 10,498,281 to 11,007,717), and AB556728 with 7 repeats within the centromere of chromosome 11 (positions 3,305,308 to 3,323,207). In chromosome 8, AB556727 repeats were found interspersed between positions 11,308,276 to 11,379,802 with 8 copies of AF124927 sequences that belong to another family of partially inverted tandem repeats (PIR; Wang et al. 2002). This confirmed the specific centromeric origin of these sequences and their accuracy for locating centromeres.

We found that sequence AB556724 matched with six repeats between positions 2,464,414 to 2,476,273 in chromosome 3, AB556729 with two repeats between positions 3,032,478 to 3,061,773 in chromosome 5, AB556655.1 with one repeat between positions 30,655,666 to 30,656,292 in chromosome 6, and AB556731 with five repeats between positions 42,746,469 to 42,775,985 in chromosome Z. In these four macrochromosomes, the matches were located within the CENP-A peak, which is outside the centromeres described in the galGal5 annotation. For macrochromosome 9, no information besides the galGal5 annotation confirmed the location of a centromere in its sequence. Matches with sequences of centromeric origin were found within regions containing CENP-A peaks of microchromosomes 10, 11, 12, 16, 18, 20, 21, 27, and 28 (Table 2). No match was found between the X51431 sequence, which has been described as a 41 or 42 bp tandemly repeated sequence monomer, and centromeres of microchromosomes and macrochromosomes 7 and 8, where 41 and 42 bp repeat arrays were expected (Matzke et al. 1990).

Table 2 Features of complete or partial matches between centromeric sequences and some microchromosomes

Concluding remarks

The comparison of existing public datasets to the galGal5 annotation confirmed that CENP-A ChIP-seq datasets were a reliable tool to localize centromeres in the sequence of chromosome models, including chickens. In this study, we were able to localize centromere positions in 25 chromosomes (1 to 8, 10 to 22, 26 to 28, and Z). Contrary to popular belief, parts of these centromeres have successfully been sequenced and assembled in the galGal5 model and could be used for investigating synteny with other avian species. Our results also supported that there were at least three kinds of centromeres in the chicken genome: (i) centromeres consisting of chromosome-specific homologous tandem repetitive arrays that span over several hundred kilobases in chromosomes 1, 2, 4, 7, 8, 11, and 17; (ii) centromeres that do not contain tandem repetitive sequences and which span over regions of about 10–70 kb in chromosomes 3, 5, 6, 10, 12, 16, 18, 20, 21, 27, 28, and Z; and (iii) centromeres that contain several kinds of tandem repetitive sequences and span over regions of approximately 10–70 kb likely in chromosomes 13, 14, 15, 17, 19, 21, and 22. Although, the centromere status for all of these chromosomes \ms now established, these results should be taken cautiously, particularly for centromeres that do not contain tandem repetitive sequences. Indeed, the lack of tandem repeats might result from an artifact, so-called muted gaps that can arise during the assembly process (Chaisson et al. 2015; Thomma et al. 2016). When tandem repeats are nearly identical or perfectly conserved in sequence, their assembly can transform the array into a unique copy. Such artifacts should be identified by performing copy number analyses using datasets of Illumina genomic resequencing (Abysov et al. 2011).

Our results have also highlighted two issues with the current galGal5 assembly. The first was the absence of centromeres in chromosome models 9, 23 to 25, 30 to 33, and W. It has commonly been assumed that this was due to sequencing and assembly difficulties and would likely be resolved in future genome models. The second was that the content of regions describing the centromeres in the galGal5 annotation was found to be inaccurate; these were the centromeres of chromosomes 3, 5, 6, 9, 10, 13, and 22. The reasons for these discrepancies may be due to two issues. First, they may have resulted from artifacts during the genome assembly step. Second, they were likely considered by the scientists in charge of the chicken genome project as large regions (~ 500,000 nucleotides) that were difficult to sequence and assemble. Such sequence characteristics are generally correlated with regions displaying an elevated GC content (Aird et al. 2011; Nakamura et al. 2011; Benjamini and Speed 2012; Dabney and Meyer 2012; Oyola et al. 2012; Ross et al. 2013Interestingly, highly GC-rich regions were found in the inner regions of some macrochromosome arms (Andreozzi et al. 2001, Federico et al. 2005, Costantini et al. 2007), but they are not present in the current sequence of macrochromosome models. Therefore, regions associated with GC-rich subtelomeric regions of macrochromosomes and the missing microchromosome models 34, 35, 36, 37, and 38 might be the hidden harbors of at least a part of the 1500 lost genes in the chicken genome that are located in GC-rich regions in other vertebrate genomes (Mello and Lovell 2018).

It is important to have reliable centromere annotations for a genome model such as the chicken, which is a reference in avian genomics and evolutionary studies. But it is also very important to understand the dynamic and the plasticity of avian genomes. Knowing the true location of centromeres should allow researchers to verify whether they are the seat of repeat expansions by retrotransposition during cellular differentiation, similar to those observed in mammalian genomes (Bersani et al. 2015; Tanne et al. 2015), and to verify whether there was a source of retrotransposition in a genome where all transposable elements were believed to be extinct (Guizard et al. 2016). Knowledge of centromere positions is also important for speciation and population research, for example when studying the effect they may have via the “centromere drive” and linked selection due to low meiotic recombination rates (Henikoff et al. 2001; Weissensteiner et al. 2017; Wolf and Ellegren 2017). It should therefore be important for the scientific community to re-examine the gold standard procedures for annotating centromeres and putative neocentromeres in chromosome models. For this, the existing literature in karyology and CENP-A ChIP-seq data might be used more systematically to verify their location.