Introduction

Water buffalo were domesticated about 5,000 years ago and are the most important farm animal resource in developing tropical and subtropical countries, contributing greatly to the local economy of rural areas (Michelizzi et al. 2010). Two types of water buffalo are recognized, the river and the swamp buffalo. River buffalo have been selectively bred as dairy animals while swamp buffalo are typically used as draft animals (Zhang et al. 2007).

The use of molecular genetic approaches has increased the genetic gain in animal selection programs. After the release of the first de novo assembly of an Italian Mediterranean river buffalo (UMD_CASPUR_WB_2.0) (Williams et al. 2017), and with the availability of a 90K single nucleotide polymorphism (SNP) chip for buffalo (Iamartino et al. 2017), SNPs have been used in a range of studies of water buffalo (Colli et al. 2018; Whitacre et al. 2017). The current reference assembly is fragmented into 366,983 scaffolds with a low scaffold N50 of ~1.4Mb, which make it unsuitable for direct detection of large genetic variants that may contribute to the genetic diversity of the species (Williams et al. 2017). One such genetic variant, the copy number variation (CNV), which consists of duplications and deletions of DNA sequence ranging from 50bp to several megabases, impacts a large percentage of genomic sequence and potentially has a greater functional effect than SNPs (Henrichsen et al. 2009). CNVs have been extensively reported in human (Sudmant et al. 2015), primates (Gokcumen and Lee 2009), mouse (Henrichsen et al. 2009), zebrafish (Brown et al. 2012), dog (Nicholas et al. 2009), and livestock, including chicken (Yi et al. 2014), pig (Jiang et al. 2014), horse (Doan et al. 2012), sheep (Fontanesi et al. 2011), goat (Fontanesi et al. 2010) and cattle (Bickhart et al. 2016; Zhou et al. 2016; Liu et al. 2010). However, there have only been two surveys of CNVs in water buffalo, including a recent survey from our group (Li et al. 2018; Zhang et al. 2014).

Comparative Genomic Hybridization (CGH) arrays, SNP genotyping arrays, and high throughput sequencing (HTS) have been used for genome-wide CNV screens. However, the major limitation of CGH (comparative genomic hybridization) and SNP arrays is that they are indirect screens, providing no information on the actual structure of the variation detected (Pinto et al. 2011; Bickhart and Liu 2014; Li and Olivier 2013). Additionally, the resolution of CGH or SNP arrays is limited by the probe density of the array and certain genetic variants such as balanced rearrangements (e.g. inversions), and novel DNA sequence cannot be detected using these approaches. The decreasing cost of DNA sequencing has enabled CNV to be detected at a high effective resolution and sensitivity. Numerous methods have been developed for CNV detection using next-generation short-read sequencing, including read pair (RP), read depth (RD), split read (SR), sequence assembly (SA), and hybrid algorithms (combinatorial detection; e.g. Genome Strip) (Snyder et al. 2010; Mills et al. 2011; Handsaker et al. 2015). Among these, RD methods are highly sensitive in discovering duplications and are capable of determining exact copy number (CN) values for each genetic locus in an individual (Sudmant et al. 2010). The mrFAST/mrsFAST and whole genome shotgun sequence detection (WSSD) method (Sudmant et al. 2010; Alkan et al. 2009; Hach et al. 2010) can be used to construct personalized CNV maps in or near segmental duplication (SD) regions, by reporting all mapping locations for sequence reads, whereas other RD methods only take one mapping location per read into consideration. When a read is mapped to multiple best tied locations, a random locus is often selected for further downstream analyses. Due to the higher frequency of CNVs in or near duplication regions in the genome (Cheng et al. 2005; Bickhart et al. 2012), mrFAST and mrsFAST are especially efficient in detecting CNV within or near duplication- and repeat-rich regions. CNV detection methods that are locus-specific, including fluorescence in situ hybridization (FISH) and quantitative polymerase chain reaction (qPCR), can be used to detect large CNVs and often used to experimentally validate the CNVs predicted by genome-wide methods (Doan et al. 2012; Bickhart et al. 2012). Notably, CNV detection in most livestock have been limited to only one or two methods, and lack rigorous experimental validation.

The recent release of a water buffalo draft reference genome has accelerated genomic studies and the application of genetic selection in this species (Williams et al. 2017). However, the draft assembly is highly fragmented and not as thoroughly annotated as the cattle reference genome. This study used a comparative alignment of the buffalo DNA sequences with the completed reference genome of Bos taurus, to systematically detect CNV in the water buffalo genome. The CNV identified were validated using a CGH-based whole-genome approach, followed by FISH and qPCR confirmation for selected CNVs. By assessing the CNV distribution of water buffalo at a genome-wide level, we provide information for studies into highly duplicated regions in the water buffalo genome e.g. to uncover duplicated genes that may be associated with agriculturally important traits.

Results

Genome-wide identification of segmental duplications

We retrieved the whole genome Illumina HTS reads of the water buffalo (Olimpia) whose sequence was recently assembled and released as the draft reference genome (Williams et al. 2017). We mapped the reads to the UMD3.1 cattle genome assembly (Zimin et al. 2009). We then detected segmental duplications (SDs) using a sliding window approach, based on a previously published mrsFAST-WSSD method (≥ 1kb in length, ≥ 90% sequence identity) (Alkan et al. 2009). Due to the short lengths of chrUn contigs (i.e. unplaced contigs) and the ambiguous mapping of the chrUn sequence reads, we excluded events mapping to chrUn contigs in the subsequent analyses. We discovered 1,038 SDs in the autosomes and X chromosome of Olimpia, spanning ~44.6 Mb (~1.73%) of the cattle genome. This is comparable with the previously predicted extent of CNVs in cattle (49.2 Mb, excluding SDs in cattle chrUn contigs) (Liu et al. 2009). The buffalo SDs ranged in size from 1270bp to 750,223bp, with an average size of ~43 kb (standard deviation = 60.5 kb) (Table S1).

FISH validation of the predicted segmental duplication

To confirm the SDs detected, we experimentally validated a subset of the largest (≥ 20 kb) duplicated regions by FISH. A total of 121 cattle BAC clones corresponding to the WSSD duplicated regions were used as probes which were hybridized against the buccal epithelial cells of Olimpia (Table S2). Twenty probes failed to generate hybridization signals. We observed multiple signals for 70.3% (71/101) of the remaining probes. As expected, the majority of duplicated sequences were intra-chromosomal (52/71), while inter-chromosomal duplications showed signals on multiple non-homologous chromosomes, accounting for less than 27% of the regions tested. These data suggested that tandem intra-chromosomal duplications predominate in the water buffalo genome, which is similar to other mammalian species (Nicholas et al. 2009; Liu et al. 2009). FISH results confirmed that the following genes are duplicated: peptidase inhibitor 3 (PI3), olfactory receptor (OR) genes and pregnancy-associated glycoprotein (PAG) gene families (Fig. 1).

Fig. 1.
figure 1

Validation of segmental duplications by FISH analysis. a Example of an interchromosomal duplication detected with clone CH240-1M15, covering the PI3 gene. (b, c) Two representative examples of tandem intrachromosomal duplications detected with clones CH240-447E04 and CH240-35K10, covering the OR genes and the PAG gene family

CNV discovery and dataset statistics

To study CNVs that might be polymorphic or fixed, we aligned short reads from 14 additional water buffaloes to the UMD3.1 cattle genome using the mrsFAST short read aligner, and called CNVs using the WSSD read depth approach. Based on sequence RD against the reference genome, we detected CNVs for the 14 individuals in the autosomes and the X chromosome. The number of duplications ranged from 839 (ITWB2) to 900 (ITWB7), and the number of deletions varied from 0 (ITWB8) to 273 (ITWB6). While our method had sufficient power to detect duplications, variation in RD across the autosomes, measured in standard deviations (STDEVs), limited our discovery to extreme deletion events (Bickhart et al. 2012). The CNVs from all individuals (including Olimpia) were merged if overlaps were 1bp or greater. In total, we detected 1,344 unique CNV regions (CNVRs) (1,041 gains, 279 losses and 24 both), amounting to 59.8 Mbp or 2.2% of the total bases in the cattle genome. A full list of CNVRs are listed in Table S3. A representative overview of the CNV landscape mapped onto cattle chromosome 5 is shown in Fig. 2 and other individual chromosome plots in Figs. S1-S6.

Fig. 2.
figure 2

CNV map of water buffalo chromosome 5. The UMD3.1 assembly is represented as black bars with assembly gaps indicated by white boxes on the chromosomes. Tracks underneath the chromosomes represent the SDs for Olimpia, and CNV data sets for 14 additional buffaloes. The colors for each bar in the animal data set tracks represent the average estimated CN for each CNV as shown in the legend

Genes overlapping with copy number variation

Using BioMart, in the Ensembl database (Ensembl Genes 79), we obtained the IDs for the genes that were located within, or overlapped, with the detected CNVRs. We identified a total of 1,245 genes and 47.4% of the CNVRs encompass 1 or more genes (Table S3). Using the MrsFAST WSSD algorithm, we assigned a CN estimate to each gene. Gene regions outside the predicted CNVRs were found to have a median CN estimate of 2.05, suggesting that CNV detection and CN assignment were concordant. Genes within CNVRs were found to be highly variable in CN among individuals (minimum value: 0; maximum: 299; median: 5.43; average: 7.54) (Table S4). To test the hypothesis that particular gene classes were over-represented in duplicated regions, we assigned PANTHER terms to all genes that overlapped duplications (Mi et al. 2017). We observed statistically significant enrichments in genes that participate in immune response, oxygen transport, sensory system and signalling transduction, which is consistent with similar analyses of duplications in other organisms (false discovery rate [FDR] < 0.05, Table S5). Of the top 25 most copy number variable genes, most had functions related to the immune response, such as interferons, melanoma antigen family and PAG gene family (Table 1). One CNV impacted gene family is PI3, which encodes the trappin/elafin anti-microbial/immune system modulator protein (Belaaouaj et al. 1998; Fujishima et al. 2008), had a high CN value (average CN: 6.5) in the water buffalo (Fig. 3a, Table S4). Another highly CNV impacted gene, UL16 binding protein 3 (ULBP3) (average CN: 8.3), encodes one of several related ligands of the KLRK1/NKG2D receptor, which is involved in the regulation of both innate and adaptive immune responses (Vivier et al. 2002) (Fig. 3b, Table S4).

Table 1. Top 25 genes with copy number variations genes in the individual buffalo with the locations given on the UMD 3.1 bovine reference genome
Fig. 3.
figure 3

Genes with copy number variations in individual water buffaloes. a Copy number values for each animal were plotted within the PI3 locus (chr13:74180018-74298194) using the color scheme depicted in the legend. Heatmap boxes represent 1-kbp sliding, nonoverlapping windows in the region. b Copy number values within the ULBP3 locus (chr17:39860701-39894230). The duplications of these two loci were confirmed using qPCR

Validation with aCGH analysis

To confirm individual CNVs, we performed aCGH experiments using the 14 additional water buffalo samples with Olimpia as the reference. We compared the RD predicted CNV intervals with the aCGH results. To make the CN estimates comparable with the aCGH results, we calculated log2 ratios between CN estimates for the 14 buffaloes and that of Olimpia using a digital aCGH approach (Sudmant et al. 2010). Based on the predicted CN values within filtered CNVs (> 20kb that contained < 80% common repeat content), we generated log2 ratios between CN estimates and compared them with log2 ratios of the aCGH probes using a linear regression model (Sudmant et al. 2010). Within the CNV regions, we observed a high correlation (Pearson r = 0.781) between log2 ratios of CN estimates and aCGH log2 ratios (Fig. 4). The computational prediction and aCGH validation of Olimpia and another three randomly selected individuals in two CNV regions are presented in Fig. 5. These two regions cover the PI3, PAG3 and PAG6 immune related genes. The duplication of these three genes for Olimpia were confirmed in FISH analysis (Fig.1, Table S2).

Fig. 4.
figure 4

Correlation between digital aCGH values (log2ratios between CN estimates) and whole genome aCGH (log2ratios of hybridization probes). Digital aCGH values were estimated using a log2 ratio of the 1-kbp CN windows from each water buffalo individual divided by CN estimates from Olimpia. A high correlation (r = 0.781) was found for aCGH probe values and digital aCGH values within CNV intervals > 20 kb that had fewer than 80% of their lengths occupied by common repeats

Fig. 5.
figure 5

Computational predictions and aCGH validations of segmental duplication copy number differences. Depth-of-coverage tracks for Olimpia, ITWB12, ITWB13 and ITWB14 are shown below a UCSC track for each investigated gene region. Regions colored in red on the plot indicate excessive read depth (> mean + 1.5 × STDEV), whereas orange regions indicate intermediate read depth (> mean + 1 × STDEV). Normal read depth values are colored green (within mean ± 1 × STDEV). Digital aCGH tracks show the log2 ratio of the copy number of each listed animal compared to Olimpia, with high value listed in green (> 0.3); low values: red (< -0.3) and nominal values: orange (0.3 ≥ x ≥ -0.3). Whole-genome CGH array experiments, using Olimpia reference sample in all cases, are listed below the digital aCGH experiments. Color schemes for the aCGH plots are the same as for the digital aCGH. The CNVRs are shown below the UCSC plot. a CNVs intersecting the PI3 locus (chr13:74180018-74298194). A duplication of this region was predicted for all animals and was confirmed by whole-genome aCGH. b CNVs intersecting the PAG3 and PAG6 locus (Bovine chr29:39994004-40119007)

qPCR analysis

We designed quantitative PCR assays to test 11 predicted CNVRs within or near annotated genes including PI3, PAG6 and other randomly chosen genes. We randomly selected 6 individuals to investigate the 11 CNV regions and designed one primer set for each locus (Table S6). The basic transcription factor 3 (BTF3) gene was chosen as the control with the assumption that there were two copies of DNA segment in this region. The validation rates of the 11 loci in the 6 samples varied from 63.6% to 90.9% with an average validation rate of 74.2% (Table S7). We selected four of the CNVRs that were validated in all 6 individuals to compare the WGS-predicted CN values with the qPCR estimates (Fig. 6). We observed a high correlation between qPCR-derived CN and WGS-derived CN, supporting the reproducibility of our genome wide CNV detection methods.

Fig. 6.
figure 6

The CN observed from qPCR experiments and estimated from WGS for four genes (GUCY1B1, GZMB, PAG6 and PI3)

Discussion

This study carried out a systematic investigation of the genome-wide CNV landscape of water buffalo. We identified 1,344 CNVRs in 15 water buffaloes (including Olimpia, the reference animal for the buffalo genome assembly) and validated this sequence-based CNV set using aCGH, qPCR and FISH. Agreement in CNV assignment was found among all four methods. Two previous studies have focused on the discovery of CNVs in the water buffalo genome. One study used the NimbleGen 3×720K CGH array, and found more than half of CNVRs discovered in buffalo were shared with cattle (Zhang et al. 2014). However, this CGH array approach has several inherent drawbacks, including hybridization noise, low resolution, and that novel and rare variations are not detected (Snijders et al. 2001). Using a comparative alignment and selective filtering approach, our previous study on buffalo CNVs focused on the comparative analysis of genome features shared between buffalo and cattle (Li et al. 2018). The study identified large deletions and smaller variations in the gene regulatory regions which may impact on gene expression (Li et al. 2018). Our earlier study used the R package, cn.mops (Klambauer et al. n.d.) and JaRMS (Oldeschulte et al. 2017) to assess CNVs. These two methods tend to discover more deletion events than duplication events. In contrast, the mrsFAST-WSSD method has the tendency to detect more duplications, especially in the repetitive segmental duplication regions. We compared the CNV regions identified in these two studies, and found 49.9% (670/1344) CNVRs were also discovered using the methods of cn.mops and JaRMS, covering 25.9% (11.6 Mb) of all the variable sequences.

We found that the water buffalo studied shared several high copy number regions with cattle (Bickhart et al. 2012). For example, position of 25 Mb on chromosome 10 is enriched with 22 CNV regions, covering T-cell receptor alpha variable (TRAV) gene family members (Fig. S2). The copy number of the CNV cluster varied from 3 to 10 (average: 5.5). TRAV genes encode variable domains of the T-cell receptor alpha chain. T-cell receptors recognize foreign antigens and bind to major histocompatibility complex (MHC) molecules, which in turn are encoded by the genes located on chromosome 23 (Fig. S4), expressed on the surface of antigen presenting cells (Nikolich-Zugich et al. 2004). The expansion of the TRAV gene CN in buffalo could be partially due to the requirement for a substantial immune-regulatory T-cell population in this species to combat a wide range of pathogens (Connelley et al. 2014). Another example of CN-divergent gene families is the PAG gene family members on chromosome 29 (Fig. S5). In the water buffalo genome, we found nine CNVRs, ranging from 38.3 Mb to 42.5 Mb on chromosome 29, of which the copy number varied from 5 to 26 (average: 11.4). PAG genes are abundantly expressed in the placenta of species within the Cetartiodactyla order where they play a role in the sequestration of fetal antigens at the placenta-uterine interface (Wallace et al. 2015). We found a large cluster of CNVs covering the PAG genes in all of the sequenced buffalo individuals, indicating the pervasiveness of the duplication events. Similarly, the PAG genes have been shown to be duplicated in eight diverse cattle breeds (Bickhart et al. 2016). As we proposed before (Bickhart et al. 2016), it is possible that duplications of the genes identified in this and other studies are indicative of subfunctionalization, neofunctionalization, or overdominance effects on structurally polymorphic PAG gene alleles. The other examples of shared CNVRs are the ~47 – 52 Mb region of chromosome 15 associated with OR genes, ~25 – 30 Mb region of chromosome 23 associated with cattle MHC (BoLA) family members and ~5 – 6 Mb region of chromosome 27 associated with β-Defensin (DEFB) family members.

We detected gene with largest CN differences between the water buffaloes studied here and 75 cattle individuals from eight breeds/subspecies (Bickhart et al. 2016) (i.e. with differences of the average CN values > 4). They included several cell cycle-related genes (MIS18BP1, MAD2L1 and CNTLN), several genes related to immune function (DEFB1, DEFB5, DEFB7 and NRIH4), the skin disease related genes (melanomas, like PRAME and TNFRSF10), as well as neuron system (FZD3) (Fig. 7a). One mitotic associated gene, mitotic arrest deficient 2 like 1 (MAD2L1), showed higher CN values in cattle (average CN: 35.1) than in water buffalo (average CN: 14.2) (Fig. 7b). The MAD2L1 encoded protein is identified as a vital mediator of the chromosomal control pathway (Kato et al. 2011). It has been reported that the copy number loss of mitotic arrest deficient genes may be related to human fetal loss (Nath et al. 2012). A previous study identified frizzled class receptor 3 (FZD3) as one of the most stratified genes for taurine and indicine animals (Bickhart et al. 2016). Interestingly, the CN values of this gene in water buffaloes (average CN: 15.5) were two times more than that in cattle (average CN: 4.5), suggesting they could have been under different selection pressures in the two species (Fig. 7c). FZD3 contributes to axonal growth in the central nervous system (Wang et al. 2002). The difference in CN values between cattle and water buffaloes may be partially driven by the differences occurring during domestication or from natural selection. We also discovered other functionally important genes stratified in CN values in these two species although the CN differences were less than 4. For example, the PI3 gene has more copies (average CN: 6.5) in water buffalo than in cattle (average CN: 3), and their CN distributions seldom overlapped (Fig. 7d). The PI3 gene is implicated in resistance to fungal and bacterial pathogens, so copy number variability in water buffalo may indicate structurally polymorphic alleles in this species that confer different resistance to these pathogens. These shared CNV clusters and CN differential genes in two species warrant further investigation to understand whether the CN affects phenotypes, particularly those related to economically important traits.

Fig. 7.
figure 7

Copy number different genes between cattle and water buffaloes. a The boxplot of CN values in 16 genes showing the highest CN differences (> 4) between cattle and water buffaloes (legend insets denote group colors). Histograms showing the distribution of CNs among the unrelated individuals in each group are plotted for MAD2L1 (b), FZD3 (c), and PI3 (d). X-axis values indicate copy number and Y-axis values indicate 3 sample count

One limitation of this study is that the UMD3.1 cattle reference genome was used as a basis for CNV detection in water buffaloes. We chose to align the buffalo sequence data to the cattle reference genome because the highly fragmented water buffalo draft reference genome, which contains a large number of smaller scaffolds that are difficult to analyse using a window-based CNV detection method. It is possible that the structural differences between the cattle and water buffalo genomes detected may have been the result of using a different species as reference. Alignment of the water buffalo sequences with the bovine UMD3.1 reference genome identified 6.6 point mutations per 1000bp. The alignment method, mrsFAST, allows two mismatches per aligned read, which represents a 96% identity cutoff for each 50bp sequence. As the water buffalo diverged 12.3Myr ago from its last common ancestor with the cattle, 15% of buffalo sequences do not have a match with the cattle genome (Williams et al. 2017). These sequences represent either buffalo specific DNA or sequence absent in the UMD3.1 cattle reference genome. This missing buffalo specific genomic sequence is likely to have resulted in a loss of buffalo specific CNVs (i.e. false negatives). Other regions, which are divergent from the cattle assembly, may also influence the prediction of buffalo CNVs. However, in most of the unique and gene-rich genomic regions, buffalo sequence was highly comparable with sequence in the cattle genome. For CNV detection based on the CGH array, the probes of 50-60 bp in length usually allowed one mismatch to target the specific chromosomal region, which is more stringent than mrsFAST alignment. Discrepancies between the digital and experimental aCGH may be partly explained by the differences in mapping/hybridization efficiency between mrsFAST and aCGH probes. For FISH, the hybridization of cattle BAC clones to Olimpia chromosomes is less likely to be affected by small variation in the sequence, thus FISH can be used to distinguish single copy signal vs. duplication signal, accommodating sequence divergence up to 20% in probe hybridization regions.

Future directions

It is important to note that performance of CNV detection using short read sequence data relies heavily on accurate mapping of reads. In highly repetitive regions, misalignment may lead to a high rate of false positive CNV calls. The latest long read sequencing technologies offer new opportunities in CNV detection by providing high confidence breakpoint analysis (Sedlazeck et al. 2018). In addition, more long reads can be more confidently anchored to repetitive sequences that often mediate the formation of SVs (Lucas Lledo and Caceres 2013).

Long read sequences and improved mapping technologies will lead to more precise CNV detection. These methods will facilitate further investigation of the structural organization of copy number variable regions in water buffaloes through population-level sequencing. With more confident detection of CNVs using long-read technology, the long-term goal is to explore the association of CNVs with important economic traits and incorporate them into selection programmes.

Methods

Data retrieval and sequence alignment

Illumina sequence data from an inbred, female Italian Mediterranean buffalo (Olimpia) were retrieved from NCBI BioProject PRJNA207334 submitted by the International Water Buffalo Genome Consortium (Iamartino et al. 2017). We retrieved sequence data for the 14 additional water buffaloes (ITWB1 to ITWB8 and ITWB10 to ITWB15, paired-end reads of 100 bp, ~ 10X coverage, Table S8) from NCBI bioproject PRJNA350833 submitted by a previous study (Whitacre et al. 2017). As the UMD_CASPUR_WB_2.0 water buffalo draft genome assembly is highly fragmented, our analysis was based on the Cattle UMD3.1 assembly (Zimin et al. 2009). We masked repeats of the cattle assembly using RepeatMasker (version open-3-3-0) (using the -s option and cattle RepBase libraries), Tandem Repeats Finder (version 3.21), and WindowMasker. We then aligned the buffalo reads to the masked UMD3.1 using mrsFAST (version 2.5.0.4) (Hach et al. 2010), allowing up to two mismatches (i.e., 48/50, ~96% sequence identity).

CNV calling using Read depth method

We then processed aligned reads within sliding windows using the WSSD pipeline as previously described (Bickhart et al. 2012). Reads were counted and the GC bias was corrected using Locally Weighted Scatter-plot Smoother (LOESS). We called the CNVs based on the read depth in three different sizes and types of windows. The procedure and criteria for the CNV calling were similar to that of the previous study (Alkan et al. 2009). We estimated the CN within 1-kb non-overlapping windows across all placed chromosomes. The non-overlapping estimates of CN served as a good approximation of CN within non-masked, non-gapped regions of the genome.

Validation of water buffalo CNVs using aCGH

Agilent whole genome high-density CGH arrays containing ~974,016 oligonucleotide probes were designed and fabricated on a single slide to provide an evenly distributed coverage on UMD3.1 with an average interval of ~3.1 kb between probes. We performed standard genomic DNA labelling (Cy3 for samples and Cy5 for references), hybridizations, array scanning, spatial correction, and data normalization as previously described (Liu et al. 2010; Bickhart et al. 2012).

qPCR validation

We designed primers using a custom script that incorporated Primer3 and Exonerate to identify unique binding sites for primer design (Bickhart et al. 2012; Untergasser et al. 2012). Only the following Primer3 setting were changed from default values: the amplicon length was set to 150–250 bp, and the GC clamp value was set to 2. Primer information is shown in Table S6. We conducted qPCR experiments using SYBR green chemistry in triplicates, each with a reaction volume of 25 μl, as previously described (Hou et al. 2011). PCRs were run on a BioRad MyIQ or iQ5 thermocycler. We chose an intron-exon junction of BTF3 as a reference location for all qPCR experiments with the assumption that there were two copies of the DNA segment in this region. We performed analysis of resultant crossing cycle thresholds (CT) using the relative comparative CT method and normalized against the control gene. Finally, a value of 3 or above was considered as gain and a value of 1 or below was considered as loss.

FISH validation

We selected one hundred twenty-one cattle BAC clones from the bovine BAC library (CHORI-240 at http://bacpacresources.org/bovine240.htm) for experimental validation by FISH (Liu et al. 2010; Bickhart et al. 2012). These clones contain large (≥ 20 kb) regions where copy number variations were predicted in Olimpia. We performed FISH experiments as previously described (Liu et al. 2009; Snijders et al. 2001). We prepare both interphase and metaphase nuclei using the buccal epithelial cells of Olimpia. We examined Metaphase nuclei to identify the chromosomal origins of FISH signals. Interphase nuclei analysis allowed us to evaluate the occurrence of tandem duplications.

Gene content

We assessed gene content of cattle CNVRs using the BioMart Database (http://www.ensembl.org/biomart/martview/). Ensembl genes overlapping with CNVRs, completely or partially, were considered as copy number variable and selected for further analysis. To gain an insight into the functional enrichment of the genes with copy number variations, we tested the hypothesis that the PANTHER molecular function, biological process, and pathway terms were under- or overrepresented in CNVRs after false discovery rate (FDR) correction using the PANTHER classification system (Mi et al. 2017).

Comparison of the gene CN between cattle and water buffaloes

We collected the CN values of all annotated genes in 75 cattle individuals (Bickhart et al. 2016), and compared them with the gene CN in the water buffalo. We focused on the common genes, shared but CN differential genes, as well as buffalo-specific CN variable genes. Sixteen genes with the highest average CN differences (> 4) in the two species are highlighted in the Fig. 7a.