Introduction

Structural variations (SVs) are generally defined as the large-scale variations, which would alter chromosomal structure and provide the raw material for evolution (Hurles et al. 2008; Guan and Sung 2016). Originally, the lengths of SVs were limited at least 1,000 bp. With DNA sequencing becoming routine, the operational spectrum of SVs has been then widened to include much smaller events, the lengths of which are greater than 50 bp (Alkan et al. 2011; Kosugi et al. 2019). SVs are commonly classified into various types based on their structure features, including deletion (DEL), insertion (INS), inversion (INV), duplication (DUP), translocation (BND), and copy number variation (CNV).

Compared to single nucleotide polymorphisms (SNPs), SVs can cause large-scale perturbations of cis-regulatory regions or directly alter the gene copy number to have greater influences on the gene expression and phenotypes (Weischenfeldt et al. 2013; Chiang et al. 2017; Alonge et al. 2020). However, SVs are more challenging to be detected in comparison with single nucleotide variations (Hurles et al. 2008). As a result, the studies of SVs lagged much behind SNPs. The rapid development of next-generation sequencing technology and reliable detection approaches make it possible to detect the full extent of SVs and genotype it routinely.

Numerous studies of SVs have been conducted in human (Homo sapiens) with the aim to identify the association of genomic SVs with genetic diseases (Stankiewicz and Lupski 2010; Vacic et al. 2011; MacDonald et al. 2014). In addition, to uncover the genetic molecular basis of important economic traits, numerous studies have also been undertaken in various domestic plants and animals. For example, in maize (Zea mays), several SVs on chromosome 4 could affect the oil concentration and long-chain fatty acid composition, through regulating expressions of 16 functional genes (Yang et al. 2019). An important harvesting trait in tomato (Lycopersicum esculentum), jointless fruit pedicel, is originated from the mutations of four SVs (Alonge et al. 2020). In cattle (Bos taurus), 34 CNVs on 22 chromosomes were identified to be significantly associated with several milk production traits (Xu et al. 2014). In addition, during the process of sheep domestication, a few biological processes and traits were related with hundreds of CNVs, including follicular development and fertility, adipogenesis, wool production, milk production, oxygenated red blood cells, and spleen size (Li et al. 2020). This work revealed the critical and underexplored roles of SVs in genotype-to-phenotype relationships. However, despite their importance, SV landscapes are only systemically characterized in a few aquaculture species. Recently, thousands of high-confidence SVs are identified in 492 Atlantic salmon (Salmo salar), suggesting their roles in the genome evolution and genetic architecture of domestication traits (Bertolotti et al. 2020).

The Pacific oyster, Crassostrea gigas, is one of the most cultivated bivalve species, contributing significantly to global seafood production (Troost 2010; Zhao et al. 2012). Given its economic importance, numerous selective breeding programs of the Pacific oyster have been conducted over years with the aim to improve the economically important traits (Evans and Langdon 2006; Li et al. 2011; de Melo et al. 2016). In China, we have conducted the selective breeding program of the Pacific oyster since 2007, by constructing breeding base populations with the oysters collected from wild populations in Rushan (China), Miyagi (Japan), and Busan (South Korea). After generations of artificial selection for fast growth, several fast-growing strains have been produced with superior growth performance. Several studies informed by SNPs and other molecular markers have been conducted to investigate the genetic basis of the fast-growing traits of the Pacific oysters (Zhong et al. 2013; Jin et al. 2014; Kong et al. 2014; Wang and Li 2017). However, the large-scale SV landscapes of the Pacific oyster have not been systemically characterized and their potential association with growth remains largely unexplored.

In the present study, we performed whole-genome alignments and genome re-sequencing data analyses to identify genome-wide SVs and construct the first comprehensive SV landscape in the Pacific oyster. Selective sweeps were further detected to determine the SV differentiations between the fast-growing strains and their wild populations, providing insights into the potential role of SVs associated with growth in the Pacific oyster. This work provided the first comprehensive overview of SVs which would be valuable information for future investigations on genome evolution under selection in the oysters.

Material and Methods

Data Collection

Two sources of datasets were used for the detection of SVs, including genome assemblies and whole-genome re-sequencing data. Two genome assemblies (cgigas_uk_roslin_v1 and ASM1103280v1) of the C. gigas were retrieved from NCBI genome database with the assembly accession number of GCA_902806645.1 and GCA_011032805.1, respectively. For whole-genome re-sequencing data, 150-bp paired-end short reads sequenced from 495 samples were retrieved from NCBI Sequence Read Archive (SRA) database (BioProject ID: PRJNA394055) with the detailed sample information described in the previous study (Li et al. 2018). In addition, whole-genome re-sequencing data of 40 samples were sequenced from the oysters from our selection breeding program (Li et al. 2011). Among which, 10 samples were randomly chosen from each of the fast-growing strains which have been successively selected for 10 generations and possessed superior growth advantages over wild oysters. We also sequenced 20 samples from the wild populations in Rushan and Miyagi (10 samples from each population). Adductor muscle tissues of the 6-month-old individuals were dissected and used for DNA extraction following a modified phenol–chloroform protocol (Li et al. 2006). DNA integrity and quantity were assessed using 1% agarose gel electrophoresis and a NanoDrop 2000 spectrophotometer (Thermo Scientific, Waltham, MA, USA). The DNA libraries were constructed with an average insert size of 350 bp. The 150-bp paired-end short reads were sequenced from the Illumina HiSeq X Ten platform with sequencing depth of 10 × .

SV Detection from Whole-Genome Alignment

Whole-genome alignment was performed between cgigas_uk_roslin_v1 and ASM1103280v1 genome assemblies using nucmer software of MUMmer (v3.1) program with the parameters of “-maxmatch -l 100 -c 500” (Kurtz et al. 2004). Then, alignment block was filtered using delta-filter software with one alignment mode (−1). SVs were determined based on the filtered blocks using the Assemblytics software (Nattestad and Schatz 2016). The length of short SVs was limited between 50 and 1000 bp, while the length of CNVs ranged from 1000 to 100,000 bp.

SV Detection from Whole-Genome Re-sequencing Data

The Fastp (v.20.0) software was employed to trim adaptor sequences and filter low-quality reads (quality score < 20 or length < 35), in order to obtain high-quality clean reads for downstream analysis (Chen et al. 2018). The FastQC (v0.11.8) was used to assess the quality of clean reads (Kim et al. 2018). The clean reads from each of 535 samples were aligned to the cgigas_uk_roslin_v1 reference genome using BWA-mem (v0.7.17) with the default parameters (Li and Durbin 2009). Alignment results were sorted and converted to BAM files using SAMtools (v1.9) software (Li et al. 2009). The sequencing coverage and depth of each sample were estimated using Bamdst software (https://github.com/shiquan/bamdst).

The SVs were mainly classified into two categories based on the length, including short SVs (> 50 bp and ≤ 1000 bp) and CNVs (> 1000 bp). To obtain individual-specific short SVs, variations of 535 whole-genome re-sequencing samples were independently called using delly (v0.8.1) software with the recommended parameters (Rausch et al. 2012). CNVnator (v0.3.3) software, a read-depth based method, was used for CNV calling for each sample (Rausch et al. 2012). The CNV calls were then filtered with P-value less than 0.01, zero mapping quality (q0) less than 0.05, and size greater than 1 kb. The gene copy numbers of each region were determined using the “-genotype” option of CNVnator. The SVs of 535 individuals were merged using VCFtools (v0.1.17) (Danecek et al. 2011). The CNVs were aggregated into CNV regions based on at least 1-bp overlap. Short SVs and CNVs detected in at least 20 individuals and minor allele frequency greater than 0.05 were considered as common SVs in 535 samples. The RepeatMasker (v4.0.9) software was used to detect repeat sequences by aligning the cgigas_uk_roslin_v1 genome sequences to Repbase library (v 20181026). Based on the sequence characteristic, the repeat sequences were further classified into different types, including simple repeat, satellite, low complexity, retroelements, RC/Helitron, DNA transposons, and others.

Population Differentiation Between Fast-growing Strains and Wild Populations

To investigate association of genome structural variations with the artificial selection of the Pacific oyster, Pi ratios and Fst values were calculated between fast-growing strains and their wild populations using VCFtools. The whole genome was scanned with the sliding windows of 20 kb with 10-kb step size. The empirical cutoffs for the candidate windows were set as bottom 5% and top 5% for Pi ratios, and top 5% for Fst values, respectively (Dennis et al. 2017; He et al. 2019; Bertolotti et al. 2020). The overlapping candidate windows of Pi ratios and Fst values were detected using BEDtools software (Quinlan and Hall 2010).

Functional Analysis of Candidate Genes Associated with SVs Under Selection

The association of SVs with genes or functional elements was identified using Annovar software according to cgigas_uk_roslin_v1 reference genome annotation (Wang et al. 2010). The Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of the function genes were annotated using eggNOG-Mapper (Huerta-Cepas et al. 2017). GO term and KEGG pathway enrichment analyses of candidate genes were carried out using clusterProfiler R package (Yu et al. 2012). GO terms and KEGG pathways with more than two enrichment genes or background genes were retained.

Results

Identification of SVs from Whole-Genome Alignment

Through the comparison of cgigas_uk_roslin_v1 with ASM1103280v1 genome assemblies, a total of 11,087 short SVs and 11,561 CNVs were identified across the chromosomes, including 2400 deletions, 3214 insertions, 9038 repeat contractions, 7394 repeat expansions, 21 tandem contractions, and 581 tandem expansions (Fig. 1a–e, Fig. 2a, Supplementary Table 1). There were distinct distribution patterns of short SVs across chromosomes with the density ranging from 6.51/Mb (NC_047560.1) to 25.79/Mb (NC_047568.1) (Supplementary Table 2). In contrast, the density of CNVs across chromosomes was between 10.13/Mb (NC_047560.1) and 24.75/Mb (NC_047568.1) (Supplementary Table 3). The frequencies of insertion and deletion in short SVs were decreased markedly with increased length, and the length of most short SVs (80%) varied from 50 to 647 bp (Fig. 2b). Similar pattern was observed in insertion, deletion, repeat expansion, and repeat contraction of CNVs, and 80% CNVs were with lengths ranging from 1 kb to 10,265 bp (Fig. 2b).

Fig. 1
figure 1

Genomic structural variation (SV) landscapes of the Pacific oyster. Tracks’ information is listed as the following (outer to inner circles): a chromosome karyotype; b GC content; c gene density; d short SVs (length from 50 to 1000 bp); e CNVs (length over 1,000 bp); f rare short SVs (identified in less than 20 samples); g rare CNVs (identified in less than 20 samples); h common short SVs (identified in at least 20 samples); i common CNVs (identified in at least 20 samples); j TE content. The units of chromosome karyotype are set as 2 Mb, while the other information was calculated in 20-kb windows. Short SVs and CNVs (d, e) were obtained through the genome alignment between cgigas_uk_roslin_v1 (GCA_902806645.1) and ASM1103280v1 (GCA_011032805.1). Rare and common short SVs and CNVs (fi) were detected based on whole-genome re-sequencing data

Fig. 2
figure 2

The statistic of SVs detected by genome alignment. a The number of different short SVs and CNVs types. b The length distribution of different short SV and CNV types

SV Landscape Constructed Based on Whole-Genome Re-sequencing

In order to construct the comprehensive SV landscape of the Pacific oyster, whole-genome re-sequencing data from a total of 535 individuals sampled from 28 populations were analyzed, containing 8,224.61G clean reads with an average sequencing depth of 22 × (Fig. 3a; Supplementary Table 4). Through standard analysis pipeline and strict thresholds, 220,468 short SVs and 13,176 CNVs were identified from each sample on average (Supplementary Table 5). It was obvious that the number of short SVs from each sample was greater than the CNVs, of which DEL was the most frequent SV type (Fig. 3b). Then, the short SVs and CNVs of each sample were merged and classified into two categories: rare (identified in less than 20 samples) and common (identified in at least 20 samples) variant sets, respectively. The size of rare short SV set grew dramatically at the beginning, while the size of common set was shrunk rapidly (Fig. 3c). They gradually slowed down and approached the plateaus with the increased number of samples. Similar results were also observed in the rare and common CNV sets (Fig. 3d). The results demonstrated that a considerable proportion of short SVs and CNVs were specific to some individuals or limited number of individuals. Together, 511,170 rare short SVs and 979,486 rare CNVs were identified, which were composed of 318,156 DEL (short SVs), 150,750 BND (short SVs), 29,769 DUP (short SVs), 12,286 INV (short SVs), 209 INS (short SVs), 539,142 DEL (CNVs), and 440,344 DUP (CNVs) (Fig. 3e).

Fig. 3
figure 3

The information of SVs detected from the genotyping of whole-genome re-sequencing data. a The sample number and population information. b Number of different SV types. SV types include translocation (BND), deletion (DEL), duplication (DUP), insertion (INS), and inversion (INV). c The number of rare and common short SVs. Rare SVs are identified in less than 20 samples, while the common are present in at least 20 samples. d The number of rare and common CNVs. Rare SVs are identified in less than 20 samples, while the common are present in at least 20 samples. e The numbers of different types of rare short SVs and CNVs

A total of 63,100 short SVs and 58,182 CNVs were regarded as common variations among populations, including 30,009 DEL, 2,074 DUP, 725 INS, 613 INV, and 29,679 BND in short SV and 30,987 DEL and 27,195 DUP in CNV (Fig. 4a). The proportion of common BND was higher than that of the other short SV types, and DEL was the major type in common CNVs (Fig. 4a). These common short SVs and CNVs were annotated across the cgigas_uk_roslin_v1 reference genome to evaluate the potential functional impact on the genes. Around half of short SVs (50.4%, 31,806) were distributed in intronic regions, while only 1.5% (927) short SVs were related to exons. In contrast, a majority of CNVs (68.8%, 40,033) were overlapped with one or more exons. A total of 6977 (12.0%) and 5611 (9.6%) CNVs were located in intergenic and 5-kb upstream or downstream regions, respectively (Fig. 4b). In addition, we investigated the distribution of short SVs and CNVs across repeat sequences. Nearly 46.7% of common short SVs (29,460/63,100) and 88.4% of CNVs (51,413/58,182) overlapped with the repeat elements, most of which are DNA transposon (50.1%), RC/Helitron (19.7%), and simple repeat (19.2%) (Fig. 4c, d).

Fig. 4
figure 4

The common SVs detected from the genotyping of 535 whole-genome re-sequencing samples. a The percentages of different short SV and CNV types. b Genomic annotation of short SVs and CNVs. c Distribution of repeat types in short SVs. d Distribution of repeat types in CNVs

Identification of Selective Sweeps Under Artificial Selection

Common short SVs detected from the 20 oysters of the fast-growing strains and 20 individuals of the wild population were further analyzed to identify the selective sweeps underlying artificial selection. Both Fst and Pi ratio statistical methods were employed to determine population differentiation by comparing allele frequency and nucleotide diversity between artificially selected strains and wild population (Fig. 5a, b). Based on the empirical threshold of top 5% Fst value (Fst > 0.1825), 1373 windows were identified by Fst, covering 23.05 Mb for 3.6% genome and containing 2022 functional genes. Meanwhile, a total of 3992 windows with top or bottom 5% of Pi ratio (Pi ratio > 2.5858 or Pi ratio < 0.3286) were also identified, which covered 66.75 Mb (10.3% genome) and harboring 5534 genes. A total of 514 genomic regions (8.76 Mb) were identified by both approaches (green and blue dots in Fig. 5c), containing 746 candidate genes (Fig. 5d). The detailed information of the candidate regions and genes was provided in Supplementary Table 6. Notably, 61 common CNVs were specifically identified from the fast-growing strains rather than their wild populations. These CNVs contained 103 genes, which were further investigated for their potential association with growth trait (Supplementary Table 7).

Fig. 5
figure 5

Selective sweep analysis based on short SVs. Manhattan plots showing Fst (a) and Pi ratio (b) values calculated based on short SVs between fast-growing strains and their corresponding wild populations. The red dotted lines represented the significant thresholds for Fst (> 0.1825, top 5%) and Pi ratio (> 2.5858, top 5% or < 0.3286, bottom 5%) values. c Selective sweep regions selected by both Fst value and Pi ratio tests. d Venn graph showing the candidate genes within selective sweep regions

Functional Analysis of Candidate Genes from SVs Under Selection

The 746 genes identified from short SVs under selection and 103 genes identified from CNVs specifically detected in fast-growing strains, in total 843 genes, were subject to functional analysis with GO and KEGG pathway enrichment. The results of GO enrichment analysis revealed 294 GO terms with P < 0.05 and the top 20 enriched GO terms, ranked by P-value, are shown in Fig. 6a. Among which, the most significantly enriched term was apical part of cell (GO:0045177). The KEGG enrichment analysis indicated that the most significantly enriched pathway was tryptophan metabolism (ko00380) (Fig. 6b), followed by histidine metabolism (ko00340) and vitamin digestion and absorption (ko04977). Biosynthesis of secondary metabolites pathway (ko01110) contained the most enriched genes and the highest enrich factor was detected in ascorbate and aldarate metabolism (ko00053). In addition, the specific biology functions of the candidate genes were also further investigated according to published studies and were classified into 10 functional groups (Supplementary Table 8), including tissue morphogenesis, organic compound metabolism, cell cycle, cellular component, ion and amino acid transport, neurogenesis and nerve-impulse transmission, protein modification, RNA processing, signal transduction, and immune response. The detailed information of all the GO and KEGG terms is provided in Supplementary Table 9.

Fig. 6
figure 6

Functional enrichment of candidate genes. a The GO enrichment analysis of candidate genes. The top 20 enriched GO terms, ranked by P-value, were displayed in a. Biological process, molecular function, and cellular component are visualized by different shapes. b KEGG pathway enrichment analysis of candidate genes

Discussion

Genome structural variations account for a major portion of genomic variations in an organism, which play critical roles in biological functions. In comparison with single nucleotide variations that are relatively well studied, genome structural variations remain largely unexplored due to the limitation of detection approach (Bertolotti et al. 2020; Qi et al. 2021). In the present study, we constructed the first comprehensive genome structural variation landscape in the Pacific oyster by performing alignment of whole-genome assemblies and analyses of whole-genome re-sequencing data. We further detected the selection signatures of the SV landscape in the fast-growing strains which have undergone 10 generations of artificial selection for growth. This work provided valuable information for further investigations on genome evolution under selection in the oysters.

SVs are important source of genetic variations underlying important domestication traits (Vlad et al. 2010; Dorshorst et al. 2015; Dharmayanthi et al. 2017; Duan et al. 2017; Simam et al. 2018; Chakraborty et al. 2019; Liu et al. 2019). Many SVs have been identified and characterized to be closely associated with some domestication or artificial breeding traits, such as berry color of grapevine (Vitis vinifera ssp. sativa) (Zhou et al. 2019) and dietary shifts between grey wolf (Canis lupus) and dhole (Cuon alpinus) (Wang et al. 2019). However, there are few studies of SVs in aquatic organisms. The Pacific oyster with great value in economics has been successfully selected for over ten generations and obtained superior growth performance. Hence, it represents an excellent model to investigate the genetic basis of SVs in artificial breeding in the Pacific oyster.

The SVs are usually rare and with low frequency in the population. In this work, we performed selective sweep analysis based on the common SVs to ensure the reliability. In total, 843 genes were identified that were associated with short SVs under selection or related to CNVs that were detected only in the fast-growing strains in comparison with the wild populations. The GO enrichment analysis revealed that the candidate genes were significantly associated with the apical part of cell term. In GO resource and annotation, it is defined as the region of a polarized cell that forms a tip or is distal to a base with key genes such as 5′-AMP-activated protein kinase subunit beta-2 (LOC105344372), angiomotin (LOC105333766), fibroblast growth factor receptor 2 (LOC105321229), and regulator of G-protein signaling 12 (LOC105343750). However, the specific role of apical part of cell and its association with growth requires future investigations. KEGG enrichment analysis demonstrated that the candidate genes were enriched in several metabolism-related pathways, such as tryptophan metabolism, histidine metabolism, vitamin digestion and absorption, and ascorbate and aldarate metabolism. Tryptophan is an indispensable and essential dietary amino acid for the regulation of growth and immune response in animals and plants (Walton et al. 1984; Le Floc’h et al. 2011; Fukuwatari and Shibata 2013; Hiruma et al. 2013). Histidine is another important amino acid and plays a critical role in growth and development of animals and plants (Ingle 2011; Powell et al. 2011; Brosnan and Brosnan 2020). Vitamin digestion and absorption pathway is closely associated with the vitamin metabolism in organisms. Also, ascorbate and aldarate metabolism could be directly related to the biosynthesis, recycling, and degradation of vitamin C (Linster and Van Schaftingen 2007). It is well documented that vitamins have great impact on the growth rate and stress resistance of aquatic animals (Sealey and Gatlin 2002; Kumari and Sahoo 2005; Dawood and Koshio 2018). Therefore, how these candidates could influence the growth performance of the Pacific oyster through the regulation of amino acid and vitamin metabolism deserves future investigation.

The classifications of candidate genes related to SV indicate that the biological traits appear to be very complex, whose regulation is involved in diverse biological processes. These SV variations of the candidate genes may explain the differences between fast-growing strains and wild populations of oysters. It will be useful to combine SV with the functional genes to investigate SVs’ function. Tissue morphogenesis could play an essential role in growth trait. A lot of genes related to tissue morphogenesis were found in high differential SV windows. For example, titin (LOC105328178) is essential in the temporal and spatial control of the assembly of striated muscles during myofibrillogenesis (Mayans et al. 1998). A total of 11 multiple epidermal growth factor-like domains protein 10 (MEGF10) genes lie within the selective sweep regions or located in specific CNVs of fast-growing strains. MEGF10 is able to interact with Notch1 via their respective intracellular domains, playing vital roles in myogenesis (Takayama et al. 2016; Saha et al. 2017). As reported, fibrillins are important components of microfibril networks, which could interact with members of the TGF-β growth factors family to take part in the tissue morphogenesis (Charbonneau et al. 2004; Gansner et al. 2008; Sengle et al. 2008; Ono et al. 2009). A total of four isoforms of fibrillins were identified, all having an evolutionarily conserved domain organization (Gansner et al. 2008; Jensen and Handford 2016). Three of which, including fibrillin-1, fibrillin -2, and fibrillin -3, have been regarded as candidates that could be associated with growth in the Pacific oysters.

The other classifications are organic compound metabolism and ion and amino acid transport which show the strong connection between SVs and biological traits. The protein encoded by 5′-AMP-activated protein kinase subunit beta-2 gene could influence the activity of AMP-activated protein kinase, an energy sensor protein kinase that plays a key role in regulating cellular energy metabolism by changing the rates of glucose uptake and fatty acid oxidation (Dyck et al. 1996; Winder and Thomson 2007). Solute carrier family 15 member 5 (SLC15A5), a sodium-coupled citrate transporter, could import citrate from the circulation into cells. Recently, emerging evidence suggested the importance of SLC15A5 in energy homeostasis, which could facilitate the utilization of circulating citrate for the generation of metabolic energy and for the synthesis of fatty acids and cholesterol (Inoue et al. 2002; Hardies et al. 2015; Li et al. 2017). In addition, a large number of genes related to immune response were also included in the candidate list, suggesting that immunity of fast-growing strains of the Pacific oysters is also largely shaped by artificial selection.

Conclusion

In the present study, we constructed the first comprehensive landscape of genome structural variations in the Pacific oyster. Further analysis of the SVs that have been affected by artificial selection were performed. The Fst and Pi ratio tests revealed that 514 genomic regions (8.76 Mb), containing 746 candidate genes, were under artificial selection and could be associated with growth traits. Another 103 candidate genes were identified from the 61 common CNVs that were only present in the fast-growing strains. Functional analysis of the total 843 candidates revealed its enrichment in apical part of cell term and several metabolism-related pathways, including tryptophan metabolism and histidine metabolism. Taken together, this work provided a comprehensive landscape of SVs and revealed their responses to selection, which will be valuable for further investigations on genome evolution under selection in the oysters.