Introduction

Tea is one of the most popular beverages consumed worldwide, and it is produced from the leaves of the plant Camellia sinensis (L.) O. Kuntze, which belongs to the family Theaceae. The leaf of C. sinensis is a unique organ usually used because it contains beneficial compounds. To date, nearly 4000 bioactive compounds (Mahmood et al. 2010), such as polyphenols (e.g., catechins), caffeine, and theanine (Liang et al. 2001; Mamati et al. 2006), have been identified in tea leaves. Tea has significant health benefits, such as autoxidation and anticancer activity (Sasazuki et al. 2012; Yen and Chen 1995) and it decreases high blood pressure (Hodgson et al. 2013), helps in preventing cardiovascular diseases (Hollman et al. 1999), and contributes to weight reduction (Auvichayapat et al. 2008).

Rapid progress in gene identification and isolation from C. sinensis has been made via RNA-Seq. For example, several studies have focused on the transcriptome of different tissues of C. sinensis (Shi et al. 2011a; Wu et al. 2013) and transcriptome profiles of C. sinensis during cold acclimation (Wang et al. 2013b). However, compared to many other crops and model plant species, C. sinensis has a large (Taniguchi 2006; Wu et al. 2014) and high heterozygosity genome (Ma et al. 2010). Xia et al. (2017) reported the genome of tea tree (cultivar Yunkang 10) and offered valuable information to further research of tea tree. However, more publically available molecular resources are required for functional genomic research, genetic studies and breeding programs of C. sinensis.

BAC libraries are important resources for the construction of genetic and physical maps (Luo and Wing 2003), gene identification, map-based cloning, comparative genomics analysis of eukaryotic genomes, and marker development (Shizuya et al. 1992). To date, hundreds of BAC libraries and physical maps have been constructed for model plants (Initiative 2000) and important fungi (Wang et al. 2013c), animals (Song et al. 2011), and crop plants such as rice (Ammiraju et al. 2006; Wang et al. 1995; Li 2005), wheat (Cenci et al. 2003; Janda et al. 2006; Nilmalgoda et al. 2003), maize (O’Sullivan et al. 2001; Tomkins et al. 2002; Wang et al. 2013a; Yim et al. 2002), soybean (Wu et al. 2004), tomato (Frary and Hamilton 2001), barley (Schulte et al. 2011), sorghum (Woo et al. 1994), and pearl millet (Allouis et al. 2001). These BAC libraries have been used for the development of physical maps (Chen et al. 2002; Wei et al. 2009), BAC-to-BAC genome sequencing (Schnable et al. 2009; Wei et al. 2009), positional cloning of genes of agronomic importance, comparative analysis of genome structures (Ammiraju et al. 2008; Chen et al. 1997; Lin et al. 2012; Lu et al. 2009; Messing and Llaca 1998; Pan et al. 2014; Wang and Dooner 2012), and genome assemblies of whole-genome shotgun sequences (Yu et al. 2005). BAC end sequences (BESs) are genomic resources that enhance the value of BAC libraries by providing partial sequence information that can be used to understand genome content and architecture and develop genetic markers (David et al. 2008). BESs embedded in a physical map can be used as anchors in genome comparisons to detect sequence assembly errors in the same source genome and large structural changes in phylogenetically close genomes (Zhu et al. 1997).

BAC libraries for C. sinensis will facilitate in DNA-marker analysis and functional gene cloning of this economically important species. Lin et al. (2011) constructed a BAC library for the cultivar “Chin-shin oolong.” In this study, we constructed a high quality BAC library for C. sinensis “Shuchazao” using fresh petals. Three complete BAC clones and ends of 384 BAC clones selected randomly from the BAC library were sequenced and analyzed. Understanding the characteristics of the BAC library will facilitate in using it in functional genomic research of C. sinensis. The BAC library and BESs are valuable resources for assembly optimization of genome for C. sinensis.

Materials and methods

Plant materials

For construction of the BAC library, C. sinensis “Shuchazao” plants were grown at the De Chang fabrication base in Anhui, China. Fresh petals (Fig. 1) were collected from the tea plants and immediately frozen by submersion in liquid nitrogen, followed by short-term storage at −80 °C.

Fig. 1
figure 1

Camellia sinensis flowers used for library construction

Library construction

The nuclear DNA was extracted by s described by Luo (Luo and Wing 2003). Firstly, grind about 50 g of petals in liquid nitrogen with a mortar and a pestle, and polyvinylpyrrolidone (PVP40) was used for removal of phenolic compounds in the milling petal to reduce DNA degradation. Secondly, transfer the ground tissue into NIBM (NIB with 0.1% β-mercaptoethanol) for 15 min with frequent and gentle shaking. The precipitation was resuspended with NIBM after centrifugation at 2400 g for 15 min. Finally, the nuclear DNA was imbed with 1% low melting temperature agarose. Then, we conducted partial digestion tests by using different amounts of the restriction enzyme Hind III per reaction to determine optimal conditions. We observed that 1 U/μL per reaction and incubation at 37 °C for 15 min generated fragments with most being between 100 and 300 kb, and digested 8 plugs (high molecular weight (HMW) DNA buried in low melting-point agarose) in this condition. After twice the size of selections, DNA fragments from 100 to 300 kb in size were eluted from the gel slices via electroelution and ligated into the pIndigoBAC536-s vector (Shi et al. 2011b). The ligated DNA was transformed into Escherichia coli strain DH10B (Invitrogen, USA) via electroporation.

Insert sizing

To evaluate the average insert size of the BAC library, we randomly selected 430 BAC clones from 384-well plates and isolated plasmid DNAs by using the alkaline lysis method (Sanmiguel and Bennetzen 1998). The DNAs were digested with I-SceI, and separated by pulsed-field gel electrophoresis (PFGE) on 1% agarose gels (Luo and Wing 2003). The insert size of each clone was estimated using the midrange PFG marker (New England BioLabs, USA) as the molecular weight standard.

Shotgun library construction and sequencing of BAC clones

BAC clones were cultured in Luria broth medium containing 12.5 μg/mL chloramphenicol for 16 h,and then centrifuged to harvest the pellets. The QIAGEN large construct kit (QIAGEN Lot.12462,Hilden, Germany) was used to obtain high quality, genomic DNA-free BAC plasmids. The plasmid DNA was sheared to 3–5 kb fragments by using ultrasonic waves. Klenow fragments and T4 DNA polymerase were used for blunt-end repair. After adding nucleotide A to 3′-end, DNA was ligated to the T-easy vector (Promega,Fitchburg, Wisconsin, USA). The DH5α competent cell (YEASTERN, ECOS Cat. No. FYE610-10VL, Taipei, Taiwan) was used for transformation. The transformed clones were maintained in 384-well plates for storage and sequencing.

Plasmids were extracted using AxyPrep Easy-96 (AXYGEN, AP-E96-P-24G, Corning, New York, USA) and dissolved in 30 μL of ddH2O for Sanger sequencing. The sequencing reaction was performed according to the protocol of the Applied Biosystems, ABI BigDye terminator v 3.1 and v 1.1 cycle sequencing kits (Waltham, Massachusetts, USA). The following primers were used: T7 (5′-TAATACGACTCACTATAGGG-3′) and SP6 (5′-ATTTAGGTGACACTATAG-3′). Sequence data were collected using ABI 3730.

Shotgun sequence assembly

Raw sequence data were collected, and phred (Ewing and Green 1998) was used for base calling; cut-off value was set at 0.05 to get reliable sequences in the phd format. Phd2fasta was used to transfer the phd file to fasta and fasta quality files. The vector sequence was masked by crossmatch. The sequence was assembled using Phrap with default parameters. Gaps were closed using PCR, and BAC sequences were finished using Consed (Gordon 2003; Gordon et al. 1998).

BAC end sequencing

The BAC plasmids for end sequencing were extracted using the standard alkaline lysis method and dissolved in 10 μL of ddH2O for Sanger sequencing. The sequencing reaction was performed according to the protocol of ABI BigDye terminator v3.1 (https://tools.lifetechnologies.com/content/sfs/manuals/cms_081527.pdf). The following sequencing primers were used: pIndigoF, 5′-aacgacggccagtgaattg-3′; pIndigoR, 5′-gataacaatttcacacagg-3′. ABI 3730 was used to sequence BAC clones at both ends.

Search for repetitive sequences in BAC sequences and BESs

BAC sequences and BESs were searched for repeat sequences by using RepeatMasker (Smit AFA, Hubley R & Green P. RepeatMasker Open-4.0.2013–2015 <http://www.repeatmasker.org>. Smit AFA, Hubley R & Green P. RepeatMasker Open-3.0.1996–2010 <http://www.repeatmasker.org>.) against the available repetitive sequence databases.

Simple sequence repeats (SSRs) analysis of BAC sequences and BESs

BESs were clustered using CAP3 (Huang and Madan 1999) to remove redundant sequences. Then BESs and complete BAC sequences were scanned for simple sequence repeats by using SciRoKo 3.4 (Kofler et al. 2007) in (perfect: total length) mode (minimum repeats 3, minimum total length 15, and no mismatch SSRs were considered). Motif length was set to 1–6; statistics were in the detailed mode.

A total of 16 cultivated tea plants were used in this study and the genomic DNA was extracted from fresh leaves following the manufacturer’s protocol of CP Plant gDNA miniprep kit (Biomiga) with few modifications. Primer pairs were designed from BAC sequences using Primer 3 (Rozen and Skaletsky 1999) (http://bioinfo.ut.ee/primer3-0.4.0/primer3/) software in batch mode. The target fragment size was set as 100–300 bp, the optimal annealing temperature 60 °C. The PCR fragments were detected by capillary electrophoresis with fragment analyzer 96 (Kuehnbaum et al. 2013).

Gene annotation for BAC sequences and BESs

Fgene SH (http://linux1.softberry.com/) was used for gene prediction. Since there is no existing Camellia sinensis gene model, Populus trichocarpa was used as a replacement model for gene prediction. Blast2go was used for gene annotation: BLASTX (NCBI-NR database) was used with an E-value threshold of 1E-3; BLAST Top-hit species were clustered, and all functional domains were searched against the InterProScan database; then GO mapping and annotation were conducted.

Cis-regulatory elements of promoters

We choose promoter sequences 1500 bp upstream of the start codon of LAR, TCS, and TS, and the cis-regulatory elements of three genes were predicted by promoter analysis software Plant CARE (http://bioinformatics.psb.ugent.be/webtools/plantcare /html/).

Analysis and validation of alternative splicing (AS)

The clean reads of tea transcriptomes were mapped to LAR, TCS, and TS BAC sequences by Tophat (Trapnell et al. 2009), then AS was analyzed by counting junction sites. The cds of three genes were mapped to the BAC sequences, and the introns of these genes were analyzed. Gene-specific primers were designed with Primer Premier 5 to confirm the predicted splicing events (Supplementary Table S1). PCR amplification was performed by KOD FX Neo (Toyobo), and PCR products were monitored by agarose gel electrophoresis and capillary electrophoresis.

Results

Construction of tea BAC library

The BAC library was constructed with Hind III in the pIndigoBAC536-s vector (Shi et al. 2011b). Fresh petals from C. sinensis “Shuchazao” were used for BAC library construction. Nuclei were washed with PVP40 to enhance DNA cloning ability because C. sinensis contains many polyphenols that may influence the construction of a large-insert DNA library. The C. sinensis BAC library contains 161,280 clones arrayed in 420 microtiter (384 wells) plates. To determine the insert sizes of the library clones, we analyzed a random sample of 430 clones from the library. The clones were digested with I-SceI, and evaluated on 1% agarose CHEF (CHEF-DRIII apparatus; Bio-Rad, Hercules, CA) gels (Fig. 2a). Result showed that the clones had insert sizes ranging from 45 to 200 kb, with an average insert size of 113 kb (Fig. 2b). Only 22 clones (5.11%) had insert sizes less than 90 kb, with 9 clones having inserts smaller than 70 kb. About 287 clones (67%) had insert fragments larger than 110 kb, some with inserts larger than 140 kb. All 430 clones had inserts, and no empty clone was found. On the basis of the insert sizes of the 430 clones, the average insert size of the BAC library was estimated to be approximately 113 kb; thus, this BAC library would represent 6.2-fold coverage of the C. sinensis genome (based on a genome size of 2940 Mb) (Huang et al. 2013). These results indicate that the tea BAC library has a very high quality.

Fig. 2
figure 2

Analysis of the insert sizes of the Hind III library. a DNA analysis of random BAC clones from the Camellia sinensis “Shuchazao” BAC library by using pulsed-field gel electrophoresis. M represents the midrange PFG marker. b Insert size distribution of the Camellia sinensis “Shuchazao” BAC library. In total, 430 random clones were analyzed

Statistics of BAC sequences and BESs

To further evaluate the quality of the tea BAC library and make a general genome survey for tea, we sequenced 20 complete BAC clones and both ends of 384 BAC clones randomly selected from the BAC library using Sanger sequencing method. PFGE results showed that they all contained the DNA inserts (around 110 kb). Details of the three BAC sequences are presented in Table 1. The GC content is 38.12% and the average BAC insert length is around 113 kb. We successfully obtained 738 raw BES reads, with a success rate of 96.1%. The total length of these BESs was 501,459 bp, and average read length (after trimming the vector sequence) was 679.48 bp (Fig. 3); GC content was 39.55%. Majority of the sequences had lengths between 500 and 800 bp (86.04%).

Table 1 Characterization of 20 BACs
Fig. 3
figure 3

BES length distribution; length was calculated after trimming the vector sequence

Assessment of contamination of mitochondrial and chloroplast DNA

For plant BAC libraries, mitochondrial and especially chloroplast DNA contaminations are usually a serious problem. To assess the contamination rate by mitochondrial and chloroplast DNA in the tea BAC library, all BESs were searched using the BLASTN algorithm against the chloroplast and mitochondrial genome sequence databases. The BLAST results showed no significant matches between BESs and mitochondrial and chloroplast sequences. All BESs were also searched to check for the bacterial DNA contamination rate and no any significant matches were found either. These results further demonstrated that the tea BAC library has a very high quality and is a valuable resource for genome sequencing, gene cloning and other purposes in tea genome studies.

Analysis of repeat sequences

Repeat sequences in the 20 BACs and 715 BESs (nonredundancy) were searched using RepeatMasker against the available repetitive sequence databases and found to be 13.44 and 13.42%, respectively. Long terminal repeat (LTR) retrotransposons were the most abundant class of transposable elements, constituting 11.91 and 9.47% of the BAC sequences and BESs, respectively. LTR retrotransposons are among the most abundant constituents of eukaryotic genomes (Havecker et al. 2004). Few lines and sines were found in these sequences (Tables 2 and 3).

Table 2 Repeat types within C. sinensis BAC sequences identified using Repeatmasker
Table 3 Repeat types within C. sinensis BAC end sequences identified using Repeatmasker

Analysis of SSRs and SSR primer development

SSRs are effective molecular markers in genetic and genomic analysis. Tables 4 and 5 listed all SSRs in the BACs and BESs. In total, 255 SSRs were found in 20 BACs and BESs (901 and 142, respectively). Among 6 SSR motifs, mononucleotide were the most frequent (372), and then were the dinucleotide (366) in 20 BACs. (Supplementary Fig. S1). The average length of SSRs in BACs and BESs were 20.74 and 19.5 bp, respectively. For BESs, 142 (19.9%) BES files contained SSRs of which 112 files contained 1 SSR, 26 files contained 2 SSRs, 3 files contained 3 SSRs and only 1 file contained 4 SSRs (Supplementary Table S2 and S3). Based on this analysis, 14 pairs of SSR primers successfully amplified tea plant DNA (Supplementary Table S4), and most primer pairs amplified one or two fragments, meanwhile some primers amplified more than five fragments such as Csi205A01-1 and Csi205B03. (Fig. 4). (Supplementary Fig. S2). SSR markers have very useful for breeding of tea plants, and these results will be considerable value for the construction of a genetic map for tea plants as well as marker-assisted selection.

Table 4 SSR counts for BACs
Table 5 SSR counts for BES
Fig. 4
figure 4

The SSR patterns resulted from amplification with 14 pairs of SSR primers. 1 C.sinensis var. assamica cv. Zijuan. 2 C.sinensis cv. Anji-baicha,3 C.sinensis cv. Anhui 1,4 C.sinensis cv. Longjing-changye,5 C.sinensis cv. Zhonghuang 1,6 C.sinensis cv. Zhenong 113, 7 C.sinensis cv. Hongyan 12,8 C.sinensis cv. Zhenong 117,9 C.sinensis cv. Yingshuang,10 C.sinensis cv. Fuzao 2,11 C.sinensis cv. Fuding-dabaicha,12 C.sinensis cv. Longjing 43,13 C.sinensis cv. Shuchazao,14 C.sinensis var. assamica cv. Yinghong 9,15 C.sinensis cv. Tie-guanyin,16 C.sinensis var. assamica cv. Yunkang 10

Gene prediction and annotation

Genes in the three BACs (LAR, TCS, and TS) were predicted. These three genes were the key enzyme genes in characteristic metabolic pathways of tea plants. The results showed that there are 30 predicted gene models in each BAC clone. The gene span was estimated to be 3.93 kb per gene (Supplementary Table S5). BLAST top hits analysis using the predicted gene sequences in the BACs (Supplementary Figs. S3a) and the whole BAC end sequences as queries (Supplementary Fig. S3b) showed an absolute predominance to Vitis vinifera. These observations were consistent with the results of previous studies, in which unigenes of tea transcriptomes were reported to have the highest homology to genes of V. vinifera (Wu et al. 2014; Zhang et al. 2015). In GO classification, “metabolic and cellular process” was the most common functional sub-category in the category “biology process” for both BESs and BACs. For BESs, the sub-category “catalytic function” exceeded “binding function” under the category “molecular function.” For the category “cell component,” “cell function” was more common than the others (Supplementary Fig. S4).

Promoters analysis of LAR, TCS, and TS

Some important elements that make up core promoter such as TATA box and CAAT box were identified in LAR, TCS, and TS. Additionally, a larger number of light-responsive cis-acting elements such as ATCT-motif, Box 4, G-Box, GT1, GTGGC-motif, MRE, TCT-motif, chs-Unit 1 m 1, 3-AF1 binding site, Box I, CATT-motif, TCCC-motif (Figs. 5, 6, and7) were identified in LAR, TCS, and TS. According to previous reports, the biosynthesis of catechins (Wang et al. 2012; Jin et al. 2016), caffeine (Chen. 2010; Liu et al. 2013a), and theanine (Matsuura and Kakuda 1990) were greatly influenced by light. Additionally, many regulatory elements involved in binding site of MeJA-responsiveness and ethylene-responsive were also identified. There were also several cis-regulatory elements such as ARE, TC-rich repeats, GC-box, HSE, LTR, MBS, Box-W1, and TCA-element, which served as a putative binding sites for specific TFs in response to biotic and abiotic stresses.

Fig. 5
figure 5

Promoter prediction of LAS

Fig. 6
figure 6

Promoter prediction of TCS

Fig. 7
figure 7

Promoter prediction of TS

Alternative splicing (AS) analysis of LAR, TCS, and TS

The process of AS can be spliced in different patterns to produce different mRNA structurally and functionally. AS may be one of mechanisms that accounts for the greater macromolecular and cellular complexity in higher living organisms. Every conceivable pattern of alternative splicing is found in nature, and the main types of alternative splicing including intron retention, alternative 3′ splice sites, alternative 5′ splice sites and exon skipping. We obtained four alternative 5′ splice sites and two alternative 3′ splice sites in LAR gene. Moreover, there are two alternative 5′ splice sites and one intron retention in TCS gene. In addition, more AS types in TS, we found three alternative 3′ splice sites, three intron retention and one exon skipping (Fig. 8). To experimentally confirm the accuracy of the identified AS analysis, primers were designed for PCR amplification. The results showed that size of the fragments and the bands on the agarose gel and capillary electrophoresis were mostly consistent with the alternative splicing isoforms (Fig. 9).

Fig. 8
figure 8

AS prediction of LAR TCS and TS by tea transcriptomic reads

Fig. 9
figure 9

Capillary electrophoresis (a) and agarose gel (b) of AS amplified from LAR TCS and TS

Discussion

Most studies have focused on investigating metabolic genes (Liu et al. 2012; Wang et al. 2014b) and the transcriptome (Jia et al. 2015; Zhang et al. 2015) of C. sinensis. In this study, we constructed a high quality, publicly available BAC library for C. sinensis. The average insert size of C. sinensis BAC clones was 113 kb, representing approximately 6.2-fold overage of the C. sinensis genome (2940 Mb) (Huang et al. 2013).

For BAC libraries, contaminations of chloroplast and mitochondrial DNA origins are serious problems and the rates are usually used as important indicators for the quality of plant BAC libraries. We sequenced the ends of 384 random clones and did not find any matches to chloroplast and mitochondrial DNA sequences. The rates are lower than those reported for other plant BAC libraries (Anistoroaei et al. 2011; Cao et al. 2014; Feng et al. 2015; Liu et al. 2013b; Xia et al. 2014). The low rate of chloroplast DNA contamination may be attributable to the use of the petal for BAC library construction. The petals of C. sinensis contain few chloroplasts and are thus a good material for BAC library construction. Our sequence data set of BACs and BESs from this library also provided a glimpse into the sequence composition of the tea genome. Plant genomes usually contain large proportions of repeat sequences. For example, the rice genome contains >35% (Li 2005; Wang et al. 2014a) and the maize genome contains >80% of repeat sequences (Schnable et al. 2009). From the limited tea BAC sequences and BESs, we only found 18.21 and 13.42%, respectively, that match the known repeat sequences. This may mean that the tea genome contains a large amount of unknown or genome-specific repeat sequences. A high quality de novo whole-genome sequence of tea is required to reveal the structure and composition of the genome. The tea BAC library provides an important foundation for whole-genome sequencing assembly and annotation during de novo sequencing of the tea genome.

AS, the process, can generate multiple transcripts from a single coding sequence, moreover, AS of mRNA allows structurally and functionally distinct mRNA and protein variants to be produced. It has recently been proposed as a mechanism by which higher-order diversity is generated. It also provides evolutionary flexibility. We found that there were many regulatory elements involved in binding site of light in LAR, TCS, and TS. This indicated that light can be most important factors involved in regulation of core secondary metabolite genes through regulatory region.

Conclusions

A high quality BAC library for C. sinensis with an average insert size of 113 kb and over 6.2-fold coverage of the genome was constructed. This publicly available resource will provide a useful platform for genetic and genomic research of C. sinensis, such as gene identification, functional genomic research, physical mapping, gene isolation and regulation, as well as complex analysis of targeted genomic regions. The AS and promoter analysis of LAR, TCS, and TS will provide important reference for functional genomic research.