Keywords

1 Introduction

Whole genome amplification (WGA), a widespread approach to amplify inadequate amounts of DNA samples for sequencing in Single-cell genomics analysis [13] have been extensively used to single-cell copy number variations (CNVs) analysis, at the cost of introducing biases [48]. As a key factor in cancer mutation [1, 9], CNVs is a common genomic variation closely associated with assorted diseases, the detection and analysis of which contributes to the research of biology and medicine.

Limited by the number of the specimens, WGA methods are widely used to facilitate the CNVs detection and analysis at single-cell level, such as Polymerase chain reaction (PCR) and multiple displacement amplification (MDA). Multiple displacement amplification (MDA), a DNA amplification method widely used in Single-cell genomics studies, uses Φ29 DNA polymerase and random primers to generate large amount of DNA template for genome samples [10]. Compared with PCR-based amplification method, MDA can be amplifid to the output with high quality and low error rates while not limited by the target length [10].

The introduction of WGA method insures the accuracy of CNVs’ detection, nevertheless, it at the same time gives rise to amplification biases [48]. Although the mechanism of how the DNA polymerase function is influenced by GC content remains unsettled, it has been suggested that the amplification quality of template is closely associated with GC content [5]. The over-amplification or under-amplification of specific region of template can result from the rich or poor GC content [5], causing misrepresentation of that region. Thus, the WGA-induced bias significantly limits sensitivity and specificity for CNVs detection.

To investigate the limitation, an empirical algorithm was developed for CNVs detection at single cell level. The proposed method consists of base call amplification, alignment and analysis for MDA-induced bias removal, with the aid of Multiple Displacement Amplification Simulator (MDAsim). MDAsim is a software developed to simulate MDA process, which generates simulated reads well approximated to the experimental ones [11]. By comparing the simulated outputs with the input chromosome 21, corrective measures were carried out to remove and compensate the MDA-induced biases. The proposed algorithm is expected to optimize the MDAsim analysis and improve the accuracy of the simulation process.

In the proposed algorithm, chromosome 21 from human genome was selected as reference template and was amplified into various coverage based on MDAsim, thus generating about 50G short read data sets. Each read has been trimmed to 50 bases and aligned to chromosome 21 by BWA [12]. Extensive statistical analysis has been conducted to investigate the correlation between genomic GC content and corresponding read coverage, per-positon error numbers considering the wrong base calls only, per-base error rate considering all base calls.Finally, we conclude the base substitution error frequencies.

2 Methods

A systematical pipeline was designed to analyze the simulated data set. The pipeline consists of three steps: amplification, alignment and analysis. The chromosome 21 was selected to amplify its base calls by MDAsim [11].

Step 1: Amplification. Since the whole chromosome 21 is too large to analysis by the amplification software. The 48 M reference was splitted into 45 subgroups, each of those is 3 M in length with the index repeating 2 M each time (1–3, 2–4……). The resulted 3 M fasta file was then used to amplify the chromosome 21. With the help of MDAsim [11], chromosome 21 is amplified into different coverage range under various parameter settings to simulate the reads with different GC contents.

Step 2: Alignment. BWA [12] is used to map the amplified reads in different coverage against the reference template. Its alignment process generates the intermediate binary sai file and final sam file. In the sam file, BWA outputs the sam file in the SAM format [14], each line of which consists of the alignment information of each read.

Step 3: Analysis. To extract the classified errors from the BWA outputs and analyze the MDA-induces biases, an extensive statistical analysis has been developed to analyze the correlation between the read coverage and GC content, base substitution errors in reads, per-postion error numbers considering the wrong base calls and per-base error rate considering all the base calls.

3 Results

The chromosome 21 was amplified into different coverage, extending from 40 to 60. The BWA analysis was then conducted on the resulted data sets.

Because only in that coverage can we find the output with U1 (match with exactly one error(insertion or replacement).Finally, we acquired 90923032 50mer reads from the process that the perl scripts reported to be uniquely matched against the chr21 reference sequence which were labeled U0, U1 or U2 respectively (Fig. 1).

Fig. 1.
figure 1

Pie chart of the read analysis. The four categories are NM, no match found; U0, exact match found without any error; U1, match with exactly one error (insertion or replacement); U2, match with exactly two errors (insertion or replacement); U0’, exact match found without any error, but its length is less than 50.

  • Correlation between the read coverage and GC content

    The amplification was amplified and aligned with the coverage 50 from the chromosome 21 to analyze the GC biases in WGA. The number of reads starting in a sliding window of length in 1kbp is estimated firstly. The analysis of the correlation between the statistic and the characteristic of the sequence of chromosome 21 shows a positive correlation between the read coverage and GC content. The coverage increases as well as GC content. However, when GC content is larger than 45 %, the coverage decreases with the GC content.increasing.

    We defined the quotient between the reads number of each observation window and the average reads number as relative read number (RRN) [13], which ideally would be equal to one. By comparing the GC content and RRN, we discovered that the RRN tended to be less than average in genomic GC-rich (>45 %) (shown in Fig. 2), implying the amplification bias within these regions. Futhermore, the base substitutional analysis was done in these regions to correct the biases.

    Fig. 2.
    figure 2

    Correlation of the read coverage and GC content: 50mer reads acquired from the chromosome 21. Each bar corresponds to the number of reads recorded for a 1-kbp window.

  • Analysis of base substitution errors in reads

    The overall substitution error is calculated and summarized in Table 1. There are twelve possible substitution errors (8 transversions and 4 transitions) when a base call happens. The transition error of G > A happens most frequently, which accounts for almost half of the substitution errors, and the least frequent substitution error is G > T and C > T. The most frequent base to happen substitution error is G, and the least is A. However, A is the most frequent base to be changed into while T is the least.

    Table 1. Base substitution frequencies in the read data sets

    Futher experiment is also done to analyze the GC-enriched (>45 %, Fig. 2) region’s substantial error, through which we can compensate the biased region’ (>45 %, Fig. 2) substitutional base call. The transition error of T > C happens most frequently, which accounts for almost half of the substitution errors, and the least frequent substitution error is T > A and G > C. The most frequent base to happen substitution error is T, and the least is A. However, A is the most frequent base to be changed into while G is the least. With these estimated substitutional information, we compensate errors to correct bias readings in the GC-enriched (>45 %, Fig. 2) regions.

  • Numbers of wrong base calls in reads verses the position along the read

    All the U1 U2 and U3 reads are selected for analysis, i.e. 3817 read (cf. Fig. 1), on the occurrence of errors per position. Two types of measurements are provided to quantify the errors. The first measurement calculated per-positon error numbers considering all the wrong base calls. The second measurement calculated per-base error rate among all the base calls. The results are shown in Fig. 3. The figure (a) shows that the high fraction of the wrong base calls occurs at the first and last position of the read. 8.2 % of the errors in the data sets are found at read position 1, and 6.7 % of errors are found at the last read position (position 50 in the data set Fig. 3a). The rate of the wrong base calls (Fig. 3b) has shown similar tendency. The rate is the highest at the first position along the read and the second highest at the last position of the read.

    Fig. 3.
    figure 3

    Numbers of wrong base calls in reads depending on the position along the read. (a) Per-positon error numbers considering all the wrong base calls. (b) Per-base error rate among all the base calls.

4 Conclusion

In this study, an algorithm was developed to detect the bias of multiple displacement amplification and the relation between GC content and coverage at the single cell level. The proposed method consists of base call amplification, alignment, analysis and base call substitutional compensate. The chromosome 21 was selected and amplified into 50 coverage. The defined RRN shows that the coverage tends to be less than average within GC-rich regions (Fig. 2). The GC-rich regions’ substitution error and overall substitution error were extensively analyzed and estimated to compensate the base substitution error. For the overall reads, wrong base calls are frequently preceded by base G. Base substitution error frequencies vary with G > A transversion being among the most frequent and C > T, G > T transversions among the least frequent substitution errors. With these estimated substitutional information, we compensate errors to correct bias readings in the GC-enriched (>45 %, Fig. 2) regions. For the biased region (GC-rich regions), the transition error of T > C happens most frequently, and the least frequent substitution error is T > A and G > C. With these estimated substitutional information, we compensate the errors to correct the MDA-induced bias.