Keywords

1.1 Introduction

Spinal muscular atrophy (SMA) is an autosomal recessive motor neuron disease with an annual incidence of about 1 in 6000 to 1 in 10,000 live births, and the carrier frequency is as high as 1 in 40 [1]. 95% of SMA patients are homozygous for SMN1 deletion. SMN2 partially compensates for SMN1 loss, lowering the severity of SMA. However, when two SMN1 genes are carried in cis, this results in a silent carrier (i.e. a carrier that does not express the SƒMA phenotype) [2, 3]. Further complicating diagnosis, SMN1 and SMN2 are nearly identical inversions [4], differing by only five base pairs: c.835−45G>A, c.840C>T, c.*3+100A>G, c.*3+214A>G and c.*248A>G [5]. Therefore, there is a need to (1) differentiate between SMN1 and SMN2, (2) call copy number (CN) of SMN1 and SMN2 and (3) determine the phase of SMN1 and SMN2 for SMA clinical classification, prognosis, carrier identification and diagnosis [6]. Current methodologies such as multiplex ligation-dependent probe amplification (MLPA) and real-time PCR (RT-PCR) present some limitations. MLPA is unable to determine the phase of SMN1 and SMN2, resulting in the inability to identify silent carriers with 2 SMN1 on one chromosome but none on the other, and hence, false negative results in silent carriers. SMN genes are unevenly amplified in RT-PCR, which leads to inaccurate results. Therefore, in order to overcome these limitations, we evaluated new emerging methods—Linked-Reads, Cytoscan (CYT) array and whole genome sequencing (WGS)—against MLPA, the most commonly used method for carrier screening and diagnosis for SMA, by their ability to: 1. differentiate between SMN1 and SMN2, 2. determine the CN of SMN1 and SMN2, 3. locate structural variants in SMN1 and SMN2 and 4. phase alleles.

1.1.1 Hypothesis

We hypothesise that Linked-Reads, CYT array and WGS can overcome the limitations of current methods in determining patient or carrier status by differentiating between the two almost identical SMN genes, as well as calling CN in trans.

1.2 Methods

1.2.1 Sample Information

A total of six anonymised data sets from SMA patients with known SMN1 and SMN2 CN were provided by the National University of Singapore, Department of Paediatrics.

1.2.2 Technologies

  1. 1.

    Linked-Reads

DNA was sheared and put through size selection. The ChromiumTM system was then used for automated barcoded library construction. The barcoded libraries were then sequenced using Illumina Whole Exome Sequencing (WES). The data obtained was visualised on Loupe, a genome browser by 10× Chromium Platform designed for visualisation of Linked-Reads data [7]. The BAM file obtained was also visualised on Integrative Genome Browser (IGV).

  1. 2.

    CYT Array

Gene probes were deposited on a chip. cDNA, labelled with either green or red fluorescence, was generated from mRNA extracted. cDNA complementary base pairs with probes on the chip were analysed by fluorescence emission. The data obtained was visualised on Chromosome Analysis Suite (ChAS). Manual guides provided by Thermo Fisher Scientific Inc. were used in aiding the usage of software [8].

  1. 3.

    WGS

Patient DNA was sequenced through whole genome sequencing (WGS) at 40× read depth. The BAM data obtained was visualised on IGV. User guides provided by the Broad Institute (2018) were used in aiding the usage of software. Genome Reference Consortium Human Build 37 (hg19) was used as reference for WGS samples.

  1. 4.

    MLPA

DNA strands were denatured to separate the strands and hybridised with probes. The right probe oligo contains a stuffer sequence which is used to identify DNA pieces. The DNA was then amplified through PCR. MLPA amplicons are separated by length using capillary electrophoresis [9]. The measured fluorescence was visualised as a peak pattern and used to quantify each probe. CN was determined using probe ratio.

  1. 5.

    CN calling using SMN/mean read depth ratio

The c.840C>T site on exon 7 is the critical difference between SMN1 and SMN2. Additionally, we were provided with the read depths of SMN1 and SMN2 exons 7 and 8 for each of our samples. Thus, using protocols modified from [10, 11], we determined the CN for our Linked-Reads and WGS samples by comparing the read depth of SMN1 and SMN2 exons 7 and 8 against each sample’s overall mean read depth using the following formula.

$${\text{Copy}}\,{\text{number}} = \frac{{{\text{Read}}\,{\text{depth}}\,{\text{of}}\,{\text{exon}}}}{{{\text{Mean}}\,{\text{read}}\,{\text{depth}}\,{\text{of}}\,{\text{sample}}}} \times 2$$

1.3 Results

1.3.1 Linked-Reads

Linked-Reads sequencing generates reads with an integrated barcode which traces the reads back to the original DNA molecule [12]. This allowed the reads to be mapped to the SMN1 and SMN2 genes, as shown from the read coverage in Fig. 1.1. The coverage of the c.840C site on SMN1 and c.840T site on SMN2 was also verified on IGV, confirming that Linked-Reads was able to differentiate between the SMN genes.

Fig. 1.1
figure 1

a SMN1 and SMN2 on sample 300,097. b, c SMN2 (left), SMN1 (right). Green bar in the coverage track indicates read depth for the region. Genes and their exons are identified and labelled in the genes track

Structural variants (SVs) can be detected through calls and candidates recognised by Linked-Reads, where calls meet the higher-quality call threshold than candidates and occur in unambiguous regions of the reference genome. However, no SVs were called by Linked-Reads in our samples. A deletion of exon 7 in SMN1 was observed in Sample 300,099 when viewing the reads in IGV, corresponding to the known SMN1 CN of 0.

CN was also calculated by comparing the read depths of exons 7 and 8 in SMN1 and SMN2 against each sample’s mean read depth. As seen in Table 1.1, there are discrepancies between the calculated and known CN, indicating that observed read depths underestimate the actual CN. This discrepancy could be due to difficulties in sequencing for the following reasons: 1. SMN1 and SMN2 genes are part of a 500 kb highly repetitive inverted duplication on chromosome 5, making it difficult to determine the organisation of this genomic region [13]; 2. high GC level of 54% in SMN1 and SMN2 [14], leading to a poor coverage of reads and less complete assembly; 3. low input DNA mass of 0.4–0.5 ng, which was below the recommended range of 1–3 ng, affecting the performance of sequencing.

Table 1.1. CN call for SMN1 and SMN2 exons 7 and 8 of samples 300,097–99

Linked-Reads is able to phase alleles by assembling long reads from short reads, creating a phase block by utilising continuous reliable heterozygous variants (phasing quality > 23) to connect the reads [15]. However, in this analysis, reads in SMN1, SMN2 and their flanking regions in samples 300,097–30,099 were not assigned to either haplotype as there were insufficient single-nucleotide variants (SNVs) [15] present in our samples that were informative for Long Ranger to determine phase blocks.

1.3.2 CYT Array

Both SMN genes are labelled as SMN1 and SMN2 simultaneously by ChAS (Fig. 1.2), showing that ChAS is unable to differentiate between SMN1 and SMN2.

Fig. 1.2
figure 2

SMN1 and SMN2 annotation in ChAS

Mean weighted Log2 ratio and smooth signal values were calculated by ChAS and used to determine the CN. Log2 ratio indicates gain or loss in genetic material, with a ratio of 0 indicating a CN of 2. The Log2 ratios of SMN1 and SMN2 in CYT34 and CYT221 are close to 0, indicating that CN for both genes in both samples is 2. Smooth signal is a smoothed calibrated estimate which can represent non-integer CN. It uses the Gaussian function to reduce noise within the array, thus allowing for a more accurate CN to be determined. The smooth signal values of SMN1 and SMN2 in CYT34 and CYT221 are also close to 2, corroborating the Log2 ratio calculated CN of both samples. However, there is a discrepancy between the calculated and the known CN (Table 1.2), as Cytoscan is unable to differentiate between highly homologous regions such as SMN1 and SMN2 [15].

Table 1.2. Mean weighted Log2 ratio, smooth signal values, calculated and known CN of samples CYT34 and CYT221

1.3.3 WGS

WGS was able to identify the SMN1 and SMN2 genes, as indicated by the presence of reads in these regions. Reads that align with the reference sequence are displayed in grey. However, WGS was not able to differentiate between the two genes well. Out of the five base pair differences in the SMN2 genes, a misalignment of SMN1 reads to SMN2 was observed in four of these regions in sample NGS-1108 (Fig. 1.3). Critically, at the crucial c.840 site, no reads were observed in SMN1, but reads were mapped to C (24 reads) and T (17 reads), respectively, in SMN2 (Fig. 1.3). This misalignment reflects WGS’s inability to distinguish between homologous regions such as the SMN genes [16] (Table 1.3).

Fig. 1.3
figure 3

a c.840C on SMN1 of sample NGS-1108. No reads were observed. b c.840T on SMN2 of sample NGS-1108. Bar is coloured in proportion to the read count of each base. Cytosine is in blue, and thymine is in red

Table 1.3. CN call for SMN1 and SMN2 exons 7 and 8 of sample NGS-1108

1.3.4 MLPA

SMN1 and SMN2 genes were identified and differentiated by MLPA (Table 1.4) using probes specific to SMN1 and SMN2 exons 7 and 8. CNs were deduced from probe ratios provided by the manufacturer [9]. Sample O221 had a single copy of SMN1 exon 7 and exon 8 and is, therefore, a carrier of SMA. Sample O34 was detected to have two copies of SMN1, which indicates that the patient is unaffected. However, this is only true when the SMN1 genes occur in trans. As MLPA does not phase alleles, it cannot confirm that O34 is not a silent carrier. Sample O34 has two copies of SMN2 exon 7, but only one copy of SMN2 exon 8. This indicates a deletion of SMN2 exon 8 in sample O34. Hence, the second copy of SMN2 is not a full functional gene. MLPA can be considered a reliable tool for determining CN, as the obtained CN was consistent with the known CN (Table 1.4).

Table 1.4. Determined CN from probe ratio, in comparison with known CN for sample O34

1.3.5 Comparison of Linked-Reads, CYT Array and WGS with MLPA

These newly emerging methods—Linked-Reads, CYT array and WGS—were compared against MLPA. Linked-Reads is capable of differentiating between the highly homologous SMN1 and SMN2 genes, a critical factor which measures up to MLPA and sets it apart from CYT array and WGS. Linked-Reads identification of molecules is more reliable than CYT array as it uses different “identification codes” for each molecule [12], whereas CYT array uses microarray analysis, which utilises probes [17] that are similar for the highly homologous SMN1 and SMN2 genes. This is also a drawback of WGS as reads from SMN1 may misalign to a highly homologous SMN2 gene during sequence assembly due to short read length.

In this analysis, we used a simple method of estimating CN using Linked-Reads and WGS data by comparing the read depths of SMN exons 7 and 8 to each sample’s mean read depth, due to the lack of access to sophisticated computational pipelines or software (such as those described by [18]) for determining CN. Our method gives an approximate estimation which would have to be confirmed either computationally or through wet experiments. Nevertheless, as we expect CN to be either in terms of deletion or between two to four, the exact value of the increased read depth can be rounded off. While Linked-Reads was able to detect a deletion of exon 7 in SMN1 corresponding to sample 300,099’s SMN1 CN of 0, the CNs of SMN1 and SMN2 could not be accurately determined using Linked-Reads and WGS data, unlike in MLPA.

Due to the lack of heterozygous SNVs in close proximity within SMN regions, Linked-Reads was not able to determine the haplotype of SMN1 and SMN2 in our samples, which is important for identifying silent carriers of SMA. However, if there were adequate heterozygous SNVs in SMN regions, Linked-Reads would be able to determine phase blocks and resolve haplotypes [19], giving it a huge advantage over the other three methods.

1.4 Conclusion

Considering Linked-Reads’ ability to differentiate between SMN1 and SMN2 genes, identify SNPs, and its potential ability to identify SVs and phase alleles to determine haplotypes, Linked-Reads can be viewed as a possible tool for carrier screening and diagnosis as it presents the ability to overcome limitations of MLPA and RT-PCR, which is the inability to phase SMN1 and SMN2 and uneven amplification of genes, respectively. Although Linked-Reads and WGS were not able to call SMN gene CNs accurately in this analysis, further work can be done to optimise the technology to be up to par with MLPA’s ability to call CN. To overcome the limitation of low read depth, normalisation of read counts can be done to account for GC bias [19], and DNA input mass can be increased to 1 ng per library [20].