1 Introduction

SARS-CoV-2, the virus responsible for the COVID-19 pandemic, was first identified in Wuhan, China, in late 2019. Since the beginning of the pandemic, as of April 2022, more than 504 million cases have been reported globally, with more than 6.2 million deaths (Yu et al. 2022). COVID-19 has had tremendous global and social impacts (Sanjay 2020). Bangladesh, a low- and middle-income country of nearly 170 million people, has been similarly affected.

SARS-CoV-2 testing is a principal bulwark in the response to the pandemic. The Bangladesh government imposed restrictions and quarantines in response to the pandemic, and also took the lead in launching testing in public and private facilities. Even today, SARS-CoV-2 testing is a key pillar in the response to the pandemic. Praava Health, a private healthcare facility, established one of the first PCR testing laboratories and led a concerted effort to test patients in Dhaka, Bangladesh, and neighboring areas. Testing for SARS-CoV-2 in people who have symptoms and also in those who have no symptoms but may have been exposed to the virus can help prevent the spread. A positive test early in the course of the illness enables individuals to isolate themselves and allows them to seek treatment earlier. It may also reduce the risks of infecting others and developing severe disease, long-term disability, or death. Since nearly half of all SARS-CoV-2 infections are transmitted by people who show no symptoms, identifying asymptomatic and pre-symptomatic infected individuals plays a major role in controlling the pandemic (Johansson et al. 2021). Comorbidities such as heart disease, obesity, and diabetes are also more common in under-represented communities because of long-standing societal and environmental factors and impediments to healthcare access (Bajgain et al. 2021). COVID-19 can spread quickly in these communities, and the impact of that spread is high. Testing, particularly of asymptomatic and pre-symptomatic individuals, is the key to stopping this spread. According to WHO, SARS-CoV-2 will be difficult to eradicate and will probably continue to circulate indefinitely with periodic outbreaks and epidemics. This will make testing critical for decreasing transmission (Morens et al. 2022). The current study was designed to investigate the genomic diversity of SARS-CoV-2 variants isolated from Bangladeshi patients and to analyze the temporal profile of the mutational accumulations within the whole genome and within the gene encoding the spike protein.

2 Methods

2.1 Collection of samples and clinical data, nucleic acid extraction, and COVID-19 testing of samples

Nasopharyngeal swabs from patients were collected in viral transport media (VTM) according to CDC guidelines, and clinical information including age, gender, symptoms, clinical classification, and locality were recorded. Total nucleic acid was extracted using commercial kits according to the protocol of the manufacturer. Total nucleic acid of 5 μL was subjected to RT-PCR screening following the CDC’s 2019 Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel guide. From April 2021 to January 2022, a total of 130 positive samples with Ct values less than 30 were selected randomly for whole-genome sequencing at the Genomic Research Laboratory of the Bangladesh Council of Scientific and Industrial Research (BCSIR). Informed consent was obtained from all participants in the study.

2.2 Whole-genome sequencing of SARS-CoV-2 using MiniSeq

Random hexamers generated the cDNA-directed reverse transcription using 20 μL of RNA extract, 660 μM dNTPs, 5 x RT Improm II reaction buffer (Promega), 50 ng hexanucleotides, 1.5 mM MgCl 20 U RNasin® Plus RNase Inhibitor (Promega, Madison, Wisconsin), and 1U of ImProm-II™ Reverse Transcriptase (Promega). SARS-CoV-2 genomes were quantified by using a qRT-PCR assay targeting a conserved region of the envelope gene. Sequencing-ready libraries were prepared using cDNA from the CoV sample (CoVOC43), the viral pool sample (ViralPool) with Nextera Flex for Enrichment (Illumina, San Diego, California), and IDT for Illumina Nextera DNA UD Indexes. The total DNA input recommended for tagmentation is 10–1000 ng. After tagmentation and amplification, samples were enriched with the Respiratory Virus Oligos Panel (Illumina, San Diego, California), which features ~7800 probes designed to detect respiratory viruses, recent flu strains, and SARS-CoV-2. After enrichment, the prepared libraries were quantified, pooled, and loaded onto the MiniSeq™ sequencing system with an output of 2× 76-bp paired-end reads for sequencing.

2.3 Data analysis

NextClade v2.9.1 (https://clades.nextstrain.org/) was used for mutation identification, clade assignment, and placing the sequences in the SARS-CoV-2 phylogenetic tree. Lineage analysis was carried out using Pangolin v4.2 (https://pangolin.cog-uk.io/). Genome sequences of 2828 SARS-CoV-2 isolates submitted in GISAID from 1 January 2021 to 1 February 2022 were collected and used as background data for sublineage analysis. Civet (https://github.com/artic-network/civet) was used to cluster sequences based on common mutations.

Genomic sequences were aligned using the FFT-NS-2 method of MAFFT v7.505 using the SARS-CoV-2 isolate Wuhan-Hu-1 complete genome (MN908947) as reference. A neighbor-joining phylogenetic tree considering uniform rates of substitution according to the maximum composite likelihood model was constructed and visualized in MEGA11 software. Ambiguous positions were discarded by pairwise deletion. The 1st, 2nd, 3rd, and noncoding positions were considered as the codon position positions to be analyzed. The original dataset was resampled 1000 times to derive the bootstrap values, and values corresponding to branches that were not reproduced in at least 30% replicates are not shown in the tree.

Summary statistics were shown as means±SD for continuous variables and as percentages for categorical variables. Patients were stratified into six groups according to their age. The association of variants with age and gender were investigated via the chi-square test using GraphPad Prism 8.4.2 (www.graphpad.com). A p-value less than 0.05 was considered statistically significant.

3 Results

In this study, a total of 130 samples from patients who underwent SARS-CoV-2 testing at Praava Health were collected and analyzed. The demographic characteristics, vaccination status, and comorbidities of the study participants are presented in table 1.

Table 1 Characteristics of the study population and comparison

Results presented in figure 1 demonstrate that of the patients who tested positive for SARS-CoV-2 by PCR, 23.8% were between 30 and 39 years, 23.1% were between 18 and 29 years, 18.4% were between 50 and 64 years, 18.4% were 65 years or older, and only 5.4% were between 1 and 17 years. Neither gender was more closely associated with infection status (results not shown). Our data indicate an increased association of hypertension (38.5%) and type 2 diabetes (32.5%) in SARS-CoV-2-positive patients.

Figure 1
figure 1

Distribution of age of patients included in this study. The majority of patients included in this study belonged to the age group 30 to 39 years, closely followed by the age group 18 to 29 years. Patients with an age less than 17 years comprised the smallest age group in our study.

3.1 Lineage and phylogenetic analysis

Using the unique mutations (as the sequences were submitted to NextClade and assigned the lineages) within a viral genome, a lineage and phylogenetic analysis can be used to assign standardized lineages/variants independent of location and sample size. Lineage analysis was carried out using both the clade assigner of NextClade and the Pangolin SARS-CoV-2 lineage assigner. Results of NextClade contained 9 unique clades, whereas the Pangolin assignment contained 13 lineages (figures 2 and 3).

Figure 2
figure 2

Distribution of the Pangolin lineages of 130 samples. The lineage IDs are shown with corresponding colors on the pie chart. B.1.617.2 is the most prevalent lineage, followed by the B.1.351 lineage. The least common lineages identified among the study samples were B.1.525, BA.1.17.2, and BA.2.10.1. The number of sequences belonging to each lineage is shown and the respective frequencies are given within parentheses.

Figure 3
figure 3

The distribution of the samples according to the NextClade clade assigner. The clade IDs with their corresponding colors in the pie chart are given. Among the samples, the most predominant clade was 20H (Beta, V2), followed by 21A (Delta) and 21J (Delta), while 21D (Eta) and 21I (Delta) were the least frequent ones. The number of sequences belonging to each clade is shown and the respective frequencies are given within parentheses.

Among the sequenced cases, the predominant Delta variants, comprising 60 samples (46.2%), belonged to the 21A, 21J, and 21I clades and B.1.617.2 and AY.122 Pangolin lineages. The Delta variant first emerged from India in October 2020 and was reported to have increased transmissibility (Petersen et al. 2022). The next largest clade identified by NextClade was 20H (Beta variant), comprising 47 samples (36.1%), all belonging to the Pangolin lineages B.1.351 and B.1.351.3. The next largest was a group of 14 samples (10.8%) belonging to the 21K and 21L clades (Omicron variant) and BA.1, BA.1.17.2, BA.1.1, BA.2.10, BA.2, and BA.2.10.1 Pangolin lineages (figures 2 and 3) (Pradhan et al. 2022).

3.2 Gender/age association with SARS-CoV-2 variants

No significant gender bias with SARS-CoV-2 variants was identified in this study (table 1), and gender-wise stratification of the major clades (Beta: B.1.351 + B.1.351.3; Delta: B.1.617.2 + AY.122; Omicron: BA.1 + BA.1.17.2 + BA.1.1 + BA.2.10 + BA.2 + BA.2.10.1) revealed that SARS-CoV-2 infection was not related to gender (p=0.3851) (figure 4).

Figure 4
figure 4

Gender-wise distribution of the patients infected with the major variants (Beta, Delta, and Omicron). Chi-square test was performed to determine the association between the two major variants (Beta and Delta) with patient sex. The Beta variants group comprised samples belonging to lineages B.1.351 and B.1.351.3; lineages B.1.617.2 and AY.122 were compiled to constitute the Delta group. Samples from the lineages BA.1, BA.1.17.2, BA.1.1, BA.2.10, BA.2, and BA.2.10.1 make up the Omicron variant group.

SStratification of patients infected with the major variants Beta, Delta, and Omicron by age revealed that the highest percentage (29.8%) of patients infected with Beta variants were aged between 18 and 29 years, followed by those between 30 and 39 years, 50 and 64 years, and 40 and 49 years. Among patients infected with the Delta variant, the highest percentage (26.7%) belonged to the oldest age group (>65), followed by age groups 50–64, 30–39, and 18–29 years. A preponderance of patients infected with the Omicron variant were between 30 and 39 years of age. Our data indicate that the Beta variant was prevalent in younger populations (18–39 years), whereas the Delta variant more frequently infected the older population (>50 years), and the Omicron variant was more prevalent among the youngest population (18–39 years) (figure 5).

Figure 5
figure 5

Frequency of SARS-CoV-2-positive patients of different age groups. The lineages constituting a particular variant group are shown.

3.3 Chronological prevalence of different SARS-CoV-2 variants from April to July 2021 in the sampled population

Analysis of the significant lineages over time showed that B.1.617.2 was the predominant lineage for most of 2021, co-existing with the B.1.351, B.1.351.3, and AY.122 lineages. In early 2022, the viral population belonged exclusively to the BA.1, BA.1.17.2, BA.1.1, BA.2.10, BA.2, and BA.2.10.1 lineages (figure 6).

Figure 6
figure 6

Comparison of the significant lineages over time. Each lineage identified in the study is denoted by a triangle having a specific color. The period over which a specific lineage persisted can be traced by following the horizontal distribution of a particular colored triangle.

Plotting the dominant lineages over time revealed a high diversity among circulating viral strains almost throughout 2021, with a gradual shift of the circulating strains from Beta to Delta variants. From the beginning of 2021 to mid-2021, the Beta variants (B.1.351 and B.1.351.3) predominated, but were later replaced by the highly transmissible Delta variants (B.1.617.2 and AY.122). According to our study, the viral strain B.1.617.2 persisted the longest in 2021, with identification dates ranging from the end of May 2021 until the very end of 2021. At the beginning of 2022, the identified viral strains exclusively belonged to the BA.1, BA.1.1, and BA.2 lineages, indicating a further shift of the viral population from Delta to Omicron variants, congruent with the global scenario (figure 6).

The NextClade phylogenetic tree revealed that most of the sequences from our dataset belonged to the 20H (Beta, V2), 21A (Delta), and 21M (Omicron) clades. The neighbor-joining tree grouped one sequence from 21D (Eta) with the 20H (Beta, V2) sequences. The Delta clades (21A, 21J, and 21I) were grouped in a separate branch. The Delta and Beta sequences were evolutionarily closer to the Beta sequences emerging earlier, while the sequences belonging to 20I (Alpha, V1) was revealed to be closer to the Omicron (21K and 21L) sequences (figures 7 and 8).

Figure 7
figure 7

Phylogenetic tree of samples from our cohort superimposed on the global sample set. The samples from this study are seen in the context of the global SARS-CoV-2 phylogenetic tree as bold lines ending with a sphere. The various clades are distinguishable by colors.

Figure 8
figure 8

Phylogenetic tree of 130 study samples. Samples are colored by taxonomic affiliation to clades or subclades. The delta variant is divided to subclades 21A, 21I, and 21J. The omicron variant is divided into subclades 21K and 21L. The bootstrap values (>30) are shown.

3.4 Mutation analysis

The 130-sample cohort had an average of 34.01 coding mutations per sample (range 16–85) and a median of 31.0. A total of 528 unique coding mutations were observed, of which 102 were deletions, 6 were premature stop codons, and the remaining were substitutions. The number of coding mutations increased as the viral population shifted to highly mutated BA.1, BA.1.1, and BA.2 variants at the beginning of 2022 (figure 9), leading to greater genetic diversity in the most recently emerging variants (21L and 21K, Omicron) in the pandemic, as seen in figure 10. ORF1a harbored the greatest number of mutations, which can be attributed to its long ORF, which codes for a total of 10 proteins. Normalizing for ORF length, ORF1a appears significantly less tolerant to missense mutations than ORF7b, ORF8, the N gene, and the E gene (figure 11).

Figure 9
figure 9

The number of coding variants per sample (y-axis) is broken down into major clades (x-axis). The recently appeared 21K and 21L samples show the greatest number of mutations.

Figure 10
figure 10

The number of coding variants per sample (y-axis) from the major clades. The Omicron variant showed a considerably greater number of mutations per sample compared with the other variants.

Figure 11
figure 11

The number of coding variants (y-axis) per ORF (x-axis) normalized to ORF length. Although ORF1a tops the total number of mutations harbored, after accounting for ORF length, the ORF7b, ORF8, the N gene, and the E gene showed a higher frequency of mutations compared with the other genes.

Eight mutations were observed in more than 50% of the samples sequenced. The most common mutation found in the cohort was ORF1b:P314L, which occurred at a frequency of 98.5% (128 samples). The globally dominant D614G mutation in the spike protein occurred at the second-highest frequency of 84.6% (110 samples). The deletion mutations ORF1a:S3675-, ORF1a:G3676-, and ORF1a:F3677- were found in almost half of the samples with frequencies of 53.1%, 52.3%, and 50.8%, respectively. The other substitution mutations that occurred in more than 50% of samples were ORF1a:P2046L, S:T478K, and M:I82T.

A total of 132 unique coding mutations were observed in the spike protein with the 9 most prevalent mutations appearing in at least 35% of samples: D614G (84.6%), T478K (50.8%), P681R (47.7%), R158G (45.4%), T19R (45.4%), E156del (45.4%), F157del (45.4%), L452R (43.8%), and D215G (36.9%). Fourteen variations were mapped to the RBD of the spike protien involved in host receptor binding (figure 12). Since as early as January 2021, the highly frequent T478K mutation emerged spontaneously multiple times, predominantly in Mexico, the United States, and India (B.1.617.2 of Indian variants) (Di Giacomo et al. 2021). Present in the RBD, this mutation was predicted to hinder the Spike/ACE2 interaction (Saito et al. 2022). The P681R mutation located near the furin cleavage site was also highly conserved in the B.1.617.2 lineage. This mutation was found to facilitate cleavage of the spike protein and enhance viral fusogenicity (Kannan et al. 2021). The remaining highly prevalent mutations (T19R, E156del, F157del, R158G, and L452R) were also characteristic mutations of the B.1.617.2 lineage (Saito et al. 2022). The N501Y mutation (30%) was characteristic of B.1.1.7 and B.1.351. This mutation was found to increase the transmissibility of the virus by imparting to the variant greater affinity between spike proteins and ACE2 for each other (Liu et al. 2022).

Figure 12
figure 12

Schematic diagram of the SARS-CoV-2 spike protein. All coding variants observed in more than 5% of samples are indicated (amino acid positions are indicated relative to the reference genome MN908947).

Clustering sequences based on the presence of nucleotide substitution revealed five distinct clusters within the samples (table 2). Cluster 1 (n=12) was composed of sequences from clades 21K and 21L. The Delta lineage was stratified into three clusters: cluster 2 (n=5), cluster 3 (n=23), and cluster 4 (n=19). All three clusters contained sequences from 21J and 21A clades. Finally, cluster 5 (n=37) consisted exclusively of sequences from the clade 20H.

Table 2 Sequence clusters along with the common mutations in each cluster

4 Discussion

Among the 130 samples in which SARS-CoV-2 was detected by PCR, 60 were taken from male patients and 70 were taken from female patients. Clinical information and vaccination status were recorded. Patients were categorized by age groups. Among the different age groups of positive cases, the greatest numbers of patients were between ages 30 and 39 years (23.8%) followed by the 18- to 29-year-old age group, which made up 23.1% of the cohort. Other investigators described similar findings (Kushwaha et al. 2021).

In Pangolin lineages, B.1.617.2 was the most prevalent, followed by the B.1.351 lineages. B.1.617.2 is also the most prominent lineage in India (Mlcochova et al. 2021). We also analyzed the clade of the selected sequences. Among the sequenced cases, the predominant Delta strains, comprising 60 samples (46.2%), belonged to the 21A, 21J, and 21I clades and B.1.617.2, AY.4, AY.12, AY.6, AY.10, AY.4.4, AY.39, and AY.43 Pangolin lineages. The frequencies of infection by the major clades were independent of patient gender. Beta, Delta, and Omicron variants infected patients of different ages at different frequencies. The highest percentage of patients infected with the Beta variant (29.8%) were between ages 18 and 29 years; patients between the ages of 30–39, 50–64, and 40–49 years were infected with the Beta variant at decreasing frequencies. Omicron was more prevalent in younger patients, whereas the Delta variant infected older populations. Vaccinated and unvaccinated patients infected with the Delta variant reportedly recover more slowly than those infected with the Alpha variant, as indicated by longer lengths of hospital stays and prolonged viral shedding (Kumar et al. 2022). The Omicron variant is 6–8 times more infectious than the Delta variant (Wang et al. 2022). Our study suggests that younger patients were less susceptible to other variants of SARS-CoV-2 compared with the Omicron variant.

During the study period, the B.1.617.2 variant was the predominant lineage for most of 2021 with co-existing B.1.351, B.1.351.3, and AY.4 lineages. In early 2022, the viral population belonged exclusively to the BA.2, BA.1.1, and BA.1 lineages. Analysis of phylogenetic trees showed that most of the samples belonged to the Delta and Beta variants, but that the Omicron variants contained the greatest number of mutations.

Phylogenetic trees also revealed that the Beta (20H) and Delta (21A, 21J, and 21I) variants were closely related. The Alpha (20I) and Omicron (21K and 21L) variants were less divergent in our study, and their emergence from 20B, as suggested by the global database, could not be reproduced in this study. From mutation analysis, we observed 528 unique coding mutations, of which 102 were deletions, 6 were premature stop codons, and the remaining were substitutions. The number of coding mutations significantly increased as the viral population shifted to highly mutated BA.1, BA.1.1, and BA.2 variants at the beginning of 2022, leading to greater genetic diversity in the variants that have emerged most recently in the pandemic (21L and 21K, Omicron). Eight mutations were observed in more than 50% of the samples sequenced. The most common mutation in this cohort was ORF1b:P314L, which occurred at a frequency of 98.5% (128 samples), while the D614G mutation in the spike protein (S_D614G) was found in 97% of the sequences in another study in Bangladesh (Rokshana et al. 2021). The deletion mutations ORF1a:S3675, ORF1a:G3676-, and ORF1a:F3677- were found in almost half of the samples at frequencies of 53.1%, 52.3%, and 50.8%, respectively. The other substitution mutations that occurred at frequencies over 50% were ORF1a:P2046L, S:T478K, and M:I82T. Our results agree with similar findings reported in the literature (Di Giacomo et al. 2021). A total of 132 unique coding mutations were observed in the gene encoding the spike protein, and the 10 most prevalent mutations that appeared in at least 35% of samples were P314L (98.5%),  D614G (84.6%), T478K (50.7%), P681R (47.7%), R158G (45.4%), T19R (45.4%), E156del (45.4%), F157del (45.4%), L452R (43.8%), and D215G (36.9%). The P681R mutation in the spike protein is highly conserved, facilitates cleavage of the spike protein and enhances viral fusogenicity (Saito et al. 2022).

Fourteen variations were mapped to the RBD of the spike protein, which is involved in binding the host receptor. Similar results have been reported by others (Saito et al. 2022). Our study also demonstrates that mutations of the N gene occurred more frequently in the Omicron variant than in other variants. The current study investigated the genomic diversity of SARS-CoV-2 strains isolated from Bangladeshi patients and helps demonstrate the temporal profile of the mutational accumulations in the genome and spike protein over the study period. It suggests, for the first time, that patients of different ages may be differentially susceptible to the variants of SARS-CoV-2. This may have important implications for how aggressively we monitor infections, distribute vaccines, and treat patients based on their age.