Background

The hard clam (Mercenaria mercenaria), commonly referred to as the northern quahog, is a bivalve mollusc native to the North American Atlantic coast, with a distribution range extending from Maritime Canada to Florida [1, 2]. Over the years, hard clams have emerged as one of the most economically significant marine resources in the United States. They are a cornerstone of a productive shellfish industry that spans the entire eastern seaboard, with over 3,600 metric tons harvested yearly, valued at around 50 million US dollars [3], and representing the most economically important species in several states. The shift from traditional harvesting of wild stocks to aquaculture represents one of the most transformative trends in the shellfish industry over the last few decades [4]. The growth of hard clam aquaculture has been particularly notable, with annual increases in production due to enhanced techniques and increased hatchery output from Massachusetts to Florida. For example, hard clam production in Florida rose from 87 million clams in 2016 to 115 million in 2021, marking a 31% increase [5].

In addition to their economic value, hard clams play an integral role in the ecosystem, particularly as benthic filter feeders [6]. This species is highly adaptable, thriving in diverse coastal environments with varying temperature and salinity levels, demonstrating significant physiological resilience [7]. Such adaptability not only allows the hard clam to manage stress from environmental changes but also supports its role in nutrient cycling within its habitats [8, 9]. Hard clams are pivotal in benthic-pelagic coupling, a process by which energy and nutrients are transferred from the water column to the benthic (ocean-floor) environment [10]. This ecological function involves filtering vast volumes of water to extract phytoplankton, thereby converting particulate matter into biomass that supports a range of higher trophic levels [11]. Their activities contribute significantly to the improvement of water quality, supporting the health of marine habitats [12]. By filtering algae and suspended particles from the water column, the hard clam helps improve water clarity, which enhances conditions for seagrass growth and helps prevent algal blooms [10, 13]. The ecological benefits extend beyond nutrient cycling, with hard clams also playing a crucial role in bioirrigation—enhancing the oxygenation and thus the overall health of the coastal sediments they inhabit [14].

Despite the ecological and economic benefits associated with hard clams, the industry faces significant challenges due to environmental and biological stressors. The rise in ocean temperatures and changes in salinity levels can lead to adverse effects on the growth, survival, and metamorphosis of bivalve species [15,16,17,18]. Additionally, ocean acidification driven by increased carbon dioxide levels in the ocean poses a particular risk to marine calcifiers like hard clams, threatening their ability to sustain biomineralization processes essential for shell formation and overall survival [19,20,21,22]. The combination of hypoxia and acidification can have additive and synergistic negative effects on the growth and survival of early life stages of bivalves, further exacerbating the challenges faced by these organisms [23]. Additionally, diseases such as QPX disease (an infection caused by Mucochytrim quahogii, formerly Quahog Parasite Unknown) have caused considerable mortality in cultivated hard clam populations, leading to substantial economic losses [24, 25]. Previous research has indicated a genetic basis for resistance to QPX, varying by geographic origin of the clam populations, which suggests that selective breeding for disease resistance could be beneficial [26, 27].

In response to these challenges, the field of genomics offers promising strategies for enhancing clam aquaculture. Genomic techniques can revolutionize selective breeding by enhancing our understanding of molluscan genetics and providing tools for genetic improvement [28]. Traditional selective breeding has been used effectively in the past to improve specific traits in bivalves [29]. For instance, in oysters (Crassostrea virginica and Crassostrea gigas), efforts have concentrated on improving disease resistance [30,31,32], growth rate [33, 34], and salinity tolerance [35]. Similarly, in hard clams, traditional selection has targeted enhancements in survival rates and growth efficiency [36], leveraging hybrid vigor for enhanced traits [37]. The advent of genomic selection (GS) represents a transformative advance in breeding technologies. The use of GS has shown promise in expediting selection for growth performance and disease resistance in various species due to its improved accuracy compared to traditional selection [38,39,40,41]. GS commonly employs SNP arrays, a preferred tool for routine genomic evaluations in major farmed species [42]. This methodology facilitates the accurate estimation of genomic estimated breeding values (GEBVs), predicting an individual’s potential to contribute desirable traits to future generations [43]. SNP arrays are particularly valued for their cost-effectiveness, scalability, and ability to be customized to target specific genetic variations, making them a favored choice over other high-throughput genotyping platforms [44]. SNP arrays have been developed for several aquaculture species including blue mussels [45], eastern oysters [46], Pacific oysters, and European flat oysters [47, 48]. SNP arrays may enable GS and enhance the genetic improvement of hard clams by providing a more precise and efficient means of selecting for traits like disease resistance and environmental resilience. Previous genetic analyses in the hard clam have relied on limited numbers of SNPs or microsatellite markers [27, 49, 50]. The sequencing of the hard clam genome [51, 52] paved the way for the development of advanced genomic tools such as SNP arrays that can empower aquaculture and facilitate population genetic studies.

Considering these advancements, our research aims to develop and validate a SNP genotyping platform for the hard clam. This tool will not only facilitate the effective selection of genetically superior breeding stocks but will also allow for the monitoring of genetic diversity and inbreeding within cultured populations. By integrating genomic tools with traditional aquaculture practices, it is possible to significantly advance the productivity and sustainability of the hard clam industry, ensuring its continued economic viability and ecological contribution.

Methods

Resequencing and SNP discovery

For SNP discovery, comprehensive whole-genome resequencing was conducted on two groups of hard clams (M. mercenaria) samples (Fig. 1 and Supplementary Table S1): 1) Individual clams (n = 84, wild and aquacultured) with 12 clams from each of seven distinct locations across the Atlantic coast and Gulf of Mexico, and 2) Pooled clam libraries with 277 clams grouped into seven pools, ranging from 28 to 56 clams, based on their geographic source. These pools represent a diverse set of populations and include samples from Maine, Massachusetts, New York (2 locations), Virginia, North Carolina, South Carolina and Florida.

Fig. 1
figure 1

Map of hard clam sampling locations used for the 66K SNP array design and validation. Locations are color-coded by latitude to illustrate the diverse geographic origins of the samples, ranging from Maine to Florida. The map includes an inset for a detailed view of sampling locations in New York

For the individual libraries, DNA was extracted from each clam using a Macherey–Nagel NucleoSpin kit following manufacturer’s instructions. Extracted DNA was then used for Illumina sequencing library synthesis using the NEBNext® Ultra™ DNA Library Prep Kit and samples were sequenced using an Illumina NovaSeq platform (S4 PE150 chemistry). Sequencing effort aimed to achieve an approximate coverage of ~ 30 × per genome, based on the Mercenaria mercenaria reference genome assembly size of 1.86 Gb [52]. For the pooled libraries, DNA was extracted from each individual clam using a standard phenol–chloroform-isoamyl alcohol (PCI) extraction protocol [53]. Equivalent quantities of DNA (~ 100 ng per clam) were used from each clam to create a total of seven DNA pools (1 pool per population). Pooled DNA samples were then used for Illumina library synthesis using an Illumina Truseq Nano DNA library preparation kit and produced libraries were sequenced on a NovaSeq 6000 S4 lane following manufacturer’s protocols. The detailed breakdown of individual and pooled samples, including population codes and specific sources is available in Supplementary Table S2. All the generated reads were aligned to the hard clam genome (GCF_021730395.1) [52] using the BWA (Burrows–Wheeler Aligner) software (bwa mem -t 28 -T 20 -M, v.0.7.17). The aligned bam files were sorted and indexed using Picard-tools (version 2.23.2). Then, variant calling was computed using the Genome Analysis Toolkit (GATK, v4.2.2.0–1-g24a8e02-SNAPSHOT) with default parameters. SNPs were further filtered with parameters “QD < 2, QUAL < 30, FS > 60, SOR > 3, MQ < 40, MQRankSum < -12.5, ReadPosRankSum < -8”.

SNP selection and array design

Initially, the focus was on identifying SNPs within gene regions due to their potentially informative value in genetic studies. Each of the 34,728 annotated protein-coding genes in the chromosome-level assembly of M. mercenaria genome (GCF_021730395.1) was checked for SNP presence. Criteria for selecting biallelic genic SNPs included a minor allele frequency (MAF) greater than 0.05, exclusion of SNPs within 30 base pairs of each other, and avoidance of A/T and C/G transversions. This process was facilitated using PLINK v.1.90 [54], SAMtools v.1.11 [55], BEDTools v.2.31.1 [56], BCFtools v.1.11 [57] and VCFtools v.0.1.16 [58]. The criteria for inclusion of SNPs within genes were stringent and aimed to maximize the power of the selected SNPs. A maximum of three SNPs were selected for each gene following a hierarchical selection process that prioritized coding sequences (CDS), untranslated regions (UTRs), and introns in that order. If more than three SNPs were found within the CDS of a gene, the three SNPs with the highest MAF were chosen. If exactly three SNPs were present in the CDS, all were retained without further filtering. When fewer than three SNPs were identified within the CDS, the selection was expanded to include SNPs in UTRs and introns. To achieve an even distribution across the non-coding regions of the genome, the genome was divided into 1,000 nucleotide windows using BEDTools v.2.31.1. To ensure high confidence and appropriate distribution, windows overlapping with gene coordinates, mitochondrial sequences, or repetitive DNA were eliminated. Within the remaining windows, SNPs were excluded if they had a MAF lower than 0.1, were located within 30 nucleotides of another SNP, were non-biallelic, or were A/T or C/G SNPs. The same filters were applied to collect mitochondrial SNPs. Probes for detecting the pathogen M. quahogii were selected based on the dissimilarity between the QPX genome [59] and the M. mercenaria [52] genome, following ThermoFisher Scientific’s recommendations.

Samples for array evaluation

Samples of wild clams and aquacultured stocks were used to evaluate the array (Fig. 1; Tables 1 and 3). Aquacultured stocks originated from Massachusetts (A1MA), New York (AUSDAD, AUSDAE, and AUSDAF), and Florida (A1FL, A2FL, A3FL). All analyzed clams were adults, except for subsets of juveniles in the aquacultured stocks: 204 out of 307 in AUSDAD, 203 out of 297 in AUSDAE, and 200 out of 298 in AUSDAF. These subsets consisted of juvenile clams (0.1 to 1.3 cm2 in shell surface area) preserved whole (with shell) in ethanol. All analyzed clams were collected between 2019 and 2022 excluding a group that included clams confirmed to be positive for QPX disease. For instance, W03NY (Table 1) consisted of mantle tissues (60 individuals) or DNA (9 samples extracted from mantle tissues) collected from clams harvested in 2003 and included specimens confirmed to be positive for QPX disease using standard histopathology techniques [60]. The same population sampled in 2003 was targeted again in 2022 (W22NY) to evaluate if the genetic composition of the stock has changed over the last two decades.

Table 1 Hard clams used for the validation of the hard clam 66K SNP array

Before genotyping, the soft tissue of ethanol-fixed juvenile clams was either used as is, or it was dissected to remove the digestive gland before DNA was extracted. This was done to evaluate the effect of removing digestive tissue (which is typically rich in inhibitors) on genotyping outcomes.

Wild samples consisted entirely of adult clams originated from New York, New Jersey, and Florida. The oldest samples used for genotyping were mantle tissues preserved in ethanol and held at -80 °C since 2003 (~ 20 years). Further information about sample types and preservation methods is given in Table 3.

Total genomic DNA was extracted from hard clam samples at the Center for Aquaculture Technologies using a magnetic bead-based protocol. Briefly, samples (~ 10–15 mg) were subsampled and processed using Mag-Bind Blood and Tissue DNA Kits (Omega BioTek, Norcross, GA) according to the manufacturer’s guidelines. Automated processing and liquid handling steps associated with the extraction protocol were performed using PurePrep 96 units (Molgen, San Diego, CA) following the kit and instrument-specific guidelines. The resulting gDNAs were assessed for yield and quality by spectrophotometry (Nanodrop) and 2% agarose gels, targeting a minimum of 35 µl at > 20 ng/µl of largely intact DNA (minimum 5 Kb). Genotyping was performed at Neogen (Lincoln, Nebraska) on custom Axiom 384HT arrays [Axiom HD Array (60 K)_Clam] using processes outlined in the Axiom Assay 384HT Array Format Automated Workflow User Guide. PCA was calculated using PLINK v.1.90 [54].

SNP array data analyses

The SNP array data were processed using the Axiom Analysis Suite 5.0 software (Thermo Fisher, CA), following the Best Practices Workflow, and using recommended threshold settings (QC ≥ 0.82, QC call rate ≥ 97%, average call rate for passing samples ≥ 98.5%). The marker-conversion rate was calculated as the percentage of polymorphic, QC-compliant, and BestAndRecommended SNPs on the array. The genetic indices were calculated using vcfR 1.15.0 [61], adegenet v2.1.10 [62], and hierfstat [63] packages in R. Pairwise significance of FST values was based on 10,000 iterations of the data. The maximum likelihood tree was constructed using IQ-TREE [64] using the best-fit model of nucleotide substitution based on Akaike information criterion (AIC) in ModelFinder [65], and branch support values were estimated using UFBoot [66].

Results

SNP discovery and array design

Through whole-genome resequencing of 84 individual clams and 277 clams from pooled libraries, we initially identified 305,753,445 SNPs across the M. mercenaria genome. Filtering processes refined this large dataset to identify the most informative SNPs for subsequent analyses. After the initial filtering of genic SNPs, 91,898 SNPs across 32,018 genes were retained. These SNPs included 79,454 within CDSs, 966 in UTRs, and 11,478 in introns. Exploration of intergenic regions yielded an additional 278,452 SNPs. The analysis also yielded 72 mitochondrial SNPs (mtSNPs), 150 SNPs associated with hard clam resistance to QPX disease [27], and 101 markers from the genome of Mucochytrium quahogii (causative agent of QPX disease). The comprehensive SNP collection of 374,463 SNPs was submitted to ThermoFisher Scientific for probe design, including a selection of 3,790 non-polymorphic sequences for design quality control (dQC). All designed probes underwent a thorough evaluation for genomic duplication, interactions with other probes, distance from known polymorphisms, and likelihood of successful probe conversion. This evaluation identified 312,064 SNPs that had a conversion probability greater than 0.6 and were free of genomic duplication or potential interference from other polymorphisms. From this refined pool, SNPs were specifically selected to enhance the array's utility: genic SNPs were prioritized for their functional insights, and intergenic SNPs were included and strategically distributed to ensure comprehensive genome coverage. Importantly, most intergenic SNPs on the array were chosen for having MAF greater than 0.2, a criterion aimed at boosting the analytical power of genetic studies by ensuring sufficient allelic variation. As a result, the final design of the screening array was determined to include 66,543 probes from the hard clam and 101 from QPX, each uniquely targeting a different genetic marker (Table 2).

Table 2 Marker composition of the hard clam Applied Biosystems Axiom Clam_Mm1 array

The array contains 17,492 genic SNPs, 48,981 intergenic SNPs, and 70 mtSNPs of M. mercenaria. Within the category of nuclear genic SNPs, there are 17,385 SNPs located within CDSs, 10 in UTRs, and 97 in introns, representing approximately 50.1% (17,411 genes) of all identified protein-coding genes in the M. mercenaria genome. The nuclear SNPs are evenly distributed across chromosomes, with an average interval of 25,641 bp between SNPs, ranging from 22,481 to 34,102 bp across different chromosomes (Fig. 2, Supplementary Table S3). The distance between the 66 mtDNA SNPs ranged from 35 to 958 bp, with an average of 244 bp.

Fig. 2
figure 2

Chromosomal distribution of SNPs on the hard clam 66K SNP array

Due to the comprehensive coverage of the array and its specific focus on the hard clam, it was named the hard clam 66K SNP array. The official name for this tool, reflecting its broad scope and targeted application, is the Applied Biosystems Axiom Clam_Mm1 Array (384-plate format).

Performance of the SNP array

A genotyping study was conducted on 1,904 wild and aquacultured clams to assess the performance of the 66K SNP array. To assess the repeatability of the genotyping process, 18 individuals were genotyped twice. Of the total number of samples, 1,384 (72.7%) passed the stringent genotyping quality control (QC) standards, which included a DQC threshold greater than 0.82 and a QC call rate exceeding 97%. The average QC call rate for passing samples was 98.8%.

The evaluation revealed notable variations in sample pass rates across different tissue types and preservation methods (Table 3). For example, mantle tissue preserved in ethanol resulted in relatively high pass rates, with populations such as A1MA achieving a 100% success rate. Adductor muscle tissue preserved in ethanol showed variable success rates, ranging from 82.7% in A2FL to 96% in A3FL, the latter being among the highest observed. Conversely, tissues such as juvenile clams preserved in ethanol exhibited markedly lower pass rates. It was also observed that DNA samples stored in water had lower pass rates, typically ranging from 62.5% to 66.7%. This can be attributed to acidic degradation of DNA stored in water over time [67, 68], and the storage durations in this study ranged from three to nineteen years. In contrast, mantle tissue preserved in ethanol at -80℃ for 19 years, which has been shown to be much better for preserving sample integrity, yielded results (85% pass rate) comparable to some of the more recent ethanol-fixed samples (e.g., W1NJ, W2NJ). Tissue (mantle) homogenates preserved in phosphate-buffered saline (PBS) before DNA extraction did not yield any results (6 out of 6 individuals failed; 0% pass rate; Table 3).

Table 3 Genotyping quality control outcomes by population and tissue type on the hard clam 66K SNP array

Further analysis of juvenile clams genotyped as whole ethanol-fixed animals, with and without digestive tissues, exhibited a clear trend related to the size of the specimens (Table 4). Intact juvenile clams exhibited a negative correlation between size and genotyping success rate, with larger animals exhibiting notably lower success rates. For instance, animals between 0.1 and 0.2 cm2 (shell surface area) had a pass rate of 54.2%, which decreased to as low as 9.8% for those in the 0.6–1.3 cm2 range. However, this trend was not observed in samples where digestive tissue was removed prior to DNA isolation. In these cases, pass rates generally improved and were less variable with size; when the digestive tract was removed, animals sized 0.6–1.3 cm2 showed a 70.5% success rate.

Table 4 Genotyping success rates by size for juvenile hard clams

After processing with the Best Practices Workflow for genotype calling, the SNPs from the hard clam 66K SNP array were categorized based on their genotyping clarity and reliability. Overall, 36,153 SNPs were designated as PolyHighResolution, indicating polymorphic SNPs with well-defined genotype clusters. In contrast, 1,133 SNPs were classified as NoMinorHom, where one of the homozygous genotypes was missing. Additionally, 1,221 SNPs were classified as MonoHighResolution, indicating monomorphic SNPs with a single, clear genotype cluster. In the array, SNPs that failed to meet the QC threshold call rate of 97% were categorized as CallRateBelowThreshold, accounting for 7,187 SNPs. OffTargetVariant (OTV) SNPs, which may indicate the presence of an additional cluster, numbered 4,598. Finally, SNPs that presented with multiple issues were grouped into the 'Other' category, which comprised 16,251 SNPs. The total count of BestandRecommended markers, which includes SNPs from PolyHighResolution, NoMinorHom, and MonoHighResolution clusters, was 38,507. This represents 57.87% of the total SNPs on the array (Table 5).

Table 5 SNP quality classification and frequency in the hard clam 66K SNP array

The overall SNP conversion rate, reflecting the number of polymorphic and recommended SNPs, was 56.03%, corresponding to 37,286 SNPs. The remaining markers either fell below the call rate threshold, were identified as OTVs, or had multiple issues preventing them from being classified as reliable markers.

Among the BestandRecommended markers identified, 12,262 SNPs were classified as genic, involving 12,223 distinct protein-coding genes. This represents a significant portion (35.2%) of the 34,728 protein-coding genes identified within the M. mercenaria genome. In addition, 26,179 intergenic SNPs were identified. This underscores the extensive coverage of the genomic landscape provided by the array. The inclusion of 66 mtSNPs highlights the comprehensive approach to capturing the complete genetic diversity of the hard clam, from nuclear to organellar DNA.

Among the 1,384 clam samples that passed the genotyping QC, a subset of BestandRecommended SNPs showed low MAF, with 2,138 SNPs (5.73%) having a MAF less than 0.05, and 3,123 SNPs (8.38%) with a MAF less than 0.1 (Fig. 3). SNPs with a lower MAF are less prevalent within the population but may still be of significance for certain traits and genetic diversity studies. Overall, 30,621 (82.12%) of the SNPs had MAF > = 0.2, making them highly informative.

Fig. 3
figure 3

Distribution of minor allele frequency of SNP markers on the hard clam 66K SNP array based on 1,384 samples passing genotyping QC

Genotyping was repeated for 18 individual hard clams to assess the reproducibility of the results. The average concordance rate across these repeated measures was high at 99.64%. This indicates that the SNP calls were consistent, with only 0.36% of the SNPs showing discordant genotypes on retesting (Table 6). Further analysis of the discordant SNPs identified by repeated genotyping shows a high degree of reproducibility. Of the total 2,134 discordant SNPs, 1,887 occurred in only one sample, indicating that most observed discrepancies were isolated events (Supplementary Table S4). Lower frequencies of discordance involving multiple individuals were observed, with 208 discordant SNP genotypes appearing in two samples, 27 in three, and progressively fewer in four to twelve samples, indicating a minimal systematic error.

Table 6 Number of SNPs called, number of discordances, and concordance rate in 18 hard clams genotyped twice on the hard clam 66K SNP array

Hemocyte samples showed particularly high concordance rates, ranging from 99.64% to 99.85%. Samples from juvenile clams with digestive tissues removed had slightly lower concordance rates of 99.16% to 99.34%. This observed variation in concordance could be attributed to differences in DNA quality, which is a common and understandable occurrence in samples representing more complex biological matrices.

Exploring genetic structures in hard clams using a 66K SNP array

To confirm the utility of the SNP array in elucidating clam population genetic structure, a Principal Component Analysis (PCA) was conducted. The analysis encompassed wild and aquacultured clams sampled from different geographic locations (Fig. 4A). The resulting PCA highlighted a distinct clustering pattern. For instance, aquacultured clam samples from the Northeast (AUSDAD, AUSDAE, and AUSDAF) clustered separately from Florida samples (A1FL, A2FL, and A3FL). Wild clams formed a single cluster on the PCA plot and showed little dispersion, indicating a genetic coherence among these groups. However, a closer look at wild clams showed that the clams from Florida (WFL) formed a separate cluster from the northeastern clams (W03NY, W22NY, W1NJ, W2NJ) (Fig. 4B). The PCA substantiates the capability of the hard clam 66K SNP array to resolve complex genetic relationships within and across wild and aquacultured populations.

Fig. 4
figure 4

Principal component analysis of (A) wild and aquacultured hard clams with genotype data from markers on the 66K SNP array. B Detailed view of the genetic clustering among wild populations from Northeastern and Florida regions. Abbreviations: A1MA (population 1 in MA), A1FL (population 1 in FL), A2FL (population 2 in FL), A3FL (population 3 in FL), AUSDAD (USDA strain D), AUSDAE (USDA strain E), AUSDAF (USDA strain F), W03NY (clams from NY, 2003), W22NY (clams from NY, 2022), W1NJ (population 1 in NJ), W2NJ (population 2 in NJ), WFL (wild clams from FL). For full names and additional details, see Table 1

We analyzed the genetic structure and diversity of the hard clam populations using various genetic metrics, including observed heterozygosity (Ho), expected heterozygosity (He), inbreeding coefficients (FIS), and fixation indices (FST). The overall genetic structure revealed an FST value of 0.024 for the total dataset, indicating a low to moderate level of genetic differentiation among the populations (Supplementary Table S5). The overall inbreeding coefficient was 0.029, reflecting minimal inbreeding within the total sample set. Observed heterozygosity was 0.396, while expected heterozygosity was 0.408, suggesting that genetic variation is slightly higher than what is observed.

The FIS values for aquacultured populations were found to be low (from 0.000 to 0.050), indicating the implementation of effective genetic management practices that minimize the occurrence of inbreeding (Supplementary Table S6). The Ho values for these populations ranged from 0.380 to 0.403, and He ranged from 0.393 to 0.402, indicating a healthy level of genetic diversity within these populations. The wild populations (W03NY, W1NJ, W22NY, and W2NJ) exhibited slightly elevated but still low FIS values of 0.042, 0.041, 0.039, and 0.042, respectively. The observed heterozygosity for these populations was approximately 0.400, and the expected heterozygosity was approximately 0.418 (Supplementary Table S6). Pairwise FST values (Supplementary Table S7) indicated significant genetic differentiation between certain locations. The FST values for the wild populations from New York and New Jersey (W03NY, W1NJ, W22NY, W2NJ) were found to be close to zero (p-value > 0.05), indicating that there is no significant genetic differentiation among them. However, these populations exhibited significant divergence from the wild Florida population (WFL), underscoring the influence of geography and the environment on genetic differentiation. Similarly, all aquacultured populations (A1MA, AFL, AUSDA) displayed pronounced genetic distinctness.

Mitochondrial diversity in hard clam populations

Four mtSNPs were excluded from the original set of 70 due to genotyping quality control deficiencies, and one was monomorphic across all evaluated samples. The absence of heterozygotes in any of the mtSNPs confirms the haploid nature of these markers. Moreover, this observation indicates a lack of heteroplasmy, which refers to the presence of multiple mitochondrial DNA types within the cells of an organism [69], in the studied populations.

Further analyses were conducted using the remaining mtSNPs to explore phylogenetic relationships and haplotype distributions among the hard clam populations. The phylogenetic analysis identified three distinct mitochondrial haplogroups, labeled Hap1, Hap2, and Hap3, (Fig. 5A, Supplementary Fig. S1).

Fig. 5
figure 5

Phylogenetic analysis and haplotype distribution of mitochondrial SNPs in hard clam populations. A Maximum likelihood phylogenetic tree of 1,384 hard clams based on mitochondrial SNPs. Branches represent identified haplogroups (Hap1, Hap2, Hap3), illustrating genetic relationships derived from the K hard clam 66K SNP array. B Distribution of mitochondrial haplotypes among various hard clam populations

Phylogenetic clustering revealed distinct patterns of haplotype distribution across both wild and aquacultured samples (Fig. 5B, Supplementary Table S8). Notably, Hap1 was exclusively found in aquacultured Florida samples. Hap2 showed a broader geographic distribution, it was present in all samples except A3FL and AUSDAF, indicating its widespread occurrence across diverse geographic and breeding backgrounds. Hap3 was present at high frequency in nearly all samples including 100% of AUSDAF samples and 93.3% of the wild Florida (WFL) samples. In contrast, Hap3 was present at lower frequency in A2FL and A3FL samples (28.7% and 7.1% respectively).

Analysis of QPX presence

The assessment of the efficacy of the hard clam 66K SNP array's QPX probes included six histologically confirmed QPX-positive controls and 35 QPX-negative controls. The median log2 ratio is a measure of the relative abundance of specific SNP markers detected by the Affymetrix SNP Array. A high median log2 ratio indicates a significant increase in detection signal, which is used to infer the presence of QPX pathogen-related markers. Among the positive controls, only a single sample demonstrated a significantly high median log2 ratio (Fig. 6). This indicates that while the probes can identify QPX presence, their effectiveness may vary with the pathogen load within the sample.

Fig. 6
figure 6

Median log2 ratios across hard clam samples from different regions and infection statuses. Blue points indicate clams from the Northeast (NJ, NY, MA). Orange points represent clams from Florida, where QPX is not known to exist. Green points are from histologically confirmed QPX-positive samples. Pink points are from QPX-negative samples. The red dashed line marks the threshold for positive detection

In contrast, all histologically validated QPX-negative samples showed very low median log2 ratios. This consistent result across the negative controls underscores the specificity of the probes under these test conditions. Furthermore, the samples originating from Florida, where QPX is not known to be present, also had low median log2 ratios, further underlining the lack of “false-positive” signals. In fact, the array may be able to only detect intense infections as shown for some northeastern clams (where QPX is enzootic) and for one of the histologically QPX-positive clams. It should be noted that QPX disease in clams is typically focal, and a biopsy sample collected for genotyping may not contain parasite cells even if it is derived from a clam microscopically confirmed to be infected. Despite these limitations, the current findings indicate that the hard clam 66K SNP array can detect QPX in clams, but a more elaborate study that includes samples with broad range of known concentrations of QPX DNA is needed to provide a more comprehensive assessment of the sensitivity of the array for parasite detection.

Discussion

Array design and performance

The design of the 66K SNP array for M. mercenaria represents a significant advance in genomic tool development, tailored to meet both the ecological and aquacultural demands of the hard clam industry. This customized array includes SNPs derived from an expansive pool of genomic data encompassing a large portion of the native geographical range. Approximately 305 million raw SNPs were identified through whole-genome resequencing, showcasing the vast genetic diversity inherent to M. mercenaria. The subsequent filtering and validation process refined this number, ensuring that only the most reliable and informative SNPs were included. The SNP selection was strategically conducted to balance between genic and intergenic regions, enhancing the utility of the array for various genetic studies. The final selection included 12,262 genic SNPs, mapped to 12,223 genes, representing approximately 35.2% of all protein-coding genes in the clam's genome. This extensive coverage allows the identification of gene-specific markers linked to traits such as disease resistance, stress tolerance, and growth rates. The inclusion of 26,179 intergenic SNPs provides a broader genomic landscape, facilitating the investigation of neutral processes across different clam populations.

Mitochondrial SNPs were also carefully chosen for their known relevance in tracing maternal lineages and assessing population dynamics and historical demography. The meticulous selection of these 66 mitochondrial markers underscores the array's design philosophy—precision in genetic representation to support robust ecological and evolutionary studies.

The array's architecture also considered the physical distribution of SNPs across the clam genome. SNPs were evenly spaced to maximize genomic coverage and minimize bias in genetic linkage analyses. This spatial arrangement is crucial for conducting genome-wide association studies (GWAS), which rely on the comprehensive genomic representation to accurately identify associations between genetic variants and phenotypic traits. The even distribution of SNP markers ensures thorough genomic coverage, facilitating detailed genetic analysis and supporting the advancement of genomic selection in hard clam aquaculture.

The performance of the hard clam 66K SNP array, with its concordance rate of 99.64%, places it on the upper end of the range for first-generation SNP arrays in other bivalves, typically between 96.6% and 99.8% [46, 47]. These figures demonstrate the array's robust design, comparable to or exceeding that of many initial bivalve arrays designed for similar purposes. The marker conversion rate (56.03%) is somewhat low with a large proportion of loci being classified as “Other”. This result highlights the challenge in array development for high polymorphism species, and the conversion rate may be improved by a two-step process [46].

The evaluation revealed significant variability in sample pass rates across different tissue types and preservation methods, underscoring the importance of sample handling in genetic studies. It should be noted that we purposely included a small number of samples generated from other standard workflows (e.g., tissue homogenates in PBS which is routinely used for DNA extraction for qPCR diagnostics of QPX [70]) to evaluate their usefulness and complementarity to genotyping with the hard clam SNP array. The findings show that protocol adaptations are needed to make such samples useful. Overall, mantle and adductor muscle tissue preserved in ethanol showed high pass rates, demonstrating the effectiveness of ethanol preservation in maintaining DNA integrity, including for samples preserved for 19 years. This finding is consistent with best practices across genomic studies in aquatic species, where ethanol preservation is often recommended to ensure the stability and quality of DNA samples [71]. Conversely, DNA samples stored in water showed markedly lower pass rates, which could be attributed to DNA degradation over time [72]. This aspect of the study highlights the critical role of preservation methods in genetic research, suggesting that maintaining optimal preservation conditions is essential for maximizing the success of genotyping efforts. Juvenile clams can also be used as biomaterial for genotyping. The study revealed a clear trend related to the size of the juvenile clams, with larger animals generally having lower genotyping success rates when digestive tissues were included during DNA isolation. This observation suggests the potential complications of including larger amounts of digestive tissue, which may contain inhibitors or contaminants that affect DNA quality [73]. When digestive tissues were removed, success rates improved significantly, suggesting that careful tissue selection and handling can mitigate some challenges associated with genotyping larger or more complex tissue samples.

Array applications

Principal component analysis performed on the SNP array provided deep insights into the genetic structures of M. mercenaria populations, highlighting its utility in delineating genetic variation across different geographic and management contexts. This analysis, which included wild and aquacultured populations from different geographic regions, revealed distinct genetic clustering patterns critical for understanding the impact of regional culture practices and environmental factors on the genetic diversity of hard clams. The PCA results highlighted significant differences in genetic clustering between aquacultured populations in the Northeast (AUSDA: AUSDAD, AUSDAE, and AUSDAF) and clams in Florida (AFL: A1FL, A2FL, and A3FL). This clear separation highlights the influence of region-specific selective breeding practices and localized environmental adaptations. The ability to assess this variation is critical because it reflects not only adaptive responses to local conditions but also potential neutral divergence due to limited gene flow, as well as the results of human intervention in breeding strategies aimed at optimizing traits beneficial to aquaculture productivity and sustainability. Our analysis revealed clear genetic differentiation between wild and aquacultured populations of M. mercenaria in Florida, in contrast to the minimal genetic differences observed previously using microsatellite markers [74]. The wild populations from the Northeast (W03NY, W22NY, W1NJ, W2NJ) showed less genetic dispersion compared to the Florida wild population (WFL) and the aquacultured populations, indicating a relatively homogeneous genetic makeup within this regional group. This homogeneity is supported by the lack of significant genetic differentiation among these northeastern wild populations, as evidenced by FST values close to zero, indicating strong genetic connectivity. However, the significant genetic differentiation between these northeastern populations and the wild Florida population (WFL) suggests the existence of factors leading to distinct genetic structures. These factors could be geographic barriers that limit gene flow and/or distinct environmental pressures and natural selection processes in different habitats. Such genetic isolation between northeastern and southern hard clam populations has been previously reported using other genetic and genomic methods [50]. For instance, using a genotyping-by-sequencing approach, Ropp et al. [48] also showed clear genetic segregation between clams collected from New York and New Jersey (which were genetically similar) and those sourced from North and South Carolina (the Ropp study did not include clams from Florida). These findings have important implications for conservation and management strategies, as they underscore the need to consider local genetic specificities when planning breeding and conservation efforts to ensure the preservation of genetic diversity and adaptability [75]. The PCA revealed that certain individuals from the aquacultured populations A3FL and AUSDAF exhibited unique placements on the PCA plot. This unusual distribution could be attributed to several factors. One possible explanation is the presence of unique genetic traits that have arisen due to localized breeding strategies and specific environmental conditions [76]. Another critical factor to consider is ascertainment bias in SNP arrays. Ascertainment bias occurs when the SNPs included in the array are identified and selected based on a discovery panel that does not represent the entire population's genetic diversity [77]. This bias can result in skewed allele frequency distributions and may influence the observed population structure [78]. The distinct placement of some individuals from A3FL and AUSDAF on the PCA plot might be partially due to such bias. To fully understand the genetic structure of these populations and the potential impacts of ascertainment bias, further analysis should be conducted.

The ability to analyze mtSNPs in M. mercenaria increases the potential applications of the hard clam SNP array for mitochondrial DNA analysis in both research and aquaculture breeding programs. In this context, phylogenetic analysis of mtSNPs identified three mitochondrial haplogroups within M. mercenaria, illustrating the hard clam 66K SNP array’s utility in elucidating mitochondrial genetic structures. The delineation of these groups provides insight into the historical and adaptive narratives that have shaped the genetic landscape of M. mercenaria. Our results complement previous mitochondrial DNA studies of the hard clams, which described M. mercenaria stocks in the Atlantic as a single evolutionary unit divided into at least three closely related populations, noting regional adaptive differences among northern, central, and southern populations [2]. Our findings also reflect the importance of mitochondrial markers in revealing phylogenetic divisions within bivalve species, as demonstrated in previous studies [79].

The use of the hard clam SNP array for the detection of QPX in hard clams exemplifies how advanced genomic tools are becoming increasingly important in the management of disease in marine aquaculture. Despite the promising application of this approach, our findings echo challenges with the detection of some pathogens using SNP arrays for the eastern oyster [46], where specificity and sensitivity issues were encountered, particularly at lower pathogen concentrations. Our finding suggests that while the SNP array can detect the presence of QPX, the pathogen load likely influences the detection thresholds. This could be due to variability in the ability to detect different stages of infection (e.g., genome replication within a parasite cell [24] or different DNA extraction efficiencies between different parasite life stages) or to the overall pathogen load present in the tissue samples, which could affect the probe's ability to detect the pathogen [80], potentially leading to a high rate of type II error. Additional research is needed to define QPX detection thresholds for confirmation of infection status and to clarify the relationship between parasite load and signal strength for potential quantification of parasite infections. This preliminary assessment indicates that detection of positive controls is possible, but comprehensive testing and validation are necessary before the arrays can be reliably used for diagnosis.

Conclusions

The development of the 66K SNP array for M. mercenaria marks a significant step forward in the integration of genomic tools into hard clam aquaculture. By facilitating detailed genetic analyses, supporting breeding for desirable traits, and potentially aiding in disease management and environmental adaptation strategies, this tool enhances the ability to manage hard clam populations more effectively. Future research should focus on expanding the applicability of the array to hard clam populations, further refining SNP selection and array design methodologies, and integrating these genomic tools with ecological and conservation approaches. This new array is expected to be a reliable tool for genome-wide association studies of clam resistance to various biological and environmental stressors and for genomic selection, thus taking clam aquaculture to the next level.