Introduction

When alleles are sufficiently close together on a chromosome, they tend to be inherited together through linked inheritance rather than being passed down independently. This means that the offspring receives blocks of alleles or haplotypes from each parent, rather than individual alleles (O’Brien et al. 2014; Ardlie et al. 2002). Linkage disequilibrium (LD) arises due to the interconnection of alleles on a chromosome, which results in a certain degree of correlation among them. This phenomenon can be observed not only for nucleotides in the genome but also for different types of genetic markers, including single nucleotide polymorphisms (SNPs) (O’Brien et al. 2014). In essence, LD between molecular markers represents the extent to which the genotypes of two SNPs are correlated (Porto-Neto et al. 2014). While physical proximity plays a significant role in determining LD, it is important to acknowledge that other factors such as evolutionary processes and historical events can also affect the correlation between molecular markers. These factors encompass inbreeding, selection, population stratification, genetic drift, genetic bottleneck, effective population size, mutation, recombination rate, and migration (Karimi et al. 2015; Ardlie et al. 2002; Reich et al. 2001).

Comprehending the concept of LD holds paramount importance in the mapping of genes and their relevance to genomic studies. Through the analysis of LD between SNPs, researchers can obtain valuable insights into the diversity across various breeds, estimate recombination event frequencies, investigate fluctuations in effective population size across generations, and identify genomic regions suitable for enhancing economically important traits. Such understanding contributes significantly to the advancement of genetic research and the targeted improvement of desired traits (O’Brien et al. 2014; Espigolan et al. 2013; McKay et al. 2007).

Effective population size is an important genetic parameter that estimates the effect of genetic drift in a population (Crow and Kimura 1970). It is one of the best quantitative indicators of genetic diversity (Makanjuola et al. 2020), that help determine the number of independent chromosome segments which are required for genomic predictions (Mrode et al. 2019). Estimating the effective population size is valuable not only from an evolutionary standpoint but also for enhancing models used in the mapping of genes related to quantitative traits (Li and Kim 2015).

The decreasing cost of conducting high-throughput genotyping assays has opened up new possibilities for conducting large-scale genomic studies. The effectiveness and precision of these genomic investigations are greatly influenced by the extent and structure of LD observed between SNPs throughout the genome. As LD patterns determine the level of correlation between markers, understanding LD is crucial for maximizing the utility and accuracy of GWAS and GS approaches (Goddard and Hayes 2012). Investigations have been carried out to examine the LD between markers within the genomes of diverse taurine and indicine cattle breeds (Espigolan et al. 2013; Makina and Taylor 2015). The findings of their study showed that moderate LD with a value of r2 = 0.20 was observable in distances of less than 100 kilobases. This suggests that a set of 50,000 SNPs is sufficient to capture most of the LD information needed for conducting GWAS in taurine breeds (McKay et al. 2007; Espigolan et al. 2013). In contrast, the researchers observed a lower extent of LD (r2 = 0.20–0.34) within indicine cattle at distances less than 30 kilobases. This suggests that a higher-density SNP chip would be necessary to capture the LD information required for conducting genomic studies in these cattle (Makina and Taylor 2015). The diminished LD extent observed in indicine cattle breeds can be attributed to a potential ascertainment bias inherent in the SNP chips utilized for genotyping.

The practice of crossbreeding in dairy cows has been identified as a highly effective approach for enhancing livestock productivity, reproductive efficiency, and sustainability (Leroy et al. 2016; Mbole-Kariuki et al. 2014; Bebe et al. 2003). In a single lactation, the crossbred offspring of Sahiwal (Bos indicus) and HF (Bos taurus) cows have the ability to produce approximately 4000 L of milk containing 4% fat (Kumar et al. 2018). Crossbreeding Sahiwal and HF cows result in offsprings that benefit from hybrid vigor, which leads to improved health traits and increased productivity. This is due to the combination of the high milk yield of HF cows and the adaptability and heat tolerance of Sahiwal cows, resulting in offsprings that are well-suited to local conditions and have improved milk yield.

In a recent study, admixture patterns and signature of selection for the same samples were studied (unpublished data). However, the investigation of genome-wide LD and its pattern using this specific chip remains unexplored. The prediction accuracy from genomic selection (GS) is affected by marker density, minor allele frequency (MAF), and genetic architecture of the target trait (Zhang et al. 2019). Also, the accuracy of the genomic prediction depends on the amount of genetic variation explained by the markers resulting from the LD between the marker and QTL (Goddard 2009). Therefore, this study examines the distribution of allelic frequencies, determines the level of LD (measured using r2), and estimates the effective population size in the population of crossbred dairy cattle, which have a major impact on the accuracy of GS in this admixed population. This study will provide valuable insights for estimating marker density in genomic studies of crossbred dairy cattle.

Materials and methods

Ethics statement

To ensure the ethical and humane treatment of animals, the study described in this research paper was approved by the institutional review committee. During blood collection, a professional veterinarian was there to ensure minimal distress and harm to the animals. Before collecting any samples, the researchers met with the owners of the farm where the animals were housed to explain the purpose of the study and obtain informed consent verbally.

Animal sample and genotype quality control

The sample size for this study consisted of 81 crossbred cattle from the Military Farm located in Renala Khurd near Okara, Punjab. These animals were selected based on having varying percentages of HF and Sahiwal genetics from different lactations. Due to the nature of crossbreeding, the breed composition varies from individual to individual due to variations in Sahiwal and HF inheritance. For instance, some crossbreds have an approx. 50% inheritance from both HF and Sahiwal, while others have 31/32 parts of HF inheritance the remaining being Sahiwal. Blood sampling was carried out in different visits to cattle farms during 2021 and 2022.

DNA was extracted from the blood samples using the FavorPrep™ Blood Genomic DNA Extraction Mini Kit, following the manufacturer's guidelines. The quality and quantity of the DNA were evaluated using different methods, including a NanoDrop spectrophotometer, agarose gel electrophoresis, and a Qubit spectrophotometer. The extracted DNA was genotyped using the GGP_HDv3_C (GeneSeek® Genomic Profiler™) and commercially available services at GeneSeek (Neogen Corporation, Lincoln, NE, United States). The genotypes were identified and analyzed using the Genome Studio software from Illumina, Inc. The analysis was based on the bovine genome assembly, ARS-UCD1.2.

Quality control (QC)

After the genotyping process, the initial raw data comprised 139,376 SNPs for the crossbred individuals. Quality control measures were applied using the PLINK v1.9 software (Slifer 2018) which involved removing SNPs that had a call rate of less than 95%, minor allele frequency (MAF) of less than 0.02, and a Hardy–Weinberg equilibrium (HWE) of less than 10E−05. For subsequent analysis, only autosomal SNPs were considered.

Marker statistics

The R software was employed to estimate multiple characteristics of the autosomes. These included the length of each chromosome in megabases (Mb), the count of markers on each autosome, the longest and shortest intervals between SNPs, and the average interval between SNPs across all autosomes (Team 2020).

Minor allele frequency (MAF)

To calculate the MAF of autosomal SNPs, default settings in PLINK v1.9 software were used with the command “--file data --freq” (Slifer 2018). The distribution of allele frequencies across various chromosomes was analyzed using the R software. Additionally, a plot was generated to visualize the proportion of SNPs falling within different frequency categories, namely 0.02–0.10, 0.10–0.20, 0.20–0.30, 0.30–0.40, and 0.40–0.50 (Team 2020).

Inbreeding coefficient (F) and effective population size (Ne)

To estimate F, the expected and observed homozygote differences were used with PLINK v1.9 software (Slifer 2018) using the formula Fi = (Oi − Ei)/Li − Ei. In this equation, Fi is the estimated inbreeding coefficient of the ith animal, Oi represents the number of observed homozygous loci, Ei represents the count of expected homozygous loci, while Li represents the count of genotyped autosomal loci. The calculation was performed using PLINK v1.9 software (Slifer 2018).

To estimate the effective population size (Ne), the SNeP tool was utilized, leveraging the relationship between Ne, linkage disequilibrium (represented by r2), and recombination rate (c) (Barbato et al. 2015). This is given by Corbin et al. (2012):

$$N_T \left( t \right) = 1/\left\{ {4f\left( {C_t } \right)} \right\} \times \left[ {1/\left\{ {E\left( {r^2_{{\text{adj}}} } \right) \, C_t } \right\}-\alpha } \right]$$

In the provided equation: NT(t) represents the estimated effective population size t generations ago in the past, Ct denotes the recombination rate t generations ago in the past, r2adj signifies the adjusted LD estimation, accounting for sampling bias, f is mapping function and α represents a constant value.

Linkage disequilibrium (LD)

The assessment of LD was carried out using the square of the correlation coefficient between two loci, represented as r2. This metric is regarded as robust and unaffected by fluctuations in allele frequency and population size (Zhao et al. 2007). The estimation of r2 plays a crucial role in determining the number of loci needed for conducting GWAS and quantitative trait loci (QTL) mapping. This measure helps to assess the extent of LD between markers and assists in designing the appropriate sample size and marker density for such studies (Makina and Taylor 2015). The equation for estimating LD using the r2 value is expressed as follows:

$$\begin{aligned} D & = \left( {pAiAi} \right)\left( {pBjBj} \right)-\left( {pAiBj} \right)\left( {pBjAi} \right) \\ r2 & = D2/pAi\left( {1 - pAi} \right) \, pBj \, \left( {1 - pBj} \right) \\ \end{aligned}$$

In this context, the frequency of the ith allele at locus A is denoted as pAi, while the frequency of the jth allele at locus B is represented as pBj. Additionally, the frequency of the haplotype AiBj in the population is denoted as pAiBj.

The MapThin v1.11 was used to thin the map files, selecting 20 SNPs per 106 bp positions to minimize false positive results and increase the efficiency of the analysis (Howey and Cordell 2011). The PLINK v1.9 software was used with the default command “--ld-snp-list mysnplist --ld-window-kb 186,000 --ld-window 99,999 --ld-window r2 0” to estimate the r2 between all pairs of SNPs on autosomes as the length of the longest chromosomes (chr1) is around 186 kb (Slifer 2018).

To verify the occurrence of free recombination at a physical distance greater than 10 Mb, two types of analyses were conducted on our dataset: one without considering a 10 Mb window, and the other with considering a 10 Mb window (Zhao et al. 2014).

  1. 1.

    To analyze the decay of LD value, the genomic regions were divided into eight categories based on a range of 20 Mb each, namely 0–20 Mb, 20–40 Mb, 40–60 Mb, 60–80 Mb, 80–100 Mb, 100–120 Mb, 120–140 Mb, and 140–160 Mb. The LD value was then calculated for all possible regions within each category.

  2. 2.

    By taking into account a maximum distance of 10 Mb between SNP pairs, the LD decay was calculated for all possible SNP pairs across the autosomes. The trend in Linkage Disequilibrium (LD) decay for crossbred individuals was plotted across the entire first 10 megabases (MB) of the genome. This analysis likely provides insights into how LD changes over increasing physical distances within this specific genomic region for crossbred animals.

The calculated LD decay was then categorized into eight intervals based on distance ranges. These intervals included: 0–10 kb, 10–25 kb, 25–50 kb, 50–100 kb, 100–500 kb, 0.5–1 Mb, 1–5 Mb, and 5–10 Mb. This categorization allows for a comprehensive assessment of LD decay patterns across varying distances, providing valuable insights into the dynamics of LD in the autosomal genome and plotted against distance range.

Minor allele frequency and sample size impact

To assess the impact of minor allelic frequency (MAF) and sample size on LD, the analysis was extended. For a physical distance of 10 Mb, LD was computed using four distinct MAF thresholds (0.05, 0.10, 0.15, and 0.2). Furthermore, seven random subsets of the population were selected with different sample sizes (N = 10, 20, 30, 40, 50, 60, and 70) to investigate the impact of sample size on r2-based LD. The extent of LD was assessed for each subset, and the impact of sample size and MAF on LD (r2) was also depicted through plotting.

Results

Quality control

Figure 1 summarizes the quality control results for different MAF thresholds. For example, for a 0.02 MAF threshold, 1804 SNPs were removed due to an MAF less than 0.02; 216 SNPs were removed based on Hardy–Weinberg Equilibrium (HWE); and 1111 SNPs were excluded due to a call rate threshold criterion. Therefore, a total of 116,710 autosomal SNPs with a genotypic rate of 0.99 were available for downstream analysis. These steps were repeated for different MAF values. All the remaining parameters were the same therefore the effect of different maf values on the final number of SNPs left for downstream analysis is depicted in Fig. 1.

Fig. 1
figure 1

Effect of different MAF thresholds on the total no. of SNPs left for downstream analysis

Marker statistics

The quality control process resulted in a total of 2520.241 Mb of retained SNPs across the genome of crossbred dairy cattle, with an average chromosome length of 86.90 Mb. The longest chromosome was BTA1, with a length of 158.8551 Mb, while the shortest was BTA25, with a length of 42.85 Mb. The number of SNPs on each chromosome exhibited a proportional relationship with the length of the respective chromosome. Notably, the highest number of SNPs was observed on BTA1 (7078), while the lowest number was recorded on BTA25 (1945). On average, the distance between adjacent SNPs was approximately 21.70 kb. The longest distance between SNPs was observed on BTA5 (612 kb), and the longest distance between SNPs on the same chromosome was found on BTA5 (3882 kb). Conversely, the mean shortest distance between SNPs was 0.16 kb, with the shortest distance occurring on BTA18 (0.002 kb). Descriptive statistics for each autosome's SNP markers are provided in Table 1.

Table 1 Snapshot of the SNP markers studied and their minor allele frequency (MAF) across the autosomal chromosomes (BTA)

Minor allele frequency (MAF)

The mean MAF observed across all autosomes was recorded as 0.30. Figure 2 provides a visual representation of the distribution of MAF on all autosomes.

Fig. 2
figure 2

Minor allele frequency (MAF) in all autosomes

Similarly, the distribution of MAF indicated that a significant percentage of SNPs exhibited elevated MAF values (Fig. 3). Specifically, around 54% of the SNPs were categorized in the last two MAF groups (MAF ≥ 0.3), while a lower percentage of SNPs fell into the initial categories. On average, around 25.20% of the SNPs displayed an MAF value lower than 0.2. This distribution highlights the predominance of SNPs with higher MAF values in the analyzed dataset.

Fig. 3
figure 3

Proportion of SNPs categorized by minor allele frequencies (MAF) across autosomal chromosomes

It is noteworthy that all autosomes exhibited a similar trend, with a greater percentage of SNPs falling into the last two groups (MAF ≥ 0.3). However, BTA25, BTA6, BTA22, BTA9, and BTA23 had a higher percentage of SNPs in the last category (MAF ≥ 0.4). Among the chromosomes, BTA19 (10.04%), BTA17 (9.67%), BTA4 (9.61%), and BTA14 (9.54%) had a higher percentage of SNPs with MAF values between 0.02 and 0.1.

Inbreeding coefficient (F) and effective population size (Ne)

The average value of F was estimated to be 0.028, indicating that the risk of negative impacts due to inbreeding depression can be considered insignificant at this level of inbreeding.

The Ne of the crossbred dairy cattle was estimated throughout the past 1000 generations based on the average r2 values, as presented in Table 2. The results showed a declining trend in Ne, which decreased from 2775 (995 generations ago) to 150 (13 generations ago). This suggests that the crossbred dairy cattle population has experienced a decrease in genetic diversity over time.

Table 2 Effective population size (Ne) across generations determined through linkage disequilibrium (r2)

Extent of LD across the genome

Without considering the window, we obtained a total of 47,054,338 possible pairs in the whole dataset, with a mean r2 value of 0.020025 (Table 3). While a total of 9,115,588 combination pairs across the autosomes were analyzed to estimate LD for SNP pairs with a physical distance of ≤ 10 Mb. The mean r2 value for markers at a 10 Mb distance was determined to be 0.128. Table 3 provides the mean LD (r2) values for different intervals of physical distance.

Table 3 Statistical summary of linkage disequilibrium (r2) over the entire genome and up to 10 MB SNP distance

Table 3 shows that considering a 10 Mb window is important because there is an inverse relationship between LD (r2) and the distance between SNP pairs. The mean r2 values are higher when the SNP pair distance is smaller, i.e., between 0 and 10 kb, and decrease from 0–10 kb to 5–10 Mb. Therefore, it is suggested that SNP pairs within a distance of 10 Mb should be explored further.

The level of Linkage Disequilibrium (LD) decay, about the distance between pairs of Single Nucleotide Polymorphisms (SNPs), is depicted in Fig. 4 for all autosomes. Notably, higher levels of LD were predominantly observed at shorter distances between SNP pairs, highlighting the rapid decay of LD as the physical distance between SNPs increases.

Fig. 4
figure 4

Average linkage disequilibrium (LD) decay in relation to the SNP pair distance

The level of LD measured by r2 varied across each chromosome, and was dependent on the physical distance between genetic markers. To explore the relationship, the mean r2 was computed for various physical distance intervals of markers across each chromosome. Chromosomes BTA22, BTA19, BTA18, and BTA7 exhibited higher levels of LD. When considering markers separated by < 10 kb, the average r2 was found to be 0.2332, which decreased to 0.1792 for markers with distances between 25 and 50 kb. The average r2 continued to decline with increasing distance, reaching a final value of 0.0344 for the 5–10 Mb category. These results indicate that the mean r2 values decrease as the physical distance between markers increases, demonstrating a decline in LD with increasing genetic distance (Fig. 5).

Fig. 5
figure 5

Box plot of mean r2 and SNP Pair distance up to 10 Mb for all 29 autosomes

The average r2 values showed a significant difference across various autosomes, especially for SNP distances less than 10 kb. On the other hand, for SNP distances greater than 100 kb, lower mean r2 values with relatively little variation across different autosomes were observed (Fig. 5).

Minor allele frequency (MAF) and linkage disequilibrium (LD) estimates

The impact of MAF on the magnitude of LD was examined by employing four threshold levels: 0.05, 0.10, 0.15, and 0.20. This analysis focused on SNP pairs with 10 Mb physical distances (Fig. 6). The findings revealed a notable influence of the MAF threshold on the average r2, especially in the case of shorter distances between SNPs. A decrease in the r2 value between SNP pairs was observed when the MAF threshold was set to a lower value (0.05), while a substantial increase in the r2 value was noted at higher thresholds of MAF. The mean r2 values ranged from 0.03 to 0.25 for MAF > 0.05, 0.03 to 0.27 for MAF > 0.10, 0.03 to 0.32 for MAF > 0.15, and 0.04 to 0.34 for MAF > 0.20. These results indicated that the MAF threshold has a considerable effect on the LD extent between SNPs, with higher MAF thresholds resulting in stronger LD.

Fig. 6
figure 6

Effect of minor allele frequency (MAF) on linkage disequilibrium extent

Sample size and LD estimates

To study the effect of sample size, random samples of sizes 10, 20, 30, 40, 50, 60, and 70 were selected from the total population for analysis. A noteworthy finding of this study is that the average r2 increased for smaller sample sizes, particularly when the physical distance intervals between SNP pairs exceeded 50 kb (Fig. 7). These results suggest that a larger sample size (at least 40 animals) is needed for an accurate estimation of r2.

Fig. 7
figure 7

Effect of different sample size on mean r2 estimates

Discussion

Following rigorous quality control measures, a final set of 116,710 autosomal SNPs were retained for analysis. These SNPs spanned a genomic region of 2520.241 Mb in crossbred dairy cattle. The average MAF was found to be 0.30, aligning with previously reported MAF values observed in diverse cattle breeds (Makina and Taylor 2015). These findings align with previous studies on other taurine breeds of cattle (O’Brien et al. 2014; Matukumalli et al. 2009; McKay et al. 2007). However, the average MAF values observed in this study were notably higher compared to indicine breeds of cattle, which typically exhibit MAF values ranging from 0.19 to 0.20 (O’Brien et al. 2014; Espigolan et al. 2013; Silva et al. 2010). In contrast to taurine cattle, indicine breeds typically display a distinct pattern in MAF levels, characterized by a greater representation of alleles with lower frequencies (< 0.2) (O’Brien et al. 2014; Gibbs et al. 2009; Villa-Angulo et al. 2009). The variation mentioned above could be ascribed to the increased genetic diversity identified in indicine breeds (Murray et al. 2010; Gibbs et al. 2009). As there are few such reports on crossbred cattle, especially with Bos Indicus and Bos taurus crossbreds, this may serve as a positive contribution and important reference for the future studies of native breeds. Furthermore, the commercially available SNP panel used in this study predominantly utilized sequence data from Bos taurus breeds. Thus, it may lead to the ascertainment bias leading to a greater proportion of SNPs with low MAF in indicine breeds of cattle.

The distribution of MAF has a direct impact on the extent of LD, as a lower MAF can result in a greater difference in allelic pair frequencies, leading to an underestimation of LD (Wray 2005). To examine the effect of MAF on LD, four different MAF thresholds were selected. The results indicated that higher thresholds of MAF (> 0.20) were associated with higher average LD (r2) between SNPs, particularly at shorter distances (O’Brien et al. 2014; Sargolzaei et al. 2008; Uimari et al. 2005). At a lower MAF threshold (e.g., 0.05), there might be the inclusion of rare variants in the analysis. Rare variants may behave differently in terms of LD leading to the observed decrease in r2 as they are less likely to be in strong linkage with other variants. Although, it has been established till now that the r2 method is less affected by the sample size it may also be a contributing factor to this finding indicating that SNP chips with a higher density of SNPs and studies on larger populations may be preferable for genomic studies in native cattle breeds of Pakistan.

The estimation of LD between SNP pairs was conducted using the correlation (r2) method, which is known to be less affected by MAF (Ardlie et al. 2002) and small sample size (Zhao et al. 2014). To assess the decay of LD, the physical distance between markers was divided into distinct intervals. The findings demonstrated a swift decrease in r2 beyond a threshold of 100 kb. Furthermore, r2 declined from 0.24 to 0.17 when considering marker distances of 10 kb and 50 kb, respectively. For inter-marker distances of up to 25 kb, the average r2 value was 0.24, which is comparatively lower than previous LD estimates, documented for taurine breeds such as Angus (0.46) and Hereford (0.49), as well as indicine breeds like Brahman (0.25) and Nellore (0.27) cattle (Porto-Neto et al. 2014; Espigolan et al. 2013; Lu et al. 2012). The results showed higher mean r2 values for BTA22, BTA19, BTA18, and BTA7, while lower mean r2 values were observed for BTA27, BTA21, BTA9, and BTA26.

LD (r2) values that surpass 0.3 are deemed valuable for dependable association studies and precise genomic predictions (Ardlie et al. 2002; Meuwissen et al. 2001, Kruglyak 1999). In this study, regions up to 10 kb on BTA5, BTA7, BTA13, BTA18, BTA22, BTA26, and BTA27 exhibited r2 values larger than 0.3. On the other hand, BTA19 showed a slower decay in LD, achieving the same level of r2 up to a distance of 25 kb. These findings deviate from the average LD levels documented in other taurine breeds such as Angus, Holstein, Brown Swiss, and Fleckvieh reach an average r2 value of 0.3 at distances ranging from 40 to 50 kb (O’Brien et al. 2014). These results align with the findings observed in indicine breeds such as Gyr and Nelore, which exhibit a more rapid decline in LD, reaching a similar r2 value at distances of approximately 20 kb. Similar outcomes have been observed in other taurine breeds, where r2 values remained comparable for distances equal to or less than 30 kb (Larmer et al. 2014; Bolormaa et al. 2011). Nevertheless, no correlation was identified between chromosomal size and r2 estimates (Bohmanova et al. 2010).

The LD decay analysis within a range of up to 10 megabases (MB), employing 10-kilobase (kb) windows, is presented in Fig. 4 for crossbred individuals. This analysis revealed a pronounced decline in LD at shorter distances between pairs of SNPs. This behavior is likely attributed to the relatively low number of SNP pair comparisons available at these close distances (Fig. 7).

These findings suggest that, when utilizing the set of SNPs contained in the GGPHDv3-C chip, there may not be consistent LD levels expected for genomic distances less than 10 MB. Importantly, these results align with those reported by O’Brien et al. (2014) in their LD analysis conducted across different taurine and indicine breeds.

To examine the influence of sample size on the extent of LD, various sample sizes were employed in the computation of r2 values as it is previously reported that a small sample size may lead to overestimation of LD (Yan et al. 2009; Khatkar et al. 2008). In the present investigation, a sample size of 40 cattle did not influence r2, which aligns with previous findings (Bohmanova et al. 2010; Singh et al. 2021). However, various other studies have reported different threshold limits for sample size. For example, Zhu et al. (2013) suggested a minimum sample size of 100. In the case of Holstein cattle a minimum sample size of 400 is necessary for reliable LD decay analysis (Khatkar et al. 2008). Human studies have indicated even higher sample sizes (Chen et al. 2006). The minimum threshold for sample sizes appears to be around 75, as r2 accuracy is significantly compromised below this value (Khatkar et al. 2008). Similarly, another study suggested a minimum sample size of 55 (Bohmanova et al. 2010).

Previous studies have consistently emphasized the significance of employing a larger number of SNPs to adequately cover the genome in genomic evaluations, especially when analyzing data from crossbreds and indicine breeds (Makina and Taylor 2015; Espigolan et al. 2013). The findings from our study support this notion, that a higher density SNP array provides more information and enhances the reliability of GWAS and GS in crossbred dairy cattle populations. These results are consistent with other studies that have also emphasized the benefits of utilizing a higher SNP density for such analyses (Singh et al. 2021).

To gain a better understanding of the population diversity and structure, we estimated F and Ne. Our study revealed an inbreeding coefficient of 0.028. This finding is comparable to previously reported inbreeding coefficients of 3%, 4%, and 6% in the Vrindavani crossbred cattle population of India (Chhotaray et al. 2021; Singh et al. 2021; Elavarasan et al. 2023). The Karan Fries crossbred cattle of Karnal exhibited an inbreeding coefficient of 3.68% (Mumtaz et al. 2021). In the case of Sahiwal cattle, one of the 11 breeds studied by Bang et al. (2022), the inbreeding coefficient was 0.9%. In Tharparker cattle, different methods were employed to estimate genomic inbreeding coefficients, resulting in values of 0.0589 (FROH), 0.0215 (FHOM), 0.0532 (FGRM), and 0.0160 (FUNI) (Saravanan et al. 2022). Another study was conducted on Holstein, Montebeliarde, and Normande breeds, an inbreeding coefficient of 4.5–5% (Dezetter et al. 2015). For pure African taurine (Baoulé) and its crossbreeds with indicine Zebu cattle, genomic inbreeding coefficients ranged from 0 to 4% (Ouédraogo et al. 2021). Among nine breeds, the mean genomic inbreeding estimates were highest for Jersey (0.173) and lowest for Hereford (0.051) (Kelleher et al. 2017). Lower inbreeding was observed in six Columbian cattle breeds, ranging from 0.5 to 4.5% (Martinez et al. 2023). In a study of 171 cattle groups conducted by Tian et al. (2023), the average inbreeding coefficient ranged from 0.22 to 0.05.

In our study, we observed a decrease in effective population size (Ne) over multiple generations in crossbred dairy cattle. The estimated Ne in our population was 150, and a decreasing trend in Ne was observed specifically 13 generations ago. When the effective population size (Ne) decreases, the genetic diversity available for selection in genomic breeding is constrained. This reduction in diversity limits the number of allelic variants that can be considered. Additionally, increased inbreeding resulting from a smaller Ne compromises fitness and undermines the accuracy of predictions due to the correlation of genetic variants. Moreover, the limited genetic contributions and heightened genetic drift further hinder the effectiveness of GS. When comparing our findings to previous studies, a range of 33–153 Ne was observed in different studies on dairy cattle (Doekes et al. 2018; Rodríguez-Ramilo et al. 2015; Stachowicz et al. 2011). Effective population sizes for European taurine breeds ranged from 98 to 152, with Brown Swiss exhibiting the lowest value (98), and Limousine and Piedmontese showing the highest values (138 and 144, respectively). African taurine breeds recorded a range of 120–175 for Ne. Among indicine cattle breeds, Gir had the highest Ne estimate (180), while Tharparker had the lowest (63) (Barbato et al. 2020). In the context of buffalo breeds, both purebred and crossbred populations demonstrated a decreasing trend in recent Ne, with estimated values closer to 387 and 113, respectively, 13 generations ago. This suggests that these animals have undergone strong selection or genetic drift, resulting in a decline in population size (Deng et al. 2019).

Conclusion

This study aimed to assess the level of LD between markers in crossbred dairy cattle from Pakistan using the GGP_HDv3_C (GeneSeek® Genomic Profiler™) SNP panel. The average estimated value of r2 was 0.24, which was lower compared to both indicine and taurine breeds. This suggests that a denser SNP panel is necessary to obtain more precise and accurate results in whole genome association studies for crossbred dairy cattle.

Additionally, the study observed a declining trend in the estimates of effective population size (Ne) in the population. This indicates the need for a well-designed breeding plan that can maintain a sufficiently large Ne to mitigate the negative effects of genetic drift and inbreeding.

Conducting studies on a larger population using a high-density array of SNPs would provide more comprehensive and reliable information regarding the extent of LD and the effective population size in crossbred dairy cattle of Pakistan.