Introduction

Rice is one of the most important food crops in the world. In response to the decline in rice acreage and the growth of the human population, the main goal of rice breeding programs in recent years has been to increase grain yield (Seck et al. 2012). Rice yield is a complex trait that is significantly influenced by various environmental factors and multiple quantitative trait loci (QTLs) and is largely determined by grain weight, panicle number, the number of grains per panicle and spikelet number per panicle. Of these traits, effective tiller number (ETN), heading date (HD), plant height (PH) and grain weight per plant (GWPP) were significantly related to yield traits in rice (Jung and Muller 2009; Naruoka et al. 2011; Teoh et al. 2019). HD and PH are prerequisites for attaining the desired yield level in rice breeding programs. PH and ETN define rice plant types, changing rice plant architecture and enhance lodging resistance by adjusting group structure to affect the yield per unit area (Liang et al. 2014). ETN and GWPP are significantly positively correlated with grain yield, and enhancing grain filling and the number of inferior tillers would be an effective approach to further improve the per acre yields of rice (Yang et al. 2022). Thus, understanding the genetic bases of the four traits has significant implications for rice yield improvement.

In recent years, with the rapid development of molecular biology technology, many QTLs and genes associated with HD, PH, ETN and GWPP have been mapped and cloned. For HD, 734 QTLs have been detected in rice (http://www.gramene.org/qtl), and 22 genes have been cloned (Doi et al. 2004; Lee et al. 2014; Wu et al. 2013; Zhang et al. 2015; Chai et al. 2021; Liu et al. 2021; Sun et al. 2022; Yang et al. 2021; Zong et al. 2021). Hd1, Hd2, Hd3a and Ehd1 play important roles in the flowering pathway of rice, and DTH2 is a micro-effective gene that promotes rice tasselling under long-day conditions. Ghd7 (number of grains per panicle, PH and germination period 7), Ghd7.1 and Ghd8 are pleiotropic QTLs with effects on the germination period, plant height and grain yield under long-day conditions (Gao et al. 2014; Wei et al. 2010; Xue et al. 2008; Yan et al. 2011; Cai et al. 2021b; Chai et al. 2021; Liu et al. 2021; Sun et al. 2022; Yang et al. 2021; Zong et al. 2021). In addition, Ghd7 controls HD through its enhanced expression under long-day conditions to repress the expression of Hd3a. Ghd8 delayed flowering under long-day conditions by regulating Ehd1, RFT1, and Hd3a. Both Ghd7 and Ghd8 can enhance grain yield by 50%. For PH, 1011 QTLs have also been detected in rice (http://www.gramene.org/qtl) and are distributed on 12 chromosomes. In addition, more than 70 genes have been cloned, while only a few genes have been applied in rice breeding (Cai et al. 2021a; Chen et al. 2019; Gu et al. 2022; Kadambari et al. 2018; Li et al. 2003; Shearman et al. 2022; Zhang et al. 2006; Morales et al. 2020). One of the GA20ox genes in rice (Oryza sativa L.), OsGA20ox2 (sd1), is well known as the “Green Revolution gene”. Sd1 greatly increases crop yield by reducing PH and increasing the harvest index and is involved in the gibberellin synthesis pathway, and loss-of-function mutation in this locus causes semi-dwarfism (Qiao and Zhao 2011; Su et al. 2021). The EUI1 gene positively regulates the uppermost internode (UI) (Luo et al. 2006). In addition, the dwarfing genes d27 and d11 resulted in dwarfed plant height, increased tiller number and higher yield (Guo et al. 2014; Shi et al. 2015; Tong et al. 2018; Wu et al. 2016; Zhu et al. 2015). Rice tillers are a variable trait that changes over time. Some experts have suggested that tiller numbers and panicles are controlled by genes at maturity (Yang et al. 2006), and the MOC1 gene is a key regulator controlling rice tillering (Zhang et al. 2021). MOC1 is mainly expressed in axillary buds and controls the initiation and outgrowth of axillary meristems (AMs) at both the vegetative and reproductive stages. This gene belongs to the GRAS (GAI, RGA and SCR) family (Sun et al. 2013) and mainly encodes transcriptional regulators, with positive regulatory effects on rice tillers and panicle length. The formation and maintenance of axillary buds are controlled by tillering-related genes in rice, such as MOC2, MOC3, TAD1, DLT, D53, IAA6, OsTB1 and OsPIN5b, which have important effects on rice tiller number and rice yield (Guo et al. 2013; Hirano et al. 2017; Jiang et al. 2013; Jung et al. 2015; Koumoto et al. 2013; Lin et al. 2012, 2020; Lu et al. 2015; Tanaka et al. 2015; Xu et al. 2012; Hou et al. 2021; Xia et al. 2020; Yu et al. 2021). In addition, controlling GWPP is a major objective of rice breeding programs, and more than 30 genes have been cloned, including EP3, GW6a, OsCIPK3, OsGASR9, ONAC023, TGW6, GW8.1, GW9.1, OsCKX2 and Os8N3, which have been found to regulate panicle size, spikelet number and density, and grain yield per plant (Borna et al. 2022; Ishimaru et al. 2013; Li et al. 2019, 2022; Piao et al. 2010; Song et al. 2015; Xie et al. 2008, 2006; Yu et al. 2015; Ruan et al. 2020; Tu et al. 2022). The ESP3 genetic strain exhibited upright spikes, mainly regulating seed size and weight. The GW6a and OsCIPK3 genes increased rice grain weight per plant by 20.6% and 40%, respectively, compared with those of the wild type. OsGASR9 is a positive regulator of responses to gibberellic acid (GA) in rice, which may regulate PH, grain size and yield through GA. OsNAC23 is an NAC transcription factor that directly represses the transcription of the Tre6P phosphatase gene TPP1 to simultaneously elevate Tre6P and repress trehalose levels, thus facilitating carbon partitioning from source to sink organs. In addition, overexpressing OsNAC23 led to an elevated photosynthetic rate and sink organ size, which consistently increased rice yields (Li et al. 2019, 2022). TGW6, GW8.1 and GW9.1, through pleiotropic effects on source organs, significantly increased yield. OsCKX2 encodes an enzyme that degrades cytokinin (CTK), and loss of function of this gene led to CTK accumulation in the panicle and increased grain weight.

Although numerous genes/QTLs related to HD, PH, ETN, and GWPP have been detected, yield traits are quantitative, and their genetic basis is thus explained only in part by these cloned genes. In addition, the associated molecular and genetic basis is unclear, and the regulatory network requires further study. However, information on the molecular basis of rice panicle development remains fragmented (Li et al. 2011). Therefore, 168 rice germplasm resources were used as materials in this study, and yield-related QTLs were determined through GWAS and multi-trait GWAS. The objectives of this study were to (1) identify QTLs related to yield traits, including HD, PH, ETN, and GWPP; (2) mine the genes associated with plant architecture and yield traits; (3) detect favourable haplotypes; and (4) provide excellent parents and molecular information for improving plant architecture and grain yield through pyramiding breeding.

Materials and methods

Plant materials and field experiment

A total of 168 germplasm accessions were provided by the State Key Laboratory of Crop Genetics and Germplasm Innovation of Nanjing Agricultural University. The 168 rice accessions were collected from Heilongjiang (19), Hunan (10), Guangdong (4), Jiangsu (74), Tianjin (1), Henan (3), Yunnan (3), Sichuan (4), Guangdong (4), Hubei (1), Hainan (2), Fujian (2), and Anhui (5) provinces and Shanghai (1). Among them, 130 accessions were from China, and the others were from Japan (11), Indonesia (4), the Philippines (2), and Vietnam (21). Detailed information on the rice accessions is given in Table S1, including the name, country of origin, and latitude and longitude of the varieties.

The 168 accessions were planted at the Jiangpu Experimental Station of Nanjing Agricultural University in Nanjing, China, in 2019, and at the Experimental Station of Anhui Agricultural University in Hefei, China, in 2020 and 2021. In both environments, the 168 accessions were planted in the middle of May and transplanted in the middle of June, with each accession planted in a two-row plot with 10 individuals per row at a spacing of 26 cm × 16 cm with two replications for each accession over 3 years (2019–2021). A randomized block design was used for the 3-year experiments, and field management followed the standard management practices of local farmers.

Phenotype determination

HD is the number of days from sowing to initial heading. At heading time, the heading of plants was recorded every two days for each accession. Each heading plant was marked to avoid repeated records in subsequent investigations during the heading time. At maturity, five plants of each accession were harvested, and the PH, ETN and GWPP were measured. PH was measured from the base of the stem to the top of the panicle. ETN was determined based on statistics, effective panicles and productive tillers. For GWPP, the seeds of five plants of each accession were harvested, and the seeds of each plant were weighed separately to obtain the mean. Data processing was performed in Excel 2019 (Microsoft), and SPSS (version 25.0) was used for statistical analysis. Welch’s two-sample t test and ANOVA with Duncan’s multiple range tests were used to analyse the different phenotypes of the haplotypes for the candidate genes using SPSS software. Correlation and frequency analyses were conducted using R (HD, PH, ETN, and GWPP) for the four yield-related traits.

SNP filter analysis

For the 168 accessions to be sequenced, two young leaves were collected from a single plant at the tiller stage, and genomic DNA was extracted using a standard cetyltrimethylammonium bromide protocol. Paired-end sequencing libraries were constructed using 5 µg of genomic DNA, with inserted fragments of approximately 350 bp, and the original sequence was further processed to remove adaptor contaminants and low-quality reads, resulting in a total of 0.532 Tb of genome sequence data. Library construction, sequencing, and sequence cleaning were performed by Mega Genomics Beijing (http://www.megagenomics.cn/mobile.php/). Haplotype-Caller of GATK 3.8-0. (http://software.broadinstitue.org/gatk/) was used for SNP detection. All SNPs with a missing rate ≥ 0.02 and one with a minor allele frequency (MAF) ≤ 0.05 were excluded in TASSEL (Yang et al. 2014). The software ANNOVAR (Wang et al. 2010) was used for SNP annotation of the Nipponbare genomic sequences. The annotation results were divided into intronic regions, exonic regions, splicing sites, and upstream and downstream regions. The SNPs in the exon regions could be divided into synonymous and non-synonymous SNPs. Of them, base substitutions in non-synonymous SNPs led to amino acid changes.

Population structure analysis

We calculated the genetic distance matrix of SNPs of the 168 rice accessions using VCF2Dis (http://github.com/BGI-shenzhen/VCF2Dis/). To analyse population structure, a principal component analysis (PCA), neighbour joining (Zhang et al. 2015) tree construction and K value analysis were applied. Construction and beautification of the neighbour-joining (Zhang et al. 2015) tree were carried out using iTOL (http://itol.embl.de/). PCA was conducted by GCTA (version 1.93.2) software (Yang et al. 2011) on Linux, and pictures were plotted in R using the package “ggplot”. The K value, where the range of subgroups (K) was set from 1 to 10 in the full population, was inferred using STRUCTURE (version 2.3.4) (Pritchard et al. 2000). We used PLINK (version 1.9) and linkage disequilibrium (LD) for SNP filtering and conversion. The optimal K value was determined by ΔK (Evanno et al. 2005). The kinship was calculated using Normalized-IBS in TASSEL (version 5.2.40) software (Bradbury et al. 2007). Manhattan and Q–Q plots were created from the GWAS results using the R package “LDheatmap”.

GWAS

In this study, we performed GWAS in TASSEL (version 5.2.40) (Barnett et al. 2014) on the obtained 1,220,522 SNPs and four phenotypic datasets of HD, PH, GWPP, and ETN. A MAF ≥ 0.05 was used in the genotype dataset. GWAS results generated through GLM and MLM were visualized by using Q‒Q and Manhattan plots. The Manhattan plot was drawn using the R package “CMplot”. The P value thresholds for GLM and MLM were set at − log10(p) ≥ 3.78 × 10−8 and − log10(p) ≥ 1.0 × 10−5, respectively. The SNPs in the same LD region were regarded as QTLs, and the SNP with the smallest p value was regarded as the lead SNP. To test the multi-trait association between each SNP and the four traits in this study, GEMMA v0.98 was used to implement a multivariate linear mixed model (mvLMM). The false discovery rate (FDR) was calculated for significant associations using the Benjamini and Hochberg (1995) correction method.

Identified important QTLs and haplotype analysis of candidate genes

The QTLs sustained discovered for at least 2 years under both GLM and MLM models were considered as the candidate QTLs. Candidate regions on chromosomes were estimated based on LD decay distances and GWAS, and genes of candidate regions were identified from the rice genome annotation project. First, we analysed the genes in the candidate regions to determine whether SNPs were present in the coding regions. We focused on the associated non-synonymous SNPs by comparison with the reference genome sequence of Nipponbare (http://rice.plantbiology.msu.edu/cgibin/gbro-wse/rice/). Candidate genes were identified based on non-synonymous SNPs in exons and gene function. Non-synonymous SNPs in all exons were selected to narrow the range of candidate genes for haplotype analysis and screen out favourable haplotypes. The haplotypes of candidate genes were analysed using the RFGB v2.0 database (https://www.rmbreeding.cn/), which includes the genomic information of 3000 rice accessions (3 K rice genome). Finally, the candidate genes were determined according to the functional annotations.

Prediction of excellent parents

The average positive (negative) haplotype effect (AHE) of a gene locus was calculated as follows:

$$A{\text{HE}} = \sum^{{h_{c} }} /n_{{c^{{^{\prime } }} }}$$

where hc represents the phenotypic value of the cth haplotype with a positive (negative) effect and nc represents the number of haplotypes with positive (negative) effects at the gene locus. The most promising parents were predicted as those with the largest positive haplotype effect on PH, HD, ETN and GWPP trait-related gene loci for yield trait improvement in rice breeding.

Results

Descriptive statistics of the abundant variation in yield traits in the natural population

The means, standard errors (SEs), ranges, coefficients of variation (CVs), and broad-sense heritability (HB2) of HD, PH, ETN, and GWPP were derived in the natural population. There were significant differences in the HD, PH, ETN, and GWPP values among the varieties in the 3 years, with CVs ranging from 11.18 to 39.65% (Table 1). All traits had high broad-sense heritability, and the HD, PH, and GWPP traits showed broad-sense heritability of more than 99%, while the ETN trait showed broad-sense heritability of more than 91%. Among the 168 rice accessions, HD was 69–132 days, PH was 56.33–192.30 cm, ENT was 5–27, and GWPP was 5.43–84 g. The 3-year mean values of HD, PH, ETN, and GWPP were 101, 113 cm, 9 and 27 g, respectively. The PH and GWPP ranges differed significantly among the varieties in both 2019 and the other 2 years, and PH ranged from 65.33–192.3 cm, 56.33–179.7 cm and 57.33–182 cm. GWPP ranged from 6.43–84 g, 6.5–63.33 g, and 6.88–67.81 g.

Table 1 Descriptive statistics of yield traits

The results above indicated abundant phenotypic variation in the 168 rice accessions. Statistical analysis of the phenotypic values of HD, PH, ETN, and GWPP showed a normal distribution for all traits, indicating that they were quantitative traits controlled by multiple genes (Fig. 1).

Fig. 1
figure 1

Histogram of the phenotypic frequency distribution of plant type, plant yield and heading date traits in 168 rice accessions: a Heading date; b Grain weight per plant; c Effective tiller number; d Plant height

Genomic variation at the SNP level in the 168 rice accessions

A total of 950 million pairs of 150 bp paired-end reads with an average coverage of 4.36× were obtained using the Illumina resequencing platform. Ninety-five percent of the total reads were mapped to the scaffold of the Nipponbare genome; 3% of reads that did not map to any position or mapped to multiple positions were removed, and SNPs with missing rates of less than 20% were selected. Through filtering, a total of 1,220,522 SNPs were detected. The density of SNPs on all 12 chromosomes was approximately 330 SNPs/Mb. We observed 351,374 SNPs in various gene regions. Of these, 40,050 were synonymous SNPs, and 46,054 were non-synonymous SNPs.

Population structure and LD analysis

We used the 168 rice accessions and SNP markers for population structure analysis (Fig. 2 and Table S2). As the logarithmic likelihood value increased with increasing K (Fig. 2a), the appropriate subgroup number was determined by assessing ΔK. The peak of ΔK appeared when k = 2 (Fig. 2b), suggesting that the whole population could be divided into two groups (Fig. 2c). Two subgroups, India and Japonica, were represented by pop1 and pop2 and included 61 and 107 rice accessions, respectively. The results PCA and the neighbour-joining (Zhang et al. 2015) tree (Fig. 2d) confirmed that the population structure was divided into two groups. The attenuation distance of LD appeared when r2 decreased to half of the maximum value. Based on the genome-wide LD decay map (Fig. 2f), we found that the attenuation distance was 112 kb.

Fig. 2
figure 2

Genetic structure analysis of the natural population constructed from 168 rice accessions: a Change in logarithmic likelihood with subgroup number; b ΔK value variation with subgroup number; c Natural population structure (K = 2). Blue and red colours represent pop1 and pop2 in Fig. 2b, respectively; d Neighbour-joining tree of the natural rice population. Blue and red colours represent pop1 and pop2 in Fig. 2b, respectively; e Principal component analysis of natural rice populations. Blue and red colours represent pop1 and pop2 in Fig. 2b, respectively; f LD decay analysis of the whole genome in natural rice populations

GWAS of the HD, PH, ETN, and GWPP traits

The GWAS of 168 rice accessions was conducted with the GLM and MLM. There were 18 significant SNPs (QTLs) associated with HD, PH, ETN, and GWPP in the Manhattan plot (Fig. 3). Five QTLs were associated with HD, six QTLs were associated with ETN, four QTLs were associated with GWPP, and three QTLs were associated with PH (Table S3). These SNPs were located on chromosomes 2–9 and 11, with PVE values ranging from 13.73 to 25.70%. Chromosomes 2, 4, 5 and 11 showed significant peaks with a high − log10 (p) value. This suggests that the derived SNP–trait associations were stable (Barnett et al. 2014). There were three QTLs in the 3 years and in the two models, one QTL in the 2 years and in the two models and one QTL in the 3 years and in the MLM (Table 2). qHD4 explained 13.73% of the total phenotypic variation; qETN11 explained 18.56% of the total phenotypic variation; qGWPP2.1 and qGWPP2.2 explained 19.72% and 19.24% of the total phenotypic variation, respectively; and qPH5 explained 18.26% of the total phenotypic variation (Fig. 3). No reliable non-synonymous mutations were detected in qGWPP2.1 in the later analysis. Therefore, we focused on the significant SNPs in qETN11, qGWPP2.2 and qPH5. qHD4, as the only significant SNP for HD, was found in 2 years and two models and needs to be further analysed.

Table 2 SNP positions for four traits identified by GWAS, the proportions of PVE, and p values in 2019, 2020 and 2021
Fig. 3
figure 3

Manhattan plots of the genome-wide association results for HD, ETN, GWPP, and PH obtained with the MLM: a Manhattan plot for HD in 2019; b Manhattan plot for ETN in 2019; c Manhattan plot for GWPP in 2019; d Manhattan plot for PH in 2019; e Manhattan plot for HD in 2020; f Manhattan plot for ETN in 2020; g Manhattan plot for GWPP in 2020; h Manhattan plot for PH in 2020; i Manhattan plot for HD in 2021; j Manhattan plot for ETN in 2021; k Manhattan plot for GWPP in 2021; l Manhattan plot for PH in 2021

Multi-trait GWAS of HD, PH, ETN, and GWPP

The four yield-related traits (HD, PH, ETN, and GWPP) were significantly positively correlated, and the results indicated that they were highly correlated (Fig. 4a). Therefore, we conducted a multi-trait GWAS of the 168 rice accessions using the mvLMM. There were 9 significant SNPs (QTLs) associated with all the traits in the Manhattan plot (Fig. 4 and Table 3). Multi-trait GWAS revealed two QTLs associated with the correlated traits, where qMTY4and qMTY9 in the multi-trait GWAS corresponded to qHD4 and qHD9 in the single-trait GWAS, respectively (Tables 2 and 3). These results showed that multi-trait GWAS detected additional SNPs and confirmed most of the significant SNPs in the single-trait GWAS, suggesting that the former can increase statistical power and complement single-trait GWAS results.

Fig. 4
figure 4

a Correlation analysis between four traits in 168 rice varieties. *p < 0.05; **p < 0.01; ***p < 0.001 (ANOVA). Manhattan plots of the genome-wide association results for four traits obtained with the mvLMM: b Manhattan plot for 2019; c Manhattan plot for 2020; d Manhattan plot for 2021

Table 3 SNP positions for four traits identified by multi-trait GWAS and p values in 2019, 2020 and 2021

Identification of candidate genes

The annotated genes within 112 kb upstream and downstream of qETN11, qGWPP2.1, qPH5 and qHD4 were detected using the Nipponbare genomic reference sequences (http://rice.plantb-iology.msu.edu/cgibin/gbrowse/rice/). Candidate genes playing putative roles in ETN, GWPP, PH and HD were predicted by considering all annotated genes included in the genomic regions indicated above. After the analysis of the genes in the candidate regions to determine whether SNPs were present in the coding regions, all 20 genes encoding hypothetical proteins of ETN, GWPP, PH and HD were filtered out due to no SNPs were found. In the phenotype comparison among haplotypes, at least 10 rice accessions were used for each haplotype.

Identification of a candidate gene for qETN11

For qETN11, we obtained 4 candidate genes associated with significant SNP loci in the 10.0–10.8 Mb region on chromosome 11 (Fig. 5b; Table S4). Non-synonymous mutations were found in 2 of the 4 genes (Table S5). We analysed the haplotypes of the two non-synonymous mutations and found that the haplotype of the gene LOC_Os11g18366 (cycloartenol synthase) was not as significant as the haplotype of the gene LOC_Os11g18570 (cytochrome P450), so we were able to select the gene LOC_Os11g18570 for our study. The non-synonymous mutation LOC_Os11g18570 encodes a cytochrome P450 protein. According to previous studies, CYP450 is involved in the synthesis of auxin precursors and promotes plant growth and development (Chaban et al. 2003). LOC_Os11g18570 contained three non-synonymous SNPs with amino acid changes: (A/C) with isoleucine to leucine, (C/G) with glutamine to glutamate, and (C/T) alanine to valine. The 168 rice accessions were divided into three haplotypes (Fig. 5c). The average ETN of HapA was 9.32 ± 3.25, that of HapB was 12.54 ± 5.53, and that of HapC was 9.72 ± 2.84. Haplotype analysis of the whole population showed that the ETN of HapB was significantly higher than that of HapA and HapC. There was no significant difference in ETN between HapA and HapC (Fig. 5d–f).

Fig. 5
figure 5

Haplotype analysis of the candidate gene LOC_Os11g18570: a Manhattan plots for ETN. Lines represent significance thresholds; b Local Manhattan plot (top) and linkage disequilibrium heatmap (bottom), where the candidate region lies between the green dashed lines; c Schematic representation of LOC_Os11g18570 structure and single-nucleotide polymorphisms in gene cDNA between HapA, HapB, and HapC. Black boxes indicate exons; d Box plots for ETN in the three haplotypes of LOC_Os11g18570 in all accessions in 2019; e Box plots for ETN in the three haplotypes of LOC_Os11g18570 in all accessions in 2020; f Box plots for ETN in the three haplotypes of LOC_Os11g18570 in all accessions in 2021. The number of accessions (n) of each haplotype (Morales et al.) in each panel is given under the x-axis. Boxes show the median and upper/lower quartiles. Whiskers extend to 1.5× the inter-quartile range, with any remaining points indicated with dots. **p < 0.01; ***p < 0.001 (ANOVA). Letters indicate significant differences, p < 0.05 (Duncan’s multiple comparison test)

Identification of a candidate gene for qGWPP2.2

The 3.89–4.12 Mb region on chromosome 2 contained 24 genes (Fig. 6b; Table S6). Non-synonymous mutations occurred in 15 of the 24 genes (Table S7). After eliminating genes encoding hypothetical proteins, retrotransposons, and transposon proteins, we analysed the haplotypes and found less variation in the results for the other gene haplotypes, whereas this protein encoding the cytochrome P450 protein has been more studied by previous authors. Based on the difference and known functions, we focused on one candidate gene (LOC_Os02g07680: cytochrome P450). The 168 rice accessions were divided into three haplotypes based on the SNPs in the cDNA of LOC_Os02g07680 (Fig. 6c). LOC_Os02g07680 contained three non-synonymous SNPs with amino acid changes: (T/A) from aspartic acid to glutamate, (G/A) from arginine to histidine, and (G/T) from glutamine to histidine. The average GWPP of HapA was 28.96 ± 7.88 g, that of HapB was 25.30 ± 6.22 g, and that of HapC was 29.41 ± 10.46 g. Haplotype analysis of the whole population showed no significant difference in numerical stability between HapA and HapC. The GWPP of Hap B significantly differed from that of the other two haplotypes (Fig. 6d–f).

Fig. 6
figure 6

Haplotype analysis of the candidate gene LOC_Os02g07680: a Manhattan plots for GWPP. Lines represent significance thresholds; b Local Manhattan plot (top) and linkage disequilibrium heatmap (bottom), where the candidate region lies between the green dashed lines; c Schematic representation of LOC_Os02g07680 structure and single-nucleotide polymorphisms in gene cDNA between HapA, HapB, and HapC. Black boxes indicate exons; d Box plots for GWPP in the three haplotypes of LOC_Os02g07680 in all accessions in 2019; e Box plots for GWPP in the three haplotypes of LOC_Os02g07680 in all accessions in 2020; f Box plots for GWPP in the three haplotypes of LOC_Os02g07680 in all accessions in 2021. The number of accessions (n) for each haplotype (Morales et al.) in each panel is given under the x-axis. Boxes show the median and upper/lower quartiles. Whiskers extend to 1.5× the inter-quartile range, with any remaining points indicated with dots. **p < 0.01 (ANOVA). Letters indicate significant differences, p < 0.05 (Duncan’s multiple comparison test)

Identification of a candidate gene for qPH5

The 5.0–5.9 Mb region on chromosome 5 contained 17 genes, including 7 with non-synonymous mutations (Fig. 7a; Tables S8; S9). We studied the function of this gene (LOC_Os05g09630) more by haplotype results and previous reports, so we chose this gene for our study. The gene LOC_Os05g09630, encoding a homeobox domain-containing protein, was obtained by screening. A previous study showed that homeodomain proteins play key roles in controlling developmental programs in various organisms (Xu et al. 2015). The accessions were divided into three haplotypes (Fig. 7b). A total of 98 rice accessions were assigned to HapA, 12 rice accessions were assigned to HapB, and 58 rice accessions were assigned to HapC. LOC_Os05g09630 contained two non-synonymous SNPs with amino acid changes: (T/G) with a serine to alanine change and (C/T) with a serine to leucine change. The average PH of HapA was 102.71 ± 17.81 cm, that of HapB was 148.92 ± 26.10 cm, and that of HapC was 124.86 ± 21.84 cm. Haplotype analysis of the whole population showed that PH significantly differed among the three haplotypes (Fig. 7c–e). Among them, HapB had the highest value, and HapA had the lowest.

Fig. 7
figure 7

Haplotype analysis of the candidate gene LOC_Os05g09630: a Manhattan plots for PH. Lines represent significance thresholds; b Local Manhattan plot (top) and linkage disequilibrium heatmap (bottom), where candidate region lies between the green dashed lines; c Schematic representation of LOC_Os05g09630 structure and single-nucleotide polymorphisms in gene cDNA between HapA, HapB, and HapC. Black boxes indicate exons; d Box plots for PH in the three haplotypes of LOC_Os05g09630 in all accessions in 2019; e Box plots for PH in the three haplotypes of LOC_Os05g09630 in all accessions in 2020; f Box plots for PH in the three haplotypes of LOC_Os05g09630 in all accessions in 2021. The number of accessions (n) of each haplotype (Morales et al.) in each panel is given under the x-axis. Boxes show the median and upper/lower quartiles. Whiskers extend to 1.5× the inter-quartile range, with any remaining points indicated with dots. ***p < 0.001 (ANOVA). Letters indicate significant differences, p < 0.05 (Duncan’s multiple comparison test)

Identification of a candidate gene for qHD4

qHD4 was located on chromosome 4 in 2020 and 2021. Three of the 16 candidate genes had non-synonymous mutations in the 28.5–30.0 Mb region (Fig. 8a; Tables S10; S11). We analysed the haplotypes of the three non-synonymous mutations and found that the haplotype of the gene LOC_Os04g48940 (uncharacterized mscS family protein) and the gene LOC_Os04g49194 (naringenin, 2-oxoglutarate 3-dioxygenase) was not as significant as the haplotype of the gene LOC_Os04g49210 (naringenin, 2-oxoglutarate 3-dioxygenase), so we were able to select the gene LOC_Os04g49210 for our study. LOC_Os04g49210 encodes naringenin, 2-oxoglutarate 3-dioxygenase. Naringenin alters auxin redistribution via VrPIN1, leading to morphological alterations and significantly reducing the protein precipitable tannins that further enhance protein accumulation and bioavailability (Sharma et al. 2020). It also accelerates the accumulation of plant nutrients. LOC_Os04g49210 had only one non-synonymous SNP with an amino acid changes, (T/G) from leucine to arginine (Fig. 8b). The average HD of the two alleles was 100.11 ± 11.63 and 109 ± 11.26, respectively. ANOVA showed that HD significantly differed among the two alleles, with the B allele showing a larger value than the A allele (Fig. 8c–e).

Fig. 8
figure 8

Haplotype analysis of the candidate gene LOC_Os04g49210: a Local Manhattan plot (top) and linkage disequilibrium heatmap (bottom), where the candidate region lies between the green dashed lines; b Schematic representation in LOC_Os04g49210 structure and single-nucleotide polymorphisms in LOC_Os04g49210 cDNA between the two alleles. Black boxes indicate exons; c Box plots for HD in the two haplotypes of LOC_Os04g49210 in all accessions in 2019; d Box plots for HD in the two haplotypes of LOC_Os04g49210 in all accessions in 2020; e Box plots for HD in the two haplotypes of LOC_Os04g49210 in all accessions in 2021. The number of accessions (n) of each allele in each panel is given under the x-axis. Boxes show the median and upper/lower quartiles. Whiskers extend to 1.5× the interquartile range, with any remaining points indicated with dots. **p < 0.01; NS not significant (Welch’s two-sample t test)

Haplotype distribution of candidate genes

The haplotype results of the candidate genes were further analysed using different geographical regions and subgroups (Fig. 9). The favourable haplotypes HapA and HapC of LOC_Os11g18570 were randomly distributed in the subgroup of indica and japonica and the different latitude regions, except for northeastern China (NEC). The non-favourable HapB haplotype of LOC_Os02g07680 was mainly distributed in the subgroup of indica and the high-latitude regions such as eastern China (EC), Southwest China (SWC), South China (SC), Central China (CC), and Japan (JP). Favourable haplotypes HapA and HapC of LOC_Os02g07680 and HapB and HapC of LOC_Os05g09630 were mainly distributed in the subgroup of indica and low-latitude regions such as SC, CC, SWC and southeast Asia (SEA). In contrast, the non-favourable HapB haplotype of LOC_Os02g07680 and HapA haplotype of LOC_Os05g09630 were mainly distributed in the subgroup of japonica and high-latitude regions such as eastern China (EC), NEC and JP, and PH and GWPP decreased with increasing latitude (Fig. 9). Similarly, the favourable B allele of LOC_Os04g49210 was mainly distributed in the subgroup of japonica and high-latitude regions, such as EC, NEC, and JP. In contrast, the non-favourable A alleles of LOC_O04g49210 were mainly distributed in the indica rice subgroup and at low latitudes, such as in SWC, SEA, SC, and CC. The favourable haplotypes of the three candidate genes were mainly distributed in indica rice and low-latitude regions. Therefore, the accessions with greater ETN and GWPP and higher PH were mainly distributed in subgroups of indica and low-latitude regions.

Fig. 9
figure 9

Haplotypes of candidate genes in seven geographic groups and two subgroups. Blue represents favourable haplotypes, and red represents non-favourable haplotypes. The green box represents grouping by region, and the yellow box represents grouping by subgroup. EC eastern China, JP Japan, NEC northeastern China, SWC Southwest China, SEA Southeast Asia, CC Central China, SC southern China

Excellent parental combinations predicted

A total of eleven haplotypes were identified through phenotypic data and gene analysis. Four haplotypes showed negative effects, and seven haplotypes showed positive effects (Table S12). All excellent parents based on the four traits belonged to the indica subgroup, and the favourable haplotype significantly increased (Table 4). Taking Baikenuo (indica) as an example, the GWPP of the superior haplotype containing the LOC_Os02g07680 gene could be theoretically increased by 2.3 g. Similarly, excellent parents in other species were predicted.

Table 4 Excellent parents predicted on the basis of ETN, GWPP, PH and HD

Discussion

In this study, the variation in yield traits was high in 168 rice germplasms, with CVs ranging from 11.18 to 39.65%, similar to those reported in other studies, including in the study of Mongiano et al. (2020), who chose 40 rice accessions from a collection of 351 genotypes for yield trait identification, with CVs ranging from 5.9 to 45.4% and measuring 6.8% for HD and 22.0% for grain weight (GW). Sohrabi et al. (2012) chose 50 accessions of upland rice from Peninsular Malaysia and Sabah for yield trait measurement, with CVs ranging from 15.86 to 40.94%, and the CVs of PH and ETN were 15.86% and 25.15%, respectively.

The study showed that using 2.5× average genome coverage can precisely detect candidate genes, and compared with deep sequencing, low-coverage whole-genome sequencing provided an effective strategy for GWAS in rice (Wang et al. 2016; Wu et al. 2015). The site effect will be seriously overestimated, resulting in false positive results when the model is solved. Determining population structure can prevent false-positive results for the associations between phenotypes and genotypes in the association map due to LD in natural populations (Pritchard and Rosenberg 1999). We detected two subpopulations in the structure analysis, which is consistent with the PCA and evolutionary tree results. These results indicate the high rigor and credibility of the research process.

Multi-trait GWAS is usually used to detect QTLs that are associated with multiple traits. The stronger the genetic and phenotypic correlation between traits is, the higher the statistical power of the multi-trait GWAS (Porter and O'Reilly 2017). The single-trait GWAS detected 18 QTLs associated with the yield traits. Multi-trait GWAS identified 10 QTLs associated with four yield traits, and colocalization with single-trait GWAS identified three QTLs associated with related traits. The QTLs in our study were compared to those in previous studies. Three QTLs were similar to those reported, and 15 QTLs were newly found in the GLM or MLM (Fig. 10). DTH2 was located on Chr2 (30.9 Mb) (Wu et al. 2013), which was close to PIN1 (31.1 Mb) (Xu et al. 2005), which promotes rice heading and tillering. CYPT14B2 (12.2 Mb) and ACE1 (12.9 Mb) were located on Chr3 and mainly regulate rice PH (Magome et al. 2013; Nagai et al. 2020). RCN11 (24.8 Mb) and OsCEP6.1 (23.4) both decreased the tiller number and PH of rice (Takano et al. 2015). The QTL qETN3 (18.7 Mb) in this study was near the cloned gene OsLBD37 (18.9 Mb), which encodes a DUF260 domain-containing protein. OsLBD37 limited the growth of aboveground parts and negatively regulated tillering in rice (Tu et al. 2022). Ghd8 (4.3 Mb) encodes a histone-like transcription factor and archaeal histone. This gene upregulates the expression of MOC1 and controls tiller and lateral branch development (Yan et al. 2011), which was found to be near qETN8.2 (4.6 Mb). qPH9 (6.7 Mb) was near the cloned gene OsGSK2 (6.6 Mb) (Sun et al. 2018), which are CDA, MAPK, GSK3, and CLKC kinases, and OsGSK2 can regulate cell division and promote rice mesocotyl elongation.

Fig. 10
figure 10

QTLs detected in natural populations in 2019, 2020 and 2021 with the GLM, MLM or mvLMM. Red colours represent genes or QTLs reported in previous studies

Both qETN11 and qGWPP2.2 were detected in 2019, 2020 and 2021 with the GLM and MLM (Fig. 3). The genes LOC_Os11g18570 and LOC_Os02g07680 encode cytochrome P450 protein. Previous studies have shown that cytochrome P450 protein is a broad-spectrum biocatalytic enzyme that is widely distributed across the tree of life and is involved in a variety of metabolic reactions. It participates in the metabolism of endogenous substances and the degradation of exogenous substances (Hofer et al. 2014; Pinot and Beisson 2011). In addition, the cytochrome P450 protein participates in gibberellin regulation for crop growth (Zhu et al. 2006). The protein function of the two genes is related to the growth and development of plant vegetative organs or reproductive organs. Therefore, we predicted that the genes LOC_Os11g18570 and LOC_Os02g07680 may be candidate genes for effective tiller number and grain weight per plant, respectively, in rice. The candidate gene LOC_Os04g49210 of qHD4 encodes naringenin, 2-oxoglutarate 3-dioxygenase, which controls Vrpin1 to regulate the distribution of auxin in plants to affect protein accumulation and bioavailability (Sharma et al. 2020). An adequate supply of nutrients can enable plants to enter reproductive growth earlier. LOC_Os05g09630 encodes a homeobox domain-containing protein, which is a conserved DNA motif that encodes proteins that act as transcription factors, controlling the actions of other genes by binding to segments of DNA (Xu et al. 2015). Therefore, LOC_Os04g49210 and LOC_Os05g34600 may be candidate genes affecting rice HD and PH, respectively.

The PH, HD, ETN and GWPP of rice could be increased by using the optimal alleles detected in this study. Of them, PH and GWPP show an increasing trend at high latitudes, and HD decreases with increasing latitude. ETN exhibits no obvious change in different dimensions. All six predicted parents belong to the indica subgroup, which suggests that indica rice performs better in terms of PH, HD, ETN and GWPP. The performance of all predicted superior parents in this study needs further verification in production contexts.

Conclusions

In this study, the traits of HD, PH, ETN and GWPP in rice were assessed in a panel of 168 diverse rice accessions in 2019, 2020 and 2021. Four candidate genes, LOC_Os11g18570, LOC_Os02g07680, LOC_Os04g49210 and LOC_Os05g34600, were identified by GWAS and haplotype analysis. The gene LOC_Os04g49210 is a candidate gene screened by GWAS and multi-trait GWAS colocalization segments. Six indica rice accessions were predicted to be excellent parents on the basis of favourable alleles for the HD, PH, ETN and GWPP traits. This result will provide a molecular basis for and information on optimal parents for high-yield rice breeding.