Introduction

Soybean oil is an important edible resource of vegetable oil that makes up 53% in the U.S. vegetable oil consumption in 2017 (American Soybean Association 2018). As the predominant saturated fatty acid, palmitic acid (16:0) typically accounts for 11% in conventional soybean oil (Fehr 2007). Although elevated palmitic acid content improves oxidative stability of soybean oil, it also causes the decrease in oleic acid and oil contents (Stoltzfus et al. 2000). On the contrary, reducing palmitic acid content has been reported to reduce the risks of developing cardiovascular diseases for humans (Hu et al. 1997). To produce edible oil with < 7% total saturates required by U.S. Food and Drug Administration, several soybean lines with reduced palmitic acid phenotype have been identified as potential genetic resources for developing low palmitic acid cultivars (Rebetzke et al. 1998). In fatty acid biosynthetic pathway, 16:0-ACP fatty acid thioesterase (FATB) is the major target to genetically reduce the level of palmitic acid in soybean seeds.

Plant acyl–acyl carrier protein (ACP) thioesterase (TE), an enzyme terminates plastidial fatty acid biosynthesis, catalyzed acyl-ACP thioester bond hydrolysis to release free fatty acids and ACP. The substrate specificity of individual TEs is essential for the chain length of fatty acids exported from the plastid (Hills 1999; Voelker 1996). In the online database named ThYme, there are 25 TE families, from which Family TE14 include bacterial and plant acyl-ACP TEs (Cantu et al. 2010). Based on amino acid sequence alignment and substrate specificity, the plant TEs have been categorized into two classes, FATA and FATB (Voelker et al. 1997). The FATA class primarily hydrolyze 18:1-ACP with minor activity toward saturated acyl-ACP substrates, while FATB class shows preference for acyl-ACP with saturated fatty acyl chains (Dormann et al. 1995; Salas and Ohlrogge 2002). They both contain two helix/multi-stranded sheet motifs (hotdog domains), in which residues in the N-terminal domain were found to affect substrate specificity of enzymes and highly conserved residues in the C-terminal domain involved in catalysis (Mayer and Shanklin 2005). Two thioesterases maintain the saturated/unsaturated balance of membrane fatty acids for normal plant growth under critical conditions (Bonaventure et al. 2003).

As an allotetraploid crop species, soybean possesses a highly duplicated genome that ~ 75% of its genes are present with multiple copies. Two whole-genome duplication events have occurred in soybean genome, including one shared by legume species 59 million years ago and another glycine-specific one around 13 million year ago. The number of genes involved in acyl lipid biosynthesis in soybean is almost doubled compared to Arabidopsis (Schmutz et al. 2010). The gene families involved in fatty acid synthesis are generally much larger in soybean, such as omega-6 fatty acid desaturase (FAD2) with seven members (Lakhssassi et al. 2021). Such genetic redundancy drastically increases the complexity of genetic basis behind agronomical important traits but provide an invaluable resource for breeding desired phenotypes.

From mutagenized soybean lines, five quantitative trait loci (QTLs) have been associated with low palmitic acid phenotype, including fap1 in C1726, fap* in ELLP2, fap3 in A22, sop1 in J3, and fapnc in N79-2077-12 (Burton et al. 1994; Cardinal et al. 2014; De Vries et al. 2011; Rahman et al. 1996; Stijšin et al. 1998). With an exception of fapnc allelic with fap3, fap1, fap3, and fap* are independent alleles conferring low palmitic acid content (Primomo 2000; Schnebly et al. 1994). At fap1, a disrupted splicing mutation in a 3-ketoacyl-ACP synthase enzyme III (GmKASIIIA) has been associated with reduced palmitic acid phenotype (Cardinal et al. 2014). At fap3, a single nucleotide polymorphism (SNP) has caused loss-of-function for GmFATB1A (De Vries et al. 2011). Fapnc represents the second allele of GmFATB1A, in which a deletion is responsible for low palmitic acid phenotype (Cardinal et al. 2007). Thapa et al. (2016) have identified two additional alleles of GmFATB1A from soybean mutant lines with 30% reduction in palmitic acid content. More recently, a 254-kb genomic deletion, including the GmFATB1A gene, has been reported to result in reduced palmitic acid content in soybean seeds (Bachleda et al. 2016; Goettel et al. 2016). Alternatively, downregulation of GmFATB gene expression can reduce palmitic acid content in soybean seeds (Buhr et al. 2002; Wilson et al. 2001). In Arabidopsis, a FATB knockout mutant has shown not only the low saturated fatty acid content, but also slow seedling growth and low-viable seed development (Bonaventure et al. 2003). However, no reports on FATB soybean mutants with negative impact on soybean growth and seed quality have been published so far.

TILLING (Targeting Induced Local Lesions IN Genomes) has been developed to screen-induced mutations from a chemical mutagenized population in early 2000s (McCallum et al. 2000). It combines traditional chemical mutagenesis with a high-throughput mutation screening method. Ethylmethane sulfonate (EMS) is widely used as the most common chemical mutagen to randomly create point mutations in plant genome (Anderson et al. 2018; Kandoth et al. 2017; Koornneef et al. 1982; Lakhssassi et al. 2020a). A large number of TILLING populations have been developed in a variety of plant species, such as barley, legume, maize, rice, sorghum, and wheat (Till et al. 2009). Using reverse genetic methods like TILLING, scholars have been studying the gene function for economically important traits in soybean, such as disease resistance and seed oil composition traits. Two missense mutations in the GmSHMT08 gene are identified in soybean cv. ‘Forrest’ mutant populations and result in alternation of SCN-resistant phenotype (Liu et al. 2012). Three missense mutations in individual soybean lines were detected in the GmFAD2-1A, and one of them leads to high oleic acid and low linoleic acid contents in the seed oil (Dierking and Bilyeu 2009). However, the complex traits resulted from duplicated soybean genome dramatically lower the efficiency of mutation screening in soybean. Using gel-based TILLING, a recent study shows that no mutations were found in either GmFAD2-1A or GmFAD2-1B from 2,000 EMS-mutagenized soybean lines, but five mutants in either of targeted genes were identified using forward phenotypic screening, which was followed by targeted sequencing analysis (Lakhssassi et al. 2017). The adoption of exome capture sequencing enabled the high-throughput screening for hidden mutations in multiple homologous wheat genes controlling one trait (Krasileva et al. 2017). More recently, we developed a versatile TILLING-by-Sequencing+ technology and discovered novel genes associated with improved seed stearic acid content in soybeans (Lakhssassi et al. 2020b).

In the current study, we characterized the soybean acyl-ACP thioesterase gene family through a comprehensive analysis of phylogeny, gene structure and expression, synteny, and conserved domain variations and identified six additional members belonging to GmFATB gene family. Using TILLING-by-Sequencing+, we discovered for the first time that EMS-induced mutations in GmFATA1A resulted in high oleic acid content in soybean seed. Mutations at four GmFATB members also are associated with low palmitic acid and high oleic acid contents. These GmFAT mutants are the valuable sources to breed new soybean cultivars with low saturated and high monounsaturated fatty acid contents.

Materials and methods

Identification of FATA and FATB from soybean and other plant species

The putative soybean acyl-ACP thioesterase genes were identified by BLASTP searches against soybean reference genome (Glycine max, Wm82.a2.v1) at Phytozome (v12.1) using Arbidopisis thaliana acyl-ACP thioesterase protein sequences as queries (https://phytozome.jgi.doe.gov). Using the same approach, the putative acyl-ACP thioesterases were identified from reference genome of Phaseolus vulgaris (v2.1), Medicago truncatula (Mt4.0v1), Brassica rapa FPsc (v1.3), Oryza sativa (v7_JGI), Lotus japonicas genome assembly build 3.0 (http://www.kazusa.or.jp/lotus/), Elaeis guineensis assembly EG5 (https://www.ncbi.nlm.nih.gov/genome/2669), and Cocos nucifera assembly ASM812446v1 (https://www.ncbi.nlm.nih.gov/genome/?term=Cocos+nucifera). The total of 50 identified protein sequences with accession numbers were included in this study.

Phylogenetic analysis

Multiple sequence alignments of the full-length acyl-ACP thioesterase protein sequences from nine plant species were performed with MUltiple Sequence Comparison by Log-Expectation (MUSCLE). An unrooted phylogenetic tree was then constructed by maximum likelihood (ML) method in MEGA X using Jones-Taylor-Thornton Gamma Distributed (JTT + G) model for all FAT genes and JTT + G + I model with Invariant Sites (I) for soybean FAT genes (Hall 2013; Kumar et al. 2018).

Gene structure, expression profiling, and conserved domain analysis

The genomic and coding sequences of soybean acyl-ACP thioesterase genes retrieved from Phytozome v12.1 were aligned to generate the gene exon–intron structure diagram using the Gene Structure Display Server (Hu et al. 2015). To analyze the tissue-specific expression of soybean acyl-ACP thioesterase genes, normalized transcript data in six different tissues were downloaded from Soybase (https://www.soybase.org/soyseq/). The expressions profiling was visualized through heatmap using Heatmapper (Babicki et al. 2016). Followed by multiple sequence alignment between FATA and FATB in soybean and A. thaliana, the residues for substrate specifying have been proposed based on the criteria described by Jing et al. (2018). Catalytic residues in conserved motifs of soybean acyl-ACP thioesterases were identified from NCBI Conserved Domain Database (CDD) (https://www.ncbi.nlm.nih.gov/cdd).

Chromosomal localization and syntenic analysis

The locations of soybean acyl-ACP thioesterase genes and their corresponding chromosomes were drawn based on soybean genome annotation a2.v1 on SoyBase. Syntenic analysis was performed using soybean acyl-ACP thioesterase genes as locus identifier in plant genome duplication database (PGDD) (Lee et al. 2012). Nonsynonymous (Ka) versus synonymous substitution (Ks) rates were calculated based on their values retrieved from PGDD. For gene pairs whose information are not available at PGDD, PAL2NAL program was used to estimate Ka and Ks (Suyama et al. 2006). Given the Ks values and a rate of 6.1 × 10−9 substitutions per site per year, the divergence time (T) was equal to Ks/(2 × 6.1 × 10−9) × 10−6 Mya for each gene pair (Chen et al. 2014).

Development of EMS-mutagenized soybean populations

EMS mutagenesis was performed as described in the past (Meksem et al. 2008). The soybean cv. Forrest and PI88788 seeds were used to generate M2 population in the greenhouse at SIUC Horticulture Research Center (HRC). Forrest is a Peking type and PI88788 is a PI88788 type in SCN resistance with the major loci being the Rhg4 + rhg1-a and the rhg1-b, respectively. A total of 4032 M2 lines were advanced to M3 generations by single-seed descent in the field between 2012 and 2015. M3 seeds from each mutant line were harvested, thrashed, and stored at − 20 °C.

Mutation detection and validation

The mutations in five soybean acyl-ACP thioesterase genes were detected using TILLING-by-Sequencing+ method. A subset of mutations at GmFATA1A, GmFATB1A, GmFATB1B, GmFATB2A, and GmFATB2B were confirmed by Sanger sequencing. PCR primers were designed to amplify the fragments covering the exons of three soybean acyl-ACP thioesterase genes using Primer3 (Koressaar and Remm 2007). The PCR program was set up with 30 cycles of amplification at 94 °C for 30 s, 52 °C for 30 s, and 72 °C for 1 min. The PCR products were then purified using QIAquick Gel Extraction Kit (QIAGEN, Valencia, CA, USA). The purified samples were sent for sequencing at GENEWIZ (https://www.genewiz.com/). The putative mutations were identified by alignment sampled sequences to reference using Unipro UGENE (Okonechnikov et al. 2012).

Fatty acid analysis of seeds from GmFAT mutants

Five major fatty acids’ content was measured from selected M2/M3 lines according to the two-step methylation procedure (Kramer et al. 1997). At least three seeds per line were crushed in 16 mm × 200 mm tube with Teflon-lined screw cap individually. 2 mL sodium methoxide was added into tube followed by 50 °C incubation for 10 min. After cooling for 5 min, the samples were mixed with 3 mL of 5% (v/v) methanolic HCl, incubated at 80 °C for 10 min, and cooled for 7 min. Each tube was then added with 7.5 mL of 6% (w/v) potassium carbonate and 1 mL of hexane and centrifuged at 1200 g for 5 min. The upper layer was transferred to vials, from which the individual fatty acid content was determined as a percentage of the total fatty acids of soybean seed by gas chromatography. A Shimadzu GC-2010 (Columbia, MD) gas chromatograph fitted with a flame ionization detector was equipped with a Supelco 60-m SP-2560-fused silica capillary famewax column (0.25 mm i.d. × 0.25 μm film thickness). The standard fatty acids were run first to create calibration reference.

Results

Identification of plant acyl-ACP thioesterase gene family members in soybean

Four FATB genes have previously been identified in soybean, from which GmFATB1A is associated with reducing palmitic acid content (Cardinal et al. 2007). To identify the putative members of TE family in soybean, a BLASTP search against the soybean genome database (Wm82.a2.v1) was performed by using A. thaliana TE protein sequences as queries. Combined with soybean TEs from Family TE14 in the ThYme database, a total of 12 TEs have been found in soybean genome, including ten GmFATB and two GmFATA. Based on nomenclature proposed previously, additional six GmFATB genes are denominated as GmFATB3A (Glyma.04G197400), GmFATB3B (Glyma.06G168100), GmFATB4A (Glyma.04G197500), GmFATB4B (Glyma06g17625), GmFATB5A (Glyma.10G268200), GmFATB5B (Glyma.20G122900), as well as two GmFATA genes, GmFATA1A (Glyma.18G167300) and GmFATA1B (Glyma.08G349200) (Table 1).

Table 1 The list of soybean acyl-ACP thioesterase genes with their corresponding gene ID, nucleotide sequence characteristics, and protein sequence properties

Amino acid sequence alignment has shown that two genes in each subfamily of GmFATB and GmFATA share high similarity, such as GmFATB1A/FmFATB1B (96%), GmFATB3A/GmFATB3B (94%), and GmFATA1A/GmFATA1B (93%). The coding DNA sequence (CDS) lengths of the GmFATB range from 1140 to 1269 bp with an average of 1203 bp while that of GmFATA averages 1140 bp. The sizes and predicted molecular weight of GmFATB1 and GmFATB2 subfamilies are larger than 400 amino acids and 45.8 kDa, respectively. GmFATB1A, GmFATB1B, and GmFATB2B show acidic isoelectric point (pI) values, whereas the rest of soybean TEs presented basic pI values (Table 1).

Phylogenetic analysis of plant acyl-ACP thioesterase gene family

13 FATA and 25 FATB proteins from other three legumes, two dicot species, and three monocot species have been identified through BLAST searches using A. thaliana TE protein sequences. A maximum likelihood (ML) tree was constructed with 50 protein sequences to elucidate the phylogenetic relationships among TEs from nine plant species (Fig. 1). As expected, two distinct clusters are formed to separate 15 FATA members from 35 FATB ones. In FATB cluster, all 35 FATB members could be classified into four subgroups. In subgroup I, GmFATB1A, GmFATB1B, GmFATB2A, and GmFATB2B are grouped together with AtFATB, BrFATB, and eight FATB members from other three legume species. Subgroup II contains all FATB members from three monocot species except one OsFATB. There are seven FATB members in subgroup III, including GmFATB3A, GmFATB3B, GmFATB4A, and GmFATB4B. And subgroup IV has GmFATB5A, GmFATB5B, and three FATB members from two legumes and one monocot species (Fig. 1). On the other hand, FATA members from all legume species are clustered apart from ones in monocot species. However, AtFATA are grouped with BrFATA in two different branches. The phylogenetic analysis also shows a close evolutionary relationship within each of six soybean TEs gene pairs with ≥ 88% reliability (Fig. 1).

Fig. 1
figure 1

Phylogenetic tree of acyl-ACP thioesterases gene family from nine plant species. The protein sequences of all acyl-ACP thioesterases were subjected to a MUSCLE multiple alignment and phylogenetic tree was constructed by maximum likelihood (ML) method using Mega X. The two members of soybean GmFATA gene subfamily were labeled with filled squares. The previously identified four members of soybean GmFATB1/2 subfamilies were marked with filled circles, while the newly identified members of GmFATB3/4/5 subfamilies in this study were labeled with empty circles The name and abbreviation of plant species used for the analysis are Arabidopsis thaliana (At); Glycine max (Gm); Phaseolus vulgaris (Pv); Medicago truncatula (Mt); Lotus japonicas (Lj); Brassica rapa (Br); Cocos nucifera (Cn); Elaeis guineensis (Eg); Oriza sativa (Os)

Gene structure and expression profiling of soybean acyl-ACP thioesterase genes

Given the two whole-genome duplication events, the soybean TE gene family consists of 12 members, which is four times more than those in Arabidopsis and double compared to the number of TEs in common bean, palm, and rice. Compared to an average of 5860 bp for GmFATA, the gene lengths of GmFATB1 and GmFATB2 subfamilies are more than 4195 bp while that of GmFATB3, GmFATB4, and GmFATB5 subfamilies are 3143 bp on average (Table 1). The GmFATB2A has the longest gene length among soybean TEs due to its extended 3’-UTR region. The gene structures of GmFATB are highly conserved with six exons for all ten members; on the contrary, GmFATA1A and GmFATA1B have seven and eight exons, respectively (Fig. 2).

Fig. 2
figure 2

Phylogenetic relationships and gene structures of GmFATA and GmFATB. The protein sequences of all soybean acyl-ACP thioesterases were subjected to a MUSCLE alignment, and phylogenetic gene tree was constructed using Mega X. The structures of 12 soybean acyl-ACP thioesterase genes were plotted with yellow boxes representing exons (coding DNA sequence, CDS), black lines illustrating introns, and blue boxes indicating 5′-UTR and 3′-UTR regions. The size of gene structures could be measured by the scale in the unit of base pair (bp) at the bottom. The gene structure was drawn using the Gene Structure Display Server

GmFATB1A, GmFATB1B, and GmFATB2A show relatively high expression in soybean seeds while the transcripts of two GmFATB2 genes were abundant in soybean flowers. GmFATA1A and GmFATA1B were also highly expressed in soybean seeds. Two GmFATB1 genes expressed relatively high levels in soybean root and nodule. Additionally, the expression of GmFATB1, GmFATB2, and GmFATA exhibits similar patterns in leaves and pod. The expression of GmFATB3A, GmFATB3B, and GmFATB5A is recorded as 0 in most of tested tissues and no RNA-seq data are available for GmFATB4A, GmFATB4B, and GmFATB5B in Soybase (Figure S1).

Chromosomal distribution and gene duplication

Based on the physical locations, 12 soybean TE genes are unevenly distributed on eight soybean chromosomes (Fig. 3). Chromosome 4 and 6 contain three GmFATB genes each while only one GmFATB gene each is present on chromosomes 5, 10, 17, and 20. Two GmFATA genes are located at chromosome 8 and 18, respectively. Among the GmFATB subfamilies, GmFATB1 and GmFATB5 are evenly distributed on four chromosomes. Nevertheless, the other three subfamilies, GmFATB2, GmFATB3, and GmFATB4, are concentrated on two chromosomes with three on each (Fig. 3).

Fig. 3
figure 3

Chromosomal locations and duplications of soybean acyl-ACP thioesterase genes. Each chromosome number is indicated above bar by Roman number and the scale (on the left) is in mega base (Mb). The size of chromosome and gene locations are based on soybean genome annotation a2.v1 on SoyBase. Each pair of segmental duplication in GmFATA and GmFATB subfamilies is connected by red and blue lines, respectively. The tandem duplicated genes are shown in rectangle box

The duplication analyses have shown that all soybean TE genes are located within eight duplicated blocks (Table 2). The gene pair, GmFATB1A and GmFATB1B, belongs to a large duplicated segment containing 62 anchor genes, while GmFATB3A/GmFATB4B and GmFATB5A/GmFAT5B are presented in huge syntenic regions with 711 and 884 anchor genes, respectively (Figure S2). Two gene pairs, GmFATB3A/GmFATB4A and GmFATB3B/GmFATB4B, are regarded as the outcomes of tandem duplication events due to their tight physical distance of less than 7 kb (without any genes in between). The ratio of nonsynonymous to synonymous substitutions (Ka/Ks) was calculated for each gene pair to determine the types of natural selection acting on coding sequences. The Ka/Ks of soybean TEs gene pairs is less than 0.5, which suggests that the evolution of soybean TEs is under purifying selection (Juretic et al. 2005; Li et al. 1981). The duplication of eight gene pairs is estimated to have occurred between 7.38 and 76.23 Mya based on 6.161029 synonymous mutations per synonymous site per year for soybean (Table 2).

Table 2 Divergence and duplication of acyl-ACP thioesterase gene pairs in soybean

Conserved domain variations among plant TEs

The protein sequences of 15 soybean and Arabidopsis thaliana TEs, four FATA and eleven FATB, were aligned to compare residues within two conserved hotdog domains. A residue that is conserved within one plant TE class but differs between FATA and FATB classes may contribute to the difference in substrate specificity (Jing et al. 2018). Based on these criteria, a total of 13 residues were selected, from which A194G, T208V, and D276E have previously been reported as specificity determining positions (Mayer and Shanklin 2007). Additionally, seven residues are found as completely different between FATA and FATB classes in hotdog domain I, including K150R, N163D, V178T, H212Q, I213V, R236K, and K246R. In hotdog domain II, another three residues, T347K, D362E, and D372E, meet the same criteria (Table 3). Among the ten newly identified residues, three residues, V178T, H212Q, and T347K, present non-conservative difference in amino acid between FATA and FATB, while the rest of seven residues contain conservative changes. In addition, three conserved catalytic residues, N340, H342, and C377, may form a papain-like catalytic triad across the FATA and FATB classes (Mayer and Shanklin 2005). From Conserved Domain Database (CDD) at National Center for Biotechnology Information (NCBI), seven active sites of plant TEs have been revealed in hotdog domain I. Among them, two residues, T208V and R236K, overlap with ones identified as substrate specifying residues, while the rest are highly conserved between FATA and FATB classes except two mismatches at positions 237 and 238 (Figure S3).

Table 3 The predicted 13 substrate specificity sites and three conserved catalytic sites of soybean acyl-ACP thioesterase genes

Identification of new alleles of GmFAT to improve fatty acid composition in soybean seed

Five soybean acyl-ACP thioesterase genes, GmFATA1A, GmFATB1A/1B, and GmFATB2A/2B, have been included in screening mutations through TILLING-by-Sequencing+. The estimated mutation density of these five genes is 1/232 kb using the formula as the total number of mutations divided by the total number of base pairs (amplicon size x individuals screened) (Table 4) (Cooper et al. 2008). Among the 280 identified mutations in these five GmFAT genes, the typical EMS-type mutations are the majority of base changes with 45.7% in G to A and 37.1% in C to T, while the other types of mutations only took up 17.1% (Table 4). In the coding regions of these five GmFAT genes, a total of 118 amino acid changes are detected, from which 71.2% are missense mutations, 26.3% are silent mutations, and 2.5% are nonsense mutations. Nonsense mutations are found in GmFATA1A and GmFATB2A/2B genes, whereas no nonsense mutations are present in GmFATB1A/1B genes (Table 4).

Table 4 A summary of mutations in five soybean acyl-ACP thioesterase genes identified by TILLING-by-sequencing+

A subset of GmFAT mutants has been confirmed by Sanger sequencing, and their novel alleles have been associated with altered fatty acid profiles. Six missense mutations (S37F, A55T, T146I, A231V, G277E, and V310I) are identified from GmFATA1A mutants, in which two mutants, F243 and F393, present > 30% high oleic acid content. Another four mutants, F636, F1305, F740, and F1188, display moderately high oleic acid content (> 24%) compared to Forrest wild type (18.0%) (Table 5). Five GmFATB1A mutants (F1040, F1129, F1539, F1200, and F1166) carry the missense mutations, P18L, G128R, R138H, G223E, and A371T, respectively. The palmitic acid content of these five mutants ranges from 8.9 to 10.7%, but an increase in oleic acid content (23.9–34.0%) is also found in GmFATB1A mutants (Table 5). Likewise, five missense mutations (P118S, G128E, A174T, D284N, and R348K) are detected at GmFATB1B, from which one mutant (F3) shows a decreased palmitic acid content by 51.7% when compared to Forrest wild type. All five GmFATB1B mutants also present an elevated oleic acid content up to 33.4% (Table 5). Another five missense mutations, P16L, A153T, A373T, R385Q, and G395D, are identified at GmFATB2A, from which four show a reduction in palmitic acid content. Two missense and one nonsense mutations are discovered at GmFATB2B. All GmFATB2A and GmFATB2B mutants show an increase in oleic acid content (Table 5).

Table 5 A summary of mutants in GmFATA and GmFATB genes identified by TILLING-by-sequencing+and their fatty acids phenotype

Discussion

Among type II fatty acid synthases (FAS), the plant TE is a major contributing factor in determining the carbon chain length of fatty acids through their substrate specificity. The large number of soybean TE genes implied genome expansion of the soybean compared to counterparts in other plant species. Previous studies have identified four unique GmFATB genes, from which mutations at GmFATB1A result in low palmitic acid content in soybean seed (Cardinal et al. 2007). In this study, we performed a genome-wide search for soybean TE genes with the aid of Phytozome and ThYme databases. Additional six GmFATB genes have been identified and named according to previously proposed nomenclature, as well as two GmFATA genes (Table 1). The total number of TE genes is four times higher in soybean compared to A. thaliana.

Here, we conducted an overall phylogenetic analysis of plant TEs gene families from nine plant species using maximum likelihood (ML) method (Fig. 1). Interestingly, GmFATB1 and GmFATB2, GmFATB3 and GmFATB4, and GmFATB5 subfamilies are in three different subgroups under FATB cluster, respectively. For plant species with high palmitic acid levels, such as coconut and palm, their FATBs appear to evolve independently from dicot species. Although two FATA and one FATB genes are presented in Arabidopsis genome, members from FATA class are generally much fewer than ones from FATB class in other higher plants. Two GmFATA are grouped with FATAs in other three legume species but apart from ones in other dicot species (Fig. 1).

The gene structures are similar within GmFATB1/GmFATB2 subfamilies, GmFATB3/GmFATB4/GmFATB5 subfamilies, and GmFATA. Although the GmFATA have the longer gene lengths due to the extended intron length, the coding sequence lengths of GmFATA are generally shorter than those of GmFATB. The gene lengths of GmFATB1 and GmFATB2 subfamilies are longer than those of GmFATB3, GmFATB4, and GmFATB5 subfamilies, as are their coding sequences (Table 1). With the advent of intron gain/loss events, all GmFATB lose at least one intron when they evolve divergently from GmFATA (Fig. 2).

The expression profiling data reveal various expression patterns of eight soybean TE genes in six soybean tissues. The similar expression patterns point to functional redundancy during soybean evolution, which could lead to neofunctionalization and subfunctionalization within soybean TE gene family (Figure S1). As expected, the high transcript level of GmFATB1 subfamilies, GmFATB2A, and GmFATA has been detected in soybean seeds, which indicated that these genes play a major role in releasing free fatty acids to cytosol. Thus, they should be the main targets to genetically modify fatty acids composition in soybean seed. For the newly identified GmFATB members, the very low expression level of GmFATB3A, GmFATB3B, and GmFATB5A in all six tissues suggests that their functions need to be explored further (Figure S1).

The distribution of 12 soybean TE genes has been shown to be on eight chromosomes (Fig. 3). Chromosomes 4 and 6 contain the largest number of TE genes (3), whereas chromosomes 5, 8, 10, 17, 18, and 20 only have one TE gene on each. The majority of soybean TE genes are found toward the chromosome ends, suggesting potential inter-chromosomal crossovers due to the high genetic recombination rates. Plant species acquired novel traits and adapted to various environments through gene duplication (Bowers et al. 2003; Eckardt 2004). There are three main gene duplication patterns, including segmental duplication, tandem duplication, and transposition (Kong et al. 2007). Our syntenic analysis shows that soybean TE gene family expands through both segmental and tandem duplications (Table 2). It is also well known that two whole-genome duplication events have occurred in soybean genome, including one shared by legume species 59 million years ago and another glycine-specific one around 13 million year ago (Schmutz et al. 2010,2014; Young and Bharti 2012). The duplication time of soybean TE gene pairs is estimated to match with either of these two time periods. GmFATB1B/GmFATB2A, GmFATB1B/GmFATB2B, GmFATB3A/GmFATB4A, GmFATB3A/GmFATB4B, and GmFATB3B/GmFATB4B are formed between 45.90 and 76.23 Mya while the duplication of GmFATB1A/GmFATB1B, GmFATB5A/GmFATB5B, and GmFATA1A/GmFATA1B has occurred between 7.38 and 8.20 Mya (Table 2).

Within two hotdog domains, 10 newly identified residues that are completely different between GmFATA and GmFATB could be the candidate positions to determine the differences in substrate specificity of TEs in soybean. Compared to other seven residues, three residues (V178T, H212Q, and T347K) may play a more important role in substrate specifying due to their non-conservative difference in amino acid (Table 3). In the current study, one GmFATA1A mutant (F740) has been identified to possess non-conservative amino acid changes at a previously reported specificity determining position (T208V) and confer an increased level of oleic acid in soybean seed (Tables 3 and 5) (Mayer and Shanklin 2007).

The mutations at GmFATB1A have been repeatedly associated with low palmitic acid phenotype in soybean. The expression level of GmFATA and GmFATB may have significant impact on soybean seed fatty acid composition, however, future studies are required to elucidate the role of soybean acyl-ACP thioesterases in controlling seed oleic acid content (Byfield and Upchurch 2007). This is the first time to discover that the novel alleles of GmFATA1A confer an elevated oleic acid content in soybean seeds. The increase in oleic acid content in GmFATA1A mutants, F243 and F393, is comparable to the high oleic acid content in either GmFAD2-1A or GmFAD2-1B mutants with the same genetic background (Table 5) (Lakhssassi et al. 2017). Meanwhile, the novel alleles of GmFATB1A, GmFATB1B, GmFATB2A, GmFATB2B are identified to confer low palmitic acid content in soybean seeds. Interestingly, these GmFATB mutants also present an elevated oleic acid content, which is consistent with the significant increase in oleic acid content from other previously reported GmFATB1A mutants (Table 5) (Bachleda et al. 2016; Cardinal et al. 2007). Zhou et al. (2019) indicate that a negative correlation exists between palmitic acid and oleic contents in both natural and mutagenized soybean populations. The identified GmFAT mutants are the new sources of seed high oleic acid and low palmitic acid contents for soybean breeding.