Introduction

Pineapple (Ananas comosus (L.) Merrill) is the second most important tropical fruit after banana in term of international trade. Although it originated in South America, pineapple is currently cultivated on 1 million hectares in 85 coutries and the global production amounted to 24.79 million tons in 2013, having nearly doubled over the last decade (http://faostat.fao.org). Pineapple is consumed or served fresh, cooked, juiced and canned. In addition to its exceptionally palatable and juicy fruit, it has outstanding nutritional and medicinal properties. The pineapple species is a perennial monocot with a diploid number of 50 chromosomes (2n = 2× = 50); it belongs to the family Bromeliaceae in the order Bromeliales (Sharma and Ghosh 1971). Pineapple is a reasonably close relative of grasses. The pineapple plant uses the Crassulacean Acid Metabolism (CAM) photosynthetic pathway, an evolutionary adaptation that occurred in some plants in response to arid conditions. CAM results in increased photosynthetic efficiency while preventing excessive water loss. Pineapple is a non-climacteric fruit that lacks the ethylene-associated respiratory peak during ripening. In contrast to the self-compatibility of wild species, cultivated pineapple is self-incompatible, thus containing high level of heterozygosity. These unique attributes make pineapple an exceptionally promising system for genetic and genomic studies to address many biological questions such as obligate CAM photosynthesis, parthenocarpic fruit development and non-climacteric ripening process, and the molecular basis of self-incompatibility in monocots as well as offering evidence for cereal genome evolution.

Microsatellites, also referred to as short tandem repeats (STRs) or simple sequence repeats (SSRs), are short tandem repetitive DNA sequences of 2–6 base pairs that are ubiquitously present in the genomes of prokaryotes and eukaryotes (Gur-Arie et al. 2000; Powell et al. 1996; Tautz 1989). DNA replication slipped-strand mispairing (Levinson and Gutman 1987; Tachida and Iizuka 1992) and recombination between DNA strands (Harding et al. 1992) can result in microsatellite instability, a ubiquitous phenomenon in the origin and evolution of microsatellites. SSRs can be categorized into class I (hypervariable markers), and class II (potentially variable markers) based on the length of the repeat motif, and consist of SSRs ≥20 bp and SSRs ≥12 bp < 20 bp, respectively (Temnykh et al. 2001). Owing to their abundance, high polymorphism in the number of repeats, multi-allelic nature, codominant inheritance and low-cost of rapid PCR-based tests, study and detection of microsatellites have found numerous applications allowing important advances in many fields, including the analysis of phylogenetic or genetic diversity (Castillo et al. 2008; Feng et al. 2013; Mian et al. 2005; Sharma et al. 2008), genetic mapping (Chen et al. 2015; Gailing et al. 2013; Yu et al. 2011), marker-assisted selection (MAS) (Steele et al. 2006), population genetics (Innan et al. 1997), comparative genomics (Garza et al. 1995) and gene localization (Molnar et al. 2003).

The computational capabilities of massive genomic sequences along with the completion of sequencing of many genomes have provided insights into genomic distribution, putative functions and mutational mechanisms of microsatellites (Li et al. 2002). The density, distribution and motif composition of SSRs vary unevenly and non-randomly across species as well as among different genomic fractions, i.e. introns, exons, coding sequences (CDS), untranslated regions (UTRs) and intergenic regions (Biswas et al. 2014; Cavagnaro et al. 2010; Liu et al. 2013; Luo et al. 2015; Vásquez and López 2014). A higher overall SSR density was observed in intergenic regions compared genic regions in almost all taxa. SSRs are significantly more frequent in non-coding regions of the genome than in the coding region except for tri- and hexanucleotides (Zhang et al. 2004). However, most SSRs in noncoding regions are close to or linked to expressed genes thus represent functional markers that are of particular interest (Morgante et al. 2002). SSRs from non-coding regions are associated with anonymous genomic sequences and can therefore provide sufficient polymorphisms to discriminate between closely related species or conduct genome comparisons. Expressed sequence tag SSRs (EST-SSRs) are associated with functional genes and usually more conserved in a wide range of species due to the higher selection pressure. In plants, 1.5–7.5 % of ESTs consist of SSRs (Kantety et al. 2002; Thiel et al. 2003). Compared to SSRs from non-coding regions, EST-SSRs are more transferable among related germplasm, enabling genome evolution and comparative mapping studies. Estimates suggest that class II SSRs have a significantly higher density than class I SSRs in both genomic sequences and ESTs in a wide range of plant species (Kantety et al. 2002; Wang et al. 2008). With the rapid increase of the sequencing of cDNA clones and the current availability of many reference genomes, more EST libraries and databases representing the vast majority of the information content of the genome have been established for many organisms, thus providing an avenue for SSR mining in the expressed transcripts.

In pineapple, a number of morphological, biochemical and nucleic acid-based markers such as isozymes, RFLPs (restriction fragment length polymorphisms), AFLPs (amplified fragment length polymorphisms) or RAPDs (randomly amplified polymorphic DNAs) have been employed to characterize pineapple germplasm (Aradhya et al. 1994; Carlier et al. 2012; de Sousa et al. 2013; DeWald et al. 1992; Duval et al. 2001; Kato et al. 2005; Paz et al. 2012; Sripaoraya et al. 2001). However, most genetic markers have failed to provide high-resolution genetic maps in pineapple. Among others, the recently developed user-friendly microsatellite markers have enjoyed much greater success in the field of pineapple genetics due to its rather high polymorphism and genome specificity (Feng et al. 2013; Kinsuat and Kumar 2007; Rodríguez et al. 2013; Shoda et al. 2012). The latest integrated genetic map of pineapple assembles 741 markers including 25 SSRs and 12 EST-SSRs in 28 linkage groups, spanning 2113 cM and covering approximately 86 % of the genome (de Sousa et al. 2013). Those microsatellites are so far the largest set mapped in this species. The most recently reported efforts in the search of pineapple SSRs identified 94 SSRs from pineapple genomic libraries, and 1110 SSRs in 5659 pineapple ESTs (Feng et al. 2013). The small number of SSRs previously found and mapped could be due to the limited number of genome sequences reported at that time. The number of robust and informative SSR markers genome wide and in ESTs publicly available for pineapple is still insufficient for some studies, hindering the development of diversity and phylogenetic studies, as well as the high-resolution genetic maps, which are instrumental for marker-assisted selection, positional gene cloning and comparative mapping. More polymorphic SSR markers and a denser map are needed. Recently, the genome of cultivated pineapple variety F153 was sequenced and assembled using several whole genome sequencing approaches, generating a contig N50 of 126.5 kb and a scaffold N50 of 11.8 Mb (Ming et al. 2015). This genome spans 382 Mb, 72.6 % of the estimated 526 Mb pineapple genome. Based on extensive pineapple RNA-Seq data from the pineapple genome project, a substantial number of novel transcripts that significantly complement current EST databases were identified and collected. The availability of the pineapple draft genome sequence and a large collection of EST sequences are now providing an opportunity for a large scale development of microsatellite markers, which would facilitate the pineapple research community and expedite breeding progress. The present study was conducted for in silico systematic and genome-wide characterization of microsatellite sequences in the pineapple genome for crop improvement. In this study, we 1) mined microsatellites throughout pineapple assembled genome and ESTs, 2) investigated the distribution, density, repeat and motif structure of SSRs in different genomic fractions as well as transcripts, 3) performed comparative analysis among pineapple and other plant species using SSRs from genomic and EST sequence datasets. 4) In addition, to gain some insight into the putative function of SSRs present in the gene regions, Gene Ontology (GO) and KEGG (Kyoto Encyclopedia of Genes and Genomes) annotation and expression pattern analyses were carried out for genes containing SSRs.

Results

SSR Classes and Density Distribution in Different Genome Regions and ESTs

In this study, the class and distribution of pineapple SSRs with a minimum repeat length of 12 bp and a unit size of 2 to 6 bp were analyzed. A total of 278,245 perfect SSRs were identified from 381.91 Mb from the most recently assembled pineapple genome with an overall density of 728.57 SSRs/Mb (i.e., one SSR per 1.37 kb of sequence, excluding mononucleotide SSRs), of which 82,261 (29.6 %) were defined as class I (≥20 bp) SSRs with a density of 215.4 SSRs/Mb, and 195,984 (70.4 %) as class II (≥12 bp and <20 bp) with a higher density of 513.17 SSRs/Mb on average (Table 1). An SSR search in coding, UTRs, intron and exon sequences was also performed to determine the distribution of SSRs on a genic scale. Following the search of SSRs for each of these regions, we found that the densities of SSRs were significantly different in coding and noncoding regions (Table 1 and Fig. 1). The densities of SSRs in noncoding regions were 2839.91 SSRs/Mb for 5′-UTR, 545.88 SSRs/Mb for 3′-UTR, 589.12 SSRs/Mb for introns, and 783.98 SSRs/Mb for intergenic regions. The abundance of SSRs in noncoding regions was much higher than that of coding regions. Only 19,727 SSRs (7 %) were located in the CDS, while 93 % were located in noncoding regions. Class II SSRs were substantially more prevalent than class I SSRs on both genomic and genic scale. The average density of SSRs in CDS (592 SSRs/Mb) was lower than the genome taken as a whole (Fig. 1). In the genic region, 3′-UTR regions were found to have the lowest density of SSRs, whereas 5′-UTR regions contained the greatest amount of SSRs. For instance, we identified 8282 (1718.96 SSRs/Mb) SSRs in 5′-UTRs and only up to 3098 (396.63 SSRs/Mb) SSRs in 3′-UTR regions. 5′-UTR sequences were observed to have between 3.6- to 5.2-fold higher SSR density than other regions and approximately four-fold higher density than in the whole genome. It appeared that the SSRs were denser in the intergenic region (783.98 SSRs/Mb) than in its genic region counterpart (633.7 SSRs/Mb) and genome-wide region. The genome-wide GC content (27.7 %) was lower than the genic GC content (36.4 %) (Table 1).

Table 1 Distribution of SSR classes identified in pineapple genome-wide, genic regions and EST sequences
Fig. 1
figure 1

SSR density in different pineapple genome regions and EST sequences

From the 53.46 Mb EST sequences, 41,962 SSRs were mined with an overall density of 619.37 SSRs/Mb, including 4339 (10.3 %) class I and 37,623(89.7 %) class II SSRs with densities of 64.05 SSRs/Mb and 555.33 SSRs/Mb, respectively. From these two sequence sources, the assembled genome and EST sequences, EST sequences had lower SSR density and a much higher GC content (53.5 %). Class I SSRs had a density of 64.05–215.4 per Mb among genomic, genic and EST sequences, while class II SSRs occurred at a significantly higher density of 446.82–555.33 per Mb.

SSR Repeat Length and Motif Length Frequency in Different Genome Regions and ESTs

The repeat length ranged in size from 12 to 1439 bp in the whole genome and from 12 to 25 bp in EST sequence datasets. 12 bp was the predominant repeat length both in assembled genome and EST, accounting for 37.3 % and 43.3 % of the total SSRs (Fig. 2a), 53 % and 48.3 % of the class II SSRs respectively (Fig. 2b). The second most frequent repeat length was 18 bp. The trend was also similar between the assembled genome and EST, representing 10.5 % and 20.2 % of the total SSRs and 14.9 % and 22.6 % of the class II SSRs. Class I SSRs were extremely underrepresented compared to their class II SSRs counterparts. Within the class I SSRs grouping, the 20 bp repeats with 13,043 SSRs (15.9 %) were the most common SSRsfollowed by 24 bp with 12,369 SSRs (15.0 %) repeats with scarce variance in whole genome sequences (Fig. 2b). In EST sequences, by contrast, the percentage of SSRs with 20 bp repeats (43.2 %) remained the most abundant but followed by 21 bp repeats instead of 24 bp repeats (Fig. 2c).

Fig. 2
figure 2

Distribution of SSRs in whole genome and EST sequences by the repeat length. a. of total SSRs; b. of class II SSR; c. of class I SSR

Regarding the SSR motif lengths, dinucleotide repeats of total SSRs and of each class outnumbered all of the other motif lengths and were the most abundant on a genome wide scale (Fig. 3). The observed SSR frequency decrease was not strictly correlated with the number of nucleotides. For example, in genomic regions, hexa- motifs were slightly more abundant than penta- motifs both in total and of each class of SSRs. In class I SSRs, the number of dimers was significantly higher than other motifs with a high percentage of 62.1 %, followed by tri- motif with 11.4 %, hepta- with 10.9 %, tetra- with 5.1 % (Fig. 3 and Supplementary Table 1). Hepta- to decanucleotide repeats of class II SSRs were totally absent from both sources of genome sequences. In gene models, the density of SSRs differed among exons, CDS, UTRs and introns (Fig. 1). CDS regions differed from other genic regions by their high trinucleotide SSRs percentage (78.4 % vs. 28.7 %, 29.9 %, and 17.7 % in 5′-UTR, 3′-UTR, and intron, respectively) and hexanucleotide SSRs percentage (13.1 % vs. 7.8 %, 6.2 % and 5.2 % in 5′-UTR, 3′-UTR, and intron, respectively) (Fig. 3a and Supplementary Table 1). Across the sequence sources, tri- and hexa- repeats of total SSRs and class II were more prevalent in ESTs than in the whole genome while other types of repeats were slightly reduced in the EST sequences (Fig. 3a and c). In class I SSRs, tetra- and penta- followed the same trend in addition to tri- and hexanucleotide repeats (Fig. 3b). For example, the tri- repeats of total SSRs and of each class were 1.9 to 2.5 times more frequent in ESTs than in whole genome sequences, making these the most dominant motif in class II and in the total SSRs, and the second most dominant motif in class I after dimers. The relative abundance of EST-related motif frequency (tri- and hexa-) was therefore found to be due to their abundance in genomic protein coding regions.

Fig. 3
figure 3

Distribution of a. total SSRs b. class I SSRs and c. class II SSRs according to the length of repeat motif in different pineapple genome regions and EST sequences

Interspecific Comparison of SSR Distribution in Genomic and EST Sequences

The distribution of SSRs in pineapple genomic sequences and in five other selected species with comparably large genomes was analyzed and summarized in Table 2. We analyzed the genomes from three monocots (Ananas comosus, Oryza sativa, Sorghum bicolor) and three dicots (Arabidopsis thaliana, Cucumis sativus L., Vitis vinifera). These six genomes were studied by applying the same criteria for the identification of SSRs that is a minimum repeat length of 12 bp and a unit size of 2 to 6 bp. The six analyzed species showed a large variance in their microsatellite density ranging from 315.5 SSRs/Mb to 728.57 SSRs/Mb (Table 2). It was an unexpected find that pineapple had the highest density among the species analyzed, significantly ahead of second-place cucumber (536.7 SSRs/Mb). Sorghum was found to have the lowest abundance of SSRs with a density of 315.5 SSRs/Mb, comparable to that of Arabidopsis (364.1 SSRs/Mb). From Table 2, the most ubiquitous SSR motif length were dinucleotides from pineapple genomic sequences, representing 46.8 % of the total SSRs, followed by tri- (23.6 %) and tetranucleotides (18 %). Penta- and hexanucleotides were the least common motif lengths, together covering less than 12 % of all SSRs. Across the six species, the distribution of motif length in the monocot pineapple genome was markedly distinguished from the others, especially from those of dicots cucumber and grapevine, for which tetranucleotides were the most frequent motif type. Trinucleotides were observed to be the most prevalent motif length in the other two monocots, rice and sorghum, together with one dicot, Arabidopsis. Tetranucleotides prevailed in most (two out of three) of the dicot species analyzed. Comparatively, the density of dinucleotide repeats in pineapple was much higher than in the other species. Apart from the dominant motif length, no other significant differences were evident between monocots and dicots.

Table 2 Distribution of perfect SSRs in genomic and EST sequences of pineapple and other selected plant species

The overall microsatellite density in pineapple ESTs was lower than that from genomic sequences. Compared to those from genomic sequences, frequencies of tri- and hexanucleotides were much higher in transcripts (51.6 % and 9.7 % of total SSRs), whilst dinucleotide type were greatly decreased (23.1 % in ESTs) (Table 2). In other species, distributions of microsatellites in the expressed fraction of the genomes revealed the similar tendency that tri- and hexanucleotides were relatively more abundant, and dinucleotides were rarer than them at the genomic level. Trinucleotides were by far the most frequent type followed in order by di-, tetra-, hexa- and pentanucleotides, in abundance. In all other species analyzed the relative abundance of SSR types was only slightly different from but rather consistent with pineapple: trinucleotieds were the most abundant, followed by tetra-, di-, hexa- and pentanucleotides. Although there was a trend of trinucleotide frequency being highest across species, the absolute density of trinucleotides varied by nearly 4-fold in magnitude, from 108.4 in grapevine to 485.7 SSRs/Mb in rice. The trinucleotides density in monocots was above 300 SSRs/Mb, while that of dicots was under 250 SSRs/Mb.

Distribution of SSR Motifs in Genome Wide and ESTsequences

We examined the frequencies of pineapple SSR motifs with regard to the repeat times at the genomic and transcript levels (Fig. 4). Consistently, the SSR frequency decreased dramatically as the number of repeat units increased for all analyzed five motif lengths, especially for longer ones (tetra- to hexanucleotides) which showed the most dramatic drop in frequency. As a result, the mean repeat number in dinucleotides was about 1.7–2.2 times the number of repeat units in trinucleotides, and it was nearly 2.5–3.4 times greater than the number of tetra- to hexanucleotides (Table 2). In the genomic region, the cumulative sequence length of dinucleotides was 2906 kb, the longest one compared to any other motif length type. Therefore, dinucleotides (340.58 SSRs/Mb) not only occurred most frequently in the pineapple assembled genome, but also accounted for the greatest contribution to the genome fraction occupied by SSRs (di- to hexanucleotides) due to their highest number of repeat units. In transcript sequences, although dinucleotides had a higher number of repeat units (7.9) than trinucleotides (4.7), trinucleotides, due to their higher density (319.44 SSRs/Mb), made a greater contribution to the transcript fraction occupied by SSRs: the cumulative sequence length of di- and trinucleotides was 152 kb and 307 kb, respectively. The longest SSR in the pineapple genome appeared in dinucleotide repeat patterns, which was (CA)1439, reaching up to 2878 bp, followed by the (TC)1438. Longer repeats in the genome generally tend to have higher mutation rates, which is associated with a high frequency of polymorphisms (Karaoglu et al. 2005). In transcript sequences, tetranucleotide (ATCC)24 at 96 bp and hexanucleotide (CTTCTC)14 at 84 bp were the longest and the second longest SSRs. The frequency of repeat motif varied for each length class from different sequence sources. A more detailed examination of the frequencies of individual repeat motifs is shown in Tables 3 and 4 and in Fig. 5 and Supplementary Table 2.

Fig. 4
figure 4

Frequencies (%) of motif length with repeat numbers in SSRs identified from pineapple whole genome and EST sequences

Table 3 Distribution of SSR motifs by number of repeats identified from pineapple whole genome sequence
Table 4 Distribution of SSR motifs by number of repeats identified from pineapple EST sequence
Fig. 5
figure 5

Densities of SSRs according to the nucleotide composition of motif

Dinucleotide Motifs

The analysis of AC, AG, AT and CG repeats revealed that the dominance of dinucleotides was attributed to an overrepresentation of AT motifs in the pineapple whole genome sequences (Table 3), while GC was rather rare, accounting for only 0.15 % of all SSRs. AT was also the overall most common single motif across the entire genome, representing 24.3 % of the total SSRs, followed by the dinucleotide motif AG (20 %). In contrast, the frequency of AT repeat units (4 %) was dramatically lower in the pineapple transcript sequences, with the AG motif (17 %) being the most abundant dinucleotide. In general, the transcript sequences showed a dominance of AT-CG balanced dinucleotide repeats, whereas whole genome sequences displayed a prevalence of AT-rich dinucleotides (Fig. 5). AC and CG repeats were the most rare dinucleotides from both sequence sources.

Trinucleotide Motifs

In terms of trinucleotide motifs, there was prevalence of AAT and AAG motifs represented by approximately 6.4 % and 5.5 % of all SSRs respectively in the entire genome, while the CCG motif was the predominant motif in ESTs, contributing to 14.3 % of the discovered SSR loci (Tables 3 and 4). The abundance of CCG in ESTs was also found in the monocots rice and sorghum. Conversely, an opposite distribution was observed in dicots, where CCG was the most rare trinucleotide in both genomic and EST datasets (Cavagnaro et al. 2010). ACT was the most infrequent trinucleotide repeat pattern in both pineapple genomic and EST datasets, similarly representing less than 1 % from both sources. A strong bias in the distribution of trinucleotides towards AT-rich motifs was found in the genomic sequences, and towards GC-rich motifs in ESTs. Overall, AT-rich motifs accounted for 14.24 % of the identified SSRs, nearly 1.5 times the number of GC-rich motifs (9.4 %) in the assembled genome, whereas GC-rich motifs (36.1 %) were more than twice the AT-rich ones (15.5 %) in ESTs.

Tetranucleotide Motifs

A clear predominance of AT-rich motifs and underrepresentation of GC-rich motifs was observed in genomic and EST sequences (Fig. 5). Compared to ESTs, the shift from AT-rich to GC-rich motifs in genomic sequences was gentler. Overall AT-rich motifs composed up to 80 % in genomic sequences and 62.9 % in ESTs of all tetramer repeats, whereas the GC-rich repeats were the least common with relative frequencies of 4.4 % and 12.7 %, respectively (Fig. 5, Supplementary Table 2a and b). The most prevalent tetramer SSRs in pineapple genomic sequences was ATTT, covering 42.2 % of tetranucleotide repeats, followed by AATT for 18.4 %, AAAG for 14.9 %, AGAT for 9.6 % (Supplementary Table 2a). In relative terms, the GC-rich motifs AGGC, ACGG, ACCG, and AGCC were, in that order, the least prevalent tetramers. A similar distribution was observed in their ESTs counterparts, for which ATTT prevailed, whereas AGGC was the scarcest (Supplementary Table 2b).

Pentanucleotide and Hexanucleotide Motifs

AT-rich motifs were predominant among pentanucleotide repeats in SSRs from the entire assembled genome and ESTs, covering 85 % and 72 %, respectively (Fig. 5, Supplementary Table 2a and b). AAAAT and AAAAG were the most abundant repeats in both genomic and EST sequences, outnumbering the next most common repeats AATAT in genomic sequences or AAAAC in ESTs, by at least two-fold.

As for pentanucleotide repeats, AT-rich hexanucleotides were prominent in the pineapple genomic data (~53 % of total hexanucleotides), such as AAATTT, AAAAAG, AAAAAT, and AAAATT, followed by GC-rich (~28 %) and AT/GC balanced (~19 %) (Fig. 5 and Supplementary Table 2a). On the contrary, there was a high prevalence of GC-rich repeats represented by approximately 53 % of all hexamers in transcript data (Fig. 5 and Supplementary Table 2b), followed by AT/GC balanced (~29 %) and AT-rich (~18 %). Both pineapple genomic and ESTs sequences had a much higher overall density of hexanucleotide motifs compared to other analyzed species, and nearly 3.4–4.5 fold the density found in Arabidopsis (Table 2).

Annotation and Expression Patterns of SSR-Containing Genes

Pineapples genes can be classified as SSR-containing genes and no-SSR genes. In total, 21,631 genes (72.7 %) containing one or more SSRs were identified in this study, only 8115 genes (27.3 %) containing no SSR (Fig. 6a). Of all SSR-containing genes, 5878 genes contained only one SSR and 15,753 genes contained more than one. The amount of SSRs contained in each gene ranged from 0 to 181. A sharply decreasing trend of gene number was observed as the contained SSR number increased (Fig. 6b). Of all SSR-containing genes, genes within one SSR were the most common, 5878, followed by 2-SSR containing genes, 3-SSR containing genes with 4389 and 3230, respectively (Fig. 6b).

Fig. 6
figure 6

Overview of SSR-containing genes in the pineapple transcriptome. a. No-SSR gene and SSR-containing gene numbers. b. Distribution of SSR-containing gene numbers versus SSR number. c. GO annotation of SSR-containing genes in the pineapple transcriptome

The expression of all pineapple genes was quantified using FPKM values, and 16,119 (54.2 %) genes had an FPKM value >0.3 in at least one tissue. The distribution of FPKM values of no-SSR genes (Fig 7 a), genes within one SSR (Fig 7 b) and genes within two or more SSRs (Fig 7 c) in different pineapple tissues including flower, leaf, root and fruit are shown. The heat map results indicate that all tissues from each of the three categories showed identical expression patterns. In general, the greater the number of SSRs in each gene, the more the expression levels of that gene increased. In genes within the two or more SSRs category, most genes had higher expression levels than the other two categories, while genes in the no-SSR category had the lowest expression levels. To further estimate the statistical significance of the correlation between number of SSRs contained in a gene and expression levels, genes with high FPKM values were separated by the number of SSRs contained inside (0 to 181). Graphically, they showed no significant correlation across four tissues (R2 = 0.0002–0.01494, P > 0.2) (Fig. 8).

Fig. 7
figure 7

Expression patterns of a. genes within no SSR (8115), b. genes within one SSR (5878), and c. genes within two or more SSRs (15,753). The heat maps showed log2 FPKM values of genes in different pineapple tissues including flower, leaf, root, and fruit

Fig. 8
figure 8

Linear correlation between the number of SSRs contained in a gene and the ratio of the number of high FPKM genes to the number of genes with certain a number of SSRs across four tissue types (a. flower, b. leaf, c. root and d. fruit)

The GO annotation of SSR-containing genes identified in this study is shown in Fig. 6c. Of the 21,631 genes containing SSRs, 11,700 (54.1 %) were able to be assigned one or more GO annotations, resulting in 14,338, 8668, and 18,110 biological process, cellular component and molecular function terms, respectively (Supplementary Table 3). The molecular function ontology category was comprised of a high portion of protein binding (61.9 %) and catalytic activity (46.5 %), followed by transporter activity (4.8 %) and nucleic acid binding transcription factor activity (2.8 %). With regard to the cellular component, 32 % were assigned to cell or cell part followed by membrane, organelle and macromolecular complex with 12.7, 10.1, and 7.0 %, respectively. In the biological process ontology category, metabolic process (48.2 %), cellular process (40.3 %) and single-organism process (27.6 %) were the top three most dominant groups. The remaining processes were localization, biological regulation, response to stimulus, cellular component biogenesis and signaling, and others. The biological interpretation of the SSR-containing genes was further completed using KEGG pathway analyses. Of all the SSR-containing genes, 1599 (7.4 %) had one or more KEGG annotations, and they belonged to 136 pathways, of which some were consistent with the biological processes identified through GO analyses. Those pathways may represent all KEGG pathways in the pineapple genome.

Discussion

Microsatellites have been of paramount importance for genetics, ecology, taxonomy and evolution studies. Analysis of pineapple microsatellites in coding and non-coding regions, coupled with information of their frequency, distribution and sequence motifs could contribute to the understanding of the pineapple genome architecture and evolution, and provide insights into the possible roles of SSRs in genomic localization and gene regulation. The high abundance of SSRs can be applied to many genetic and genomic studies in pineapple. With the recent pineapple genome release, a global analysis of SSRs is feasible. Genome-wide mining and characterization of pineapple microsatellites was performed and reported for the very first time in this work.

Classes, Frequency and Distribution of SSRs in the Pineapple

In this study, microsatellites in the repeat unit size range of 2 to 6 bp and a minimum length of 12 bp in the pineapple genome were mined and analyzed. In total, the cumulative sequence length of SSRs was 5.2 Mb, which contributed 1.4 % to the estimated 381.91 Mb genome assembly of pineapple and the density was one SSR per 1.37 Kb (728.57 SSRs/Mb). This observed SSR density was higher than those reported from other plant species (Table 2) (Biswas et al. 2014; Cavagnaro et al. 2010; Jena et al. 2012; Liu et al. 2013; Luo et al. 2015; Vásquez and López 2014; Wang et al. 2008). A generally negative correlation between genome size and the density of SSRs or the number of SSRs in plants has been reported (Morgante et al. 2002). Our data obtained applying the same criteria for SSR identification on most selected plant species is consistent with this reported trend. Sorghum with the largest genome size (739 Mb) had a lowest density among others. A similar trend has also been observed in the larger soybean (1115 Mb) (Arumuganathan and Earle 1991) and maize (2365 Mb) genomes (Huo et al. 2008). However, the observation from pineapple deviates from this general trend. Having an estimated genome size larger than that of rice (389 Mb) (Project IRGS, 2005), cucumber (367 Mb) (Arumuganathan and Earle 1991) and grapevine (487 Mb) (Jaillon et al. 2007), the pineapple genome (526 Mb) (Arumuganathan and Earle 1991) harbors an SSR density (728.57 SSRs/Mb) higher than the densities found in these other species. Arabidopsis, the plant with the smallest genome size from our dataset, also deviates from this trend; Arabidopsis had a density substantially smaller than that of those species with 2-fold or 3-fold larger genomes. It has been observed SSRs are preferentially found in non-repetitive or low-copy fractions of the genome (Morgante et al. 2002; Temnykh et al. 2001). The assembled pineapple genome is 381.91 Mb compared to the estimated 526 Mb. The lacking sequences of about ~144 Mb could consist of repetitive DNA that has not been assembled. The high density of SSRs in pineapple genome could be associated with the possibility that the non-assembled sequences are repetitive parts of the plant genome where microsatellites are underrepresented. It has been reported that the frequency of SSRs is considerably higher in dicot species compared to monocots (Sonah et al. 2011). However, this tendency was not observed in this work.

The shorter class II SSRs are more abundant than class I SSRs in genomic and different genic regions of pineapple (Table 1, Fig. 2) and of other plant species (Cavagnaro et al. 2010; Wang et al. 2008). Microsatellites are known to be mutational hotspots in genomes and thus play a significant role in the origin and evolutionary dynamics of genomes (Ellegren 2004; Li et al. 2004b; Luo et al. 2012). Therefore, this tendency might be due to the inherent instability of the longer class I repeats that are prone to mutate to imperfect SSRs by mechanisms of replication slippage, point mutation or recombination (Ellegren 2004), whereas the shorter class II repeats are more tolerant to mutations and retained.

Distribution of SSRs in Different Genome Regions and Transcript Regions

This study represents the first focused analysis of the distribution of SSRs in different genome regions of pineapple. Several lines of evidence have shown that SSRs are nonrandomly distributed across different genomic fractions (Li et al. 2004b; Li et al. 2002). A previous report indicated that UTR regions are SSR rich, and 3′-UTRs were expected to have more SSRs than the 5′-UTRs (Morgante et al. 2002). From our SSR search of each analyzed genome region, 3′-UTRs were found to have the lowest density of SSRs (545.9 SSR/Mb), whereas 5′-UTRs (2839.9 SSR/Mb) had the greatest amount of SSRs, at least 3.6-fold higher than other regions in general (Fig. 1). Higher SSR densities in 5′-UTR were also observed in Arabidopsis, rice, sweet orange and cassava (Biswas et al. 2014; Lawson and Zhang 2006; Vásquez and López 2014). The SSR density varies between genomic and transcribed sequences of pineapple and other organisms (Table 2). In the intergenic or genomic regions of pineapple, microsatellites were present at a higher density than in transcribed regions of genomes, including protein-coding genes and ESTs. This trend was also found in several other plant species, such as papaya, cucumber and grapevine, whereas the opposite was found for rice, sorghum, soybean and citrus. These findings partially contradict a previous study by Morgante et al. (Morgante et al. 2002), although these discrepancies may be due to limited data resources at that time. In addition, inherent genome variations between different species as well as different software used for SSR detection with different set of parameters and algorithms may account for the observed differences among species. In general, SSRs were less abundant in coding regions than in non-coding regions. This is consistent with the report that a considerable amount of SSRs are embedded in non-coding regions either in the intergenic sequences or introns (Ellegren 2004; Tóth et al. 2000). Although SSRs are scarcer in coding regions, the majority of SSRs in noncoding regions are actually close to or linked to expressed genes, making SSRs attractive potential markers for gene localization (Morgante et al. 2002). Furthermore, SSRs from non-coding regions are associated with anonymous genomic sequences, and can therefore provide sufficient polymorphisms to discriminate between closely related species or be useful when conducting genome comparisons.

In pineapple genic regions, UTRs harbor more SSRs than the coding regions. In the untranslated portion of transcripts, SSRs were observed to be densest in 5′-UTRs, followed by introns and 3′-UTRs. The observations of SSR densities in different genomic regions were congruent with previous reports from various plants (Biswas et al. 2014; Cavagnaro et al. 2010; Mun et al. 2006; Wang et al. 2008) and fungi (Labbé et al. 2011; Li et al. 2014; Murat et al. 2011). The transcript regions had a higher G + C content than the whole genomic regions (Table 1). SSRs from different genic locations (CDS, UTRs and introns) may play various roles in development, adaptation, survival and evolution. Mutations in SSRs in the genic regions could affect the corresponding gene products. For example, SSR deletions/insertions in coding regions could lead to a gain or loss of gene function via frameshift mutation or expanded toxic mRNA (Li et al. 2004b; Li et al. 2002); the presence of certain polymorphic SSRs in UTRs or introns could affect gene expression levels; SSRs in 5′-UTRs could be responsible for regulating transcription/translation, gene regulation adaptation, mRNA stabilization as well as short-time phenotypic changes; SSR expansions in the 3′-UTRs could cause transcription slippage, slicing disruption and cellular function damage; SSR variations in introns can affect gene transcription, mRNA slicing or export to the cytoplasm (Li et al. 2004a; Li et al. 2004b; Zhang et al. 2006). In light of this, the observed high density of SSRs in the 5′-UTRs of the pineapple genome provides a good opportunity to gain insight into the influence of SSRs on pineapple gene expression and regulation. In total, 8282 SSRs were found in the 5′-UTRs of genes coding for proteins (Table 1). These gene models are good candidates for future studies.

Character of SSR Motifs, Repeat Number and Repeat Length

SSRs motifs show species and genome fraction or region specificity in eukaryotic and prokaryotic organisms (Mrázek et al. 2007; Tóth et al. 2000). In pineapple, most SSRs are di-, tri- and tetranucleotide repeats, which together account for 88 % of all SSRs. The dinucleotide motifs (46.8 %) outnumbered all other microsatellite repeat categories and were the most prevalent motif length type in pineapple genomic regions (Fig. 3). Especially in the genomic class I (long) microsatellites, dimers accounted for up to 62 %. The prevalent motif length varied across species. Similarly to the trend observed in pineapple, dimers prevailed in sweet orange, cranberry, and cassava. These findings are in agreement with previous reports of highly abundant dinucleotides in the genomic DNA of many evaluated species (Kalia et al. 2011). However, some exceptions exists such as the prevalence of trinucleotides in Arabidopsis, rice, sorghum and flax, whereas tetranucleotide repeats are found to be the most common type in cucumber and grapevine. Comparison of transcript regions (i.e., CDs, exons, ESTs) and whole pineapple genomic regions showed that all repeat types except tri- and hexanucleotide repeats were comparatively less abundant in the transcript regions (Fig. 3 and Table 2). This tendency was also observed in other species. Trinucleotide repeats were found to be the most prevalent repeat type in protein-coding sequences or ESTs of pineapple genome and all other taxa including plants, insects and human (Biswas et al. 2014). In transcript regions, trinucleotide SSRs were the most abundantly found SSRs, followed by di-, tetra-, hexa- and tetranucleotides. This trend was consistent with the most recent study on pineapple EST-SSRs mining (Ong et al. 2012). This predominance of tri- and hexanucleotides over other repeat types has been attributed to negative selection against frameshift mutations. Tri- and hexanucleotides are an integration of multiple of codons, therefore their mutations probably will not disrupt the reading frame, a process that may be associated with genetic conservation. Although a similar situation was observed across species, trinucleotides being the most frequent motif, the absolute density of triplets varied extensively depending on the species. In all examined species, we found that the density of trinucleotides in monocots was at least 1.4-fold higher than that of dicots in transcripts regions.

For all SSR motif types, as the number of repeat unit increased their occurrence strikingly decreased at genomic and transcript levels. The longer motifs, tetra- penta- and hexanucleotides, showed a more dramatic reduction in frequency as the number of repeats increased (Fig. 4). The higher mutation rates of longer repeats may account for this trend. Dinucleotides with a cumulative length of nearly 3 Mb were the greatest contributors to the total percentage of SSRs in the genome fraction due to their highest number of repeat units and highest frequency (Table 2 and Fig. 4). In transcript regions, trinucleotides with a cumulative length of 307 kb represented the largest proportion of microsatellites due to their highest density. The longest SSR in pineapple genome was (CA)1439. It has been proposed that repetitive sequences are recombinogenic elements in eukaryotic chromosomes (Treco and Arnheim 1986; Wahls et al. 1990), dinucleotides especially are preferential sites for recombination because of their high affinity for recombination enzymes (Biet et al. 1999). As molecular markers, the lower stability, higher frequency and longer repeats indicates that the region will be richer in polymorphisms, making dinucleotides more important than other types of nucleotide as the most sought-after markers for practical applications in pineapple population genetics.

There may be striking differences in the frequency of SSRs within certain nucleotide compositions among eukaryotic genomes or between sequence datasets (genomic and EST) of a species. Overall, the base composition of the SSR motifs showed a strong bias to AT-rich in the pineapple genome, and an increased GC-rich or AT-CG balanced motifs in the transcript regions (Fig. 5). Similarly to many other understudied plant species (Cavagnaro et al. 2010; Morgante et al. 2002; Tangphatsornruang et al. 2009), AT were the densest dinucleotide motifs, whereas the least frequent was the GC motifs accounting for only 1 % of dinucleotides, and other genomic SSRs with GC-rich repeats were also rare. This result in pineapple contradicts previous reports indicating that AT-rich repeats prevail in dicot species but not in monocots by virtue of the relative GC content of their genomes (average 34.6 % in dicots vs. 43.7 % in monocots) (Cavagnaro et al. 2010; Wang et al. 2008). This dicot-like trend in pineapple could be explained by its relatively low GC content compared to other monocots, only 38.3 % genome-wide (Ming et al. 2015). ESTs showed higher frequencies of AG repeats than AT repeats in pineapple, in agreement with previous findings from pineapple ESTs (Wöhrmann and Weising 2011) and from many other vascular plants, e.g. Arabidopsis thaliana, coffee, kiwifruit, cassava and cereals (Fraser et al. 2004; Katti et al. 2001; Morgante et al. 2002; Poncet et al. 2006; Vásquez and López 2014). Homopurine-homopyrimidine stretches like AG in the 5′-UTR have been reported to take part in gene regulation (Varshney et al. 2005) and are preferentially associated with genes involved in transcription, nucleic acid metabolism and the regulation of gene expression (Scaglione et al. 2009). The length polymorphism of a (AG)n in the 5′-UTR of the waxy gene also proved to correlate to the amylose content in rice (Ayres et al. 1997). Likewise in many dicots such as legume species but not in Arabidopsis, the trinucleotide repeat AAT was overrepresented in genomic sequences of pineapple, nearly 1.16 times more than the second most trinucleotide repeats. By contrast, GGC repeats were typically predominant in the monocots (Mun et al. 2006). The representation level of AAT in the pineapple genome was midway between the levels found in the legume species and those of rice. Typically, GC-rich tri- and hexa-motifs dominate in the transcribed regions and are less pronounced in non-transcribed regions (Mun et al. 2006; Tóth et al. 2000), in agreement with the observations from our study. CCG as the most common triplet in transcript regions is a known feature of monocots, including pineapple in this study and all cereal species (Li et al. 2004b), and seems to correlate with the increase of GC content in monocot genomes (Morgante et al. 2002). The CCG motif in 5′-UTR of ribosomal protein genes is involved in the regulation of fertilization in maize (Dresselhaus et al. 1999). The taxon-specific accumulation of certain repeat motifs could be explained by strand slippage or a positive selection pressure, such as a preference of codon usage in exons or a regulatory effect of particular repeats in non-coding regions (Mun et al. 2006), which may drive specialization and divergence of genomes.

Expression and Functional Annotation of SSR-Containing Genes

It is important to note that SSRs identified in genic regions are informative and potential powerful molecular markers for the plant breeding community. A better understanding of SSRs could reduce the effort and resources required in the early stages of development of markers closely linked to particular genes due to their location inside genes. Additionally, gene-related SSR markers can be employed in association with mapping studies to help map the particular genes in which they reside. However, the functional significance of SSRs in plant genes remains poorly understood. Putative functional annotation and categorization of pineapple genes containing SSRs from this study revealed that those genes have a range of functions such as protein binding, catalytic activity, metabolic enzymes, disease signaling, structural and storage proteins, and transcription factors, suggesting a biological significance of SSRs in plant metabolism and gene evolution. In the molecular function category, the majority of genes containing SSRs were homologous to proteins with binding and catalytic activities, mostly associated with cell, membrane and organelle according to the cellular component category. Cellular and metabolic processes were associated with most SSR-containing genes, while a small number were involved in reproduction, biological adhesion and growth processes. Similar results were found from date palm (Zhao et al. 2012) and citrus (Liu et al. 2013) transcript sequences containing SSRs, suggesting that genes containing SSRs involved in protein metabolism and biosynthesis are well conserved in plants. Genes containing SSRs are nearly 3 times more likely than no-SSR genes to appear in the pineapple genome (Fig. 6a). In general, a positive correlation between gene expression level and the number of SSRs present was found (Fig. 7). This evidence suggests that SSRs may play an important role in the regulation of gene expression and many other associated functions. The particular role of SSRs and preferred motifs for the function of related genes needs to be further investigated in pineapple.

Conclusions

The current work contributed to a detailed characterization of microsatellites in pineapple and the comparison of these microsatellites to related species. We reported the identification of 278,245 SSRs and 41,962 SSRs with an overall density of 728.57 SSRs/Mb and 619.37 SSRs/Mb in genomic and EST sequences, respectively. This was unexpectedly high given the moderate size of the pineapple genome. Class II SSRs were more abundant than class I SSRs in all genome fractions. ESTs of pineapple were less abundant in microsatellites compared to genomic sequences. Dinucleotide repeats were the most frequent SSRs in the genome with AT being the overall most common single motif, whereas trinucleotides strongly predominated in EST sequences. AT-rich motifs prevailed in the pineapple genome and an increased GC-rich or AT-CG balanced motif in the transcription regions were observed. The putative functional annotation and categorization of genes containing SSRs revealed that those genes are involved in various aspects of pineapple development. Our transcriptome analysis reflected a positive relationship between expression levels and SSR number contained in a gene. Based on this preliminary study, primer design and laboratory validation for genomic and ESTs SSR marker development will be necessary to develop this line of research. These potential SSR markers, especially SSR loci with GO terms, may facilitate a number of genetic and genomic studies in Ananas comosus, such as functional genomics, genetic mapping, discrimination of genotypes, diversity analysis, transferability, as well as positional gene-cloning and QTL analysis.

Materials and Methods

RNA Extraction and Library Construction

Pineapple transcripts were sampled from the major tissues including flower, leaf, root, and fruit. Total RNA was extracted from ground leaves using the Qiagen RNeasy Plant Mini Kit (Qiagen, #74,904) and following the manufacturer’s instructions. DNA was removed with the DNA-free™ DNA Removal Kit (Life Technologies, #AM1906M). A single indexed RNAseq library was constructed using the Illumina TruSeq stranded RNA Sample Preparation Kit (Illumina, #RS-122-2001), and then sequenced by Illumina HiSeq2500 in paired-end 100 nt mode. Three biological replicates were studied for each time point.

Source of Genomic and EST Sequences

The pineapple genome sequencing project yielded 381.91 Mb genome sequences from A. comosus F153, accounting for 72.6 % of the estimated 526 Mb pineapple genome. The contig N50 is 126.5 kb and scaffold N50 is 11.8 Mb (Ming et al. 2015). Pineapple ESTs were downloaded from Genbank (access date: 10/27/2014). Transcripts were assembled using TRINITY (Grabherr et al. 2011), from several RNAseq libraries including flower, leaf, root, and fruit S1 to S8. Finally, we combined the Genbank ESTs and TRINITY transcripts, which were further assembled into 61,522 unigenes using CDHIT (percent identity cutoff at 98 %). These two sources of pineapple sequences were used for SSR mining.

SSR Mining

A large-scale, genome-wide SSR search was performed in pineapple genome using a Perl program MISA, MicroSAtellite identification tool (Thiel et al. 2003) available at http://pgrc.ipk-gatersleben.de/misa/. Both perfect and compound repeats were considered, with a basic motifs ranging from 2 to 6 bp and a minimum repeat length of 12 (for di- to tetra-), 15 (for penta-), 18 (for hexa-). Mononucleotide repeats were not considered due to possible confusion between bona fide SSRs and errors during sequencing, assembly process or polyadenylation tracks. With respect to compound repeats (distinct and adjacent SSRs), the maximum difference between two SSRs was set as 100 bp. 3′-UTR, 5′-UTR, CDS, exon and intron were extracted from gff3 annotation file using an on-line gff2bed python script (http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/gff2bed.html ), ‘bedtools getfasta’ to convert bed format to fasta and some in-house python script. Altogether we obtained ~41.95 Mb of exon sequences, ~100.49 Mb of intron sequences, ~33.32 Mb of coding sequences, and ~2.92 Mb and ~5.68 Mb of 5′ and 3′-UTR sequences, respectively. SSR density, GC content, motif, repeat length and repeat times distribution in pineapple whole genome, genic region and EST sequences were estimated, analyzed and compared with each other via windows Excel 2010 and Linux python scripts. The repeats motifs on a complementary strand were considered equivalents and grouped into one motif. For instance, the motif AG is equivalent to GA, TC, CT, and so forth.

SSR-Containing Gene and No-SSR Gene GO Annotation and Expression Estimation

The trimmed paired-end reads of each sample were aligned to repeat-masked pineapple assembly version 3 using TopHat v2.0.9 default settings (Trapnell et al. 2009). The normalized FPKM value (Number of fragments per kilobase of exon per million fragments mapped) of each sample were estimated by Cufflinks v2.2.1, followed by Cuffnorm v2.2.1 using default setting with pineapple gene model annotation provided (−g option). Several in-house python scripts (available upon request) were used to extract ID of genes within no SSR, one SSR, and two or more SSRs. The log2 FPKM value for each class of genes was used to generate heatmaps via pheatmap in R (edgeR) ver. 3.2.1 statistical package (www.CRAN.R-project.org). To further identify the relationship between the SSR number contained in a gene and its expression levels, genes with high FPKM values were separated by the number of SSR contained inside. The genes with a log2 FPKM value >5 were considered to derive from active transcribed regions and defined as highly expressed genes. GO terms and KEGG pathway information associated with each protein were computed using INTERPROSCAN (Zdobnov and Apweiler 2001).