Introduction

The chloroplast is a double membrane-bounded organelle (Cooper 2000). Chloroplasts contain their own DNA and replicate independently from the nuclear genome (Palmer 1985). This important organelle plays a role in photosynthesis and sustains life on earth (Daniell et al. 2016). Chloroplast genomes exhibit a circular quadripartite structure based on the arrangement of three important regions (Palmer 1985; Daniell et al. 2016; Mehmood et al. 2019; Abdullah et al. 2020). The inverted repeat (IRa and IRb) regions are situated between the large single copy region (LSC) and the small single copy region (SSC) (Palmer 1985; Daniell et al. 2016; Abdullah et al. 2019a; Yu et al. 2019a). However, quadripartite structure has not been observed in the chloroplast genome of various species such as Pinaceae (Wu et al. 2011), Cephalotaxaceae (Yi et al. 2013), Taxodiaceae (Hirao et al. 2008), Taxaceae (Zhang et al. 2014), Fabaceae (Sabir et al. 2014), and Cactaceae (Sanderson et al. 2015) due the loss of one or two IRs, whereas very short IRs are also reported in Pinaceae (Zeb et al. 2019). Moreover, linear chloroplast genome structure has also been reported along with the circular chloroplast genome (Oldenburg and Bendich 2016). The size of chloroplast genomes ranges from 107 kb (Cathaya argyrophylla) to 218 kb (Pelargonium) (Daniell et al. 2016). The chloroplast genome contains up to 120–130 genes including tRNA genes, rRNA genes, and protein-coding genes (Palmer 1985; Ahmed et al. 2012; Daniell et al. 2016; Iram et al. 2019; Abdullah et al. 2020).

The structure of chloroplast genomes is conserved regarding gene content, intron content, and gene organisation (Palmer 1985; Daniell et al. 2016; Shahzadi et al. 2019; Mehmood et al. 2019). However, events of gene loss, intron loss, gene rearrangements, and conversion of functional genes to pseudogenes has also been reported (Ahmed et al. 2012; Menezes et al. 2018; Hu et al. 2019; Liu et al. 2019a). Contraction and expansion of the inverted repeats occur frequently in chloroplast genomes and lead to origination of pseudogenes, duplication of genes, or conversion of duplicate genes to single copy (Ahmed et al. 2012; Menezes et al. 2018; Abdullah et al. 2019a, 2020; Liu et al. 2019a). Many mutational events occur within the chloroplast genome including inversions, oligonucleotide repeats, microstructural changes, InDels (insertions–deletions), and substitutions (Xu et al. 2015; Abdullah et al. 2019b; Shahzadi et al. 2019; Mehmood et al. 2019). Large-scale gene rearrangements have also been reported in some plant lineages, including Marattiaceae (Roper et al. 2007), Equisetaceae (Karol et al. 2010), Fabaceae (Schwarz et al. 2015), Geraniaceae (Marcussen and Meseguer 2017), Linaceae (Lopes et al. 2018), Passifloraceae (Rabah et al. 2019; Shrestha et al. 2019), and many non-photosynthetic plant species (Wicke et al. 2016).

Chloroplast genomes are inherited maternally in most angiosperms or paternally in some gymnosperms (Neale and Sederoff 1989; Daniell et al. 2016). Unlike the nuclear genome, chloroplasts lack meiotic recombination (Palmer 1985; Daniell et al. 2016). These properties, along with adequate levels of polymorphism, make it a suitable molecule for studies of evolution such as phylogeography, population genetics, phylogenetics, molecular evolution, and genome evolution (Ahmed et al. 2012, 2013; Li et al. 2013; Ahmed 2014; Henriquez et al. 2014; Xu et al. 2015; Marcussen and Meseguer 2017; Li and Zheng 2018; Zhai et al. 2019). Recently, several studies used either complete chloroplast genome sequences for inferring phylogenies (Feng et al. 2019; Zhai et al. 2019) or followed the alternate approach of Ahmed et al. (2013) and identified highly suitable polymorphic loci for designing unique markers for barcoding and phylogenetic studies in several plant lineages (Bi et al. 2018; Menezes et al. 2018; Shahzadi et al. 2019; Abdullah et al. 2019a, 2020).

The plant family Araceae is a large and ancient monocot plant family. This family belongs to the order Alismatales and comprises 118 describe genera and 3414 species, whereas 132 genera and 5946 species are expected (Boyce and Croat 2018). This family is unique among angiosperms based on its diverse morphology, ecology, and wide distribution from tropical to temperate regions (Gunawardena and Dengler 2006; Cabrera et al. 2008). This important family has been subdivided into eight subfamilies (Cabrera et al. 2008; Nauheimer et al. 2012) in which subfamily Monsteroideae is considered the third largest family with ca. 369 described species and ca. 700 estimated species (Boyce and Croat 2018). It comprises mostly hemiepiphytic or epiphytic plants restricted to the tropics, with three intercontinental disjunctions (Zuluaga et al. 2019). Monsteroideae is part of one of the earlier diverging lineages in Araceae and may help to provide a clearer picture of the evolution of the family (Zuluaga et al. 2019). Several studies based on plastid and nuclear markers inferred the phylogeny of this important subfamily, but still, the phylogeny of certain clades and genera is unresolved (Tam et al. 2004; Cabrera et al. 2008; Cusimano et al. 2011; Henriquez et al. 2014; Zuluaga 2015). A recent study inferred the phylogeny of 126 species of subfamily Monsteroideae based on five plastid and two nuclear markers which revealed the low polymorphism and low efficacy of these markers for species-level phylogenetic reconstruction of tropical Araceae (Zuluaga et al. 2019). For inferring the phylogeny of plant lineages with complex taxonomy, uses of specific and suitable polymorphic markers are required (Daniell et al. 2016). The comparative analyses of chloroplast genomes of subfamily Monsteroideae might be appropriate for identification of suitable loci for designing cost-effective, unique, and robust markers. However, the chloroplast genomes of only two Monsteroideae species are reported, both from the same genus including Spathiphyllum cannifolium (Liu et al. 2019b) and Spathiphyllum kochii (KR270822). These genomic resources are insufficient for determination of suitable polymorphic loci for designing cost-effective markers with high-resolution potential.

The recent phylogenetic inference, based on chloroplast and nuclear markers, of the 126 species from various genera of subfamily Monsteroideae shows that species of the four genera are distinctly related, including Spathiphyllum, Stenospermation, Monstera, and Rhaphidophora (Zuluaga et al. 2019). To broaden the genomic resources and uncover the molecular diversity of the subfamily, we selected one species from each of four diverse genera of subfamily Monsteroideae as comparative analyses of the chloroplast genome sequences of diverse species are helpful in the identification of suitable polymorphic loci for designing of unique markers. The chloroplast genomes of Spathiphyllum patulinervum, Stenospermation multiovulatum, Monstera adansonii, and Rhaphidophora amplissima were assembled and annotated. These chloroplast genomes will be helpful in understanding the evolutionary dynamics and in the elucidation of chloroplast genome structure of subfamily Monsteroideae. The comparative analyses of these species enabled us to get insight into the evolutionary patterns, and molecular evolution of subfamily Monsteroideae. These resources also enabled us to identify suitable polymorphic loci for designing cost-effective, robust, and unique markers which could provide high-resolution potential for inferring phylogenies of subfamily Monsteroideae even at the species level.

Materials and methods

Samples collection, DNA extraction, and sequencing

Fresh, healthy leaf tissues were collected from the Araceae Greenhouse at the Missouri Botanical Garden in St. Louis, Missouri from four Monsteroideae species: S. patulinervum, S. multiovulatum, M. adansonii, and R. amplissima. We used 100 mg fresh leaves for whole-genomic DNA extraction and performed two extractions per taxon using Qiagen DNeasy Minikit (Qiagen, Germantown, Maryland, USA). The DNA of each extraction was eluded in 125 µL elution buffer. The quantity and quality of DNA were confirmed by 1% agarose gel electrophoresis and Nanodrop (ThermoScientific, Delaware, USA). The libraries were constructed according to the manufacturer’s protocol of Illumina TruSeq kits (Illumina, Inc., San Diego, California) in the Pires lab at the University of Missouri, Columbia. The qualified libraries were sequenced from single end 100 bp reads using the Illumina HiSeq 2000 at the University of Missouri DNA Core. The sequencing of whole-genomic DNA by HiSeq2000 with 100 bp short-read length produces 3.36 GB (12.87 million short reads) in R. amplissima to 9.47 GB (36.28 million reads) in M. adansonii.

Short-read data-quality analyses and submission to Sequence Read Archive

The quality of short reads were analysed by FastQC (Andrews 2017) in the Galaxy portal (https://usegalaxy.org). Due to a file size limit of 2 GB in Galaxy, fastq.gz files for samples with concatenated fastq.gz files larger than 2 GB were uploaded individually (M. adansonii and S. multiovulatum). To compare the quality of the raw data across samples, a MultiQC (Ewels et al. 2016) analysis was also performed in the Galaxy portal. These analyses confirmed the quality of the raw reads with average Phred score 35.19–37.85. The raw data of all four species were submitted to the National Center for Biotechnology Information (NCBI) under Sequence Read Archive (SRA) number PRJNA547622.

Genome assembly and annotations

Chloroplast genomes were assembled using Fast-Plast v. 1.2.2 pipeline (https://github.com/mrmckain/Fast-Plast) under default settings. The reads were first clean by Trimmomatic v. 0.36 (Bolger et al. 2014). The reads of chloroplast origin were extracted from clean reads by mapping to available Alismatales plastomes packaged using bowtie2 v. 2.2.9 (Langmead and Salzberg 2012) with the “very-sensitive-local” parameter. We used SPAdes v.3.9.0 (Bankevich et al. 2012) with various k-mers to assemble the extracted mapped reads. The contigs generated by SPAdes v.3.9.0 were assembled using afin v. 1.0 (https://github.com/afinit/afin) with three iterations of 150, 50, and 50 loops, an initial contig trimming of 100 base pairs, a 20, 15, and 10 (per iteration) overlap of reads to contigs, a minimum coverage of 2, 1, and 1 reads per loop, and the full set of trimmed reads from the Trimmomatic phase. In case of assembly of complete genome in single contig, Perl code was used to orientate and identify single copy regions and inverted repeat regions. If assembly of the complete chloroplast genome was not gained from the contigs generated with SPAdes v.3.9.0 by afin v. 1.0. Then, Sequencher (Genecodes, Ann Arbor, MI, USA) was used with the clean short reads to bridge gaps as in McKain et al. (2016). The mapping of short reads to assemble genome was performed after the step of Sequencher using bowtie2 v. 2.2.9. The chloroplast assembly was verified through a coverage analysis conducted in Jellyfish2 v. 2.2.6 (Marçais and Kingsford 2011) under default parameters. The threshold of 25-mer abundance was used to map a 25-bp sliding window of coverage across the chloroplast genome of each species to determine misassembled regions if any. In case of identification of any misassembled regions in the assembly, the chloroplast genome was reassembled by repeating all the steps from afin v.1.0. After obtaining the assembled chloroplast genome, the clean raw reads were once again mapped to the final assembled genome using bowtie2 v. 2.2.9. After mapping of reads to assemble genome, Pilon v. 1.21 (Walker et al. 2014) was used to identify and fix any potential assembly issues and identify some minor mislabelled base call variants. The program Pilon is usually used for improvement and removal of small errors that exist in the assembled genome. Hence, we accepted the output of Pilon as the final assemblies.

The coverage depth of the final assembled chloroplast genome was again performed with Bowtie2 v.2.2.9 and ranged from 43.1X (S. patulinervum) to 449.9X (S. multiovulatum). The detail of the quantity and quality of raw reads, number of chloroplast genome reads, coverage depth, and NCBI accession number is provided in Table 1.

Table 1 Quantity and quality of the sequencing data and coverage depth of the assembled genomes

The newly assembled chloroplast genome was annotated using GeSeq (Tillich et al. 2017), whereas the tRNA genes were further verified by ARAGORN v1.2.38 (Laslett and Canback 2004) and tRNAscan-SE v.2.0.3 (Lowe and Chan 2016). The start and stop codons of protein-coding genes that were identified by GeSeq were further confirmed by manual visualisation as well as by blasting with homologues genes in Geneious R8.1 (Kearse et al. 2012). The stop codon of gene was also confirmed by analysing translation of each protein-coding genes in Geneious R8.1 (Kearse et al. 2012). A gene was declared pseudogene if contained internal stop codon as compared to other homologues genes or exists as a truncated/partial copy of a gene. GB2Sequin (Lehwark and Greiner 2019) was used to generate five-column tab-delimited annotation file for submission of the chloroplast genome of each species to GenBank at the National Centre for Biotechnology Information (NCBI) with specific accession numbers (Table 1). Fully annotated plastomes of circular diagram were drawn by OrganellarGenomeDRAW (OGDRAW) (Lohse et al. 2007).

Comparative analyses, determination of polymorphic loci, and phylogenetic inference

Geneious R8.1 (Kearse et al. 2012) was used for comparison of genomic features and for determination of codon usage and amino acid frequency. IRscope (Amiryousefi et al. 2018) was used for the analyses of inverted repeat region contraction and expansion at the junctions of chloroplast genomes. The Geneious R8.1 (Kearse et al. 2012) integrated Mauve alignment (Darling et al. 2004) was used to analyse chloroplast genome organisation and gene arrangement based on analyses of collinear blocks.

The rate of synonymous (Ks) and non-synonymous (Ka) substitutions and their ratio (Ka/Ks) of 76 protein-coding genes were also determined. The MAFFT alignment of each protein-coding genes was exported in FASTA format from Geneious R8.1 (Kearse et al. 2012) after removal of stop codon and analysed in DnaSP v. 5.10.01 (Rozas et al. 2017) following the previous studies (Choi et al. 2018; Kim et al. 2019; Abdullah et al. 2020). We used the S. kochii as reference for all the species of subfamily Monsteroideae and results were interpreted as: Ka/Ks > 1 indicate positive selection, < 1 indicate purifying selection, and Ka/Ks = 1 indicate neutral selection.

REPuter (Kurtz et al. 2001) program was employed to identify oligonucleotide repeats: palindromic, reverse, forward, and complementary. The parameters for repeats determination were set as to identify repeats pair of ≥ 30 bp with minimum similarity index of 90%.

Suitable polymorphic regions were determined using two different approaches of chloroplast genome comparison. At the family level, we aligned all the reported four species of subfamily Monsteroideae using multiple alignment of MAFFT (multiple alignment using fast Fourier transform) (Katoh et al. 2005) and compared all protein-coding genes, conserved IGS (intergenic spacer regions) and conserved intronic regions following Abdullah et al. (2019b) and Shahzadi et al. (2019). Most of the IGS and intronic regions showed high level of polymorphism and produce high missing data (> 5%) due to large number of InDels and inversions as compared to substitutions. Hence, these regions were not considered suitable for the phylogenetic inference of the subfamily Monsteroideae and were discarded from the list of suitable polymorphic loci. The inversions also provide false results of phylogenetic relationships (Menezes et al. 2018). The number of InDels and substitutions of each region were counted manually and divided on the length of the alignment to find percentage diversity of each region. At the genus level, we compared S. patulinervum (reported in current study) and S. kochii (KR270822) (downloaded from NCBI) using MAFFT pairwise alignment following Abdullah et al. (2019a). In the pairwise alignment, we compared each protein-coding sequence, intronic region, and intergenic spacer region to identify suitable polymorphic loci for designing of unique and robust markers. The average diversity of each region has been determined by dividing the number of substitutions and InDels by the size of the alignment of each region. We also removed the inversion from the alignment to avoid false results.

We used a total of 18 species in inferring the phylogeny including 14 downloaded species from NCBI. The details of the species are provided in Table S1. We used Acorus americanus from family Acoraceae to root the tree. The phylogenetic relationships were inferred based on complete chloroplast genomes following Abdullah et al. (2019a) after removal of IRa region from each genome. The IQ-tree (Nguyen et al. 2015; Kalyaanamoorthy et al. 2017; Hoang et al. 2018) program was used for the reconstruction of the phylogenetic tree with default parameters including 1000 replication and 1000 iteration along with best fit model TVM + F + I + G4. The TreeDyn program was used to improve visualization of the phylogenetic tree (Dereeper et al. 2008).

Results

Comparison of chloroplast genomic feature in subfamily Monsteroideae

The size of the chloroplast genome ranged from 163,335 bp (R. amplissima) to 164,751 bp in S. patulinervum. The size of LSC ranged from 89,714 (R. amplissima) to 91,841 bp (S. patulinervum), SSC ranged from 21,448 bp (S. multiovulatum) to 22,346 bp (S. kochii), and the size of each IR region ranged from 25,270 bp (S. kochii) to 25,931 bp (S. multiovulatum) (Table 2). All the species showed conserved intron, gene content, and gene organisation. The circular map of chloroplast genomes and collinear blocks (LCBs) of Mauve alignment confirmed the high similarities in these species (Figs. 1, 2). The LCBs also revealed similarity in gene arrangement and chloroplast genome organisation. The average GC content of the chloroplast genomes was 36% and revealed a high extent of similarity. However, the GC content showed variation among the three main regions of the chloroplast. IR regions showed high GC content compared to the LSC and SSC regions. All the species have 114 unique genes that included 30 tRNA genes, four rRNA genes, and 80 protein-coding genes (Table 2, Fig. 1). In IR regions, 17 genes were present and duplicated that included seven tRNA genes, four rRNA genes, and six protein-coding genes. We found 18 intron containing genes including six tRNA genes and 12 protein-coding genes. Among intron containing genes, two tRNA genes and three protein-coding genes contained introns (Table S2). The infA gene exists as a pseudogene in all species. The ycf1 gene was found functional in all species at junction of SSC/IRa. However, a pseudo-copy of ycf1 also originated at the junction of IRb/SSC only in S. patulinervum, along with functional copy of ycf1 at junction of SSC/IRa, due to starting in the IR regions instead of its complete presence in the SSC region (Fig. 3).

Table 2 Comparison of chloroplast genomes of Monstera adansonii, Rhaphidophora amplissima, Stenospermation multiovulatum, Spathiphyllum patulinervum, and Spathiphyllum kochii
Fig. 1
figure 1

Circular map of Monsteroideae chloroplast genomes. Genes transcribed counter clockwise are present inside of the circle. Genes transcribed clockwise are present outside of the circle. The colour of the genes correspond to the function of the genes

Fig. 2
figure 2

Collinear block (LCBs) analyses of Monsteroideae genome. The colours of LCBs represent genes. The white colours represent protein-coding genes, black colours represent tRNA genes, green colour represent intron containing tRNA genes, and blue colour represent rRNA genes. The LCBs analyses revealed high level of similarities among the Monsteroideae species

Fig. 3
figure 3

Inverted repeat contraction and expansion analyses of quadripartite junctions of chloroplast genomes among Monsteroideae. Genes represented above the block are transcribed counter clockwise and genes represented below the block are transcribed clockwise. The number represented above the arrow show the distance of the genes from the junctions. Genes that are integrated between the junctions are represented with scale bar and the number shows the extent of base pair by which integration from one region to another region takes place

Analyses of inverted repeat regions contraction and expansion

The inverted repeat regions contraction and expansion revealed high similarities at the junctions of LSC/IRb, IRb/SSC, SSC/IRa, and IRa/LSC. The rps19 gene was completely found in the LSC region, and rpl2 was completely found in the IRb region at the junction of LSC and IRb. The ndhF gene was completely found in the SSC region at the junction of IRb and SSC. At the junction of SSC and IRa, ycf1 gene was completely located in the SSC region of three species, whereas in one species (S. patulinervum), it started from the IR regions and integrated into the SSC region. Hence, it left a pseudogene of 81 bp at the junction of IRb and SSC in S. patulinervum. The trnH gene showed complete presence in the LSC region in four species except S. patulinervum in which trnH gene integrated into IRa with 11 bp present at the junction of IRa and LSC (Fig. 3).

Codon usage and amino acid frequency

The relative synonymous codon usage (RSCU) analyses revealed that codons that end with A/T at 3′ end have RSCU ≥ 1 and encode the highest amount of amino acids. The codons that end with C/G at 3′ have RSCU < 1 and encode the lowest amount of amino acids (Table S3). The ATG codon encodes formyl-methionine as a start codon using a specific tRNA-fMet (CAU) for translation initiation and a methionine during translation elongation using the tRNA-Met (CAU). This was the most common start codon in the chloroplast genome of all species. However, other codons were also found as start codons such as ACG (in rpl2), ATA (in cemA), and GTG (in rps19). The analysis of amino acid frequencies revealed that leucine is the most abundant and cysteine the most rare amino acid. In general, we found high similarities in codon usage and amino acids frequency among the four species of subfamilies Monsteroideae (Figure S1).

Rate of evolution of protein-coding genes

The rate of synonymous substitutions (Ks), non-synonymous (Ka) substitutions, and their ratio (Ka/Ks) showed low rates of evolution for all types of protein-coding genes in the chloroplast genome. The synonymous substitutions were found more common than non-synonymous substitutions; therefore, less value of Ka/Ks was observed. We observed low average value of Ks, Ka, and Ka/Ks for different groups of genes as: cytochrome group (Ks = 0.0193, Ka = 0.0015, and Ka/Ks = 0.0784), photosystem I group (Ks = 0.0312, Ka = 0.0036, and Ka/Ks = 0.1145), photosystem II (Ks = 0.0253, Ka = 0.0042, and Ka/Ks = 0.1678), ribosomal small subunit group (Ks = 0.0359, Ka = 0.0063, and Ka/Ks = 0.1753), ATP synthase group (Ks = 0.0201, Ka = 0.0036, and Ka/Ks = 0.1797), NADPH dehydrogenase group (Ks = 0.0238, Ka = 0.0058, Ka/Ks = 0.2434), ribosomal large subunit (Ks = 0.0172, Ka = 0.0046, Ka/Ks = 0.2693), and RNA polymerase group (Ks = 0.0253, Ka = 0.0073, Ka/Ks = 0.2876). Our data revealed that purifying selection acts on the genes of cytochrome group, photosystem I group, and photosystem II group. The details about the evolution of each gene is provided in Table S4. We found seven genes that showed Ka/Ks ≥ 1 and hence showed positive selection pressure. Genes ndhK, rbcL, ycf2, and ndhD showed positive selection in S. patulinervum; accD and rps8 in R. amplissima and psbK in S. multiovulatum.

Repeats analyses

REPuter detected four types of oligonucleotide repeats: palindromic, reverse, forward, and complementary. The abundance of the repeats varies among species based on types of repeats. The forward repeats showed abundance in S. patulinervum, reverse repeats showed abundance in S. multiovulatum, palindromic repeats in M. adansonii, and complementary repeats also in M. adansonii (Fig. 4a). The size of repeats varies among species, but most of the repeats exist in the range of 35–44 bp (Fig. 4b). The LSC region contained most of the repeats than SSC and IR regions, whereas some repeats were also shared among the different regions of the chloroplast (Fig. 4c). The analysis of repeats distribution based on functional regions of the chloroplast revealed the presence of most of the repeats in intergenic spacer regions (IGS) as compared to protein-coding sequences and introns (Fig. 4d). The detail of the repeats is also provided in Table S5.

Fig. 4
figure 4

Oligonucleotide repeats in Monsteroideae chloroplast genomes. a Represent four types of repeats: F (forward), R (reverse), P (palindromic), and C (complementary). b Represent size of repeats, i.e., 30–34 represent repeat range in size from 30 to 34 bp and so on. c Represent distribution of repeats in the three main regions of chloroplast genome: LSC (large single copy), SSC (small single copy), and IR (inverted repeat). The LSC/IR, LSC/SSC, and IR/SSC represent shared repeats, i.e., LSC/IR represent repeats pair in which one copy of repeats is present in LSC and another copy is present in IR. d Represent distribution of repeats in functional regions of chloroplast genome; IGS (intergenic spacer regions), Intron (intronic regions), and CDS (protein-coding sequences). The IGS/intron and IGS/CDS represent share repeats between these regions

Identification of suitable polymorphic loci at subfamily and genus level

The comparison among the species of the subfamily Monsteroideae identified 30 polymorphic regions for phylogenetic inference at the subfamily level, and identified suitable polymorphic loci from IGS regions, intronic regions, and protein-coding sequences (Table 3). Most of the regions were included from IGS regions, namely trnQ-UUG-psbK, trnW-trnP, rpoA-rps11, petG-trnT, and trnC-petN. We did not include those regions in the list of suitable polymorphic loci which produce high levels of missing data (> 5%) due to multiple InDels and inversions events. The protein-coding regions include ycf1, psbK, ccsA, accD, rbcL, matK, and ndhF. The intronic regions of petL and rpoC1 were also included in the suitable polymorphic loci. The sequence of some IGS and intronic regions were partially included in the list of suitable polymorphic loci to avoid those sequences which revealed high polymorphism and showed high level of InDels and inversion. The sequences that were chosen from each region are given in Table 3, whereas the percentage of polymorphism of all the polymorphic protein-coding sequences along some conserved IGS and intronic regions has been given in Fig. 5a. These polymorphic loci might be suitable for phylogenetic inference of the subfamily Monsteroideae.

Table 3 Polymorphic loci identified by comparing four species of Monsteroideae
Fig. 5
figure 5

Suitable polymorphic regions in subfamily Monsteroideae and genus Spathiphyllum. a The percent diversity of the protein-coding sequences of chloroplast genome regions based on comparison of four species of subfamily Monsteroideae. We included polymorphic protein-coding genes and the conserved intergenic spacer regions and intronic regions. We avoided regions with high indels and inversion events as these are not preferred for inferring of phylogeny at subfamily or family levels. b, c Nucleotide differences of complete chloroplast genomes based on comparison of S. patulinervum and S. kochii. The X-axis represents regions of chloroplast genome, whereas the Y-axis represents the nucleotide differences. The LSC, IR, and SSC on the X-axis indicate those regions which belong to large single copy, inverted repeat, and small single copy, respectively. Conserved regions without nucleotide differences were ignored and were not included in b and c

We also compared all the regions of chloroplast genomes of Spathiphyllum species in pairwise alignment. The average nucleotide differences of intergenic spacer regions (0.0140) were found the highest followed by intronic regions (0.0088) and then by protein-coding sequences (0.0048). We identified 30 high polymorphic loci in which most of the regions belong to IGS including trnS-GCU-trnG-UCC, trnH-psbA, atpH-atpI, trnP-UGG-psaJ, psbK-psbI, and psaC-ndhE (Table 4). The polymorphic loci that belonged to intronic regions include atpF and ndhA, whereas the polymorphic regions that belonged to protein-coding sequences included ycf4, rpl22, and cemA. The nucleotide differences of complete chloroplast genomes of the Spathiphyllum species are given in Fig. 5b, c. These polymorphic loci might be helpful for phylogenetic inference and population genetic studies of the species of genus Spathiphyllum.

Table 4 Polymorphic regions identified by comparison of Spathiphyllum patulinervum and Spathiphyllum kochii

Phylogenetic relationships among the aroid species

The phylogeny of subfamily Monsteroideae, without problematic species, has been inferred based on complete chloroplast genomes (IRa not included). After removal of InDels, the alignment contained 94,257 nucleotide sites in which 67,230 (71.32%) sites were constant, 6522 sites showed distinct patterns, and 15,500 sites were found to be parsimony informative. The phylogenetic tree of the studied species supports the monophyletic position of five species of subfamily Monsteroideae (Fig. 6).

Fig. 6
figure 6

Phylogenetic tree of family Araceae. The number on each node represents bootstrapping support

Discussion

In the current study, we report chloroplast genome sequences of four species of subfamily Monsteroideae. We compared genomic features among the species of Monsteroideae, analysed IRs contraction and expansion, and identified suitable polymorphic loci for designing of suitable molecular markers.

All the analysed species of subfamily Monsteroideae exhibit conserved chloroplast genomes, and show similarities in gene content, intron content, and chloroplast genome organization. Previous studies of other angiosperm plant lineages have demonstrated both conserved (Choi et al. 2016; Li et al. 2018; Shahzadi et al. 2019; Mehmood et al. 2019) as well as highly polymorphic (Menezes et al. 2018; Abdullah et al. 2019a, 2020; Liu et al. 2019a) chloroplast genomes within specific plant lineages. The chloroplast genomes which we report here are highly conserved. In contrast to our results, in Amorphophallus, a genus of the subfamily Aroideae in Araceae, certain events of gene loss were recently reported (Hu et al. 2019; Liu et al. 2019a). However, species of Monsteroideae show conserved chloroplast genomes similar to previous studies (Choi et al. 2017). The infA gene is important as a translation initiation factor. This gene has been reported as absent in many species, either fully deleted or non-functional in multiple independent lineages (Jansen et al. 2007; Ahmed et al. 2012; Abdullah et al. 2020). This gene was found non-functional in all species of Monsteroideae. A functional copy of this gene might be present in the nuclear genome (Jansen et al. 2007).

The contraction and expansion in IRs are considered important evolutionary events which change chloroplast genome size and gene content (Menezes et al. 2018; Abdullah et al. 2020). Previously, the expansion of IRs has been reported in subfamily Lemnoideae of family Araceae which led to duplication of ycf1 and rps15 genes (Wang and Messing 2011), whereas the duplication of single copy genes or vice versa have also been reported due to IRs contraction and expansion in two species of Aroideae (Araceae) (Henriquez et al. 2020). In the current study, such a high level of IRs expansion has not been observed and the structure of the chloroplast genomes showed similarities with the chloroplast genome structure of other reported species of family Araceae (Ahmed et al. 2012; Choi et al. 2017). A truncated copy of ycf1 was observed at the junction of IRb and SSC in S. patulinervum along with the existence of one functional copy at the junction of SSC/IRa. The ycf1 pseudogene has also been reported in other angiosperms, including family Araceae (Ahmed et al. 2012; Choi et al. 2017; Yu et al. 2019b; Shahzadi et al. 2019; Abdullah et al. 2019a, 2020; Henriquez et al. 2020).

The analyses of RSCU provide information about the encoding frequency of codon for an amino acid. The codons that have either A or T at their 3′ end showed high encoding efficacy of the amino acid and mostly have RSCU ≥ 1. Conversely, the codons that have C or G at their 3′ end showed low encoding efficacy and mostly have RSCU < 1. Similar results were previously reported in other angiosperms (Shahzadi et al. 2019; Mehmood et al. 2019; Abdullah et al. 2020). In addition to normal ATG start codon which encodes formyl-methionine (Alkatib et al. 2012a), we also observed alternate start codons, including ACG (in rpl2 gene), ATA (in cemA gene), and GTG (in rps1 gene). The ACG is converted to AUG by RNA editing (Neckermann et al. 1994), whereas the other alternative codons are also reported in the chloroplast genome of other plant species (Sugiura et al. 1998; Su et al. 2019). Usually, 32 tRNAs are required to read all codons of the mRNA (Crick 1966) and chloroplast genome contains up to 30 tRNAs (Menezes et al. 2018; Abdullah et al. 2020). However, Superwobbling can reduce the required number of tRNAs, whereby a single tRNA species containing a Uridine in the wobble position of the anticodon can read an entire fourfold degenerate codon box (Alkatib et al. 2012b) but also reduce translation efficiency (Rogalski et al. 2008). Moreover, the essential presence of uridine at the wobble position on the gene of tRNA also makes the gene with G on this position not essential for translation (Rogalski et al. 2008). Similar phenomena might exist in Monsteroideae species, enabling only 30 tRNAs to read all the codons in the chloroplast genome. Leucine was the most frequently coded amino acid, whereas cysteine was rarely found. These results are also an agreement with the previous studies of angiosperms (Menezes et al. 2018; Shahzadi et al. 2019; Abdullah et al. 2019a).

The rate of synonymous substitutions (Ks), non-synonymous (Ka) substitutions, and their ratio (Ka/Ks) showed low rates of evolution for all types of protein-coding genes in the chloroplast genomes. The synonymous substitutions were more frequent than the non-synonymous substitutions; Ka/Ks ratio was < 1. These findings agree with various studies of angiosperm chloroplast genomes (Menezes et al. 2018; Shahzadi et al. 2019; Abdullah et al. 2020). However, our study contradicts a recent study of aroid species in other subfamilies, which reports a higher rate of non-synonymous substitutions compared to synonymous substitutions, and many genes undergoing positive selection (Kim et al. 2019). In agreement with the previous reports in other angiosperms as well as aroids (Choi et al. 2016; Menezes et al. 2018; Piot et al. 2018; Kim et al. 2019; Abdullah et al. 2020), our analyses revealed strong purifying selection on genes which have a role in photosynthesis. Some genes, including rbcL, psbK, accD, rps8, ndhK, ndhD, and ycf2, were found under positive selection which might be due to different types of stresses faced by these species in their respective ecological niches. These genes have also previously been reported to undergo positive selection (Yang et al. 2016; Choi et al. 2018; Yu et al. 2019b; Abdullah et al. 2019a; Kim et al. 2019).

Oligonucleotide repeats play a role in the generation of substitutions, InDels, and inversion (McDonald et al. 2011; Ahmed et al. 2012; Xu et al. 2015; Abdullah et al. 2020). These repeats have been suggested as a proxy for identification of polymorphic loci (Ahmed et al. 2012, 2013). In the current study, we reported four types of oligonucleotide repeats, including forward, reverse, palindromic, and complementary repeats. Repeat density was high in IGS and in the LSC region. Most of the repeats ranged between 35 and 44 bp in size. These observations are in agreement with the previous reports (Poczai and Hyvönen 2017; Mehmood et al. 2019; Abdullah et al. 2019a, 2020).

Barcoding and phylogenetic inference can be performed either using complete chloroplast genome or lineage-specific polymorphic loci (Li et al. 2014). Owing to high cost per sequencing of a complete chloroplast genome, the use of lineage-specific polymorphic loci can be a better alternative at times (Li et al. 2014). As all genomic regions are not equally useful for inferring the phylogeny of closely related taxa or resolving taxonomic discrepancies (Daniell et al. 2016), lineage-specific polymorphic regions can serve these purposes (Ahmed et al. 2013; Ahmed 2014; Li et al. 2014) and have been reported for several species (Li et al. 2018; Menezes et al. 2018; Shahzadi et al. 2019; Abdullah et al. 2019a, 2020). The phylogenetic analysis using a limited number of species in the current study demonstrates the monophyly of subfamily Monsteroideae. Previously, the subfamily Monsteroideae was also identified as monophyletic, however, the species-level phylogeny of subfamily Monsteroideae is not well resolved due to low polymorphism of available molecular markers (Zuluaga et al. 2019). We identified suitable polymorphic regions that might be helpful for designing suitable and unique markers for inferring the phylogeny of subfamily Monsteroideae. In selecting these polymorphic regions we focused on substitution mutations rather than indels, as substitutions are preferred for reconstructing evolutionary history based on maximum-likelihood methods (Ahmed et al. 2013; Ahmed 2014; Menezes et al. 2018; Shahzadi et al. 2019; Abdullah et al. 2019a, 2020). Commonly used molecular markers including rbcL and matK (Cusimano et al. 2011; Zuluaga et al. 2019) were less polymorphic than many alternatives in our study. The most recent study used trnC-petN and partial ycf1 for inferring the phylogeny in subfamily Monsteroideae (Zuluaga et al. 2019). Our study also suggests that these loci are included among the suitable polymorphic regions. However, use of ycf1 in phylogenetic reconstruction in Monsteroideae should be done with care and aided by other markers due to large scale and frequent inversions in this gene (Menezes et al. 2018; Zuluaga et al. 2019). The trnC-petN locus also showed high incidence of inversions and indels; accordingly, we recommend its partial sequencing. Our findings contradict a recent study (Zuluaga et al. 2019) in which authors reported low polymorphism for trnT-psbD, rps16-trnQ, petA-psbJ, and psbE-petL loci. These loci were not included as suitable polymorphic regions in our study at the family-level comparison due to their high level of polymorphism with high production of missing data and not due to low polymorphism.

We also identified polymorphic regions by comparison of two species of genus Spathiphyllum. Here, we selected the high polymorphic loci with alignment ≥ 200 bp for designing of suitable polymorphic markers following Abdullah et al. (2019a, 2020) and Shahzadi et al. (2019). For these comparisons, some of the most commonly employed loci, including matK, rbcL, and ndhF (Alverson et al. 1999; Pfeil et al. 2002; Li et al. 2014) were not essentially the most suitable loci for such comparisons, while other commonly used loci including ndhA intron and trnH-psbA (Li et al. 2014; Tr et al. 2016; Huang et al. 2019) were among the 30 polymorphic regions in our findings. The identified polymorphic regions by genus- and family-level comparisons showed variations. The region which was found to be highly polymorphic and to produce high levels of missing data (> 5%) and found unfit for phylogenetic inference at the family level were included in the list of suitable polymorphic loci for phylogenetic inference of genus Spathiphyllum. These data also suggest that different levels of polymorphism can be employed at genus (closely related species) and family (more distantly related species) levels for drawing phylogenetic inferences (Menezes et al. 2018; Abdullah et al. 2020).

To conclude, our study provides broad insight into the chloroplast genome structure of subfamily Monsteroideae in which the chloroplast genome of three species was sequenced as first representatives of the genera Monstera, Stenospermation, and Rhaphidophora. Higher synonymous substitutions existed than non-synonymous substitutions and most protein-coding genes showed high purifying selection pressure. The polymorphic regions identified here might be suitable for designing unique and robust markers for inferring the phylogeny and phylogeography among closely related species within the genus Spathiphyllum and among distantly related species within the subfamily Monsteroideae.