Introduction

Over an evolutionary time scale, a nucleotide base (e.g., adenine [A], cytosine [C], guanine [G], or thymine [T]) at a particular site in a DNA strand is subject to mutational events that can cause the base’s replacement or substitution with another type of base. Knowledge of the patterns of these single nucleotide substitutions is needed to more fully understand DNA sequence evolution and to correctly calculate taxon phylogenies using sequence data (Yang 1994, 1996). Increasingly, base substitution rates are being shown to be unequal, both among taxon lineages (Petrov and Hartl 1999; Yang and Yoder 1999) and among nucleotide sites (Wills 1995; Yang 1996). In coding regions of a genome, the success of a particular mutational event can be affected by selection factors, such as in protein-coding sequences, where substitutions at the first or second nucleotide position of a codon have more impact on the resulting amino acid than do substitutions at the third position (Shinozaki et al. 1983). Although nucleotide substitutions are assumed to occur more or less independently of selection in noncoding regions, functions for this DNA such as promoter and ribosome binding operations (Zurawski et al. 1984) can produce selective constraints. Randomness with respect to the 12 substitution types (e.g., A→C, A→G, A→T, C→A, C→G, C→T, G→A, G→C, G→T, T→A, T→C, T→G) is also violated by the transition/transversion (ti/tv) bias. Transitions (A↔G, T↔C) generally occur more frequently than transversions (A↔C, A↔T, C↔G, G↔T) because of the differing molecular structures of pyrimidines (C, T) versus purines (A, G). Researching noncoding regions in Drosophila and mammalian genomes, Petrov and Hartl (1999) discovered another substitution bias—that a greater proportion of substitutions resulted in changes to A or T (i.e., the [A+T]/[G+C] bias).

Most of the work on DNA sequence evolution, particularly on substitution patterns, has focused on animal genomes. Although there have been few detailed substitution analyses (e.g., Olmstead et al. 1998) for plastid (e.g., chloroplast) DNA, many phylogenetic plant studies have reported transition biases (e.g., Chase and Albert 1998; Van der Bank et al. 2002). Among recent findings are that ti/tv ratios are often lower in noncoding than coding sequences (Savolainen 1997; Van der Bank et al. 2002) and that transversion biases exist for some plastid regions (Yang et al. 2002). To elucidate evolutionary trends in plastid DNA more accurately, we evaluated substitution patterns in several coding and noncoding regions of 19 species of the Cucurbitaceae by inferring substitution polarity from a phylogenetic tree of the study taxa and then comparing substitution class percentages.

The Cucurbitaceae are a primarily tropical and subtropical family of vines, composed of eight tribes, including the ill-defined, predominantly Old World Benincaseae (Jeffrey 1990). Some morphological data have been interpreted as suggesting that the New World Sicyeae is evolutionarily linked to the Luffinae subtribe of the Benincaseae (Jeffrey 1980) and that the other predominantly New World tribe in the family, the Cucurbiteae, is linked to the other Benincaseae subtribe, the Benincasinae (Jeffrey 1962). Within the Benincasinae, morphological (Jeffrey 1962), chromosomal (Whitaker 1933), and isozymic data (Walters et al. 1991) suggest that the only two European genera (Ecballium and Bryonia) among the 17 total genera of the Benincasinae are relatively distinct, both from each other and from other genera of the subtribe. A preliminary study that used 4 of the 14 DNA sequences (Chung et al. 2003) used herein provided supporting evidence for the genetic distinctiveness of the three geographically differentiated lineages in the Benincaseae: the core Benincasinae taxa, these being primarily from Africa and secondarily from Asia; the two European genera, Ecballium and Bryonia; and the Luffinae, which includes only species of Luffa (four in Asia and three in the New World). Genetic affinity between the Cucurbiteae and the core Benincaseae was also noted (Chung et al. 2003).

The multiple levels of hierarchy in the Benincaseae complex make this group of taxa an excellent candidate for finding a large number of nucleotide substitutions. The choice of DNA segments to sequence was based on our experience with a set of consensus primers representing 23 short (100- to 400-base pair) sequences in various regions of the plastid genome (Chung and Staub 2003). By examining multiple, short segments of genes, introns, and spacers instead of only one or two longer sequences (e.g., whole sequences for rbcL or ndhF), we increased the randomness and probable independence of the characters used for phylogenetic analyses. Also, the short, widely distributed sequences were selected because a previous study (Chung et al. 2003) revealed that several of them were sufficiently variable among taxa of the Benincaseae to provide a large number of substitution characters.

Materials and Methods

Description of Samples

Most of the Benincaseae and Sicyeae taxa included in this project had been preliminarily evaluated for plastid sequence variation (Chung et al. 2003). In the preliminary study, the selected outgroup taxon, Cucurbita pepo L. (Cucurbiteae), fell within the ingroup in some unrooted analyses. Given also the possible evolutionary tie between the Cucurbiteae and some members of the Benincaseae, we chose a different outgroup taxon, Trichosanthes cucumerina L. (Trichosantheae), for this analysis. We also added a few more accessions to the study, for a total of 13 species representing 10 genera of the Benincaseae, 4 species representing 4 genera of the Sicyeae, and T. cucumerina and C. pepo (Table 1).

Table 1 The 19 species of the Cucurbitaceae used to investigate phylogenetic relationships in the Benincaseae and substitution patterns in the plastid genome

Description of Targeted Sequences

From 23 consensus plastid primer pairs constructed from tobacco (Nicotiana tabacum L.) plastid DNA sequences (Chung and Staub 2003), 14 were chosen (Table 2) for PCR amplification based on a preliminary evaluation of these sequences (Chung et al. 2003). The primers are described in detail elsewhere (Chung et al. 2003). The targeted sequences consist of spacer regions, introns, and gene segments and cover a wide range of the plastid genome (Table 2). They have been deposited in GenBank under the accession numbers AY396152–AY396287 and AY396289–AY396421.

Table 2 Description of the 14 plastid sequences used to investigate phylogenetic relationships and substitution category patterns in the Cucurbitaceae

Sample Preparation and Sequencing Protocols

Total DNA was extracted from one to five seeds or leaf tissue samples for each accession according to previous protocols (Chung et al. 2003). DNA was dissolved with distilled water and quantified using a minifluorometer (model TD-360; Turner Designs, Sunnyvale, CA).

All PCR reagents were purchased from Promega Corp. (Madison, WI). Each 15-μl reaction volume contained 4.0 mM MgCl2, 0.3 mM dNTPs, 15 ng DNA, 0.45 μM primer, polymerase buffer, and 0.2 U Taq DNA polymerase. Samples were overlaid with mineral oil, and amplifications were conducted using the following cycling profile: 94°C for 5 min, 35 cycles at 94°C for 60 s, primer-specific annealing temperatures (50 to 60°C) for 60 s, 72°C for 60 s, and 72°C for 6 min, followed by an indefinite soak at 4°C.

For direct DNA sequencing, PCR products were initially checked by agarose gel electrophoresis. The excess dNTPs and unincorporated primers were then removed from the remaining PCR products using the ExoSAP-IT kit (USB Company, Cleveland, OH) according to the manufacturer’s protocol. Subsequently, 2 μl of each cleansed PCR product was added to an 8-μl sequencing reaction mixture containing 2 μl of 2.5× reaction buffer (5× in 400 mM Tris, pH 9, 10 mM MgCl2), 2 μl of BigDye enzyme mix (Applied Biosystems, Foster City, CA), 0.5 μl of forward or reverse primer (20 pmol/μl), and 3.5 μl of water. Sequencing reactions were run on a Perkin–Elmer 9700 thermal cycler using the following cycling protocol: 50 cycles at 95°C for 20 s, 50°C for 30 s, 60°C for 4 min, and then 72°C for 7 min. The dye terminators were then removed through MicroBioSpin columns (BioRad, Hercules, CA) followed by sequencing of reactions. Sequencing problems meant that data could not be produced for Trichosanthes or Acanthosicyos for sequence 12.

GeneTool software (BioTools Inc., Edmonton, Canada) was used to align sequences. Complicated regions consisting of various insertion and/or deletion events (indels) were subsequently checked and, if necessary, aligned manually.

Phylogenetic Analyses

For phylogenetic tree formation, the substitution data were analyzed separately and combined with the presence/absence matrix representing the indel data set. Maximum parsimony (MP) and maximum likelihood (ML) analyses were performed with PAUP* 4.0b10 (Swofford 2002). All trees were rooted with the outgroup, Trichosanthes cucumerina. For parsimony analyses, the search for minimal tree lengths was performed using a heuristic search with tree bisection reconnection (TBR), MULPARS off, and simple addition sequence. Internal clade support was evaluated with a bootstrap analysis with settings of 10,000 replicates, each with 10 random sequence addition replicates, TBR swapping, and MULPARS off.

The ML analysis was performed on the substitution data matrix only. The chosen settings were empirical base frequencies, empirical rate frequencies for individual sites (gamma distribution), and the general time-reversible substitution model (=GTR + Γ). The heuristic search was performed with TBR and as-is addition sequence.

For all trees, character optimization was DELTRAN, which favors parallelisms over reversals and allows for the possibility that forward and reverse substitutions do not occur with equal frequency. Then the optimized MP consensus or ML tree that was best supported by data from other sources (e.g., Whitaker 1933; Jeffrey 1962, 1980; Walters et al. 1991) was chosen for substitution polarity assessments.

Substitution Analyses

Only those hypothesized substitutions from the phylogenetic tree results that were deemed most likely to represent true substitution events were used for substitution class analysis. For example, putative substitutions suspected to be the result of an insertion, deletion, or inversion event were not used, nor were those that occurred on ambiguous tree branches. Also, only autapomorphies (i.e., substitutions occurring in a single taxon) and synapomorphies (i.e., substitutions shared as a result of inheritance, as indicated by the phylogram results) that could be unambiguously polarized were included in the classification analysis. Excluded were substitutions exhibiting obvious homoplasy or reversal since some of these characters were arbitrarily positioned on the tree by the optimization program. Although the methods described above reduced the number of polarized substitutions we could use for classification analysis, they provided greater confidence in the dataset.

Results and Discussion

Description of the Sequences

The lengths of the targeted sequences often varied among taxa as a result of indel events in one or more taxa (Table 2). The total number of base positions evaluated in the aligned data was 2680. Based on comparison with the tobacco genome (accession number CHNTXX in GenBank), approximately half of the 2680 cucurbit sites were in coding regions and half were in noncoding regions (Table 2).

Inspection of the entire set of sequences revealed several trends. First, short nucleotide repeats characterized the sequences. Most common were single-base repeats, usually consisting of 2–10 Ts or As or 2–4 Gs. Repeats of 2–4 Cs were less common, as were 1–2 Cs followed by 3+Ts. CT, TA, CG, and GA repeats occurred sporadically. Second, only about 50% of the observed substitutions occurred in repeats, with these typically found at the ends of a repeat. Third, most of the estimated total of 75+ indel or inversion events either occurred in a single taxon (there were 40 autapomorphies) or in a region of a sequence (usually long repeat segments) where additional types of indel or inversion events occurred in other taxa. In the latter case, the overlapping gaps in the aligned sequences, some spanning 10–20 positions, were often difficult to interpret and consequently ignored for data set creation. Fourth, substitutions were the most common type of event distinguishing taxon sequences, followed by deletions, insertions (often duplications of proximal sequences), and then inversions. Finally, sequences from different parts of the plastid genome often differed dramatically in their composition of repeats, substitutions, indels, and inversions.

Phylogenetic Analyses

Because several complicated gap regions were found in the aligned sequences, each potential substitution, inversion, or indel character was individually evaluated for inclusion in the final substitution and indel data sets. Ambiguous characters (e.g., multiple overlapping deletions within a DNA segment or closely positioned multiple single-base indels within a simple repeat) were excluded. For sequences 12, 15, and 17, there were a few taxa for which 10–20 terminal nucleotides were not accurately interpreted by the sequencer; in other words, many of these positions were coded for by an N (indeterminant nucleotide) rather than by A, C, G, or T. Consequently, these aligned regions, which did not, in any case, exhibit any polymorphisms among the remaining taxa, were excluded from analyses. Autapomorphic indels were not included in the indel dataset.

The phylogenetic substitution data set of variable base positions included 159 autapomorphic sites and 113 additional variable sites. Only 24 (9%) of the total 272 substitution sites exhibited more than two base types at a particular site. At nine of these sites, a common nucleotide type occurred in most taxa, while the second and third types occurred in a single taxon each. Only a single site exhibited all four bases among the sampled taxa.

The data set of the nonautapomorphic, easily interpretable indel events consisted of 10 indels (Table 3). Interestingly, four of the five putative insertions consisted of duplications of proximal six-base sequences.

Table 3 Description of the 10 insertion and deletion events used to investigate phylogenetic relationships in the Cucurbitaceae

MP analyses with and without the indel data produced identical topologics. The four shortest trees from the analysis based on the combined data set differed from each other with respect to the placement of Acanthosicyos and the relative positions of the Sicyeae and Luffinae branches. Shown in Fig. 1 is the bootstrap consensus tree. Most branches and the tree in general were strongly supported, as indicated by the bootstrap statistics. Homoplasy was not particularly prevalent as indicated by the high consistency (without autapomorphies, 0.732) and retention (0.848) index values. The tree includes four major clades within the ingroup: A—the European Benincasinae; B—the core Benincasinae, with Cucurbita as nearest relative (i.e., sister); C—the Sicyeae; and D—the Luffinae (Fig. 1).

Figure 1
figure 1

The bootstrap consensus tree based on maximum parsimony analysis of 272 substitutions and10 informative indels, with Trichosanthes cucumerina as outgroup. This tree has a length of 357 steps, a consistency index (without autapomorphies) of 0.732, and a retention index of 0.848. The four major Benincaseae and Sicyeae clades that occurred in this and all other maximum parsimony and maximum likelihood analyses are marked with the letters A–D. Boldface numbers above branch lines are the numbers of character changes for each branch under DELTRAN optimization. Next to each of these numbers (in parentheses) is the number of classified substitutions, followed by the ti/tv ratio, and then the percentage of substitutions that were part of the G→A→C→T→G substitution series. Numbers below branches are bootstrap percentages. Two morphophysiological characters, male flowers in racemes (R) versus solitary (S) and male and female flowers coaxillary (C) versus in separate axils (U), have been mapped on the tree.

The ML analysis produced the same general topology as in Fig. 1, except that the Sicyeae were placed as sister clade to the Benincasinae/Cucurbita clade; Echinocystis, Marah, and Sicyos/Sechium remained an unresolved trichotomy; and Acanthosicyos was placed as the nearest relative to the Citrullus/Lagenaria/Benincasa/Praecitrullus clade. The −ln likelihood value was 2231.16, the tree length was 344 steps, the consistency index was 0.863, and the retention index was 0.845.

The consensus tree in Fig. 1 was chosen over the ML tree for substitution analysis because the latter placed the Sicyeae as sister to the Benincasinae/Cucurbita clade, a result not supported by the morphological assessments of Jeffrey (1962, 1980). For two deep branches with relatively low bootstrap percentages—the entire Benincasinae (including Cucurbita) branch (66%) and the core Benincasinae/Cucurbita branch (86%)—we mapped two supporting morphophysiological characters to the tree (Fig. 1). Although data from additional sequences are needed to resolve the few remaining ambiguities in Fig. 1 (e.g., the uncertain position of the Sicyeae), most of the branches mirror phylogenetic results (e.g., the three major clades of the Benincaseae complex; the relationship between the Cucurbiteae and the core Benincasinae) from other sources (Whitaker 1933; Jeffrey 1962, 1980; Walters et al. 1991), making the tree suitable for establishing substitution polarities for the unambiguous branches in Fig. 1.

Substitution Analyses

The substitution classification data set based on Fig. 1 consisted of 214 substitution events, each of which was scored as one of the 12 substitution types. The numbers of substitutions occurring in the 14 DNA sequences are listed in Table 2. The coding regions for ycf5 and psbC and the spacer next to the psbB gene were well represented. In addition to analyzing the full set of 214 substitutions, we divided substitution sites into two subsets: those sites that exhibited a single substitution type (192 sites) and those that exhibited two or more substitution events (11 sites with two substitution events each). Presumably, the multiple substitution data set represents more rapidly evolving nucleotide sites.

Substitution class counts for the 214 events in the overall substitution classification data set were 31 for A→C changes, 12 C→A, 15 A→G, 22 G→A, 9 A→T, 8 T→A, 9 C→G, 7 G→C, 37 C→T, 11 T→C, 15 G→T, and 38 T→G. The percentages of single and multiple substitution sites in each substitution category are illustrated in Fig. 2. Obviously, biases exist, as the 12 substitution classes are not equally represented by the data.

Figure 2
figure 2

Percentages of substitutions in each substitution class for the single substitution (white area of a bar) and multiple substitution (black area) data sets. Above the appropriate bars, transition classes are marked with an “i” and G→A→C→T→G classes are marked with an asterisk.

To put substitution biases into perspective, we calculated base composition frequencies (for all sequences combined) from two sources: the substitution data set used for phylogenetic analysis (272 sites) and the data set of 2009 invariable nucleotide sites. The composition results for the variable and invariable sites, respectively, were A (27%, 32%), C (26%, 17%), G (20%, 16%), and T (27%, 35%). The percentages of each base involved in substitutions (the first set of values) were much closer to each other in value in contrast to larger differences in the invariable data set, suggesting that base involvement in substitutions was only somewhat affected by overall base composition. The A+T content was 54% and 67% for the variable and invariable sites, respectively. Besides A+T content, the most evident difference between the two sets of sites was in C composition—a relatively large percentage of Cs were involved in substitution events compared to their lower percentage in the larger data set of invariant sites.

In the single substitution data set, the percentages of change to and from a base type were 28% for replacements of A, 24% for C replacements, 22% for G replacements, and 26% for T replacements; and 20% for replacements with A, 24% for replacements with C, 29% for replacements with G, and 27% for replacements with T. For C and T each, replacements of and with these bases were nearly equal and approximated their compositional frequencies in the variable site data set. Guanine was replaced least often and more Gs occurred as replacements than As, Cs, or Ts. At the other extreme, only 20% of substitutions were with A even though A replacements were relatively numerous. These biases suggest that selection or mutational pressures for higher G and lower A content are operating in some lineages of the Cucurbitaceae. Given that G represented only 16% of the total sequence base composition, and A composition was much higher (31%), these lineages appear not to be in equilibrium.

In plants, ti/tv ratios are often lower in noncoding than coding sequences (Savolainen 1997; Van der Bank et al. 2002). We noted very low ti/tv ratios (0.66 for the overall data set; 0.66 and 0.69 for the single and multiple substitution subsets, respectively), even though 60% of our data set consisted of substitutions in coding regions (Table 2). Individual lineage ti/tv ratios (Fig. 1) were variable, confirming similar observations in other plant (Palmer et al. 1988) and mammalian (Petrov and Hartl 1999) studies.

In addition to many other factors affecting theti/tv ratio (e.g., pyrimidine/purine asymmetry, tautomeric shifts, deamination of methylcytosine, etc.), the composition of flanking nucleotides is correlated with a substitution being either a transition or a transversion (Morton 1995, 2003; Morton et al. 1997). We examined flanking nucleotides for 68 substitutions in three of the larger DNA sequences (Nos. 1, 7, 19) and also found a correlation between transversions and flanking As and Ts (data available from the first author). Given an overall T composition of 39% for the three sequences, our finding that the nearest base on the 5′ flank of a substitution was T for 45% of the substitutions suggests that T-rich areas may also have an increased chance of having a substitution of any kind. This is not surprising because our sequences were characterized by single-base repeats, usually consisting of 2–10 Ts or As. In fact, 44% of the substitutions for sequences 1, 7, and 19 were flanked on one side or the other by AA or TT, often indicating that the substitution took place at one of the ends of a repeat.

In contrast to the (A+T)/(G+C) biases observed in various studies (Blouin et al. 1998; Petrov and Hartl 1999; Kowalczuk et al. 2001), we found a nearly unbiased (A+T)/(G+C) ratio of 0.93 in the overall substitution classification dataset. The ratios for the single and multiple substitution subsets were 0.90 and 1.20, respectively. The higher value for the multiple substitution data set was primarily the result of the replacement of C by other bases; 4 of the 11 sites exhibited C→T and C→G events, and two sites exhibited C→T and C→A events. In mammals, the relatively high rate of cytosine replacement is believed to be related to cytosine methylation, which increases the probability of C→T transitions (Petrov and Hartl 1999). However, only half of the C replacements in our multiple substitution data set were with T. And the question remains why more Cs (55%) in this data set were replaced than Ts, As, and Gs combined, whereas C replacements in the single substitution data set were an unremarkable 24%. One possible explanation is that C replacement is more common at nucleotide sites undergoing more rapid change and that at these sites C replacement is less constrained to C→T transitions.

We detected a previously undocumented type of substitution bias in our data. In the overall data set, 50% of G replacements were with A, 56% of A replacements were with C, 64% of C replacements were with T, and 67% of T replacements were with G, thus forming a complete cyclical pattern. In fact, 60% of the substitutions in the overall data set fell into this G→A→C→T→G pattern, with 10% of all substitutions representing G→A events, 14% A→C events, 17% C→T events, and 18% T→G events. The G→A→C→T→G pattern was more frequent in the single substitution data set (61% of the substitutions) than in the multiple substitution data set (50%).

To investigate substitution biases more thoroughly, substitutions were summarized for each branch of the phylogenetic tree on which the 214 events took place (Fig. 1). Inspection of the 12 × 29 matrix revealed that different evolutionary branches exhibited different substitution biases. In comparing taxa, the most obvious deviations from the norm were that 9 of the 20 (45%) substitution events along the Echinocystis branch were C→T changes, and 6 of the 23 (26%) events along the Cucurbita branch were G→T changes. These substitution biases contributed to relatively high (A+T)/(G+C) values in the Echinocystis (1.5) and Cucurbita (2.8) branches. For the other six branches with at least 10 classified substitution events (Fig. 1), Bryonia, Ecballium, Bryonia/Ecballium, and March exhibited a (A+T)/(G+C) ratio of 1.0, whereas values for Sechium (0.8) and the Sicyeae (0.6) branches were lower. Significant life history differences exist among these taxa: Echinocystis lobata and Cucurbita pepo are rapidly growing temperate annuals; Bryonia dioica, Ecballium elaterium, and Marah oreganus are temperate perennials; and Sechium edule is a subtropical perennial. In animals, high A+T content and increases in the (A+T)/(G+C) bias have been correlated with colder temperatures (Bernardi and Bernardi 1986) and faster cellular metabolism (Martin 1995), respectively. Perhaps a similar type of correlation exists for rapidly growing, north-temperate plants that have evolved from southern, warm-weather ancestors (e.g., Cucurbita pepo).

Although individual lineages exhibited different substitution trends, the percentage of substitutions that fell into the four classes of the G→A→C→T→G series was 50% or more for 25 of the 29 branches for which there were data (Fig. 1). The pattern also characterized most of the individual DNA segments, where the percentage of substitutions in these classes was 50% or more for 11 of the 14 sequences (Table 2). The coding regions for psbC (sequence 7) and ycf5 (19), which were well represented in the substitution data set, had values of 75 and 57%, respectively (Table 2). Sequence 12, in which 20 of the 23 substitutions occurred in the noncoding spacer region next to the psbB gene, exhibited a G→A→C→T→G bias of 65%. The mean G→A→C→T→G value for those sequences with substitutions mostly in coding regions (sequences 5–7, 14–15, 19–20, 23; Table 2) was higher (67%), but not significantly so (p = 0.094), than the mean for the predominantly noncoding sequences (51%).

Conclusions

In studies of prokaryotes (Kowalczuk et al. 2001), nematodes (Blouin et al. 1998), Drosophila and mammals (Petrov and Hartl 1999), and plants (Olmstead et al. 1998), two transition types, G→A and C→T, were the highest-frequency substitution classes, even outnumbering their reverse transitions, A→G and T→C. G→A and C→T are the same two transition classes that were most common in our dataset, being part of the G→A→C→T→G pattern. Consequently, it appears that these substitutions are part of a universal polarized pattern in nucleotide substitutions, and that the assumption of near-equal reversible changes, which is part of most substitution models used for phylogenetic tree construction as well as the basis for choosing the accelerated character transformation algorithm (i.e., ACCTRAN in PAUP* 4.0b10 [Swofford 2002]), should be reconsidered.

In contrast to C and T, where the rates of change to and from each base were more or less equal in the single substitution data set, substitutions in the Benincaseae complex increased the accumulation of Gs and decreased the accumulation of As. Therefore, it appears that the plastid DNA base composition of some lineages of the Cucurbitaceae is not in equilibrium but, instead, is changing as a result of unknown pressures for directional substitutions leading to G gain and A depletion.

Surprisingly, only a small proportion of the increased accumulation of Gs and depletion of As was the result of A→G substitutions. Instead, G gain was mostly the result of T→G changes and A depletion was mostly the result of A→C substitutions. Along with the G→A and C→T transitions, the A→C and T→G transversions complete a cyclical pattern in which G is preferentially converted to A, A to C, C to T, and T to G. The G→A→C→T→G substitution series was not only evident in the overall and single substitution data sets, it was also reflected in the branches of the phylogenetic tree (Fig. 1) and in most of the 14 sequenced DNA segments, including both coding and noncoding regions of the genome (Table 2).

In spite of the consistency of our results, we would like to point out two limitations of this study. First, sample sizes were so small for the examinations of individual lineages and sequence segments that those results may not be statistically significant. Second, a possible source of error for this study concerns “hidden substitutions” (i.e., unrecognized substitutions that occur between two taxa via a third intermediary taxon that is not included in the analysis). The fact that we found variation among congeneric species (e.g., Luffa, Citrullus) means that there are probably hidden substitutions in those areas of the phylogenetic tree (Fig. 1) in which evolutionary distances between taxa are great (e.g., between Ecballium elaterium and Bryonia dioica).

In spite of the limitations of this study, recent results from a review of substitution patterns for various coding and noncoding regions of plastid DNA in several plant families confirm that the G→A→C→T→G series is a common pattern, albeit not for the most commonly sequenced gene, rbcL (Decker-Walters, unpublished data). What we do not have, yet, is a biological explanation for this cyclical substitution pattern. Nevertheless, the reality of the pattern will certainly impact theories of plastid DNA evolution as well as practical applications, such as algorithm choices for phylogenetic analyses, that are based on such theories.