Introduction

According to the endosymbiont hypothesis (Margulis and Burmudes 1985), the mitochondrion and its genome are the remnants of a free-living eubacterial ancestor, probably an extant α-proteobacterium, which was engulfed by a eukaryotic host cell and later established a symbiotic relationship with it (Gray 1999). The host provided the bulk of the nuclear genome, and the large majority of the endosymbiont genes was either lost or transferred to the nuclear genome at an early stage in evolution. Thus, only a small fraction of the original genes is found in modern mitochondrial DNA (mtDNA).

Many features distinguish the mtDNAs of higher plants from those of animals and other organisms. Although the transfer of mitochondrial genes to the nucleus and their functional activation ceased in the common ancestor of animals, mitochondrial gene loss and gene transfer have been an ongoing and frequent process in flowering plants (Palmer et al. 2000). Extensive Southern analysis of 280 genera of flowering plants has provided a global view of gene loss in plant mtDNA (Adams et al. 2002), and the possible mechanisms of DNA transfer between organelles with closed membrane systems, and integration of the DNA into the host genome, have been reviewed by Kurland and Andersson (2000).

mtDNAs in land plants have expanded in size compared with those of green algae. Land plants evolved from green algae belonging to the Charophyceae (Graham et al. 2000), and Chara vulgaris was recently inferred to be the last common ancestor of green algae and land plants by comparisons of completely sequenced mtDNAs (Turmel et al. 2003). Since Chara possesses a densely packed mitochondrial genome with a gene content similar to that of its Marchantia counterpart (Oda et al. 1992), it was inferred that the growth in mtDNA size in Marchantia occurred by the enlargement of intergenic spacers due to frequent duplications and substitutions during evolution from charophytes to bryophytes (Turmelet al. 2002). The subsequent size increase in angiosperm mtDNAs during evolution from bryophytes occurred both by further enlargement of spacer regions due to frequent duplications and by frequent capture of sequences from the chloroplast (cp) and nuclear genomes (Marienfeld et al. 1999). Of these incoming DNAs, only cp-derived tRNA genes have gained functions in angiosperm mtDNA (Joyce and Gray 1989). Furthermore, the contribution of frequent recombination and transposition of many different classes of retrotransposons to genome expansion in the mtDNA of land plants is at most 15%. Thus, the origin of the large majority of unique sequences (~50%) in plant mtDNA is not known.

The organization of angiosperm mtDNA is characterized as a multipartite genome structure, which arises from high-frequency recombination via repeated sequences in the genome (Fauron et al. 1995). A master circle (MC) model is traditionally constructed on the basis of restriction fragment mapping of mtDNA in higher plants, in which the total genetic information can be accommodated. In contrast, an extensive electron microscopic investigation has shown that mtDNA from Chenopodium album cells cultured in suspension appears to consist mainly of linear molecules of various sizes, together with rosette-like and sigma-like structures, in vivo (Backert and Börner 2000). Since the relative amounts of these structures change during the course of cell growth, they may represent replication intermediates. Similar large branched molecules have also been observed in mtDNA from BY-2 tobacco cells under the light microscope (Oldenburg and Bendich 1996). Thus, there are differences between the forms of mtDNA molecules derived from genome mapping data and from microscopic observations. Although this discrepancy has not yet been resolved, both types of evidence indicate that the structural organization of mtDNA is highly dynamic. Furthermore, the multipartite structure, i.e., the existence of a set of mtDNAs of varying structures, can provide a redundant gene assembly and modulate the genome copy number in plant mtDNA. Low-frequency ectopic recombination among multipartite structures will produce chimeras, aberrant ORFs, and novel subgenomic DNA molecules (Abdelnoor et al. 2003). This genomic shuffling is apparently reversible and can alter plant phenotype (Kanazawa et al. 1994; Janska et al. 1998).

In the case of tobacco, complete sequencing of the mtDNA not only provides information about these features, but also establishes a basis for future research. The tobacco cultivar Nicotiana tabacum is a natural amphidiploid derived from two progenitors, most probably N. sylvestris and N. tomentosiformis (Kenton et al. 1993). It has been reported that the mtDNA of N. tabacum originated from N. sylvestris as the maternal parent, based on Southern hybridization analysis (Bland et al. 1985), but it is necessary to confirm this at the DNA sequence level. Secondly, the genetic interaction between mitochondria and chloroplasts in the various functions of plant cells is an important theme to be examined. Genetic analyses of higher-plant chloroplasts have been progressing rapidly in tobacco since the complete determination of its cpDNA sequence (Shinozaki et al. 1986), followed by the development of a chloroplast transformation technique (Svab and Maliga 1993) and chloroplast in vitro expression systems (e.g., Hirose and Sugiura 2001). Therefore, it is reasonable to analyze tobacco mtDNA. Furthermore, it has been widely documented that the expression and regulation of mitochondrial genes are under the control of the products of genes encoded in the nuclear genome. In this respect it is worth utilizing tobacco as a model plant because nuclear transformation of tobacco is easy.

In this paper we report the complete mtDNA sequence of N. tabacum. Detailed analysis of the sequence data indicates a multipartite structure for tobacco mtDNA. Comparative analysis of the complete mtDNAs of Arabidopsis thaliana (Unseld et al. 1997), Beta vulgaris (sugarbeet) (Kubo et al. 2000), Oryza sativa (rice) (Notsu et al. 2002), Brassica napus (rapeseed) (Handa 2003) and N. tabacum supports the punctuated evolution of mitochondrial gene content, and corrects some previous suggestions concerning gene loss in tobacco mtDNA. Furthermore, our analysis suggests that promoter modification or acquisition could occur upon reorganization of mitochondrial genes during evolution.

Materials and methods

Isolation of mitochondria and mtDNA

Mitochondria were isolated from green leaves of 10-week-old N. tabacum (cultivar Bright Yellow 4) plants according to the procedure of Bland et al. (1985). Briefly, mitochondrial fractions were collected by differential centrifugation. To remove any DNA released from broken organelles, the mitochondria were incubated with DNase I for 30 min at room temperature, and further purified by centrifugation in a discontinuous sucrose density gradient (0.6–2.0 M). Purified mitochondria (which band at the 1.6 M/1.2 M interface) were washed with 0.4 M sucrose and lysed in 2% Sarkosyl, and mtDNA was recovered by phenol-chloroform extraction and ethanol precipitation. Contaminating RNA was removed by treatment with a DNase-free RNase (Takara Bio).

DNA sequencing and data analysis

mtDNA libraries were constructed by digestion with Bam HI, Eco RI, Hin dIII, Sac I and Xba I. pGEM-11Zf(+) (Promega) was used as a cloning vector. The sizes of the inserted DNA fragments were determined by agarose gel electrophoresis after digestion with restriction enzyme, and they were sequenced from both ends by the dideoxy termination method using ABI 377 and 3100 DNA sequencers (Applied Biosystems). We collected 3400 sequence reads from restriction fragments, in combination with shotgun sequencing of selected clones with an insertion size greater than 5 kb. Since the clones from the five different restriction libraries did not cover the whole mitochondrial genome, we further constructed a shotgun DNA library in which ~2-kb mtDNA fragments generated by hydrodynamic shearing were cloned into the Hin cII site of pUC118 (Takara Bio). We collected 3400 sequence reads from both ends of randomly selected clones using RISA-384 DNA sequencers (Shimadzu). General methods for DNA manipulation were carried out as described by Sambrook and Russell (2001).

DNA sequences were assembled using phred/phrap/consed (Ewing et al. 1998; Gordon et al. 1998) on a PC/UNIX platform and consensus sequences were obtained. Gaps between contigs were closed by direct sequencing of selected clones and the final assembly contained 6500 sequences. The genome sequence has eight-fold coverage on average and has a phred quality value greater than 50.

A database search was carried out using stand-alone BLAST (Basic Local Alignment Search Tool: the binary code was obtained from ftp://ftp.ncbi.nih.gov/blast/executables/) in the All GenBank+EMBL+DDBJ+PDB sequences; nt.tar.gz (ftp://ftp.ncbi.nih.gov/blast/db/) and the All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF sequences; nr.tar.gz (ftp://ftp.ncbi.nih.gov/blast/db/).

Multiple sequence alignments were performed by CLUSTAL X (Thompson et al. 1997), and the tRNA gene search and repeated region search were carried out with tRNAscan-SE (Lowe and Eddy 1997) and RePuter (Kurtz et al. 2001) software, respectively, on a PC/UNIX platform.

PCR amplification

To amplify the three kinds of long repeats in the multipartite tobacco mtDNA (see below), we performed long-range PCR amplification using the GeneAmp PCR system 2700 (Applied Biosystems). The following primer pairs were based on the sequences flanking the repeat regions: seq1s (5′-ATTAATATGGTAGGCCTTGATG-3′) and seq1a (5′-AAATGAAAATCTCGTAAGAGAAAG-3′), seq2s (5′-ATGTCTTGTTCTATTCTAAGGCTC-3′) and seq2a (5′-CATCTGCTAGAGGTTGTAACAAT-3′), seq3s (5′-AACTAGTCCCTTATATTTCGCA-3′) and seq3a (5′-ATTCATATCGATTTGGGTTTT-3′), rep1s (5′-TCAATTAGTGCCAGATAAATTTC-3′) and rep1a (5′-AACTGATTAACTGTGAACTACGTG-3′) and rep2s (5′-GAATCCTTATCCTGCAAACA-3′) and rep2a (5′-CTGAGGTATCTAATAACCTTGCTT-3′), rep3s (5′-TATTTAGTGATCAGTCCGCTAGT-3′) and rep3a (5′-GCTCACAAGATTCCACTACAG-3′). PCR amplification was carried out in a 50-μl reaction mixture with LA-Taq DNA polymerase (Takara Bio) according to the protocol supplied by the manufacturer. The extension temperature was 72°C and the extension time was set to 50 s/kb.

Nucleotide sequence accession numbers

Sequences of the mtDNA of N. tabacum have been deposited in the DDBJ/GenBank/EMBL nucleotide sequence database under the Accession Nos. AP006340, AP006341 and BA000042.

Results and discussion

Assembly of DNA sequencing data by Phred/Phrap/Consed

Phred/phrap software assembled 6800 sequence reads of tobacco mtDNA to produce two contiguous DNA sequences (contigs) of 120,754 bp and 277,468 bp, respectively (Fig. 1a). The Assembly View function of Consed (http://www.phrap.org/consed/consed.html) indicated that the forward and reverse reads of some clones mapped to positions that were too far away from each other based on the expected size of the fragments. We examined all boundaries of assembled clones in the Consed window by visual inspection and found that many of them mapped to two separate regions, suggesting the presence of direct repeats in the mitochondrial genome. We designated the repeat sequences on contig 2 as rep1, rep2 and rep3, as shown in Fig. 1b. Recombinational activity mediated by these repeats could be inferred from the presence of such multiple clones with distinct sequences flanking the repeat, which could be categorized into four different boundaries as shown in Fig. 1c–f. Examples of such clones are discussed in the following.

Fig. 1a–f
figure 1

Assembled contigs and organization of long repeat regions in tobacco mtDNA. a Contig sequences 1 and 2 were 120,754 and 277,468 bp long, respectively. The 5′ and 3′ ends of contigs marked as F (838 bp), C (485 bp), E (320 bp) and D (514 bp) were identical in sequence to the regions marked with the same letter in contig 2. b Enlargement of a part of contig 2. Unique sequences are designated as seq1, seq2 and seq3. Repeat sequences are designated as rep1 ( graded stippling), rep2 ( black) and rep3 ( gray). c Sequences of the seq1/rep1 ( upper line) and seq2/rep1 ( lower line) boundaries. d Sequences of the rep1/rep2 ( upper line) and seq3/rep2 ( lower line) boundaries. e Sequences of the rep2/rep3 ( upper line) and rep2/seq1 ( lower line) boundaries. f Sequences of the rep3/seq3 ( upper line) and rep3/seq2 ( lower line) boundaries. Sequence identities are underlined

First, the complementary sequence of the forward read of the clone Sac-88, which is a 5.3-kb Sac I fragment, was located in the 5′ region of contig 1 (at nucleotide position 3622) and the sequence of the reverse read was assembled within the 3′ region of rep3 in contig 2 (at nucleotide position 84,601). In accordance with this, the 838-bp 5′ sequence of contig 1 overlapped the 3′ region of rep3. Therefore, contig 2 can be bridged at the 3′ end of rep3 by connecting it with the 5′ region of contig 1. The boundary sequence is shown in Fig. 1f. Second, the sequence of the forward read of the Eco-118 clone, which is a 1.5-kb Eco RI fragment, was mapped to the 3′ region of contig 1 (at nucleotide position 119,928) and the complementary sequence of the reverse read was located in the 5′ region of rep1 in contig 2 (at nucleotide position 58,076). In this case, the 485-bp 3′ sequence of contig 1 overlapped the 5′ region of rep1. The boundary sequence is shown in Fig. 1c. Thus, the 3′ region of contig 1 can be connected with the 5′ region of rep1, and the circular form can be organized by connecting the sequences in the order contig 1 (seq2), rep1, rep2 and rep3 (giving subgenomic circle 2; SC2 in Fig. 2). Third, the complementary sequence of the reverse read of the Sac-62 clone, which is a 1.4-kb Sac I fragment, was located at the 5′ region of contig 2 (at nucleotide position 757) and the sequence of the forward read was mapped to the 3′ region of rep2 in contig 2 (at nucleotide position 78,860). Since the 320-bp 5′ sequence of contig 2 overlapped the 3′ region of rep2, it is possible that the 3′ end of rep2 bridges part of contig 2; i.e., that it connects the sequences in the order 5′ region of contig 2 (seq1), rep1 and rep2 (subgenomic circle 1; SC1 in Fig. 2). Figure 1e shows the boundary sequence. Fourth, the sequence of the reverse read of the Sac-58 clone, which is a 7.4-kb Sac I fragment, was located at the 3′ region of contig 2 (at nucleotide position 271,707) and the complementary sequence of the forward read was mapped to the 5′ region of rep2 in contig 2 (at nucleotide position 77,300). In this case, the 514-bp 3′ sequence of contig 2 overlapped the 5′ region of rep2, and the circular form can be constructed by connecting the sequences in the order rep2, rep3, and the 3′ region of contig 2 (seq3) (subgenomic circle 3; SC3 in Fig. 2). The boundary sequence is shown in Fig. 1d.

Fig. 2
figure 2

Predicted multipartite organization of tobacco mtDNA. Possible subgenomic circles were constructed from the assembled sequencing data (see text). Repeated sequences are depicted as in Fig. 1. Rep2 was present in SC1 (79,847 bp), SC2 (149,250 bp) and SC3 (204,500 bp) together with rep1, rep1 + rep3 and rep3, respectively. SC4 (229,097 bp) was constructed from SC1 and SC2. SC5 (281,347 bp) was constructed from SC1 and SC3. SC6 (350,750 bp) was constructed from SC2 and SC3. Two isomeric forms of the MC are shown (MC1 and MC2). They contain the entire sequence of the genome but differ from each other in the organization of repeat sequences (rep1, rep2 and rep3) and unique sequences (seq1, seq2 and seq3). Interconversion between MC1 and MC2 can occur via homologous recombination during replication. MC, SC1 + SC2 + SC3, SC1 + SC5, SC2 + SC4, and SC3 + SC4 have the same genetic information

Since we have no additional clones that could be assembled as another contig or bridge at another position, we believe that our data represent the complete nucleotide sequence of the tobacco mitochondrial genome. Thus, the entire tobacco mtDNA can be assembled into an MC as shown in Fig. 2. The length of the MC is 430,597 bp, far in excess of the 270 kb suggested previously (Satoh et al. 1993). Two isomers of the MC can be constructed based on the boundary sequences.

The multipartite structure of tobacco mtDNA

In addition to SC1, SC2 and SC3, other subgenomic circles can be postulated to be formed through homologous recombination. These are schematically represented as SC4, SC5 and SC6 in Fig. 2. Therefore, the mitochondrial genome of tobacco may be represented as two MCs and six SCs. Similar multipartite models were previously proposed for the mitochondrial genomes of maize (Fauron et al. 1995) and Arabidopsis (Klein et al. 1994).

Since more than 70 short direct-repeat sequences (>30 bp and <405 bp) are found in the MC in addition to rep1, rep2 and rep3 using RePuter software (Kurtz et al. 2001), it is easily postulated that many circular molecules of 20–100 kb could be formed by homologous recombination via these short direct repeats. Some of these molecules might account for the circular DNAs of 20–33 kb observed by electron microscopy of tobacco mitochondria (Satoh et al. 1993). However, homologous recombination through these repeats might be an irreversible process, because there is no known mechanism that can reconstruct one MC correctly from many small circular molecules. Therefore, if they exist, such molecules are unlikely to be related to heredity.

We carried out long-range PCR to test whether long direct repeats were interspersed throughout tobacco mtDNA. The combinations of PCR primers used are listed, together with the expected target regions and product sizes in Supplementary Table S1. We amplified rep2 (5.4 kb) and rep3 (7.6 kb) using different combinations of primers. Furthermore, the regions containing both rep2 and rep3 (12.4 kb) were amplified using different combinations of primers. The primer pair seq3s/seq3a amplified a 12-kb product, suggesting the presence of SC3. We amplified the region containing rep1 and rep2 (23.6 kb) using the primer pairs seq2s/rep3a and seq1s/rep3a. However, we did not succeed in amplifying the 23.6-kb fragment using either seq1s/seq1a or seq2s/seq1a. These results certainly indicate that part of the sequence in each subgenome structure exists in the mtDNA, suggesting the presence in vivo of a part of a multipartite structure. However, it is not possible to prove the presence of even the smallest SC1, because it exceeds the size limit for amplification using long-range PCR. Quantitative measurements such as quantitative PCR will be required to estimate the relative amounts of various DNA fragments in subgenomic structures of plant mtDNA. Since there seems to be no evidence for the existence of an MC as the entity of genetic heredity in the mitochondria of higher plants, studies designed to obtain quantitative measurements of changes in the size and structure of mtDNA molecules during the growth cycle, like that done on C. album (Backert and Börner 2000), should be performed on many plant species. Furthermore, it will be necessary to develop a method to measure the distribution of subgenomic circles among mitochondria in the plant cell.

Gene organization in tobacco mtDNA

When we started to sequence the mtDNA of N. tabacum in 2000, only the mitochondrial gene sequences for atp9-rps13-nad1a (Bland et al. 1986), orf25 (Stamper et al. 1987) (recently designated as atp4; Burger et al. 2003), atp6 (Bland et al. 1987), cob (Ortega et al. 2000) and orf274-atp1 (partial) (Bergman et al. 2000) had been reported for this tobacco species. We used the programs BLASTN, BLASTX and tRNAscan-SE to find genes of known or unknown function that had been identified previously in other plant mtDNAs, and detected 36 protein-coding genes, three ribosomal RNA genes ( rrn5, rrn18 and rrn26) and 21 tRNA genes in tobacco mtDNA, which are summarized in Supplementary Table S2. The set of genes found on tobacco mtDNA differs from those of Arabidopsis, sugarbeet, rice and rapeseed with respect to sequences that encode tRNAs, ribosomal proteins and components of Complex II. A linear physical map of MC2 of tobacco mtDNA with the 430,597-bp nucleotide sequence starting at the 5′ end of seq1 and ending at the 3′ end of rep2 is illustrated in Fig. 3; the positions of the genes on MC2 of tobacco mtDNA are shown. No clear bias was found in the distribution of the genes between the two strands: 19 protein-coding frames (the trans -assembled exons are counted individually) and 11 tRNA genes are present on one strand, and 25 protein-coding frames and 10 tRNA genes are found on the complementary strand. The number of intergenic spacers in tobacco mtDNA is 73 and their sizes vary from 1 bp between rpl5 and rps14 (ψ) to 36 kb between nad7a and trnfM, with an average of 5 kb. A comparable average size for intergenic spacers (4.7 kb) can be estimated for Arabidopsis and sugarbeet mtDNAs. Furthermore, 14 of 73 spacers exceed 10 kb in size; however, there are three pairs of genes which overlap ( ccb250-trnI, rps3-rpl16 and cox3-sdh4) by 14, 110 and 73 bp, respectively. Including these overlapping genes, there are 18 gene clusters in tobacco mtDNA: ccb250-trnI, rpl5-rps14(ψ)-cob, trnfM-rrn26, sdh3-nad2ab, nad3-rps12, rrn18-rrn5, nad4L-atp4(orf25), rps10-cox1, atp9-rps13-nad1bc, rps19-rps3-rpl16-cox2, trnD-trnS, nad4-rps1-nad5ab, trnN-trnY-nad2cde, atp8-cox3-sdh4, nad3-nad1a, rps4-nad6, trnS-trnF-trnP and nad9-trnP-trnW. We considered these gene clusters to be potential cotranscription units based on their relatively short intergenic spacers, and to date we have amplified the transcripts of nad4L-atp4(orf25), rps10-cox1, atp9-rps13-nad1bc and atp8-cox3-sdh4 by an RT-PCR method (S. Yagura and Y. Sugiyama, manuscript in preparation). In the mtDNA of N. sylvestris, orf87-nad3-rps12 and orf87 - nad3-nad1a are cotranscribed units (Lelandais et al. 1996) and the same gene orders are present in N. tabacum, except that orf87 is replaced by orf265. Therefore, 18 promoters should be required for transcribing these gene clusters. The remaining 23 genes ( nad1d-mat-R, atp6, nad1e, trnS, trnI, ccb574, nad5de, trnC, nad5c, trnQ, trnG, trnE, atp1, trnM, nad2ab, ccb438ab, trnE, trnH, mttB, trnK, ccb206, rpl2ab and nad7abcd) probably represent monocistronic transcription units. Thus, at least 41 promoters can be postulated in tobacco mtDNA. Supplementary Table S3 provides a list of gene clusters that represent potential cotranscriptional units in completely sequenced angiosperm mtDNAs. Despite extensive divergence in genome organization due to multiple recombination events during the evolution of higher plants, some gene clusters, such as nad3-rps12 and rrn18-rrn5, are common to the five angiosperms and nad4L-atp4 ( orf25) is common to the four dicotyledonous plants. Although many blocks of colinear gene clusters are present in Marchantia mtDNA, only a part of the ribosomal gene cluster ( rps10-rpl2-rps19-rps3-rpl16-rpl5-rps14) and rrn18-rrn5 are shared among the angiosperm mtDNAs. Indeed, with respect to gene order, the organization of Marchantia mtDNA closely resembles that found in Chara mtDNA (Turmel et al. 2003).

Fig. 3
figure 3

Physical map of the genes on tobacco mtDNA. The circular genome (MC2) of tobacco mitochondria is depicted in linear form starting from the boundary between rep2 and seq1. Genes are indicated by boxes above and below the horizontal lines. Ribosomal protein genes are represented by green boxes, genes involved in energy transduction by magenta boxes, other protein genes by black boxes, tRNA genes by cyan boxes and rRNA genes by yellow boxes. Genes above the line are transcribed from left to right and those below the line from right to left. The asterisk indicates a pseudogene

We first used the program BLASTN to search for potential conserved nonanucleotide motif (CNM)-type promoters (Hoffmann and Binder 2002) in tobacco mtDNA using 5′-AAAATATCATAAGAGAAG-3′ and 5′-AAAATATCGTAAGAGAAG-3′ as query sequences against the tobacco mtDNA BLAST database, with a word-size option of 10. Nine possible promoters were found in the upstream regions of the following genes [the distance (in bp) between the 3′ end of the predicted promoter and the 5′ end of the coding sequence is indicated in parentheses]: atp6 (278), trnfM (190), nad3 (1129), rrn18 (65), nad4L (1002), nad5d (1028), atp1 (1001), rps4 (2999), and nad7 (1771). Their sequences and positions in MC2 are listed in Table 1. Of these, six CNM-type promoters were located 5′ of gene clusters in tobacco mtDNA (marked as P in Supplementary Table S3). Gene clusters with CNM-type promoters were also searched for in sugarbeet, Arabidopsis and rapeseed mtDNAs by the same method, and are marked in Supplementary Table S3. Since the consensus 5′-CR(A or G)TA-3′ sequence reported for mtDNA of monocotyledonous plants (Caoile and Stern 1997) is too short to permit us to perform a BLAST search, we have not analyzed promoter sequences in rice mtDNA. It is clear from these searches that there are few CNM-type promoters in dicotyledonous mtDNAs. Therefore, many promoters that differ from the CNM-type must control mtDNA gene expression in dicotyledonous plants. Furthermore, since CNM-type promoters are distributed at random with respect to the genes that constitute a cluster, except for rrn18 and trnfM, it seems that genes (coding sequences) are not necessarily rearranged together with their promoters on mtDNA during the evolution of higher plants. Rather, only genes that acquired promoters at the new position in the reorganized mtDNA could be maintained during evolution. Further insight into promoter modification or acquisition during evolution awaits comparative studies on the primary transcripts of all mitochondrial genes.

Table 1 Potential CNM-type promoters in tobacco mtDNA

In tobacco mtDNA, 110 ORFs larger than 100 codons were identified, but it was clear from our BLASTX analysis that most of these ORFs are unique to tobacco mtDNA and are fragments of known sequences. Thus, they are likely to be nonfunctional.

Comparison of the protein-coding gene sets in angiosperm mtDNAs

Tobacco mtDNA contains a standard set of genes for mitochondrial proteins related to both biogenesis and bioenergetics, as in the case of other plant mtDNAs. The protein-coding genes of tobacco mtDNA are compared with those of Arabidopsis, sugarbeet, rapeseed, rice and Marchantia in Supplementary Table S4. Mitochondrial ribosomal proteins (MRPs) of plants are encoded partly by the mtDNA and partly by nuclear genes. Marchantia mtDNA codes for 16 MRPs, whereas 10, 7, 8, 6 and 11 such proteins are encoded by tobacco, Arabidopsis, rapeseed, sugarbeet and rice mtDNAs, respectively. This variability in ribosomal protein gene content in the mtDNAs of angiosperms suggests that functional gene transfer events from the mitochondrion to the nucleus are ongoing (reviewed by Palmer et al. 2000). We identified an ORF coding for a protein of 120 amino acids ( orf120) as the ψrps14 gene in tobacco mtDNA, because the predicted N-terminal amino acid sequence (1–74) shows a high degree of similarity to those of other plants, whereas its C-terminal sequence (75–119) is completely different (Supplementary Fig. S1). This apparent divergence is caused by a frameshift due to the deletion of 5 bp at nucleotide position 226 in the ψrps14 gene of tobacco mtDNA. However, RT-PCR and subsequent DNA sequencing showed that ψrps14 was cotranscribed with rpl5 and that RNA editing occurred at one position.

We also identified an ORF coding for a protein of 216 amino acids ( orf216) in tobacco mtDNA, which is similar to rps1 of wheat (Gonzalez et al. 1993) and rice (Notsu et al. 2002), and to orf224 of Oenothera (Mundel and Schuster 1996). To date, little information has been available for the rps1 genes in the mtDNAs of dicotyledonous plants other than Oenothera. Comparison of the deduced amino acid sequence of orf216 with that of its wheat, rice and Oenothera counterparts shows that the Oenothera orf224 gene and the tobacco orf216 gene code for proteins with an extra 40 and 39 amino acids, respectively, at the N-terminus compared to those of wheat and rice (Supplementary Fig. S2a). The rps1 gene is active in wheat mtDNA; whereas orf224 is a pseudogene in Oenothera because, although it is transcribed, RNA editing does not occur (Mundel and Schuster 1996). We examined whether orf216 is an active gene in tobacco mitochondria. RT-PCR and subsequent DNA sequence analyses indicated the presence of a transcript of this gene and, furthermore, revealed that RNA editing occurred at nucleotide positions 71 and 206 (the A of the start codon is position 1), so Ser24 and Pro69 in Supplementary Fig. S2a change to Phe and Leu, respectively. These results indicate that orf216 is the rps1 gene and is active in tobacco. It should be noted that there are two other potential start codons in orf216, and at present we cannot decide which of these is the true start codon.

The rps10 gene has been extensively studied with respect to its transfer to the nucleus (Adams et al. 2000), and it is absent in the mtDNAs of Arabidopsis, sugarbeet, rice and rapeseed. In the case of tobacco mtDNA, the rps10 gene consists of two exons of 250 bp and 113 bp, which are separated by a 774-bp Group II intron. The deduced amino acid sequence of the rps10 product aligns well with those of its counterparts in pea, rice and Arabidopsis (Supplementary Fig. S2b). RT-PCR and subsequent DNA sequence analyses showed that the rps10 gene is transcribed in tobacco, the intron is spliced and RNA editing creates a putative start codon and a new stop codon (S. Yagura and Y. Sugiyama, manuscript in preparation). A similar situation has been reported for rps10 of potato (Zanlungo et al. 1995).

Since Southern hybridization suggested that the sdh3 and sdh4 genes are present in tobacco mtDNA (Adams et al. 2001), we used the program BLASTN to search the tobacco mtDNA BLAST database with the tomato sdh3 and sdh4 genes as query sequences. We identified ORFs coding for proteins of 108 and 115 amino acids as sdh3 and sdh4 of tobacco mtDNA, respectively. The amino acid sequences of the predicted products align well with those of tomato, coffee and mayapple SDH3 (Supplementary Fig. S3a) and SDH4 (Supplementary Fig. S3b). The genetic context of sdh4 is similar to that of the corresponding gene cluster found in some other plants, such as Arabidopsis, Oenothera (Giegé et al. 1998) and potato (Siqueira et al. 2002). RT-PCR and subsequent DNA sequence analyses showed that the sdh4 gene is transcribed in tobacco and edited at three positions (S. Yagura and Y. Sugiyama, manuscript in preparation).

An extensive survey of mtDNAs from 280 genera of flowering plants by Southern hybridization indicated that rps1, 2, 7, 11 and 14 have been lost from tobacco mtDNA (Adams et al. 2002). However, our present work showed that although rps7 is absent, rps1 and ψrps14 are present in tobacco mtDNA. Thus, Southern analysis alone does not always provide definitive evidence in this context. In this connection, Adams et al. (2002) reported that, in addition to the rps1 sequence in tomato mtDNA, the existence of a nuclear gene for rps1 could also be inferred from the results of searches of the EST database at the National Center for Biotechnology Information, which encodes an RPS1 with a putative mitochondrial targeting presequence predicted by MITOPROT and Target-P. However, the RPS1 proteins encoded in tobacco and Oenothera mtDNAs are highly homologous to the postulated nuclear RPS1 sequence of tomato. Surprisingly, MITOPROT and Target-P predict the presence of a mitochondrial targeting presequence in RPS1 of tobacco and Oenothera mtDNA, as well as in the nucleus-encoded RPS1 of tomato. These confusing analytical data can be interpreted as follows. The programs MITOPROT and Target-P predict the N-terminal targeting sequence of RPS1 as a false positive, and in addition the sequence deposited in the EST database may be derived from mtRNA of tomato. If this is correct, rps1 has not been transferred from mtDNA to the nucleus in tomato, and we must therefore be prudent in our interpretations on the basis of EST data. Since rps1 is present in tobacco mtDNA, further work will be required to determine whether the rps1 gene has actually been transferred from mtDNA to the nucleus in Solanaceae species.

The pattern of loss of protein-encoding genes in mtDNA during the evolution from Marchantia to the five angiosperm species is schematically depicted in Fig. 4. rpl6, rps8 and rps11 were lost during the course of evolution from Marchantia to the common ancestor of angiosperms. After the divergence of monocotyledonous and dicotyledonous plants, the lineage leading to rice (monocots) lost the genes rps10, rps14, sdh3 and sdh4, while that leading to the progenitor of tobacco, sugarbeet, Arabidopsis and rapeseed lost the rps2 gene. Then, the dicotyledonous branch leading to tobacco (Asterids) lost rps7 and rps14, that leading to sugarbeet (Caryophyllids) lost rps1, rps10, rps14, rps19, rpl2, rpl16, sdh3 and sdh4, and that leading to the progenitor of Arabidopsis and rapeseed (Rosids) lost the rps1, rps10, rps13, rps19 and sdh3 genes. Finally, the lineages leading to Arabidopsis and rapeseed lost rps14 and sdh4, respectively. This profile supports the punctuated model of mitochondrial gene loss and functional gene transfer to the nucleus in angiosperms (Adams et al. 2002).

Fig. 4
figure 4

Loss of genes from the mtDNA during evolution from Marchantia to angiosperms. Phylogenetic relationships for angiosperms are based on published results (Soltis et al. 1999). The order of gene loss in each lineage is not clear

In tobacco mtDNA, the genes rps4, rps10 and cox1 have no conventional ATG start codon. We found that the genomic ACG codon in the rps10 gene is changed by RNA editing to the normal ATG codon. To date, it has been reported that RNA editing creates an ATG start codon in nad1 of Arabidopsis, nad1, nad4L and atp6 of sugarbeet, nad1 and nad4L of rice, and nad1 of rapeseed. Although it was also reported that the cox1 initiation codon is created by RNA editing in tomato (Kadowaki et al. 1995) and potato (Quiñones et al. 1995) mitochondria, we have not detected RNA editing at ACG in the rps4 and cox1 genes of tobacco, despite several RT-PCR experiments (S. Yagura and Y. Sugiyama, manuscript in preparation). Genes with an unknown start codon are ccb203, mat-R and mttB of Arabidopsis, mttB (orfX)of sugarbeet, and mttB (orfX)of rapeseed. N-terminal analysis of these proteins will be required to clarify the nature of their start sites.

In tobacco mtDNA, 17 cis -splicing and 6 trans -splicing introns are present in ten genes. All of them are group-II introns (Supplementary Table S5). The tobacco nad1.i3 is a trans -splicing intron whose secondary structure is formed by two transcripts derived from SC1 and SC3. Therefore, at least SC1 and SC3 must be present in a given mitochondrion to permit the synthesis of a functional nad1 product. The tobacco cox2.i1 is inserted at the same position as sugarbeet and rice cox2.i1. In contrast, Arabidopsis and rapeseed cox2.i1 is inserted at a different position, as described by Unseld et al. (1997). The tobacco nad7 gene does not contain the third intron found in Arabidopsis, rapeseed, sugarbeet and rice. The variation in the length of cis -splicing introns among the five angiosperms is generally larger than that of the protein-coding region.

Comparison of the tRNA gene sets of angiosperm mtDNAs

Complete DNA sequencing is the most direct means of determining whether the genome of the mitochondrion encodes all of the tRNA species necessary for protein synthesis in this organelle. We identified 21 tRNA genes in tobacco mtDNA (Supplementary Table S2) and verified that they can be folded into the standard cloverleaf structures, which can carry fifteen kinds of amino acids. Tobacco mtDNA uses the standard genetic code and all 61 sense codons are used in the translation of the 36 protein-coding genes (TAA, TAG and TGA are stop codons). There seems to be some bias for codons that end in A or U in four-codon families. The 21 tRNAs in tobacco mtDNA are sufficient to translate 37 of the 61 sense codons when tRNAs with U in the wobble position of the anticodon are assumed to recognize all codons in four-codon families (in the case of tobacco, Pro: CCN; Ser, TCN). Therefore, at least nine tRNAs that recognize the remaining 24 sense codons have to be imported from the cytoplasm in tobacco: tRNAAla (UGC), tRNAGly (UCC), tRNAIle (GAU), tRNALeu (UAA), tRNALeu (UAG), tRNAArg (UCG), tRNAArg (UCU), tRNAThr (UGU) and tRNAVal (UAC). In addition, tRNALeu (CAA), tRNAArg (ACG) and tRNAThr (GGU), which are encoded in Marchantia mtDNA but not in tobacco mtDNA (Table 2), may be imported from the cytosol.

Table 2 Comparison of tRNA genes in mtDNAs of higher plants

Table 2 compares the tRNA gene sets of tobacco mtDNA with those of Arabidopsis, sugarbeet, rapeseed, rice and Marchantia. Many tRNA genes [ trnA(ugc), trnG(ucc), trnL(caa), trnL(uaa), trnL(uag), trnR(acg), trnR(ucg), trnT(ggu) and trnV(uac)] were lost during the course of evolution from Marchantia to the common ancestor of angiosperms. The lineage leading to rice mtDNA lost the trnG(gcc) gene after monocotyledon and dicotyledonous plants branched, and that leading to the progenitor of Arabidopsis and rapeseed lost the trnF(gaa) gene. In contrast to the protein-coding genes, tRNA genes were transferred from cpDNA to mtDNA during the evolution of angiosperms. The cp-derived tRNA genes, such as trnH(gug), trnMe(cau), trnN(guu), trnP(ugg), trnS(gga) and trnW(cca), are common to all angiosperms, and trnD(guc) is common to dicotyledonous plants. Of these, trnH, trnMe and trnS were integrated as part of large cpDNA fragments into rice mtDNA (Notsu et al. 2002), so transfer events differed between monocotyledons and dicotyledonous plants, at least in the case of these tRNA genes. After branching of monocotyledonous and dicotyledonous plants, the lineage leading to rice acquired trnC(gca) and trnF(gaa) and that leading to tobacco acquired trnE(uuc), trnI(cau) and trnP(ugg). In contrast, it is not easy to infer the timing of transfer of trnD(guc), trnH(gug), trnMe(cau), trnN(guu), trnP(ugg), trnS(gga) and trnW(cca) among the four dicotyledonous plants, because knowledge of the sequences of DNA carrying mitochondrial and cp tRNA genes in the same plant is a prerequisite for analyzing the timing of the tRNA gene transfer events. Such a data set is available for Arabidopsis and tobacco. We have recently shown that the transfer of trnMe(cau) occurred before branching of Arabidopsis and tobacco, and that of trnD(guc) occurred afterwards (Sugiyama et al. 2004). The trnS(gga) gene is unique among the cp-derived tRNA genes because its counterpart is absent in Marchantia mtDNA (Table 2). Therefore, the mechanism of integration of the trnS(gga) gene into mtDNA appears to differ from that of other cp-derived tRNA genes, as suggested by Turmel et al. (2003).

Comparison of angiosperm mtDNAs

Table 3 summarizes the general features of five angiosperm mitochondrial genomes. Genome sizes (taken as that of the MC) are in the range of 220–490 kb with similar A+T contents. Brassica species have the smallest mtDNA among higher plants (Handa 2003). Although the generation of a complex mitochondrial genome organization through the recombination of long repeats is a common feature of higher-plant mitochondria, long repeated sequences vary in size from 2427 bp in rapeseed to 127,600 bp in rice, and show no homology to each other. This implies that repeated sequences were independently acquired during the process of evolution from Marchantia to angiosperms. The “unique sequence” is the mtDNA region that excludes copies of long repeated sequences; the size of this region is smallest in rapeseed, as expected from the genome size. The total length of coding sequence is similar among the five plants. That the cis -splicing introns are smallest in sugarbeet is due to the lack of some introns (Kubo et al. 2000).

Table 3 General features of mtDNA of angiosperms

In terms of other sequences, retrotransposons and transposons detected by BLASTX, which are sequences considered to be of nuclear origin, short direct repeat sequences and inverse repeat sequences detected by RePuter, trans -spliced introns and finally the vast regions that do not define recognizable ORFs of any significant length are included in mtDNAs. We counted the genes of known function that have been reported previously in mtDNA as gene content, in which gene copies of long repeats are included.

Conclusion

In conclusion, the complete sequencing of tobacco mtDNA has identified the rps1 and ψrps14 genes, which had previously been thought to be absent from tobacco mtDNA based on Southern analysis, and provided the basis for identifying RNA editing sites in tobacco mitochondrial transcripts (S. Yagura and Y. Sugiyama, manuscript in preparation). Comparison of mtDNAs among Nicotiana species will reveal the origin of the multipartite organization of tobacco mtDNA (Makita et al., manuscript in preparation). Comparative studies suggest that coding sequences have not necessarily moved together with their promoters upon reorganization of the mitochondrial genome during the evolution of dicotyledonous plants. One aim of future studies will be to identify and compare the active promoters for all mitochondrial genes.