Introduction

Pentatricopeptide repeat (PPR) proteins comprise a large gene family which was found in a systematic search of the Arabidopsis genome for mitochondrial and chloroplast targeted proteins [1]. PPR proteins are characterized by the signature motif of a degenerate 35-amino acid repeat often arranged in tandem arrays of 2–27 repeats [1, 2]. The PPR protein family is subdivided into two major subfamilies: the P subfamily, whose members contain only canonical PPR motifs, and the PLS subfamily, whose members also contain non-canonical shorter and longer PPR-like motifs [2]. PPR proteins have been found in very small number in animals, fungi, algae; but they are greatly increased in the higher plants. For instance, the Arabidopsis thaliana genome is estimated to have over 440 PPR protein genes and the rice genome encodes over 570 PPR proteins [2, 3]. Evidence from EST databases suggests that many other land plants also contain hundreds of PPR genes [46].

The structure of the PPR protein is similar to the related tetratricopeptide (TPR) protein, a 34-amino acid repeating motif which is generally found in eukaryotic proteins [7]. PPR motifs are predicted to consist of a pair of antiparallel α helices, and tandem PPR motifs likely forming a helical binding surface [1]. However, unlike TPR containing proteins which mediate protein–protein interactions [8], PPR proteins are sequence specific RNA-binding proteins. They bind with high affinity to single-stranded RNA, but bind poorly to single or double-stranded DNA [912]. Another interesting feature of the PPR proteins is that most of them have an N-terminal sequence predicted to target the protein to either mitochondria or chloroplasts and a number of recent studies reported that PPR proteins play essential roles in regulating the expression of organelle genes. For example, PPR proteins from various higher plants suppress the expression of mitochondrial genes associated with cytoplasmic male sterility [8, 1315]. They are also associated with transcription [16], translation [17], and many stages of mRNA processing including splicing [18, 19], endonucleolytic cleavage [14] and RNA editing [20, 21]. The growing evidence on PPR protein functions are consistent with the fact that the majority of PPR proteins are predicted to be targeted to mitochondria or chloroplasts.

Cotton is not only the world’s most important natural textile fiber and a significant oilseed crop, but also a crop that is significant for foil energy and bio-energy production. Cotton (Gossypium spp.) belongs to the genus Gossypium of the family Malvaceae. Of the Gossypium species, four are cultivated in agriculture, including two allotetraploids (G. hirsutum and G. barbadense) and two diploids (G. herbaceum and G. arboreum). Gossypium hirsutum, also known as Upland cotton, produces over 95% of the world’s cotton. As with many other plant and animal species of biological and/or economical importance, significant efforts have been made to generate many ESTs and some BAC sequences of cottons. For instance, as of August 25, 2008, 265,833 ESTs were available for the G. hirsutum in GenBank. PPR proteins have been identified in many plants, but up to now there are still no reports on PPR proteins in Upland cotton. Cytoplasmic male sterility (CMS) is a maternally inherited trait which renders plants unable to produce functional pollen; however, male fertility can be restored by nuclear-encoded fertility restorer (Rf) genes. The CMS and Rf systems are a valuable tool in the production of hybrid seed in crop species, including maize, rice and a number of vegetable crops. With the exception of Rf2 from maize, all of the cloned fertility restorer genes so far are members of the PPR gene family [8, 1315], and the Rf genes of other species are also presumed to encode PPR proteins. In this study, we extensively surveyed the EST and nucleotide databases of G. hirsutum for PPR genes. Moreover, we cloned five PPR genes from Upland cotton 0-613-2R and analyzed their cDNAs encoding PPR proteins. We also investigated the relationship of the five GhPPR genes between the restoring line 0-613-2R and the CMS line 104-7A. The expression patterns of these PPR genes were characterized in different developmental stages by real-time quantitative RT-PCR.

Materials and methods

Plant materials

The Upland cotton lines 0-613-2R and 104-7A used in this study were grown under natural conditions during the growing seasons at the Nanjing Agricultural University.

Preparation of genomic DNA and first strand cDNA synthesis

Genomic DNA was isolated from young cotton leaves using the CTAB method as described previously [22]. Total cellular RNA was extracted from roots, stems and leaves, as well as pollens and fibers using the CTAB-sour phenol extraction method [23]. Each RNA sample was treated with DNase I after the extraction to remove all residual DNA. One microgram of DNA-free total RNA was reverse-transcribed with RevertAid ™ First Stand cDNA Synthesis Kit (MBI) following the manufacturer’s instructions.

Search for genes encoding protein with PPR motifs

Eighteen PPR proteins which have been cloned from Arabidopsis, rice, maize, radish and petunia were used as query sequences for tBLASTn searches of the Gossypium hirsutum EST database (http://www.ncbi.nlm.nih.gov/). Vector sequences and low quality sequences were removed manually from the resulting hits. The remaining non-redundant sequences were assembled by CAP3 assembly tool [24] with default parameter. Potential genes of BAC sequences originated from the G. hirsutum database were predicted by a program: FGENESH (http://www.softberry.com/) and the predicted proteins were used subsequently as input for BLASTP searches against the non-redundant Genbank protein database.

Molecular cloning and sequence analysis of five GhPPR cDNAs

From these gene predictions, gene-specific primers (Table 1) were designed for PCR from cDNA and genomic DNA. The amplification products from both cDNA and genomic DNA were inserted into the pMD18-T vector (TaKaRa, Japan). Sequencing was performed by Nanjing Jinsite Biotechnology Co., Ltd. The PPR domains of the GhPPR proteins were searched with Pfam [25] and InterPro (http://www.ebi.ac.uk/Tools/InterProScan/) [26] and aligned in ClustalX version 1.81 [27]. The export and editing of these sequences were conducted with geneDOC software [28]. The presence of putative mitochondrial localization signals was determined using TargetP [29], Predotar [30], and MitoProt [31].

Table 1 Primer pairs used in gene cloning and real-time quantitative RT-PCR

Semi-quantitative RT-PCR

Semi-quantitative RT-PCR was used to analyze gene expression of 0-613-2R and 104-7A. The gene specific primer pairs (Table 1) were used for PCR reactions under the following conditions: pre-denaturation at 94°C for 5 min, followed by 30 cycles of 30 s at 94°C, 60 s at a specific annealing temperature for each gene (55 or 57°C), and 90 s at 72°C. Cotton EF1a gene, an internal control for constitutive expression, was uniformly expressed in all tissues examined. As an internal control and to exclude genomic contamination, cotton EF1a was amplified (same cycling conditions as above for 28 cycles) from the same cDNA samples.

Real-time quantitative RT-PCR

Real-time quantitative RT-PCR was performed on a Bio-RAD iCycler iQ5Machine. The cotton EF1α gene was amplified as a reference to the target gene for gene expression. PCR products were detected by SYBR Green I fluorescence dye (Invitrogen, USA). A 25 μl PCR amplification mixture contained final concentrations of 1× PCR buffer, 1 mM MgCl2, 0.2 mM dNTPs, 1 μl SYBR Green I (10,000-fold dilution), 1 U rTaq, 0.4 μM of each forward and reverse primers, and cDNA from 25 ng total RNA as template. The following PCR cycling conditions were used: one cycle at 95°C for 3 min, followed by 40 cycles at 94°C for 10 s, 55°C for 20 s and 72°C for 45 s, followed by a final elongation at 72°C for 10 min. Each sample was tested in three replicates.

Phylogenetic analysis of the GhPPR proteins with other plant PPR proteins

Previously published 18 PPR proteins sequences from higher plants were retrieved from the Genbank database together with the five deduced GhPPR proteins sequences for phylogenetic analysis. The phylogenetic tree was constructed with DNAStar software, using MegAlign program.

Results

Searching for ESTs and BACs of G. hirsutum encoding PPR motifs in NCBI database

The allotetraploid G. hirsutum is comprised of two subgenomes (designated AT and DT), and the genome size is estimated to be approximately 2.5 Gb [32]. However, there are only a limited number of genomic sequences available for cotton in GenBank. Up to now, there are 147 BAC sequences of G. hirsutum published on the GenBank totaling about 14.5 Mb. We surveyed these BACs for PPR genes, and found six ORF encoding 2–10 PPR motifs. According to this ratio, there are about 1,034 PPR genes in the allotetraploid G. hirsutum genome with about 517 PPR genes in each subgenome. We also used another method to identify the members of the PPR genes in Upland cotton. Eighteen PPR proteins published from other higher plants were used as queries to screen the G. hirsutum EST database (http://www.ncbi.nlm.nih.gov/). We identified 309 EST contigs encoding PPR motifs from 265,833 G. hirsutum EST database by manually removing the incorrect contigs. Most of the contigs are limited to their 5′ and/or 3′ regions, and therefore the internal sequences are unknown.

Cloning and characterization of the five GhPPR genes

To determine the entire amino acid sequence of the putative PPR proteins, we isolated 5 full-length cDNAs from 3 BAC sequences and 2 EST contigs. Five gene-special primers were designed for RT-PCR from Gossypium hirsutum 0-613-2R, and the products ranged in size from 867 to 1,779 bp (Accession numbers: FJ812358–FJ812367). To identify the intron–exon structure of cotton PPR genes, gene-specific primers were used to amplify the five GhPPR genes from genomic DNA. A comparison between the cDNA and genomic DNA sequences revealed that all the five GhPPR genes contained no intron. This special genomic organization is common to other plant PPR genes.

Each of the five GhPPR cDNA sequences contains a single complete ORF encoding protein ranging from 288 to 592 amino acids (Table 2) and these PPR proteins contain 5–10 PPR motifs (Fig. 1a). Further analysis of the PPR domains of the deduced proteins reveals that GhPPR3–5 are typical members of the PPR-P subfamily, and GhPPR1–2 are members of the PLS subfamily which is entirely specific to plants. The GhPPR2 protein also contained a DYW motif in its C-terminal amino acids, which is unique to higher plant PPR proteins and significant for RNA editing [6]. Unlike the E and E+ motifs being highly degenerate, the DYW motifs show a high conservation in the amino acid sequences from the alignment of GhPPR2 and CRR2 (Fig. 1b). Subcellular prediction of the N-terminal amino acids of the five GhPPR proteins show that GhPPR1, GhPPR3 and GhPPR4 are probably targeted to mitochondria and GhPPR2 is probably targeted to chloroplasts, whereas no subcellular signal was detected in GhPPR5 (Table 2).

Table 2 Overview of five GhPPR genes identified in Gossypium hirsutum
Fig. 1
figure 1

Schematic diagram of G. hirsutum PPR proteins containing multiple PPR motifs. a PPR motifs are shown as black boxes. Alignments of the PPR motifs of the respective G. hirsutum PPR protein are indicated. Numbers indicate aa positions. The PPR consensus sequences are shaded gray. b Amino acid sequence alignment of the C-terminal DYW motifs of GhPPR2 and CRR2 proteins. Residues identical among the two PPR proteins are shaded black

Expression profiles of GhPPRs in different developmental stages

In order to study the expression pattern of the five GhPPR genes during cotton growth and development, we performed real-time quantitative RT-PCR to measure the relative expression levels in Upland cotton 0-613-2R. The total RNA isolated from a variety of tissues and organs, such as roots, stems, leaves, pollens, and 15dpa fibers was used for further analysis. The results of real-time quantitative RT-PCR showed that each of the five GhPPR genes was expressed ubiquitously in all detected tissues and organs, and this feature was common to the PPR gene family. Although all the five GhPPR genes were expressed constitutively, the relative expression levels were slightly different (Fig. 2). For instance, the transcript accumulation of GhPPR2 was higher in roots and leaves, whereas there was low expression of GhPPR3 in leaves.

Fig. 2
figure 2

Real-time quantitative RT-PCR analysis of five GhPPR transcripts in different organs and tissues in Upland cotton. ae GhPPR1–GhPPR5, respectively. The samples of various tissues and organs included roots, stems, leaves, pollens and 15dpa fibers in this study. The levels of the five GhPPR mRNA in total RNA samples were analyzed by quantitative real-time RT-PCR using EF1α gene as an internal probe for giving consistently equal levels of samples

RT-PCR analysis of the five GhPPR genes between the CMS line with its restoring line

With semi-quantitative RT-PCR, we determined whether the five GhPPR genes are potential candidates for the Rf gene of Upland cotton. The specific primers of the five GhPPR genes were used for reverse transcription (RT)-PCR in the pollens of the CMS line 104-7A and the restoring line 0-613-2R (Fig. 4), and the products were recovered for sequencing. However, the cDNA of the five GhPPR genes were identical between the CMS line and the restoring line. Though all cloned Rf genes, except for the Rf2 of maize, encoded PPR proteins, the five GhPPR genes have no relationship with the Rf of cotton, although and they may play other important roles in cotton development.

Phylogenetic analysis of GhPPR proteins and other higher plant PPR proteins

To further characterize the homology and evolutionary relationship of the Upland cotton PPR proteins with other higher plant PPR proteins, we constructed a phylogenetic tree from the amino acid sequences of the five GhPPRs and 18 previously published plant PPR proteins. In this study, overall amino acid sequences were used for phylogenetic analysis, and the phylogenetic tree showed that the GhPPR3, GhPPR4 and GhPPR5 have a close relationship with the members of the P subfamily, while the GhPPR1 and GhPPR2 clustered with members of the PLS subfamily. This result was consistent with the classification of the five GhPPR proteins based on the different PPR domains and C-terminal domains. Interestingly, the Rf genes of different plants clustered into one clade compared with other PPR genes in the P subfamily (Fig. 3).

Fig. 3
figure 3

Phylogenetic tree of PPR proteins in higher plants. The ClustalW method of the DNAStar package was used to align the amino acid sequence of the five GhPPR proteins and 18 PPR proteins from other plant. The five GhPPR proteins are boxed

Discussion

The pentatricopeptide repeat proteins have been identified in almost all eukaryotes, but they are much more abundant in higher plants as compared with other eukaryotes. The Arabidopsis and rice both have several hundreds of PPR genes in their genome, whereas yeast, Drosophila and human genomes are predicted to contain only 5, 2 and 6 PPR genes, respectively [2, 5]. The vast difference between the numbers of PPR genes in higher plants and non-plant organisms indicates that a massive expansion of the PPR gene family occurred during the evolution of plants, but the factors that underlie the expansion of this gene family in plants are not yet understood. In the present study, about 1,034 PPR genes were presumed in the allotetraploid G. hirsutum genome based on six PPR genes in 147 BAC sequences in the Upland cotton database. A predominant feature of the cotton EST set is the significant preference of their tissue sources for fiber or fiber-bearing ovules than other organs. We found 309 PPR unigenes by surveying the G. hirsutum EST database which mostly was from the tissue sources for fiber or fiber-bearing ovules, while only about 24% of them were from non-fiber and non-ovule organs. Therefore, other PPR genes also may be expressed specifically in other tissues and organs. There are as many PPR genes in Upland cotton as in other higher plants, and the increasing number of PPR genes may be correlated with the evolution of land plants towards increasing complexity. Their large number in plants is probably required to meet the special needs of organelle gene expression.

Pentatricopeptide repeat proteins are characterized by tandem arrays of degenerate 35–amino acid motifs [1]. The PPR motif is not one, but in fact three closely related motifs: the canonical PPR motif (P motif), common to all eukaryotes; and two variants specific to plants, the PPR-like S motif (for short) and the PPR-like L motif (for long) [2]. The P subfamily contains only P motifs, whereas the PLS subfamily is comprised of repeats of P-L-S triplets. The plant-specific PPR proteins usually contain additional C-terminal domains E, E+ and DYW motifs in the PLS subfamily and it could be further separated into four subclasses based on their C-terminal domain structure [2]. Whereas E and E+ motifs are highly degenerate and difficult to recognize, DYW motifs contain highly conserved regions including some invariant amino acids which is consistent with our study (Fig. 2b). In our five cloned GhPPR genes, GhPPR3–5 belonged to the classical P subfamily which contained 10, 5 and 7 P motifs, respectively. GhPPR1 and GhPPR2 belonged to the PLS subfamily, with the GhPPR2 was further classified into the DYW subgroup. The DYW domain has not been found in any other protein or in any organism apart from land plants, and it may have catalytic activity for RNA editing due to the presence of invariant cysteine and histidine residues in its motif [2, 6]. The members of this subgroup are predicted to be targeted almost as likely to plastids as to mitochondria. The GhPPR2 is also predicted to target to chloroplast and contains a DYW motif in the C-terminus which suggest GhPPR2 may play an important role in RNA editing in cotton chloroplast. Many functions of the plant PPP proteins are still unknown, but recent studies demonstrated that members of the PPR protein family are involved in the suppression of cytoplasmic male sterility [8, 1315]. The five GhPPR genes detected by RT-PCR in the CMS line and the restoring carry no differences between them in our study (Fig. 4). Interestingly, the Rf genes from monocot and dicot plants clustered together in the phylogenetic tree (Fig. 3). Therefore, the Rf genes from other plants might be sufficient for future testing of the hypothesis that the monocot and dicot Rf genes share a common ancestor distinct from that of all other PPR genes.

Fig. 4
figure 4

RT-PCR analysis of the five GhPPR genes between the CMS line and the restoring line. 1, 3, 5, 7 and 9—the pollens of the CMS line 104-7A; 2, 4, 6, 8 and 10—the pollens of the restoring line 0-613-2R

In Arabidopsis, almost half of the plant PPR proteins are predicted to be targeted to mitochondria and one-quarter are predicted to be targeted to chloroplasts [2]. Another peculiarity of PPR genes in Arabidopsis and rice is that the large majority do not contain any introns [2, 5]. We extend these results to Upland cotton in our study. Subcellular prediction of the five deduced GhPPR proteins showed that four of them have subcellular signals in the N-terminus. GhPPR1, GhPPR3 and GhPPR4 are predicted to be targeted to mitochondria and GhPPR2 is probably targeted to chloroplast. From analyses of the genomic organization of the five GhPPR genes, none of them have introns in genome sequences, and this feature may apply to most of the PPR genes in Upland cotton. The majority of PPR genes have been shown to express constitutively in various experiments comparing different plant organs [2, 33]. Similarly, the five GhPPR genes were expressed in all of the tested tissues and organs (Fig. 2). We identified 309 PPR unigenes from the EST database where most of them were sequenced from fiber cDNAs. Although the GhPPR4 and GhPPR5 transcripts originated from those PPR unigenes, they were also expressed in other tissues and organs such as roots, stems and leaves. Therefore, the majority of other GhPPR genes may be expressed constitutively in Upland cotton.

In conclusion, we identified many PPR unigenes in Upland cotton and the characterization and expression pattern of five full-length PPR genes were investigated in our study. Such expression analyses are important for understanding the potential role of these abundant PPR genes in the molecular regulation of the organelle gene expression and plant development. Further studies on these PPR genes in Upland cotton will increase our understanding of the biological significance of the large PPR protein family in land plants.