INTRODUCTION

Coccolithophorids are an extremely important calcite-producing group of unicellular algae in the marine environment. The most abundant coccolithophorid, Emiliania huxleyi, is distributed throughout the world’s oceans and coastal waters. E. huxleyi is unique among the marine phytoplankton in that it is capable of fixing atmospheric carbon into both photosynthetic and biomineralized product. This alga has a significant impact on the flux of CO2 across the air-sea interface, and also on the removal of CO2 as calcium carbonate at the deep water–sediment interface (Westbroek et al., 1993). These data indicate that E. huxleyi plays an important role in the ocean carbon cycle and may even influence the global climate system by decreasing the oceanic draw of CO2. E. huxleyi is also recognized as a major sink for calcium carbonate in the ocean (Hide, 1990; Samtleben and Bickert, 1990). Ecophysiologists and climatologists are interested in E. huxleyi’s involvement in sulfur biotransformations in the ocean and its ability to synthesize long-chain alkenones and alkyl alkenoates. The production of dimethylsulfide (DMS) in E. huxleyi blooms may affect production and regional weather patterns (Bates et al., 1987; Charlson et al., 1987), while the long-chain polyunsaturated ketones have proved to be accurate paleotemperature proxies for estimating surface water temperature distributions to determine patterns in ocean circulation and paleoclimate (Prahl et al., 1988; Sikes et al., 1991; Conte et al., 1992).

In addition to its ecologic importance, E. huxleyi has attracted the attention of materials scientists interested in using these porous shells of calcium carbonate to develop novel materials. Potential applications include the design of new lightweight ceramics, catalyst supports, robust membranes for high-temperature separation technology, and biomedical devices (Walsh and Mann, 1995). Despite its use in biogeochemistry, climatology, and materials science, little is known about the molecular genetics of this important marine alga. Molecular approaches aimed at elucidating the complex life cycle of E. huxleyi, and tools for analyzing genes that express the protein machinery responsible for calcium carbonate biomineralization and DMS production, are lacking (Paasche, 2002). The size of the E. huxleyi genome is not known, and there is little information that describes the content and organizational structure of the genome. At the time of this study, a search of databases for protein-encoding genes in E. huxleyi yielded only 5 to 10 entries; this situation has restricted our understanding of the biochemical and physiologic pathways that govern the biology of this alga.

Therefore, to accelerate the genetic and molecular characterization of the biology of E. huxleyi, we present results obtained from the identification of 3000 E. huxleyi expressed sequence tags (ESTs) based on cDNA sequencing. The analysis of ESTs generated by systematic partial sequencing of randomly picked cDNA clones is an effective means of rapidly gaining information about an organism at its most fundamental level. Analyses of ESTs have been published for several model plants, including Arabidopsis, rice, maize, and wheat (DeRisi and Iyer, 1999), but this approach has not been extensively employed with algae. We have identified transcripts that are expressed under conditions that promote calcification and coccolithogenesis, which include those encoding proteins that are likely to be involved in calcium homeostasis and transport. In addition, many apparently novel genes have been identified. These genes include transcripts that are present in Volvox, yeast, and other organisms and that are known to be involved in gametogenesis and sexual reproduction. The EST sequence information presented herein will complement the large set of physiological information already available and enable new technologies to be rapidly exploited to advance our understanding of the global significance of E. huxleyi.

MATERIALS AND METHODS

Media and Growth Conditions

E. huxleyi strain 1516 was obtained from the Provasoli-Guillard National Center for Culture of Marine Phytoplankton and grown as described previously (Laguna et al., 2001). RNA was extracted from cultures obtained by inoculating cells into 1 L of f/50 medium (Guillard, 1975) in 4-L flasks. Cultures were incubated photoautotrophically at 17° to 18°C under cool white fluorescent light (660 μmol · m−2 · s−2) under a discontinuous-light (12-hour dark, 12-hour light) cycle.

RNA Extraction

RNA was isolated from 3 L of cultures in mid to late log phase. Prior to RNA extraction cells were decalcified by lowering the pH of the culture with HCl to a pH of 5.0 for 2 minutes, followed by rapid readjustment with NaOH to pH 8.0. Total RNA was extracted from cells using a standard guanidinium isothiocyanate procedure (Strommer et al., 1993). Briefly, cells were lysed by grinding in liquid nitrogen with a mortar and pestle. Cell material was resuspended in extraction buffer (4 M guanidinium isothiocyanate, 25 mM sodium citrate, 0.5% sarkosyl, 0.1 M β-mercaptoethanol) to inhibit the activity of ribonucelases and disrupt membranes. Total RNA was separated from other cellular components by phenol extraction followed by isopropanol precipitation with sodium acetate. A final lithium chloride precipitation was performed to further purify the RNA. The concentration of RNA was determined from its absorbance at 260 nm, and the integrity was assessed using denaturing agarose gel electrophoresis.

Construction of cDNA Library and EST Sequencing

Total RNA was used for the construction of a cDNA library prepared by ResGen (Invitrogen Corp.). First-strand synthesis was performed using a NotI primer-adapter (GAC TAG TTC TAG ATC GCG AGC GGC CGC CC(T)15) and Superscript II reverse transcriptase. Following second-strand synthesis using Escherichia coli DNA polymerase, NotI/blunt end products were directionally cloned into the NotI/EcoRV sites of the Gateway cloning vector pMAB58. Plasmids were used to transform ElectroMax DH10B-TON cells via electroporation, and random clones were picked for quality control analysis.

Plasmid DNA was prepared from recombinant clones using a standard alkaline lysis procedure, and unidirectional sequencing was accomplished using the pMAB58 forward primer (TAT AAC CGC TTT GGA ATC ACT), providing sequence from the 5′ end of cDNA clones. Sequencing was performed by Integrated Genomics of Chicago, Illinois.

Data Analysis

ESTs were trimmed to remove the vector and ambiguous sequences, and high-quality sequences with a minimum of 400 bp of continuous sequence with at least 98% accuracy were retained for further analysis. High-quality sequences were compared with sequences in GenBank (National Center of Biotechnology and Information, NCBI) using BLASTX. A sequence was considered to be a significant match when the BLAST probability value (e value) was less than 1 × 10−2. High-quality ESTs were assembled into contigs using the phrap/cross_match/swat package version 0.990329 (available at pg@umpqua.genome.washinton.edu). A final unique set of 1523 sequences has been deposited into GenBank (accession numbers CF753162–CF754684; dbEST_Id 20096956–20098478) and archived in our E. huxleyi database (Ehux Express). A Web interface is currently being constructed to allow keyword or sequence homology searches to be performed.

BLASTCLUST was used to group the initial ESTs into consensus sequences using match reward of 1, mismatch penalty of −3, non-affine gapping cost, and a word size of 28 with an e-value threshold set at 1e-6. Pairwise comparisons across the initial sequences were also used to determine the total redundancy in the library. Random subsets of ESTs (500, 1000, 1500, 2000, 2500, and 3000) were sampled, and the number of unique sequences in each subset was determined (Figure 1).

Figure 1
figure 1

Characterization of the rate of new gene discovery expressed as the number of unique sequences obtained versus the total number of clones sequenced.

RESULTS AND DISCUSSION

EST Library Sequence Analysis

The cDNA library employed in this study consisted of 6 × 105 clones, from which the 5′ ends of 3000 cDNAs were sequenced. After editing to eliminate vector and other problematic sequences, high-quality ESTs with an average length of 559 nucleotides were used in database searches. As shown in Table 1, 1836 (approx. 61%) of the ESTs exhibited an e value greater than or equal to 10−2, and 78 (approx. 3%) of the sequences had no GenBank match. For the remaining 1086 (approx. 36%) of ESTs returning an e value less than 10−2, matches were found to genes from a wide diversity of organisms. Highly significant matches were most frequently obtained with sequences from animals and plants and fungi. However, significant matches to sequences from prokaryotes and unicellular eukaryotes were also observed. Table 1 also lists significant E. huxleyi EST matches assigned into groups or domains based on GenBank search data. The GenBank search results appear to reflect the current bias in the databases for animal sequences relative to eukaryotic photosynthetic organisms, plants, or algae, as one would expect sequences from E. huxleyi to be most closely related to plants or algae.

Table 1 BLAST Search Analysis of cDNA Library Clones

Analysis of rates of gene discovery indicated that our library prepared from RNA extracted from calcifying E. huxleyi cells contains more information than we had mined from this initial sequence screen. The number of sequences that can be processed and the potential new information that can be gleaned from that effort is represented graphically in Figure 1. After sequencing 3000 ESTs, 2298 different transcripts were predicted using BLASTCLUST and the rate of new sequence discovery was still at 76.6%. At this point there is no indication of a plateau effect, suggesting the sequencing of more library clones is warranted. Assembly of individual ESTs into groups of tentative consensus sequences yielded 1523 unique transcripts, a 200-fold increase in what was previously contained in GenBank. Our unigene set is composed of 1054 singletons and 459 contigs.

The average G + C content ratio from this library sampling was 0.65, with 68% of the sequences having a G + C content between 0.59 and 0.70 (Figure 2). The leptokurtic distribution suggests that the G + C content is constant across the coding region of the genome and indicates that the presence of contaminating sequences is minimal.

Figure 2
figure 2

G + C content of 3000 EST sequences from cDNA library of E. huxleyi strain 1516. The frequency distribution mean of these EST data (approx. 65%) reflects the high GC content previously described for E. huxleyi.

Given its high G + C content, E. huxleyi might be expected to use a GTG initiation codon in addition to the preferred ATG codon, as is the case with Mycobacterium tuberculosis, which has a similar G + C content (Lowery and Ludden, 1988). Analysis of the predicted start codon of a small subset of matched ESTs reported herein (n = 70) revealed that a GTG start codon was used to define the start of translation at least 14% of the time, and possibly as much as 44% of the time.

Preliminary data we have collected using open reading frames from 85 ESTs (those with the lowest e values) and 15 full-length cDNA sequences suggest that E. huxleyi exhibits a codon bias consistent with its high G + C content (Table 2). These results are in agreement with previous findings that suggested a codon bias based on the G + C composition of codon positions in cDNA clones from the actin multigene family in E. huxleyi (Bhattacharya et al., 1993). Information pertaining to the alga’s preferred codon usage is of practical importance in terms of designing degenerate primers for polymerase chain reaction and performing in vivo genetic manipulation experiments. Our preliminary data also suggest the high G + C content may reflect a biased amino acid content of the E. huxleyi proteome (Table 2). In E. huxleyi, as in M. tuberculosis and other organisms harboring genomes with a high G + C content, there appears to be a distinct preference for amino acids encoded by the GC-rich codons of Ala, Gly, Pro, Arg, and Trp, as compared with those encoded by the A + T-rich codons of Asn, Ile, Lys, Phe, and Tyr (Collins and Jukes, 1993; Foster et al., 1997; Lobry, 1997; Gu et al., 1998). Whether this preference is characteristic of the entire E. huxleyi proteome and influences the structure and chemistry of its proteins is not known and beckons further analysis.

Table 2 Codon and Amino Acid Usage in Emiliania huxleyi from Analysis of 85 ESTs and 15 Full-length cDNA Clonesa

ESTs were grouped according to putative cellular function (Table 3) as described previously (Adams et al., 1995). The ESTs with putatively identified functions encompassed a wide variety of biological processes including ribosomal proteins, cell division, gene or protein expression, cell signaling, cell structure, defense, and metabolism. Table 3 is not an inclusive list of all ESTs with e values less than 1 × 10−2, but rather a representation of a select set of ESTs from each functional class to demonstrate the apparent diversity of the library. Figure 3 shows the percentage distribution of sequences falling into each of the functional categories. Data from the 1086 ESTs with significant matches indicate that 35% of those sequences encode proteins involved in metabolism (Figure 3, A). Interestingly, 15% of the represented sequences encoded proteins involved in cell defense supporting the hypothesis that coccolithogenesis may be a response to environmental or physiologic stress (Paasche, 2002). Genes with hypothetical or putative function represented 8.2% (Figure 3, B, groups 8 and 9), whereas novel sequences represented the vast majority of the total sequences, at (group 10).

Table 3 Representative ESTs Showing Significant GenBank Match and Grouped into Functional Classesa
Figure 3
figure 3

Percentage distribution of ESTs by functional classes. A: ESTs with significant (e value <10−2) matches. (1) ribosomal proteins, 1.35%; (2) cell division, 1.6%; (3) gene/protein expression, 6.6%; (4) cell signaling, 7.9%; (5) cell structure, 9.3%; (6) cell defense, 14.7%; (7) metabolism, 35.2%; (8) other matches, 15.5%; and (9) hypothetical proteins, 7.9%. B: Total ESTs sequenced, with class numbering the same as in A. (1) 0.5%, (2) 0.6%, (3) 2.4%, (4) 2.9%, (5) 3.4%, (6) 5.4 %, (7) 12.8%, (8) 5.4%, (9) 2.8%, (10) nonsignificant (e value ≥10−2) matches, 63.8%.

The most prevalent transcripts in the cDNA library generated from E. huxleyi cells grown under conditions promoting calcification as determined by BLASTCLUST are listed in Table 4. The fact that we have constructed a nonnormalized primary library suggests that the abundance or cluster size is more likely to be indicative of the relative messenger RNA population. Of the 3000 ESTs, a total of 25 clusters contained 10 or more sequences, together constituting 19% of the sequenced clones. Sequences in the 3 largest clusters contained 131, 52, and 51 members, respectively. These transcripts, which are presumably the most abundant in the library, showed no significant similarity to sequences in GenBank. The most prevalent identifiable transcripts in the library were actin and polyubiquitin, clusters of which contained 51 and 37 members, respectively.

Table 4 Most Prevalent mRNA Transcripts

Gene Content Analysis

Most known transcripts are considered housekeeping genes, such as those involved in metabolism (e.g., photosynthesis and carbon fixation, amino acid and carbohydrate metabolism, nitrogen and sulfur assimilation, and the synthesis of isoprenoids and phenylpropanoids). One metabolic transcript of particular interest is phosphoenolpyruvate (PEP) carboxykinase (5 copies), which plays a key role in C4 metabolism in plants. In many algae and vascular plants, the fixation of CO2 by PEP carboxylase works in concert with a C4-C1 decarboxylase (e.g., an NADP+- or NAD+ -dependent malic enzyme) to provide CO2 to RubisCO (Raven, 1997). The presence of multiple PEP carboxykinase transcripts in the library suggests that E. huxyleyi may be CO2 limited in seawater, and that C4 photosynthesis may support carbon assimilation in E. huxleyi, as described in the marine diatom Thalassiosira weissflogii (Reinfelder et al., 2000). Alternatively, PEP carboxykinase may function as another carbon-concentrating mechanism (CCM) in this alga. Many contend that E. huxleyi does not require a CCM because calcification (which shifts the DIC equilibrium toward CO2) is an efficient alternative in coccolithophorids and may even be more efficient than a traditional CCM (Steeman, 1966; Brownlee et al., 1994). Data obtained from recent studies, however, did not show a significant correlation between increased calcification rates under low CO2 concentrations—the results of which would presumably generate more CO2 for photosynthesis (Clark and Flynn, 2000). In E. huxleyi, carboxylases other than RubisCO that have been shown to be involved C4 photosynthesis in other organisms have not been investigated (Raven, 1997).

Our cDNA library was constructed from phosphate-stressed cells (f/50 medium), and thus it is not surprising that a number of cell stress or defense-related transcripts were present, including various heat shock proteins (HSP 70, HSP 80, HSP 81, HSP 82, and HSP 90) and the co-chaperonins Dna J and Dna K. A number of different transcripts related to programmed cell death and apoptosis were also noted. Several copies of a metalloproteinase sequence and a hypersensitive response element were identified along with cathespin, caspase, metacaspase, and other members of the cysteine protease family. The collective presence and prevalence of these transcripts suggests that programmed cell death is an active process in E. huxleyi, and may be an adaptation to adverse environmental conditions, such as nutrient deprivation, that can trigger the rapid dissolution of algal blooms.

A number of different transcription factors and nucleic acid binding proteins were predicted from E. huxleyi ESTs by their similarity to known proteins. Although several general transcription factors are present, cmyb is the most abundantly represented transcription factor in the library, with 3 ESTs in the data set. Three other different myb transcription factors are also present. The Myb proteins are a family of transcription factors that occur in both animal and plant lineages but have been dramatically amplified in the plants. In Arabidopsis this large family of more than 100 gene regulatory proteins plays a fundamental role in regulation of metabolism. In both Arabidopsis and Chlamydomonas reinhardtii, one of the Myb transcription factors has been shown to be involved in signaling during phosphate starvation (Rubio et al., 2001). In E. huxleyi phosphate starvation is linked to calcification (Riegman et al., 2000); hence, it is reasonable to hypothesize that one of the Myb transcription factors could be involved in the regulation of genes involved in calcification and coccolithogenesis.

We were also able to identify proteins with zinc finger motifs as well as sequences with significant homology to several known homeodomain transcription factors. Although homeobox-containing genes play developmentally important roles in a wide variety of plants, animals, and fungi, few homeodomain proteins have been described in algae. A gamete-specific, sex-limited homeodomain protein has been identified in Chlamydomonas (Kurvari et al., 1998), and a homeodomain protein that appears to play a role during early reproductive development has been identified in Acetabularia acetabulum (Serikawa and Mandoli, 1999). Consequently, it is not unreasonable to envision a role for these homeobox transcription factors in the induction of phase variation events that lead to switching from the haploid (S-cell) to the diploid (C-cell) stage in the life cycle of E. huxleyi.

Another one of the more interesting nucleic acid binding proteins is a posttranscriptional regulator that belongs to the pumilio family of RNA binding proteins. Members of this family of proteins in Drosophilia melanogaster are responsible for maintaining germline stem cells (Forbes and Lehmann, 1998; Parisi and Lin, 1999); in Caenorhabditis elegans they promote the switch from sperm to egg production (Zhang et al., 1997; Tollervey and Caceres, 2000); and in Dictyostelium discoideum they control the development of reproductive structures (Souza et al., 1998, 1999). Pumilio-family proteins in S. cerevisiae regulate mRNA turnover by causing deadenylation and degradation of transcripts including the HO endonuclease involved in regulation of the mating-type switch (Tadauchi et al., 2001). In E. huxleyi the transition from one life cycle stage to another most likely affects the expression of a large number of transcripts, and it is easy again to imagine roles for posttranscriptional regulators such as a pumilio protein in maintaining one of the life cycle stages or in regulating mRNA turnover during phase transition. Given the fact that life cycle phase transition in E. huxleyi has only been inferred from observational (microscopic) data (Klaveness, 1972; Laguna et al., 2001), flow cytometric data (Green et al., 1996) and more recently molecular data (Laguna et al., 2001), this study may provide the means to begin molecular and genetic characterization of the life cycle of this organism.

Several other cDNAs identified through this EST project should help to expand our knowledge of signal transduction pathways in E. huxleyi. Multiple copies of a calcium-dependent protein kinase showing significant homology to the green alga Dunaliella protein (Pinontoan et al., 2000) and a calcium/calmodulin-dependent protein kinase highly similar to the corresponding protein in Drosophila (Adams et al., 2000) were uncovered. Other signal transduction proteins related to the cell cycle and organelle inheritance included cyclin-dependent kinases (Cdks) and a cell cycle initiation mitogen-activated protein kinase with significant homology to the protein described in Chlamydomonas reinhardtii.

Knowledge of biomineralization and coccolithogenesis in E. huxleyi is in its infancy, and we have yet to unequivocally identify genes involved in these processes. In our library we have, however, found several genes encoding calcium binding proteins and proteins involved in calcium homeostasis. For example, the gene for the previously identified protein that is associated with intracellular precursors of coccolith polysaccharides (Corstjens et al., 1998) was present in our library, as was another acidic uncharacterized protein with a distinct calcium binding motif. The library was also found to contain multiple copies of the genes for both calnexin and calreticulin. Although calnexin and calreticulin reside predominately in the endoplasmic reticulum, the proteins affect many cellular functions both in the ER and outside of the ER environment. Calnexin and calreticulin are chaperones that also play a key role in calcium homeostasis and are known to affect a variety of cellular functions including lectin-like chaperoning, Ca2+ storage and signaling, regulation of gene expression, protein trafficking, and cell adhesion (Michalak et al., 1999; Huang and Beck, 2003). Whether or not these proteins are involved in the regulation of calcium in biomineralization is not known, but preliminary data from Northern analysis in our laboratory indicate transcription of calreticulin is upregulated in E. huxleyi cells grown in low-phosphate medium that promotes calcification, as compared with levels in cells grown in rich medium that appears to inhibit calcification.

We expect genes encoding proteins involved in biomineralization and coccolithogenesis to be novel sequences unlikely to be found in GenBank. Hence efforts in our laboratory are also being directed toward the most prevalent uncharacterized genes in the library that are identified in Table 4.

CONCLUSIONS

Our initial EST analysis, presented herein, is informative and indicates that the calcifying E. huxleyi cells express a complex set of genes. To our knowledge this analysis is the only available genomic resource for E. huxleyi and, as such, represents a valuable resource for future work with this important alga. A complete description of the data set is beyond the scope of this work; however, the complete data set will be deposited in GenBank, and efforts to construct an E. huxleyi database are underway in our laboratory. We have putatively identified the function of 1086 sequences, but the incomplete nature of EST sequences dictates that any inferred function for a given sequence should be interpreted with caution. Nonetheless, we have provided a conceptual framework of ESTs from which clones may be identified for more complete functional analysis by gene expression profiling, gene silencing or RNA interference, or biochemical characterization. Further studies aimed at gene discovery and functional analysis in E. huxleyi will help resolve the underlying mechanisms defining calcification, DMS emissions, and the complex life cycle of this ubiquitous and ecologically important marine organism. These efforts will be greatly facilitated by the Department of Energy’s recent selection of E. huxleyi for genome sequencing.