Introduction

The release of the first complete genome sequence of a microorganism in 1995 (Haemophilus influenzae) and of the first draft of the human genome 6 years later (Fleischmann et al. 1995; Venter et al. 2001) heralded the new age of genomics. Currently the complete genome sequences of almost 500 organisms are available, and the genome sequences of four times this number of organisms are in progress (http://www.genomesonline.org). One of the major revelations of this revolution has been the discovery of a large number of conserved/hypothetical genes, the function of which is essentially unknown. At least some insight into the function of the proteins that such genes encode might be available from their 3D structures, which could be predicted if the structure of a protein closely-related in sequence was available. However, in the late 1990s it was realized that for any uncharacterized protein (deduced amino acid sequence), the chance of finding a protein closely-related in sequence for which a crystal structure was available (to serve as a template for modeling) was very limited (Holm and Sander 1994, 1997; Orengo et al. 1999). Thus an international effort known as “structural genomics” (SG) was initiated to fill “sequence-structure space”. This was to be achieved by determining the 3D structures, mainly by X-ray crystallography, of representatives of all known protein families. It was anticipated that achieving this would require more than 5,000 new structures (Brenner 2000; Brenner et al. 1997; Burley 2000; Liu et al. 2004).

The SG concept was therefore a rational post-genomic goal to follow the success of the genome sequencing projects. In the United States the SG effort took the form of the Protein Structure Initiative spearheaded by the National Institutes of General Medical Sciences (NIGMS/NIH). It began in 2000 with the goal of developing high-throughput (HTP) protocols to increase the rate, and decrease the cost, of determining the 3D structures of proteins by X-ray crystallography (X-ray), and to a lesser extent by nuclear magnetic resonance (NMR) spectroscopy, in order to find representative structures of all possible protein folds in the biological world (Brenner and Levitt 2000; Gaasterland 1998). While the individual tasks of selecting genes to be expressed, obtaining the recombinant proteins in a stable purified form, crystallizing them, and determining their structures, were commonplace, developing HTP methods for all aspects, with the goal of generating thousands of structures from a single center, was a daunting challenge. In the first 5-year pilot phase of the NIH-funded project, nine large, multi-disciplinary groups were formed between multiple university, government, and industrial laboratories (for example see Bonanno et al. (2005)] to develop the necessary technologies (see http://www.nigms.nih.gov/Initiatives/PSI for a complete list). SG efforts were also initiated around the globe with essentially the same goals and with significant international coordination and communication (http://www.isgo.org).

Initially, most of the SG groups targeted the proteins (genes) encoded by the sequenced genomes of one or more model organisms, typically including both prokaryotic and eukaryotic representatives. It was at this early stage that extremophilic microorganisms had the most impact on the SG phenomenon. In particular, among the first organisms to be targeted were the thermophilic bacteria Thermus thermophilus and Thermotoga maritima (DiDonato et al. 2004; Ito et al. 2006), and thermophilic archaea Methanobacterium thermoautotrophicum [now Methanothermobacter thermoautotrophicus (Christendat et al. 2000)] Pyrobaculum aerophilum (Mallick et al. 2000), and Pyrococcus furiosus (Adams et al. 2003). It is important to note that one of the original SG initiatives began at RIKEN in Japan (Yokoyama 2005; Yokoyama et al. 2000) where the initial focus was exclusively on the extremophiles, Thermus and Pyrococcus. Proteins from thermophiles were prime targets in the initial stages of the SG projects mainly because of the anecdotal belief that such proteins crystallize more easily, and would therefore increase the overall success rates of transforming the gene of a hypothetical protein into a three dimensional protein structure. Of course, there is no doubt that proteins from thermophilic organisms are more stable (Sadeghi et al. 2006; Szilagyi and Zavodszky 2000) than those from mesophiles, and thus more easily purified, manipulated, and more stable during the long time periods needed for crystallization. However, whether they really do crystallize more readily is an open question (Rees 2001). In any event, the purpose of this paper is to briefly review the interaction between the world of SG and that of extremophiles, with a specific focus on protein targets from thermophilic organisms.

Structural genomics: from the early years to the present day

The immediate goals of the SG centers were to develop new technologies for bioinformatics analyses of multiple genomes, HTP gene cloning, protein expression, purification, crystallization, and structure determination protocols. Traditionally in the biochemistry field, for any one protein under investigation by a single group, it was often the case that the process from gene cloning to the structure of the encoded protein would take many years. In contrast, early estimates were as high as 15,000 for the number of protein structures that needed to be detemined in order to cover the perhaps 3,000 or more unique proteins folds that may be represented in living organisms (Gaasterland 1998) and this might increase as the database of available sequences increased (Marsden et al. 2006). Consequently, by conventional technologies during the 1990s, it would have taken as many as 50,000 person-years worth of work for full coverage of all possible protein fold families, at an astronomical cost. Thus the primary goal of the SG effort was to reduce this time and cost by many orders of magnitude, by automating and streamlining as many steps as possible in the process.

One critical issue for the SG efforts was target selection, both from the perspective of limiting overlap between groups targeting homologous proteins in different organisms, and to limit protein targets to those expected to yield novel folds. The targets initially chosen by particular SG groups typically represented the research interests of individual members. Some SG groups had a common theme, for example, the SG of Pathogenic Protozoa Consortium (http://www.sgpp.org) and the TB SG Consortium (TBSGC http://www.doe-mbi.ucla.edu/TB/) focused on pathogenic protozoa and the tuberculosis-causing bacterium Mycobacterium tuberculosis, respectively. Similarly, the Center for Eukaryotic Structural Genomics (CESG http://www.uwstructuralgenomics.org) focused on the plant model organism Arabidopsis thaliana. In addition to the NIH projects in the US and RIKEN in Japan, there were also a large number of projects in Europe, including the Structural Proteomics in Europe (SPINE) project (Berry et al. 2006), as well as efforts in Germany (Banci et al. 2006), France (Abergel et al. 2003), and England (Cianci et al. 2005). Links to all relevant sites can be found at the International Structural Genomics Organization (http://www.isgo.org).

As noted above, however, a number of centers targeted proteins from thermophilic organisms, both from interest and from the perspective that their more stable proteins would be more tractable. For example, P. aerophilum was targeted by the TBSGC as an early validation test for target selection by assigning protein fold predictions to open reading frames (ORFs), in order to eliminate targets for which homologous structures may already exist (Mallick et al. 2000). Some of the earliest pioneering work, and the earliest success story, was with Methanobacterium thermoautotrophicum (Mt) carried out at the University of Toronto (Christendat et al. 2000). This study is particularly useful as it demonstrated all the strengths and weaknesses of the SG approach (vide infra). For example, one of the first problems that became evident is the high attrition rate as targets pass through the SG “pipeline” from gene to structure, and this aspect is discussed further below.

All SG groups soon realized the difficulties in obtaining recombinant forms of proteins in a HTP-mode. Consequently, many ORFs were automatically removed from target lists, such as those predicted to encode membrane proteins and others expected to be recalcitrant, such as very large proteins. Even so, success in recombinant protein production was much lower than anticipated. We and others (Adams et al. 2003) proposed that the high attrition rate could be due in part to proteins that are very unstable and perhaps rapidly degraded by the expression host. This instability was proposed to arise because of the lack of either simple (e.g., Fe or Zn) or complex (e.g., flavin) cofactors that are not properly “inserted” (and in some cases even synthesized) by the expression host, and/or to proteins which may only be stable when coexpressed with their partners to form a multiprotein complex (Adams et al. 2003). It was estimated that less than 20% of the ORFs in any genome would likely be expressed as a stable, properly folded protein, commonly named the “low-hanging fruit”. This prediction is consistent with the data from this first comprehensive SG report [<10 structures out of 424 targets from Mt (Christendat et al. 2000)], and with data available from many groups [Protein Data Bank (PDB), http://www.rcsb.org/pdb/, 2004]. While cloning is virtually 100% successful even at the HTP level, the loss at each step from gene target to structure can be as high as 50%, and is not correlated with research group, protocols used, or ORF targets (Acton et al. 2005), even when the more difficult membrane proteins have been removed from the target list (Bonanno et al. 2005).

Over the 5 years of the first phase of the NIH-funded SG effort, a number of key advances were made in bioinformatics (Bravo and Aloy 2006; Ginalski et al. 2005; Wolfson et al. 2005), recombinant protein production (Dieckman et al. 2006; Esposito and Chatterjee 2006; Hart and Tarendeau 2006; Marsischky and LaBaer 2004; Zhou and Chen 2004), and structure determination techniques (Arzt et al. 2005; Atreya and Szyperski 2005; McPherson 2004; Pusey et al. 2005; Wang et al. 2005). The results from the product-driven SG centers were recently compared and contrasted with those of conventional hypothesis-driven laboratories of individual investigators carrying out traditional (non HTP) approaches (Chandonia and Brenner 2006). An attempt was made to analyze quantitatively the cost and impact of protein structure determination between the two types of groups. It was found that about half of all novel protein structures are now solved at SG centers, and very significantly, the cost of solving a structure at these centers has been reduced to 25% of the estimated cost at a traditional laboratory (Chandonia and Brenner 2006). Nonetheless, while traditional hypothesis-driven structure laboratories may work on more difficult targets (such as protein complexes), their efficiency is similar to the HTP SG centers. In addition, publications from the traditional non-SG laboratories are cited more frequently, indicating that structures from the HTP SG laboratories are having a significantly lower impact. In fact, as discussed below, one of the major limitations of protein structures from the SG laboratories is that they are only deposited electronically and are not accompanied by a publication describing the biological consequences of the new structure.

The second and so-called production phase of the NIH-funded SG initiative in the US began in 2005 and this time two types of centers were created (Service 2005). They included large-scale centers dedicated to HTP protein production of novel targets, targeting orthologs of a particular protein from multiple species as at least one ortholog will typically be successful (Savchenko et al. 2003), and smaller, specialty centers dedicated to research on the more difficult problems, such as membrane protein and multiprotein complex expression (http://www.nigms.nih.gov/Initiatives/PSI), the so-called “high-hanging fruit”. There are currently four large production centers, and their common theme is HTP production to meet the project goal of 4,000 structures from different protein families that currently have no representative structure. In all cases technology development is a major focus, as well as dissemination of these technologies. The primary focus of the ‘Big4’ centers is to efficiently cover structure space. There has been a major philosophical shift from the first 5 years of the NIH initiative, as the goal is now to reach maximum efficiency of protein structure production and maximum coverage of sequence-structure space (http://www.psi-big4.org/index.php). As of December 2006, the ‘Big4’ have already produced almost 80% as many structures (671 vs. 854) in the first year of the second SG phase as they did in the 5 years of the first initiative. The six smaller, specialized centers will take the technologies developed in the first phase of the NIH-funded SG initiative and use these to develop more HTP protocols for the much more difficult proteins. While proteins from extremophiles and particularly thermophiles enjoyed premiere status in the first 5 years of the SG initiative, there is no longer any emphasis on these proteins. Their genes are still selected, but only as one of a large group of orthologous sequences for a particular target protein of interest, and not because they are like to generate highly stable proteins that are more amenable to crystallization.

Consequently, the protein (gene) target list has expanded dramatically in the recently initiated second phase of SG and contains many new species as sources of new orthologs of protein targets. There are currently over 600 species and strains from all three kingdoms of life and viruses represented with, in some cases, thousands of genes. While this incorporates genes from dozens of extremophiles including psychrophiles, (hyper) thermophiles, halophiles, acidophiles, etc. (see http://targetdb.pdb.org for a complete list of registered targets), these organisms no longer receive any “special” status and have no higher or lower priority than the genes from any other organism. In the second phase of SG, organisms per se are not the issue, protein families are. Nevertheless, as noted above, a number of the extremophiles were specifically targeted in the first 5 years of the SG initiative, and these projects have generated an enormous amount of information on their “extremophilic” proteins.

Targeted genomes from extremophiles

Methanobacterium thermoautotrophicum

M. thermoautotrophicum (Mt, now Methanothermobacter thermoautotrophicus) is a thermophilic (Topt 65°C) lithoautotroph isolated from sewer sludge (Zeikus and Wolfe 1972) which uses energy from H2 to reduce CO2 to CH4. Its 1.75 Mbp genome contains approximately 1,871 ORFs (Smith et al. 1997). This organism was initially the flagship organism of the SG world as a collection of ORFs from Mt was the first real test of the SG protocol (Christendat et al. 2000). Out of the 1,871 ORFs in the genome, 424 were selected (none were predicted to encode membrane proteins). Approximately 80% of these yielded protein when they were expressed in E. coli, but only about half gave rise to soluble protein. A total of 175 proteins were purified, and about half gave promising results in initial NMR and crystallization screens. Of the ones that formed crystals, 24 were chosen for optimization, and ten structures were solved (Christendat et al. 2000). In this first SG test case, the inherent problem of target attrition in the gene to protein structure pipeline was evident, but this project also demonstrated that the structure determination of proteins of unknown function could, at least in some cases, give strong indications as to their in vivo function. Of the ten structures reported in this work, five co-crystallized with ligands. For example, the structure of the uncharacterized protein MTH150 showed that NAD was bound to it. The structure also indicated a nucleotide binding fold, and biochemical assays demonstrated that the protein has nicotinamide mononucleotide adenylyltransferase activity (Saridakis et al. 2001). Other structures were for proteins of known function for which there was no existing structure. These included MTH129, which is an orotodine 5′ monophosphate decarboxylase, and the NMR structure of MTH40 which is homologous to a subunit of RNA polymerase II revealed a novel Zn-binding motif (Christendat et al. 2000). Mt proteins continue to be targets of various SG centers and to yield novel structural information (Lee et al. 2004).

Thermotoga maritima

T. maritima (Tm) is a hyperthermophilic heterotrophic bacterium (Topt 80°C) isolated from a hot marine sediment in Vulcano, Italy (Huber et al. 1986). It ferments sugars and produces H2, making it of particular interest for the production of biofuels. The genome (1.86 Mbp) is predicted to contain 1,877 ORFs (Nelson et al. 1999), and it was also an early target of one of the most successful SG projects (Lesley et al. 2002) at the Joint Center for Structural Genomics (http://www.jcsg.org). The JSCG alone has deposited 162 structures of Tm proteins in the PDB (http://www.rcsb.org/pdb/, 2004). Out of 21 novel protein folds discovered by this SG group to date, 15 are in Tm proteins. Overall, various SG centers have produced 220 structures of Tm proteins (including 10 determined by NMR). This represents 11% of the total number of ORFs in this organism, which is a remarkable feat. Almost 1,000 recombinant Tm proteins have been purified successfully, 770 of them at the JCSG. An important point to note here is that this is a very valuable potential resource for the community of thermophile researchers. Many of these proteins may only be expressed at low levels, but there are now clones and data available on how to express and purify proteins representing at least half of the Tm genome. As discussed below, such resources are also available for several other extremophiles.

A number of structures of Tm proteins have yielded novel insights into the function of unknown proteins. For example, TM1662 encodes a surE homolog by sequence analysis and the structure of the Tm protein was determined (Zhang et al. 2001). The surE protein is conserved across all domains of life. However, its function is not clear, although it is expressed in stationary phase growth of E. coli. Through the SG effort, TM1662 was shown to be an acid phosphatase, despite having no sequence similarity to any other acid phosphatases. Another example is TM0654, which represented the first structure of an aminopropyltransferase (Korolev et al. 2002). This protein is involved in biosynthesis of common polyamines such as spermidine. The structure indicated that the active site was highly conserved in bacteria and eukaryotes, thus suggesting a universal catalytic mechanism and the specific residues likely to be involved (Korolev et al. 2002). Other structures of previously unknown ORFs have led to new, relevant insights into protein function. These include TM1643, which represents a completely novel family of enzymes, aspartate dehydrogenase, that catalyzes the first step of NAD biosynthesis (Yang et al. 2003). In addition, structures of unknown proteins can illuminate entire families of unknown genes. A good example is the NMR structure of TM0487, for which there are more than 200 homologs in the database. The structure of the Tm protein indicates a possible active site with a buried Asp residue (Almeida et al. 2005). Other structures of Tm proteins produced by SG groups have indicated unique covalent protein dimers and a novel DNA binding protein (Liu et al. 2005; Zhang et al. 2006).

The extensive library of structures of Tm proteins produced by SG efforts has also allowed for some initial attempts at correlating their high thermal stability with structural elements such as contact order (Robinson-Rechavi and Godzik 2005), density of salt bridges, and compactness (Robinson-Rechavi et al. 2006; Robinson-Rechavi and Godzik 2005). These data indicate a clear correlation between an increase in contact order between residues in the thermophilic proteins relative to mesophilic ones (Robinson-Rechavi et al. 2006). This is a particularly significant contribution to the understanding of protein stability, as there are many different proposals for the basis of the extreme stability of proteins from hyperthermophiles (Chakravarty and Varadarajan 2002; Sadeghi et al. 2006). The Tm protein collection has also been used for extensive screening of NMR structure candidates (Peti et al. 2004), protein solubility screening for crystallization optimization (Collins et al. 2005), and to design a HTP pipeline from cloning to structure determination (DiDonato et al. 2004). Clearly, the work with Tm has had a significant impact on the SG world in general and it remains one of the most studied organisms in this regard.

Thermus thermophilus

Thermus thermophilus (Tt) is an aerobic, thermophilic (Topt 68°C), gram negative bacterium originally isolated from a thermal environment in Japan (Oshima and Imahori 1971). This organism is of significant biotechnological interest as it is tolerant to a number of stress conditions (Koyama et al. 1986). It is amenable to genetic manipulation (Hashimoto et al. 2001) and is closely related to the mesophilic, radiation-resistant Deinococcus radiodurans (Henne et al. 2004). Its 2.12 Mbp genome is predicted to contain 2,238 ORFs. The Tt SG effort is being carried out by groups at Osaka University and RIKEN [see http://www.thermus.org/e_index.htm (Yokoyama et al. 2000)]. So far 1,450 Tt ORFs have been heterologously expressed, 930 recombinant proteins have been purified, 632 have been crystallized. These have yielded 438 structures to date by these groups deposited in the PDB (http://www.pdb.org) although very few have been formally described in publications, and hence few have any degree of biochemical characterization. Unfortunately, this is one of the drawbacks of the SG approach, where the primary goal is structure determination. The interpretation of a structure, particularly if it is not novel (in structural terms), is typically not a priority and is left to those outside of the SG projects.

As with Tm, the large amount of structural information generated on Tt proteins is being used to make global predictions about thermal stability, the solubility, and the crystallization ability of recombinant proteins. For example, 108 Tt sequences were used to predict structural domains, and experimentally assess these structural predictions and the stability of the recombinant proteins using NMR spectroscopy (Hondoh et al. 2006). A major part of the SG efforts with Thermus species in Japan has been the very promising development of HTP cell-free in vitro expression systems. This can eliminate a number of problems associated with in vivo expression such as cell lysis and multiple purification steps, as well as reducing the cost of isotopic labeling of protein targets (Endo and Sawasaki 2006; Yokoyama 2003; Yokoyama et al. 2000).

Pyrococcus furiosus and P. horikoshii

Two species of these obligately anaerobic, heterotrophic, hyperthermophilic archaea, both growing optimally near 100°C, have been the targets of SG projects. Pyrococcus furiosus (Pf) was isolated from a shallow marine solfatara near Vulcano, Italy (Fiala and Stetter 1986) and its genome of 1.9 Mbp contains approximately 2,200 ORFs. Pf was one of the inital target organisms at the NIH SG center SECSG (Adams et al. 2003). The specific goal was to express as many of its proteins as possible in a fully-folded, functional form. This involved developing expression protocols for recombinant proteins that contain cofactors and/or are part of multiprotein complexes, for example, by growth of the heterologous host in the presence of excess Fe or Zn for metal cofactors (Jenney et al. 2005), or by coexpression of multiple ORFs for multiprotein heteromeric complexes. It was predicted that at least 20% of the ORFs would encode membrane proteins (Holden et al. 2001), and that few of these would yield soluble proteins.

One critical issue in designing any experiment involving the entire proteome of an organism is how to precisely define that proteome, both in terms of the total number of ORFs, and their putative translation start sites. While the original annotation of the Pf genome contained 2,065 putative ORFs (Robb et al. 2001), there are two annotations currently in the major databases (http://www.ncbi.nlm.nih.gov and http://www.tigr.org) where up to 2,261 ORFs are predicted (Poole et al. 2005). One major issue in annotations that is particularly important for SG efforts is the correct start site for a given ORF. For example, 552 ORFS, or about 25% of the total proteome, in the two current annotations of the Pf genome differ in their start-sites, many by the equivalent of more than 20 amino acids (Poole et al. 2005). The addition or deletion of a few critical residues at the N terminus of a protein could have a dramatic effect on protein stability, solubility and its ability to crystallize. There are no bioinformatic tools available to address this problem, so for the Pf project at the SECSG the maximum possible start site was chosen (which, if incorrect, would generate extended rather than truncated proteins) for all 2,192 of the predicted ORFs. Of these, 1,909 were cloned into an expression vector containing an N-terminal His affinity tag (MAHHHHHGS-). This allows protein purification by immobilized metal affinity chromatography (IMAC), as well as detection using an enzyme-linked immunosorbant assay (ELISA) with a commercial antibody against the His affinity tag [see http://www.secsg.org and (Sugar et al. 2005)].

For the production of recombinant Pf proteins in E. coli, an automated screening was performed using a small scale (1 mL) expression system (SSE). The soluble and insoluble fractions of cell-free extracts were separated robotically and recombinant protein production was assessed using an antibody to the His tag (Adams et al. 2003; Sugar et al. 2005). All of the SG centers have developed and demonstrated similar types of HTP heterologous protein expression screens, for example (Acton et al. 2005; Alzari et al. 2006; Cornvik et al. 2006; Dieckman et al. 2006; Douris et al. 2006; Hart and Tarendeau 2006; Vincentelli et al. 2005). In the case of Pf, the expression screen data were used to scale production to (at least) 1-L cultures of E. coli for the purification of the milligram amounts of protein necessary for analyses by X-ray crystallography (and NMR spectroscopy). Clones that failed the expression step (either due to no or limited amounts of recombinant protein, or the production of insoluble, presumably unfolded protein) were subjected to protocols of increasing complexity, such as alternative E. coli expression strains, recloning with different expression vectors or affinity tags, different host organisms, etc. For Pf, a total 2,381 cultures representing 1,008 unique ORFs were grown at the 1-L scale. Of these 57% (578) produced sufficient protein to be detected after SDS-polyacrylamide gel electrophoresis (after the IMAC step) and 388 proteins representing unique ORFs have been purified. Of these, 259 (67%) gave the predicted mass when analyzed by mass spectrometry, i.e., they had not been degraded, or subjected to some unknown post-translational modification in E. coli, and 240 (62%) were submitted for X-ray crystallography screening (and 137 for NMR screening). This resulted in 108 crystals, 59 of which diffracted, and 29 structures were obtained. The results to date are indicated in Table 1.

Table 1 December 2006 production statistics for Pyrococcus furiosus proteins from gene to structure, and for all structural genomics groups worldwide registered in the TargetDB [see http://www.secsg.org and Protein Data Bank (http://www.rcsb.org/pdb/, 2004)]

For the structures of Pf proteins determined by the SG effort, half of them (15 of 29) represented conserved hypothetical proteins. Unfortunately, insights into their biological functions provided by the structures were limited. For example, in the case of the hypothetical protein PF1455, its structure indicated that the protein is involved in the binding, transport, or detoxification of heavy metals (Mayer et al. 2006). On the other hand, some of the proteins enabled advances to be made in protein structure analysis. For example, PF1455 was used to demonstrate that with a rapidly collected, limited amount of NMR data (traditionally a slow method for structure determination), a structure can be modeled with sufficient detail to both render a prediction as to its possible function, and to classify it as a novel fold. Such information is extremely important in SG screening so that protein targets are not duplicated (Mayer et al. 2006). The structure of another Pf protein (rubrerythrin, PF1283) provided an example of domain swapping, an unusual observation in protein structure. In this case, domains from two different monomers in a dimer were intertwined to form a structure homologous to that of a previously characterized protein [in which the same structure is made up of domains from one monomer (Tempel et al. 2004)]. Research on Pf proteins at the SECSG has led to a number of methods developments for HTP protein expression and structure determination (Jenney et al. 2005; Sugar et al. 2005; Valafar et al. 2004; Wang et al. 2005).

The other Pyrococcus species that is the specific target of an SG effort is P. horikoshii (Ph). In contrast to Pf, Ph was isolated from a deep sea hydrothermal vent in the Pacific Ocean (Gonzalez et al. 1998) although the two organisms are closely related and have similar size genomes (Lecompte et al. 2001). The SG effort with Ph at RIKEN (http://www.riken.go.jp) led to the production of 472 recombinant proteins, 447 of which were purified. Remarkably, this effort has led to over 180 structures of Ph proteins. Unfortunately, as is characteristic of SG, very few of these structures have been published in peer reviewed journals and so the information is not widely disseminated to those in the field of extremophiles. Two other closely-related species (Cohen et al. 2003; Fukui et al. 2005), P. abyssi (T opt 98°C), isolated from a deep sea vent in the Pacific, and Thermococcus kodakaraensis (formerly Pyrococcus T opt 85°C), isolated from a surface solfatara in Japan, appear on the target lists of various SG centers. However, only a few of their ORFs (36 and 6, respectively) have been utilized to produce proteins, and the organisms themselves (or rather their complete genomes) have not been targets of any SG effort.

Other extremophile targets

Extremophiles such as Tm, Pf, Tt and Mt are therefore unique as they have been specific targets of the initial SG efforts, and a large number of crystal structures of their proteins have been generated. The hyperthermophile Pyrobaculum aerophilum (Pa) was also one of the first target organisms at the beginning of one of the NIH-funded SG projects (for the Integrated Center for Structure and Function Innovation, formerly the TB Structural Genomics Consortium). However, this effort was not sustained as the focus of the center moved to the disease-causing, mesophilic bacterium Mycobacterium tuberculosis, and more recently, to a technology-based approach that emphasizes producing correctly-folded proteins regardless of source (http://techcenter.mbi.ucla.edu) (Protein Structure Initiative 2005). In a similar fashion, Methanococccus jannaschii (Mj) was a specific target organism of another SG center (at UC Berkeley, http://www.strgen.org). One of its early successes was the assignment of a biochemical function to a hypothetical Mj protein (Zarembinski et al. 1998). However, this SG group has since shifted emphasis to proteins from species of the mesophilic bacterium Mycoplasma (Chandonia et al. 2006). Although not a specific organismal target, ORFs from Mj are still the subject of study, with 317 targets listed in the PDB and the structures of 20 Mj proteins have been determined, some of which have been characterized biochemically. For example, MJ0936 was shown to be a novel phosphodiesterase (Chen et al. 2004) (Martinez-Cruz et al. 2002). Table 2 is a select list of some example target organisms from the TargetDB in the PDB, and demonstrates that a number of extremophiles have been targeted by the various SG centers around the world. It also shows that, at least in the early days of SG, the emphasis was clearly on thermophilic and particularly hyperthermophilic organisms (usually defined as those with T opt ≥ 80°C).

Table 2 Extremophiles as targets of structural genomics projects

Phase II of SG

Now that the 5-year pilot phase I of the NIH-funded SG initiative that began in 2000 is complete, the second, production phase is well underway. As stated above, there is a truly significant shift in priorities in this new phase. Individual organisms are no longer targeted and while extremophiles had a major impact on the first phase, their proteins (genes) are now lost in the sea of orthologs that are chosen entirely by bioinformatic criteria. Nonetheless, proteins from (hyper)thermophiles and other extremophiles will certainly be included on these target lists, and should it hold true that these proteins are more stable and crystallize more easily than mesophilic proteins, then they will likely be over represented in the list of protein structures that are produced.

A summary of current statistics for all SG groups can be found at the PDB (http://www.rcsb.org/pdb/, 2004), but at the time of this writing (December 2006), 119,506 targets from more than 600 organisms/strains have resulted in 2,767 crystal and 1,181 NMR structures (Table 1). Note also in this table that while the attrition rate across all groups has improved at some steps (for example, now as many as 82% of soluble proteins are successfully purified) only about 8% (3,948 of 46,064 proteins where expression has been attempted) have yielded either X-ray crystal or NMR structures. These numbers represent a glimpse at a rapidly changing scene, and a more in-depth analysis of these statistics has been reported recently (Chandonia and Brenner 2006). Of course, there continue to be general critiques of the SG philosophy [for example, Cyranoski (2006)], as significant funds in many countries, which could be directed towards individual research laboratories, have been directed towards the SG efforts.

The most serious problem in SG is that the steps in the “gene to structure” pipeline remain empirical—few predictive rules have become apparent and these mainly concern properties of proteins such as thermostability and correlation of physical properties with crystallization success (Canaves et al. 2004; Robinson-Rechavi et al. 2006). The hope is that a more extensive data set will allow better prediction of success in heterologous expression systems to obtain stable recombinant proteins. This will have a tremendous impact and make many more proteins available in fully-folded, functional forms for complete structural and functional characterization. As of yet, such predictions are still hampered by the incredible variability inherent in proteins.

The impact of SG on extremophiles

The impact of genome sequencing on a particular organism or a group of organisms is clear cut and readily appreciated, with quantitative results, such as number of bases, number of predicted ORFs, etc. The world of SG, however, is far more qualitative and it is hard to measure how much impact has been made in a particular field, such as extremophiles. In general terms, it is clear that in a few short years the SG efforts around the world have contributed a large number of novel structures to the public databases, and many are of proteins from extremophiles. SG efforts have also yielded new HTP technologies that have accelerated bioinformatics analyses, cloning and protein expression screening, and much more rapid structure determination, and these tools and protocols are available to all researchers. Those groups who are also interested in the biology of extremophiles and are directly involved in such efforts have also directly benefited. However, it is more difficult to say that the SG efforts have made a very specific impact on the field of extremophiles in general. A large number of structures from particular organisms are now available, especially from Thermotoga, Pyrococcus and Methanothermobacter species, and these in turn allow structure modelling of homologous proteins from many other organisms (Todd et al. 2005). However, as of yet the available SG structures have had no groundbreaking effect on extremophile research. In general, the biological contribution of SG efforts so far has been in using novel structure information to direct functional biochemical analyses (Sanishvili et al. 2003; Yakunin et al. 2004), but this has not really affected extremophiles.

The most important ramification of SG efforts for those who study extremophiles is more technical than scientific. This concerns the large number of recombinant proteins produced from a variety of genes from numerous extremophile sources. More importantly, the procedures and protocols to produce these recombinant proteins, while typically not published in the formal literature, are available on web sites from the various SG centers (links to all these centers can be found at http://www.nigms.nih.gov/Initiatives/PSI). Similarly, a huge collection of clones is also available for an even larger variety of extremophilic organisms, and these may or may not have been analyzed for the production of recombinant protein. The complete (and searchable) list of all targets selected by all SG centers worldwide can be found at the PDB (http://targetdb.pdb.org/). Such resources have been created by the SG phenomenon and are available to be utilized by the extremophile community at large.