1. Goals

To benefit from genomic information generated in recent years, a full understanding of a gene’s function is necessary, an effort for which the term functional genomics has been coined. Even though parts of a genome such as regulatory regions, stem/loop structures or tRNAs reside on the level of nucleic acids, most genomic information is translated into proteins with distinct structures, functions and interactions. The ultimate goal is to understand the working of an entire living cell in a specific tissue, a massive web of processes that are predominantly based on proteins. Thus, proteomics goes beyond characterization of individual proteins, its strength and goal lies in the characterization of all proteins in a given cell. Understanding all protein structures as well as their functions and dynamics in a cell is a daunting task which still requires substantial new technical developments, especially on the side of protein structure determination.

Nevertheless, and especially in the case of soluble proteins, a growing number of high-resolution structures is available allowing the study of protein function, protein/protein and also protein/ligand interactions toward rational drug design for treatment/prevention not only of infectious diseases but also of genetic disorders and inherited diseases. Unfortunately, a huge discrepancy still exists between available information and the determination of metabolic significance in the case of membrane proteins. On one hand these proteins are preferred targets for drugs because the exposed location on the cell surface makes them indispensable participants for drug entry into the cell. On the other hand, membrane proteins are notoriously difficult to study and still only about two dozen high-resolution structures of membrane proteins are known.

2. Technology

The technical demands to accomplish successful proteomic investigations are tremendous at every stage of the process and so far not a single cellular proteome has been characterized to completion. Large numbers of genes (104 to 106 per organism) have to be analysed in order to describe large numbers of proteins that in turn must be characterized structurally and functionally down to the atomic level. The information obtained will be stored in appropriate database systems that allow sufficient annotations, cross referencing and rapid access. Therefore, every aspect of the data acquisition side of a proteomics project is labeled as ‘high throughput’, typically on the ‘micro(nano) scale’, whereas activities on the data storage and analysis side are often called ‘datamining’. Consequently, throughout the experimental stages of a large proteomics project, use is being made of robotic systems all the way from sample storage through sample preparation to data acquisition. To obtain proteomic information a set of key technologies is being used.

2.1 Two-Dimensional Gel-Electrophoresis

The classical, biochemical method for analysis of large and complex protein mixtures, high-resolution 2-dimensional gelelectrophoresis (2D gels), remains the key technology for proteomic research. Separation of mixtures is accomplished by isoelectric focusing of proteins according to their charge followed by SDS polyacrylamide gel electrophoresis (PAGE) that separates according to size.[4] Over a thousand proteins can be resolved this way and samples from healthy versus diseased tissues, for example, can be compared to provide a global picture of expression changes. With the recent and dramatic progress that has been made in the development of computer technology and mass spectrometry, as well as the availability of genome data, a leap has taken place in proteomic research. Proteins on 2D gels can be quantified by advanced imaging software and the spots identified rapidly using mass spectrometry. Recent advances in 2D PAGE technology in Proteomics have been extensively reviewed.[5]

2.2 Mass Spectrometry (MS)

Mass spectrometry is driving proteomics as modern equipment allows high-throughput identification of proteins via comparison with genomic data. Different approaches have distinct advantages and disadvantages.

2.2.1 Peptide Mass Tags

In the technique of peptide mass tagging (peptide mass fingerprinting) proteins are treated with a cleavage agent, typically the protease trypsin that cleaves specifically after arginine and lysine residues, and the resulting fragment mass is measured as accurately as possible. This measurement is usually carried out by matrix-assisted laser desorption ionization, time of flight mass spectrometry (MALDI-TOF MS).[6] Experimentally derived masses are then matched to databases (reviewed in Fenyo[7]). The database software translates genes or expressed sequence tags (ESTs) in all possible frames, digests the ‘cyberprotein’ in silico and calculates then the theoretical masses of the fragments. Positive identification is achieved when the observed masses closely match theoretical masses; it is improved by more closely matching measured with calculated masses and increasing the number of peptides matched. Accuracy of mass measurement is critical[8] but modern MALDI-TOF instruments achieving 10 ppm (+/− 0.1 Da at 10 kDa) or better, combined with sample preparation robots, provide massive throughput ( 10 to 100 samples/hour) making this the most widely used methodology currently used in proteomics.)

Common protocols involve selection of important targets from 2D-gel experiments for subsequent identification but more ambitious protocols seek to systematically scan entire 2D gels in an automated fashion.[9] Software improvements now permit consideration of covalent modifications that alter experimental masses in a predictable fashion.[10] Very recent success using this approach has been achieved at Large Scale Biology Corp. (see URL:httpўw.lsbc.com) by characterising HPI v1.0, a subset of the human proteome covering 115 693 proteins derived from some28 000genes.[11]

2.2.2 Sequence Tags

Generation of sequence tags involves tandem mass spectrometry (MS-MS) usually after electrospray-ionization (ESI)[12]. First, a mass selective filter is used to select an ion derived from a single protein peptide just as for mass tags (section 2.2.1). The selected ion is then fragmented in the mass spectrometer using collision activated dissociation (CAD), which typically breaks at the peptide bond, and the masses of the pieces are measured. Thus a sequence tag is the mass of a peptide plus the masses of CAD fragments derived from it. Identification against the database is similar to the technique described for mass tags except that the theoretical masses of CAD fragments have to be calculated based upon the sequence. Since this technique incorporates true sequence data, confidence in protein identification is raised, though computation is more cumbersome. Furthermore, the ESI mass spectrometers used to generate sequence tags can only perform one MS-MS experiment at a time, as opposed to the MALDI-TOF instruments that can record the masses of many different peptides in a single experiment. As a result of this constraint, reversephase liquid chromatography (LC) is typically coupled to ESI (LC-MS) in order to separate peptides from the digest mixture, and the throughput drops at least an order of magnitude compared to MALDI.

2.2.3 Intact Mass Tags

If a full-length protein can be mass measured accurately enough and its sequence matches that predicted by its gene, it is possible to identify proteins by their intact mass.[13] The widespread occurrence of post-transcriptional and post-translational modification in eukaryotes and especially humans generally precludes use of intact mass tags for identification, though future advances in software may bring increased focus to this arena. However, the mass spectrum of an intact protein defines the native covalent state of the gene product and its heterogeneity.[14] Thus, in order to monitor any covalent modification it is useful to record the mass spectrum of the intact protein. This is of special importance for monitoring subtle alterations that might not alter protein mobility on 2D-gels. One example is methionine oxidation. Even single methionine oxidation events can significantly modulate protein function[15] without altering charge or migration in SDS-PAGE. Toward this end protocols for extraction of intact proteins from gels are being developed.[16]

Alternative strategies for separation of proteins from complex mixtures are also attractive because of the potential for covalent modification during 2D-gel analysis.[17] Furthermore, recent studies indicate that proteomics with whole cell extracts separated on 2D-gels is biased against less abundant proteins[18] and this has been haled as a serious deficiency. At present there is no ‘quick fix’ for this problem and enrichment and purification of sub-fractions may be the only way to bypass this dilemma.[19] Howell, Yates and co-workers recently reported improved detection of minor abundance proteins of Golgi membranes by employing sub-fractionation techniques.[20] In studies of intact proteins by electrospray-ionization MS an added benefit arises in the ability to detect minor covalent variants of much more abundant proteins.[14] With current levels of investment in proteomics it seems likely that the near future will bring machines that automate the enrichment/purification process. The least abundant proteins may only be characterized after their capture by antibodies and a substantial effort is underway to provide proteome-wide antibody collections (reviewed in Holt et al.[21]). Through use of Fourier-Transform Ion Cyclotron Resonance (FTICR) mass spectrometry,[22] which provides the ultimate in accuracy and resolution, McLafferty and co-workers have proposed ‘top-down’ proteomics. This involves the recording of a mass spectrum of an intact protein prior to fragmentation by CAD or other means to generate fragment ions that can be used to identify the protein and characterize its modifications.[23,24]

2.2.4 Membrane Proteins

As the benefits of mass spectrometry of intact proteins become more widely appreciated, increased attention will be devoted to the separation techniques coupled to MS. Membrane proteins make up a significant proportion of all genomes analysed to date and are an important class of drug targets. Therefore, we have devoted considerable effort to develop separation techniques for intrinsic membrane proteins that are fully compatible with the electrospray-ionization process. ESI is preferred because of the superior accuracy and resolution attained over MALDI-TOF analysis, though the latter is very useful in cases where excessive heterogeneity precludes interpretation of ESI spectra.[25] By performing reverse-phase liquid chromatography (LC)-MS on macroporous polymeric stationary phases in the presence of high concentrations of formic acid[26] it is possible to separate various intrinsic membrane proteins for ESI analysis. Rhodopsin was the first G-protein coupled receptor to be analyzed in this way; other membrane proteins thus far analyzed include the bacterio-rhodopsin holoprotein and the thylakoid D1 herbicide receptor.[14] Larger membrane proteins are amenable to ESI-MS after size-exclusion high performance liquid chromatography (HPLC) in chloroform/methanol/aqueous formic acid, as demonstrated in an extensive study on the Escherichia coli lactose permease, which has 12 transmembrane helices.[27] Further studies demonstrated the broad applicability of the method including the successful analysis of membrane proteins with up to 15 transmembrane helices and molecular weights approaching 100 kDa.[13,28] Limited separation of complex protein mixtures has been achieved on extracts containing all E.coli membrane proteins, to a degree that at least 1 membrane protein could be identified unambiguously by comparison with the database.[13]

2.2.5 Quantitative Proteomics Using MS

One reason for the widespread popularity of 2D-gel analysis is the ability to quantify proteins based upon staining intensity. Up or down regulation of a protein’s expression under an experimental treatment is visualized, and targets with altered expression identified by mass spectrometry. Mass spectrometry itself is not truly quantitative due to the variable ionization efficiencies associated with different proteins. However, a number of experimental protocols aimed at providing quantitative information from mass spectrometric measurements have been reported. These methods rely upon the distinction of different masses for a given species in the mass spectrometer based upon stable isotope incorporation, thus allowing

In one report, relative expression of E.coli proteins was measured as a function of cadmium stress. Proteins from cells grown under experimental conditions were distinguished from controls by growth of one set in rare-isotope depleted media (15N, 13C, 2H depleted). Protein samples were mixed in equal quantities prior to analysis and relative expression monitored by relative intensities of normal and light protein signals. The experimental protocol was then reversed to illuminate any bias caused by growth in nutrient-depleted media.[29] This protocol could also be modified to a pulse-chase experiment to measure protein turnover rates. A similar method was employed to measure relative expression of yeast proteins as well as relative phosphorylation levels in mutants and controls.[30] Both of these techniques suffer from the disadvantage that cells must be grown in isotope-depleted media, a factor that limits the applicability of the method.

An alternative strategy is to modify control and experimental protein samples after extraction of proteins. This was accomplished through use of isotope-encoded affinity tag technology (ICAT).[31] Chemically identical tags with a thiol-directed reactivity at one end and biotin at the other are synthesized with one batch being deuterium labeled. Parallel protein samples are tagged, pooled in identical quantities and digested with trypsin. Then the biotin tag is used to isolate just the tagged peptides via an affinity procedure, greatly simplifying the complexity of the mixture to be analysed. Once again the mass spectrometer distinguishes the relative expression of a peptide (and thus the protein it was derived from) via light and heavy tags. The technique ignores proteins that lack cysteine and others whose thiols are not accessible for reaction, though alternative specificities can be envisaged. Quantitative proteomics and mass spectrometry have been extensively reviewed.[3237]

2.3 Determination of Protein Structures

To obtain high-resolution protein structures of reliable quality on an industrial scale, new technologies are needed to keep up with the pace of DNA sequencing. In principal a high-resolution protein structure can be obtained either by x-ray crystallography or by solution nuclear magnetic resonance (NMR) methods. Both techniques are rather laborious and work reliably only in the case of small to medium size soluble proteins. No reliable technology is available for membrane proteins and so far only about 20 have been crystallized successfully. Therefore other techniques such as Fourier transform infrared (FTIR) spectroscopy, electron paramagnetic resonance (EPR) spectroscopy or intramolecular cross-linking are being used which usually require purification of the protein in milligram quantities.[3841]

An impressive example of a pioneering and straightforward approach to obtain real proteomic structure information has been established at the University of California Los Angeles, demonstrating the resources needed for this work (see URL: http://www.doe-mbi.ucla.edu/TB/). In a consortium approach integrating a few dozen renowned international laboratories with more than a hundred specialized scientists the proteome of Mycobacterium tuberculosis will be characterized on the structural, functional and interacting level to ‘provide a foundation for a fundamental understanding of biology’. With the goal of high-throughput protein structure determination on the atomic level, the current situation in proteomics is reminiscent of the early attempts to obtain genomic information.

2.4 Data Management and Evaluation

To create protein/protein interaction maps and to identify signaling cascades, individual interacting proteins need to be identified as the building blocks for these networks. Both computational and experimental approaches have been harnessed. Using innovative strategies involving phylogenetic profiling, correlated evolution, correlated messenger RNA expression patterns and patterns of domain fusion, the function of many proteins can be predicted (reviewed in Eisenberg et al.[42]). Marcotte et al.[43] used such methods to assign function to over half of 2557 previously uncharacterized yeast proteins. Computation is also used extensively for protein structural modeling where solved structures are unavailable (reviewed in Sanchez et al.[44]), contributing greatly to understanding function and prediction of interactions.

Experimental approaches to assembling large-scale interaction maps have focused on yeast two-hybrid technology in which only interacting proteins activate a genetic switch which is being used as a reporter for interaction. Impressive information has been obtained using two-hybrid technology and, for yeast, complex maps of interactions have been published.[4547] An alternative approach involves biochemical isolation of protein complexes and characterization of proteins by mass spectrometry.[14,48] While this may not necessarily be a high-throughput approach it may be the most effective way to completely describe a proteome due to the sensitivity benefits for detection of minor species and low abundance proteins. Both technologies may become less often used as more protein structures are being made available and docking studies will be performed in silico as a large-scale approach.

2.5 Expression Patterns, Microarray Data, DNA Chips and Protein Chips

Gene expression profiling is achieved using DNA microarrays (DNA chips) that allow high throughput measuring of transcript abundance across a broad range of genes (the transcriptome). However, the frequent lack of agreement between mRNA levels and protein accumulation[49] has lead many researchers toward the proteomics protocols described above. DNA chips do have a very important role to play in monitoring the functional status of the least abundant elements of the proteome itself, that is, transcription factors.

Though most practical proteomics is being done with 2D gels and mass spectrometry, there are protein chip technologies under development. One of the first is surface-enhanced laser desorption ionization (SELDI) which is showing potential for discovery of novel cancer biomarkers.[50] Specially modified surfaces are used to select subsets of proteins from complex mixtures prior to mass spectrometric analysis of intact proteins. Intact mass tags are then compared across experimental samples to detect changes in abundance. Actual identification of biomarkers then requires further proteomics experiments as described above. Antibodies will likely be incorporated into protein-chip design and an array of 27648 recombinant human-brain proteins were screened in a single experiment, illustrating the potential strength of this technique.[51] Many other designs of protein chips are under development and the near future will bring devices to measure protein/biomolecule interactions[52] or activities of specific subsets of a proteome, for example, kinases.[53]

3. Proteomics, Drugs and Therapy

Powerful leads for the treatment of human disease are anticipated from large-scale proteomic research that identifies suitable targets. However, as we learn from genomics, the mere discovery of a new disease related gene normally does not contribute substantially to the development of successful drugs, especially since DNA rarely serves as a good target. Moreover, the availability of the human genome does not automatically deliver disease related genetic information.[1,2] This can be found systematically only after determining several human genomes, that is, sequencing of different phenotypes. In the future, expression profiling through DNA-arrays or sequencing of genomes from different age groups, races and tissues of patients will help to identify disease-related genes. Proteomics technology can also be used as a diagnostic tool to monitor identity and abundance of individual proteins in body fluids and tissues at a given time. Certainly, proteomics will contribute to the understanding of dynamic interactions between proteins in the living cell, however it will not necessarily provide therapies for diseases. In other words, proteome research will help to ask the right questions but will not automatically give the right answers.

In doing research on the human proteome, technical and ethical problems might arise due to the complexity of the human body and the potential need to identify and purify a sparsely available target protein. Fortunately, one important lesson that has been learned through genomic research is that for almost every gene found in humans a direct homolog can be found in mice. It is reasonable therefore to extrapolate these homologies in DNA sequences to homologies in protein structure, that is, identical folds and similar functions for the human and murine system. Indeed, the driving force behind sequencing the mouse genome as the first mammalian genome is exactly this high degree of homology.[54,55]

The main goal of proteomics in drug development is to understand the cellular machinery sufficiently to identify suitable targets against which a selective and efficient lead compound and finally a successful drug can be synthesized. Amid all the progress in data acquisition for sequences, structures and interactions — once the target has been identified the major work remains: the screening of large candidate compound libraries in conjunction with clever medicinal chemistry that guarantees selective action and defined delivery of the drug.

4. Conclusion Along with the growing knowledge of genomics, substantial efforts are underway to understand the encoded information — a challenge mainly conceived as proteomics. In the future similar fields will develop which inevitably end in ‘-omics,’ such as metabolomics for the study of small metabolites or lipidomics for lipids and glycomics for sugars. The trend is to perform molecular life science in a general and global array approach that tries to integrate all molecules of a given entity using the power of modern computation. Certainly this is in contrast to traditional biochemical research aimed at in-depth exploration of individual molecules, which is the only way to obtain information about structures, functions and mechanisms. However, to make a simple analogy, if we only look at the individual brickstones of a house we do not obtain any information about the rooms, the windows and the doors. Therefore we might be able to obtain previously inconceivable insights into biology by using proteomics in the framework of careful hypothesis-driven research.