Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

I. Introduction: Fungi as Model Organisms for SysBio Studies

Systems Biology (SysBio) is an overarching term used to describe a collection of methods that attempt to understand biological systems in a quantitative fashion, primarily using experimental data to formulate mathematical models that enable the prediction of the future behavior of a system or emergent properties in response to environmental stimuli. A “system” in this context can be defined as any group of interacting biological components that either directly or indirectly perform a specific and measurable function; it can include a small or large number of components. For example, a protein complex, a regulatory pathway, an organ, a cell or the entire organism could all be systems in different experimental contexts. This field is a recent addition to the scientific community. Although its exact origins are difficult to pinpoint, many ascribe its beginnings to the late 1960s when the first attempts to investigate metabolic pathways in bacterial cells were performed (von Bertalanffy 1969). The formulation of the metabolic control theory (MCT) by the groups of Kacser and Burns (1973) and Heinrich and Rapoport (1974) marks the birth of SysBio. MCT mathematically describes metabolic flux as an inherent system property, and that the flux control through different reactions is shared within the system, influencing one another.

“Scientific fields, like species, arise by descent with modification” (Kirschner 2005), and tellingly, to date, a consensus definition of what SysBio concretely means is intensely debated. Nevertheless, SysBio has some key characteristics that differentiate it from classical reductionist-driven biological studies. Historically, scientific experiments have taken a reductionist approach, investigating a single gene under a single condition. However, biological systems are dynamic and nonlinear in nature, making the reductionist approach often unsuitable for investigating the behavior of a gene of interest. A primary goal of SysBio is to understand the wiring and connectivity in biological networks that define genotype–phenotype relations (Kohl et al. 2010). A “systems approach,” sometimes referred to as the “desk–bench–desk loop” (Mustacchi et al. 2006), combines computational and experimental tools to formulate a model based on prior knowledge of the system (i.e., from published literature, databases, and genome-wide data sets). This model is then tested under laboratory conditions and analyzed for biological insights that would have been otherwise difficult to gain without an interdisciplinary approach (Fig. 3.1). The recent introduction of whole-genome sequencing and genome-wide experimental platforms such as RNA sequencing (RNA-seq), metabolomics, microarray, and mass spectrometry (MS), among several others, form the core “omics-based” methodologies for data generation that drive SysBio studies.

Fig. 3.1.
figure 00031

The “bench–desk–bench” loop of SysBio. A typical SysBio workflow. Based on an observation, such as a difference in yeast colony size and phenotype under two different conditions, a hypothesis is formed based on the experimental results and further supported by prior knowledge in published literature. This hypothesis can be addressed using a number of qualitative and quantitative methods, the results of which are deposited in publically available databases. With these data, modeling approaches attempt to mimic, predict, and visualize data. Once modeled, experimental verification and refinement of the model, creates the bench–desk–bench loop, where iterative cycles of prediction and verification are undertaken until the model and experiment validation are representative of one another

The integration of data from these various technologies into SysBio models has remained a formidable challenge. Hence, SysBio approaches have been classified as “top-down,” “bottom-up,” or “middle-out” (Bray 2003; O’Malley and Dupre 2005; Bruggeman and Westerhoff 2007; Petranovic and Nielsen 2008). A top-down approach aims at extracting principles from experimental data representing molecular properties of the studied system. A top-down approach focuses on the comparison of genome-wide data sets, such as transcriptome and proteome, to formulate a focused and testable biological hypothesis. A bottom-up approach uses the knowledge of molecular properties of the system components to predict the behavior of the system as a whole. In short, the bottom-up approach connects smaller entities to predict, identify, and simulate the behavior of a bigger system. A bottom-up approach usually starts with prior knowledge of a specific gene, and a model is then generated based on these data to investigate the system as a whole. A middle-out approach describes a process of starting at the level for which the best information for the process of interest is available, and then combining higher and lower levels of structural and functional information, essentially breaking out of a more strict top-down and bottom-up loop in order to validate the hypothesis at the current state of biological understanding (Brenner et al. 2001). Regardless of the approach used, SysBio strategies differentiate themselves from more classical biological methods by consciously taking into consideration different levels and dynamics of biological data (DNA, RNA, protein) simultaneously.

Fungal SysBio was ushered in with the completion of the Saccharomyces cerevisiae genome sequence, the first completely sequenced eukaryotic genome (Goffeau et al. 1996). Since then, the genomes of many of the most common fungal pathogens including, but not limited to, Candida albicans, Aspergillus fumigatus, and Cryptococcus neoformans, have become available (Loftus et al. 2005; Nierman et al. 2005; van het Hoog et al. 2007). Community-wide initiatives such as the Fungal Genome Initiative (Cuomo and Birren 2010) and the 1,000 Fungal Genomes Project (http://1000.fungalgenomes.org/) have been useful tools for studying the evolution of fungal virulence. The discovery of key genes positively and negatively regulated during the infection process, and understanding the function of their products, will drive the design of new strategies to combat fungal pathogens.

In this chapter, we provide a comprehensive overview of recent SysBio methods suitable for study of fungal virulence, including genome sequencing, -omics technologies, and bioinformatics tools, with an emphasis on computational and modeling-based approaches. We focus on the genera Candida and Saccharomyces; the latter stands out as a “workhorse” of fungal SysBio in which many of the methods described herein were originally established or tested (Mustacchi et al. 2006; Santamaria et al. 2011). These approaches are used to identify molecular wiring and dynamics in biological networks, with the goal of identifying their biological function and eventually identifying novel therapeutic options. We describe the applicability of each method to specific experimental questions using numerous case examples and critically discuss some of the current pitfalls in the analysis of SysBio data sets.

II. High-Throughput and –Omics-Based Methods for Studying Fungal Virulence

SysBio approaches, especially top-down analyses, incorporate genome-wide data sets such as comparative genomics, transcriptomics, and proteomics data. These approaches fit into the category of “–omics” or “–ome” studies, which attempt to analyze a genome-wide response to a specific condition. –Omics studies represent an important shift in the way biological data is both produced and interpreted, complementing traditional hypothesis-driven research (Weinstein 2001). In order to understand SysBio as a whole, it is important to understand the types of data sets that it utilizes to address a given question. Several important methods have established themselves in this field over the past decade and have been used extensively to investigate fungal virulence. For clarity, we have divided these methods into qualitative and quantitative approaches. We address some of the most popular methods used at different levels of biological understanding, including DNA, RNA, and protein, as well as epigenetic modifications and validation methods, and examine the key contributions they have made to the understanding of fungal virulence.

A. Genomics

The initial genomic sequencing of S. cerevisiae was a monumental international collaboration that included some 600 scientists worldwide (Goffeau et al. 1996).

This sequencing was performed using a series of hybrid plasmids, called “cosmids.” Cosmids had the advantage that a much longer DNA sequence stretches could be incorporated than using normal plasmids, and at the same time longer DNA stretches could be sequenced to build up the genomic library. Sequencing polymerase chain reaction (PCR) fragments then filled the remaining gaps between sequence stretches of the assembled genomic library to complete the genome (Dujon 1993). The Candida Genome Sequencing project began directly after the S. cerevisiae sequencing in 1996, ending in 2004 with the C. albicans genome assembly known as Assembly 19 (Jones et al. 2004). This genome assembly was divided into 412 contigs (consensus stretches of DNA that are assembled to form the scaffold of the genome assembly) and sequenced with a shotgun-based sequencing strategy. In order to obtain a more complete view of the diploid sequence, Assembly 21 was created using a fosmid library, which is conceptually similar to a cosmid library, except that it is based instead on a bacterial F-plasmid and is more stable than a cosmid because of its low copy number (Hall 2004). These early sequencing projects took years because of the low throughput.

Today, genome and transcriptome sequencing has become routine, with ever-increasing stability, coverage (several fungal genomes can now be sequenced in a couple of days), and bioinformatics assembly tools publically available. As DNA sequencing technologies have become more efficient, there has been a surge in the number of sequenced genomes, with over 150 fungi sequenced so far (Marcet-Houben and Gabaldon 2009). These sequences facilitate functional and comparative genomics studies. Functional genomics aims to understand relationships between genotype and phenotype. Comparative genomics attempts to identify genes or genetic rearrangements between closely related species based on their DNA sequence; in the case of fungal pathogens, this often includes a highly virulent species compared to a significantly less or even avirulent species. This is done in order to identify genetic transitions that might explain the evolutionary divergence of pathogens or the identification of novel virulence factors.

Comparative genomics studies use two main techniques: comparative whole genome sequencing or hybridization-based microarrays. Comparative whole genome sequencing literally attempts to identify genetic elements present in one species and absent in another based on the genome sequence; this is done by overlapping the genome sequences and identifying outlier sequence stretches that do not match between them. Comparative genomic hybridization (CGH) arrays identify genome-wide variation in gene copy number. CGH experiments assume that the binding ratio of the experimental sample to the control is proportional to the sequencing concentration in the samples. These methods provided significant insight into the evolution of pathogenicity for many fungal species. For example, early comparative studies of fungi identified a strong sequence homolog among 228 genes in S. cerevisiae, Schizosaccharomyces pombe, Aspergillus niger, Magnaporthe grisea, C. albicans and Neurospora crassa genomes for which no homology was found in the human or mouse genomes, representing potential targets for pan-fungal treatment (Braun et al. 2005).

Numerous studies have investigated the evolution of pathogenicity within a single fungal clade. For example, in the Candida clade, eight genomes were sequenced, including the C. albicans WO-1 strain (which is characterized for white-opaque switching and is associated with specificity to host tissues), along with the de novo sequencing of C. tropicalis and C. parapsilosis, Lodderomyces elongisporus, C. guilliermondii and C. lusitaniae, many of which are now classified as emerging fungal pathogens (Butler et al. 2009). These strains were compared to the previously sequenced genomes of C. albicans clinical isolate SC5314, the marine yeast Debaryomyces hansenii, and species from the Saccharomyces clade. Some 21 gene families emerged that were enriched in pathogenic species as compared to nonpathogenic fungi. A related study investigated the closest known relative of C. albicans, C. dubliniensis, which, despite its similarities, is significantly less virulent than C. albicans. Comparative sequence analysis has identified almost 200 species-specific genes in C. albicans, including the absence of the key C. albicans invasion gene ALS3 in C. dubliniensis, and members of the aspartyl proteinase family SAP4 and SAP5 (Jackson et al. 2009). ALS3 is among the most important virulence factors in C. albicans. It is a cell surface protein that plays a major role in adhesion to host cells and in maintenance of infection (Hoyer 2001; Hoyer et al. 2008). Notably, numerous translocations were identified in C. dubliniensis, especially in the SAP family, which is known to play a role in Candida pathogenesis. Comparative genomics has even lent itself to the investigation of genetic variations at chromosome level, using a single C. albicans isolate that had been passaged multiple times in an in vivo model organism using CGH (Forche et al. 2009), showing the environmental impact on the host strain evolution. Together, these studies collectively demonstrate that even closely related species have significantly diverged at their genomic levels, suggesting mechanisms for the evolution of fungal virulence factors.

A number of resources for fungal genomics research have recently been made available. Large genome databases, including the Broad Institute (http://www.broad.mit.edu/annotation/fgi/), the Sanger Center (http://www.sanger.ac.uk/Projects/Fungi/), the Institute for Genomic Research (http://www.tigr.org/tdb/fungal/), and the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/genomes/FUNGI/funtab.html), are all publically available.

B. Transcriptomics

Looking at the DNA level, one can investigate how the information stored in the genetic code is translated into protein molecules; however, it does not provide information on the diverse molecules actually produced in the cell normally or in response to different environmental conditions. This takes place at the RNA level by looking at the complete set of RNA species produced in a given population of cells at a specific time. This field is referred to as transcriptomics.

There are two main methods to investigate transcriptional dynamics, also known as expression profiling. These include genome-wide microarrays and, more recently, next-generation sequencing (NGS) technologies, most notably, RNA-seq. Microarray experiments employ special microarray chips carrying printed copies of the entire genome and are used for assessing the relative differences in gene expression between a control sample and a treated sample. A microarray chip is, in principle, able to measure the relative changes in expression levels of all known genes simultaneously. Southern blotting was the inspiration for the microarray technology (Maskos and Southern 1992). Southern blotting involves the hybridization of a DNA probe to a specific DNA fragment on a solid substrate. Microarrays use the same principle but cover a genome-wide scale rather than single genes. The chip itself is made up of probes, corresponding to short DNA stretches for all genes in the genome. Depending on the specific experimental question, there may be multiple probes for each gene of interest and the number of spotted probes is often well into the thousands. In a typical microarray experiment, RNA is collected from an organism of interest, transcribed into cDNA (the sample is also referred to as the “target”), and hybridized to the chip where the target forms hydrogen bonds with the probes. In order to determine the relative abundance of the transcripts, the chip is then scanned with the hybridized sample. In theory, if a gene is expressed in the organism, it will have hybridized to the probe on the chip. The abundance of a gene product is then measured by the detection of chemiluminescent-labeled targets. Based on the intensity of the target–probe hybridization, the relative abundance of the RNA produced by the organism in response to a condition can be measured. Because microarray technology is chip-based, its ability to detect a specific gene or transcript is limited to the original spotting on the chip itself. This is especially important to keep in mind for certain organisms, where only incomplete genomic sequences are available, leading to low quality annotations and unknown alternative spliced products, which would remain undetected if not already taken into account in the original design of the microarray.

The first genome-wide array was developed for S. cerevisiae in 1997 (Lashkari et al. 1997). Since then, microarrays for fungi have evolved into high-density tiling arrays (Sellam et al. 2010) and splicing-sensitive exon-junction arrays (Inada and Pleiss 2010), among many others. Microarray technology has been extensively used to investigate global changes in gene expression in response to changing environmental conditions and genetic knockouts. It has also been used in conjunction with immune cell and animal models at different infections stages in vitro and in vivo to investigate different infection stages. For example, the transcriptional response of both S. cerevisiae and Candida glabrata to antifungal agents and other chemical stress agents in vitro was profiled (Lelandais et al. 2008). To identify pathogen-specific responses on the side of C. glabrata, the authors compared the transcriptional profile of both species after treatment. Surprisingly, they found a high conservation among the regulated genes during infection, and a subpopulation of genes that were pathogen-specific. Further, in vitro infection experiments of human blood cells with C. albicans identified a number of differentially expressed genes, which may be important in the survival of Candida during bloodstream infections (Fradin et al. 2003). Transcriptional profiling has been used to identify effects of phagocytosis of C. albicans by immune cells, including neutrophils (Rubin-Bejerano et al. 2003) and macrophages (Lorenz et al. 2004). These studies identified the extent of the amino-acid-deficient environment within the phagosome, and characterized the dynamic starvation response of Candida over the time course of infection. The first dual transcriptional profiling using microarrays for a host and pathogen interaction was also performed with conidia of A. fumigatus during infection of human airway epithelial cells. This work confirmed the upregulation of inflammatory interleukin (IL)-6 and the immune response to conidia, as well as pathways whose activation had previously only been investigated from either the host or pathogen perspective alone (Oosthuizen et al. 2011).

Transcription profiling using RNA-seq is conceptually similar to microarray, insofar as the end result of the experiment is often a list of differently expressed genes. However, the sample is sequenced using a parallel sequencing approach referred to as next-generation sequencing (NGS) instead of using hybridization-based methods. Based on Sanger sequencing methods, high-throughput technology began with tag-based methods that were developed so that multiple sequencing reactions could be run in parallel. These included serial analysis of gene expression (SAGE) (Velculescu et al. 1995), cap analysis of gene expression (CAGE) (Kodzius et al. 2006) and massive parallel signature sequencing (MPSS) (Reinartz et al. 2002). In order to increase the scale of reactions taking place, a number of novel sequencing strategies and commercially available platforms have been developed. These included Roche/454, Illumina/Solexa, Life/APG Helicos BioSciences, and Pacific Biosciences. Each system has pros and cons, depending on the biological application (Metzker 2010).

In a typical RNA-seq experiment, cDNA is first fragmented; these templates are then attached to a substrate (which will vary with the technology used) with the aid of adaptor sequences. The immobilization of the template samples gives the advantage of allowing billions of simultaneous sequencing reactions, differentiating itself from first generation sequencing technology in terms of capacity and cost (Metzker 2010). Templates can be sequenced either from one end (single-end sequencing) or both ends (paired-end sequencing). The resulting sequencing reads can vary in length, depending on the technology used, from less than 30 bp to over 300 bp (Wang et al. 2009; Metzker 2010). Reads are then mapped back to the reference genome to determine gene expression and, when compared to other samples, differential gene expression.

RNA-seq has rapidly gained in broad popularity over the past few years, especially because of its ability to sequence to a high depth and also because it detects low abundance transcripts, offering a more complete view of the transcriptional profile of an organism than microarrays.

The sequencing technology has significant advantages over microarray, especially for non-model organism species, as the detection of expressed genes is not dependent on having a priori knowledge of the gene investigated. Moreover, RNA-seq does not have intrinsic limitations to the dynamic range of detection (Royce et al. 2007). RNA-seq has been especially important in the detection of novel noncoding RNA species and small RNAs, as well as for de novo annotation (Wang et al. 2009). Under in vitro conditions, a de novo annotation of the C. albicans transcriptome under nine different environmental conditions was recently performed and was able to identify over 600 novel transcriptionally active regions and introns from a total of 177 million uniquely mapped reads (Bruno et al. 2010). Similarly, with A. fumigatus, RNA-seq was used to investigate planktonic and biofilm growth to identify differences in pathological and morphological characteristics in these two stages. Numerous biofilm-specific genes were identified as being regulated, representing targets for biofilm development.

Most recently, the first dual-species RNA-seq approach, sequencing RNA mixture comprising both host and fungal pathogen transcriptomes over a time course of infection, has been accomplished. Furthermore, this study predicted, using mathematical approaches, and experimentally verified novel host–pathogen regulatory networks implicated in the interaction. The use of a combination of sequence analysis and network inference enabled this dual-systems approach (Tierney et al. 2012). This study presents the first adaptation of network inference to model host–pathogen interactions, validating the use of network inference for the analysis of multiple species data sets.

1. Clustering Gene Expression Data Sets

The most common output of transcriptomics is a list of differentially expressed genes in one condition versus a control condition. Differential gene expression analysis begins with a testing for the statistical significance of the variation within the sample. Statistical approaches for determining differential expression have been extensively reviewed elsewhere (Cui and Churchill 2003), as have a number of freely available tools to aid in statistical analysis for both microarray (Steinhoff and Vingron 2006) and RNA-seq data sets (Sun and Zhu 2012). This method reduces a genome-wide comparison down to only those genes significantly affected under a specific condition. Convenient analysis pipelines, especially for RNA-seq data, have been recently created to help non-computational biologists in the analysis of sequencing from the raw data file to a list of differentially expressed genes (Oshlack et al. 2010; Garber et al. 2011). High-dimensionality data such as microarray or RNA-seq samples complicate data analysis due to the inequality of variables measured compared to the sample number. Because the list of differentially expressed genes can still be on an order of magnitude of several hundred genes, additional methods to reduce complexity are often necessary.

Partitioning expression data into subgroups of genes, called clusters, facilitates data visualization and interpretation underlying a biological process of interest. Depending on the approach used, the groupings can then be visualized by scatter plots, histograms, dendrograms, or heat maps. Genes are clustered into specific categories, which can be functional, structural, temporal, or a combination of the above. A number of clustering approaches have been developed, including principle component analysis (PCA), hierarchical clustering, fuzzy clustering, biclustering, and mutual information analysis, each of which tackle different potential bias aspects of the data set (Eisen et al. 1998; Kerr et al. 2008).

PCA identifies data trends within samples, called principal components, such that very large data sets can be graphically represented using a smaller number of dimensions (Ringner 2008). This technique is especially useful for visually identifying batch effects or noise between samples, which may otherwise negatively affect downstream analysis.

Hierarchical clustering aims to create a hierarchy of gene groups, whereby relationships among genes are represented by a dendrogram. The shorter the length of dendrogram branches between objects, the more closely related the gene expression patterns are. These differences are assessed by pair-wise similarity functions. In this way, the method builds a hierarchy of gene groups by progressively merging clusters (Eisen et al. 1998). One of the major limitations of hierarchical clustering is that the decision-making for gene assignments is focused locally, without considering the data set as a whole, which can affect downstream interpretation (Tamayo et al. 1999).

Fuzzy clustering (also referred to as soft clustering) was developed to partially counteract the local bias of hierarchical clustering approaches. Fuzzy clustering allows for data elements to simultaneously belong to multiple groups with respect to a given criteria. Each data element has a “degree of belonging” to a cluster, instead of being assigned to an individual cluster and this degree represents how close the fit is in multiple clusters (Dembele and Kastner 2003; Fu and Medico 2007). This is in contrast to hard clustering, where data elements only are allowed to belong to one group.

Some of the newest clustering approaches have attempted to incorporate prior biological knowledge into the clustering algorithm. This has been attempted with a form of biclustering (Madeira and Oliveira 2004), a matrix-based clustering approach that includes both genes and conditions in the algorithm. One example algorithm, called cMonkey, was used to identify and cluster sequence motifs in Helicobacter pylori, S. cerevisiae, and Escherichia coli based on microarray data sets (Reiss et al. 2006). A similar clustering approach that incorporates prior knowledge, called mutual information analysis, has also been shown to identify transcriptional interactions with a high fidelity in mammalian cells (Margolin et al. 2006). Finally, a number of standardized tools and analysis techniques are already publically available (Table 3.1) to facilitate transcriptional data analysis. To date, they have been able to provide detailed views of changing transcriptional landscapes in response to different environmental conditions on a functional level, and have been highly beneficial for the identification and prediction of virulence factors in fungi.

Table 3.1. OMICS resources

C. Proteomics

The term “proteome” was coined in 1996 to describe the complete set of proteins that is synthesized by a cell (Wilkins et al. 1996). The proteome provides the highest level of functional information of a cell, revealing the end product of the transcription and downstream transcriptional processing. The use of proteomics data sets is also becoming a popular approach for studying proteins involved in virulence.

The major areas in proteomics research include identification of proteins and their posttranslational modifications as well as protein–protein interactions.

These areas are investigated using two main methods, traditional two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) and, increasingly, mass spectrometry (MS). In 2D-PAGE, protein samples are resolved by two intrinsic properties: first in one dimension on an SDS gel, and then in the second dimension, at a 90° rotation (O’Farrell 1975); These properties can include their isoelectric point, protein complex mass in the native state, and protein mass. The properties chosen will depend on the specific experiment. Proteins are then visualized by staining of gels, often using silver, Coomassie Blue or Ponceau S staining techniques. Once visible, spots can then be picked out by hand or more often using automated detection software based on their location on the gel. The identified spots are then excised, proteolytically digested, and then subjected to MS analysis. Briefly, MS measures the mass-to-charge ratio of charged particles such as peptides, and this information can then be used to identify the composition of the peptide and the gene it is derived from. Experimentally, MS samples are first vaporized and then ionized using an electron beam. The produced ions are then detected by the mass analyzer, which sorts the ions by their masses, and then processed into mass spectra where the detector measures the quality and quantity of the ions present. Variations of MS, including liquid chromatography tandem MS (LC-MS/MS) (Yates et al. 1999) and gel-free proteomics techniques (2012; Stastna and Van Eyk 2012) are also widely used approaches. These facilitate the analysis of proteins that are not easily separated in 2D gels due to their high hydrophobicity or high molecular weight, as in the case of many integral membrane proteins (Aebersold and Mann 2003; de Godoy et al. 2008). MS/MS involves additional rounds of ionization; however, the reproducibility between technical replicates of a sample remains in the range of 35–60% overlap (Tabb et al. 2010). Unfortunately, absolute protein quantification remains out of reach at the moment (Peng et al. 2012). Major hurdles remain to improve the reproducibility and standardization of the MS-based methods (Kniemeyer et al. 2011). Nonetheless, since 1996, the percentage of protein-coding genes in S. cerevisiae for which some biological function has been identified has increased to over 80%, greater than for any other sequenced eukaryotic genome (Botstein and Fink 2011). Proteomics studies have been highly beneficial in achieving this.

Proteomics approaches have led to the identification of a number of fungal virulence factors. Using an in vitro approach, the proteome of C. albicans yeast-form cells in the exponential or stationary growth phase was investigated in response to nutrient limitation using 2D-PAGE. The authors aimed to identify metabolic response patterns in these two cell types that might confer a tolerance phenotype (Kusch et al. 2008) similar to that observed in S. cerevisiae in response to stress (Herman 2002). They observed that the stationary phase cells upregulated a number of proteins, including those involved in the defense against reactive oxygen species and heat stress, as compared to exponentially growing cells. The ability to undergo morphological transitions between yeast and hyphal cells is an important virulence trait of many but not all Candida spp. This is especially important as the cell wall itself is always subjected to recognition by the host cell surface and is thus exposed to immune recognition.

For example, a number of proteins are expressed in the yeast or hyphal stage only, suggesting a potential mechanism for secretion of cytosolic proteins, which may contribute its overall virulence in these different morphological states (Ebanks et al. 2006). These data further support the idea that the regulation of Hsp90, an essential chaperone protein that is activated in response to stress, is posttranscriptional in hyphal cells.

Additional variations of MS, such as matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) MS has been shown to be a useful tool in drug susceptibility screening of C. albicans to fluconazole (Marinach et al. 2009). A complete map of the yeast proteome using MS has most recently been completed, using a combination approach of high-throughput peptide synthesis in conjunction with MS for the S. cerevisiae proteome. This study provides insight into the evolution of yeast proteins and protein complexes (Picotti et al. 2013).

For A. fumigatus, the conidia mediates the initial contact with the immune system of the host and therefore is an interesting target for proteomics studies looking for fungal virulence factors and posttranscriptional responses upon host recognition. For example, comparison of the proteome profiles of A. fumigatus conidia and mycelial cells revealed some 50 conidia-specific proteins (Teutschbein et al. 2010). Interestingly, the data suggested that many proteins that are not needed during the resting stage are stored, perhaps for a rapid response to the activation of metabolic processes or in response to recognition by the immune system.

In vitro co-culture approaches with immune cells have also been influential in investigating differential protein expression in response to infection conditions. Using a time course of interaction between C. albicans and macrophages, a combination of proteomics and transcriptomics techniques highlighting specific pathways related to the virulence of Candida spp, including the regulation of apoptosis (Fernandez-Arenas et al. 2004; Fernandez-Arenas et al. 2007), was performed. The authors used a C. albicans strain of attenuated virulence, in which the kinase HOG1, important for the oxidative stress response, was absent. They were able to identify several novel C. albicans antigens and further characterized the protective antibody response of mice against C. albicans infection. The use of proteomics methods, in general, has been useful in validating transcriptional data sets. However, they also revealed a number of discrepancies between the transcriptome and the proteome, which remains an active area of research in the validation of fungal virulence factors. Finally, many resources have recently become available, including the Proteopathogen Database for studying host–pathogen interactions (http://proteopathogen.dacya.ucm.es) with C. albicans, Compluyeast (http://compluyeast2dpage.dacya.ucm.es/cgi-bin/2d/2d.cgi), which catalogues 2D-PAGE data sets from C. albicans, Mus musculus, and S. cerevisiae for comparative proteomics.

E. Metabolomics

Metabolites are the products of metabolism or reaction intermediates and are usually small molecules serving a number of functions within the cell, including signaling and inhibition or stimulation of enzymes, among a number of other functions. As reaction intermediates, metabolites provide the “missing-link” between DNA, RNA, and protein interactions within a cell. One of the major themes of metabolomics is to investigate the influence of metabolites on cellular phenotypes. The metabolome is composed of intracellular metabolites and the exo-metabolome, also referred to as the secretome, which contains all small molecules secreted from a cell. It has been estimated that over 70% of metabolites participate in more than two biological reactions, and therefore represent interesting molecules for SysBio approaches (Nielsen 2003). Furthermore, from an evolutionary perspective, it is expected that a number of the filamentous fungi share their primary metabolism with their yeast ancestor S. cerevisiae, suggesting a broad applicability of the metabolomics research in the fungal research community. In fungal cells, there is an estimated number of more than 1,000 metabolites in the steady state (Smedsgaard and Nielsen 2005), some of which are extremely short-lived or of low abundance, making their quantification a formidable challenge.

A number of methods are in use to identify metabolite profiles in cells. The most common are nuclear magnetic resonance (NMR) spectroscopy, MS (see Sect. II.C) as well as metabolic labeling with radioactive isotopes (Niittylae et al. 2009; Zamboni and Sauer 2009). Another method for investigation of metabolomics is gas chromatography coupled to mass spectrometry (GC-MS). GC-MS utilizes GC with detection by MS. GC is used in analytical chemistry to separate and identify molecules based on their migration within a capillary system. The sample is vaporized and travels through the capillary using an inert carrier gas. The time it takes for each molecule to elute from the column will vary according to its molecular properties and therefore can be used to identify compounds. Combining this elution with MS gives a highly detailed description of the molecule. High performance liquid chromatography (HPLC) is often used in combination with MS (HPLC-MS). HPLC is a chromatographic purification technique using a high-pressured capillary tube system, allowing for the fine separation of molecules. These methods, among others, provide a comprehensive way to identify the structure of metabolites on a genome-wide scale.

The identification and function of metabolites is highly relevant for a better understanding of fungal virulence. Fungi, more so than other pathogenic species, are notoriously known for the diversity of metabolites produced in response to host immune defense, and are thus useful organisms for studying metabolic diversity (Jewett et al. 2006). Notably, about a dozen A. fumigatus secondary metabolites have been implicated in niche adaptation and virulence (Galagan et al. 2005). To date, significant progress has only been made in metabolic profiling of fungi such as S. cerevisiae. The first metabolic network reconstruction of S. cerevisiae used an extensive data-mining approach of previous literature in combination with mathematical techniques to identify approximately 600 metabolites (Forster et al. 2003). Shortly thereafter, GC-MS methods were able to verify the presence of approximately 100 of these metabolites under standard laboratory growth conditions (Villas-Boas et al. 2005). Metabolic flux in over 30 S. cerevisiae mutants demonstrated robustness and inherent redundancies built into yeast metabolism (Blank et al. 2005). In C. albicans, LC-tandem MS was used to profile the regulation of the secretome under standard laboratory conditions (Sorgo et al. 2010) and in response to the antifungal agent fluconazole (Sorgo et al. 2011), identifying numerous immunogenic peptides as novel vaccine candidates for antifungal therapy. Recently, the metabolome of A. fumigatus was investigated using 1H-NMR metabolomics under infection conditions (Grahl et al. 2011). Using this technique, the authors detected ethanol in the lungs in a murine model of invasive pulmonary aspergillosis, suggesting a role for fungal alcohol dehydrogenase in pathogenesis (Grahl et al. 2011). 1H-NMR metabolomics also enabled the identification of pneumococcal or cryptococcal meningitis without prior sample culture, which if implemented in a clinical setting would speed up the time it takes for patients to be diagnosed (Himmelreich et al. 2009).

F. Epigenomics

Among the biologically relevant –omics approaches, the most recent addition, epigenomics, has entered center stage. The epigenome describes the global epigenetic modifications that take place within a cell. Epigenetic modifications take place on the DNA, histones, and chromatin in its various functional states. They use numerous posttranslational modifications, including, but not limited to, the addition of single or multiple methyl residues, ubiquitination, acetylation, phosphorylation, or adenylation just to name the most common modifications [for review see (Hnisz et al. 2011)]. Most importantly, many modifications are reversible, providing an additional and even heritable level of cellular regulation.

The most common methods for investigation of the epigenetic landscape are studies on the variation of chromatin states using chromatin immunoprecipitation (ChIP).

Combinations of ChIP with microarray technology, known as ChIP-Chip or ChIP-on-Chip, and a similar combination of ChIP with NGS technology, termed ChIP-seq, have been recently introduced. ChIP identifies transient in vivo protein–DNA complexes by crosslinking DNA and associated proteins within a cell lysate. The DNA is then fragmented either by sonication or nuclease digestion. The proteins of interest are then selected using an antibody, precipitated, purified, and the associated DNA is either sequenced or placed on a microarray, depending on the technology used.

ChIP-Chip has been used to investigate genome-wide changes in patterns of histone methylation in the fission yeast S. pombe. A complex composed of two proteins, Swm1 and Swm2, mediates demethylation of lysine 9 in histone H3 (H3K9) (Opel et al. 2007). Epigenetic regulation via this complex, in concert with additional histone deacetylases and chromatin remodelers, is a major factor in the transcriptional regulation of S. pombe (Opel et al. 2007). In C. albicans, Nobile and colleagues identified the transcriptional network for controlling biofilm formation using a combination of ChIP-Chip and in vivo animal models. The six identified core transcriptional regulators, regulating over 1,000 target genes, provide insight into biofilm formation during host infection (Nobile et al. 2012). In yeast, ChIP-Chip was used to investigate histone and gene deletion mutants during environmental stress, highlighting the importance of epigenetic regulation in this process (Weiner et al. 2012).

In C. neoformans, the size of the capsule increases under infection conditions and is a well-established virulence factor of the species. The direct targets of Ada2 in C. neoformans were recently investigated using ChIP-seq (Haynes et al. 2011). Ada2 is a member of the Spt-Ada-Gcn5 acetyltransferase (SAGA) complex, which regulates transcription by histone acetylation. The authors identified a relationship between the function of Ada2 and capsule size, linking this epigenetic modification and its targets to the overall virulence of the species (Haynes et al. 2011). Most recently, in C. albicans, a role of chromatin-modifying enzymes in the inhibition of the yeast-to-hyphal transition was discovered using a combined approach of ChIP-seq and RNA-seq. The authors identified a role for the histone deacetylase Set3/Hos2 complex (Set3C) as a transcriptional cofactor of metabolic and morphogenesis-related gene expression. They found that the acetylation status of C. albicans chromatin influences transcription kinetics at target genes, showing that the epigenetic regulation supersedes a core transcriptional factor circuit involved in morphogenesis, a circuit that might be shared among other fungal pathogens (Hnisz et al. 2012).

G. Data Mining Approaches and Genome-Wide Fungal Resources

1. Databases

High-throughput molecular biology techniques have enormously increased the sheer volume of data generated and the need for proper data storage has never been higher (Kersey and Apweiler 2006). The value of this biological information is dependent on the ability of researchers to access and extract the information in a quick and reliable format, but also requires high-level curation.

Databases classify, organize, and systematize information. The maintenance of databases is essential in disseminating biological data to the community. The early development of two excellent databases for S. cerevisiae, the Saccharomyces Genome Database (www.yeastgenome.org/) and the Yeast Proteome Database (http://www.proteome.com/YPDhome.html) led to the rapid use of S. cerevisiae as a functional genomics tool and model organism (Botstein and Fink 2011). Despite their importance, maintaining high-quality, reliable databases is a constant struggle (Baker 2012), partly because of inherent high costs required for the curation of ever-changing biological data sets. Systems biologists extensively use databases for retrieval of prior knowledge, both qualitative and quantitative, on the biological question to increase the strength predictability, and identifiability models.

Major database initiatives include PubMed (http://www.ncbi.nlm.nih.gov/pubmed/), Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/), and Gene Ontology (GO) http://www.geneontology.org, all of which have established themselves as staple websites for researcher. Specialized databases are also becoming increasingly popular, such as the database of virulence factors in fungal pathogens (http://sysbio.unl.edu/DFVF/), which enable inclusion of more in-depth information about a specific topic that may not be sufficiently covered in larger databases.

An additional benefit to including scientific data into databases is the ability to standardize the reporting format, facilitating both the integration of distinct data sets from different laboratories and the development of analysis tools. Standardization has been greatly aided with the push for minimum reporting guidelines for biological and biomedical information (Taylor et al. 2008) (http://mibbi.sourceforge.net/). Reporting guidelines now exist for all major –omics methodologies (Table 3.1) and there has been a general push from the scientific community to adhere to and popularize these standards for biological information.

2. Strain Collections

Genome-wide profiling at the RNA and protein level has been greatly aided by publically available strain libraries in the form of loss-of-function (deletions) and gain-of-function (overexpression) collections. They have provided an efficient screening tool for scientists worldwide to investigate transcriptional and posttranscriptional changes in response to external stimuli, such as drug treatment and environmental variation, or exposure to host immune surveillance. Specifically for fungi, they have increased the throughput of virulence factor screening.

Of all fungi, S. cerevisiae has contributed the most number of strain collections. Starting with the yeast knockout (YKO) strain collection, this set methodologically deletes open reading frames (ORFs) by substituting the gene of interest with a selectable drug-resistance cassette, allowing for the systematic screening of the effects of gene loss (Winzeler et al. 1999; Giaever et al. 2002). More than 20,000 strains are currently available from the Saccharomyces Genome Deletion Project (http://www-sequence.stanford.edu/group/yeast_deletion_project/), including both homozygous and heterozygous diploid deletions, MATα and MAT a haploids, green fluorescent protein (GFP)-tagged (Huh et al. 2003), even essential temperature-sensitive collections (Li et al. 2004, 2011; Yan et al. 2008).

To investigate the posttranscriptional regulation of a gene and the number of protein complexes in the S. cerevisiae by proteomics, a Tandem Affinity Purification (TAP) collection was created in which each ORF is tagged with a high-affinity epitope expressing the protein at its native locus (Ghaemmaghami et al. 2003). To investigate protein–protein interactions with S. cerevisiae, a yeast two-hybrid collection was created, where hybrid proteins were derived from over 6,000 transformations and a Gal4 transcription–activation domain vector was inserted to create a hybrid protein for each ORF (Uetz et al. 2000). Data sets generated using this collection have also been made publically available (http://portal.curagen.com/). Several overexpression libraries for S. cerevisiae are also available, including a yeast GAL-GST library of over 5,000 strains, containing inducible overexpressed tagged ORFs from the GAL1 promoter, covering over 80% of the genome (Sopko et al. 2006). Additionally, the overexpression transformable plasmid based library of S. cerevisiae has been created, including over 13,000 entries with over 95% functional coverage of the genome (Jones et al. 2008). Approximately three-quarters of the S. cerevisiae proteome are also covered by the chromosomally C-terminal-tagged GFP fusion proteins strains (Huh et al. 2003). Using this library of 4,159 yeast–GFP clones, the localization of proteins in response to external stimuli can be easily visualized by live-cell fluorescence microscopy.

After the early success of strain collections in S. cerevisiae, more focused collections of pathogenic fungi have been created in order to specifically address fungal virulence. In C. albicans, a homozygous deletion library of approximately 670 homozygous deletion strains affecting 11% of the C. albicans genome was used to screen for virulence in a mouse model of infection, identifying 115 infectivity-attenuated mutants (Noble et al. 2010). A knockout collection of C. albicans transcriptional regulators includes over 100 strains, which were screened in 55 different growth conditions (Homann et al. 2009). Among the phenotypes identified, a number of them showed altered susceptibility towards antifungal treatment. These results also support the theory that there is a high redundancy among transcriptional regulatory circuitry, where a single knockout does not greatly affect the strain’s overall virulence. In C. neoformans, a knockout collection of 1,201 genes was screened in an in vitro model of murine lung tissue for virulence phenotypes (Liu et al. 2008). Using these collections, a number of previously uncharacterized genes were identified as virulence factors, including those involved in growth at body temperature and in melanization, and those dependent and independent of capsule formation.

Smaller arrayed mutant collections of Neurospora crassa and A. fumigatus, among other pathogenic fungi, are available from the Fungal Genetics Stock Center (http://www.fgsc.net/) for screening. Finally, a single gene deletion collection comprising around 650 haploid C. glabrata genes will become available to the community shortly (Schwarzmüller, unpublished data).

III. Modeling Biological Phenomena

Most, if not all, biological processes follow a dynamic, nonlinear pattern. Nonetheless, a biological process can be approximated via a set of mathematical expressions to form a mathematical model. In this context, a “model” is referred to as a description of a biological process using mathematical expressions of quantitative data rather than a graphical representation. Although SysBio studies do not exclusively rely on either high or low throughput data sets, and lean towards a combination of both when possible, mathematical models have become increasingly useful ways of representing the information. These studies integrate computational approaches with experimental data to gain a more complete picture of how cells, tissues, and organs of species function and how the entire genetic information is wired and connected. Although current experimental techniques allow detailed measurements, it is impossible to gain full information about the system from discrete data sets without considering the topology and dynamics of the interacting components. Depending on what is known a priori about the specific system under investigation, and what one wants or needs to learn about it, different modeling approaches are required. Identifying the proper modeling approach is a critical point, as not all methods will be appropriate for each experimental question. The pros and cons for each approach should be weighed for each biological question (Di Ventura et al. 2006; Karlebach and Shamir 2008). This needs a close interaction between mathematicians and experimentalists because very often it is impossible to generate the experimental data required for a particular model. Visual examples and a summary of the pros and cons for the use of Boolean models, ordinary differential equations (ODEs), and Petri Nets (PNs) are provided in Fig. 3.2.

Fig. 3.2.
figure 00032

Simulation of a negative feedback loop with different modeling techniques. (a) Boolean models incorporate only two values: 0 for the inactive state and 1 for the active state of a variable. They represents only an interaction network where no kinetic information can be included. (b) Petri Nets are an extension to the Boolean approach such that they include the stoichiometric information of the considered reaction network. The reaction can take place only if the right amount of variable is present. (c) ODE model. Parameters used: \( x_1^0=0.1 \), \( x_2^0=0.1 \), \( x_3^0=0.1 \), \( {S^{\mathrm{off}}}=1 \), \( {S^{\mathrm{On}}}=50 \), \( {k_{11 }}=1 \), \( {k_{12 }}=99 \), \( {K_I}=0.001 \), \( {k_{21 }}={k_{22 }}=50 \), \( {k_{31 }}={k_{32 }}=50 \). Both the Boolean modeling approach and PNs indicate oscillatory behavior of the system, whereas the ODE model, which incorporates higher level of detail, suggests that a system adapts to the external stimuli by reaching a new steady state

In short, mathematical models represent a simplified and abstract view of the studied phenomena. They are extremely useful in understanding dynamic and multifaceted systems and their perturbations . These models often encompass different levels of biological understanding. Although it is important to know the consecutive action of the individual components of the system, an understanding of the time scale under which these interactions take place is essential for the whole picture.

Processes such as the ability of the cell to respond and adapt to a stimulus can be modeled using a set of ODEs. These processes are triggered by molecular interactions, which are not spontaneous and any environmental signal will result in a cellular response. A cellular response involves consecutive activation of a list of proteins that together establish a signaling pathway. These proteins pass the signal to the transcriptional machinery via a transcription factor (TF) that sometimes shuttles between cytoplasm and nucleus. TFs regulate gene expression and define the amplitude, magnitude, and duration of the cellular response. Such interactions can be graphically represented by “networks” that connect the interacting molecules. For example, gene regulatory networks (GRNs) are types of pathways that consist only of genes; in the network, gene A is connected with gene B if its product regulates the activity of gene B. A Boolean approach is often applied to study the topology of GRNs. Thus, based on the specific experimental question, different modeling approaches will be applicable. Here, we discuss in detail several computational approaches commonly applied to model experimental data sets, emphasizing how key attributes of fungal virulence have so far been investigated using modeling approaches.

A. Boolean Models

Boolean models are often used to infer GRNs from microarray data or other types of expression analysis (Hickman and Hodgman 2009). Boolean models were first introduced by Kauffman and colleagues (Glass and Kauffman 1973). In contrast to ODE models, a Boolean approach is a discrete type of modeling where time and states are represented by discrete values. Boolean models are suitable for studying biological problems that can be interpreted using a rather simple on/off behavior, such as gene transcription. For example, an algorithm called REVEAL (Liang et al. 1998) infers the network topology from expression data. A measure of element interactions is used to derive logic functions that define them. Boolean models are useful for studying the existence of steady states or whether a given network topology is robust (Li et al. 2004). Furthermore, the approach can be used when the precise network topology is uncertain and our primary goal is to understand the wiring of the interactions in the system. The topology of the system of interest must be known before implementing kinetic models. As an example, the Boolean approach was used to infer a Drosophila segment polarity GRN (Ay and Arnosti 2011).

A Boolean model is defined by n entities interconnected via k edges, forming a directed graph. Each model entity is in a state either “on” (1) or “off” (0). Using the example of gene transcription, each gene can be considered as either expressed, 1, or not, 0. In the synchronous Boolean model, the states of the model entities are simultaneously evaluated and updated at time t + 1, according to the regulatory functions and variables states at time t. Such Boolean models are purely deterministic. Regulatory relations are described via logic functions, such as the Boolean operators “and,” “or,” “not.” For vi being a vector representing a state of the model, we call a state space a set of all possible vectors vi. Thus, the state space has 2n elements for n entities in the network, and vi are vectors of 0s and 1s. The elements of the state space are connected via arrows indicating the flow of model states.

In the asynchronous Boolean model, one node at a time is chosen and updated, and the evaluation of the next selected node state takes this change into account. If the order of choosing the nodes is fixed, then the model is called a deterministic asynchronous Boolean model; if the nodes are chosen at random, then the model is termed a stochastic asynchronous Boolean model. The state space typically contains single point attractors, which are fixed points (also called steady states) towards which the systems evolve into both synchronous and asynchronous Boolean models. However, the time needed to reach the fixed point can vary between synchronous and asynchronous modeling variants. Cyclic attractors, e.g., limit cycles, can be lost in the stochastic Boolean model (Wang et al. 2012). In either case, the identification of point and cyclic attractors of large-scale Boolean models is not a trivial task but there are algorithms that deal with this problem (Wang et al. 2012). Because stochastic processes influence any biological process, Boolean networks have been further developed to account for noise in the system and for making the approach suitable for the study of stochasticity and uncertainty. These include development of probabilistic Boolean networks (Shmulevich et al. 2002) or Boolean models where stochasticity is implemented by reversing a node’s state at some probability rate, or by implementing stochasticity of a biological function that fails to be executed (stochasticity in function, SIF, models) (Garg et al. 2009).

Although a Boolean approach only allows for a very simplified representation of a biological system, it can be a powerful method for studying its underlying nature. For instance, a Boolean model can be used for systematic screening of possible networks that reproduce a pattern of interest (Giacomantonio and Goodhill 2010). Recently, Boolean modeling was used to study the interplay between gene expression, chromatin modifications, and DNA methylation, where the authors linked the epigenetic landscape with the probability state space (Flöttmann et al. 2012). This innovative application shows that there are other possible ways of analyzing Boolean models that are waiting to be explored.

B. Petri Nets

There is a growing interest in applying Petri Nets (PNs) to modeling and analysis of biological networks. In principle, a PN is a directed bipartite graph, whose nodes are called either “places” or “transitions”. Places indicate resources and they are indicated with circles; transitions are events or biochemical reactions, which are shown as boxes. Both types of nodes are connected by arrows. An arrow (called an “arc”) from a place (input place) to a transition indicates that a compound is necessary for the reaction. Further arcs point from the reaction to its products (see Fig. 3.3).

Fig. 3.3.
figure 00033

Model simulation of the system treated with a low drug dose. Drug is applied at time t = 400 (arbitrary units). Treatment does not clear the infection. After initial reduction, t = 500, the fungal population recovers from the drug stress, t = 1,500. (a) Model simulation output at different time points. (b) Agents considered in the model: yeast cells (red), inactive hyphal cells (yellow), active hyphal cells causing damage to the host (gray), polymorphonuclear neutrophils (blue). (c) Graphical representation of the model simulation over time; y-axis gives the total cell number

Each place holds a nonnegative number of tokens, which indicate resources of a given substance in the system. The state of a system is represented by an allocation of tokens at a given time point, which is called marking (M). Initial marking takes place at the time point zero. A transition will fire if there is a sufficient number of tokens on the input places, and it is at least equal to the edge’s weight. Once a transition fires, tokens are transferred into the respective output places and the number of tokens in the output places is again indicated by the weight of the arcs. In summary, formally a PN is a tuple \( \mathrm{PN} = (\mathrm{P},\mathrm{T},\mathrm{F},\mathrm{W},{{\mathrm{M}}_0}) \), where P is a set of places, T is a set of transitions, F is a set of arcs, W is a map that assigns each arc with a specific weight, and M0 is initial marking (Ackermann 2011). For the analysis of PNs, concepts like P-invariants and T-invariants are introduced. For instance, for a stoichiometric matrix N of a PN, a P-invariant, is any vector x where \( {{\mathrm {x}}^T}\mathrm{N}=0 \) and it holds \( \left\langle {\mathrm{x},\mathrm{M}} \right\rangle =const. \) for any marking M that appears during the simulation of PN. The scalar product \( \left\langle {\mathrm{x},\mathrm{M}} \right\rangle =const. \) represents the conservation relation, e.g., \( \mathrm{ATP} + \mathrm{ADP} = const \). A T-invariant on the other side is any vector x where \( \mathrm{Nx}=0 \) . These provide a decomposition of the network, and a transition that does not belong to any T-invariant can be removed from the PN. A set of T-invariants corresponds to the elementary flux modes operating in the system. The marking graph of a PN is a directed graph that represents the evolution of markings during PN simulation. This concept is similar to state space in Boolean models.

PN approaches are applicable for discrete modeling, or modeling of GRNs. For GRNs, like with Boolean models, one can consider both synchronous and asynchronous modes. PNs can also be applied to perform quantitative analysis and establish stochastic models. Fuzzy modeling has also been introduced based on PNs, and details on the application of the techniques and examples have been published (Ackermann 2011).

A stochastic version of a Petri Net (SPN) has been applied to the study of the cell cycle in budding yeast (Mura and Csikasz-Nagy 2008). The authors provide a fair comparison of the results from SPN with the results of the deterministic version of the corresponding ODE model, as proposed earlier (Novak 2002). Recently, a PN-based technique has been applied to integrate signaling, metabolic, and regulatory events participating in the S. cerevisiae HOG signaling pathway (Tomar et al. 2013).

C. Ordinary Differential Equation Models

Dynamics of biological processes are most often described via ODE models or partial differential equations (PDEs) when space is included. Optimally, ODE models are used for systems that can be considered “well-stirred” and that comprise large molecule numbers. When this condition is met, changes in molecule numbers can then be considered as continuous (Di Ventura et al. 2006). ODE models can also be used to address questions that regard changes in cellular phenotypes (Le Novere et al. 2005; Karlebach and Shamir 2008). Each equation in an ODE model describes temporal changes of one variable such as molecule concentration, phosphorylation levels, cell mass, or volume. If we consider a system with n variables, characterized by k, and positively valued parameters p, then the equation is:

$$ \frac{{{\mathrm d}{x_i}}}{{{\mathrm d}t}} = {f_i}\ ({x_1},\ldots,{x_n},\ {p_1},\ldots,{p_k},\ t) $$
(3.1)

where t represents time. The equation describes dynamical changes of the variable x i . A mathematical expression describing an increased production of a compound or its activation enters Eq. 3.1 with a positive sign; expressions for degradation or deactivation enter with a negative sign. Taken together, the equation evolves, having m reactions determined by the topology of our network of n state variables, which may or may not influence each other. Thus, we can also represent Eq. 3.1 in a matrix form:

$$ \frac{\mathrm{d}\mathrm{x}}{{\mathrm{d}\it t}}=\mathrm{Nv} $$
(3.2)

where N is a n x m stoichiometric matrix and \( \mathrm{v} = {{({{\it v}_1},\ldots,{{\it v}_{\it m}})}^{\mathrm{T}}} \) is a vector that stores rates for all the reactions taking place in our system. For practical examples, refer to detailed earlier descriptions (Klipp et al. 2009). The solution of an ODE model is a time course simulation. Given the initial state in a deterministic model, all the future states can be computed. Simulations of the model reproducing experimental data are used for understanding the time course dynamics of the interacting molecules, generating testable predictions, and for design of new experiments. Understanding of the system through these simulations has proven useful for the study of fungal virulence (Chen et al. 2004; Klipp et al. 2005; Leach et al. 2012). Only using the analysis of the model simulations can reveal the purpose of integrating by the cell certain molecular circuits, e.g., by negative or positive feedback loops.

Another example is a molecular autoregulatory loop integrated by C. albicans adapting to heat stress (Leach et al. 2012). In this case, the authors developed a dynamic model of the heat stress response in C. albicans using a set of ODEs supported with experimental data. The model reveals several features of the system such as a memory for acquired thermotolerance. For example, when pretreated with a mild heat shock, the system becomes more resistant to a severe shock. Moreover, the simulations of the model indicate a transient molecular memory in the system that is mediated through phosphorylation of heat shock transcription factor Hsf1.

D. Flux-Balance Analysis

Flux-balance analysis (FBA) is a mathematical framework that is widely used for the analysis of the flow of metabolites throughout a metabolic network (Orth et al. 2010). This structural modeling approach solely requires knowledge of the stoichiometric matrix N of the biological network. This is generally a well-known property for metabolic networks and, hence, opens the way for genome-scale studies (Edwards and Palsson 2000; Price et al. 2003; Yus et al. 2009). The FBA approach aims to identify the optimal distribution of fluxes in the steady state, i.e., fluxes v that satisfy the following equation:

$$ \frac{{\mathrm{d}{\it x}}}{{\mathrm{d}{\it t}}}=\mathrm{N}{\it v}\mathop{=}\limits^{!}0 $$
(3.3)

FBA aims to find fluxes where a given objective function reaches extreme values, for example, fluxes leading to the maximal growth rate (Feist and Palsson 2010) or the minimal production of toxic metabolites. The problem takes the form of linear programming. Here, problem constraints are included on the basis of the experimental results in the steady state (for instance, thermodynamics, biomass produced, or energy availability). These data are required to reduce the degree of freedom in the solution space.

The major advantage of using FBA is that there is no need to know reaction kinetics and metabolite concentrations, because FBA addresses steady-state conditions. However, the approach is inherently deprived of quantitative information (e.g., enzyme concentrations). Because of this restriction, FBA cannot be used to predict, for example, how specific levels of a certain enzyme must be changed in order to achieve a desired effect on flux. It is important to keep in mind that FBA does not help us to understand the dynamics of the system because it can only reveal steady state properties.

To enable the re-use of FBA models, they should be provided in the Systems Biology Markup Language (SBML) format, similar to the case of ODE models. FBA models can then be imported and solved using algorithms such as Matlab or COBRA Toolbox (http://systemsbiology.ucsd.edu/Downloads/Cobra_Toolbox) (Becker et al. 2007).

An extension of FBA, the regulated flux-balance analysis (rFBA) has been extensively reviewed (Karlebach and Shamir 2008) and aims at integrating the metabolic network with regulatory processes that are expressed by Boolean logic (Covert et al. 2001). Regulatory events can reflect situations when, for instance, certain regulatory proteins are not expressed and then appropriate fluxes will be shut down and thus set to zero.

E. Stochastic Modeling

ODE models, which are deterministic, are suitable for the analysis of dynamical behavior of the population on average. These, however, do not provide information about whether and how stochastic switches or noise impact the outcome of the biological process. For instance, the lysogenic and lytic cycle of λ phage (Arkin et al. 1998) could not be explained by deterministic modeling. Stochastic models describe random processes that evolve and change over time. It is convenient to use stochastic modeling in cases where one wants to investigate processes in which a molecule with small copy number affects key components of the model or if steady states are unstable. In general, simulating stochastic models is computationally more expensive. Moreover, to enable statistically significant conclusions, many simulation results have to be analyzed together. Stochastic simulations can be performed using tools such as COPASI (Hoops et al. 2006) or Cain (http://cain.sourceforge.net/). For a thorough introduction to stochastic modeling, we refer to recent work in yeast (Klipp et al. 2009). The ideal type of data for stochastic modeling are time-resolved measurements of single molecules, e.g., by microscopic measurements. In practice, experiments can rarely be repeated for the same single cells.

F. Monte Carlo Simulation

Monte Carlo (MC) simulation is a hit-or-miss sampling method. It is typically applied to find extremes of a function in a restricted region of possible parameter space. The MC sampling method can be viewed as randomly choosing parameters (x,y) and keeping the pair that gives, for example, the highest value of f(x,y). This method, however, is not a systematic approach to approximate the optimal solution; simply put – each time the simulation is performed, we either hit or miss the solution.

For each run, parameter values are randomly changed as, for example, initial conditions or kinetic constants. Then, it is recorded whether such perturbations influence the final result of the model and, if so, how. The parameters to vary are those where a significant uncertainty is encountered. From the analysis of the created range of estimated final values of the model, one can estimate how likely it is that a certain outcome will occur. MC simulation usually evaluates the model from hundreds to tens of thousands of times to estimate the solution to the model.

The MC simulation method was applied to study the dependence of drug dosage treatments on host resistance against disseminated candidiasis (Hope et al. 2006). In another study, the MC method was applied to study anticancer drug target inhibition strategies on the epidermal growth factor signaling pathway. The authors investigated the influence of changes in kinetic parameters by comparing parallel simulation runs (Wierling et al. 2012). MC simulation was also proposed as a method for assessing the degree of completeness of GRNs, where information on gene interactions is often missing (Kuhn et al. 2009). Among other tools for performing MC simulations are PyBios (http://pybios.molgen.mpg.de/), MATLAB, and Statistics ToolBox or Simulink.

Finally, MC Markov Chain (MCMC) is an optimization method in high dimensional spaces. It is designed to randomly search the parameter space such that the optimal value can be approximated. The way that the parameter space is searched and tested for whether or not parameters are accepted is often defined by methods such as Gibbs sampling or Metropolis–Hastings algorithms (Hastings 1970). MCMC is applied to solve integrals in high dimensional spaces when the traditional numerical methods fail. A complete description of MCMC is beyond the scope of this chapter. For further details on the method, we refer readers elsewhere (Gamerman and Lopes 2006).

G. Agent-Based Models

Agent-based models (ABMs) are counterparts to ODE models, where simple rules for each agent’s action can lead to complex dynamics in the population of interacting agents. Agents are autonomous entities that can represent molecules, cells, and organisms. ODE models are often applied to study the dynamical properties of regulatory pathways, leaving out information on spatial distribution of the molecules in the cell. This can be tackled using ABMs (Pogson et al. 2006, 2008). For example, ABM was applied to examine the pathogenesis of gut-derived sepsis using the example of Pseudomonas aeruginosa interaction with its host (Seal et al. 2011). In general, ABMs are suitable for study of systems where the spatial and temporal distribution of agents influences the systems dynamics, such as chemotaxis, while sensing a gradient of quorum molecules (Netotea et al. 2009; Fozard et al. 2012), inducing biofilm formation (Mitri et al. 2011), or pheromone concentration gradients during mating. ABM has been applied to study the functionality of the immune system (Folcik et al. 2011), granuloma formation (Segovia-Juarez et al. 2004) and for predicting the outcome of different immunotherapy strategies (Pappalardo et al. 2011).

All ABMs are systems of agents, whose actions and decisions are specified by the user. The agents are typically living in a 2-D world, which is a square divided into a grid of patches (or triangulated space). These patches, representing the environment, can influence an agent’s actions, and agents can affect the attributes of the environment themselves as well. ABM is a discrete and stochastic modeling approach. At each step, an agent makes a decision according to its status at the time. Dynamics of biological systems are complex and although ABMs appear conceptually clear, the code of ABMs tend to have particularly long lines of code, numbering even into the thousands, which can make them difficult to handle.

ABMs have gained popularity, particularly in modeling the immune system, disease dynamics, and as a tool for elucidating the nature of host–pathogen interactions. Agent-based modeling is a modeling technique where each agent’s action can be defined with simple rules. Furthermore, using ABM simulations, it is possible to track dynamics of a single agent rather than the averaged behavior of a population. Tools have already been developed for simulating immune dynamics whereby the user can specify the rules for an agent’s interactions, including IMMSIM, SIMMUNE, SIS, and reactive animation [reviewed in (Bauer et al. 2009)]. Computational-oriented studies on host–pathogen interactions can be performed using CyCells, PathSim, MASyV (Bauer et al. 2009), or BSim (Gorochowski et al. 2012), which are freely available at http://bsim-bccs.sf.net.

ABMs have been implemented for the study of fungal pathogenesis using an additional ABM tool referred to as NetLogo (freely available at http://ccl.northwestern.edu/netlogo/docs/). NetLogo was used to study C. albicans interactions with its human host (Tyc and Klipp 2011). The authors explored the rules that determine the dynamics of a fungal population influenced by host phagocytic cells (Fig. 3.4). The model was then used to investigate the effects of potential drug treatments on fungal populations and their clearance. Another example is a study on A. fumigatus population clearance by neutrophils, where different rules defining neutrophil movement were examined, such as chemotaxis along a chemokine gradient, random walk, and communication between the phagocytes (Tokarski et al. 2012).

Fig. 3.4.
figure 00034

Petri Nets. An example PN of the reaction \( {S_1}+2{S_2}\mathop{---\rightarrow}\limits^{enzyme }P \) is shown. Marking of the PN is presented (a) before the reaction takes place and (b) after enabling the reaction

H. Game Theory

Humans have developed numerous strategies to protect themselves against invading pathogens, such as controlling pathogen growth and dissemination by clearing pathogens from the body through activation of the innate and adaptive immune system. Host–microbe interactions are therefore processes that can be viewed as shifting the balance between different populations of cells that will lead to either a healthy or diseased host state. Populations evolve and they establish equilibrium states in accordance with the equilibrium of the other populations, and this process repeats ad infinitum or until one partner is removed from the interaction, i.e., by death of the host or through pathogen clearance. Evolutionary game theory is an approach suitable for modeling dynamics of evolutionary processes that are assumed to modify the population fitness landscape. This can be assumed because normal optimization methods for population fitness or flux rates are not affected by interactions between individuals and the environment (Pfeiffer and Schuster 2005). In short, game theory (GT) is a framework suitable for modeling biological systems where distinct strategies can be assigned to the individual partners.

GT has been applied to the study of survival strategies of C. albicans in macrophages (Hummert et al. 2010) and to growth strategies adapted by individual cells when different carbon sources are given (Friesen et al. 2004). GT was also exploited to study distinct utilization of metabolic pathways by cell populations (Pfeiffer and Schuster 2005; Ruppin et al. 2010), to study cooperation during evolution (Nowak et al. 2004), and as an application for optimizing altruistic behavior in microbial populations (Schuster et al. 2010). GT was used in the analysis of cancer cells (Gatenby and Vincent 2003), for multiple knockout analysis in S. cerevisiae (Kaufman et al. 2004), and for the analysis of microarrays data via defining coalitional game sets (Albino et al. 2008; Moretti et al. 2007, 2008).

I. Model Parameters

Simulation of a dynamic model requires knowledge of the kinetic parameter values. These, however, are often unavailable or extremely difficult to obtain from experimental data. Thus, a process often termed parameter estimation or more precisely, regression (Jaqaman and Danuser 2006) has to be employed. Parameter estimation is a typical inverse problem (i.e., deducing from effects to their causes) and its objective function is the set of parameter values that best represent the data. There are many algorithms suitable for solving such optimization problems that minimize the distance between experimental data points and simulation results of the model, which focus both on local and global optimization methods (Moles et al. 2003; Baker et al. 2010). Although global optimization methods search for the solution to the problem by scanning the entire parameter space, they are computationally more expensive and time-consuming than local optimization methods. Comparing a global optimization method and a local one, the latter method is faster; however, it is limited to providing a suboptimal solution, which in some cases might be only a local optimum. Regardless of the method utilized, algorithms tend to minimize a sum of squared residuals (RSS) given by:

$$ \mathop{\min}\limits_{\mathrm{p}}\sum\limits_{i=1}^n {|y({t_i})-\mathrm{M}({\it x}({{\it t}_i}),\mathrm{p}){|^2}} $$
(3.4)

where \( y({t_i}) \) are experimental data points and \( \mathrm{M}({\it x}({{\it t}_i}),\mathrm{p}) \) are the values of the simulated model; \( x({t_i}) \) is a corresponding model variable and p is a set of parameter values.

J. Sensitivity Analysis

Most mathematical models require parameters to be estimated from experimental data sets. The quality of the data fit, however, depends both on reproducibility of experimental data and model structure itself. We can study how the changes in model parameters influence its output – a technique termed sensitivity analysis. Parameters with marginal effects on a model output cannot be estimated from experimental data, nor will their numerical values significantly affect the quality of model predictions. Therefore, in order to estimate a set of parameters for a given model, one needs to focus on generating experimental data for the components that are strongly influenced by these parameters. Sensitivity analysis can be performed either locally or globally. Using local sensitivity analysis, we can analyze the effects of comparatively small perturbation changes in the parameter (p) on the model output (O), such as at steady state (SS), and can represent the effects using the following mathematical equation:

$$ R_p^O=\frac{p}{{{O^{SS }}}}\frac{{\partial {O^{SS }}}}{{\partial p}} $$
(3.5)

Local sensitivity analysis does not consider multiple parameter interdependencies. It also tends not to be robust, meaning that results will be partially affected by the parameters used in the model. By contrast, global techniques for performing sensitivity analysis search the entire parameter space, taking into account many parameters values rather than only one. Global techniques not only consider larger variability in parameter values, but they also provide a measure for parameter interactions (Frey and Patil 2002; Marino et al. 2008).

The matrix of coefficients obtained from sensitivity analysis can be analyzed to address the question of structural identification of our model. Specifically, whenever (i) each column contains a large absolute value (i.e., each parameter has a strong effect on at least one model variable), and (ii) columns are linearly independent, then the model is structurally identifiable (Jaqaman and Danuser 2006). In summary, sensitivity analysis can help the process of parameter estimation. Although a strategy for estimating parameters is very model-dependent, there are no golden standards and, hence, it will rely on the modeler’s experience and the biological context.

K. Standards for Modeling

Standards in computational biology are necessary to ease the exchange of the results of research between scientific communities. SBML has been proposed for describing models of signaling pathways and GRNs (Hucka et al. 2003). SBML is used for formatting ODE models or a system of ordinary differential and algebraic equations (DAE) such that they can be re-used in other software tools. The need for standardization of models is evident from an ever-increasing number of software tools for model implementation. Lack of standards will inevitably lead to incompatibility of the models between different tools, and make their broad use impossible. Minimum information requested in the annotation of biochemical models (MIRIAM) has been proposed (Le Novere et al. 2005). Proper annotations of model components (using ontology names and unique database identifiers) allow for the comparison of different models but also enable model merging. Model annotation can be done either manually or using tools such as semanticSBML (http://www.semanticsbml.org, (Krause et al. 2010)).

IV. Conclusions and Future Perspectives

As experimental techniques improve and evolve, the dimensionality of the biological problems under investigation increases in parallel. Our knowledge about system properties increases exponentially with the amount of biological data. Initially, studies were focused on understanding the functionality of single genes and followed strictly reductionist schemes. The availability of high-throughput and genome-wide data sets (genomics, metabolomics, proteomics epigenomics, etc.) has dramatically extended the size and complexity of biological problems because we have remained almost paralyzed in our efforts to integrate data to gain a better understanding of systems. Hence, there is an emerging need to come up with creative and efficient solutions for integrating large-scale data into meaningful biological context.

The ever-increasing amount of genome-wide data sets has significantly aided our understanding of the nature of cellular responses and how these have evolved. It has helped us to identify novel network components so that it is possible to identify what makes, for example, one species more resistant to an antifungal drug compared to another.

Modeling techniques across the board have been useful in the visualization of both large- and small-scale data sets. For each technique, it is important to keep in mind the kind of information that is being included, and what the expected outcome of the system is. For instance, protein–protein interactions between molecules do not always correlate with a conserved GRN (Roguev et al. 2008). One should also interpret these models cautiously, especially when perturbing a mathematical model by including gene mutations or alternations in the system, since such alterations may also change physical interactions of the proteins and therefore may not fit the proposed model structure (Goh et al. 2007; Zhong et al. 2009). There are also benefits and limitations to modeling a single cell versus a cell population. It is imperative to weigh the pros and cons of each of these limitations when modeling biological data.

The generation and cataloguing of further data sets and models will become increasingly important in fostering our understanding of fungal virulence and for the prediction of alternative or even novel therapeutic strategies.