Keywords

1 Introduction

Marine systems are vast, productive, long-lived, dynamic, and complex. However, they are also finite and sensitive and can be overwhelmingly altered or impacted by strong short-term pressures of environmental or human origin . As these pressures and impacts proceed in the natural environment, it is crucial to consider questions about how these systems function and respond to change. What feeds and balances marine ecosystems? How do biooceanographic processes vary geographically, genetically, functionally, and in accordance with environmental change? What are the strengths and weaknesses of marine systems, and what are the ways in which they are likely to adapt to future ocean conditions? Can we transform detailed observations of past and present systems into valuable predictions and answers to present and future questions?

Marine ecosystems are fed and balanced primarily by the photosynthesis and nutrient cycling ubiquitous marine microalgae. The biology, photosynthesis, and life cycles of these organisms have adapted, evolved, diversified, and colonized all parts of the diverse and dynamic oceans, providing a food source and biogeochemical shuttle that sustains almost all other sea life. The ubiquity and critical function of robust primary productivity by marine microalgae are now wholly evident and fundamentally accepted (Falkowski 1997; Falkowski et al. 1998). However, the true complexity, variability, and adaptive paths that govern wild marine microbial systems now and in future scenarios are far from entirely measured or understood.

Broad and precise new capabilities now exist to measure and model the identities, distribution, dynamics, and functions of cellular processes operating in biological systems. More importantly, new questions can be addressed holistically based upon these increasing data (Karsenti et al. 2011). How do marine organisms balance their complex physiologies and cellular states in order to persist, thrive, and produce? What are the organized systems of genes and proteins that have specifically evolved to sense, respond, and catalyze these processes? How do these diverse and highly evolved unicellular machines manage to cope with dynamic and challenging conditions? What are the chemical species and processes that crucially mediate competitive and cooperative exchange? How do numerous evidently co-occurring species truly co-exist? How quickly can these functions adapt to new changes, and which features of biological systems are most subject to novel environmental and evolutionary selection pressures? What do these details mean for ecosystem dynamics and larger ocean processes now and in the future?

These and other important questions are now theoretically addressable through (a) well-reasoned scientific thinking and experimentation, (b) collection of relevant comprehensive molecular data (genomes, transcriptomes, proteomes, metabolomes), and (c) systematic, robust, and scientific methods of data interpretation, aggregation, analysis, modeling, prediction, and validation.

2 Systems Biology

The specific organization and intrinsic biological programs operating within living cells in their environments define their functional and ecological roles. Cast about randomly in a marine environment, the millions of highly specialized biomolecules inside a cell would accomplish very little, fail to self-replicate, and quickly cease to exist. Simply cataloging their presence individually in that case would mean equally little: it is the programmed and compartmentalised coordination of these biomolecules that govern their existence and functions. The systematic co-organization of nucleic acids, proteins, metabolites, and systems by millions of years of evolutionary selection results in irreducible and interconnected processes. This integrative nature is essential to their functions and relevance, and this thinking motivates and defines “systems biology” as a scientific discipline (Ideker et al. 2001). It is not simple however to accurately, rigorously, and scientifically measure, model, and study biological systems in an integrative or exhaustively detailed manner.

Biological systems can be defined at many levels: a single metabolic pathway, signaling cascade, or set of coevolving gene functions; each operates as a molecular system . The outward functions of these systems are defined (and selected upon) as much by their coalescent properties as a whole, as by the functions of their individual components. These small systems do not function or make sense except in the context of the additional information contained within their coevolving properties and interactions. No protein, gene, or environmental response mechanism operates in isolation; the operation of one biomolecule, phenotype, or gene depends on the functions and identities of others. Viruses, organelles, cells and tissues, organisms, populations, and ecosystems are all complex biological systems, whose relevance and operation similarly cannot be reduced to individual parts without missing critical information. Likewise, the relevance and specific effect or response of any single environmental variable can only be understood or predicted in the context of co-varying conditions. The unique and irreducible information stored within the context and arrangement of a biological system in its dynamic environment is essential to understanding its nature and operation—and the principles and theories therein are as important to study, measure, model, and predict as the properties of single genes, proteins, or metabolic components.

The importance of this “systems-level” thinking has long been recognized (von Bertalanffy 1968) but seldom easily or scientifically addressed. Billions of microorganisms teem in most marine waters, and the genomes of even the simplest among them encode for thousands of active biomolecules. The coalescent behavior of these numerous variables is often limited to outward observations, measurements, and qualitative classifications: cell size, growth rate, fluorometry, and elemental contents. However, the complex, dynamic, and interacting molecular and genetic parameters that give rise to emergent biological functions must be explicitly considered in order to reach sufficient or predictive answers to many fundamental questions. For example, certain species dominate others, depending on environmental factors. The cyanobacteria thrive in nutrient-limited pelagic waters (Zwirglmaier et al. 2008), while the large eukaryotic phytoplankton tend to dominate in areas abundant in macronutrients (de Vargas et al. 2015). But why is this? What combinations gene functions have critically coevolved into these divergently optimized systems-level properties? Which emergent properties of their molecular systems produce their respective advantages? What are the constraints and boundaries on their niches and adaptabilities? What are the minimum molecular and genetic differences (among many) required to explain and account for the outward differences between species? How have proliferative, competing, or cooperating species coevolved or co-opted advantages to become successful in each other’s niches through evolution, gene transfer, and symbiosis? What are the possibilities and opportunities for this to continue and change in new and future environments?

The consideration of complex molecular, genetic, ecological, and environmental data and models becomes necessary in order to accurately and scientifically answer questions like these. The advent of efficient large-scale comprehensive molecular data collection (genomics, transcriptomics, proteomics, metabolomics, etc.) now offers broad and explicit information that can help to explicitly link genotype to phenotype, phenotype to environment, species to ecosystems, and intra- and interspecies evolution to adaptation. The amount of new data available is immense and is superseded in importance and opportunity only by a careful focus on biological questions, well-conceived hypotheses, and measurements and experiments designed to address them.

3 Marine Microalgal Genomics

The first comprehensive molecular data rapidly, efficiently, and completely collected for living systems were arguably the first microbial genomes (Fleischmann et al. 1995; Fraser et al. 1995, 1997, 1998; Bult et al. 1996). The rapid, systematic sequencing of genomes provided the first precise molecular descriptions of complete biological systems. Every functional biomolecule in the organism is encoded or imprinted in its genome, as well as the hardcoded portion of the biological program that controls its physiology, metabolism, life cycle, responses to change, potential for interactions with the extracellular environment, evolutionary history and relationships, and heritable capacity to genetically evolve new functions, adaptations, and emergent properties. The whole genome of an organism is necessary (but not sufficient) to fully understand and consider its biology and its current and future capabilities.

Prokaryotic Microalgae

The most ubiquitous marine microalgae are the cyanobacteria, including the genera Prochlorococcus and Synechococcus. These are found thriving throughout marine systems, including throughout vast nutrient-limited pelagic deserts where organic nitrogen, phosphorus, iron, or silicate are insufficient for larger microalgae to thrive. The genomes of Prochlorococcus and Synechococcus are approximately 1.5–2.5 Mbp in length and contain between 1000 and 2500 genes (Dufresne et al. 2003; Palenik et al. 2003; Rocap et al. 2003), typically including all of the genes necessary for photosynthesis. These genomes appear remarkably streamlined in comparison to the 4.6Mbp genome of the E. coli bacterium, consisting of approximately 4300 genes (Blattner et al. 1997). It is hypothesized that the small genome sizes of cyanobacteria are optimized for nutrient-limited environments (Bentkowski et al. 2017).

Eukaryotic Microalgae

The first fully sequenced marine microeukaryote genomes have revealed a surprising genetic and physiological complexity hidden within seemingly simple unicellular plankton. The genome of the siliceous “cosmopolitan” diatom , Thalassiosira pseudonana, the first diatom sequenced due to its ubiquity and compact genome size (32 Mbp), consists of 24 chromosomes and encodes an estimated 11,390 genes (Armbrust et al. 2004). Roughly half of these genes bear no confident similarity to any other genes of known function. The genome of Phaeodactylum tricornutum, the second diatom to be sequenced, similarly encodes over 10,000 genes, only about half of which bear detectable similarity to those found in T. pseudonana, despite only ~90 million years of divergent evolution between the two species. Both of these marine microeukaryote genomes imply a complexity greater than that of the first fully sequenced microeukaryote, the yeast Saccharomyces cerevisiae, the 12 Mbp genome of which consists of ~6000 genes (Goffeau et al. 1996). Similarly, the ~120 Mbp genome of the freshwater microalga Chlamydomonas reinhardtii harbors a surprisingly complex genome encoding ~16,700 genes (Merchant et al. 2007), less than 3000 of whose products bear confident similarity to those encoded in the genomes of T. pseudonana (Armbrust et al. 2004) or P. tricornutum (Bowler et al. 2008). The collection of microalgal genomes is rapidly expanding, including several additional diatom (Lommer et al. 2012; Traller et al. 2016; Mock et al. 2017; Basu et al. 2017) and dinoflagellate (Lin et al. 2015) genomes now available for research into these lesser understood phyla. In addition to this, at least hundreds of new genomes and millions of new putative protein coding genes have been cataloged by the latest exploratory and integrative oceanographic efforts (de Vargas et al. 2015).

Marine Microbial Metagenomes

Within any water sample or marine ecosystem, there are multiple genomes present, numbering from the tens to trillions, depending on scale. The specific repertoires of variously encoded functions and alleles in any sample define the genetic potential of biological processes in that context and specify salient functions and features that are adapted to exist in the environment from which the sample was collected. The genomic information typically observed in marine environments is inexhaustibly complex and difficult to assemble or reduce (Gilbert and Dupont 2011; Iverson et al. 2012). Nevertheless, the sample-specific richness of taxonomy, diversity, allele distribution, and function gained through high-depth short-read sequencing of environmental samples quickly yields critical information to understand the aggregate nature and properties of marine microbial systems (Venter et al. 2004; Sunagawa et al. 2015). The genome, however, is largely passive and inactive and merely a source code in all living systems. In order to understand how the information stored and transmitted in genomes relates to biology itself, it is necessary to consider the messages and biomolecules transcribed from genomes to perform cellular functions .

4 Marine Microalgal Transcriptomics

The primary (and likely most ancient) dynamically encoded information in the cell is contained within RNAs conditionally and flexibly transcribed to and from the functionally inactive genome. Fortunately, this is also now relatively comprehensive and easy information to obtain through high-throughput sequencing. For microeukaryotes with large genome sizes and mixed cultures or in environmental samples, the size and complexity of the expressed transcriptome are significantly smaller in size than the corresponding whole genome space and by definition contain nearly all of genetically encoded functional information that is operating under a particular condition. These information, which are in many cases now readily comprehensive, provide a wealth of detailed information about crucially acting molecular processes and the emergent patterns of gene expression that give rise to various cellular states. While it is necessary but not sufficient to explain the operation of biological systems, transcriptomic data currently represent the broadest, most sensitive, and most easily obtainable and intercomparable system-wide functional information that can be collected for most organisms and biological systems.

The putative biological activities of the proteins produced by typically about half of all microalgal transcripts can be inferred bioinformatically, and while in some examples and species the expression levels of proteins can be uncorrelated to the expression levels of their transcripts, increases or decreases of mRNA transcript levels are often concordant with changes in the cellular levels of the proteins that they encode. mRNA sequencing of genetically variable species or environments also yields nucleotide polymorphisms in conserved and evolutionarily pressured protein-coding regions that may be linked to spatial, temporal, ecological, evolutionary, and functional divergence. For all of these reasons, a wealth of transcriptomic and metatranscriptomic data has been collected for marine microalgal species in a very short time. These data can be used to rapidly characterize informative gene regulatory profiles of biological systems in accordance with varying cellular states and environments.

Laboratory Studies

The first comprehensive genome-wide expression studies in marine microalgae consisted of single-species experiments in which whole-transcriptome microarrays were used to measure changes in the expression of all genes under various conditions of relevance to the natural environment. These include tracking the cyanobacterium Prochlorococcus over its day/night cycle (Waldbauer et al. 2012), measuring the comprehensive responses of the diatom T. pseudonana to a panel of typical stresses and limitations (Mock et al. 2008), and interacting factors of the diel cycle, culture density, and nutrient exhaustion (Ashworth et al. 2013). For the diatoms in particular, abundant transcriptome-wide expression data have been collected using microarrays and mRNA sequencing under various laboratory conditions designed to simulate environmental variables. These include, among others: (1) T. pseudonana: silica, iron, and nitrogen limitation, low temperature and elevated pH (Mock et al. 2008), iron starvation (Thamatrakoln et al. 2012), silica starvation and re-supplementation (Shrestha et al. 2012; Smith et al. 2016b), diel growth from exponential to stationary phase (Ashworth et al. 2013), exposure to the pollutant benzo[a]pyrene (Carvalho et al. 2011), and growth at moderate and elevated CO2 levels under moderate and elevated light and (2) P. tricornutum: silica limitation (Sapriel et al. 2009), acclimation to high light (Nymark et al. 2009), exposure to cadmium (Brembu et al. 2011), acclimation to light and dark cycles (Chauton et al. 2013), exposure to a panel of stresses and pollutants (Hook and Osborn 2012), darkness and re-illumination (Nymark et al. 2013), and growth in red, blue, and green light (Valle et al. 2014).

The comprehensive tracking of changes in gene expression over all of these various experimental conditions results in a rich and complex picture of transcriptome dynamics in these organisms (Ashworth et al. 2013, 2016; Levering et al. 2017), much of which is still yet to be sufficiently studied, understood, and fully applied to address fundamental questions and predictions with regard to marine systems and future change.

RNA Sequencing

Whole transcriptome RNA sequencing at high depth has become affordable enough to simplify the process and information necessary to obtain transcriptomic profiles for any species or biological sample. The sequencing of transcribed RNA repertoires through reverse transcription and amplification is powerful and sensitive and can be informatively classified, quantified, and assembled even in the absence of corresponding genome sequences. For eukaryotes in particular, the functional and phylogenetic information density present in transcribed messenger RNA is high, and the amplification of cDNA improves signal detection—particularly in the case of poly-dT-primed strand synthesis and 3′-directed selective amplification (Xiong et al. 2017). Laboratory mRNA sequencing studies in marine microalgae include the profiling of diatoms to transcriptome dynamics at different levels of carbon dioxide (Hennon et al. 2015), during silica starvation (Smith et al. 2016b), diel cycling combined with iron limitation (Smith et al. 2016a), and an integrated range of several other environmental, nutrient, and chemical perturbations (Levering et al. 2017). Hundreds of additional new single-strain transcriptomes have been sequenced in order to cover dozens of completely new clades of marine microeukaryotes (Keeling et al. 2014), resulting in an atlas of transcribed and functional coding sequences of unprecedented comparative breadth and depth against which to integrate and contextualize individual new studies conducted in the laboratory and in the field.

Metatranscriptomic mRNA sequencing studies, or those conducted on samples collected in natural marine systems, are beginning to produce a deep and unprecedented richness of functional eukaryotic coding sequences operating in wild environments. The transcriptomic complexity and response of a wild phytoplankton community during iron supplementation experiments at sea revealed a rapid and iron-specific response encoded in native organisms that are evolutionarily and adaptively primed to cope with and take advantage of fluctuating iron levels in the North Pacific Ocean (Marchetti et al. 2012), confirming and expanding upon observations from related laboratory experiments. Environmentally responsive biological programs can be identified within these large data sets that help to explain acclimation and evolved survival mechanisms in unpredecented precision and detail (Marchetti et al. 2017). Meanwhile, in the Atlantic Ocean, transcriptomics are being used to more deeply observe and understand the intracellular behaviors and responses of phytoplankton with respect to nutrient conditions (Alexander et al. 2015). The ability to track detailed and specific intracellular programs in the natural environment provides an opportunity to deeply understand what native cellular communities are really doing as they occupy and experience different marine environments, and this may soon be broadly automated to study and predict the dynamics of communities in situ (Ottesen et al. 2013, 2014; Aylward et al. 2015).

5 Proteomics and Metabolomics

The majority of unique functional biomolecules in marine systems are proteins, and the majority of the remainder are metabolites and the products of enzymes. It would be deeply informative to know the identity and quantity of all proteins, metabolites, and biomaterials present in a cell or sample from the ecosystem—this could be used to model and predict metabolic flux—and would be more closely representative of true physiology and activity than transcript levels, from which the levels and functions of posttranscriptional, posttranslational, and metabolic products may importantly diverge.

Proteins are more complex polymers than nucleic acids, are complicated to uniquely separate and identify, and cannot be amplified; unfortunately, unlike for DNA and RNA, there is no simple way to simply sequence or quantify complex pools of proteins. Proteomic technologies based on peptide generation, automated multidimensional separation, and mass spectrometry are now able to finely sample and detect thousands of different peptides at high resolution, resulting in successful, albeit noncomprehensive proteomic analyses of microalgae in response to changing conditions (Nunn et al. 2009; Dyhrman et al. 2012; Nunn et al. 2013).

Quantification of complex protein pools is also a challenge and is best conducted using a pool of peptide standards matching the expected proteome. Exhaustive species-specific prototypic peptides have made this possible for the human proteome (Kusebauch et al. 2016), but this does not immediately translate to other species, and the theoretical pool of possible peptides in wild environments is prohibitively large for accurate and comprehensive de novo quantification. Nevertheless, environmental proteomics have been able to directly identify certain functional proteins that present and operate in accord with environmental conditions (Saito et al. 2014), and the proteomic detection and quantification of specifically validated and informative protein biomarkers may be an important tool for oceanography.

The broad measurement of metabolites and biomaterials in cells and natural systems similarly relies on adequate chemical separation, unique identification through mass spectrometry, and validated standards for quantification and thus are similarly challenging in terms of comprehensiveness and sensitivity compared to amplifiable nucleic acids. Nevertheless, deep and informative biochemical datasets are emerging that can be integrated with transcriptomic, proteomic, and environmental data to obtain cellular models that are predictive of basic emergent cellular properties.

6 Integration and Meta-analysis

The critical task and opportunity in systems biology and high-dimensional measurements (including omics) are to coherently integrate and apply these data into forms that are easily distilled and amenable to testable scientific questions. Some of these questions include

how do multiple experiments agree? Which aggregate patterns are robust and unique? Which features are condition specific? What are the informative trends and relationships between linked but orthogonal components: genome, transcriptome, proteome, metabolome, phenotype, community, and environment? What is the organizational and reactive information contained within the system, what are its constraints, and how is it most likely transmitted?

In the case of transcriptome-wide gene expression , coherent microarray or RNA sequencing data from many independent experiments can be readily integrated to discover patterns of co-expression and conditionality that are only evident in aggregation. In the case of microalgae, as in other organisms, this can more powerfully imply condition-specific units of implied co-regulation and function than individual experiments alone (Hennon et al. 2015), partition all genes into subgroups of statistical and conditional relatedness within and between species (Ashworth et al. 2016), and identify core features of apparent conditional metabolic control (Levering et al. 2017). The bioinformatics and integrative construction of metabolic models help to organize, explain, and predict the flow of metabolites in new microalgal species (Chang et al. 2011; Nogales et al. 2012; Levering et al. 2016; Kim et al. 2016). Layering additional data types together into multi-scale models is also in some sense simple, given adequately comprehensiveness and coherence (Karr et al. 2012). This is now also imminently possible for microalgae. Increasingly, deep data collection and performance-optimised “data-driven” statistics must be intimately applied to justifiable biological hypotheses in order understand and compare profound and complex features evident in single cells or entire ecosystems. Modeling ocean processes has evidently little do to with the tidy normal distributions and convenient statistical models that easily obtained in the laboratory. In practice, assumptions about what to expect in environmental datasets must be very carefully considered, and designed uniquely to address new questions with statistically valid findings. The computational challenge alone rivals the prodigious stargazing efforts and brilliant scientific advances of the national space programs of the 20th century. The constellations of biological entities and processes occurring in the world’s oceans still exceed our ability to fully observe them, or to accurately and predictively model all of the biological processes operating in balance with changing environments. This is a major new challenge for 21st century oceanography.

7 Prediction and Synthesis

The increasing depth and comprehensiveness available through high-throughput molecular data collection now parameterizes the operational details of living systems in typically overwhelming detail. This detail is necessary but not sufficient to constitute scientific knowledge and utility. The burden is on a [systems] biologist to demonstrate the soundness, practicality, and relevance of their data, models, and products to broader fields. How is high-data analysis and modeling useful, informative, and predictive? For what reasons were these complex data and models invested in and what is their readily transferable scientific or technological value?

Two applicable aims in this regard are (1) predictions of present and future ecosystem properties and dynamics and (2) predictive testing, optimization, and reengineering of cellular and system-wide properties. The use of complex “whole-cell” models with predictable aggregate properties could be used to deepen predictions of genetic and cellular functions that drive ecosystems (Bragg et al. 2010), and responses and adaptations of species and strains to ecological situations may soon be modelable as a function of complex, interacting, and adaptable molecular programs whose inputs, rules, and outputs are predictable. Complementary to pure new “data-driven” approaches are efficient and multiplexed laboratory experiments that are able to directly probe the functions and impacts of specific new gene candidates identified from conditional ‘omics experiments. Two exciting examples of this are direct and thoroughly demonstrations of the evidently broad effects of transcription factors and regulatory systems on productive microalgal phenotypes. The direct experimental knockdowns of single transcription factors identified by targeted transcriptomics experiments in both P. tricornutum (Matthijs et al. 2017) and N. gaditana (Ajjawi et al. 2017) resulted in dramatic changes in metabolism and natural product profiles, demonstrating the genetic and regulatory flexibility of microalgal species with regard to microbial engineering.

The use of whole-cell modeling is also an attractive answer to the challenge microbial engineering that can probe the bounds and productive potential of microalgae. This may be crucial for pathways or cellular functions whose operation involves multigenic tuning, signaling and regulatory logic, subcellular organization, or large-scale bulk cell properties and phenotypes. Honest estimates of uncertainties, model assumptions, and validating tests are all critical to the scientific relevance of integrative analyses—as well as efficient and parsimonious algorithms that can be widely adopted and understood. True biotechnological gains from data-based modeling in microbial systems will require the contextualization and design of methods specifically to be predictive, parsimonious and practical engineering within larger real-world goals, constrains, and opportunities (Georgianna and Mayfield 2012).

Marine systems are vast, and they are ubiquitously populated, produced, and balanced by the contextual operation of diverse and complex microalgae. Understanding the molecular and genetic dynamics of present and future marine biological systems will be crucial to interpret large-scale observations and shifts in marine ecosystems. As cellular and biological systems often vary astonishingly in their distinctly varying modes of genome organization, regulation, metabolism, interactions, and environments, it will be crucial to begin modeling efforts by designing data collection, analyses, model, and algorithms to suit the essential biology at hand. Many systems may not simply conform to pre-existing assumptions, tools, or frameworks. As the depth and variety of available data and modeling approaches continue to increase, continued critical, honest, practical, efficient, and rigorous adaption of scientifically focused thinking, data collection, modeling, analysis, prediction, and validation methods will yield the most fit and fruitful and translatable products of systems-level scientific research.