Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Molecular Structures

Starting with proteins, we introduce the four levels of protein structures. Thus, the primary structure refers to the amino acid sequence of the polypeptide chain, the secondary structure to highly regular local sub-structures such as \( \upalpha \)-helices, \( \upbeta \) -strands or \( \upbeta \)-sheets, the tertiary structure to the 3D structure of a single protein molecule and the quaternary structure to the assemblies of several protein molecules or polypeptide chains, usually called subunits in this context. As the 3D structure defines the functionality of a protein, much effort has been made in the past years in order to precisely detect it, mainly using experimental techniques such as X-Ray crystallography, NMR and electron microscopy. Simultaneously, many computational methods try to accurately predict the 3D tertiary structure of a protein given the amino-acid sequence. Today over 60 K solved protein structures are hosted in wwPDB [1] whereas ~86 % of the structures are derived from X-ray crystallography, ~13 % from NMR spectroscopy and less than ~1 % from electron microscopy [2]. Typical resolutions vary from 1.2 to 4 Å. Similarly to proteins, ~4 K solved RNA 3D structures are hosted in NDB [3], whereas 8 % of them correspond to PDB entries [2]. While a great variety of reviews that comment on the visualization approaches for such cases exists [46], here we give an overview of what is the status of the cutting-edge research in the field.

Most of the available visualization tools currently try to picture the chemistry of the biomolecules, the atoms and the bonds among them. Different representations include ribbons, space-filling atoms, ball-and-stick and others. Coloring schemes are used in order to highlight important parts of a protein such as binding sites, atoms with certain physiochemical characteristics, SNPs, active sites, different chains, exon boundaries or whole domains. Despite it is beyond the scope of this chapter to analyze all of the available visualization tools below we give some example of representative tools that are widely used and we try to categorize them according to their functionality (Table 1). Despite presenting the tools as a non-redundant list, many of them share functionalities and characteristics. For example, PyMol [7], Jmol [8], KiNG or Mage [9] offer typical views and can be incorporated in a web page. Others such as Chimera [10], SRS 3D [11], STRAP [12], Cn3D [1315] or PdbViewer [16, 17] are able to combine the 3D structural visualization in space with the linear amino acid sequence (Fig. 1). They are highly interactive and therefore users can select regions in any of the two views and highlight the corresponding area in the other view. For example, when a sequence region is predicted to be functional or when a part of it is aligned to another sequence of interest, the 3D structural components are highlighted. This way one can look at the region of interest either from a linear or a structural point of view. Tools such as Molscript [18], PMV [19], VMD [20], ICM-Browser [21] and plusRaster3D [22] export images at a high dpi quality to be used for scientific publications. In order to superimpose two proteins and compare them directly in 3D space, tools such as MOLMOL [23], MOE, VMD [20] or PyMol [7] are suitable. In cases where computationally expensive superimposition is required, external CPU intensive packages such as STAMP [24], STRAP [12] or THESEUS [25] are recommended. Cases that require advanced computational power exist when one wants to superimpose very large regions (high size complexity) or sequences with low sequence similarity (many possible combinations). Looking at other characteristics such as hydrophobicity, electrostatics, residue conservation or connolly surfaces, MSMS software [26] is the most widely used. In order to show annotations from databases that are related to a certain part of the structure, tools such as ProSAT2 [27], JenaLib [28], PDBsum [29], SYBYL, Swiss-PdbViewer [17] or WHAT IF [30] can be used. Tools like Relibase [31, 32] and Superligands [33] can directly compare smaller molecules such as ligands between each other simultaneously. Notably, while tools such as tCONCOORD [34] and FIRST/FRODA [35] are able to picture conformational changes, Moviemaker [36] and Yale Morph [37] server applications can show two different transition stages of the same molecule. NOMAD-ref [38] and ANM [39] are can combine many transition stages but only for low frequency events. Despite the fact that few of the aforementioned tools such as PyMol [7] are also suitable for RNA structure visualization, specialized tools such as S2S Assemble [40] are implemented for RNAs.

Fig. 1
figure 1

P04637 (P53_HUMAN) tumor suppressor protein visualized by SRS3D application. a 3D structure representation of the three chains of P04637 as ribbons using three different colors. b Three different columns show the domains of the three different chains from different databases individually. c The sequence of the protein in a linear form. d Interactivity enables the highlighting of a chosen domain in every view (sequence and 3D structure). Switching between different representations, the 3D chemical structure of the specific domain is highlighted and visualized as a coil

Despite the fact that visualization of macromolecular structures is today very mature compared to other areas in biology, the current rendering techniques still lack the computational capacity to process more complex systems such as protein complexes or protein interactions at very high resolutions. In addition, molecular dynamics, simulations and motion are difficult to picture at such levels of detail, as the current tools are CPU greedy for more advanced analysis when visualization of more than one molecule per time is required. In order to come closer to a physical model and combine the chemistry-based visualization with real images from electron or cryo-microscopy great effort should still to be done in that direction towards the generation of real and more informative prototypes. In terms of data integration, tools are still available as standalone applications but a great variety of them can run as a part of a web page and come with standardized file formats and services to increase portability and data exchange.

Table 1 Software tools in the area of proteomics

2 Tree Hierarchies

Tree data structures and representations are widely used in biological studies in order to show hierarchies of data [41]. These include for example the Gene Ontologies (GO) [42] to describe functional annotation of genes via a hierarchically organized set of terms or the Unified Medical Language System (UMLS) [43] which serves a similar function for biomedical notions.

Another very important area of biology raises the topic of investigating and visualizing the evolution between the species. Thus, evolutionary studies try to reveal and understand how different species evolved over time and whether two different species have a common ancestor and at which time point. To picture these evolutionary transitions, phylogenetic trees are mainly used. A prime example of such tree representations is the so-called tree of life [44] which displays such evolutionary relationships between species and how they have separated and over millennia. From about ~1.7 million identified species, only ~80,000 of them have been analyzed for evolutionary relationships and have been assigned into a hierarchy [45] (Fig. 2).

Fig. 2
figure 2

Examples of tree hierarchies in Biology a 5 protein sequences were aligned with TCoffee and clustered according to their sequence similarity. The clustering results are shown as a tree hierarchy. b The Tree of Life presented in [44]. c Example of a gene expression heatmap. The expression levels of several genes (tree hierarchy on the left) were measured across several conditions (tree hierarchy on the bottom) using the Expander software. Genes and conditions were clustered using the average linkage hierarchical clustering. Dense red or green areas show the correlations between the genes and the experimental conditions

Other areas in biology that involve high-throughput technologies such as Chip–Chip arrays, microarrays, next generation sequencing or proteomics often use tree-based clustering algorithms to interpret and visualize their results. In the case of microarrays [4648] for example, genes are clustered according to their expression patterns in order to see which of them are correlated or anti-correlated. When one compares a healthy with a non-healthy tissue, the purpose is to find which of them are up- or down-regulated. Algorithms that are widely used, include the single linkage, average linkage, complete linkage [49], UPGMA [50], Neighbor Joining [51, 52] etc. (Figure 2).

In addition, in the field of sequence analysis, biologists try to determine the similarity between two protein or nucleotide sequences. For a given set of sequences, often a multiple alignment or an all-against-all pairwise alignment is performed constructing a distance matrix that hosts the similarity scores between every pair of genes. Notably, widely used applications that perform such analyses include the Clustal W [53], MUSCLE [54], BLAST [55], and the T-Coffee suite [56]. In order to classify these sequences in families, a clustering algorithm is applied based on the constructed similarity matrix by bringing together those sequences that are closely related to each other. The post-clustering results are visualized using a tree hierarchy (Fig. 2).

While a variety of computer readable formats exist, most phylogenetic trees are described using either the New Hampshire/Newick [57], the NHX extended Newick file format or the Nexus [58] file format. In terms of tree annotation and information sharing across repositories, Markup languages such as phyloXML [59] and NeXML are of demand.

Although the most common representation of hierarchies is a tree representation based on a 2D Euclidean drawing [60], treemaps [61] which present a tree hierarchy as nested rectangles serve as an alternative as they are often best suited for classifications rather than phylogenies [62]. While tree visualization is today a mature area, the growth of taxa is still a limiting factor as the space to represent such huge hierarchies on a single screen is insufficient. Traditional viewers that have been in use for many years such ATV [63] or TreeView [64] are nowadays weak for displaying huge taxonomies with thousands of data such as [65]. To overcome this problem, several approaches have been proposed (Table 2). One approach is the implementation of efficient zooming. Thus, as users zoom in or out, nodes collapse or expand respectively. Tools that try to compress the information into a given smaller canvas include DOI trees [66], space trees [67] and expand-ahead browsers [68]. Another approach that tools such as HyperTree [69] follow, is to project data on hyperbolic space [70]. While, this idea is very efficient in terms of visualization, in practice users find these views difficult to navigate [61]. Preferred tree visualization on the other hand involve radial layouts like those found in iTOL [71], TreeDyn [72], TreeVector [73], or Dendroscope [74]. A third approach that tools such as Paloverde [75] and the Wellcome Trust Tree of Life follow, involves the utilization of 3D space. Despite the fact that such an approach is less disorientated compared to hyperbolic viewers it is still not preferred to single 2D visualization with an exception in the visualization of geophylogenies where geographical and phylogenetic information is combined towards the implementation of geographic information systems [76]. Based on such approaches, in Biology, georeferenced barcode DNA sequences are likely to become more widely used [77].

To directly compare two trees between each other so far methods such as tanglegram alignment have been proposed [78]. According to this methodology, two trees are mirrored against each other and lines connect the leaves that are equivalent to each other. Alternatively, color schemes can highlight the taxonomies that are different between each other. As tree hierarchies can vary in size and host overloaded information of thousands of taxa, direct comparison, navigation and exploration still remain a problem as the aforementioned approaches succeed in efficiently organizing the data but often fail to visually deliver them to the user in efficient ways. An example is the visualization of the tree of life versus the visualization of the forest of life [79]. While image tiling [80] methods to generate large images and then break them into smaller pieces at different resolutions (Google Earth) and recombine them could be of a solution, further opportunities for further investigation are still available in this respect.

Table 2 Tools to represent hierarchies

3 Next Generation Sequencing

Recent technological improvements have led to great steps towards the understanding of the genome, its genes, their expression and their function. While the Human Genome Project (1990–2003) allowed the release of the first human reference genome by determining the sequence of ~3 billion base pairs and identifying the approximately ~25,000 human genes [8183], current technologies allow the sequencing of a whole exome in a few days and at a very low cost. The first generation sequencing technique was discovered back in 1977 and is known as the Sanger (dideoxy) [24] technique. High-throughput second generation technologies have already been developed by Illumina [84], Roche/454 [85] and Biosystems/SOLiD [86] while Helicos BioSciences [87], Pacific Biosciences [88], Oxford Nanopore [89] and Complete Genomics [90] belong to the third generation of sequencing techniques. Similarly to DNA sequencing, RNA Sequencing [91, 92] which allows today the simultaneous gene expression measures in a cell and ChIP-Sequencing which uses immunoprecipitation with massively parallel DNA sequencing to mainly identify DNA regions that are binding sites for proteins such as transcription factors [93] are now more feasible and more accurate due to the rapid technological advantages as the aforementioned. Projects like the 1000 Genomes Project (started in 2008) to sequence a large number of human genomes and provide a comprehensive resource for human genetic variation [94] and the International HapMap Project [9599] to identify common genetic variations among people from different countries show the broad spectrum of the application of such technologies and the scale of the data that they can process.

Advances in high throughput next generation sequencing techniques allow the production of vast amounts of data in different formats that currently cannot be analyzed in a non-automated way. Visualization approaches are today called to cope with huge amounts of data, efficiently analyze them and deliver the knowledge to the user in a visual, easier to grasp, way. User friendliness, pattern recognition and knowledge extraction are the main targets that an optimal visualization tool should excel in. Issues such as de novo genome assemblies, SNP identification, visualization of structural variations, whole genome alignment, alignment of short reads, comparisons between several genomes simultaneously, alignment of unfinished genomes, intra/inter chromosome rearrangements, identification of functional elements and display of sequencing data and genome annotations are still open fields for visualization. Therefore, tasks like handling the overload of information, displaying data at different resolutions, fast searching or smoother scaling and navigation are not trivial when the information to be visualized consists of millions of elements and reaches an enormously high complexity. Modern libraries, able to scale millions of data points smoothly and visualize them using different resolutions are essential. While established genome browsers (Fig. 3) such as Ensembl [100, 101], UCSC Genome Browser [102] and IGV [103] are able to partially address some of the aforementioned challenges, visualization of genomic data in this respect is still an underdeveloped field.

Fig. 3
figure 3

P04637 (P53_HUMAN) tumor suppressor protein is found in Chromosome 17 in positions: 6,375,874–878,791,773 and visualized by the UCSC Genome browser at the highest resolutions. While a red mark shows where TP53 is in the chromosome information about alignments, SNPs, mRNA coding regions and others are shown. Notably, one can interactively zoom in and out to see the information even at the lowest nucleotide level

4 Network Biology

In Systems and Integrative Biology, often bioentities are interconnected with each other and are represented as networks where nodes (bioentities) are linked with edges. Several categories of different biological networks already exist [104] such as protein–protein interactions networks, signal transduction networks, pathways, knowledge and integration networks (where bioentities are found to be related in literature or in records of public repositories), metabolic and biochemical networks, or gene regulatory networks which picture the factors that control gene expression.

While it is not within the scope of this section to present a thorough review of available repositories for each individual network category, we shortly mention experimental, computational and high throughput techniques to detect protein–protein interactions in order to give an overview of a few of the available repositories to demonstrate the size complexity of the available data and their heterogeneity. Thus, the most widely used experimental methods include pull down assays [105], tandem affinity purification (TAP) [106], yeast two hybrid systems (Y2H) [107], mass spectrometry [108], microarrays [109] and phage display [110]. Furthermore, computational methods such as MCODE [111], jClust [112], Clique [113], LCMA [114], DPClus [115], CMC [116], SCAN [117], Cfinder [118], GIBA [119] or PCP [120] are graph-based algorithms that use graph theory to detect highly connected subnetworks. DECAFF [121], SWEMODE [122] or STM [123] have been developed to predict protein complexes incorporating graph annotations, whereas others like DMSP [124], GFA [125] and MATISSE [126] also take the gene expression data into account. A very useful review article that describes and compares the aforementioned techniques can be found in [127].

Of course, such biological networks share common characteristics but they can differ significantly in their topology and properties such as for example the number of their highly connected nodes or regions, their average eccentricity, betweeness or other types of centralities, shortest paths or their clustering coefficient. Protein–protein interaction networks tend to have hubs as signal transduction networks do not. Today, there exists a wide variety of tools (Table 3) that are network specific as reviewed in [104, 128, 129], but the field of network visualization is an active fields with many challenges to be addressed as the amount of data increases exponentially and the annotation databases expand continuously.

Currently the most widely used network representations include node-link diagrams where bioentities are represented as nodes and the interactions between them as edges forming a hairball or distance or similarity matrices which hold information about every pairwise relationship with size N(N − 1)/2 and hybrid views that combine the two previous ones. While matrices are often preferred for larger scale networks, all of the aforementioned approaches suffer in terms of scaling when the size of the network consists of few thousands of nodes and edges. In order to make large scale biological networks more informative, several layout algorithms [130] try to reveal the properties of the network such as showing the hubs using a force-directed algorithm and simultaneously try to minimize the crossovers between the lines. Similar to node-link diagrams, various ordering algorithms try to efficiently order the columns and the rows of a distance matrix to make highly connected regions more visible. While established tools such as Ondex [131], Pajek [132], Cytoscape [133], Medusa [134], VisANT [135] for real world-like networks or iPath [136], PATIKA [137], PathVisio [138] for pathways or EXPANDER [139], HCE [140], ExpressionProfiler [141] for expression data implement such layouts most of them try to project data a 2D plane using advanced navigation techniques to make data exploration easier. Other tools such as Arena3D [142, 143] or BioLayout Express 3D [144] take advantage of 3D space to show data in a universe. While still very few of the aforementioned tools try to fill the gap between analysis and visualization, efforts have been made the past years. ClusterMaker [145] Cytoscape’s plugin and jClust [112] applications for example try to cluster the data within the application without the help of an external application (Fig. 4). Similarly, CentiBiN application [146] tries to compute and visualize different vertex and graph centrality measures.

Fig. 4
figure 4

A Yeast protein–protein interaction network [147] consisting of ~1600 proteins is analyzed by Cytoscape. a A force-directed layout algorithm is applied on the network. b The network was clustered according to MCL algorithm and ~240 clusters were produced. c The zooming functionality enables the user to the cluster and the node’s labels in detail. d Connectivity degree versus number of nodes is plotted to show some network characteristics. Notably, different combinations of node properties can also be plotted such as the clustering coefficient (0.42 for the specific network) versus network centralization (0.053 for this network)

While network visualization is a developing area, there is much space for improvements as for example visualization of time series data, network evolution and dynamics are still important features to be visually represented. Similarly, node aggregation, edge bundling, faster and more efficient layout algorithms and their extension into 3D space, multi-dimensional data visualization, semantic zooming, interactivity and data integration still remain open problems in network biology.

Table 3 Tools in network biology

5 Visualization in Biology—the Present and the Future

In the aforementioned sections we widely discussed visualization tools which may be applied on different biological areas of the “omics” spectrum. These mainly include software for genome analysis, microarrays, molecular structures, phylogenies, alignments and network biology. Despite the tremendous efforts for the development of better, more efficient, more interactive and user friendlier visualization tools which has been going on over approximately the past 20 years [148] and despite the fact that all of these tools share common characteristics, the future challenges partially overlap and many difficulties still need to be addressed.

So far, there is a tendency to produce tools that mainly run as standalone applications being able to read their own file formats. While this has slowly changed over time, it is still a limiting factor, as integration needs to come to the foreground. Thus, visualization tools should share common human and computer readable file formats in order to easier exchange information. Such a demand for integration can be partially solved whenever each tool is launched with its own API or by implementing specific web services for data exchange. In addition, it is highly recommended to make tools directly available through a web interface (i.e. Flash, JavaFX, Processing.org, Applets) or directly make them downloadable through other technologies such as JNLP (Java web start) in the case of a Java implementation. Such an effort for integration would greatly help to further bridge the gap between analysis and visualization as visualization tools often use external packages to perform a typical analysis that is not embedded in the tool. A visible example of such a gap can be observed whenever one works with network biology where the nodes usually represent bioentities and the edges the connections between those. As such networks can increase in size and complexity, clustering analysis to categorize data and investigate the clusters individually is often in demand. Unfortunately, today very limited number of visualization tools hosts such functionality to cluster data within a visual application. Another example can be given for genomic data analysis where tasks such as SNP and variation calling, genome assembly or genome alignments should initially be performed individually and sequentially, the results of the analysis should be visualized by different tools after reformatting them to the tool-specific input format. In conclusion, it is expected in the future, the visualization tools will follow golden standards both in terms of data storage, analysis and integration (as to manually merge software packages and combine their functionalities requires some expertise, something that is tedious and time consuming). A first step would provide tools with a pluggable architecture where users can implement their own plugin for a tool based on their own expertise.

During the past 10 years, a progress has been made to move away from static images and cope with the increasing size complexity by handling biological data interactively. This includes operations such as efficient zooming, panning and navigation. Noticeably, multi-touch screens today encourage more modern and less conservative interfaces to handle multiple events simultaneously to increase interactivity. Similarly, vibrations could potentially be used to get the attention of the user when a property or a characteristic of the system changes. A characteristic example is the MacOS systems where icons start to vibrate in order to indicate that the corresponding process is running. Apart from these operations, in biology, often data need to be explored at multiple-scales and at different resolutions. Similarly to GoogleEarth application, which can be used to explore maps from different heights, one could imagine the biological world as a universe that can be observed from the organism to a cell or to an atomic level. In the case of genome browsers for example, a genome can be explored at a chromosome, at a gene or even at a single nucleotide resolution. Similarly, in biological networks, node aggregation or edge bundling methodologies could be applied on the network while exploring it at different levels. In order to explore multi-scaled data, pre-processing and pre-indexing is normally required, as the enormous amount of data does not allow such calculations on the fly. GoogleEarth application is a great approach to be followed as real-world images that refer to a specific resolution are indexed and stored in a database and get loaded on the fly upon users request.

Besides user interface challenges, progress in biological data management has been made over the past years. Current technologies, infrastructures and architectures allow the parallel processing of information at significantly lower costs. Unfortunately, not many visualization tools for biology today are engaged to these technologies with an exception being the tools that are implemented for biological image analysis as in the case of microscopy. Taking advantage of libraries like CUDA, which allows parallel calculations at multiple Graphical Processing Units (GPUs), protocols like Message Passing Interface (MPI) to distribute computational tasks to computers over the network or other multi-core supercomputers with multiple CPUs are ways to significantly reduce the processing time and the running time complexity of huge-scale data. Similar to architectures, display hardware such as large screens, tiled arrays or virtual reality environments, which take advantage of a very large space to project data, should be taken into consideration by programmers and designers as they become more and more affordable over time. A great advantage of such technologies is that they allow the representation of the dataset as a whole without the need of algorithms to project data at lower dimensions, something that can lead to information loss.

As visualization in biology evolves rapidly, a great variety of new visualization concepts and representations appear. While this is encouraging and it can become a source of inspiration for other fields such as economics, physics, environmental or social studies, golden standards concerning the design, the interactivity and the prototyping should be strictly defined, aiming to maximize human–computer interaction. In addition, as visualization tools are designed for a broad range of users, prototypes should take into consideration rare cases like the careful choice of color schemes as 10 % of the population suffers from color-blindness.

As biological systems are highly dynamic, visualization tools to capture the behavior and property changes of such systems and how they evolve over time is a necessity. Approaches that picture how the properties of a system change, currently include parallel coordinates, 3D representations using multi-layered graphs or animations. The effectiveness of the animation-approach is however often very low and limited by human perception capabilities, mainly due to changes in the user’s mental map of the structure. More efficient approaches should be implemented to tackle this problem, as time-series visualization for biology is still a very immature field.

Finally, an ideal visualization system of the future should be able to track users preferences and learn users behavior while he or she explores specific data types. After training, such a system could guess and suggest possible solutions that anticipate the users preference, something that would minimize the time–cost to solve a problem. SVMs, SOMs, neural networks and other approaches have significantly evolved and can be used as initial steps for such user profiling. Concerning data parameterization, today visual analytics approaches that require human judgment are followed as data properties and results can vary significantly as one changes the parameters of a software or workflow. Automation of such procedures like optimal parameterization finding and profiling still remain a bottleneck.