1 Visualization in Climate Science

The output generated by current climate simulations is increasing both in size, as well as in complexity. Both aspects pose equal challenges for the visualization and an ideally interactive visual analysis of the simulated data. The increase in complexity is due to a maturing of models that are able to better describe the intricacies of the climate system, while the gain in size is a direct result of finer spatial and temporal resolutions.

ICON, the ICOsahedral Non-hydrostatic model, that is jointly developed by the Max Planck Institute for Meteorology (MPI-M) and the German Weather Service (DWD), is a framework based on an icosahedral grid with an equal area projection, on which data sets are sampled via primal triangular cells, dual hexagonal cells and hybrid quadrilateral cells [1]. Figure 1 shows the horizontal layout of the ICON grid, visualizing the relationship between cell (triangle) and point (hexagon) data. The vertical layout is a rectilinear grid, that is sampled more densely close to the Earth’s/Ocean’s surface. ICON – though unstructured – has several advantages over other grids that are regularly used in climate science: is has no computational poles, it allows for an easy refinement in local areas and it provides a simplified coupling between its oceanic, atmospheric and land components. Over the last years, ICON was extended to permit large eddy simulations at cloud resolving resolutions in a regional setup as part of the HD(CP)\(^2\) projectFootnote 1 [2] to advance the understanding of clouds, cloud building and precipitation processes. Although the data produced was quite large (22 million/3.5 billion cells 2D/3D), a classic post visualization approach using ParaView employing a parallel processing/visualization setup on several fat nodes was still possible.

Fig. 1.
figure 1

Horizontal ICON grid layout showing triangles (cell data) and hexagons (point data).

Within the recently started EU funded project of ESiWACE2Footnote 2, the spatial – and the temporal – resolution will be further refined down to 1.25 km globally, resulting in approximately 360 million cells per level, and – depending on the number of levels – around 30 to 60 billion cells in 3D, per variable and time step. In order to explore, and actually be able to access such data, let alone writing the data to disk, other workflows than the currently employed post visualization are necessary. Examples include in-situ visualization in its many forms and in-situ compression/transformation, which reorders and possibly also compresses the data to make it accessible within a modified post visualization pipeline. This paper illuminates these approaches from a climate science perspective, thereby focusing explicitly on climate science visualization needs. It also discusses some of the initial implementations and highlights some first results.

2 In-Situ Visualization

The idea of in-situ visualization dates back to the golden era of coprocessing in the 1990s [3]. As a buzzword in HPC, the term in-situ is currently almost as popular as data avalanche or I/O bottleneck, and reverberated through scientific conferences and papers for years [3, 4]. However, the majority of users still tried to avoid it united in the hope that faster hardware could remedy the problems before the need to resort to an in-situ visualization approach would become immanent. So far, also the data analysis and visualization at DKRZ was entirely based on the classic post visualization workflow, but this is – driven by projects such as ESiWACE2 – about to change. Only one simulated day with 30 min output, 75 levels and eight 3D plus twenty 2D variables – there are of course many more variables in the model that would be worth looking at – would accumulate to \(\approx \)43 TB (single precision), which is quite a bit, given the fact that such weather simulations often run for several days, weeks, even months.

Several software packages, such as Visit [5] and ParaView [6, 7], already provide in-situ visualization capabilities. Other exascale visualization initiatives, such as ALPINE/Ascent [8] or SENSEI [9], directly build upon these tools and extend their functionality. Due to the great familiarity and happiness with ParaView, we at DKRZ have experimented so far only with Catalyst, a VTK-based and ParaView bound in-situ visualization framework [4, 10]. ParaView is thereby employed to generate a Python script, that later drives the in-situ processing to create standard visualizations, to threshold and write out reduced data sets, and to create a CINEMA database. The connection to the ICON model is implemented using a so called Catalyst adaptor, compare also with Fig. 2, that handles the data transfer from ICON (FORTRAN) to ParaView/Catalyst (C++). As use cases, we see the following applications/data flows:

  • Data reduction

  • Verifying the simulation during run time

  • Generation of data quicklooks and previews

  • Feature detection, extraction and tracking

Once the data is on the C++ side of the adaptor, the data can flow in multiple directions, as are outlined in Fig. 2. Catalyst already supports a number of those applications and comes with a variety of examples. Initially we planned to base our in-situ developments onto the in-situ implementation developed by the MPASFootnote 3 group [11]. However, as the code was quite complex and would have required a lot of work to be customized for ICON, we decided to start the development of a new Catalyst adaptor that is directly tailored to the ICON model from scratch. This proved to be the right decision, as only a few hundred lines of FORTRAN and C++ code, along with some minor modifications within the ICON code itself, already allowed us to run first in-situ visualization experiments. In-situ visualization can be performed either loosely or tightly coupled, that is on dedicated visualizations nodes, or on the same compute nodes on which the simulation is run. We have not yet experimented with different configurations, and so far have only used the tightly coupled setup with an equal number of visualization/simulation processes per node. The data, grid information as well as actual data variables, are transferred from FORTRAN to C++ as zero copy arrays.

The maximum number of nodes that we used so far for parallel simulation/in-situ processing is 540 nodes with 4320 MPI processes, i.e. 1/6 of our HPC cluster. ICON atmosphere was used in a global setup with 2.5 km horizontal resolution and 75 height levels, thereby thresholding two 3D variables liquid cloud water and cloud ice, which were written out to disk. A threshold of 1.0 \(\times \) 10\(^{-7}\) kg kg\(^{-1}\)  (in kilogram water/ice per kilogram air) was applied to discard the empty cells, and save only those that are above this threshold. This resulted in a mesh reduction from 6.5 billion cells down to \(\approx \)150 million cells per 3D variable, which can now be handled easily in a classic post analysis/visualization workflow. Figure 3 shows a visualization of both variables along with an annotation of the Earth’s orography. Additionally, a standard visualization (single snapshot per time step) was created, along with a CINEMA database to create quicklooks and previews of the data.

The additional time required to initialize and actually do the in-situ visualization is almost neglectable in respect to the time required to perform the simulation itself. The initialization of the model, as described in the run above, takes 71 s, from which Catalyst needs 6.1 s (\(\approx \)11%). The first workload, i.e. the transfer of the grid data and Catalyst initialization takes an additional 1.6 s. The model needs an average of about 408 s (total: 408.027 s) to advance the model one time step, from which Catalyst uses on average 0.1 s (max 0.8 s) (total: 12.000 s (\(\approx \)3%)) to create a standard visualization (single image) and to write out the two thresholded 3D variables.

Fig. 2.
figure 2

In-Situ visualization/processing pipeline using ParaView/Catalyst.

CINEMA is a useful extension for ParaView developed by the Los Alamos National Laboratory and allows an image-based analysis and visualization of large data sets by creating multiple views of the data per timestep and storing these images together with a data description. Several applications exist to efficiently access this CINEMA database, including a web based viewer [12]. But the evaluation of CINEMA is unfortunately still incomplete, as several issues exist that are also documented on Kitware’s Github websiteFootnote 4. Although the new SPEC-D CINEMA database format is formally implemented, currently only SPEC-A databases are written out and have at the time of writing still to be manually converted into a SPEC-D database to be consistent with the current CINEMA display toolsFootnote 5. Nevertheless, once working, CINEMA will be a great addition, as it allows scientists to quickly browse through large piles of data (images) in the search for the proper data file to be analyzed in more detail. Other researchers have also utilized the CINEMA database to perform a feature tracking of mesoscale ocean eddies using contours and moments directly on the image data [13].

Another direct advantage of using ParaView/Catalyst as in-situ framework is the possibility to visualize the simulated data live while the simulation still runs on the supercomputer. This allows to precisely track and supervise the progress of the simulation, and to abort in cases of errors. However, in practice, this feature will probably only be used in specific setups and for long simulations, as a batch scheduler maintains the processing of the simulations.

Fig. 3.
figure 3

Catalyst extracted 3D cloud ice (turquoise) and 3D cloud water (white) from a 2.5 km global ICON atmosphere simulation.

2.1 Feature Detection and Tracking

In-situ processing does not necessarily need to be limited to the generation of images and the storage of reduced data sets alone, but can easily be extended to a detection, extraction and tracking of certain interesting features or structures. A popular object of study in the atmospheric sciences are clouds. Meteorologists are especially interested in their formation, as well as development and evolution over time, i.e. the transition from one type of cloud into another. Clouds and the various cloud types are listed and described in the cloud atlasFootnote 6 and can be characterized using boundary conditions defined through specific levels of cloud water/ice, pressure, temperature, humidity, rain and upward wind velocity, as are identified by the International Satellite Cloud Climatology Project ISCCP [14]. Figure 4 shows a visualization of an in-situ cloud classification based on a regional (Germany centered) simulation at cloud resolving scales at 156 m from within the HD(CP)\(^2\) project [2]. It shows four different cloud types over northern Germany: Stratus (ST), Stratocumulus (SC), Altostratus (AS) and Altocumulus (AC). This cloud classification is thresholded, i.e. the empty cells (no clouds) are discarded, and efficiently stored as VTI (VTK Integer Array) file series on disk, to be later loaded and displayed in ParaView.

Another example that we have not yet implemented, but which has been shown by others to be beneficial [11], is an in-situ eddy census from a high resolution ocean simulation, ideally accompanied with the extraction and display of additional quantitative information, such as the size, duration, speed and movement direction of eddies, as well as how much energy and mass they transport.

Fig. 4.
figure 4

ISCCP in-situ cloud classification using HD(CP)\(^2\) data [2] with stratus (ST), stratocumulus (SC), sltostratus (AS) and sltocumulus (AC) clouds.

3 Progressive Data Visualization

One of the drawbacks of an in-situ visualization/processing approach is the indispensable need of a priori knowledge to extract and visualize the right features in the simulated data. Without the domain knowledge where to find interesting features, and which isolevels and/or thresholds to choose, an in-situ visualization is likely to fail. An iterative approach is of course possible, but both time and labour intensive. The finer spatial and temporal resolutions of large scale simulations, also exhibit new – possibly previously unresolved – processes and correlations within the data. The thresholds and isolevels used in simulations at lower resolution can provide guidance, but are probably not a perfect fit at higher resolutions. To correctly find and visualize those new features and structures, one needs to work on the actual data in its original resolution, but the data needs to be transformed and reordered to make the data accessible. To achieve this, a progressive data visualization approach based on a level-of-detail (LoD) rendering is required. In here, the data is decomposed into different resolutions (LoD) and possibly also compressed before it is written out to disk in a way that facilitates a later interactive access. After the data has been written to disk, a special visualization application that supports progressive LoD rendering is used for a classic user-driven post visualization of the data [15]. Such an application accesses data in an out-of-core fashion, and only the data that is relevant to the current level of detail and which is also contained in the current view frustum is fetched, visualized and displayed.

Fig. 5.
figure 5

Exemplary application of DHWT on a low resolution ICON ocean data set. Centroids of hexagons (top), edge midpoints in one direction (middle), centroids of triangles (bottom). Each row (left to right) shows original data, conversion to a triangular grid via icosahedral maps, applying DHWT on the converted grid to obtain coarse data, and reconstructed data using two different thresholds: 10% discarded/90% preserved and 95% discarded/5% preserved.

3.1 Wavelet Decomposition and Compression

A classic tool for a level of detail decomposition of large data sets are wavelets. There are numerous examples on how to use wavelets to perform an efficient LoD based visualization of large data sets, yet those primarily focus on regular rectilinear grids [15, 16]. For irregular grids, such as ICON, this becomes a bit more difficult, but it is, nevertheless, still possible. Here, so called icosahedral maps can be employed that are designed to fit the geometry of different cell configurations within the ICON model, as we have discussed in our prior publication [17]. As this research is still work in progress, this section summarizes our previous accomplishments, and outlines our current efforts in this direction.

Icosahedral maps contain the connectivity information in ICON in a highly structured two-dimensional hexagonal representation and facilitate the execution of a multi-resolution analysis on ICON data by applying a hexagonal version of the discrete wavelet transform. A global ICON grid is thereby broken down into ten diamonds, in which the data can be accessed and processed more easily [17], see also Fig. 6:

Fig. 6.
figure 6

Global ICON grid unfolded into a net consisting of ten diamonds. The vertex information of each diamond is stored in a 2D rectangular grid that corresponds to the hexagonal lattice associated with that diamond.

Figure 5 shows the principle of this wavelet decomposition for all three ICON grids using a low resolution global ocean simulation. The left column shows the original data, followed by a mapping onto a triangular grid (icosahedral maps), a coarsened grid (lower level of detail) and two reconstructions. In order to observe how the data responds to compression, a quantile thresholding was applied, keeping only those detail coefficients whose magnitudes falls within a specified percentile range. The last two columns in Fig. 5 shows that the wavelet responds very well to compression, i.e. the discarding of certain details. The reconstruction for two different thresholds – 10\(\%\) and 95\(\%\) – is shown, and displays that even a very aggressive compression of 95\(\%\) – i.e. only 5\(\%\) of the details are retained – is feasible. While this is still only work in progress, it clearly shows that a wavelet based decomposition and lossy compression of ICON data is possible and desirable. The here described wavelet based decomposition of the data would be performed in-situ, and will be implemented on the C++ side of the Catalyst adaptor in the form of an additional data flow branch, as is already outlined in Fig. 2. Researchers at NCAR in Boulder, Colorado, have looked into possible gains and losses by using various lossy compression algorithms, and shown that compression ratios of 1:5 are feasible, without even impairing the statistical signal of the data [18, 19].

VAPOR, an interactive 3D visualization platform that is also developed at NCAR can be used to load and display such wavelet decomposed data sets. In fact, VAPOR features the so called VAPOR Data Collection (VDC) data model that allows users to progressively load and visualize their data, thus allowing an interactive visualization of terascale data sets on commodity hardware [15, 20]. Initially designed to handle regular rectilinear grids only, it was more recently extended to additionally support the UGRID netCDF-CFFootnote 7 standard, i.e. allowing one to also load and display model data on irregular grids, such as MPAS and ICON simulation output.

4 Summary and Conclusion

We have discussed the current status of in-situ visualization/processing at DKRZ, along with a few other ideas to handle some of the extremely large data sets that are produced by current climate simulations. The work presented is still work in progress, and although currently none of the work discussed is used in production mode, it is expected that the in-situ visualization and processing techniques will transition within a few weeks once workflows that are easy to setup and deploy by climate scientists have been devised. After that, other in-situ feature detection and tracking, such as the discussed ocean eddy census, will be added. Steering, as is developed as part of the exascale visualization initiative SENSEI, is also interesting, but probably a topic that lies – at least for us – a bit further into the future [9]. The wavelet decomposition and compression that was outlined in the previous Sect. 3 is implemented as standalone prototype, and needs to be relocated and further optimized into the Catalyst adaptor.

Other interesting aspects include the in-situ generation of high-quality visualizations using raytracing. As our current cluster is based on Intel CPUs, with relatively outdated GPUs, our current choice is OSPRay. Nevertheless, both OSPRay and OptiX are able to create superior quality renderings and are already used at DKRZ within the conventional post visualization workflow [21, 22]. A transition of raytracing to in-situ will probably also take some more time.

Furthermore, just a visual display and analysis of large simulations may not be very useful in the near future, as the data is so massive that a single user will have problems finding and detecting the interesting features in the visual output. Here the currently popular machine learning might prove useful to automatically check the plausibility of a simulation, and also to identify outliers and extreme weather/climate events and direct the attention of the researcher directly onto those regions/time steps.