Keywords

1 Introduction

Data centers are increasingly complex and hence monitoring such systems is a daunting task for system administrators. In 2013, the Distributed Management Task Force (DMTF) released Redfish [17], an open industry standard specification for server configuration that aims to supersede IPMI over the network (IPMI over LAN). The web servers are expected to provide end users with simple, secure management of scalable platform hardware by enhancing security and reliability to Baseboard Management Controller (BMC). Redfish uses industry standard RESTful API architecture over Hyper Text Transfer Protocol Secure (HTTPS) using JavaScript Object Notation (JSON) format based on Open Data Protocol (OData) as depicted in Fig. 1. Redfish-enabled technologies enable IT systems to operate in harmony. Most importantly, a Redfish-enabled system delivers, and single API for everything [13]. One of the critical advantages of REST APIs is that it introduces a great deal of flexibility, allowing any software clients with RESTful capabilities to interact with Redfish firmware.

Fig. 1.
figure 1

Redfish reference architecture [13].

Building on top of Redfish RESTful interface, this paper introduces a visual analytic prototype for monitoring high-performance computing (HPC) system events. The goals of this prototype are: (1) to monitor the multidimensional health status of multiple hosts and racks in real-time, (2) to support system administrators in detecting unusual correlation of these services in a complex and highly dynamic system, and (3) to help in performing system troubleshooting and debugging. We demonstrate our prototype at a data center at a university.

The rest of this paper is organized as follows: We first summarize existing techniques for visualizing HPC in the next Section. Then we provide an overview of visualization tasks and describe the design and supported interactive features of our prototype in detail. In Sect. 4, we present the results of an informal study with HPC experts from a university and an industrial company. We argue that using visual features of pairwise projection we can capture abnormal correlations between various health dimensions in the data center. More importantly, these visual features can be computed automatically and notify the system administrator significant events. Finally, we conclude our paper with future plans for visual monitoring and predicting through the integration with Nagios and Redfish API.

2 Existing Approaches

Although not fully explored, there are many existing tools that support monitoring high-performance computing systems. For example, LLView [14] is a client-server based application which allows monitoring the utilization of clusters controlled by batch systems like IBM LoadLeveler, PBSpro, Torque, or IBM Blue Gene system database. However, multidimensional analysis proposed in this work is not in focus of LLview.

Fig. 2.
figure 2

Main window of LLview [14]: graphical monitoring of batch system controlled cluster.

Ganglia [16] is an open source PHP-based web front-end interface that allows users to gather information via real-time dynamic web pages. This information, including CPU usage, memory usage, disk usage, network statistics, the number of running processes, is plotted on similar graphs. It leverages the widely used XML technologies for data representation. The advantage of real-time responsive on the web front-end, however, leads to high latency due to the size of the XML tree. Thus, Ganglia is not practical on the less powerful machines when the amount of data is large (Fig. 2).

Nagios Core [6] is another open-source monitoring system that is capable of handling a variety of servers and operating systems with the industry standard. It provides a primary web interface for the core monitoring engine as depicted in Fig. 3(b). To trace a problem, however, an administrator cannot capture a holistic monitoring view by using the Nagios web interface. Even though basic filtering operations are provided, system status overview, which is useful to correlate the isolated temporal/spatial issues, can be lost [5, 11]. Figure 3(a) shows an example of the JSON query on CPU temperatures (on the left) and the corresponding results (on the right) returned from the server.

Fig. 3.
figure 3

Nagios Core [6] interfaces: (a) System summary in a simple listing format and (b) Json query on CPU temperatures and the corresponding results.

These existing tools inspect system status independently. This paper takes into account the correlations between various dimensions such as CPU temperature, memory usage, fan speed, power consumption, and I/O bandwidth. Analyzing the relationships of multidimensional data is essential to understand system behaviors as a whole [2].

3 Visualization Components

Through in-depth discussions with domain experts, we have identified the set of design goals: (1) Provides spatial and temporal overview across hosts and racks, (2) Allows system administrators to quickly narrow down to the event of interest for system debugging, and (3) Inspects the correlation of system health services in a single view. To meet these goals, this paper proposes several visualization tasks:

  • Overview Display (T1). Display an overview of real-time system status on the corresponding spatial layout [15].

  • Details-On-Demand (T2). Users can inspect multidimensional historical data of a host in the system via a simple click [18].

  • Filtering (T3). Highlight critical events on a host [3] and the associated time stamps [12].

  • Multidimensional analysis (T4). Explore the correlations between dimensions [4, 10] such as fan speeds, memory usage, and power consumption.

Fig. 4.
figure 4

Visualizing power consumption on Wednesday, October 17, 2018, of the 467-node Quanah cluster at Texas Tech University.

We leverage the Nagios Core engine for data retrieval through a RESTful API web interface. In other words, we iteratively request health status of every host int he system. For each host, we obtain a set of updated status as shown on the right panel of Fig. 3(a). In the next section, we present our approach for displaying status updates of hundreds of hosts within a single view. In the next section, we first introduce the HPC system at Texas Tech University and then visualization components and the supported interactions.

3.1 HPC System Spatial Layout

Figure 4 shows a snapshot power consumption of the 467-node Quanah cluster at Texas Tech University at noon on Sunday, May 13, 2018 (the visualization task T1). Within a rack, hosts are listed top down. Rows are power consumption time series (the temporal distance between two consecutive cells is 3 min, the updating period set up on the cluster. Rectangles on each row are the computed power consumption, which is colored: red is high power consumption nodes while blue is low power consumption nodes.

3.2 Multidimensional Analysis of Health Status

Through our discussions with domain experts, it is essential to have a holistic view of the entire data center on multiple dimensions for system monitoring and especially diagnosis. Therefore, our system supports a multi-axis visualization of hosts on demand (visualization task T2). In particular, when users mouse over a host, a pop-up window is displayed to unveil the details of host information as depicted in Fig. 5. Historical temperatures of each CPU in the selected host are presented in line graphs: The vertical axis represents the CPU temperature (from 0 to 100\(^\circ \) of Fahrenheit) while the horizontal axis represents the time.

Fig. 5.
figure 5

Visualizing CPU temperature on Wednesday, October 17, 2018 of the 467-node Quanah cluster. Hosts experiencing a sudden temperature increase of 25\(^\circ \) in 3 min.

The spider or cobweb chart [7] in the lower panel of the pop-up window shows health dimensions of the selected host (visualization task T4) which are grouped by category: CPU temperature (degrees Fahrenheit), job load (number of jobs assigned on the the given host), memory usage (from 0% to 100%), fan speeds, and power consumption (in watts). The purpose of this chart is to enable users to detect similar multidimensional patterns and how they change over time.

We decide to use the spider chart to display multidimensional health status of data center since the mental images of spider charts allow users to capture and compare historical data between different hosts quickly. Figure 6 shows four examples of spider charts from four random hosts in the Quanah cluster at Texas Tech University. In particular, each closed curve on the spider charts represents a multidimensional observation. The observed period is from 11 am to 7 pm on Wednesday, October 17, 2018.

Fig. 6.
figure 6

Visualizing multidimensional heath status of four hosts in the data center.

Users can request to show multidimensional spider chart across hosts in the system at a time stamp. Figure 7 shows our multidimensional visualization of 467 hosts at Texas Tech University at 2 pm on Wednesday, October 17, 2018. Brushing and linking between the summary view and HPC spatial layout can be done via mouse over. Our prototype also supports a range of interactive features, such as zooming and filtering (visualization task T3). In Fig. 5, we only display hosts which experienced a sudden CPU heat up of over 25\(^\circ \) Fahrenheit within the two consecutive time stamps.

Fig. 7.
figure 7

Visualizing multidimensional heath status at 2 pm on Wednesday, October 17, 2018, of the 467-node Quanah cluster at Texas Tech University.

4 Discussion and Future Work

Our multidimensional analysis approach received positive feedback from both industrial and academic domain experts and encouraged to explore further. One potential direction is inspecting pairwise correlations within a 2D projection [19]. Figure 8 show an example of the 467-node Quanah cluster projected on fan speed (x axis) vs. CPU temperature (y axis). The plot reveals two clusters along the x axis and an outlier (the compute 4–17) at the lower left corner which has low fan speed and temperature.

Fig. 8.
figure 8

Visualizing multidimensional heath status of all 467 hosts in the Quanah cluster at Texas Tech University: Fan speed vs. CPU temperature.

Instead of having a human looking into every single scatterplot, visual features [20] can help to simplify this process by highlighting only unusual correlations. The following examples in Fig. 9 summarize possible 2D projections that can be captured using these visual features [9].

Fig. 9.
figure 9

Some example scatterplots and their measures [8]. In each row, the scatterplots with a low score on the associated measures are on the left while the scatterplots with a high score on the associated visual features are on the right.

5 Conclusion

This paper present a graphical tool for situational awareness and multidimensional health status of data centers using real-time data gathered through the industry-standard Redfish protocol and API (v.s. the aging IMPI protocol). The system has two components: visual feature extraction for pairwise projection and spider chart for visualizing high-dimensional health status. Our prototype supports a wide range of interactive features such as brushing and linking, zooming, and filtering.

In future work, we want to incorporate machine learning framework, such as TensorFlow [1] to predict the system health status. The model will be trained on historical data, make real-time predictions, and raise the alarm to the system administrator for timely actions. Google recently released tensorflow.js (https://js.tensorflow.org) which can be naturally adapted in our project. Virtual Reality and Augmented Reality (VR and AR) are also interesting future extensions. VR and AR interfaces enable real-time monitoring beyond traditional displays by experiencing interactive visualization techniques on mobile devices as well as within immersive environments.