Abstract
Monitoring data centers is challenging due to their size, complexity, and dynamic nature. This project proposes a visual approach for situational awareness and health monitoring of high-performance computing systems. The visualization requirements are expanded on the following dimensions: (1) High performance computing spatial layout, (2) Temporal domain (historical vs. real-time tracking), and (3) System health services such as temperature, CPU load, memory usage, fan speed, and power consumption. To show the effectiveness of our design, we demonstrate the developed prototype on a medium-scale data center of 10 racks and 467 hosts. The work was developed using feedback from both industrial and acadamic domain experts.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Scatterplots
- Visual features
- High-dimensional data
- Data center visualization
- High-performance computing systems
- Redfish RESTful API
- Nagios Core
- Baseboard management controller
1 Introduction
Data centers are increasingly complex and hence monitoring such systems is a daunting task for system administrators. In 2013, the Distributed Management Task Force (DMTF) released Redfish [17], an open industry standard specification for server configuration that aims to supersede IPMI over the network (IPMI over LAN). The web servers are expected to provide end users with simple, secure management of scalable platform hardware by enhancing security and reliability to Baseboard Management Controller (BMC). Redfish uses industry standard RESTful API architecture over Hyper Text Transfer Protocol Secure (HTTPS) using JavaScript Object Notation (JSON) format based on Open Data Protocol (OData) as depicted in Fig. 1. Redfish-enabled technologies enable IT systems to operate in harmony. Most importantly, a Redfish-enabled system delivers, and single API for everything [13]. One of the critical advantages of REST APIs is that it introduces a great deal of flexibility, allowing any software clients with RESTful capabilities to interact with Redfish firmware.
Building on top of Redfish RESTful interface, this paper introduces a visual analytic prototype for monitoring high-performance computing (HPC) system events. The goals of this prototype are: (1) to monitor the multidimensional health status of multiple hosts and racks in real-time, (2) to support system administrators in detecting unusual correlation of these services in a complex and highly dynamic system, and (3) to help in performing system troubleshooting and debugging. We demonstrate our prototype at a data center at a university.
The rest of this paper is organized as follows: We first summarize existing techniques for visualizing HPC in the next Section. Then we provide an overview of visualization tasks and describe the design and supported interactive features of our prototype in detail. In Sect. 4, we present the results of an informal study with HPC experts from a university and an industrial company. We argue that using visual features of pairwise projection we can capture abnormal correlations between various health dimensions in the data center. More importantly, these visual features can be computed automatically and notify the system administrator significant events. Finally, we conclude our paper with future plans for visual monitoring and predicting through the integration with Nagios and Redfish API.
2 Existing Approaches
Although not fully explored, there are many existing tools that support monitoring high-performance computing systems. For example, LLView [14] is a client-server based application which allows monitoring the utilization of clusters controlled by batch systems like IBM LoadLeveler, PBSpro, Torque, or IBM Blue Gene system database. However, multidimensional analysis proposed in this work is not in focus of LLview.
Ganglia [16] is an open source PHP-based web front-end interface that allows users to gather information via real-time dynamic web pages. This information, including CPU usage, memory usage, disk usage, network statistics, the number of running processes, is plotted on similar graphs. It leverages the widely used XML technologies for data representation. The advantage of real-time responsive on the web front-end, however, leads to high latency due to the size of the XML tree. Thus, Ganglia is not practical on the less powerful machines when the amount of data is large (Fig. 2).
Nagios Core [6] is another open-source monitoring system that is capable of handling a variety of servers and operating systems with the industry standard. It provides a primary web interface for the core monitoring engine as depicted in Fig. 3(b). To trace a problem, however, an administrator cannot capture a holistic monitoring view by using the Nagios web interface. Even though basic filtering operations are provided, system status overview, which is useful to correlate the isolated temporal/spatial issues, can be lost [5, 11]. Figure 3(a) shows an example of the JSON query on CPU temperatures (on the left) and the corresponding results (on the right) returned from the server.
These existing tools inspect system status independently. This paper takes into account the correlations between various dimensions such as CPU temperature, memory usage, fan speed, power consumption, and I/O bandwidth. Analyzing the relationships of multidimensional data is essential to understand system behaviors as a whole [2].
3 Visualization Components
Through in-depth discussions with domain experts, we have identified the set of design goals: (1) Provides spatial and temporal overview across hosts and racks, (2) Allows system administrators to quickly narrow down to the event of interest for system debugging, and (3) Inspects the correlation of system health services in a single view. To meet these goals, this paper proposes several visualization tasks:
-
Overview Display (T1). Display an overview of real-time system status on the corresponding spatial layout [15].
-
Details-On-Demand (T2). Users can inspect multidimensional historical data of a host in the system via a simple click [18].
-
Filtering (T3). Highlight critical events on a host [3] and the associated time stamps [12].
-
Multidimensional analysis (T4). Explore the correlations between dimensions [4, 10] such as fan speeds, memory usage, and power consumption.
We leverage the Nagios Core engine for data retrieval through a RESTful API web interface. In other words, we iteratively request health status of every host int he system. For each host, we obtain a set of updated status as shown on the right panel of Fig. 3(a). In the next section, we present our approach for displaying status updates of hundreds of hosts within a single view. In the next section, we first introduce the HPC system at Texas Tech University and then visualization components and the supported interactions.
3.1 HPC System Spatial Layout
Figure 4 shows a snapshot power consumption of the 467-node Quanah cluster at Texas Tech University at noon on Sunday, May 13, 2018 (the visualization task T1). Within a rack, hosts are listed top down. Rows are power consumption time series (the temporal distance between two consecutive cells is 3 min, the updating period set up on the cluster. Rectangles on each row are the computed power consumption, which is colored: red is high power consumption nodes while blue is low power consumption nodes.
3.2 Multidimensional Analysis of Health Status
Through our discussions with domain experts, it is essential to have a holistic view of the entire data center on multiple dimensions for system monitoring and especially diagnosis. Therefore, our system supports a multi-axis visualization of hosts on demand (visualization task T2). In particular, when users mouse over a host, a pop-up window is displayed to unveil the details of host information as depicted in Fig. 5. Historical temperatures of each CPU in the selected host are presented in line graphs: The vertical axis represents the CPU temperature (from 0 to 100\(^\circ \) of Fahrenheit) while the horizontal axis represents the time.
The spider or cobweb chart [7] in the lower panel of the pop-up window shows health dimensions of the selected host (visualization task T4) which are grouped by category: CPU temperature (degrees Fahrenheit), job load (number of jobs assigned on the the given host), memory usage (from 0% to 100%), fan speeds, and power consumption (in watts). The purpose of this chart is to enable users to detect similar multidimensional patterns and how they change over time.
We decide to use the spider chart to display multidimensional health status of data center since the mental images of spider charts allow users to capture and compare historical data between different hosts quickly. Figure 6 shows four examples of spider charts from four random hosts in the Quanah cluster at Texas Tech University. In particular, each closed curve on the spider charts represents a multidimensional observation. The observed period is from 11 am to 7 pm on Wednesday, October 17, 2018.
Users can request to show multidimensional spider chart across hosts in the system at a time stamp. Figure 7 shows our multidimensional visualization of 467 hosts at Texas Tech University at 2 pm on Wednesday, October 17, 2018. Brushing and linking between the summary view and HPC spatial layout can be done via mouse over. Our prototype also supports a range of interactive features, such as zooming and filtering (visualization task T3). In Fig. 5, we only display hosts which experienced a sudden CPU heat up of over 25\(^\circ \) Fahrenheit within the two consecutive time stamps.
4 Discussion and Future Work
Our multidimensional analysis approach received positive feedback from both industrial and academic domain experts and encouraged to explore further. One potential direction is inspecting pairwise correlations within a 2D projection [19]. Figure 8 show an example of the 467-node Quanah cluster projected on fan speed (x axis) vs. CPU temperature (y axis). The plot reveals two clusters along the x axis and an outlier (the compute 4–17) at the lower left corner which has low fan speed and temperature.
Instead of having a human looking into every single scatterplot, visual features [20] can help to simplify this process by highlighting only unusual correlations. The following examples in Fig. 9 summarize possible 2D projections that can be captured using these visual features [9].
5 Conclusion
This paper present a graphical tool for situational awareness and multidimensional health status of data centers using real-time data gathered through the industry-standard Redfish protocol and API (v.s. the aging IMPI protocol). The system has two components: visual feature extraction for pairwise projection and spider chart for visualizing high-dimensional health status. Our prototype supports a wide range of interactive features such as brushing and linking, zooming, and filtering.
In future work, we want to incorporate machine learning framework, such as TensorFlow [1] to predict the system health status. The model will be trained on historical data, make real-time predictions, and raise the alarm to the system administrator for timely actions. Google recently released tensorflow.js (https://js.tensorflow.org) which can be naturally adapted in our project. Virtual Reality and Augmented Reality (VR and AR) are also interesting future extensions. VR and AR interfaces enable real-time monitoring beyond traditional displays by experiencing interactive visualization techniques on mobile devices as well as within immersive environments.
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Allcock, W., Felix, E., Lowe, M., Rheinheimer, R., Fullop, J.: Challenges of HPC monitoring. In: SC 2011: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–6, November 2011. https://doi.org/10.1145/2063348.2063378
Amar, R., Eagan, J., Stasko, J.: Low-level components of analytic activity in information visualization. In: Proceedings of the IEEE Symposium on Information Visualization, pp. 15–24 (2005)
Anand, A., Wilkinson, L., Dang, T.N.: Visual pattern discovery using random projections. In: 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 43–52 (2012). https://doi.org/10.1109/VAST.2012.6400490
Andrienko, N., Andrienko, G., Gatalsky, P.: Exploratory spatio-temporal visualization: an analytical review. J. Vis. Lang. Comput. 14(6), 503–541 (2003)
Barth, W.: Nagios: System and Network Monitoring. No Starch Press, San Francisco (2008)
Bremer, N.: A different look for the D3.js radar chart (2015). https://www.visualcinnamon.com/2015/10/different-look-d3-radar-chart. Accessed Oct 2018
Dang, T., Wilkinson, L.: Scagexplorer: exploring scatterplots by their scagnostics. In: Proceedings of the 2014 IEEE Pacific Visualization Symposium, PACIFICVIS 2014, pp. 73–80. IEEE Computer Society, Washington (2014). https://doi.org/10.1109/PacificVis.2014.42
Dang, T., Wilkinson, L.: Transforming scagnostics to reveal hidden features. IEEE Trans. Vis. Comput. Graph. 20(12), 1624–1632 (2014). https://doi.org/10.1109/TVCG.2014.2346572
Dang, T.N., Anand, A., Wilkinson, L.: TimeSeer: scagnostics for high-dimensional time series. IEEE Trans. Vis. Comput. Graph. 19(3), 470–483 (2013). https://doi.org/10.1109/TVCG.2012.128
Dang, T.N., Pendar, N., Forbes, A.G.: TimeArcs: visualizing fluctuations in dynamic networks. Comput. Graph. Forum (2016). https://doi.org/10.1111/cgf.12882
Dang, T.N., Wilkinson, L.: TimeExplorer: similarity search time series by their signatures. In: Bebis, G., et al. (eds.) ISVC 2013. LNCS, vol. 8033, pp. 280–289. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41914-0_28
of the Industry, V.: How redfish specifications can improve simplicity, visibility and control of it (2016). https://datacenterfrontier.com/redfish-specifications/. Accessed Oct 2018
Karbach, C., Valder, J.: System monitoring: LLview. In: Computational Science and Mathematical Methods. Introduction to the programming and usage of the supercomputer resources at Jülich November 2015 (Course no. 78a/2015 in the training programme of Forschungszentrum Jülich), 26–27 Nov 2015, Jülich, Germany (2015). https://juser.fz-juelich.de/record/279901
Keim, D.A., Panse, C., Sips, M.: Information visualization: scope, techniques and opportunities for geovisualization. In: Dykes, J. (ed.) Exploring Geovisualization, pp. 1–17. Elsevier, Oxford (2004)
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Organization, S.D.: Distributed management task force (2013). https://www.dmtf.org/standards/redfish
Shneiderman, B.: The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the 1996 IEEE Symposium on Visual Languages, VL 1996, p. 336. IEEE Computer Society, Washington (1996). http://dl.acm.org/citation.cfm?id=832277.834354
Wilkinson, L., Anand, A., Grossman, R.: Graph-theoretic scagnostics. In: Proceedings of the IEEE Information Visualization 2005, pp. 157–164. IEEE Computer Society Press (2005)
Wilkinson, L., Anand, A.: Grossman: D3 data-driven documents. IEEE Trans. Vis. Comput. Graph. 17(12), 2301–2309 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Dang, T. (2019). Visualizing Multidimensional Health Status of Data Centers. In: Bhatele, A., Boehme, D., Levine, J., Malony, A., Schulz, M. (eds) Programming and Performance Visualization Tools. ESPT ESPT VPA VPA 2017 2018 2017 2018. Lecture Notes in Computer Science(), vol 11027. Springer, Cham. https://doi.org/10.1007/978-3-030-17872-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-17872-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17871-0
Online ISBN: 978-3-030-17872-7
eBook Packages: Computer ScienceComputer Science (R0)