Developing a New Generation of Reconfigurable Heterogeneous Distributed High Performance Computing System

Antonov, Alexander; Zaborovskij, Vladimir; Kisilev, Ivan

doi:10.1007/978-981-33-6632-9_22

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 220))

509 Accesses

Abstract

Restrictions of current architectures of Heterogeneous High-Performance Computing (HPC) systems lead to Reconfigurable Heterogeneous HPC (RH HPC) which are able to adapt to a particular solved task on the hardware level. Highly disruptive technologies like Artificial Intelligence (AI), Internet of Things (IoT), and Machine Learning (ML) are expected to lead not only fundamental shift in classical multi-cores and multi-treads-based approaches to high-performance computing, but also open new direction in designing systems with dynamically reconfigurable runtime environment. The state-of-the-art integrated circuits allow realizing the general architecture of RH HPC at different levels of the distributed HPC: from the level of the supercomputer, to the level of user computers and the remote, built-in units intended for data acquisition, management, and control. The article describes proposed architecture for new generation of Reconfigurable Heterogeneous Distributed HPC (RHD HPC) system, including architectures of each sub-parts and highlights already developed components for such RHD HPC.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Task-Driven Reconfigurable Heterogeneous Computing Platform for Big Data Computing

Exo-intelligent Data-Driven Reconfigurable Computing Platform

Energy-Efficient Heterogeneous Computing at exaSCALE—ECOSCALE

Keywords

1 Introduction

Current High-Performance Computing (HPC) systems such as Supercomputers [1], Data centers, Cloud services are becoming larger and larger; consume more and more power [2]. This trend leads to increasing of required space, power consumption, and cost of deploying and maintenance service for HPC systems.

Due to rapidly increasing requirements: to performance, measured in FLoating-point Operations Per Second (FLOPS); to Power Efficiency (FLOPS/W); to Calculation Efficiency (Real FLOPS/Peak FLOPS); to Size Efficiency (Real FLOPS/square), modern HPC systems are evolving toward heterogeneous computing [3]. Current heterogeneous HPCs mostly use General Purpose Graphics Processing Units (GPGPUs) [4] and Application-Specific Integrated Circuit (ASIC) as an accelerators to perform efficient solving of Artificial Intelligence (AI), Machine Learning (ML), Internet of Things (IoT), and Big Data analysis tasks [5]. The next step in evolving HPC systems is reconfigurable computing [6].

Reconfigurable computing technology is based on Field-Programmable Gate Array (FPGA) [7, 8]. FPGA is an integrated circuit (IC) that can change its internal structure according to the task being solved. FPGA consists of programmable logic cells that can perform any logic/memory functions and programmable matrix (interconnection matrix) that can connect all logic cells together to implement complex functions. FPGA is programmed, or configured, by binary file, called configuration file, that configure logic cells and interconnection matrix. Configuration file sets up the logic cells and the interconnection matrix such that FPGA can implement the task being solved. Modern FPGA contains not only logic cells and interconnection matrix but also Digital Signal Processing (DSP) blocks, Random Access Memory (RAM) blocks, High Bandwidth Memory (HBM) based on embedded DDR memory blocks, hardware implemented controllers and transceivers for external: DDR memory, PCIe interface, 100G Ethernet.

State-of-the-art FPGA could be configured on-the-fly. It means that FPGA could be configured for solving new task during execution of current task. FPGA could be partially configured. It means that a part of FPGA could be configured for solving new task while the rest of FPGA continues to solve current task. Finally, FPGA could be configured and partially configured through PCIe and Ethernet for solving a particular task. It means that FPGA-based PCIe Accelerator deployed on a Host or FPGA-based remote accelerator connected to a Host by high-speed channel, for example, 100G Ethernet, could be on-the-fly dynamically configured, or re-configured, to solve a particular task with efficiency of hardware implementation.

2 Materials and Method

Compared to existing heterogeneous HPC systems [9,10,11,12,13] which consist of a Multi-Processor Unit (MPU), or clusters of MPUs, and GPGPU-based accelerators, Reconfigurable Heterogeneous HPCs (RH HPC), by using reconfigurable FPGA-based accelerators, are able to meet the requirements of particular tasks, such as: data structures, calculation algorithms, real-time requirement and allow to solve particular tasks more efficiently [14,15,16] in terms of Power Efficiency (FLOPS/W), Calculation Efficiency (Real FLOPS/Peak FLOPS) and Size Efficiency (Real FLOPS/square).

Current understanding of System-on-Chip (SoC) is: FPGA, often referenced as Logic Part of SoC; Multi-core processor, often referenced as Processor Part of SoC, with GPU accelerator; a lot of embedded peripheral components [16], such as PCIe, USB3.0, MAC Ethernet, SATA, DDR4, SPI\QSPI, NAND memory, SD Card which are deployed on silicon of a single IC. It means that state-of-the-art SoC has reconfigurable heterogeneous architecture and could be treated as tiny RH HPC systems.

State-of-the-art FPGA, SoCs, and of-the-shelf devices allow to use the Reconfigurable Heterogeneous (RH) architecture for building Supercomputers, Data centers, and Cloud services (DC-Cloud RH HPC), office computers (Premises RH HPC) and remote high-performance computing systems (Edge RH HPC).

In the chapter, we review proposed architectures of Reconfigurable Heterogeneous Distributed High-Performance Computing (RHD HPC) System and architecture of developed and deployed PCIe-based reconfigurable accelerator.

3 Results

3.1 The Proposed Architecture of Reconfigurable Heterogeneous Distributed HPC System

The proposed architecture of reconfigurable heterogeneous distributed HPC system (see Fig. 1) is based on our previous researches dealing with HPC architectures based on OpenCL standard [17, 18].

DC-Cloud RH HPC (see Fig. 1) consists of some Computing Clusters and one Service Cluster. Each of the clusters consists of some identical computing nodes with MPU, Reconfigurable FPGA-based Accelerator (RA), and Single Instruction Multiple Data (SIMD) accelerator, which was previously pointed as GPGPU. The Computing Clusters are intended for solving the particular computational tasks. The Service Cluster implements:

performance evaluation of all Computing Clusters, remotely connected Premises RH HPC and Edge RH HPCs;
optimal task distribution between available computational resources. Optimization criterion could be pointed by user or assigned by AI, deployed on the Service Cluster, automatically.

Premises RH HPC (see Fig. 1) consists of Premises MPU (P_MPU), Premises Reconfigurable FPGA-based Accelerator (P_RA), and Premises Single Instruction Multiple Data (P_SIMD) accelerator. PCIe3.0 × 16 (PCIe4.0 × 16) interface provides interconnection between P_MPU, P_RA, and P_SIMD. Premises RH HPC could be:

identical with the Computing Node of DC-Cloud RH HPC. In this case, P_MPU, P_RA, and P_SIMD are identical to MPU, RA, and SIMD, respectively;
specialized for solving particular tasks.

Edge RH HPC (see Fig. 1) consists of Embedded MPU (E_MPU), Embedded Reconfigurable FPGA-based Accelerator (E_RA), and Embedded Single Instruction Multiple Data (E_SIMD) accelerator. PCIe3.0 × 16 (PCIe4.0 × 16) interface provides interconnection between P_MPU, P_RA, and P_SIMD. The connection, pointed on Fig. 1 as Embedded Interconnection, can be realized by:

PCIe interface if all E_MPU, E_RA, E_SIMD or some of them are separate devices
interconnection matrix if all E_MPU, E_RA, E_SIMD are deployed inside FPGA.

Since Edge RH HPC is intended for interaction with sensors and actuators, their important element is Object Connection block, highlighted in Fig. 1. The Object Connection block can contain analog-to-digital converters (ADCs), digital-to-analog converters (DACs), digital inputs/outputs (DIO), and other means of interaction with the particular object.

It is assumed that data transmission medium, pointed in Fig. 1, can be an arbitrary combination of wired (with speed of 1–100 GBIT/s) and wireless (e.g., Bluetooth, Wi-Fi, LTE, 5G) connections. The choice of which is depended by the characteristics of the solving tasks and the remote objects parameters.

3.2 The Proposed Architecture of the Computing Node

The proposed architecture of the Computing Node (see Fig. 2) has been extracted from the performance demands of Machine Learning tasks [19].

The proposed architecture of the Computing Node contains:

Two Central Processing Units (CPUs). Each CPU itself is a multiprocessing unit containing several, up to several dozen, computing cores, and some number of embedded controllers for high-speed connection with external dynamic memory, PCIe interface, 1–100G Ethernet connections, etc. CPUs should have a direct connection and implement non-uniform memory access (NUMA).
Dynamic Random Access Memory (DRAM) blocks, which, at the physical level, are DDR4 memory modules. DRAM is the local memory for each processor, the width, performance, and volume of which depends on the purpose, particular solving tasks, desirable performance, and power consumption of the Computing Node.
Some number of SIMD accelerators. Each SIMD accelerator should have independent connection with each CPU in the Computing Node. The independent connection could be as simple as PCIe3.0 (PCIe4.0) × 16\8 interface [20], or as advanced as Open Coherent Accelerator Processor Interface (OpenCAPI) or Cache Coherent Interconnect for Accelerators (CCIX) interface or Compute eXpress Link (CXL).
Some number of RA accelerators. Each RA accelerator should have independent connection with each CPU in the Computing Node. The independent connection could be as simple as PCIe3.0 (PCIe4.0) × 16\8 interface, or as advanced as OpenCAPI [10] or CCIX interface.
Task Memory, which is Random Access Memory (RAM) with capacity from 16GByte and performance comparable with DDR4 memory. The Task Memory should be connected with each CPU could access to Task Memory through PCIe3.0 (PCIe4.0) interfaces or by OpenCAPI/CCIX, by implementing Uniform Memory Access (UMA) at the tasks level.
Interconnect which is Interconnect blocks, which are a local parts of the Interconnect System, see Fig. 1. The Interconnect blocks provide high-speed wired connections between Computing Nodes in the Computing and Service Clusters and between Clusters. Each Interconnect block should contain one or some 100G connectors/controllers.
Smart Endpoints, which are intelligent units for direct, without using CPUs, channel with outside world. Each Smart Endpoint consists of 1–100G Ethernet block and Packet Processors. The 1–100G Ethernet block could contain one or more Ethernet\SFP\QSFP connectors and PHysical Layer (PHY) controller. Packet Processor is intended for solving security issues and intelligent data processing, like particular data extraction for further processing, with efficiency of hardware implementation.

Proposed architecture of the Computing Node is treating as universal architecture for building Computing Clusters, Service Clusters and as a core architecture for Premises RH HPC.

3.3 The Proposed Architecture of the EDGE RH HPC

The proposed architecture of the Edge RH HPC (see Fig. 3) is based on our experience in developing and deploying of high-performance Systems-on-Chip.

The proposed architecture of the Edge RH HPC contains:

Multi-Core CPU, which is a main processing unit.
E_SIMD accelerator, which is tightly coupled with Multi-Core CPU. It could be implemented as a separate Integration Circuit (IC) or as embedded GPGPU unit inside SoC device.
E_RA accelerator, which could be implemented as a separate IC or as embedded unit, deployed on Logic Part of SoC device.
DRAM blocks, which, at the physical level, are DDR4 memory modules. DRAMs are the local memory for Logic Part and Processor Part of SoC device. The width, performance, and volume are functions of the purposes, particular solving tasks, desirable performance, and power consumption of the Edge RH HPC.
Packet Processor, which is intended for solving security issues and intelligent data processing, like particular data extraction for further processing, with efficiency of hardware implementation. The Packet Processor should be implemented as a separate IC or deployed on Logic Part of SoC device.
Embedded Interconnect, which is interconnection between all units of Edge RH HPC. The Embedded Interconnect could be implemented as a PCIe3.0 (PCIe 4.0), NV-Link, etc. interface or deployed on Logic Part of SoC device.
Object Connection Unit, which provides interaction with particular object connected with Edge RH HPC. Object Connection Unit could include, but is not limited by, ADCs; DACs; transceivers; analog and digital sensors; GPS, GLONASS, Baidu, Galileo devices, etc.
Data transmission unit, which is a local, remote, part of Data transmission medium (see Fig. 1). The data transmission unit should provide an arbitrary combination of wired (with speed of 1–100 GBIT/s) and wireless (e.g., Bluetooth, Wi-Fi, LTE, 5G) connections. The choice of particular media for data transmission is depended by the characteristics of the solving tasks and the remote objects parameters.

3.4 Developed and Deployed Reconfigurable Accelerator for DC-Cloud RH HPC

We developed and deployed in Supercomputer Center ‘Polytechnic’ [12] the reconfigurable accelerator (PB_4×) [13], based on Xilinx Kintex UltraScale FPGA. The structure of the reconfigurable accelerator PB_4× (see Fig. 4) was developed for providing the highest level of parallelism for reconfigurable accelerators.

The reconfigurable accelerator PB_4× consists of:

Four KU115 devices, each is Xilinx Kintex UltraScale KU115 FPGA. It is the largest Kintex UltraScale device available.
Four DDR3 memory blocks. Each block, having 4 GByte capacity, is connected by 64 data bus to its own KU115 device.
PCIe Switch, which provides non-blocking connection each of KU115 devices.

The architecture of the accelerator PB_4× provides the ability to connect any KU115 to any other. Throughput between KU115 and bandwidth of data channels between KU115 and DDR3 memory are balanced:

Throughput, in any direction, between the PCIe Switch and each of KU115: 8 lanes × 8 Gb/s = 64 Gb/s, provided by PCIe3.0 × 8 interface.
Throughput, in any direction, between KU115: 8 lanes × 16 Gb/s = 128 Gb/s, provided by Aurora GTH 16 Gb/s × 8 interface.
Bandwidth of data channels between KU115 and DDR3 memory: 64 bits × 1600 m/s ~ 100 Gb/s.

The use of the PCIe Switch and the availability of high-speed communication channels between each KU115 allow a flexibility to change the configuration of the reconfigurable accelerator. An operating system (OS) can see PB_4× as:

Four independent reconfigurable accelerator. Each of which is connected to the PCIe bus by eight channels of PCIe3.0 is implemented on Xilinx Kintex UltraScale KU115 FPGA, and has 4GByte of DDR3 memory
One huge reconfigurable accelerator. In such configuration, a solved task will be implemented on all KU115, connected together by high-speed channels. Such huge reconfigurable accelerator will have 16 Gbit DDR3 memory divided into four independent channels.

The reconfigurable accelerator PB_4x was implemented as PCIe3.0-×16 [14] expansion card (see Fig. 5).

To integrate the reconfigurable accelerator into Xilinx SDAccel environment we developed [12]:

The hardware platform, which is hardware design for each KU115 enabling integration of the reconfigurable accelerator with the host computer.
The set of drivers enabling integration of the reconfigurable accelerator with CentOS × 64 deployed on the host computer.

By the time, a Computation Node with the developed reconfigurable accelerator is deployed in Supercomputer Center ‘Polytechnic.’

4 Discussion

The nearest future researches will deal with performance evaluation of developed and deployed reconfigurable accelerator. We expect to get significant leap in performance for the tasks related with Deep Neural Networks inference in Supercomputer Center.

The second direction for future researches is implementation and performance investigation of HPC EDGE unit developed in accordance with proposed architecture.

5 Conclusions

The chapter describes the developed hardware architecture of Reconfigurable Heterogeneous Distributed High-Performance Computing (RHD HPC) Systems, which integrates a wide set of distributed heterogeneous computing nodes. By meeting a set of requirements, such as multi-components, distributed, intelligent computer systems may be applied for wide range context-aware high-performance computing tasks.

References

Top500 Efficiency, Power. [Online]. Available: https://www.top500.org/statistics/efficiency-power-cores/. Last accessed 10 Oct 2019
Mantovani, F., Calore, E.: Performance and power analysis of hpc workloads on heterogeneous multi-node clusters. J. Low Power Electron. Appl. 8, 13. https://doi.org/10.3390/jlpea8020013
Ashraf, M.U., Alburaei Eassa, F., Ahmad Albeshri, A., Algarni, A.: Performance and power efficient massive parallel computational model for HPC Heterogeneous exascale systems. IEEE Access PP(99):1–1 (2018). https://doi.org/10.1109/ACCESS.2018.2823299
Anzt, H., Ribizel, T., Flegar, G., Chow, E., Dongarra, J.: A parallel threshold ILU for GPUs. In: 33rd IEEE International Parallel and Distributed Processing Symposium, Rio de Janeiro, Brazil, 20–24 May 2019
Google Scholar
Supercomputer Mixes Streams with CPU, GPU, and FPGA. [Online]. Available: https://www.nextplatform.com/2019/04/18/supercomputer-mixes-streams-with-cpu-gpu-and-fpga/. Last accessed 10 Oct 2019
Kobayashi, R., Oobata, Y., Fujita, N., Yamaguchi, Y., Boku, T.: OpenCL-ready high speed FPGA network for reconfigurable high performance computing, pp. 192–201 (2018). https://doi.org/10.1145/3149457.3149479
Xilinx FPGA. [Online]. Available: https://www.xilinx.com/. Last accessed 10 Oct 2019
Intel FPGA. [Online]. Available: https://www.intel.com/content/www/us/en/products/programmable.html. Last accessed 10 Oct 2019
IBM PowerPC9. [Online]. Available: https://www.ibm.com/it-infrastructure/power/power9. Last accessed 10 Oct 2019
OpenCAPI. [Online]. Available: https://opencapi.org/. Last accessed 2019/10/10
NVIDIA Tesla V100. [Online]. Available: https://www.nvidia.com/en-us/data-center/tesla-v100/. Last accessed 10 Oct 2019
Intel Xeon. [Online]. Available: https://www.intel.com/content/www/us/en/products/docs/processors/xeon/2nd-gen-xeon-scalable-processors-brief.html. Last accessed 10 Oct 2019
Supercomputer Center ‘Polytechnic'. [Online]. Available: https://www.top500.org/system/178469. Last accessed 10 Oct 2019
Dongarra, J.J., Gottlieb, S., Kramer, W.T.C.: Race to exascale. Comput. Sci. Eng. 21(1), 4–5 (2019). https://doi.org/10.1109/MCSE.2018.2882574
Robert, Y., Le Fèvre, V., Hori, A., Bouteiller, A., Dongarra, J., Bosilca, G., Hérault, T.: Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms. Parallel Comput. https://doi.org/10.1016/j.parco.2019.02.002
Wong, K., Tomov, S., Dongarra, J.: Hands-on research and training in high-performance data sciences, data analytics, and machine learning for emerging environments. In: ISC 2019, Frankfurt, 16 June 2019
Google Scholar
Antonov, A.P., Filippov, A.S., Kiselev, I.O.: Design of reconfigurable computer supporting OpenCL standard. St. Petersburg State Polytech Univ J Comput Sci Telecommun Control Syst 11(4):108–118. https://doi.org/10.18721/JCSTCS.11408
Antonov, A.P., Zaborovskiy V.S.: Heterogeneous OpenCL based high performance computing system. Tech. Sci. 8:6–18 (2018)
Google Scholar
Utkin, L.V., Zhuk, Y.A., Zaborovsky, V.S.: An anomalous behavior detection of a robot system by using a hierarchical Siamese neural network. In: 2017 XX IEEE International Conference on Soft Computing and Measurements (SCM), pp. 630–634. IEEE (2017)
Google Scholar
PCIe standard. [Online]. Available: https://pcisig.com/specifications. Last accessed 10 Oct 2019

Download references

Acknowledgements

The authors are grateful to the Supercomputer Center ‘Polytechnic’ for the help in gaining access to the resources of the supercomputer.

Author information

Authors and Affiliations

Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia
Alexander Antonov, Vladimir Zaborovskij & Ivan Kisilev

Authors

Alexander Antonov
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Zaborovskij
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Kisilev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Antonov .

Editor information

Editors and Affiliations

Higher School of Software Engineering, Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia
Nikita Voinov
Department of Computer Science and Biomedical Engineering, Graz University of Technology, Graz, Austria
Tobias Schreck
School of Maths, Computer Science and Engineering, City, University of London, London, UK
Sanowar Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Antonov, A., Zaborovskij, V., Kisilev, I. (2021). Developing a New Generation of Reconfigurable Heterogeneous Distributed High Performance Computing System. In: Voinov, N., Schreck, T., Khan, S. (eds) Proceedings of International Scientific Conference on Telecommunications, Computing and Control. Smart Innovation, Systems and Technologies, vol 220. Springer, Singapore. https://doi.org/10.1007/978-981-33-6632-9_22

Download citation

DOI: https://doi.org/10.1007/978-981-33-6632-9_22
Published: 29 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6631-2
Online ISBN: 978-981-33-6632-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Developing a New Generation of Reconfigurable Heterogeneous Distributed High Performance Computing System

Abstract

Similar content being viewed by others

A Task-Driven Reconfigurable Heterogeneous Computing Platform for Big Data Computing

Exo-intelligent Data-Driven Reconfigurable Computing Platform

Energy-Efficient Heterogeneous Computing at exaSCALE—ECOSCALE

Keywords

1 Introduction

2 Materials and Method

3 Results

3.1 The Proposed Architecture of Reconfigurable Heterogeneous Distributed HPC System

3.2 The Proposed Architecture of the Computing Node

3.3 The Proposed Architecture of the EDGE RH HPC

3.4 Developed and Deployed Reconfigurable Accelerator for DC-Cloud RH HPC

4 Discussion

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Developing a New Generation of Reconfigurable Heterogeneous Distributed High Performance Computing System

Abstract

Similar content being viewed by others

A Task-Driven Reconfigurable Heterogeneous Computing Platform for Big Data Computing

Exo-intelligent Data-Driven Reconfigurable Computing Platform

Energy-Efficient Heterogeneous Computing at exaSCALE—ECOSCALE

Keywords

1 Introduction

2 Materials and Method

3 Results

3.1 The Proposed Architecture of Reconfigurable Heterogeneous Distributed HPC System

3.2 The Proposed Architecture of the Computing Node

3.3 The Proposed Architecture of the EDGE RH HPC

3.4 Developed and Deployed Reconfigurable Accelerator for DC-Cloud RH HPC

4 Discussion

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation