Keywords

1 Introduction

Current High-Performance Computing (HPC) systems such as Supercomputers [1], Data centers, Cloud services are becoming larger and larger; consume more and more power [2]. This trend leads to increasing of required space, power consumption, and cost of deploying and maintenance service for HPC systems.

Due to rapidly increasing requirements: to performance, measured in FLoating-point Operations Per Second (FLOPS); to Power Efficiency (FLOPS/W); to Calculation Efficiency (Real FLOPS/Peak FLOPS); to Size Efficiency (Real FLOPS/square), modern HPC systems are evolving toward heterogeneous computing [3]. Current heterogeneous HPCs mostly use General Purpose Graphics Processing Units (GPGPUs) [4] and Application-Specific Integrated Circuit (ASIC) as an accelerators to perform efficient solving of Artificial Intelligence (AI), Machine Learning (ML), Internet of Things (IoT), and Big Data analysis tasks [5]. The next step in evolving HPC systems is reconfigurable computing [6].

Reconfigurable computing technology is based on Field-Programmable Gate Array (FPGA) [7, 8]. FPGA is an integrated circuit (IC) that can change its internal structure according to the task being solved. FPGA consists of programmable logic cells that can perform any logic/memory functions and programmable matrix (interconnection matrix) that can connect all logic cells together to implement complex functions. FPGA is programmed, or configured, by binary file, called configuration file, that configure logic cells and interconnection matrix. Configuration file sets up the logic cells and the interconnection matrix such that FPGA can implement the task being solved. Modern FPGA contains not only logic cells and interconnection matrix but also Digital Signal Processing (DSP) blocks, Random Access Memory (RAM) blocks, High Bandwidth Memory (HBM) based on embedded DDR memory blocks, hardware implemented controllers and transceivers for external: DDR memory, PCIe interface, 100G Ethernet.

State-of-the-art FPGA could be configured on-the-fly. It means that FPGA could be configured for solving new task during execution of current task. FPGA could be partially configured. It means that a part of FPGA could be configured for solving new task while the rest of FPGA continues to solve current task. Finally, FPGA could be configured and partially configured through PCIe and Ethernet for solving a particular task. It means that FPGA-based PCIe Accelerator deployed on a Host or FPGA-based remote accelerator connected to a Host by high-speed channel, for example, 100G Ethernet, could be on-the-fly dynamically configured, or re-configured, to solve a particular task with efficiency of hardware implementation.

2 Materials and Method

Compared to existing heterogeneous HPC systems [9,10,11,12,13] which consist of a Multi-Processor Unit (MPU), or clusters of MPUs, and GPGPU-based accelerators, Reconfigurable Heterogeneous HPCs (RH HPC), by using reconfigurable FPGA-based accelerators, are able to meet the requirements of particular tasks, such as: data structures, calculation algorithms, real-time requirement and allow to solve particular tasks more efficiently [14,15,16] in terms of Power Efficiency (FLOPS/W), Calculation Efficiency (Real FLOPS/Peak FLOPS) and Size Efficiency (Real FLOPS/square).

Current understanding of System-on-Chip (SoC) is: FPGA, often referenced as Logic Part of SoC; Multi-core processor, often referenced as Processor Part of SoC, with GPU accelerator; a lot of embedded peripheral components [16], such as PCIe, USB3.0, MAC Ethernet, SATA, DDR4, SPI\QSPI, NAND memory, SD Card which are deployed on silicon of a single IC. It means that state-of-the-art SoC has reconfigurable heterogeneous architecture and could be treated as tiny RH HPC systems.

State-of-the-art FPGA, SoCs, and of-the-shelf devices allow to use the Reconfigurable Heterogeneous (RH) architecture for building Supercomputers, Data centers, and Cloud services (DC-Cloud RH HPC), office computers (Premises RH HPC) and remote high-performance computing systems (Edge RH HPC).

In the chapter, we review proposed architectures of Reconfigurable Heterogeneous Distributed High-Performance Computing (RHD HPC) System and architecture of developed and deployed PCIe-based reconfigurable accelerator.

3 Results

3.1 The Proposed Architecture of Reconfigurable Heterogeneous Distributed HPC System

The proposed architecture of reconfigurable heterogeneous distributed HPC system (see Fig. 1) is based on our previous researches dealing with HPC architectures based on OpenCL standard [17, 18].

Fig. 1
figure 1

The proposed architecture of reconfigurable heterogeneous distributed HPC system

DC-Cloud RH HPC (see Fig. 1) consists of some Computing Clusters and one Service Cluster. Each of the clusters consists of some identical computing nodes with MPU, Reconfigurable FPGA-based Accelerator (RA), and Single Instruction Multiple Data (SIMD) accelerator, which was previously pointed as GPGPU. The Computing Clusters are intended for solving the particular computational tasks. The Service Cluster implements:

  • performance evaluation of all Computing Clusters, remotely connected Premises RH HPC and Edge RH HPCs;

  • optimal task distribution between available computational resources. Optimization criterion could be pointed by user or assigned by AI, deployed on the Service Cluster, automatically.

Premises RH HPC (see Fig. 1) consists of Premises MPU (P_MPU), Premises Reconfigurable FPGA-based Accelerator (P_RA), and Premises Single Instruction Multiple Data (P_SIMD) accelerator. PCIe3.0 × 16 (PCIe4.0 × 16) interface provides interconnection between P_MPU, P_RA, and P_SIMD. Premises RH HPC could be:

  • identical with the Computing Node of DC-Cloud RH HPC. In this case, P_MPU, P_RA, and P_SIMD are identical to MPU, RA, and SIMD, respectively;

  • specialized for solving particular tasks.

Edge RH HPC (see Fig. 1) consists of Embedded MPU (E_MPU), Embedded Reconfigurable FPGA-based Accelerator (E_RA), and Embedded Single Instruction Multiple Data (E_SIMD) accelerator. PCIe3.0 × 16 (PCIe4.0 × 16) interface provides interconnection between P_MPU, P_RA, and P_SIMD. The connection, pointed on Fig. 1 as Embedded Interconnection, can be realized by:

  • PCIe interface if all E_MPU, E_RA, E_SIMD or some of them are separate devices

  • interconnection matrix if all E_MPU, E_RA, E_SIMD are deployed inside FPGA.

Since Edge RH HPC is intended for interaction with sensors and actuators, their important element is Object Connection block, highlighted in Fig. 1. The Object Connection block can contain analog-to-digital converters (ADCs), digital-to-analog converters (DACs), digital inputs/outputs (DIO), and other means of interaction with the particular object.

It is assumed that data transmission medium, pointed in Fig. 1, can be an arbitrary combination of wired (with speed of 1–100 GBIT/s) and wireless (e.g., Bluetooth, Wi-Fi, LTE, 5G) connections. The choice of which is depended by the characteristics of the solving tasks and the remote objects parameters.

3.2 The Proposed Architecture of the Computing Node

The proposed architecture of the Computing Node (see Fig. 2) has been extracted from the performance demands of Machine Learning tasks [19].

Fig. 2
figure 2

The proposed architecture of the Computing Node

The proposed architecture of the Computing Node contains:

  • Two Central Processing Units (CPUs). Each CPU itself is a multiprocessing unit containing several, up to several dozen, computing cores, and some number of embedded controllers for high-speed connection with external dynamic memory, PCIe interface, 1–100G Ethernet connections, etc. CPUs should have a direct connection and implement non-uniform memory access (NUMA).

  • Dynamic Random Access Memory (DRAM) blocks, which, at the physical level, are DDR4 memory modules. DRAM is the local memory for each processor, the width, performance, and volume of which depends on the purpose, particular solving tasks, desirable performance, and power consumption of the Computing Node.

  • Some number of SIMD accelerators. Each SIMD accelerator should have independent connection with each CPU in the Computing Node. The independent connection could be as simple as PCIe3.0 (PCIe4.0) × 16\8 interface [20], or as advanced as Open Coherent Accelerator Processor Interface (OpenCAPI) or Cache Coherent Interconnect for Accelerators (CCIX) interface or Compute eXpress Link (CXL).

  • Some number of RA accelerators. Each RA accelerator should have independent connection with each CPU in the Computing Node. The independent connection could be as simple as PCIe3.0 (PCIe4.0) × 16\8 interface, or as advanced as OpenCAPI [10] or CCIX interface.

  • Task Memory, which is Random Access Memory (RAM) with capacity from 16GByte and performance comparable with DDR4 memory. The Task Memory should be connected with each CPU could access to Task Memory through PCIe3.0 (PCIe4.0) interfaces or by OpenCAPI/CCIX, by implementing Uniform Memory Access (UMA) at the tasks level.

  • Interconnect which is Interconnect blocks, which are a local parts of the Interconnect System, see Fig. 1. The Interconnect blocks provide high-speed wired connections between Computing Nodes in the Computing and Service Clusters and between Clusters. Each Interconnect block should contain one or some 100G connectors/controllers.

  • Smart Endpoints, which are intelligent units for direct, without using CPUs, channel with outside world. Each Smart Endpoint consists of 1–100G Ethernet block and Packet Processors. The 1–100G Ethernet block could contain one or more Ethernet\SFP\QSFP connectors and PHysical Layer (PHY) controller. Packet Processor is intended for solving security issues and intelligent data processing, like particular data extraction for further processing, with efficiency of hardware implementation.

Proposed architecture of the Computing Node is treating as universal architecture for building Computing Clusters, Service Clusters and as a core architecture for Premises RH HPC.

3.3 The Proposed Architecture of the EDGE RH HPC

The proposed architecture of the Edge RH HPC (see Fig. 3) is based on our experience in developing and deploying of high-performance Systems-on-Chip.

Fig. 3
figure 3

The proposed architecture of the Edge RH HPC

The proposed architecture of the Edge RH HPC contains:

  • Multi-Core CPU, which is a main processing unit.

  • E_SIMD accelerator, which is tightly coupled with Multi-Core CPU. It could be implemented as a separate Integration Circuit (IC) or as embedded GPGPU unit inside SoC device.

  • E_RA accelerator, which could be implemented as a separate IC or as embedded unit, deployed on Logic Part of SoC device.

  • DRAM blocks, which, at the physical level, are DDR4 memory modules. DRAMs are the local memory for Logic Part and Processor Part of SoC device. The width, performance, and volume are functions of the purposes, particular solving tasks, desirable performance, and power consumption of the Edge RH HPC.

  • Packet Processor, which is intended for solving security issues and intelligent data processing, like particular data extraction for further processing, with efficiency of hardware implementation. The Packet Processor should be implemented as a separate IC or deployed on Logic Part of SoC device.

  • Embedded Interconnect, which is interconnection between all units of Edge RH HPC. The Embedded Interconnect could be implemented as a PCIe3.0 (PCIe 4.0), NV-Link, etc. interface or deployed on Logic Part of SoC device.

  • Object Connection Unit, which provides interaction with particular object connected with Edge RH HPC. Object Connection Unit could include, but is not limited by, ADCs; DACs; transceivers; analog and digital sensors; GPS, GLONASS, Baidu, Galileo devices, etc.

  • Data transmission unit, which is a local, remote, part of Data transmission medium (see Fig. 1). The data transmission unit should provide an arbitrary combination of wired (with speed of 1–100 GBIT/s) and wireless (e.g., Bluetooth, Wi-Fi, LTE, 5G) connections. The choice of particular media for data transmission is depended by the characteristics of the solving tasks and the remote objects parameters.

3.4 Developed and Deployed Reconfigurable Accelerator for DC-Cloud RH HPC

We developed and deployed in Supercomputer Center ‘Polytechnic’ [12] the reconfigurable accelerator (PB_4×) [13], based on Xilinx Kintex UltraScale FPGA. The structure of the reconfigurable accelerator PB_4× (see Fig. 4) was developed for providing the highest level of parallelism for reconfigurable accelerators.

Fig. 4
figure 4

The structure of developed reconfigurable accelerator PB_4

The reconfigurable accelerator PB_4× consists of:

  • Four KU115 devices, each is Xilinx Kintex UltraScale KU115 FPGA. It is the largest Kintex UltraScale device available.

  • Four DDR3 memory blocks. Each block, having 4 GByte capacity, is connected by 64 data bus to its own KU115 device.

  • PCIe Switch, which provides non-blocking connection each of KU115 devices.

The architecture of the accelerator PB_4× provides the ability to connect any KU115 to any other. Throughput between KU115 and bandwidth of data channels between KU115 and DDR3 memory are balanced:

  • Throughput, in any direction, between the PCIe Switch and each of KU115: 8 lanes × 8 Gb/s = 64 Gb/s, provided by PCIe3.0 × 8 interface.

  • Throughput, in any direction, between KU115: 8 lanes × 16 Gb/s = 128 Gb/s, provided by Aurora GTH 16 Gb/s × 8 interface.

  • Bandwidth of data channels between KU115 and DDR3 memory: 64 bits × 1600 m/s ~ 100 Gb/s.

The use of the PCIe Switch and the availability of high-speed communication channels between each KU115 allow a flexibility to change the configuration of the reconfigurable accelerator. An operating system (OS) can see PB_4× as:

  • Four independent reconfigurable accelerator. Each of which is connected to the PCIe bus by eight channels of PCIe3.0 is implemented on Xilinx Kintex UltraScale KU115 FPGA, and has 4GByte of DDR3 memory

  • One huge reconfigurable accelerator. In such configuration, a solved task will be implemented on all KU115, connected together by high-speed channels. Such huge reconfigurable accelerator will have 16 Gbit DDR3 memory divided into four independent channels.

The reconfigurable accelerator PB_4x was implemented as PCIe3.0-×16 [14] expansion card (see Fig. 5).

Fig. 5
figure 5

Top view of implemented reconfigurable accelerator PB_4x

To integrate the reconfigurable accelerator into Xilinx SDAccel environment we developed [12]:

  • The hardware platform, which is hardware design for each KU115 enabling integration of the reconfigurable accelerator with the host computer.

  • The set of drivers enabling integration of the reconfigurable accelerator with CentOS × 64 deployed on the host computer.

By the time, a Computation Node with the developed reconfigurable accelerator is deployed in Supercomputer Center ‘Polytechnic.’

4 Discussion

The nearest future researches will deal with performance evaluation of developed and deployed reconfigurable accelerator. We expect to get significant leap in performance for the tasks related with Deep Neural Networks inference in Supercomputer Center.

The second direction for future researches is implementation and performance investigation of HPC EDGE unit developed in accordance with proposed architecture.

5 Conclusions

The chapter describes the developed hardware architecture of Reconfigurable Heterogeneous Distributed High-Performance Computing (RHD HPC) Systems, which integrates a wide set of distributed heterogeneous computing nodes. By meeting a set of requirements, such as multi-components, distributed, intelligent computer systems may be applied for wide range context-aware high-performance computing tasks.