1 Introduction

Low-dimensional semiconductor structures have become of critical importance with recent revolutionary progress in lithographical technologies, [1,2,3] since they not only present the fundamental framework for designs of advanced electronic devices in the nanoscale regime, [4, 5] but also open the possibility to find novel materials that may not be feasible with traditional bulk structures [6, 7]. Solid understanding of electronic and material properties of nanoscale structures must involve a precise prediction with computer-aided simulations coupled to quantum physics, because otherwise there would be a huge loss in time and cost stemming from trial-and-error processes for determination of optimal conditions in sizes, shapes, material species, and so on. In particular, electronic structures have been a popular target of quantum simulations since they reveal the key information of low-dimensional structures, e.g., the band-structure which provides clues for material and electronic properties of nanoscale structures such as electrostatic profiles and carrier transport [8,9,10].

As semiconductor structures are downscaled in sizes, their characteristics become more sensitive to atomistic fluctuations such as interface roughness, non-ideal or unintended doping, alloy composition and strain relaxation, and associated quantum effects [11,12,13,14]. Also, experimentally realizable modern nanoscale structures usually consist of several million or more atoms though the core regions (e.g., nanowire channels or core-cells of quantum dot structures) have sizes of just a few nanometers (nms), because, in many cases, these regions are surrounded by or connected to large external layers that affect bias-dependent potential profiles and energy-level splitting in core parts [14,15,16]. The nearest-neighbor \(sp^3d^5s^*\) tight-binding (TB) model, [17] which represents a single atom with a set of 10 (20 with spin-orbit couplings) bases and parameters that are fitted to reproduce band-structures of bulk materials, has been popularly used to study electronic structures of multi-million atomic systems, and has been verified with the capability to present strong connections to experimentally observed behaviors [5, 8, 18,19,20,21] or guidelines for advanced designs of nanoscale devices [22,23,24].

High-performance computing (HPC) clusters are essential to simulate electronic structures of multi-million atomic systems with empirical methods like a TB approach, because the size of Hamiltonian matrices, which needs to be diagonalized to solve associated Schrod̈inger equations, is proportional to the number of atoms residing in simulation domains. Nano-Electronic Modeling tool (NEMO), which is a well-known package of TB simulations [25, 26], has established the framework of large-scale electronic structure simulations with traditional HPCs of multicore processors (CPUs), overcoming the structural size-limit (\(<10^3\) atoms) of simulations with density functional theory (DFT) [27]. However, general-purpose graphics processing unit (GPU) devices, which have attracted attention of HPC communities being utilized to accelerate expensive scientific computations, [28,29,30,31] have not been fully exploited yet for empirical modeling of large-scale electronic structures that involves multi-million atomic systems, although several pioneer works have reported remarkable performance enhancement with GPU devices for DFT simulations of electronic structures [30, 31].

This work examines the utility of GPU devices for simulations of extremely large-scale electronic structures with a TB approach. Using our in-house code as a baseline that has been recently introduced as quantum simulation tool for advanced nanoscale devices (Q-AND), [32] we perform extensive code-refactoring with CUDA and benchmark the performance using Si:P quantum dots (QDs) as target devices, which are defined to be huge silicon (Si) layers encapsulating a phosphorus (P) atom and have been studied with a 10-band \(sp^3d^5s^*\) TB model for designs of Si-based quantum computers [20, 21]. In particular, we justify the utility of GPU devices by elaborating the following items: (1) strategical details of offload-computing and asynchronous data-transfer scheme with descriptions of major numerical approaches for solving large-scale electronic structures, (2) excellence of speed and scalability of end-to-end simulations with Nvidia Tesla K40 devices, compared to the performance measured in CPU-only nodes, and (3) benefits of simulations with Tesla K40 devices (a single K40 device has 2880 CUDA cores (745 MHz) and 12 GB memory [33]) in terms of computing time, data-transfer overhead and energy consumption, particularly against the case with their Intel counterpart, Xeon Phi Knights Corner (KNC) 7120 family (a single KNC 7120 coprocessor has 61 cores (1.24 GHz) and 16 GB memory [34]). Extending our latest study with KNC coprocessors [32] to the area of GPU devices, this work delivers practical information for the efficiency of offload-computing that has been rarely covered for empirical modeling of large-scale electronic structures and would be thus beneficial to researchers in the field of nanoelectronics who consider a code-migration toward heterogeneous computing systems supporting manycore devices via PCI-express (PCI-E) communications, which take \(\sim \) 30% of top 100 HPC clusters in the world [35].

2 Methods

All the simulations of Si:P QD electronic structures considered in this work employ a 10-band \(sp^3d^5s^*\) TB model, [27] which describes a single atom with a set of 10 orthogonal orbital bases (s, \(3\times p\), \(5 \times d\), \(s^*\)) ignoring spin–orbit couplings. The size of a Hamiltonian matrix associated with a specific atomic structure, therefore, becomes 10 times larger than the number of atoms in the structure. As the TB approach we employed assumes nonzero coupling energies only among nearest-neighbor atoms, Hamiltonian matrices become sparse and thus are constructed with a Compressed Sparse Row (CSR) format [36]. Simulation domains are parallelized with a hybrid utilization of Massage Passing Interface (MPI) and OpenMP, so they are decomposed along the x-direction with MPI processes where the y-z plane allocated in a single MPI process is further decomposed with multiple threads. (We note Ref. [32] presents a detailed illustration of the domain decompositions with multicore CPUs.) Hamiltonian matrices are then decomposed in a row-wise manner as Fig. 1a shows. The Schrödinger equation solver, which numerically tackles normal eigenvalue problems in our case, is implemented with the well-known Lanczos method [37].

Fig. 1
figure 1

A scheme of offload-computing for matrix-vector multiplications. a Each GPU device has a part of the block-matrix belonging to a single MPI process in host, which sends / receives input / output vectors to / from the associated GPU device. Each MPI process does not need to send the whole input vector (Vin), since multiplications in a MPI process can be done with only three block-vectors of Vin (one in itself, two in its neighbor processes) as our TB model assumes nearest-neighbor couplings. Upon the completion of multiplications, a GPU device just needs to send one block of the output vector (Vout) back to its associated MPI process. Host CPUs and GPU devices can thus perform multiplications simultaneously with no heavy overhead of data-transfer. b Utilization of pinned memory can further reduce the overhead of data-transfer, as it not only boosts the speed of data-transfer (\(\sim \) 3\(\times \) speed-up against the case with pageable memory), but enables asynchronous data-transfer by which Vout can be transferred back to host while host is performing multiplications.

The two core mathematical operations, which involve MPI communications for parallel processing of Lanczos iterations, are multiplication of a sparse matrix and a vector, and dot-product of two vectors [37]. In particular, matrix-vector multiplications take a significant portion of the end-to-end computing time [32], becoming a hot spot that must be tackled to accelerate simulations with GPU devices. To achieve this mission with offload-computing, we decompose a block of the matrix belonging to a single MPI rank into two sub-blocks in a row-wise manner, where the ratio of decomposition is set as an input parameter in the unit of percentages. As illustrated in Fig. 1a, we then place one sub-block of the matrix in a single GPU device to process multiplications simultaneously in host and GPU devices. Additional overhead, caused by transfer of input (Vin) and output (Vout) vectors between host and GPU devices, must be paid. The cost, however, may not be huge since we assume nearest-neighbor couplings, where each MPI process needs to send only three blocks of Vin (one in itself, two in adjacent MPI processes) to the associated GPU device, and each GPU device just needs to send one block of Vout back to the associated MPI process after multiplications are completed. This is clearly illustrated in Fig. 1a, where three blocks of Vin and one block of Vout are the targets of data-transfer between a MPI process (rank 2 of 4 processes) and its associated GPU device. The cost of data-transfer can be further reduced with aids of CUDA pinned (page-locked) memory, [38] which leads \(\sim \) 3\(\times \) speed-up of data-transfer against the case with regular (pageable) memory. Utilization of CUDA pinned memory also enables data to be transferred via asynchronous streams, so Vout can be transferred from GPU devices to host while host is still performing multiplications as Fig. 1b shows.

Performance of sparse matrix-vector multiplications in GPU kernels is known to be limited by global memory access [39], which cannot be circumvented in our case as nonzero elements of large sparse matrices (Hamiltonian) have to be stored in global memory of GPU devices. We thus adopt a single-instruction, multiple-thread (SIMT) model [40] to increase the efficiency of global memory access. Figure 2 conceptually describes the advantage of performance that can be achieved with a SIMT model. Figure 2a shows first 4 rows of the Hamiltonian matrix for a [100] Si unitcell, where h(i,j) is the nonzero element at (row i, column j), and (NZ k) denotes the element is the \(k\mathrm{th}\) nonzero value of the matrix. Figure 2b gives a fundamental view of how a GPU kernel can access nonzero elements in first 4 rows with multiple threads. Here, a single thread takes a single row of the matrix, so the speed of global memory access for reading nonzero elements will be determined by the thread which takes the row that has the largest number of nonzero values. The optimized version of Fig. 2b is shown in Fig. 2c, which we use in the code. Here, multiple threads simultaneously access multiple nonzero elements, where each thread accesses a single element. A group of these threads, called as a CUDA WARP, [40] consists of contiguous threads, and a single WARP in Tesla K40 devices consists of 32 threads [41].

One of goals this work pursues is to assess the energy efficiency of GPU (Tesla K40) devices for TB simulations of electronic structures, via a comparison to data obtained with Intel Xeon Phi (KNC 7120) coprocessors. For this purpose, the real-time power-usage of a single computing node is monitored while simulations are being performed. The power used by host (CPUs and memory) is measured with Intel RAPL library, [42] where power used by KNC and Tesla devices are retrieved with Intel MICSMC utility and NVidia Management Library (NVML), [43, 44], respectively.

Fig. 2
figure 2

A conceptual illustration of performance benefits that can be obtained with WARP-based parallelization of matrix-vector multiplications in GPU kernels. a First 4 rows of a 10-band TB Hamiltonian that describes a single [100] silicon unitcell, being stored in a CSR format. h(i,j) represents the nonzero element located in (row i, column j) of the matrix, where (NZ k) denotes the matrix element is the \(k\mathrm{th}\) nonzero element of the matrix. b The pattern of access to global memory when purely thread-based parallelization is used. Here, a single thread accesses a single row of the matrix, so the speed of accessing the matrix stored in global memory is determined by the thread which takes the row that has the largest number of nonzero values. c With WARPs, the speed of multiplications can be improved since multiple threads (32 threads in this work) in a single WARP can access multiple nonzero values simultaneously.

3 Results and discussion

Performance of TB simulations is carefully investigated against a Si:P QD system that includes a single P atom at the center of a cuboid [100] Si layer. The Si layer consists of 30 \(\times \,80\,\times \) 80 [100] unitcells that spatially correspond to a dimension of \(\sim \) 16 nm \(\times \)43 nm \(\times \)43 nm. The target device has a total of 1.536 million (M) atoms and involves a 15.36\(\times \)15.36 M Hamiltonian system matrix. With a maximal iteration of \(10^4\) and a convergence criterion of \(10^{-8}\) electronvolt, the calculations are continued until either they reach the maximal iteration or find 10 lowest energy-levels in conduction band. All the workloads are tested with up to 3 computing nodes connected with an infinite-band 4 \(\times \) FDR (56 Gbps) network, where each node has 2 Intel Xeon E5-2670 v2 (2.5 GHz) processors (10 cores per processor), 128 GB DDR3 SDRAM (1866 MHz), and 2 PCI-E (16\(\times \)) devices (Tesla K40 or KNC 7120). Since one MPI process is mapped to a single PCI-E device as described in the previous section, all the simulations are performed with 2 MPI processes per node, where a single MPI process has 10 threads to maximally utilize the host computing resource. In each PCI-E device, multiplication is performed with a maximum number of threads that the device can support.

Fig. 3
figure 3

Performance measured for the simulation of a 16 nm\(\times \)43 nm\(\times \)43 nm cuboid [100] Si:P QD that involves a \(\sim \) 15 M \(\times \) 15 M Hamiltonian matrix. a The wall-time of end-to-end simulations measured in a single node, shown as a function of computing load of matrix-vector multiplications imposed upon GPU devices. The wall-time is minimized when GPU devices perform 70% of multiplications. With a 70% load, the overall speed-up becomes \(\sim \) 1.46\(\times \) compared to the case when host performs all the multiplications, mainly due to \(\sim \) 2.93\(\times \) speed-up of the process of multiplications that includes data-transfer from GPU devices to host. b The strong scalability measured up to 3 nodes is nice regardless of the computing load of GPU devices (only the results at a 70% load are shown). Note that the performance here is also better than that of NEMO3D-PETA (Ref. [25]). c Utilization of pinned memory accelerates data-transfer and enables a simultaneous execution of data-transfer (GPUs \(\rightarrow \) host) and multiplications (host). With a 70% load, the speed of multiplications (MVMul+CP(DtoH)) and data-transfer from host to GPU devices (CP(HtoD)) become \(\sim \) 1.4\(\times \) and \(\sim \) 3\(\times \), respectively, compared to the case when pageable memory is used. d WARP-based parallelization (Fig. 2c) dramatically improves the speed of multiplications in GPU kernels, leading \(\sim \) 7.5\(\times \) speed-up against the case with purely thread-based parallelization (Fig. 2b)

Figure 3a summarizes the performance of calculations measured in a single computing node with Tesla K40 devices. The wall-time for end-to-end simulations is shown as a function of computing load for multiplications imposed upon GPU devices (GPU Load) and is decomposed into the following 5 components: the time taken for (1) MPI Communication (Comm), (2) data-transfer from host to PCI-E devices (CP(HtoD)), (3) multiplication including data-transfer from PCI-E devices to host (MVMul+CP(DtoH)), (4) dot-product (VVDot), and (5) other operations (Others). As briefly mentioned in the Methodology section, the two operations involving MPI communications, i.e., matrix-vector multiplication and vector dot-product, take a significant portion of the wall-time. In particular, the process of multiplications, which includes data-transfer from PCI-E devices to host, consumes \(\sim \) 56% of the wall-time when only host CPUs are utilized. However, the time needed to complete the process of multiplications reduces as the GPU Load increases, and finally reaches its minimum when the GPU Load is \(\sim \) 70%. A 70% GPU Load, consequently, minimizes the wall-time, causing \(\sim \) 1.46\(\times \) speed-up of end-to-end simulations compared to the case when host CPUs perform all the multiplications (GPU Load is zero), which is driven by \(\sim \) 2.93\(\times \) speed-up of the process of multiplications. Though other operations (Others) also take a non-negligible portion of the wall-time, they are performed in host CPUs and do not belong to the targets of GPU computing in this work. We note that more detailed discussion about the Others portion can be found in the supplementary document.

It is not easy to clarify with exact numbers why the wall-time is minimized at a \(\sim \) 70% GPU Load, because the speed of sparse matrix operations is affected by many factors such as computing performance, memory bandwidth, and latency stemming from data-transfer via PCI-E lanes. The ideal roofline of the optimal GPU Load, however, can be roughly estimated with the known theoretical (peak) performance of host CPUs and GPU devices, by ignoring effects of memory access and data-transfer. Let us say that the peak performance (in the unit of FLoating point Operations Per Second (FLOPS)) of host CPUs and PCI-E devices are \(P_{H}\) and \(P_{D}\), respectively. The ceiling of the optimal GPU Load (x) then can be calculated with a simple equation as follows:

$$\begin{aligned} \frac{x}{100-x} \simeq \frac{P_\mathrm{D}}{P_\mathrm{H}}, \end{aligned}$$
(1)

which can be justified as long as we ignore effects of memory access and data-transfer, because the speed of overall multiplications then would be maximized when host CPUs and GPU devices complete computing operations at the same time. For double-precision floating point operations, a single computing node used in this work has \(P_\mathrm{H}\) of \(\sim \,4\times 10^{11}\) FLOPS (for 20 Xeon E5-2670 v2 cores), [45] and \(P_\mathrm{D}\) of \(\sim \,2.86\times 10^{12}\) FLOPS (for 2 Tesla K40 devices) [33]. x is therefore estimated as \(\sim \) 87.7%, which is a bit larger than what we observe (\(\sim \) 70%) due to the assumed ignorance. Although the answer may not be strictly precise, Eq. (1) can still explain why the optimal load in this work is observed to be slightly larger than the one observed with Intel Xeon Phi KNC 7120 coprocessors (\(\sim \) 65%) for the workload and host computing environment identical to the ones of this work, [32], because the peak performance of a single Tesla K40 device (\(\sim \,1.43\times 10^{12}\) FLOPS) is a bit larger than that of a single KNC 7120 coprocessor (\(\sim \,1.21\times 10^{12}\) FLOPS) [34] for double-precision operations.

Figure 3b shows the strong scalability at a 70% GPU Load that is measured in up to 3 computing nodes (2 MPI processes per node). Here, we show that the scalability in multiple nodes is quite nice even though the workload involves offload-computing. The speed-up of end-to-end simulations obtained with 3 computing nodes becomes \(\sim \,2.34\hbox {x}\) against the case with a single node, showing a \(\sim \) 78% scaling-ratio (\(= 2.34\div 3\)) that is not very different to the one obtained with host CPUs only (\(\sim \,85 \% = 2.6 \div 3\)) in both this work and Ref. [32]. It is straightforward that utilization of more nodes reduces the computing load of both multiplication and dot-product that a single MPI process has. It is, however, worthwhile to emphasize the overhead of data-transfer is also mitigated with an increasing number of computing nodes (or MPI processes), because of the size reduction of Vin and Vout (See Fig. 1a) that have to be transferred between host and GPU devices during the process of multiplications. The performance at a 70% GPU Load also turns out to be generally better than that of NEMO3D-PETA package, [25] which adopts a MPI-based 3D domain decomposition for parallel processing of TB simulations for large-scale electronic structures. We note that, for simulations with NEMO3D-PETA, 2 / 4 / 6 MPI processes are used to decompose the simulation domain along the x-direction, as we did with our code. The y-z plane belonging to each MPI process, however, is decomposed with 10 MPI processes instead of 10 threads since NEMO3D-PETA only supports a MPI-based parallelization.

WARP-based parallelization of matrix-vector multiplications (Fig. 2c) and asynchronous data-transfer with pinned memory (Fig. 1b) can drive remarkable speed-up of end-to-end simulations. Figure 3c, which shows the time spent for the process of multiplications (MVMul+CP(DtoH)) and data-transfer at a 70% GPU Load, delivers the following messages: (1) the time spent for CP(HtoD) and CP(DtoH) supports the speed of data-transfer itself increases with pinned memory, showing \(\sim \) 3\(\times \) speed-up against the case when pageable memory is used. (2) The time spent for MVMul+CP(DtoH) shows the benefit of data-transfer via asynchronous streams. When pinned memory is used for data-transfer, we find \(\sim \) 174 s are saved for MVMul+CP(DtoH) compared to the case when pageable memory is used. This time-saving is a bit larger than the time saved for CP(DtoH) (\(\sim \) 139 s), and the additional saving of 35 s is thus caused by the overlap of the two processes, i.e., multiplications in host and data-transfer from host to GPU devices, which is enabled by data-transfer via asynchronous streams. Note the speed of multiplications itself in host (MVMul(H)) and GPU kernels (MVMul(D)) is not much affected by the type of memory used for data-transfer. As Fig. 3d shows, multiplications in GPU kernels become drastically faster with WARP-based parallelization, where we find \(\sim \) 7.5\(\times \) speed-up at a 70% GPU Load, compared to the speed measured with thread-based parallelization (Fig. 2b).

Fig. 4
figure 4

Performance measured in a single computing node equipped with GPU devices or Xeon Phi coprocessors. 70 and 65% of multiplications are performed in Tesla K40 devices and KNC 7120 coprocessors, respectively, where it is known from Ref. [32] that a 65% load gives the best performance with KNC 7120 coprocessors in the same condition as the one used in this work. a The end-to-end simulation with Tesla K40 devices takes \(\sim \) 10% less time (163 s) than the one with KNC 7120 coprocessors. b Most of the time-saving (163 s) driven with Tesla K40 devices comes from the process of multiplication (MVMul+CP(DtoH)) and transfer of input vectors (CP(HtoD)), which take less by 65 and 76 s, respectively, than the ones with KNC 7120 coprocessors. b Transfer of output vectors (CP(DtoH)) with Tesla devices takes less by 48 s than the one with KNC coprocessors, explaining \(\sim \) 75% of the time saved for MVMul+CP(DtoH) (65 s). Remaining \(\sim \) 25% of the time-saving (17 s) means the speed of multiplications itself is also improved with Tesla K40 devices

Figure 4a shows the wall-time of end-to-end simulations measured in a single computing node that has either Tesla K40 devices or KNC 7120 coprocessors. 70 and 65% of multiplications are performed in GPU devices and Xeon Phi coprocessors, respectively, where 65% is known to be the optimal load for KNC 7120 coprocessors in the same environment as the one used in this work [32]. With Tesla K40 devices, the wall-time is measured to be \(\sim \) 1700 s, which turns out to be \(\sim \) 10% smaller than the wall-time measured in a single node with KNC coprocessors (\(\sim \) 1863 s). Figure 4b shows the times that are spent for the process of multiplications (including data-transfer from PCI-E devices to host) and data-transfer, where the labels of MVMul+CP(DtoH), CP(HtoD) and CP(DtoH), are identical to the ones defined in Fig. 3. Here, the time-saving driven by Tesla K40 devices (163 s) mostly comes from CP(HtoD) and MVMul+CP(DtoH), which take less by 76 and 65 s, respectively, than the times taken with KNC coprocessors. Time-saving of CP(HtoD) (48 s) and CP(DtoH) (76 s) is due to the increased speed of data-transfer that comes from utilization of the asynchronous stream through pinned memory (Fig. 1b). The time saved for MVMul (17 s) implies the speed of multiplications itself improves with Tesla K40 devices, although the multiplication processes is also optimized in KNC coprocessors with Initial Many Core Instructions (IMCI), [46] which support single-instruction, multiple-data (SIMD) vectorization with 512-bit registers. It should be noted our code also transfers data through the asynchronous stream in KNC coprocessors [47]. Another important point that must be clarified is that the pattern of convergence and the accuracy of eigenvalues do not prominently depend on the type of PCI-E devices. The fairness in the performance comparison with different PCI-E devices can be supported by Table S1 in the supplementary document, which shows the total number of converged eigenvalues, magnitudes of converged eigenvalues, and iteration numbers at which eigenvalues are converged in computing environments with CPU-only, CPU+K40 devices, and CPU+KNC coprocessors.

So far, we have used the speed (wall-time) as the only indicator to assess the performance of TB simulations for large-scale electronic structures. Another indicator that has been widely agreed to be important to discuss the performance of scientific computing in HPC clusters, however, is the energy efficiency that is defined as the rate of computation performed for every watt (W) of power consumed (in the unit of FLOPS / W). As Tesla GPU devices are considered to be good solutions for energy-efficient computing, they are popularly adopted by clusters in Green500 sites, [48] where computing systems in Top500 sites citeR33 are ranked in terms of energy efficiency that is normally quantified with LINPACK benchmark [49]. The official energy efficiency known for Tesla K40 devices (\(\sim \,6\times 10^9\) FLOPS/W) is nice, [33] particularly compared to the one of KNC 7120 coprocessors (\(\sim \,4\times 10^9\) FLOPS/W) [34]. But this official comparison may not be directly applicable to our target problem, since LINPACK benchmark consists of computing-bound problems involving full matrix operations, while the core of TB electronic structure simulations involve large-scale sparse matrix operations whose performance may not be computing-bound [39].

Fig. 5
figure 5

Power-usage and energy consumption associated with the target simulation. a Power-usage of a single computing node with KNC 7120 coprocessors at a 65% load, and b with Tesla K40 devices at a 70% load of multiplications are shown as a function of elapsed time. During the runtime of simulations, power-usage of host is not much affected by PCI-E devices. Tesla K40 devices, however, use much less power than KNC 7120 coprocessors. c Real-time power-usage is time-averaged, where we find Tesla K40 devices use \(\sim \) 158 W to perform 70% of multiplications, while KNC 7120 coprocessors use \(\sim \) 351 W for 65% of multiplications. Host uses \(\sim \) 200 W in both cases. d With Tesla K40 devices, a computing node consumes \(\sim \) 58% of the total energy consumed for the end-to-end simulation with KNC 7120 coprocessors. Focusing on the energy consumption of PCI-E devices, we find KNC coprocessors consume \(\sim \) 754 KJ to perform 65% of multiplications, while Tesla devices consume \(\sim \) 300 KJ to perform 70%

To investigate the energy efficiency of TB simulations, the power-usage of host and the two PCI-E devices is retrieved as a function of elapsed time. Figure 5a, b shows the results measured, while the simulation is being performed in a single computing node with KNC 7120 coprocessors (with a 65% load) and Tesla K40 devices (with a 70% load), respectively. Here, the pattern of power-usage is similar in both cases such that it roughly consists of the following 3 steps: (1) power-usage increases during the setup of the domain and associated Hamiltonian matrix, (2) oscillates rapidly during the process of Lanczos iterations, and (3) reduces back as the workload is finished. Our results indicate the power-usage of host (CPUs and memory) does not show clear dependency on PCI-E devices during the whole runtime, but the power-usage of KNC 7120 coprocessors is much larger than that of Tesla K40 devices. Figure 5c, which shows time-averaged power-usage of host and PCI-E devices, reveals KNC 7120 coprocessors and Tesla K40 devices use \(\sim \) 351 and \(\sim \) 158 W, respectively, while host uses \(\sim \) 200 W in both cases. The energy consumed by the simulation, which can be obtained by multiplying the time-averaged power-usage by the wall-time, is shown in Fig. 5d. When the simulation runs in a single computing node with KNC 7120 coprocessors, host and two PCI-E devices consume \(\sim \) 411 kilojoule (KJ) and \(\sim \) 754 KJ, respectively, while corresponding values become \(\sim \) 380 and \(\sim \) 300 KJ with Tesla K40 devices.

As addressed, the energy efficiency is defined as the rate of computation performed for power consumption of 1 W. If we introduce a new quantity, the total number of floating point operations that a single job processes (NF), the energy efficiency (FLOPS / W) can be extracted since FLOPS is effectively equivalent to NF divided by the time taken until the job is finished (wall-time). In particular, the energy efficiency of TB simulations (\(\eta \)) can be approximated with the following equation:

$$\begin{aligned} \eta= & {} \frac{NP}{T_\mathrm{total}}\times \frac{1}{W} \nonumber \\= & {} \frac{NP}{E_\mathrm{total}}, \end{aligned}$$
(2)

where \(E_\mathrm{total}\) and \(T_\mathrm{total}\) represent the total energy and wall-time consumed by a single job (or simulation), respectively. As discussed with the results shown in Fig. 5, the total energy consumed to simulate the electronic structure of a 16 nm \(\times \)43 nm \(\times \)43 nm cuboid [100] Si:P QD, is \(\sim \) 1165 KJ in a single computing node with KNC 7120 coprocessors and \(\sim \) 680 KJ in a node with Tesla K40 devices. As shown in Eq. (2), for the same workload, the energy efficiency is inversely proportional to the consumed energy. Consequently, by assuming the energy consumed by CPUs and memory can reasonably approximate the energy consumed by host computing resource, we find Tesla K40 devices increase the energy efficiency by a factor of \(\sim \) 1.7 compared to the case when KNC 7120 coprocessors are used. As Tesla K40 devices and KNC 7120 coprocessors perform 70 and 65 % of total multiplication consuming 754 and 300 KJ, respectively, the energy efficiency of Tesla K40 devices (\(\propto \) 70\(\div \)300) becomes \(\sim \) 2.7\(\times \) against that of KNC 7120 coprocessors (\(\propto \) 65\(\div \)711). We note that the superiority of Tesla K40 devices retrieved here (2.7\(\times \)) becomes more remarkable that the one (1.5\(\times \)) obtained with the official efficiencies known for the two PCI-E devices [33, 34].

4 Conclusion

We have investigated the practicality of general-purpose graphics processing unit (GPU) devices for empirical tight-binding (TB) simulations of extremely large-scale electronic structures, which target multi-million atomic systems and involve sparse Hamiltonian system matrices of \(10^7\) or larger degrees of freedom (DOFs). Major technical strategies used to exploit the strength of GPU-based offload-computing, which are data-transfer via asynchronous streams and WARP-based parallelization, have been explained in detail with short but clear descriptions of the numerical method employed to solve large-scale Schrödinger equations in parallel. The gain of performance obtained by offload-computing with Tesla K40 devices has been carefully analyzed for simulations of a phosphorus quantum dot encapsulated by large silicon (Si) layers (Si:P QD), which has 1.536 million (M) atoms involving a Hamiltonian matrix of 15.36 M DOFs.

The wall-time of end-to-end simulations fluctuates as the GPU portion of matrix-vector multiplications, which is the core numerical operation of electronic structure calculations, is varied, and reaches its minimum when Tesla K40 devices perform \(\sim \) 70% of multiplications so \(\sim \) 1.46\(\times \) speed-up is observed with respect to the wall-time measured in CPU-only nodes, which is mainly due to \(\sim \) 2.93\(\times \) speed-up of multiplications. Compared to the case when lntel Xeon Phi Knights Corner (KNC) 7120 coprocessors are used to offload matrix-vector multiplications similarly as what is done in this work, [32] Tesla K40 devices save \(\sim \) 10% of the wall-time due to the speed-up of data-transfer, consuming \(\sim \) 58% of the total energy to complete the target simulation. Although the purpose of this work is to present the details of technical approaches for the performance improvement of large-scale electronic structure simulations with GPU computing, it should be noted that Tesla GPU devices are not cost-competitive. We thus encourage readers to carefully examine the benefit that can be obtained even at the expense of additional costs, particularly before writing codes for GPU computing. Readers, who are interested in the related analysis for this work, may want to refer to Table S2 in the supplementary document.