Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Graphene is an atomically thin layer of carbon atoms that self-organize in a honeycomb lattice. Since several years, such layers are routinely manufactured in laboratories worldwide. Electron transport in graphene and other graphene properties are investigated intensively and form one of the hottest fields in condensed matter and materials sciences. The driving force results from prospective applications in information and nanotechnology, e.g., as field effect transistors or spin quantum bits. Also, chemical functionalization triggers proposals for additional technologies like hydrogen storage or sensor applications.

In most applications, the electron transport properties of graphene-based films are crucial. Despite intensive research, the transport properties are far from being fully understood, especially concerning the influence of disorder, as it is induced, e.g., by functionalization with adsorbates. In computational work, disorder effects have been treated so far mostly on the level of tight-binding calculations that incorporate aspects of π-electron physics; genuine ab initio treatments of transport in functionalized molecular films were not yet available. Partially, this is because electronic structure calculations based on the density functional theory (DFT) for large, sufficiently representable flakes are computationally highly demanding. They can be successfully treated only by the most advanced ab initio packages. An additional difficulty is that such packages are often not capable for transport calculations in film geometries.

Recently, the FHI-aims code [1] has been complemented with a versatile transport module AitransS  [25], which implements the non-equilibrium Green’s function formalism (NEGF) and uses DFT results as input. It enables computing ab initio transport properties of single molecules but also of more extended structures such as molecular flakes, films or nanotubes.

In this work we study the effect of disorder (e.g. hydrogen adsorbates) and chemical functionalization on the conductance of graphene flakes. In contrast to the presently available tight-binding treatments, our DFT approach includes the effects of charging and screening of impurities, and lattice distortion, i.e., strain. As a consequence, we can study the cross-talk between different impurities which is not possible with present tight-binding implementations.

The sizes of computationally tractable graphene flakes reach 2500 carbon atoms. This is sufficient to see the qualitative effects of quantum interference and mesoscopic fluctuations in such systems. Reaching such flake sizes was only possible by achieving an excellent DFT performance on the HLRS systems when using the FHI-aims package. In addition, we improved the parallel scaling of the transport module AitransS .

The structure of the paper is as follows: in Sect. 2.1, we describe our transport method as implemented in the code, and we comment on the employed parallelization techniques (Sect. 2.2). In Sect. 3.1, the key findings are summarized. For reference, we provide the numerical parameters used in a typical calculation in Sect. 3.2. Next, the achieved excellent performance on the HLRS computing systems is shown, first for the DFT simulation (Sect. 3.3), second for the transport part (Sect. 3.4). In Sect. 3.5, we present two optimizations, necessary for achieving the excellent performance.

2 Methods

Our transport calculations are performed in a two-step procedure. First, we perform a DFT calculation using the all-electron DFT code FHI-aims including relaxation effects. Second, we perform a transport calculation using AitransS with the Kohn–Sham orbitals from the previous DFT step as input. We summarize the transport calculation in the following and point out where parallelization is necessary. The implementation description of FHI-aims and AitransS is outlined in [1] and [5], respectively.

2.1 Transport Calculations

We extract the Kohn–Sham (KS) Green’s function \(\mathbf{G}_{0}^{\text{KS}}(E) = \left (E\mathbb{1} -\mathbf{H}^{\text{KS}} +\mathrm{ i}0\right )^{-1}\) for a finite disordered graphene flake from the DFT calculations. To model the infinite extension of the system in current flow direction, we compute the self-energies \(\boldsymbol{\varSigma }_{\text{L/R}}\) using absorbing boundary conditions [5, 6]. The self-energies \(\boldsymbol{\varSigma }_{\text{L/R}}\) represent the influences of the leads. The resulting Green’s function

$$\displaystyle{ \mathbf{G}(E)^{-1} = \mathbf{G}_{ 0}^{\text{KS}}(E)^{-1} -\boldsymbol{\varSigma }_{ \text{L}}(E) -\boldsymbol{\varSigma }_{\text{R}}(E) }$$
(1)

describes the propagation of KS particles in the device in the presence of leads and is used to calculate the transmission \(\mathcal{T} (E) =\mathop{ \mathrm{Tr}}\limits \{\boldsymbol{\Gamma }_{\rm{L}}\,\mathbf{G}\,\boldsymbol{\Gamma }_{\rm{R}}\,\mathbf{G}^{\dag }\}\,.\) Here, \(\boldsymbol{\varGamma }_{\text{L/R}}\) denote the anti-Hermitian parts of the self-energies, i.e., \(\boldsymbol{\varGamma }_{ \text{L/R}} =\mathrm{ i}(\boldsymbol{\varSigma }_{\text{L/R}} -\boldsymbol{\varSigma }_{\text{L/R}}^{\dag })\). They account for the level broadenings in the finite graphene flake due to the coupling to the leads.

Using the retarded Green’s function, we calculate the non-equilibrium Keldysh Green’s function \(\mathbf{G}^{<} =\mathrm{ i}\mathbf{G}\big[f_{\text{L}}\boldsymbol{\varGamma }_{\text{L}} + f_{\text{R}}\boldsymbol{\varGamma }_{\text{R}}\big]\mathbf{G}^{\dag }\). Assuming that the scattering states originating from the left/right lead are occupied/unoccupied, the Keldysh Green’s function G  < (E) simplifies to

$$\begin{array}{rlrlrl} \mathbf{G}^{<}(E) =\mathrm{ i}\mathbf{G}(E)\boldsymbol{\varGamma }_{ \text{L}}(E)\mathbf{G}^{\dag }(E)\,. & &\end{array}$$
(2)

Here, we assumed zero temperature and an energy inside the voltage window, so that f L = 1 and f R = 0 for the occupation in the leads.

We use orthonormal basis functions \({\tilde{\varphi }_{i}}(\mathbf{r})\) [constructed via Löwdin-orthogonalization from the DFT basis functions \(\varphi _{i}(\mathbf{r})\)] to transform the Keldysh Green’s function into real space:

$$\displaystyle{ \mathbf{G}^{<}(\mathbf{r},\mathbf{r}',E) =\sum _{ ij}\tilde{\varphi }_{i}(\mathbf{r})\mathbf{G}_{ij}^{<}(E)\tilde{\varphi }_{ j}^{{\ast}}(\mathbf{r}')\,. }$$
(3)

The current density (per spin and energy) is then expressed as

$$\displaystyle{ \mathbf{j}(\mathbf{r},E) = \frac{1} {2\pi } \frac{\hslash } {2m}\lim \limits _{\mathbf{r}'\rightarrow \mathbf{r}}(\boldsymbol{\nabla }_{\mathbf{r}'} -\boldsymbol{\nabla }_{\mathbf{r}})G^{<}(\mathbf{r},\mathbf{r}',E)\,. }$$
(4)

2.2 Implementation: Parallelization Efforts

Because the computational demand of the DFT and transport steps scale very similar with flake size (see Sect. 3.5), none of the steps represents a single bottleneck for the whole simulation. Thus, we also optimized and parallelized the transport module AitransS to enable study of large flakes.

The AitransS code uses two types of parallelization:

shared-memory parallelization, :

in which several threads within one process have access to the same data, i.e., the same energy point E. We use threaded LAPACK as implemented in the Intel MKL for matrix operations and OpenMP for parallelization of loops over real space, etc.

distributed-memory parallelization, :

in which each process works with separate data sets for possibly different energy points E. The Message Passing Interface (MPI) is used for work balancing, e.g., for distributing different energy points to different processes, cutting and distributing the real-space grid.

A workflow diagram of the parallelized modules is shown in Fig. 1.

Fig. 1
figure 1

Workflow diagram of the transport module AitransS . The two most common module sequences are depicted in blue (transmission calculation) and red (current density and magnetic field calculation). Both sequences include the reconstruction of the Kohn–Sham Hamiltonian and the calculation of the self-energies (purple). Left: timing symbols used in the following for performance analysis, cf. Figs. 4 and 5. Right: overview of the parallelization techniques used in each module

3 Results

The results of this work are twofold. First, we summarize the physics results for the investigated graphene flakes in Sect. 3.1. Second, we present our development of computational techniques: (a) we demonstrate that our applied codes scale sufficiently when using a few thousand cores and (b) we discuss the improvements in our transport module that were necessary to achieve good scalability.

3.1 Overview of the Key Findings

As a preparation for this work, we calculated the local current density response j(r, E) of pristine armchair graphene nanoribbons (AGNRs) with varying width [7]. We observe pronounced current patterns, which we call “streamlines”, with threefold periodicity in the ribbon width. They arise as a consequence of quantum confinement in the direction transverse to the current flow. Neighboring streamlines are separated by stripes of almost vanishing flow. This effect can be explained in a tight-binding toy model. The response of the current density to adatoms is very sensitive to the placement: adatoms placed within the current filaments lead to strong backscattering; while in other regions, adatoms have almost no impact.

Then, we switched to larger graphene flakes calculating the local current density j(r, E) in the presence of hydrogen adsorbates [8], an example with 5 % hydrogen is shown in Fig. 2. We discovered pronounced local current patterns, ring currents (current vortices), that go along with orbital magnetism. Importantly, the magnitude of the ring currents can exceed the average transport current by orders of magnitude. The associated magnetic fields exhibit drastic fluctuations with large field gradients reaching up to \(1\,\text{T}\,\text{nm}^{-1}\,\text{V}^{-1}\). These observations are relevant for spin relaxation in systems with very weak spin–orbit interaction, e.g., organic semiconductors. In such systems, spin relaxation induced by bias-driven orbital magnetism competes with the hyperfine interaction. Both appear to be of similar strength. As a result of our calculation, we propose an NMR-type experiment combined with a dc-transport setup to observe the spatial fluctuations of the induced magnetic fields. We studied several impurity concentrations and different graphene flake sizes. The described physics seems to be independent and, therefore, should also be present in larger mesoscopic samples which are more common in experiments.

Fig. 2
figure 2

Local current density response (integrated over the out-of-plane direction) in an AGNR41 (24×41) with 5 % hydrogen adsorbates. The current density exhibits very strong mesoscopic fluctuations which exceed the average current by over two orders of magnitude in the logarithmic color scale. The current density is plotted relative to average current density. Plot shows current amplitude (color), current direction (arrows), carbon atoms (grey crosses) and hydrogen atoms (red crosses). Some arbitrary current paths (black lines) are drawn into the plot for illustration

We also studied the statistical distribution of the current density in the graphene flakes [9]. The distribution function of the current density follows a log-normal distribution in a wide range. Its typical value is larger than the average current. Therefore, there are always significant contributions to the current density which do not contribute to the conductance, i.e., they form current rings. This work is still on-going, but so far, these features seem to be remarkably stable, in a wide range of impurity concentrations (5–30 %) and system sizes (up to 2500 carbon atoms), and even survive an averaging over several scattering states, e.g., when a finite bias voltage is applied to the system.

3.2 Typical Numerical Parameters

In Table 1, we list the numerical parameters of a typical calculation performed for hydrogenated graphene flakes. First, the graphene flake is structurally relaxed using FHI-aims until the remaining forces decrease below \(10^{-2}\,\,\text{eV}/\text{\AA }\). This is, by far, the most expensive part of the calculation. Then a final DFT run for the relaxed structure is performed and the output written to disk. This is used by AitransS to perform a wide scan over the transmission function (the self-energies \(\boldsymbol{\varSigma }\) are pre-calculated since they only depend on the system size, not on the impurity configuration). Eventually, a few interesting energy points are taken from the transmission function and the current density is calculated at those energy points.

Table 1 An overview of a typical calculation performed on Cray XC40 (Hornet) for a graphene flake (with 1312 carbon atoms) whose central 24 × 41 carbon atoms have been functionalized with hydrogen (compare with Fig. 2)

3.3 Achieved DFT Performance on Cray XE6 (Hermit) and Cray XC40 (Hornet)

Within a test project on Hermit for which a budget of 40,000 core hours has been granted, we carried out code porting, tuning and performance measurements for DFT calculations of graphene to ensure that the FHI-aims code scales as necessary for the completion of the project goals and that the envisaged computing system Hermit was appropriate for the planned productive simulations.

The scalability of the FHI-aims code was demonstrated running the DFT module for pure planar graphene flakes, in which dangling bonds at the boundaries were saturated with hydrogen atoms with 170 (10×17), 345 (15×23), 735 (21×35) and 1500 (30×50) carbon atoms, each represented by a tier 1 basis set, i.e. 14 basis functions per C atom and 5 per H atom. The total number of basis functions N is 2560, 5095, 10,675 and 21,550, respectively.

Figure 3 (upper plot) presents the scaling of the code where the data from each set (color) are plotted as reduced speedup T nT P. T P is the wall time of calculation on P processor cores and T n is the wall time on the minimum number of cores n necessary for the job to finish within wall time limit (24 h). Thus, n is 8, 16, 32 or 256, depending on the graphene flake size, i.e., each color in Fig. 3 is for a different n. The normalization with n is also necessary to allow easy comparison of the speedups for the four graphene flake sizes. Then, the data for each set have been fitted to Amdahl’s law (the solid lines) showing that the serial fraction of the work α is always very small, around 1 ‰( = 0. 001) and furthermore decreases with increasing the graphene flake size.

Fig. 3
figure 3

Strong scaling of FHI-aims for pure graphene flakes on the CRAY XE6 (Hermit) cluster at HLRS (upper plot). Scaling of FHI-aims for pure graphene flakes of different size on the CRAY XE6 (Hermit) cluster at HLRS (weak scaling, lower plot)

Figure 3 (lower plot) shows the scaling for graphene using the same data as in Fig. 3 (upper plot) now represented by the speedup \(S(P) =\tilde{ T}_{1}/T_{P}\) achieved on P cores, where \(\tilde{T}_{1}\) is a hypothetical time for which the same calculation would take when a single core were used. Note that for a fixed number of cores used for computations, e.g., set of data points for P = 256, the speedup is improved when the size of the problem increases approaching the ideal speedup (Gustafson’s law).

3.4 Parallelization of the Transport Module AITRANSS and Achieved Performance

In Fig. 4, we present detailed measurements of the performance of our transport code for realistic system sizes. In the tests, we distinguish between calculation of the transmission and of local observables. Transmission calculations (\(\mathcal{T} (E)\)) also include the density of states ρ(E). Local observables are the local current density j(r, E) but also its divergence \(\boldsymbol{\nabla }\boldsymbol{\cdot }\mathbf{j}(\mathbf{r},E)\), the non-equilibrium density n(r, E) and the local density of states ρ(r, E).

Fig. 4
figure 4

\(\mathcal{O}\!\left (N^{n}\right )\)-scaling: Performance measurements with varying system size for a transmission (upper plot) and local observable (lower plot) calculation for a fixed number of CPU cores (P = 32). Symbols: number of basis functions N, number of energy points N E, number of CPU cores P

In the upper panel, the wall time for calculating N E = 128 transmission and density of states of hydrogen-saturated AGNRs (the same as in Sect. 3.3) is plotted depending on the number of basis functions N. The total time \(T_{\mathcal{T}}(N)\) is divided into four groups (t H, t diag, \(t_{\boldsymbol{\varSigma }}\), t G, cf. Fig. 1). Because the calculation (via 200 iterations in the decimation technique [5]) of the self-energy representing the leads depends directly on the number of basis functions N lead of the leads (and only indirectly on the basis functions N of the device region), it is plotted separately. The main effort for a transmission calculation is reconstructing the self-energy of the leads; therefore it makes sense to save them on the hard disk if several different impurity configurations are processed which all use the same leads. Then, the main effort is spent for reconstructing the Green’s function (t G). The diagonalization (t diag, see Sect. 3.5.3) does not significantly contribute to the overall effort for the considered system sizes since it is performed only once and not for every energy point.

In the lower panel of Fig. 4, the same quantities are plotted for a single (N E = 1) current density calculation. The number of grid points used is proportional to the system size (one grid point every 0. 2 Å, 31 in z-direction). We first note, that the main effort is dominated by the discretization of the local quantities (t (r)). We were able to optimize local observable calculation to scale below N 2 employing the locality of the basis functions.Footnote 1

In Fig. 5, we discuss the parallelization efficiency of the transmission calculation and current density calculation for a fixed system size (21×35 = 735 carbon atoms). The speedup S for many MPI processes compared to a single process is shown and compared to Amdahl’s law: \(T_{N_{ \text{MPI}}} = T_{1}\left [\alpha +(1-\alpha )/N_{\text{MPI}}\right ]\), α ≈ 1%. We see a good scalability for the total wall time, the self-energy construction and the Green’s function construction (\(T_{\mathcal{T}}\), \(t_{\boldsymbol{\varSigma }}\), t G). The reconstruction of the KS Hamiltonian does not speedup since only the first MPI process is involved (cf. Fig. 1). For the diagonalization, we even observe that using two processes (using ScaLAPACK) is slower than using only one process (using LAPACK). Therefore, our code now only uses ScaLAPACK starting with 4 MPI processes.

Fig. 5
figure 5

MPI-parallelization: speedup for a fixed system size (735 (21×35) carbon atoms) and for a fixed number of CPU cores per MPI process (p = 8). Left: speedup for transmission calculations. Right: speedup for local observable calculation. Symbols are the same as in Fig. 4. The p is the number of CPU cores per MPI process, and N MPI is the number of MPI processes (P = pN MPI)

3.5 Optimization of the Transport Module AitransS

3.5.1 Overview: Code Optimization

In the following sections, the two most important optimizations used throughout our code development are presented. Their performance impact is summarized in Table 2. Please note, that both optimizations have larger impact for larger systems.

Table 2 Increase in wall time when specific optimizations are removed from the code

The optimization “SpaceBlocks” is vital because without it calculations for systems with more than 1000 atoms become unfeasible. Also the optimization “MatrixInverse” is quite handy because it allows quick transmission scans before turning to more expensive current density calculations.

3.5.2 SpaceBlocks: Dividing Space into Blocks

Here, we discuss how to evaluate the formulas for space-depending local quantities such as the current density

(5)

In principle, the double sum runs over all basis functions of the underlying DFT simulation. FHI-aims uses numerically tabulated atom-centered orbitals (NAOs), i.e. localized basis functions. When restricting the spatial point r to a small region, most basis functions are vanishing inside this region. (These basis functions are, of course, nonzero elsewhere.) This locality in the basis set can be exploited in the following way.

First, we define r max as the maximal radial extent of all basis functions (i.e. all basis functions are zero at points which are further away from the central atom than r max). Second, the 3D space is divided into little cubes with edge length r maxn with n being an integer (see Fig. 6 for an example). When performing the sum of Eq. (5) for any spatial point r inside the blue shaded area, the only basis functions taken into account are centered around atoms in the green (and blue) shaded area. All other basis functions do not contribute to this area.

Fig. 6
figure 6

Dividing space into 156 (13×12×1) non-overlapping blocks, exemplary for a graphene flake with 398 atoms. (2D model with n = 2, r max = 5. 05 Å)

Hence, we divide the space into cubes of length r maxn and distribute them to separate MPI processes. The integer n is chosen such that every MPI process works on at least five blocks to alleviate load imbalance due to different block sizes. Then, for each inner (blue) block, we restrict the Green’s function G  <  to the basis functions localized at atoms in the extended (green) block.

3.5.3 MatrixInverse: Calculating the Green’s Function Inverse

As the self-energy can be read from hard disk, the most expensive part in a transmission calculation is the matrix inversion in calculating the retarded Green’s function G, cf. Eq. (1). According to Fig. 4 (upper panel), G can be constructed in \(\mathcal{O}(N^{2})\). Without this optimization the matrix inversion in calculating the Green’s function would scale as \(\mathcal{O}(N^{3})\) and would therefore dominate for large systems.

Partitioning of the Green’s function: The Green’s function inverse, see Eq. (1), can be calculated by transforming the Hamiltonian so that it is diagonal in the regions where the self-energies \(\boldsymbol{\varSigma }\) are zero. We partition the indices in the Green’s function inverse such that the self-energy contribution of the leads only appears in subblock D, i.e.,

$$\displaystyle{ \begin{array}{ll} \mathbf{G}^{-1} & = E\mathbb{1} -\mathbf{H} -\boldsymbol{\varSigma }_{L}(E) -\boldsymbol{\varSigma }_{R}(E) \\ & = \left (\begin{array}{*{10}c} E\mathbb{1}_{\text{AA}} -\mathbf{H}_{\text{AA}}&& -\mathbf{H}_{\text{AD}} \\ -\mathbf{H}_{\text{DA}} &&E\mathbb{1}_{\text{DD}} -\mathbf{H}_{\text{DD}} -\boldsymbol{\varSigma }_{L}(E) -\boldsymbol{\varSigma }_{R}(E) \end{array} \right )^{-1} =: \left (\begin{array}{*{10}c} \mathbf{A}&\mathbf{B} \\ \mathbf{C}&\mathbf{D} \end{array} \right )^{-1}\,, \end{array} }$$
(6)

with the subscripts AA, AD, DA, DD denoting the restriction to the respective matrix subspace.

As advantage of this partitioning, the only non-trivial energy dependence appears in subblock \(\mathbf{D} = E\mathbb{1}_{\text{DD}} -\mathbf{H}_{\text{DD}} -\boldsymbol{\varSigma }_{L}(E) -\boldsymbol{\varSigma }_{R}(E)\). The block A can be diagonalized for all energies in a single eigenvalue problem: the eigenvalues are given by the diagonal matrix \(\mathbf{\tilde{A}} = E\mathbb{1} -\mathbf{\tilde{H}}_{\text{AA}}\) where \(\mathbf{\tilde{H}}_{\text{AA}}\) denotes the diagonal matrix with the eigenvalues of H AA. The transformation matrix V (\(\mathbf{\tilde{H}}_{\text{AA}}=\mathbf{V}^{-1}\mathbf{H}_{\text{AA}}\mathbf{V}\)) is constructed by filling its columns with the (right) eigenvectors of H AA. The off-diagonal blocks stay energy independent, i.e., \(\mathbf{\tilde{B}} = -\mathbf{V}^{-1}\mathbf{H}_{\text{AD}}\).

General matrix: For the matrix inversion, we first tend to a general matrix which we divide into four blocks

$$\displaystyle{ \left (\begin{array}{*{10}c} \mathbf{A}&\mathbf{B}\\ \mathbf{C} &\mathbf{D} \end{array} \right )\,, }$$
(7)

so that the submatrices A and D are square matrices. The inverse is given by

$$\displaystyle{ \left (\begin{array}{*{10}c} \mathbf{A}&\mathbf{B}\\ \mathbf{C} &\mathbf{D} \end{array} \right )^{-1} =\ \left (\begin{array}{*{10}c} \mathbf{A}^{-1}(\mathbb{1} + \mathbf{B}\mathbf{E}^{-1}\mathbf{C}\mathbf{A}^{-1})&-\mathbf{A}^{-1}\mathbf{B}\mathbf{E}^{-1} \\ -\mathbf{E}^{-1}\mathbf{C}\mathbf{A}^{-1} & \mathbf{E}^{-1} \end{array} \right )\text{ with }\mathbf{E}:= \mathbf{D}-\mathbf{C}\mathbf{A}^{-1}\mathbf{B} }$$
(8)

as is easily checked by direct matrix multiplication.

Transforming A into diagonal form \(\mathbf{\tilde{A}}\), i.e., \(\mathbf{A} = \mathbf{V}\mathbf{\tilde{A}}\mathbf{V}^{-1}\), makes the calculation of the inverse \(\mathbf{\tilde{A}}^{-1}\) trivial and we get:

$$\displaystyle{ \left (\begin{array}{*{10}c} \mathbf{A}&\mathbf{B}\\ \mathbf{C} &\mathbf{D} \end{array} \right )^{-1}=\ \left (\begin{array}{*{10}c} \mathbf{V}\mathbf{\tilde{A}}^{-1}(\mathbb{1} + \mathbf{\tilde{B}}\mathbf{E}^{-1}\mathbf{\tilde{C}}\mathbf{\tilde{A}}^{-1})\mathbf{V}^{-1} & -\mathbf{V}\mathbf{\tilde{A}}^{-1}\mathbf{\tilde{B}}\mathbf{E}^{-1} \\ -\mathbf{E}^{-1}\mathbf{\tilde{C}}\mathbf{\tilde{A}}^{-1}\mathbf{V}^{-1} & \mathbf{E}^{-1} \end{array} \right )\text{ with }\mathbf{E}:= \mathbf{D}-\mathbf{\tilde{C}}\mathbf{\tilde{A}}^{-1}\mathbf{\tilde{B}} }$$
(9)

using the abbreviations \(\mathbf{\tilde{C}}:= \mathbf{C}\mathbf{V}\) and \(\mathbf{\tilde{B}}:= \mathbf{V}^{-1}\mathbf{B}\).

Exploiting symmetries of G: In general, the Hamiltonian H is Hermitian and the self-energies \(\boldsymbol{\varSigma }\) are non-Hermitian. In most cases, we can restrict ourselves to a real symmetric Hamiltonian and complex symmetric self-energies. In that case, the Green’s function G is also (complex) symmetric, \(\mathbf{\tilde{B}}\) and \(\mathbf{\tilde{C}}\) are related by transposition, the eigenvalue problem simplifies to a real symmetric oneFootnote 2 which makes the transformation matrix V orthogonal, i.e., \(\mathbf{V}^{-1} = \mathbf{V}^{T}\).

Basis change for non-local quantities: If we are only interested in non-local quantities such as the transmission or the density of states, we can go one step further. Such quantities do not dependent on the spatial basis and we can transform the Green’s function so that the Hamiltonian is diagonal in the subblock A:

$$\displaystyle{ \mathbf{G} \rightarrow \mathbf{S}^{-1}\mathbf{G}\mathbf{S}\,,\qquad \mathbf{S} = \left (\begin{array}{*{10}c} \mathbf{V}& 0\\ 0 &\mathbb{1}_{ \text{DD}} \end{array} \right )\,. }$$
(10)

In practice, we indirectly perform this transformation by omitting the respective factors of V in Eq. (9). All in all, the inverse is given by:

$$\displaystyle{ \mathbf{G} = \left (\begin{array}{*{10}c} \mathbf{\tilde{A}}^{-1}(\mathbb{1} + \mathbf{\tilde{B}}\mathbf{E}^{-1}\mathbf{\tilde{B}}^{T}\mathbf{\tilde{A}}^{-1})&-\mathbf{\tilde{A}}^{-1}\mathbf{\tilde{B}}\mathbf{E}^{-1} \\ \left [-\mathbf{\tilde{A}}^{-1}\mathbf{\tilde{B}}\mathbf{E}^{-1}\right ]^{T} & \mathbf{E}^{-1} \end{array} \right )\text{ with }\mathbf{E}:= \mathbf{D}-\mathbf{\tilde{B}}^{T}\mathbf{\tilde{A}}^{-1}\mathbf{\tilde{B}} }$$
(11)

using the abbreviation \(\mathbf{\tilde{B}}:= \mathbf{V}^{T}\mathbf{B}\).

Optimization traits: In Eq. (11), no matrix operations for matrices of size of H AA appear (except for the initial eigenvalue problem): the inverse \(\mathbf{\tilde{A}}^{-1}\) is trivial since \(\mathbf{\tilde{A}}\) is diagonal. Therefore, this optimization is extremely useful for large systems where the contact regions to the leads are only a small part of the overall system, i.e., N A ≫ N D with N A/D denoting the size of the square matrices A, D, respectively.

For a short complexity analysis, we assume that multiplication and eigenvalue problem of N×N-matrices have computational complexity \(\mathcal{O}(N^{3})\). Then, without above optimization, the direct matrix inversion used to calculate the Green’s function has complexity \(\mathcal{O}((N_{\text{A}} + N_{\text{D}})^{3})\mathop{ \rightarrow }\limits^{ N_{\text{A}} \gg N_{\text{D}}}\mathcal{O}(N_{\text{A}}^{3})\).

In the above optimization, the complexity of the preparation process containing the eigenvalue problem and the calculation of \(\mathbf{\tilde{B}}\) is \(\mathcal{O}(N_{\text{A}}^{3} + N_{\text{A}}^{2}N_{\text{D}})\mathop{ \rightarrow }\limits^{ N_{\text{A}} \gg N_{\text{D}}}\mathcal{O}(N_{\text{A}}^{3})\). All the following inversions using Eq. (11) only are of complexity \(\mathcal{O}(N_{\text{D}}^{3} + N_{\text{A}}N_{\text{D}}^{2} + N_{\text{A}})\mathop{ \rightarrow }\limits^{ N_{\text{A}} \gg N_{\text{D}}}\mathcal{O}(N_{\text{A}}N_{\text{D}}^{2})\). The summands stand for inversion of E, products of N N D-matrices with N N D-matrices like \(\mathbf{\tilde{B}}\mathbf{E}^{-1}\) and inversion of \(\mathbf{\tilde{A}}\), respectively.

Strictly speaking, the optimization still scales cubically in N A due to the initial eigenvalue problem. Nevertheless, for energy sweeps over the density of states or the transmission, the complexity of each inversion step dominates and this effort could be reduced to complexity \(\mathcal{O}(N_{\text{A}}N_{\text{D}}^{2})\) for large systems,Footnote 3 cf. Fig. 4 (lower panel).

As stated above, the optimization only applies for non-local quantities. For local quantities such as current densities, the transformation matrix V cannot be omitted from Eq. (9) and we are back to cubic complexity.

4 Conclusions

In this work, we calculated the local current density in large hydrogenated graphene flakes. The current flow shows complicated patterns, a fact that is ignored in most studies focusing only on the total conductance. These patterns show large local fluctuations. The idea behind the nontrivial local current patterns is very general: scattering states of mesoscopic samples have an inner, nontrivial structure. The latter is seen in the electric current density but it is easily generalized for other observables, e.g. for heat. A mesoscopic device (like a hydrogenated graphene ribbon) which is connected to reservoirs with different temperatures will show fluctuations in the local temperatures as a result of the nontrivial structure of the scattering states.

Along the way, we parallelized and optimized our transport module AitransS to benefit from a supercomputer, thus to enable ab initio transport calculations for large graphene flakes. We showed that our techniques feature good scalability and discussed necessary optimizations. Future work will benefit from the fact that ab initio current density calculations for disordered systems are now available for large 2D film materials and for medium sized 3D materials. Essentially, the approach is only limited by the available computing power.