1 Introduction

The development of supercomputers has been advancing rapidly for the past five decades with the most recent supercomputer Fugaku culminating the Top 500 list in November 2021 at a peak speed of 537 PFlops and 7,630,848 cores [1]. Along with the growth of number of faster processors – as suggested by the Moore’s law [2], the interconnection network that can efficiently hold and connect such massive systems has been constantly on demand. Due to the power wall from the Moore’s law [2], however, the increase of processor’s clock speed has hit its barrier. Consequently, the development of ever-expanding supercomputers with ever-increasing speed leans on the design of optimally interconnected networks topologies. Generally, ideal interconnection networks should satisfy small network diameter, large bisection width, topological simplicity, symmetry, engineering feasibility, and modular, expandable design, to provide maximal connectivity, scalable performance, minimal engineering complexity and least monetary cost [3]. Meshes [4], tori [5,6,7,8,9,10], hypercubes [11], fat trees [12, 13] and multistage switched fabrics [4, 14] have been the mainstream networks for years. Meshes have simple layout but the large diameter slows down node-to-node communications. Fat trees can realize maximum bisection width, but the diameter grows with the number of levels of switches. The torus and its variation k-ary n-cube [15] have relatively smaller diameter and average distance. 3D torus is applied in Cray SeaStar [5] and IBM Blue Gene/P [6], while modified 3D torus formulates Cray Gemini interconnect [7]. 5D torus is incorporated in IBM Blue Gene/Q [8]. Hybrid 6D mesh/torus Tofu interconnect is integrated in K and post-K supercomputers [9, 10]. Recently, Dragonfly topology [16] among hierarchical topologies with small diameter has been deployed into Aries and Slingshot interconnects [17, 18].

In this manuscript, we propose the optimal circulant topologies with minimal diameter and mean path length (MPL) and maximum bisection width as promising low-latency network topologies by a home-grown and highly efficient parallel exhaustive search algorithm to generate and optimize circulant topologies. We obtain a set of optimal circulant topologies of size \(2^5\) through \(2^{10}\). By comparing with corresponding torus and hypercube of the same sizes and degrees, we show remarkable improvement in graph properties by the optimal circulant topologies and the Cartesian products of optimal circulant topologies and fully connected topologies. The benchmarking results by simulation also demonstrate significant performance enhancement for communication-intensive applications. In the meantime, the Cartesian products of optimal circulant topologies and fully connected topologies show potential in balancing the influence of global communication with data traffic patterns of specific applications or internal algorithms. We discuss related work in Sect. 2 and present the parallel optimization algorithm for circulant topologies in Sect. 3. Section 4 examines graph properties of optimal circulant topologies and Cartesian products by comparative analysis with torus and hypercube. Section 5 shows simulated benchmark performance and detailed analysis. We conclude the manuscript in Sect. 6.

2 Related work

The optimization of interconnection networks is essentially to reduce the node-to-node communication latency, which is directly indicated by the diameter and mean path length (MPL) of the topology. The idea of optimizing MPL can be dated back to Cerf et al., who first calculated the lower bound of MPL given arbitrary regular graph size and degree [19]. Zhang et al. modified torus by adding bypass links to construct interlaced bypass torus (iBT) [20, 21], designed efficient broadcast algorithms on iBT [22] and evaluated its performance by building a re-configurable cluster and system [23]. Feng et al. further explored 6D mesh/iBT on custom routing algorithms [24] and performance evaluation by simulation [25]. Deng et al. implemented parallel exhaustive search of regular graphs [26] and random search with rotational symmetry to obtain small and large (near) optimal topologies and evaluated the performance by benchmarking both on real clusters and simulation platform [27]. More small optimal topologies with symmetries and other special properties were produced and presented in a structured table [28]. Distributed shortcut networks targeting the diameter and cable length trade-off [29] and host-switch graphs designed by minimizing diameter and/or MPL using randomized heuristics [30] were also introduced and benchmarked by simulation.

In addition, network topologies with additional symmetry properties provide advantages in routing and load balancing [4] and also in design and engineering simplicity. Recent research on constructing supercomputer network utilizing Lie algebra symmetry [3, 31] and optimizing MPL by random or heuristic algorithms with rotational symmetry constraints [27, 32] have demonstrated the importance of the symmetry perspective in network design. The total number of symmetries of a graph calculated as its automorphism group size has also been included in further optimization of smaller topologies [28]. Vertex-symmetric topologies are one class of highly symmetric topologies, where any two vertices are equivalent under self mappings that preserve adjacency [4]. Torus consisting of equal-sized rings, hypercube and dragonfly are notable examples of vertex-symmetric topologies that have been widely applied in current mainstream high-performance computing networks as introduced in Sect. 1.

Among vertex-symmetric topologies, circulant topologies, also known as multi-loop networks, have been extensively explored in both theory and applications as summarized in three major surveys [33,34,35]. The lower bounds of diameter and MPL for circulant topologies have been estimated in theory [34]. For degree-4 (double-loop) circulants, such lower bounds can be achieved [36, 37] and their layouts [38], routing [39, 40] and broadcast algorithms [41] have been studied. However, for arbitrary size and degree, the existence of such optimal circulant topologies that reach theoretical lower bounds of diameter and MPL still remains open.

The synthesis of general arbitrary-degree circulant topologies with small diameter and/or MPL has been conducted by both theoretical construction and computer search [35]. Population-based heuristic algorithms [42] have been analyzed when optimizing MPL for circulants with given size, but may only obtain near-optimal graph properties comparing with exhaustive search. Techniques such as tree-like search guided by girth heuristic [43] and direct product construction [44, 45] have been developed to solve the related degree/diameter problem, which is to maximize the size of circulant topologies given fixed diameter and degree. To the best of our knowledge, there are only a few parallel algorithms for the synthesis and optimization of arbitrary-degree circulant topologies, and even less work on the performance evaluation of circulant topologies as communication networks. Our optimization method of efficient parallel exhaustive search provides another approach in finding all the definitive optimal circulant topologies given graph size and degree. We further select the optimal circulant topology with the largest bisection width to add to its potential in performance enhancement. Moreover, our simulation results bring insight into the actual performance of the optimal circulant topologies for computational applications.

3 Discovery of optimal circulant topologies

An N-vertex circulant graph [46] is defined by a jump set \(S=\{s \vert 1 \le s \le \lfloor {N/2}\rfloor \}\), where each vertex \(i=0, 1, ..., N-1\) is connected to vertices \(i\pm s \pmod {N}\) for every jump s. A degree-k circulant graph has \(\lceil {k/2}\rceil\) number of jumps. In particular, the rings and complete graphs are both circulant, when \(S=\{1\}\) and \(S=\{1, 2, ..., \lfloor {N/2}\rfloor \}\) respectively. Examples of 16-vertex circulant graphs are shown in Fig. 1, which are optimal as we discovered.

Fig. 1
figure 1

16-vertex optimal circulant graphs

We search optimal circulant graphs of minimal diameter (D), minimal mean path length (MPL), and maximal bisection width (BW). To exhaustively search through all N-vertex degree-k circulant graphs, denoted by (Nk), is equivalently to compute all combinations of choosing \(\lfloor {k/2}\rfloor\) jumps from 1 to \(\lfloor {(N-1)/2}\rfloor\). Jump N/2 is excluded as the jump set S contains N/2 if and only if k is odd. Since we require the circulant graphs to be connected, by theorem of Boesch and Tindell [47], a circulant graph is connected if and only if \(\mathrm {gcd}(N,S)=1\). Thus, when N is a power of 2, there must be an odd jump. Such an odd jump is also relatively prime to N, which can be mapped to jump 1 in an equivalent jump set by Adam isomorphism of the circulant graph [47]. As a result, for N equal power of 2, we can set one of the jumps to 1 and reduce the exhaustive search to choosing \(\lfloor {k/2}\rfloor -1\) jumps from 2 to \(\lfloor {(N-1)/2}\rfloor\). For instance, the numbers of (1024, 10) and (1024, 15) circulant graphs to optimize can be considerably reduced from \(\sim 3 \cdot 10^{11}\) to \(\sim 3 \cdot 10^{9}\) and from \(\sim 2 \cdot 10^{15}\) to \(\sim 2 \cdot 10^{13}\) respectively.

To effectively perform the exhaustive search of optimal circulant graphs, we adapt the co-lexicographic ordering algorithm of combinations in the FXT library, which is a highly efficiently library in combinatorial generation [48, 49]. One key feature of the algorithm is that the generation of each combination only depends on the previous combination in co-lexicographic order. Therefore, the generation process can be reduced to a minimum number of operations which makes it as fast as possible. Moreover, such feature enables the parallelization of the total combinatorial generation by incorporating with an unranking algorithm [50]. In the co-lexicographic ordering, each combination corresponds to a unique ordinal number which is called the rank of the combination. The unranking algorithm maps a rank back to its corresponding combination in co-lexicographic order, which allows the generation to start and end at arbitrary ranks.

Utilizing the above combinatorial techniques including the reduction of search space, the co-lexicographic ordering algorithm and the unranking algorithm, we build our own parallel exhaustive search program for circulant graphs. Given (Nk) as input, we design our parallel exhaustive search algorithm to start by counting the total number of (Nk) circulant graphs to optimize, and divide the number evenly among all processors. Then each processor unranks its start rank, generates circulant graphs in the format of jump sets by enumerating through combinations in co-lexicographic order while filtering by finding the ones with minimal diameter and then minimal MPL, and stops at its end rank. For the calculation of diameter and MPL, we run breadth-first search only once on each graph since circulant graphs are vertex-symmetric. For N equal power of 2, we use bit arrays and bit shifts to mark traversed vertices in the implementation of breadth-first search to further improve its speed. Finally, all processors communicate only once to find the overall minimal diameter and MPL. Then each processor filters once more to obtain the optimal circulant graphs. Hence our parallel exhaustive search algorithm has perfectly balanced workloads, minimal communication and synchronization delay, and highly efficient implementations of both combinatorial generation and calculation of diameter and MPL. We execute the parallel exhaustive search on the SeaWulf supercomputer at Stony Brook University. The processing speed on compute nodes with Intel Xeon Gold 6148 CPUs can reach \(\sim 10^{5}\) graphs per second per core, which makes it practical to optimize large high-degree circulant graphs even up to (1024, 15) by exhaustive search.

When there are multiple optimal circulant graphs with minimal diameter and MPL, we filter further by optimizing bisection width. To compute the bisection width of a graph, we use the KaHIP program which can achieve strictly balanced bisection as required [51]. We also eliminate isomorphic optimal circulant graphs using the nauty and Traces programs [52]. We present the discovered optimal circulant graphs (in Table 1) which have minimal diameters and MPLs and maximum bisection widths filtered in such specific order. We believe they serve as promising candidates for interconnection network topologies.

Table 1 Optimal circulant graphs discovered by parallel exhaustive search

4 Graph analysis of optimal circulant topologies

We analyze the graph properties of optimal circulant topologies by comparing with torus and hypercube as well as Cartesian products of one smaller optimal circulant topology and the 4-vertex fully connected topology. We first calculate and present the diameter, MPL and bisection width of the three categories of topologies in Table 2. For each N from \(2^5\) to \(2^{10}\), single optimal circulant topologies are denoted as Optimal Circulant, while Cartesian products of optimal circulant topologies are denoted as Optimal Circulant Product. For tori and Cartesian products, the product components are placed from small to large starting on the right. An example of Cartesian product of optimal circulant topologies is shown in Fig. 2. We also calculate the number of edges \(\vert E\vert =Nk/2\) for all topologies in Table 2.

Fig. 2
figure 2

(32, 5) Optimal Circulant Product (8,2)\(\times\)(4,3)

Moreover, for a given graph size we further normalize the diameter, MPL and bisection width in Table 2 by comparing all topologies with corresponding torus and compute the graph property ratios. For diameter and MPL, the ratios are inverted as \(\text {D}_\text {Torus}/\text {D}_\text {Topology}\) and \(\text {MPL}_\text {Torus}/\text {MPL}_\text {Topology}\). For bisection width, the ratio stays as \(\text {BW}_\text {Topology}/\text {BW}_\text {Torus}\). All (inverse) ratios are shown in Fig. 4b–d by bar plots, such that higher bars always represent better graph properties. In the end, for each graph category, we show the average (inverse) ratios of graph properties across size N in Fig. 4a. We use low-degree and high-degree to denote the two different graph degrees for each N as in Table 2, and use consistent color scheme as in Fig. 3 to represent different topology categories in all bar plots Figs. 4, 5, 6, 7, 8, 9 and 10.

Table 2 Graph properties of optimal circulant (product) topologies and tori/hypercubes
Fig. 3
figure 3

Color scheme to represent different topology categories in all bar plots

Fig. 4
figure 4

Graph properties measured in (inverse) ratios to torus

Table 2 and Fig. 4 show that the optimal circulant topologies and Cartesian products always have smaller diameter, MPL and larger bisection width than corresponding torus. The low-degree optimal circulant topologies and Cartesian products have average diameter inverse ratios 1.58/1.68 and average MPL inverse ratios 1.18/1.24 respectively comparing with torus, indicating 37%/40% smaller diameter and 15%/19% smaller MPL in average. Meanwhile, the high-degree optimal circulant topologies have further smaller diameter and MPL, 52% and 30% decrease in average comparing with torus. For bisection width, the low-degree optimal circulant topologies and Cartesian products have clear advantages over torus, and almost as good as hypercube, while the high-degree optimal circulant topologies can achieve up to 235% average increase. We also observe that for the optimal circulant topologies and Cartesian products, both the diameter inverse ratios and bisection width ratios are increasing along with the topology size N with slight fluctuation, while the MPL inverse ratios mostly stay around its average.

5 Benchmarks of optimal circulant topologies

5.1 Simulation platform and network routing

To investigate the performance of optimal circulant topologies, we simulate the topologies listed in Table 2 on SimGrid (version 3.26) [53]. SimGrid offers flexible and accurate parallel simulation platform, especially the SMPI interface which enables simulation of MPI applications [53]. We configure the cluster parameters in SimGrid by setting 100 GFlops computational speed per core of single-core nodes, 10 Gbps network bandwidth and 40 ns latency per link. In the cluster the computing nodes are set to be directly connected via input topologies. We utilize the built-in implementation of MVAPICH2 in SimGrid as the MPI library. All the simulations are run on the SeaWulf supercomputer at Stony Brook University.

For network routing, we use static shortest-path routing with full routing table for all simulated topologies with a custom-designed vertex-symmetric routing for the optimal circulant topologies. To determine the routing table, we first apply breadth-first search once and find the initial routes from node 0 to all the other nodes. Then we add i (\(\bmod \ N\)) to all the nodes on the initial routes to form the routes from node i to all the other nodes. In this way we can construct the full routing table for the optimal circulant topologies. For torus, hypercube, and Cartesian products, we implement the widely used dimension-order routing which achieves shortest distances between nodes in Cartesian product topologies [4].

5.2 Benchmark results and analysis

The following benchmark programs are simulated to examine and compare the performance of network topologies: effective bandwidth (b_eff) [54, 55], FFTE [56, 57], Graph 500 [58, 59] and the NAS Parallel Benchmarks (NPB) [60, 61] consisting of FT, IS, CG, MG, LU, BT and SP. We run each benchmark on all topologies listed in Table 2 and evaluate the performance by ratio of processing speeds to torus of the same size. The performance ratio is calculated as \(S_\text {Topology}/S_\text {Torus}\), where S is the average speed over multiple runs reported by the benchmark. All performance ratios are demonstrated by bar plots in Figs. 5, 6, 7, 8, 9 and 10. The terms low-degree and high-degree are used to denote the two different graph degrees for each N. The same color scheme as in Fig. 3 is used to represent different topology categories in all bar plots. Like the comparative analysis of graph properties in Sect. 4, we also present in Fig. 5 the average performance ratios across topology size N for each benchmark.

Fig. 5
figure 5

Average performance measured in ratios to torus

Figure 5 shows that comparing with torus, the low-degree and high-degree optimal circulant topologies and Cartesian products achieve average performance gains of 15%, 52% and 15% respectively over all simulated benchmarks. The performance ratios are much higher especially on effective bandwidth, FFTE, NPB FT and Graph 500, while reaching higher or equal average performance for the other NPB benchmarks. In particular, the high-degree optimal circulant topologies are capable of enhancing the performance up to 128% and 124% in average on FFTE and NPB FT respectively, even 24% and 21% higher than hypercube. We present more detailed analysis on the benchmarks and simulation results in the rest of this section.

5.2.1 Effective bandwidth

Effective bandwidth (b_eff, version 3.6.0.1) [54, 55] measures the total network bandwidth over multiple ring and random communication patterns. It also compares three different communication methods: MPI_Sendrecv, MPI_Alltoallv and non-blocking MPI_Irecv and MPI_Isend with MPI_Waitall, and reports the maximum bandwidth of all methods. To fairly compare the performance of optimal circulant topologies with torus, we modify the benchmark to run over random communication patterns only and collect the reported bandwidth at maximum message size of 1 MB per processor.

Fig. 6
figure 6

Performance ratios to torus on effective bandwidth

The effective bandwidth performance ratios to torus are plotted in Fig. 6. The high-degree optimal circulant topologies substantiate 42% average performance gain, reaching 62% at \(N=1024\) which is even 22% higher that hypercube. The low-degree optimal circulant topologies and Cartesian products show higher performance gain starting with larger topology size from \(N=256\), with 17%/14% in average and as high as 31%/34% respectively at \(N=1024\), almost equal to hypercube. Since effective bandwidth essentially measures cross-network traffic, optimal circulant topologies with small diameter and MPL and large bisection width therefore can significantly improve the performance.

5.2.2 FFTE

FFTE (version 7.0) [56, 57] is simulated, which requires global all-to-all data transpositions across the network. We perform the parallel 1D FFTE routine with transform array lengths ranging from \(2^{10}\) to \(2^{31}\), equal to total transform array sizes from 16 KB to 32 GB. Then we collect the sustained computational speed at the maximum transform array length.

Fig. 7
figure 7

Performance ratios to torus on 1D FFTE

Figure 7 shows the 1D FFTE performance ratios to torus. The low-degree and high-degree optimal circulant topologies and Cartesian products have slightly fluctuating performance ratios, maintaining average performance gains of 64%, 128% and 25% respectively. Moreover, the high-degree optimal circulant topologies outperform hypercube for every topology size, adding to 24% higher average performance gain. The high performance gains result from the performance dependence of 1D FFTE on global communication, which relies heavily on the optimization of circulant network topology.

5.2.3 Graph 500

The Graph 500 (version 2.1.4) [58, 59] tests graph search and shortest path algorithms on an tremendously large undirected graph distributed among all processors. It evaluates large-scale data-intensive performance for supercomputers. The processing speed is reported as mean TEPS, i.e. traversed edges per second. Due to implementation issues with SimGrid, we use the previous version 2.1.4 and the replicated version of parallel breadth-first search. We choose the scale of 29 with edge factor 4, generating an initial unweighted graph of 36 GB.

Fig. 8
figure 8

Performance ratios to torus on Graph 500

The performance ratios to torus for Graph 500 are shown in Fig. 8. The low-degree and high-degree optimal circulant topologies and Cartesian products can achieve average performance gains of 25%, 70% and 27% respectively. The global communication involved in Graph 500 makes the optimal circulant topologies suitable for enhancing the performance. We also note that hypercube brings top performance for Graph 500, which shows that the extremely symmetric topology structure coupled with proper routing may compensate its other disadvantages by matching with traffic patterns and internal MPI algorithms.

5.2.4 NAS parallel benchmarks (NPB)

Fig. 9
figure 9

Performance ratios to torus on NPB kernels

Fig. 10
figure 10

Performance ratios to torus on NPB pseudo applications

The NAS Parallel Benchmarks (version 3.4.1) [60, 61] consist of programs derived from computational fluid dynamics (CFD) applications. We benchmark four kernel programs: discrete 3D FFT (FT), integer sort (IS), conjugate gradient method (CG) to calculate the smallest eigenvalue and multi-grid solver (MG) on a sequence of meshes, and three pseudo applications: lower-upper (LU) Gauss-Seidel solver, block tri-diagonal (BT) solver and scalar penta-diagonal (SP) solver. FT tests long-distance all-to-all communication; IS uses random memory access and tests both integer computation speed and communication; CG uses irregular memory access and tests unstructured long-distance communication; MG tests highly structured short- and long-distance communication with intensive memory access [60, 61]. We select the standard problem size Class C for each benchmark. Since BT and SP require a square number of processors, we perform these two applications only on topologies of \(N=64\), 256 and 1024.

Figures 9 and 10 show the NPB performance ratios to torus. The performance of FT (Fig. 9a) is similar to 1D FFTE (Fig. 7) in accordance with global communication, in which the low-degree and high-degree optimal circulant topologies and Cartesian products all have high performance gains of 66%, 124% and 27% in average respectively, with high-degree optimal circulant topologies 21% higher than hypercube. For IS (Fig. 9b), the performance is also similar to Graph 500 (Fig. 8) but with relatively lower performance ratios, in which the high-degree optimal circulant topologies and Cartesian products achieve average performance gains of 30% and 18%. The performances of other NPB benchmarks fluctuate around the average (Figs. 9c, d and 10 ). The high-degree optimal circulant topologies are able to reach higher average performance gains of 28%, 54% and 36% in CG, BT and SP, almost as good as hypercube, while maintaining around 3% and 10% in MG and LU. The Cartesian products achieve higher average performance gains of 35% and 18% in CG and MG, while keeping around equal performance in LU, BT and SP. Meanwhile, the low-degree optimal circulant topologies have almost equal average performance compared with torus in most of the NPB benchmarks apart from FT.

The NPB performance ratios to torus demonstrated multiple perspectives for performance enhancement. On one hand, applications with intensive global communication such as FT are strongly dependent on network topology, where the optimal circulant topologies contribute most to enhancing the performance. On the other hand, other factors such as data traffic patterns, internal algorithms and routing also influence the performance. Limited parallelism in LU [61] may result in its relatively lower performance ratios in general among all topologies. The fluctuation of performance ratios especially for the linear solvers such as MG, BT and SP can be caused by different partitions of the fixed input 3D grid onto topologies of varying sizes, which creates diverse data traffic patterns. In the meantime, some application can reach particularly high performance when its internal traffic pattern matches with the network connections, such as hypercube for IS. The Cartesian products of optimal circulant topologies and fully connected topologies may serve as better candidates for balancing the influence of global communication with specific traffic patterns, such as in CG and MG.

6 Discussion and conclusion

Optimal circulant graphs, obtained by highly efficient parallel exhaustive search algorithm in our study, are recognized as low-latency network topologies for large clusters of computing systems. The optimal circulant graphs and their Cartesian products, obtained from a huge search space of graphs with the same graph size and degree, have minimal diameter and MPL and maximum bisection width. These favorable graph properties prompt us to verify a common belief that they can significantly improve the performance for communication-intensive applications.

Indeed, the second contribution of our work is to demonstrate the enhancement of performance of these optimal circulant topologies, compared with other mainstream topologies including torus and hypercube of the same size, on effective bandwidth, 1D FFTE, NPB FT and Graph 500. These applications not only have high communication-to-computation ratio, but also have dominant dependence on long-distance global communication such as cross-network and all-to-all data traffic. The characterization of such applications is in accordance with our optimization methodology in which diameter, MPL and bisection width are overall global topology properties. As a result, on these applications our newly discovered optimal circulant topologies fulfill their advantages to the most extent. We also observe that, although the optimal circulant topologies and their Cartesian products achieve higher performance than torus, the performance ratios fluctuate around the average on different topology sizes for certain benchmarks such as NPB MG, BT and SP. This exhibits the influence on performance from other factors including specific data traffic patterns, internal algorithms, topology and problem scale, routing methods and memory access. A noticeable phenomenon is the benefit of extremely symmetric hypercube for the performance of Graph 500 and NPB IS. More investigation shows that specific algorithms for MPI collective operations such as recursive doubling and halving [62] match exactly the hypercube network connections. The relation between communication pattern and network topology can be explored to develop topology-aware mapping algorithms and frameworks [63, 64], which may even lead to hardware-software co-design [65, 66]. Additionally, different routing methods including adaptive routing on circulant topologies [67] can be examined and both topology- and routing-aware task mapping [68] may further enhance the performance of optimal circulant topologies.

To extend to the design of larger-scale networks, we propose and evaluate Cartesian products of optimal circulant topologies and fully connected topologies, which have both optimal local and global structures and thus can balance the global communication with specific traffic patterns as shown by our simulations. Various graph products [69, 70] of multiple optimal circulant topologies may also be applied to expand the network size. As another advantage over torus and hypercube, the node degree of circulant topologies can be arbitrary. Therefore, optimal circulant topologies are ideal candidates for global connections in large hierarchical networks [71, 72].