1 Introduction and motivation

In the past decade or so, there has been rapid development in high-performance computing. The number of supercomputers has been increasing and their computing speeds have been getting faster. High-performance computing has been widely used in fields such as laser fusion, oil exploration and weather forecasting (Xiaowen etal. 2009; Dogru etal. 2011; Kimura 2002), etc.

Parallel computing uses high-performance computers as hardware platforms to solve scientific computing problems by utilizing multiple computers working together in a coordinated manner. Compared to serial computing, parallel computing can solve problems of the same scale in a shorter time without losing accuracy, or solve larger-scale problems in the same amount of time. In large-scale parallel processing computers, MPI (The MPI 2008) (message passing interface) is currently the most widely used parallel programming interface.

Message passing interface is a message-passing parallel programming technique. The MPI standard defines a set of portable programming interfaces, and there are several major implementations of MPI interfaces, such as OpenMPI, MPICH, IntelMPI and MVAPICH. They are implemented according to the MPI interface standard with different internal implementations. MPI_Allreduce is a global reduction operation in MPI, which is equivalent to first performing MPI_Reduce, and then performing MPI_Bcast. MPI-defined reduction operations include sum, dot product, maximum value, minimum value, maximum value and its position, minimum value and its position, etc. MPI_Reduce allows all processes in a communication group to participate in the reduction operation on the same variable and output the reduced result to a specified process. Generally, the master process has the reduced result. MPI_Scatter distributes vector data to each process. Therefore, compared to MPI_Reduce, MPI_Allreduce has reduced results for each process, while MPI_Reduce has reduced results only for the specified process.

When using computers with floating-point operations, some numerical problems arise. First, computers use binary to store floating-point numbers, and floating-point operations produce rounding errors, so the calculated results deviate from the real results, especially when the results of the previous operation are used in subsequent operations, there is an accumulation of rounding errors, which can lead to unreliable results in some cases. Second, the floating-point addition does not satisfy the associative law, and when parallel computation uses different number of processes to do the reduction operation, it may produce different results, i.e., non-reproducible results. To address the above phenomena, it is a proven way to study the numerical algorithms and implementations for high precision.

The all-reduce operation combines values from all processes and distributes the results to all processes. It is commonly used in parallel computing. In the MPI standard (The MPI 2008), the routine for this operation is MPI_Allreduce. Currently, the most widely used all-reduce scheme is the butterfly-like algorithm (Rabenseifner 2004; Rabenseifner and Traff 2004; van de Geijn 1994). When the network can support the butterfly communication pattern without contention, this algorithm is optimal both in the latency term (using the minimum number of communication rounds needed) and in the bandwidth term. The problem with the butterfly-like algorithm is that the butterfly communication pattern can cause network contention in many contemporary clusters. Therefore, Patarasuk and Xin implement an effective all-reduce operation for large data sizes. The ring-based all-reduce operation they proposed is bandwidth-optimal (Patarasuk and Yuan 2009), the communication overhead of the RingAllreduce algorithm is a constant and independent of the number of processes. However, in the presence of rounding errors, the reduction result of the RingAllreduce algorithm may cause the above-mentioned accuracy problems.

Zhou (1980) proposed a calculation formula for the relationship between computer word length, speed, and memory matching by establishing a probability model for the accumulation of computer rounding errors, which was an early research on floating-point rounding errors in China. If the high-precision summation algorithm is not used, some applications will be inaccurate or incorrect. Demmel et al. proposed a fast and accurate floating-point summation algorithm in Demmel and Hida (2004) and applied it to computational geometry. They proposed a fast reproducible floating-point summation algorithm in Demmel and Nguyen (2013), after which they proposed a parallel reproducible summation algorithm in Demmel and Nguyen (2015) based on the algorithm of Demmel and Nguyen (2013). Higham’s monograph (Higham 2002) made a very comprehensive introduction to the accuracy and stability of numerical algorithms. They proposed a class of fast and accurate high-precision floating-point summation algorithms in Blanchard et al. (2020), and also performed some theoretical analysis. Rump proposed a variety of high-precision summation algorithms, and made a very detailed theoretical analysis of them, such as (Rump et al. 2008a, b; Rump 2009). Muller’s monograph (Muller etal. 2010) introduces the relevant knowledge of floating-point arithmetic in great detail. We published an article (Lei et al. 2021) with reference to Rump’s work, proposing a new fast parallel high-precision summation algorithm, which is based on MPI_Allreduce high-precision, and carried out theoretical analysis and experimental verification on it. We also implemented a reproducible BiCGSTAB (Lei etal. 2023) based on Demmel’s ReproBLAS (Ahrens etal. 2020) and Riakymch’s ExBLAS (Iakymchuk etal. 2015).

The remaining sections of this paper are structured as follows: In Sect. 1, we provide an explanation of the symbolic representation and introduce the RingAllreduce algorithm. In Sect. 3, we introduce the double-double format and its basic operations (Li et al. 2002; Hida etal. 2001). Next, we propose our high-precision RingAllreduce algorithm, which combines the double-double format and the RingAllreduce algorithm. We also analyzed the error bounds of the proposed algorithm, which allows us to confirm that it achieves approximate double-double precision results. In Sect. 4, we present the experimental results, compare the accuracy and performance of the RingAllreduce algorithm, and verify that the theoretical error bound is tight. Finally, we conclude the paper in Sect. 5 and suggest some future work.

2 RingAllreduce algorithm

2.1 Notation

In this section, the meaning of the symbols used in the paper is introduced, as shown in Table 1, which the first column is the symbol, and the second column indicates the meaning it represents.

Table 1 Notation meaning

2.2 RingAllreduce algorithm

In the RingAllreduce algorithm, the processes are arranged in a logical ring. Each process should have a left neighbor and a right neighbor, it will only ever send data to its right neighbor, and receive data from its left neighbor. The algorithm proceeds in two steps: first, a scatter-reduce, and then, an allgather. In the scatter-reduce step, the processes will exchange data such that every process ends up with a chunk of the final result. In the allgather step, the processes will exchange those chunks such that all processes end up with the complete final result.

Let the N processes be \(P_{0}, P_{1}, \ldots , P_{N-1}\), using the RingAllreduce algorithm, the scatter-reduce operation is performed as follows: Assuming each process has K values, first, the K values in each process is partitioned into N chunks, all chunks having \(\lceil \frac{K}{N}\rceil\) values except the last chunk, which has a chunk size of \(K-(N-1)\lceil \frac{K}{N}\rceil\). Let us number the chunks by \(chunk_{0}, chunk_{1}, \ldots , chunk_{N-1}\). The scatter-reduce operation is carried out by performing the logical ring pattern \(N-1\) iterations.

We use a specific example to illustrate the scatter-reduce step: Suppose we have three processes, in the first iteration, the chunks sent and received by the three processes are shown in Table 2. After each process receives the data, it performs a reduction operation on the received data chunk with its corresponding data chunk (the chunk with the same chunk index), and replaces its own data with the (partial) reduction results. Figure 1 shows that the scatter-reduce step is implemented using three logical rings of processes.

Table 2 Scatter-reduce data transfers
Fig. 1
figure 1

Logical ring scatter-reduce algorithm

After the scatter-reduce step is complete, every process has a chunk are the final results which include contributions from all the processes. In order to complete the all-reduce operation, the processes must exchange those chunks, so that all processes have all the necessary results.

The allgather proceeds identically to the scatter-reduce (with \(N-1\) iterations of sends and receives), except instead of accumulating values the processes receive, they simply overwrite the chunks.

The RingAllreduce algorithm pseudocode is shown in Listing 1.

figure a

Next we analyze the communication cost of RingAllreduce algorithm. We assume that the number of data owned by each process is K, in the RingAllreduce algorithm, each of the N processes will send and receive values \(N-1\) times for the scatter-reduce, and \(N-1\) times for the allgather. Each time, the processes will send \(\lceil \frac{K}{N}\rceil\) values. Therefore, the total amount of data transferred to and from every process is

$$\begin{aligned} Data\, Transferred=2(N-1)\left\lceil \frac{K}{N}\right\rceil , \end{aligned}$$

which, crucially, is independent of the number of processes.

Baidu has successfully applied the RingAllreduce algorithm to deep learning training, they also released their RingAllreduce algorithm implementation as a library https://github.com/baidu-research/baidu-allreduce.

Next, we analyze the error bounds of the RingAllreduce algorithm, following (Higham 2002), we define \(\gamma _{n}\) as

$$\begin{aligned} \gamma _{n}:=\frac{n{\texttt{u}}}{1-n{\texttt{u}}}, n\in {\mathbb {N}}, \end{aligned}$$

when using \(\gamma _{n}\), we implicitly assume that \(n{\texttt{u}} < 1\).

Let \(p=(p_{1},\ldots ,p_{n})^{T}\in {\mathbb {F}}^{n}\). Then it holds that (Higham 2002)

$$\begin{aligned} {{\tilde{s}}}: = fl\left(\sum \limits _{i = 1}^n {{p_i}} \right) \Rightarrow \mid {{{\tilde{s}}} - \sum \limits _{i = 1}^n {{p_i}} } \mid \le {\gamma _{n - 1}}\sum _{i = 1}^n {{\mid {p_i} \mid }}. \end{aligned}$$
(1)

Note that (1) is valid for any order of addition in the summation.

Let us denote s and S by

$$\begin{aligned} s: = \sum \limits _{i = 1}^n {{p_i}},S: = \sum \limits _{i = 1}^n {\mid {p_i}\mid }. \end{aligned}$$

The condition number of the summation of the vector p is defined by

$$\begin{aligned} cond\left(\sum \limits _{i = 1}^n {{p_i}} \right): = \frac{S}{{\mid s\mid }}, s \ne 0. \end{aligned}$$

The error bounds of the result \({\texttt{res}}\) by RingAllreduce are given as follows:

Theorem 1

Let \({\texttt{res}}\) be the result obtained by \({\texttt{RingAllreduce}}\), then

$$\begin{aligned} {{\texttt{res}} - s} \le {\gamma _{n - 1}}S. \end{aligned}$$
(2)

Moreover, if \(s\ne 0\), then

$$\begin{aligned} {\frac{{{\texttt{res}} - s}}{s}} \le {\gamma _{n - 1}}cond \left(\sum {{p_i}} \right). \end{aligned}$$
(3)

From the error bounds, we can see that the relative error of a summation problem is related to both the number of summations and the condition number of the problem. This theorem allows us to assess the accuracy of the RingAllreduce algorithm. Assuming that we need to find the sum of 100 numbers, if the condition number of the summation problem is \(10^{13}\) order of magnitude, then the relative error between the result given by the algorithm and the exact solution is 1.

3 ddRingAllreduce algorithm

3.1 Double-double formats

In this section, we use the same notation as in Yamanaka et al. (2008). Let \({\mathbb {F}}\) be a set of floating-point numbers. Throughout this paper, we assume floating-point arithmetic adhering to IEEE standard 754 (ANSI 2019). Let \(p = (p_i)\in {\mathbb {F}}^{n}\) and let \({fl}(\cdot )\) be the result of floating-point operations, where all operations inside parentheses are executed by ordinary floating-point arithmetic in rounding-to-nearest. We denote by u the machine epsilon. In IEEE standard 754 double precision \({\texttt{u}}=2^{-53}\).

The basic double-double precision arithmetic operation is composed of some algorithms in the QD multi-part format software library (Hida etal. 2001) developed by Hida, Li and Bailey. It can make the numerical calculation result approximate to double-double precision.

Suppose a double-double precision number is x, which is represented by the combination of two non-overlapping double precision floating-point numbers \(x_{h}\) and \(x_{l}\), that is, \(x = x_{h} + x_{l}\), and satisfies \(\mid x_{l}\mid \le \frac{1}{2}{\mathrm{{ulp}}}(x_{h})\le {\texttt{u}}\mid x_{h}\mid\), the definition of \(\mathrm{{ulp}}(x_{h})\) is the gap between the two nearest floating-point numbers around a real number \(x_{h}\) (Jiang 2013).

The following describes the addition of double-double precision. We first introduce the error-free transformation algorithm for the addition of two floating-point numbers, assuming that a and b are two floating-point numbers and \(fl(a~op~b)\in {\mathbb {F}}\), according to the fundamental properties of floating-point arithmetic, the error of a floating-point number is still a floating-point number. Therefore, we can obtain:

$$\begin{aligned} x = fl\,(a \pm b) \Rightarrow a \pm b = x + y,y \in {\mathbb {F}},\\x = fl\,(a \times b) \Rightarrow a \times b = x + y,y \in {\mathbb {F}}. \end{aligned}$$

An error-free transformation is a transformation that converts a floating-point number pair (ab) into another floating-point number pair (xy), where y is the error. After accumulating the errors, the result is compensated back to its original value.

figure b

FastTwoSum is an error-free transformation used for adding two floating-point numbers, which requires the condition \(\mid a \mid \ge \mid b \mid\) to be satisfied.

figure c

TwoSum has no conditional requirements and is still valid in the case of underflow.

figure d

The Split algorithm divides a floating-point number a with a precision of m into two floating-point numbers with a precision of up to \(s-1\) digits, where \(s:=\lceil m/2\rceil\).

figure e

TwoProd is an error-free transformation algorithm for floating-point number multiplication proposed by Dekker (1971).

figure f

add_dd_dd is the accumulation of 2 double-double type numbers. The application of double-double arithmetic can approximate a floating-point number with a precision (mantissa) of 106 bits, which satisfies the following properties (Li et al. 2002):

$$\begin{aligned} {fl}\,(a\, op \,b)=(a\, op\, b)(1+\delta ), \end{aligned}$$

where a and b are in double-double format, \(op\in \{+, -, \times , \div \}\), satisfy

$$\begin{aligned} \mid \delta \mid \leqslant {u_{dd}},{\text { }}op \in \left\{ { +, - } \right\} ;{\text { }}\mid \delta \mid \leqslant 2{u_{dd}},{\text { }}op \in \left\{ { \times , \div } \right\} , \end{aligned}$$

where \({\texttt{u}}_{dd}=2{\texttt{u}}^{2}=2^{-105}\).

3.2 ddRingAllreduce algorithm

In the RingAllreduce algorithm, we use double-double arithmetic for the reduce operation, so we get a high-precision RingAllreduce algorithm, called ddRingAllreduce. First, we convert the input data to double-double type. In the scatter-reduce stage, we use Algorithm 5 to add the two numbers of the adjacent process, and then send the obtained double-double result to the next process, which is say, we add the input data in the double-double format. After the final round of iteration is completed, we round the double-double result to the double type and then enter the allgather stage.

Theorem 2

Let \({\texttt{res}}\) be the result obtained by \({\texttt{RingAllreduce}}\), then

$$\begin{aligned} \mid {{\texttt{res}} - s}\mid \leqslant \frac{{(n - 1){{\texttt{u}}_{dd}}}}{{1 - (n - 1){{\texttt{u}}_{dd}}}}S. \end{aligned}$$
(4)

Moreover, if \(s\ne 0\), then

$$\begin{aligned} \mid {\frac{{{\texttt{res}} - s}}{s}} \mid \leqslant \frac{{(n - 1){{\texttt{u}}_{dd}}}}{{1 - (n - 1){{\texttt{u}}_{dd}}}}cond\left(\sum {{p_i}}\right ). \end{aligned}$$
(5)

Proof

The sum of two numbers in p is denoted as \({T_i} = {p_k} + {p_j}\), the sum \({{\hat{T}}_i}\) of floating-point calculations satisfies

$$\begin{aligned} {{\hat{T}}_i} = \frac{{{p_k} + {p_j}}}{{1 + {\delta _i}}},{\text { }}\mid {{\delta _i}} \mid \leqslant {{\texttt{u}}_{dd}},{\text { }}i = 1:n - 1. \end{aligned}$$

The local error introduced in calculating \({{\hat{T}}_i}\) is \({\delta _i}{{\hat{T}}_i}\), the overall error is the sum of the local errors (since summation is a linear process), therefore, we can get the overall error as

$$\begin{aligned} {E_n}: = {\texttt{res}} - s = \sum \limits _{i = 1}^{n - 1} {{\delta _i}{{{\hat{T}}}_i}}. \end{aligned}$$

Since \(\mid {{\delta _i}} \mid \leqslant {{\texttt{u}}_{dd}}\), we can get

$$\begin{aligned} \mid {{E_n}} \mid \leqslant {{\texttt{u}}_{dd}}\sum \limits _{i = 1}^{n - 1} {\mid {{{{\hat{T}}}_i}} \mid }, \end{aligned}$$

and have \(\mid {{{{\hat{T}}}_i}} \mid \leqslant \sum \nolimits _{j = 1}^n {\mid {{p_j}} \mid } + O({{\texttt{u}}_{dd}})\) for each i, therefore, we can get the upper bound

$$\begin{aligned} \mid {{E_n}} \mid \leqslant (n - 1){{\texttt{u}}_{dd}}\sum \limits _{i = 1}^n {\mid {{p_i}} \mid } + O({{\texttt{u}}_{dd}}^2). \end{aligned}$$

Then according to the series expansion, we can get

$$\begin{aligned} \mid {{\texttt{res}} - s} \mid \leqslant \frac{{(n - 1){{\texttt{u}}_{dd}}}}{{1 - (n - 1){{\texttt{u}}_{dd}}}}S. \end{aligned}$$

Dividing both sides by s yields formula (5). \(\square\)

4 Numerical results

The following experiment are performed on Sugon HPC cluster with 172 compute nodes (16 accelerator nodes), consisting of two 12-core processors each (24 cores per node). The MPI library used for this experiment is OpenMPI. Accuracy is evaluated by relative error \(e=\mid {\texttt{res}}-s \mid /\mid s \mid\), where \({\texttt{res}}\) is an estimate of s. s is the exact value calculated by the MPFR library (Fousse etal. 2007) or known, MPFR is an arbitrary precision numerical library written in C language. The following are some explanations for the three calculation examples. In the three calculation examples, the data to be summed are serious positive and negative cancellations, and the final accurate values are all small. Such problems are prone to large condition numbers. It can be seen from the definition of the condition number that for this type of problem, the denominator in the calculation formula of the condition number for the summation problem is small while the numerator is relatively large. Therefore, the condition number is large, and ordinary recursive summation algorithms may not be able to provide accurate results. High-precision algorithms are needed instead.

4.1 Example 1

We use Algorithm 6.1 in Ogita et al. (2005) to generate arbitrarily ill-conditioned sum data. First generating ill-conditioned dot product data, the ill-conditioned sum data length is 2n generated from the dot product data of length n, and then the algorithm TwoProd is used to convert the ill-conditioned dot product data of length n to the ill-conditioned sum data of length 2n through error-free transformation. Finally, randomly disturbing 2n summation data can generate ill-conditioned sum data with different condition numbers.

The following are some experimental results and analysis of experimental results.

Fig. 2
figure 2

Left: a Relative error image. Right: b CPU time image

Figure 2a shows the relative error graph for a total data size of 200, using 200 processes, with each process having only one number. The reason for doing this is that it allows us to calculate the exact values of these 200 numbers, and thus we can calculate the relative error between the results obtained by our algorithm and the exact values. In practical applications, each process has multiple chunks, and each chunk has multiple values. Then each chunk performs a global reduction operation. Our algorithm and implementation can perform this operation, but we cannot calculate the exact value after each chunk reduction, so we cannot calculate the relative error of our proposed algorithm. Therefore, we use one number per process to verify the accuracy of our algorithm. In the later examples comparing accuracy, we use this approach.

Figure 2a shows the relative error graph for two algorithms. The horizontal axis represents the condition number of the summation problem, while the vertical axis represents the relative error. The exact value of the summation problem is calculated using the MPFR library. From Fig. 2a, it can be observed that the ddRingAllreduce algorithm provides results of the same order of magnitude as the machine precision, and numerical calculations cannot be expected to yield results more accurate than the machine precision. However, the relative error of the RingAllreduce algorithm increases as the condition number of the summation problem increases. When the condition number of the summation problem is around \(10^{15}\), the RingAllreduce algorithm gives completely incorrect results. Based on Fig. 2a, we can conclude that the ddRingAllreduce algorithm is more accurate than the RingAllreduce algorithm when the condition number of the summation problem is large.

Figure 2b shows the CPU time graph, with 5 runs averaged for each summation scale. In subsequent examples comparing CPU time, 5 runs are also performed and averaged. The vertical axis represents CPU time. When the data volume exceeds 100, ddRingAllreduce is slower than RingAllreduce. This is because when the data size is less than 100, the extra computational overhead brought by double-double precision in ddRingAllreduce is overlapped by communication time, so the additional floating-point operation overhead does not increase CPU time. In Fig. 2b, for the five summation scales, ddRingAllreduce algorithm is on average 1.0656 times slower than RingAllreduce algorithm.

4.2 Example 2

We use the same data generation method as the ReproBLAS library (Ahrens et al. 2015), i.e., \(p_{i}=\sin (2.0\times \pi \times (mpi\_rank \div mpi\_size-0.5))\). Each process generates a number, where \(mpi\_rank\) represents the process number and \(mpi\_size\) represents the total number of processes.

Fig. 3
figure 3

Left: a Relative error image. Right: b CPU time image

Figure 3a shows the relative error image, with the number of summing data on the horizontal axis. ddRingAllreduce provides machine-precision-level results for all five different scales, while the results obtained by the RingAllreduce algorithm are incorrect for all five scales. From the image, it can be seen that the ddRingAllreduce algorithm is more accurate than the RingAllreduce algorithm. Figure 3b shows the CPU time image. Similar to Example 1, when the number of summing data is less than 100, the two algorithms have similar times. For the five different summing scales, the average time of ddRingAllreduce algorithm is 1.0906 times slower than that of the RingAllreduce algorithm.

4.3 Example 3

We use Algorithm 4.2 in Yamanaka et al. (2008) to generate arbitrarily ill-conditioned sum data. Convert the ill-conditioned dot product data into ill-conditioned sum data in the same way as in Example 1.

In this example, we will incorporate the use of \(\_\_float128\) in the scatter-reduce phase of RingAllreduce as part of the experiment, we refer to it as FP128RingAllreduce.

Fig. 4
figure 4

Left: a Relative error image. Right: b CPU time image

Figure 4a shows the relative error image. The accurate result of the summing problem is \(cond^{-1}\). From the image, it can be seen that the ddRingAllreduce algorithm can handle summing problems with larger condition numbers, and is more accurate than the RingAllreduce algorithm for summing problems with larger condition numbers. FP128RingAllreduce is the most accurate algorithm among these three, because FP128RingAllreduce is 128-bit while ddRingAllreduce is only 106-bit. Figure 4b shows the CPU time image. Similar to Examples 1 and 2, when the number of summing data is less than 100, the two algorithms have similar times. For the five different summing scales, the average time of the ddRingAllreduce algorithm is 1.1238 times slower than that of the RingAllreduce algorithm. FP128RingAllreduce is on average 1.0829 times faster than ddRingAllreduce, but 1.0378 times slower than RingAllreduce. \(\_\_float128\) is supported by some compilers such as GCC, MPIC++, while double-double is a software simulation that requires more floating-point operations. Therefore, FP128RingAllreduce is faster than ddRingAllreduce. However, not all compilers support \(\_\_float128\), for example, NVCC does not support it. Therefore, when high precision is required, if the compiler supports \(\_\_float128\), we recommend using \(\_\_float128\) supported by the compiler. If the compiler does not support it, then use double-double.

Fig. 5
figure 5

Error bounds and true relative errors left: a RingAllreduce, right: b ddRingAllreduce

Next, we evaluate how tight the error bound for ddRingAllreduce in Theorem 2 is in practice. To do this, we set \(n=200\) and vary the condition number cond from 1 to \(10^{100}\) in Algorithm 4.2 (Yamanaka et al. 2008), the exact result of sum is equal to \(cond^{-1}\). The error bounds (3) and true relative errors of the results obtained by RingAllreduce are displayed in Fig. 5a. The error bounds (5) and true relative errors of the results obtained by ddRingAllreduce are displayed in Fig. 5b. In Fig. 5, the lines labeled ‘exp.’ denote the experimental error, the lines labeled ‘est.’ denote the error bounds. It can be seen from this experiment that the theoretical error bounds of RingAllreduce and ddRingAllreduce are tight.

4.4 Performance comparison for large-scale problems

In this section, we compare the performance of RingAllreduce, FP128RingAllreduce, and ddRingAllreduce on two large-scale problems. The array size of each process is \(K=100,000\) and \(K=1,000,000\), respectively. Each element value is set to 1.5, and the number of nodes varies from 1 to 5 (i.e., the number of processes ranges from 24 to 120).

Fig. 6
figure 6

Performance comparison for large scale problems left: a \(K=100,000\), right: b \(K=1,000,000\)

Figure 6 shows the comparison results. Figure 6a shows the experiment where the array size of each process is \(K=100,000\). The average time of the ddRingAllreduce algorithm is 1.3786 times slower than that of the RingAllreduce algorithm. FP128RingAllreduce is on average 1.1130 times faster than ddRingAllreduce, but 1.2385 times slower than RingAllreduce. Figure 6b shows the experiment with \(K=1,000,000\). The average time of the ddRingAllreduce algorithm is 1.5863 times slower than that of the RingAllreduce algorithm. FP128RingAllreduce is on average 1.0438 times faster than ddRingAllreduce, but 1.5198 times slower than RingAllreduce. On larger-scale problems, since the ddRingAllreduce algorithm uses double-double arithmetic and adds more floating-point operations, it is slower than the RingAllreduce algorithm.

5 Conclusions and future work

In this paper, we address the problem of inaccuracies in the RingAllreduce algorithm, a specific algorithm for global reduction operations. We propose a high-precision version of the RingAllreduce algorithm called ddRingAllreduce, which we analyze and verify to achieve higher accuracy than RingAllreduce algorithm. For large condition number summation problems, the ddRingAllreduce algorithm performs better in terms of accuracy than that of RingAllreduce algorithm, and in practice, the proposed algorithm often yields more accurate results than the theoretical error bounds. The ddRingAllreduce algorithm incurs less time overhead for small-scale problems, but it requires some overhead for large-scale problems.

For the future work, one can implement other high-precision reduction operations based on the RingAllreduce algorithm, such as multiplication. Or you can implement high-precision versions of other MPI_Allreduce algorithms, such as the butterfly algorithm. Although high-precision algorithms offer higher accuracy, they require more floating-point calculations and communication, so balancing computation speed and accuracy is always a research direction.