Keywords

1 Introduction

Associated Legendre polynomials are solutions of the differential equation

$$\begin{aligned} (1-x^{2}){\frac{d^{2}}{dx^{2}}}P_{n}^{m}(x)-2x{\frac{d}{dx}}P_{n}^{m}(x)+\left[ n(n+1)-{\frac{m^{2}}{1-x^{2}}}\right] P_{n}^{m}(x)=0, \end{aligned}$$
(1)

where degree n and order m are integers satisfying \(0 \le n\), \(0 \le m \le n\), and x is a real variable in the interval \([-1,1]\) which is usually expressed as \(\cos \theta \), where \(\theta \) represents the colatitude [1, Chap. 15].

These polynomials are important when defining geopotential of the Earth’s surface [2, 3], spherical functions in molecular dynamics [4], angular momentum in quantum mechanics [5], as well as in a number of other physical applications. The accuracy and scale of numerical simulations directly depend on the maximum degree of a polynomial which can be correctly computed. Modern applications typically operate upon first-kind (\(m = 0\)) polynomials at a degree of \(10^3\) or higher. The functions \(P_n^m(x)\) grow combinatorially with n and can overflow for n larger than about 150. Therefore, for large n, instead of \(P_n^m(x)\), normalized associated Legendre polynomials are computed. There are a number of different normalization methods [6, Chap. 7]. We consider the computation of fully normalized polynomials

$$\begin{aligned} \bar{P}_n^{m}(x) = \sqrt{\frac{2n+1}{2}\frac{(n-m)!}{(n+m)!}}(1-x^2)^{m/2}\frac{\partial ^{m}}{\partial x^{m}}P_n(x), \end{aligned}$$
(2)

which satisfy the following equation:

$$\begin{aligned} \int _{-1}^1 \{\bar{P}_n^m(x)\}^2 dx = 1. \end{aligned}$$
(3)

Mathematical properties and numerical tables of \(\bar{P}_n^{m}(x)\) are given in [7]. A number of recursive algorithms are suggested to evaluate \(\bar{P}_n^{m}(x)\) [8,9,10]. One of the most common ones is based on the following relation [8]:

$$\begin{aligned} \bar{P}_{n}^{m-1}(x) = \tfrac{2m x}{\sqrt{(1-x^2)(n+m)(n-m+1)}}\bar{P}_{n}^{m}(x) - \sqrt{\tfrac{(n-m)(n+m+1)}{(n+m)(n-m+1)}}\bar{P}_{n}^{m+1}(x). \end{aligned}$$
(4)

The starting points for recursion (4) are the values \(\bar{P}_{n}^{n+1}(x) = 0\) and

$$\begin{aligned} \bar{P}_{n}^{n}(x) = \sqrt{\frac{1}{2}\frac{3 \cdot 5 \cdots (2n+1)}{2 \cdot 4 \cdots 2n}}(1-x^2)^{n/2}. \end{aligned}$$
(5)

Equations (4) and (5) are asymptotically stable at any admissible parameters x, m and n, so if we consider them in terms of pure mathematics, they are appropriate for computing polynomials of an arbitrarily high degree. In practical computation, however, there are difficulties in computing (4) and (5) when n becomes large [3]. This is due to the following reasons:

  • computations take an unacceptable long time;

  • overflow or underflow exceptions may occur.

The first of these problems stems from the fact that during the numerical simulation it is required to compute many polynomials of different degrees at a fixed angle, or many fixed degree polynomials for a variety of angles. An effective solution to this problem has been made possible thanks to the intensive development of new generation massively parallel computing architectures, such as graphics processing units (GPUs).

The second problem is related to the limited dynamic range of real numbers which are represented in computers [11]. As a result, if x is about \(\pm 1\), the computation of the starting value \(\bar{P}_{n}^{n}(x)\) leads to underflow, even though the desired value \(\bar{P}_{n}^{m}(x)\) is within an acceptable dynamic range. For example, if \(x = 0.984808\), which corresponds to angle \(\theta \approx 10^{\circ }\), then \(\bar{P}_{5000}^{5000}(x) \approx 1.42\times 10^{-3801}\) while \(\bar{P}_{5000}^{0}(x) \approx 3.32\times 10^{-1}\). The smallest normal value in IEEE-754 double-precision format is approximately equal to \(10^{-308}\). Thus, to evaluate \(\bar{P}_{5000}^{5000}(x)\), it is necessary to extend the dynamic range by more than an order of magnitude. The value of \(\bar{P}_{5000}^{5000}(x)\) is not of independent practical interest, however, it is impossible to start recursion for calculating \(\bar{P}_{5000}^{0}(x)\) without it being correctly computed, because if, due to underflow, \(\bar{P}_{5000}^{5000}(x) = 0\), then all following values \(\bar{P}_{5000}^{4999}(x)\), \(\bar{P}_{5000}^{4998}(x)\), etc. will also become zero. On the other hand, calculating the fraction in (5) in a conventional way (first the numerator, and then the denominator) may lead to overflow exception. Some information about the range of angles and limitations to polynomial degrees at which calculations in IEEE-754 arithmetic do not result in exceptions is given in [2].

To avoid overflow or underflow problems, methods using global scaling coefficients are suggested [9]. However, as noted in [3], this solves the problem only for limited ranges of arguments and degrees. The general solution to the overflow and/or underflow problem when computing the normalized Legendre polynomials is suggested in [8] and involves the use of extended-range arithmetic.

In this paper we consider parallel computation of normalized polynomials \(\bar{P}_{n}^{m}(x)\) of high degrees in extended-range arithmetic using CUDA-compatible GPUs. Due to a high level of task parallelism, the transfer of computations to the GPU has allowed to achieve significant performance improvement, as compared with the CPU implementation.

2 Extended-Range Arithmetic

2.1 Basic Algorithms

Currently, IEEE-754 standard is the main standard for binary floating-point arithmetic [12]. It defines two most widely used formats: a single-precision format (binary32) and a double-precision format (binary64). These formats are supported, to some extent, at both the hardware level and the level of programming language. In 2008, a revision to IEEE-754 standard was published, which further describes the quadruple-precision binary format—binary128, and two decimal formats—decimal64 and decimal128 [13]. However, support for these new formats is currently implemented in quite rare cases. The properties of single- and double-precision binary formats are presented in Table 1. In this table, the number of digits of the significand, p, defines precision of the format; integers \(e_{\min }\) and \(e_{\max }\) are the extremal exponents; \(n_{\max }\) is the largest positive finite number, \(n_{\min }\) is the smallest positive normal number, and \(s_{\min }\) is the smallest positive subnormal number; the segment \([n_{\min }, n_{\max }]\) specifies the range of positive normal numbers, and the segment \([s_{\min }, n_{\max }]\) specifies the total range of positive finite numbers.

Table 1. The properties of IEEE-754 single-precision and double-precision formats

The situation when the intermediate result of an arithmetic operation or function exceeds in magnitude the largest finite floating-point number \(n_{\max } = (2-2^{1-p})\times 2^{e_{\max }}\) in IEEE-754 standard is defined as overflow. When there is overflow, the result, depending on the used rounding mode, is replaced with infinity (\(\pm \infty \)) or the largest finite floating-point number. The situation when the intermediate result of an arithmetic operation is too close to zero, i.e. in magnitude it is strictly less than the smallest positive normal number \(n_{\min } = 2^{e_{\min }}\) is defined as underflow [13, 14]. When there is underflow, the result is replaced with zero, subnormal number, or the smallest positive normal number. In all cases, the sign of the rounded result coincides with the sign of the intermediate result. The exceptions examined are presented in Fig. 1.

Fig. 1.
figure 1

Overflow and underflow in floating-point arithmetic

One of the ways to eliminate overflow or underflow is scaling. This method requires estimating the source operands and multiplying them by factor K chosen so that all intermediate results are within the normal range. After the computation of the final result, scaling is carried out by dividing it by K [14]. In terms of computing speed, this technique is evidently the best one. However, it requires a detailed analysis of the whole computing process and is not applicable in many cases. A more common approach is emulation of extended-range arithmetic. To do this, the integer e is paired with a conventional floating-point number f, and this pair is considered as a number

$$\begin{aligned} f\times B^e, \end{aligned}$$
(6)

where B is a predetermined constant that is a power of the floating-point base [8, 11]. Significand f can take values in the interval (1 / BB). Given this, B must be such as for any arithmetic operation performed with f, no underflow or overflow occurs. It is advisable that B is a natural power of two (for a binary computer).

For instance, if

  • f is a double-precision number (binary64),

  • e is a 32-bit signed integer (\(-2147483648 \le e \le 2147483647\)) and

  • \(B = 2^{256}\),

then the range of the represented numbers will exceed \(10^{\pm 165492990270}\).

The algorithms for basic extended-range operations are considered in [8, 11], and therefore, we will focus only on some of them. In the following, we will assume that the base of exponent B is uniquely determined and the extended-range number is represented by a pair (fe). Algorithm 1 performs the “adjustment” of the number. It is one of the basic extended-range arithmetic algorithms. It provides control of the value range of significand f, as well as its correction in case the input is incorrect encoding of zero.

figure a

It is important to note that it is not always enough to carry out the Adjust procedure only one time. This can take place at least in the following two cases: (a) when the number is converted from the machine format or a format with different from the current exponent base; (b) when subtraction of almost equal numbers (or addition with different signs) is performed. In any of these cases, it is possible that, after the Adjust procedure has been performed, significand f is less than 1 / B. If it is ignored and the computation process is continued, gradual “zeroing” of the result is likely to take place. To avoid this, it is possible to use a cyclic adjustment procedure which is implemented by Algorithm 2.

figure b

Algorithm 3 performs addition of extended-range representations. Algorithms for subtraction and comparison are quite similar to the addition algorithm, so their description seems to be unnecessary.

figure c

2.2 Implementation of Extended-Range Arithmetic

We have implemented all basic algorithms of extended-range arithmetic, and a number of mathematical functions for CPUs and NVIDIA CUDA-compatible GPUs. Do to it, data types shown in Fig. 2 were declared.

Fig. 2.
figure 2

Extended-range data types: er_frac_t—standard floating-point number (double by default), er_exp_t—machine integer (int64_t by default)

The list of implemented CPU- and CUDA-functions includes the following:

  • memory management and constants initialization;

  • addition, subtraction, multiplication and division, supporting four IEEE-754 rounding modes, as well as comparison functions;

  • integer floor and ceiling functions, computation of the fractional part;

  • functions of converting numbers from the double-precision IEEE-754 data type to extended-range data type, and vice versa;

  • factorial, power, square root, and a number of other mathematical functions.

The exponent base B is defined in parameters. By default \(B = 2\). It is quite enough for the computation carried out. The declaration of CPU- and GPU-functions is identical (cuda namespace is used for GPU-functions). Pointers are used for effective passing of parameters. All functions are thread-safe.

Efficiency of extended-range arithmetic is largely determined by the speed of converting numbers from the machine floating-point representation to extended-range representation, and vice versa. To implement these procedures, we used bitwise operations. In particular, Fig. 3 shows the subroutine er_set_d that converts a conventional IEEE-754 double-precision number into the extended-range representation. This subroutine uses the DoubleIntUnion structure, which allows storing double and integer data types in the same memory location.

Fig. 3.
figure 3

Conversion of a double-precision floating-point number into the extended-range representation. For the double data type, SIGN_OFFSET = 63, EXP_OFFSET = 52 and EXP_BIAS = 1023.

3 Computation of Normalized Legendre Polynomials on CPU and GPU

3.1 Computation of Starting Point of Recursion

Our implementation of normalized Legendre polynomials computation is based on the recursion (4), which, in turn, requires computation of relation (5). In case of high degree n of polynomial, direct computation of (5) is time-consuming since it requires computing two double factorials in the extended-range arithmetic, \((2n+1)!! = 3\cdot 5\cdots (2n+1)\) and \((2n)!! = 2\cdot 4\cdots 2n\). When one polynomial is computed, it is not critical. However, the problem becomes urgent when many polynomials of various degrees are computed sequentially. In addition, in case of direct computation of factorials in the machine-precision floating-point arithmetic, significant rounding errors can accumulate. To partially solve the problem, the ROM lookup table (LUT) can be used which stores values \(\frac{(i\cdot 2h+1)!!}{(i\cdot 2h)!!}\) for \(i = 1,2,\dots , N\), where h and N are some integers. Then, for computing \(\frac{(2n+1)!!}{(2n)!!}\), where n is the polynomial degree, one has to take the \(\lfloor n/h\rfloor \)-th value from LUT, and compute \(\prod _{i = 1}^{n-q}{\frac{2(q+i)+1}{2(q+i)}}\), where \(q = h\lfloor n/h\rfloor \), and multiply these two values. The size of LUT is determined by step h and the maximum degree of polynomial n we want to compute. For instance, if \(n = 50 000\) and \(h = 100\), LUT will contain \(N = 500\) values. LUT content is computed in advance with high precision, after which it is converted into the extended-range format.

Table 2. Subroutines to compute normalized Legendre polynomials
Fig. 4.
figure 4

CUDA kernel for computing normalized Legendre polynomials

3.2 Developed Software for Computing Legendre Polynomials

Based on the implemented extended-range arithmetic functions (Subsect. 2.2), CPU- and CUDA-subroutines were developed, which allow computing \(\bar{P}_{n}^{m}(x)\) for large \(n \ge 0\) and at any m, \(0 \le m \le n\). They are shown in Table 2.

For implementation on the GPU, the direct paralleling scheme was chosen, according to which i-th GPU thread computes a polynomial of degree n[i] and order m[i] for argument x[i]. The result is written with a corresponding offset to res array. The number of the required thread blocks is defined by the following:

$$\begin{aligned} N = \left\lceil \frac{vector\_size}{max\_threads\_per\_block}\right\rceil , \end{aligned}$$
(7)

where \(vector\_size\) is the size of the input vectors, \(max\_threads\_per\_block\) is the maximum number of threads in a block. If N does not exceed the maximum number of blocks for the device, fully parallel computation of all polynomials is possible. Otherwise, some threads compute more than one polynomial. Listing of CUDA kernel legendre_lst is given in Fig. 4.

4 Experimental Results

The evaluation of correctness and efficiency of the developed subroutines was carried out within HP SL390+NVIDIA Tesla M2090 stand of UniHUB.ru platform at the Institute for System Programming of the Russian Academy of Sciences [15]. Three software implementations of the recursive algorithm (4) have been examined: CPU- and GPU-implementations based on extended-range arithmetic, and calculations using the GNU MPFR Library.

In the first experiment, we examined dependence of computation time for the first-kind polynomial on n. The value \(\cos (179^\circ ) \approx -0.999848\) was taken as an argument. The degree n varied in the range of 100 to 53200, and it was doubled at each stage of the testing procedure. The results are presented in Fig. 5(a).

Fig. 5.
figure 5

Experimental results: computation time of \(\bar{P}_{n}^{m}(x)\) at fixed \(m = 0\) and \(x = \cos (179^\circ )\) versus degree n (a); computation time of the vector of \(\bar{P}_{n}^{m}(x)\) at fixed \(m = 0\) and \(n = 20000\) versus the vector size (b)

In the second experiment, vectors of polynomials were calculated at fixed \(m = 0\) and \(n = 20000\), whose size ranged from 32 to 8192. The arguments were defined by the formula \(x_i = \cos \left( i \times \frac{180}{vector\_size}\right) \), which allowed calculations for each vector in the angular range \([0^\circ , 180^\circ ]\), having a uniform step determined by the size of the vector (\(vector\_size\)). The experiment results are shown in Fig. 5(b). Longer GPU computation time, observed at the vector size greater than 2048, is explained by the limited resources of the used device.

It is worth noting that the subroutines for computation of normalized and non-normalized associated Legendre polynomials are implemented in a number of well-known software packages, such as The GNU Scientific Library, Boost, ALGLIB. However, they allow calculations only for rather small degrees (up to several thousand). Therefore, these implementations were not analysed in the experiments.

The computed polynomials \(\bar{P}_n^m(\cos \theta )\) for \(n = 1000, 5000, 15000, 20000\), \(m = 0\) with intervals of \(\theta \) equal to \(1^\circ \) are shown in Fig. 6, and the logarithms of the starting values (5) for recursion (4) are shown in Fig. 7.

Fig. 6.
figure 6

Normalized associated Legendre polynomials

Fig. 7.
figure 7

Logarithms of the starting values for recursive computation of normalized associated Legendre polynomials

5 Conclusion

The paper considers the problem of GPU-based calculation of normalized associated Legendre polynomials. At high degrees n and orders m, this polynomials are characterized by a large spread of numerical values, which greatly limits the possibility of their correct computation in IEEE-754 arithmetic. In particular, when \(n = 20000\), obtaining correct results in the double-precision format is only possible for angles ranging from \(75^\circ \) to \(105^\circ \). Calculations for angles beyond this range result in underflow. To overcome this limitation, extended-range arithmetic is implemented on GPU. The experimental evaluation shows that, thanks to the natural task parallelism and a simple computation scheme, the use of GPU is effective even with small length vectors. With increasing problem size the speedup becomes more significant.

When parallel computation of polynomials of the same degree for a set of different arguments is made, the computation scheme is balanced, since time complexity of the extended-range arithmetic operations does not depend significantly on the magnitude of the arguments. If vectors of polynomials of different (high) degrees are computed, then, to improve the GPU-implementation performance, a more complicated computation scheme involving load balancing can be applied.