Keywords

1 Introduction

Numerical non-Archimedean computations have been pioneered by Sergeyev and his Grossone Methodology [1], and allow for the use of infinitely large and infinitely small numbers in machines, other than finite ones as usual. From their advent, numerous applications benefited, especially in the domain of multi-objective optimisation, e.g., linear programming [2, 3], quadratic programming [4], evolutionary algorithms [5], game theory [6], artificial intelligence [7, 8], etc.

The reference framework of this study is the non-Archimedean model built upon the Alpha Theory [9], which introduces the set of Euclidean numbers and their associated numerical encoding called Bounded Algorithmic Number (BAN) format. The BAN encoding is a fixed length representation, guaranteeing that any operation involving two BANs outputs a result that occupies the same memory as the operands, exactly as happens with computations between 32-bit IEEE 754 floats, where the result of addition, subtraction, addition, and division is again a 32-bit float. However, since the Euclidean numbers form a superset of the real ones, their numerical representation is heavier than the one of floats, making the processing of BANs cumbersome. CPU vectorization can be exploited to optimise computations since BANs encoding can be implemented through multiple fixed-length coefficients, thus fitting them inside vector registers and efficiently computing operations in a single clock cycle.

Using vector instructions had already proved to be a good solution for handling non-native data types that do not have hardware acceleration [10, 11]. In this paper, we present an optimisation of a C++ BAN library called BANcpp [12]. The goal is to optimise the library to leverage CPU vectorization when dealing with BAN coefficients. In particular, we focused on BAN numbers with eight 64-bit coefficients and 512-bit vector instruction sets (e.g., Intel AVX-512). We also present a benchmark application that consists of several iterations of a non-Archimedean optimisation problem. We evaluate the goodness of automatic vectorization as a baseline model and then we enhance the automatic vectorization whenever the compiler fails to automatically optimise the code.

The paper is organised as follows: (i) Sect. 2 briefly introduces Alpha Theory and the Euclidean numbers; (ii) Sect. 3 details the BAN format; (iii) Sect. 4 explains the choices made to implement the enhanced vectorization of the library; (iv) Sect. 5 shows the application benchmark and the results obtained in terms of timing performance and throughput.

2 Alpha Theory and The Euclidean Numbers

Alpha Theory reference set of non-Archimedean numbers is indicated with the symbol \(\mathbb {E}\) and it is called the set of \(\alpha \)-Euclidean numbers, or, in brief, just Euclidean numbers. The peculiar name of the theory comes from the definition of a reference infinite value within \(\mathbb {E}\) and it is indicated by the symbol \(\alpha \). Then, any Euclidean number can be represented as a function of \(\alpha \), and only functions of \(\alpha \) are numbers in \(\mathbb {E}\), which guarantees that the Euclidean numbers and the mathematical operations among them behave according to their counterparts in \(\mathbb {R}\), i.e., commutative operations continue to be commutative, differentiable functions are still differentiable, etc. For instance, the following are all Euclidean numbers:

$$\begin{aligned} \alpha ^3, \qquad \frac{1}{\alpha }, \qquad \frac{1}{\alpha ^2}-e^\alpha , \qquad -\frac{1}{2^\alpha }, \qquad -\ln \left( \frac{1}{\alpha }\right) . \end{aligned}$$
(1)

As opposed to Archimedean Mathematics the concepts of infinite and infinitesimal numbers are sharply defined rather than vague concepts.

Definition 1

Given \(\xi \in \mathbb {E}\), then

  • \(\xi \) is infinite \(\Longleftrightarrow \) \(\forall \,n\in \mathbb {N}, |\xi |>n\)

  • \(\xi \) is finite \(\Longleftrightarrow \) \(\exists n\in \mathbb {N}\) ,\(\tfrac{1}{n}<|\xi |<n\)

  • \(\xi \) is infinitesimal \(\Longleftrightarrow \forall \,n\in \mathbb {N},\) \(|\xi | < \tfrac{1}{n}\).

Therefore, in (1) the first and fifth numbers are positive and infinite, the third is negative and infinite, the second is positive and infinitesimal, while the fourth is negative and infinitesimal. A more detailed presentation of Alpha Theory and the set \(\mathbb {E}\) can be found in [13].

3 The BAN Format

The BAN encoding consists in a finite length representation for Euclidean numbers; however, as for IEEE 754 floating point numbers, it cannot represent the whole set \(\mathbb {E}\) because it would require infinitely many binary possible representations, i.e., an infinite computer memory. Any Euclidean number compliant with the following representation can be represented as a BAN:

$$\begin{aligned} \xi = \sum _{i=1}^L r_i\alpha ^{p-i}, \end{aligned}$$

where \(L\in \mathbb {N}\) is the encoding length, \(r_i\in \mathbb {R}\) and \(p\in \mathbb {Z}\). Changing perspective, one can define a BAN as a Euclidean number that can be represented by a linear combination of L subsequent integer powers of \(\alpha \).

From the very first glimpse, one may notice that the BAN representation of a Euclidean number is very similar to the one of a polynomial, suggesting how cumbersome can be to execute algebraic operations between BANs. An example of addition and multiplication between BANs with \(L=3\) follows:

$$\begin{aligned} (3.2\alpha ^2-0.5\alpha +1.4)+(0.2\alpha +1-1.5\alpha ^{-1}) = 3.2\alpha ^2-0.3\alpha +2.4-1.5\alpha ^{-1} \end{aligned}$$
$$\begin{aligned} (3.2\alpha ^2-0.5\alpha +1.4)\times (0.2\alpha +1-1.5\alpha ^{-1}) = 0.64 \alpha ^3 +3.1\alpha ^2 - 5.02\alpha - 2.15 - 2.1\alpha ^{-1} \end{aligned}$$

Both computations output a result that is not a BAN since it requires more than three consecutive powers of \(\alpha \) to be represented. Therefore, there is the need to approximate them considering only the first three highest powers of \(\alpha \), the most significant ones somehow, that is executing a truncation of the result. Below we report the numerical execution of the previous two operations, along with the division, realized by a software simulator of a BAN Processing Unit (BPU, whose hardware design has been recently proposed in [12]). The latter manipulates and outputs BANs in the normal form [13], i.e., in the standardized format which guarantees the uniqueness of the representation.

figure a

4 Vectorization of BANcpp Library

We vectorized the BANcpp library by mixing two approaches: (i) leveraging the automatic vectorization offered by the compiler; (ii) enhancing vectorization manually whenever the automatic optimisation of the compiler fails. In particular, we needed to implement manual vectorization in the following cases:

  • when a for-loop contains control-flow instructions on the BAN coefficients: due to possible branches and FPU exceptions the compiler refuses to insert vector instructions for these loops. The solution is to provide vectorization manually exploiting masked instructions based on comparisons. This is the case of comparison between BANs or checks on BAN values.

  • when we have an outer and inner for-loop whose indexes are depending on each other. The solution is to implement vectorization manually leveraging the “geometry” of the problem. This is the case of multiplication between two BANs, which ends up in a one-dimensional convolution.

figure b

5 Benchmark Application and Results

The problem used as a benchmark in this study is one of the first ever used for testing and showing the efficacy of non-Archimedean numerical computations, namely Kite [3]. It consists of a bi-objective lexicographic linear programming problem, i.e., an optimisation problem of two linear functions, ordered by strict priority, over a linearly defined domain. To solve the problem, we adopted a Simplex-like non-Archimedean algorithm [3], precisely tailored to this type of task. To make it more realistic, we wrapped the problem within the I-Big-M framework [14], which adds a third objective to generalize the optimisation to the case of unknown starting feasible basis.

We ran the benchmark application for \(10^5\) steps with both the auto-vectorized (namely, baseline) and the enhanced-auto-vectorized (namely enhanced) versions of the BAN library, collecting the time spent for each iteration. We smoothed the data by a 200-steps moving average window and computed a least square fit to plot the metric trends for the average time spent during an iteration and the average throughput (in terms of iterations per second). The benchmark was run on an Intel Xeon Gold 6238R processor running at 2.2 GHz with eight 64-bit BAN coefficients. Figure 1 shows the comparison between the two versions in terms of average time spent per iteration and overall throughput (iterations per second). Mean value and standard deviation of the two are reported in Table 1.

Fig. 1.
figure 1

Comparison between time spent for each iteration (left) and throughput (iterations per second, right) in the two different versions of the BAN library with the associated fitted curve.

Table 1. Mean value and standard deviation over \(10^5\) iterations of the benchmark application, 8 64-bit BAN Coefficient with AVX-512.

6 Conclusions

In this work, we presented the acceleration of a C++ library for Bounded Algorithmic Numbers (BAN) exploiting vector instructions, testing it on a non-Archimedean optimisation benchmark. The results showed how manually enhancing the automatic vectorization produced by the compiler can improve the performance of such applications even without complete hardware support for BANs. We have found that the performance of compiler automatic vectorization is significantly inferior to that achieved by manually optimising which intrinsics to use (and in which order) and how to load the information (again, in which order and according to which scheme). This can be helpful for the community of compiler developers too since it means that there is room for improving the compilers for handling the specific use case tackled in this work.