1 Introduction

Nowadays, one of the most fruitful topics that exploit the pervasiveness of DNNs is image processing in the automotive industry. This field brings new problems and challenges to DNNs. On the one side, there is the need to reduce network architecture and computation complexity to better accomplish real-time tasks in resource-constrained devices. On the other side, there is the need to target-specific platform DNN accelerators [e.g. NVIDIA cuDNN for NVIDIA graphical processing units (GPUs)] to provide substantial speed-ups to neural network processing, in the both training and inference phases. On the complexity reduction side, one of the most explored and interesting fields is the alternative representation of real numbers, to reduce the number of bits used to represent the weights of the DNNs. Some ideas have already been proposed by industries such as Google (Brain Float—BFloat16—[1]), Intel (Flexpoint—FP16—[2, 3]) and Facebook AI Group [4]. Another promising representation that diverges from the floating-point standard is the posit number system [5,6,7]. This type has been proven to be a perfect drop-in replacement of 32-bit IEEE 754 floats in machine learning, using just 16 bits [8,9,10,11,12,13]. Moreover, it has been productively exploited in low-precision inference down to 8-bit posit representation, with very little degradation of network inference accuracy. Furthermore, as also explained in Sect. 2 and in [9], this number system can be exploited to build fast, approximated and efficient activation functions for neural networks like the sigmoid function by only using the already existent arithmetic logic unit (ALU) within the CPU. On the side of target-specific platform accelerators, the ubiquity of operations such as dot products, matrix multiplications and filter convolutions points out the need for optimized routines able to increase the throughput for these operations. While the spread of GPUs in this field is relevant, the use of such components may be precluded by both high implementation costs and low-power requirements. Microprocessor industries have already moved towards this direction providing a vectorized extension of their instruction set architecture (ISA). In particular, ARM has firstly proposed the NEON instruction set, then evolved and improved with the ARM scalable vector extension (SVE) [14, 15]. Furthermore, ARM has already developed a deep neural network library that supports its NEON vectorization backend [16], but at time of writing it lacks the SVE support. This extension, along with the ARM compiler, allows producing executable binaries exploiting the SVE instruction set in two dimensions. One dimension is the auto-vectorization approach, with the compiler autonomously producing vectorized instructions exploiting data parallelism in the code (e.g. loop unrolling). The other dimension is the explicit use of specific high-level instructions to instrument vectorization in an explicit way. This is possible thanks to the ARM C Language Extension (ACLE) for SVE [14]. Combining the reduction in information size with the vectorization is thus very interesting. If we halve the bits of a given representation without losing decimal accuracy, we can fit twice the elements in the same vector register, increasing the overall throughput. In this paper, we will develop a vectorized extension for the cppPosit C++ posit arithmetic library, following both the approaches. Then, this extension will be tested against common DNN and machine learning operations and in the tinyDNN C++ DNN library.

1.1 Organization of the paper

In Sect. 2, we present the posit format and its properties along with some interesting arithmetic operators developed only using integer arithmetic. In Sect. 4, we are going to summarize the main characteristic of the new ARM SVE architecture, pointing out the useful tools and approaches that we can exploit for the development of our vectorized backend. In Sect. 5, we are going to present the cppPosit library developed by the authors and to propose a vectorized extensions for it, providing implementations for common operations (such as dot product and convolution) and addressing the issues and challenges of posit in general. This work has been carried out within the H2020 European Processor Initiative (EPI). The obtained results provide interesting feedback to the EPI CPU designers, since they can evaluate upon a time the impact of their design choices.

In Sect. 6, we present the results obtained on the official ARM Instruction Emulator, trying to point out the difference in terms of processing time between the different versions and vectorization levels. Finally, in Sect. 7 we present the results achieved using the tinyDNN library (equipped with the cppPosit library) on very deep convolutional neural networks (using synthetic images). These benchmarks are interesting for real-time image processing applied to the automotive scenario. Single-operation benchmarks show the impact of our approach on the image processing building blocks (e.g. convolution is a very common operation in image processing and filtering). Furthermore, the tinyDNN benchmarks are focused on widely used neural networks in the autonomous driving world. In particular, the evaluated networks are employed as basic blocks both in automotive computer vision (e.g. object detection and semantic segmentation).

2 Posit arithmetic

As widely shown in [7, 8, 10, 17, 18], the posit format is a fixed-length alternative representation to float numbers. A posit can be configured in the total number of bits (nbits) and the number of exponent bits (es). It has up to four fields as in Fig. 1:

  • Sign field (1 bit, ). Posits are 2’s complement.

  • Regime field (variable length: it is identified as the sequence of identical bits r followed by the opposite bit \({\bar{r}}\), ).

  • Exponent field (maximum length of es, ). This field can be shorter or even missing at all, for some representations, even when \(es > 0.\)

  • Fraction field (variable length, ) can be missing too.

Fig. 1
figure 1

Illustration of a posit\(\langle 16,2\rangle \)

Given such a format, the value x is represented by the signed integer v representing the posit:

$$\begin{aligned} x= {\left\{ \begin{array}{ll} 0,&{} \text {if }v=0 \\ \text {NaN} ,&{} \text {if }v=-2^{(nbits-1)} \\ sign(v)\times useed^k \cdot 2^e \cdot (1 + f) ,&{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \(useed = 2^{2^{es}}\) and \(f=\phi \cdot 2^{-F}\) is the fractional part represented by the fraction field.

Figure 2 shows an example of posit format decoding:

Fig. 2
figure 2

An example of a 16-bit posit with 3 bits for the exponent (\(es=3\)). Given the sequence on top of the figure, after detecting it starts with 1, we have to compute the 2’s complement of all the remaining bits (passing from 110-110-111011001 to 001-001-000100111). Then, we can proceed to decode the posit. The associated real value is therefore . The final value is therefore \(-1/65536 \cdot (1 + 39/512) = -0.00003284\)

2.1 Fast approximated operations on posits using only the ALU

In this subsection, we will refer to x as the represented real value and to vY as the integers representing the posit, with v being the input of the operation and Y being the output of the same operation. The represented real value can be obtained from its representation using Eq. (1).

Please notice how, when \(es=0\), the formula in Eq. (1) can be further simplified as:

$$\begin{aligned} x = 2^k\cdot (1+f) \end{aligned}$$
(2)

where \(k = -R\) for \(x < 1\) and \(k = R-1\) for \(x > 1\), and R is the regime length, as shown in Fig. 1.

This posit formulation allows implementing some arithmetic operators that, unlike IEEE 754 float numbers, can be evaluated by using only the ALU on the integer v representing the posit bit string.

2.1.1 The twice operation (2x)

When applying the twice operator, we consider three different cases for the posit value: \(x\in [2,+\infty )\), \(x\in [1,2)\), \(x\in [0,1]\) (the same holds for negative values). We implement the twice operator in the different cases as follows (in the following expressions, \(x \ll n\) represents the number x left shifted of n bits, \(x \gg n\) represents the number x arithmetically right shifted of n bits, \(x \gg ^* n\) represents the number x logically right shifted of n bits, | is the bitwise or operation and \(\oplus \) is the bitwise exclusive or (xor) operation).

$$\begin{aligned} vabs= & {} abs(v)\\ s= & {} sign(v)\\ vs= & {} v \ll 1\\ Y_t= & {} {\left\{ \begin{array}{ll} vs \gg 1,&{} \text {if }x \ge 2 \\ vs \oplus twicemask, &{} \text {if }x \ge 1\\ vs \ll 1 ,&{} \text {if }x < 1 \end{array}\right. } \end{aligned}$$

where twicemask is obtained as follows:

$$\begin{aligned} twicemask = (1 \ll nbits - 2) | (1 \ll nbits - 3) \end{aligned}$$
(3)

We obtain the final result as:

$$\begin{aligned} Y = (Y_t \gg ^* 1) \oplus s - s. \end{aligned}$$

A similar approach can be applied to the half (x/2) operation. We only need to change the transformation in the three different cases:

$$\begin{aligned} Y_t = {\left\{ \begin{array}{ll}vs \ll 1,&{} \text {if }x \ge 2 \\ vs \oplus twicemask ,&{} \text {if }x \ge 1\\ vs \gg 1 ,&{} \text {if }x < 1 \end{array}\right. }. \end{aligned}$$

2.1.2 The one’s complement operator (\(1-x\))

The one’s complement operator requires also that the posit to be in the range [0, 1] and can be implemented as follows:

$$\begin{aligned} Y = (1 \ll nbits-2) - v. \end{aligned}$$

2.1.3 Fast reciprocate function (1/x)

We can implement a fast and approximated version of the reciprocate function as follows (where \(\lnot \) is a bitwise negation, \(\oplus \) is the exclusive-or operator and signmask is a bit mask for the sign bit:

$$\begin{aligned} Y = (v \oplus \lnot signmask). \end{aligned}$$

The signmask can be obtained as follows:

$$\begin{aligned} msb= & {} 1<< (nbits-1)\\ signmask= & {} ~((msb | msb -1)>> 1). \end{aligned}$$

Moreover, some interesting nonlinear activation functions in DNNs can be approximated with this format. Some of the most important approximated functions that can be implemented are the Sigmoid (see [7]), hyperbolic tangent and the extended linear unit function (see [9]), as also explained in Sect. 2.1.6.

Fig. 3
figure 3

Accuracy comparison between exact (original formula applied to the posit format) and approximated versions of the hyperbolic tangent (TANH) and extended linear unit (ELU) using posit\(\langle 8,0\rangle \). The functions were computed on each point of the posit\(\langle 8,0\rangle \) range. The mean squared error between the TANH function versions is \(2.8\cdot 10^{-3}\), while the mean squared error for the ELU ones is \(3.7\cdot 10^{-3}\)

2.1.4 Fast sigmoid activation function

The sigmoid function \(sigmoid(x) = 1 / (1 + exp(-x))\) can be approximated as follows (where v is the integer representing the posit and Y the integer representing the posit that approximates sigmoid(x)):

$$\begin{aligned} Y = ( (1 \ll nbits - 1) + v + 2) \gg 2. \end{aligned}$$

The approximated sigmoid function can be used as a building block for the other two functions, using linear combinations that exploit fast approximated operators of posit arithmetic seen before.

2.1.5 Fast hyperbolic tangent

The hyperbolic tangent can be obtained from a linear combination of the sigmoid function using the double and the one’s complement operators:

$$\begin{aligned} \text {tanh}(x) = 2\cdot \text {sigmoid}(2x) - 1 = - \left( 1 - 2\cdot \text {sigmoid}(2x)\right). \end{aligned}$$

Finally, instead of using the exact sigmoid formula, we approximate the hyperbolic tangent using the fast approximated version of the sigmoid described in Sect. 2.1.4. In order to satisfy the one’s complement requirements (\(x \in [0,1]\)), we only consider negative x values that result in the sigmoid output to be in [0, 1/2]. Then, we can exploit the hyperbolic tangent odd symmetry around 0 to obtain the values for the positive arguments.

2.1.6 Fast extended linear unit

Similarly, the extended linear unit function for negative arguments can be implemented as a linear combination of the sigmoid using the said operators:

$$\begin{aligned} e^x - 1 = -2\cdot \left[ 1-\frac{1}{2\cdot \text {sigmoid}(-x)}\right]. \end{aligned}$$

As in the previous case, if we substitute the exact sigmoid function with its fast approximated, we obtain the fast approximated version of the ELU. Figure 3 shows the accuracy comparison between the two. Figure 4 shows processing time comparison between latter two approximated functions. More mathematical details can be found in [17]. In the next section, we will see how to speed-up DNN training and inference using the SVE feature of modern ARM CPUs.

Fig. 4
figure 4

Processing time comparison between exact (original formula applied to the posit format) and fast approximated versions of the hyperbolic tangent (TANH) and extended linear unit (ELU). The reported results came from evaluations of the functions on each point of the posit\(\langle 8,0\rangle \) domain. As reported, the approximated ELU function is on average five times faster when compared to the exact version. On the other hand, the TANH function is more than 18 times faster on average

3 Posits and DNNs

When considering posit numbers for DNNs, we need to take into account that the highest density of posit numbers is in the range \([-1,1]\). This range indeed represents half of the posit projective circle. This can be exploited to design networks that are more proficient when used together with posit numbers. This can be addressed in different ways, as discussed in next subsections.

3.1 Activation functions

When choosing activation functions, we need to consider the output range of the functions. For example, if we consider the ReLU activation function, it discards all the negative numbers passed as argument flattening them to 0. Furthermore, the sigmoid function, limiting the output in [0, 1], discards the precious high-density region \([-1,0]\). Instead, the hyperbolic tangent can fully exploit the region \([-1,1]\). However, being modern deep neural network architectures very deep (the number of layers is huge), S-shaped functions like hyperbolic tangent suffer from vanishing gradients; thus, they are not acceptable in the training process. The ELU function and in general scaled extended linear units (SELUs [19]) manage to cover a higher range, typically parameterized by two real factors \(\alpha \) and \(\beta \) : \([-\alpha \cdot \beta ,+\infty ]\).

3.2 Distribution of values

When stacking layers in a deep model, we need to care about the right-shifting of value distributions during forwarding passes. Adding a batch normalization layer [20] after some convolution and activation steps can manage to re-scale the values by subtracting the batch mean and dividing it by its standard deviation. This will result in a value distribution with a null mean value and unitary standard deviation, thus fitting the needs already explained.

3.3 Loss strategies

If we want to perform low-precision inference without losing too much accuracy (e.g. switching to posit\(\langle 8,0\rangle \) for inference), we may need to take into account the dynamic range of such types (e.g. posit\(\langle 8,0\rangle \) has a range \([-64,64]\)). This means that, during the training, we must penalize high network weights. This can be addressed by using different types of regularization. In [21] are shown recent trends in regularization for neural networks. For example, a weight decay approach (see [22]) with a decay rate of \(\lambda \) adds the following L2 regularization term to the loss:

$$\begin{aligned} R(w) = \lambda \cdot \frac{1}{2} \cdot \left| w \right| ^2_2 . \end{aligned}$$

This has been proven to reduce overfit and training error in [23]. In general, avoiding overfitting can help in maintaining low weight values. Therefore, the use of other layers designed to help with a generalization like dropout layer [24] can be useful as well.

3.4 Data pre-processing

When considering low-precision inference, we also need to take into account the encoding of data fed to the neural network. For example, if we take an RGB dataset, we will find each pixel encoded in each channel as an integer in [0, 255]. If we feed this type of data to a posit\(\langle 8,0\rangle \) network, it will result in values above 64 to be clipped down to the maximum value. Moreover, we are not exploiting the negative axis. To address this problem, we may apply a re-scaling of the encoding before even training the network. Simply re-scaling the image in \([-1,1]\) is not always a good solution, since it may result in an unacceptable loss of information. Another important point in the posit circle is the \(useed = \pm 2^{2^{es}}\) point that is strictly connected to the dynamic range of a posit \(\pm useed^{nbits-2}\). For example, re-scaling an image in the range \([-useed,useed]\) of posit\(\langle 8,0\rangle \) (thus having the pixel encoded in \([-2,2]\)) has been proven to be effective encoding during both training (with higher-precision types) and inference phase (with low-precision types). The formula to re-scale the value p of each pixel is therefore:

$$\begin{aligned} n(p) = 2\cdot useed \cdot \frac{p}{255} - useed. \end{aligned}$$

In Appendix, we describe a MATLAB tool, helpful to support the user in choosing the best posit configuration, depending on the needs of the application at hand.

4 ARM SVE architecture

The ARM scalable vector extension (SVE [14]) is a vector extension for the ARM AArch64 architecture supported by the ARMv8 instruction set. The main difference between SVE and other single-instruction multiple-data (SIMD) engines (Intel AVX/SSE or ARM NEON) is that it does not specify any width for vector registers, but it provides some constraints for it. The vector register widths must be multiple of 128 up to 2048 bits. This approach, called vector length agnostic (VLA), allows us to implement only one vectorized version of our operations, exploiting both auto-vectorization and ARM ACLE (ARM C Language Extensions), without the need to target-specific hardware platforms (Figs. 5, 6).

Fig. 5
figure 5

ARM SVE Z Data registers: 32 vector length agnostic register where \(L=128\cdot k, k \in [1,16]\)

Fig. 6
figure 6

ARM SVE P predicate registers: 16 vector length agnostic register where \(P = Z/8\)

SVE architecture introduces new kind of registers:

  • Z registers: 32 registers with configurable width, from 128 to 2048 bits, as said above. These registers are meant to be data registers. SVE allows to interpret data in Z registers as 8 bits (bytes), 16 bits (half words), 32 bits (words) and 64 bits (double words). For instance, referring to posits, a 2048-bit Z register can hold up to 256 posit\(\langle 8,X\rangle \) (in any exponent configuration).

  • P registers: 15 predicate registers, with one bit to control each byte in a Z register (a 2048-bit Z register will be controlled by 256-bit P register). Each bit in the P register is interpreted as a Boolean. A predicate lane, made by 1 to 8 predicate bits, indicates whether the correspondent lane (when using a Z register) is active or not, depending on the least significant bit.

5 The C++ library cppPosit

For this work, we used the cppPosit C++ posit library developed in Pisa. This library exploits C++ templates to provide flexibility for posit configurations, ranging the total number of bits from 4 to 64. The main feature of cppPosit is the separation of the posit type in an interface frontend and a backend. The cppPosit frontend exposes all the possible implemented operations on posits, regardless of the underlying implementation. The backend implements the actual operations offered by the frontend, in one of the following flavours.

The supported backends are: i) fixed-point, ii) software floating-point (exploiting Berkeley SoftFloat library), iii) hardware floating-point unit (FPU) if present and iv) tabulated (or log-tabulated). The latter two deserves to be deeply analyzed. In fact, they become an important backend when a hardware posit processing unit (PPU) is not available and the number of bits is not required to be much large (like in DNNs).

5.1 Tabulated posits

When dealing with low-bit posits (e.g. 8-, 10-, 12-bit posits), we can think of pre-computing the arithmetic operators and some convenient functions in look-up tables to be used at run time for posit processing. Without optimization, these tables grow quadratically with the size of the posits. The main optimizations are applied exploiting addition and subtraction symmetry and antisymmetry properties to have half of the tables. The log-tabulated approach also optimizes the multiplication and division operations, by noticing that \(\log (a\cdot b) = \log (a)+\log (b)\) (see also [4] for logarithmic numbers). In this way, we only need two single-operand tables for logarithm and exponentiation to perform both multiplication and division operations. These single-operand tables scale only linearly with the posit size, thus reducing the overall size of look-up tables. Note that the log-tabulated approach may result in some products or powers being off in the last bit.

5.2 Operational levels

The cppPosit library also classifies the posit operations into four different operational levels:

  • L1: These operations only require bit manipulations of the signed integer v representing the posit and thus can be executed just with the ALU support in a fast and efficient way. L1 operations are the most efficient one and are of crucial importance. Table 1 shows some of the L1 operations implemented in the library.

  • L2: These operations require to decode the posit into the sign, regime, exponent and fraction with an additional unpacking step that slows down the computation. (This includes the use of count-leading-zeros (CLZ) operations.)

  • L3: These operations also require the complete construction of the posit field, including the join between exponent and regime fields, with additional computation cost. Note that in the 0-bit exponent case L3 and L2 operations have the same complexity.

  • L4: These operations require the posit to be fully unpacked and reconstructed in the chosen backend.

As reported in Table 1, most of the L1 operations require to have 0 exponent bits, due to the emerging properties of posits with this particular configuration, as already explained in Sect. 2. Moreover, other functions such as the 1’s complement require the posit to be in the unitary range.

Table 1 cppPosit most important implemented L1 operations, including common use activation functions such as sigmoid, hyperbolic tangent and extended linear unit

5.3 Vectorized extension

In this section, we introduce the vectorized extension of the cppPosit library, aimed to provide the vector version of the posit operations. Firstly, we need to take into account the differences between different operational levels. L1 and L2 operations are the easiest one to be vectorized; they only require bit manipulation of unsigned or signed integers plus additional encoding and decoding steps. Instead, L3 and L4 operations need to be brought back at the chosen backend, and then, in case of hardware floating point, we can use native SIMD vectorization if any.

In order to provide a more general and abstract interface to posit vectorized operations, the architecture has separate posit vector frontend and a specialized posit vector backend that, in the paper case, implements the vectorized operations using ARM ACLE for SVE (Fig. 7).

Fig. 7
figure 7

UML class diagram for an example SVEBackend that allows the vectorized computation of the fastSigmoid

When implementing vectorized operations, we have a common template to follow:

  • Prologue: We need to prepare the data to be fed to the SIMD engine. For posits and L1 operations, this means preparing a vector with the signed integer representing the posits. In the SVE case, this means loading into the Z registers the posit holder type content (e.g. int16_t for posit\(\langle 16,X\rangle \)) using the svld1(...) intrinsic. For L3/4 operations, we need instead to unpack the posit to the underlying backend (fixed, floating or tabulated) and load the backend type into registers as well, performing full decoding of the posit type.

  • Body: The body contains all the arithmetic and logic functions needed to apply the considered operation. In the SVE case, this may contain the SVE intrinsics that operate on the Z vector registers that contain the posit data. For instance, when implementing the fastSigmoid function, we will use the built-in intrinsics svasr_x(...) for the first right shift and svadd_x(...) for the sum. The first performs the same right shift on all the vector elements while the second performs the addition of the value \((1\ll nbits-2)\) to all the vector elements.

  • Epilogue: We need to build back the posit into the result vector from the signed integer we have just manipulated in the function body. For SVE, this means invoking the svst1(...) intrinsic on the SVE result pointer obtained in the previous step. For L3/4 operations, we need instead to pack the posit up to the frontend, performing a full encoding of the posit type.

When vectorizing non-L1 operations that require the posit to be decoded in its components (sign, regime, exponent and fraction), we need to take into account two phases of the prologue. The first and simplest one is the posit conversion to the underlying signed integer holder type, that is, the cost of a pointer cast from the posit type to the holder one. This step has practically no cost. The second and hardest one is the vectorization of the posit decoding step since it involves many operations and branches on the bit string. After this decoding, the function body is the same as applying vectorization to the backend type (native ARM floats in our case).

The same behaviour holds for the epilogue as well. Predictably, both prologue and epilogue for non-L1 operations will introduce some kind of overhead in function computation, due to the conversion of the posit at the underlying backend. This means that, to see real effectiveness of this vectorized approach, we need to test this on large-sale data and SVE vector sizes. This will be addressed more deeply in the next section..

6 Single-operation benchmarks results

In this section, we present benchmark results on isolated operations such as activation functions, dot products and DNN convolutions. All benchmarks are compiled in two different versions using the armclang++ 19.3 compiler. One version is compiled enabling all compiler optimization using the -Ofast flag and targeting the armv8-a+sve architecture to enable SVE vectorized assembly instruction generation. The other version (naive from now on) is compiled without targeting any vectorization platform. In this way, we can compare the differences in execution time of single operations when using both the vectorized and the naive approach. At the time of writing, the only proposed hardware supporting SVE instruction set is the Fujitsu A64FX CPU [25] that employs 512-bit wide SVE vector registers. Unfortunately, it is not available to us at the moment. Therefore, both benchmarks presented in this paper are executed on the ARM SVE Instruction Emulator, with a different configuration of SVE vector lengths from 128 to 2048 bit. The emulator runs on a HiSilicon Hi1616 CPU with \(32@2.4\text {GHz}\) ARM Cortex-A72 cores. (Only single-core performance is addressed for the single-operation benchmarks).

Table 2 shows activation function comparison between vectorized and naive approaches on different benchmarks. Each benchmark has been executed on 8192-bit vectors with different vectorization levels in the case of SVE. Each computation is repeated 1000 times, and the average is computed and reported. As we can see, every function benefits of a substantial speed-up thanks to vectorization, up to \(18\times \).

Table 3 shows benchmark result for vectorized and naive approaches. Dot-product benchmarks have been executed on 8192-bit vectors with different vectorization levels in the case of SVE. Matrix multiplication benchmarks have been executed on \(64\times 64\) matrices. Each computation is repeated 1000 times, and the average is computed and reported. Convolution operations employ a \(3\times 3\) filter. More details are provided in the next subsections.

Note that the vectorized multiply and accumulate instruction (namely svmla) offered by the ARM SVE ISA allows to perform these operations as fused floating-point addition of products. As stated in the ARM SVE ACLE documentation [26], this instruction does not perform intermediate rounding step after the multiplication. (This is a very important behaviour.)

Table 2 Common activation function benchmark result comparison between vectorized and naive approaches
Table 3 Common vector operation benchmark result comparison between vectorized and naive approaches
Table 4 Common pooling operation benchmark result comparison between vectorized and naive approaches

6.1 Dot product

As reported, vectorized dot product benefits of an impressive speed-up, even without vectorization of posit decoding. This can be straightforwardly explained: Both naive and vectorized approaches need to convert posit to a chosen backend (native float in our case). Once converted, the vectorized approach fully exploits the floating-point unit and SIMD acceleration, while the naive one only exploits floating-point acceleration.

6.2 Matrix–matrix multiplication

Furthermore, matrix–matrix multiplication substantially benefits from vectorization operations. As a plus, as shown in [15], this operation has been realized avoiding the traditional sequence of dot products between rows and columns. The idea is to carry more than a single column of the second multiplication operand in vector registers, to be multiplied with the same element of the first one. Let \(A \in {\mathbb {R}}^{M\times K}\) , \(B \in {\mathbb {R}}^{K\times N}\) and \(C=A \cdot B\) be the matrices involved in the operation. Values of C can be obtained in batches of length equal to the vector register capability, say L, for the used representation as follows:

$$\begin{aligned} C_{i,[j:j+L-1]} = \sum _{i}^{K} A_{i,k} \cdot B_{k,[j:j+L-1]}. \end{aligned}$$
(4)

6.3 Convolution

Computing the convolution (a sequence of matrix–vector multiplications) is the most demanding part in the forward pass of a convolutional deep neural network. Thus, speeding it up is crucial. Therefore, we considered a \(3\times 3\) convolution operation, where we obtained significant improvements from the vectorization approach, gaining a very impressive speed-up compared to the plain version. Our approach works for any size of the filter when the stride is equal to 1. For different types of convolution, the auto-vectorized version is preferable, still providing consistent speed-ups. The basic idea is to perform three different one-dimensional convolutions, one for each filter row, moving the filter along with the image matrix. For each filter stride, we convolve the filter rows with a batch of matrix row elements loaded in a vector register. In order to do this, we pre-fetch the nine filter elements in the vector registers. The pseudocode for the algorithm is shown in Algorithm 1. Note that the vector multiplication instructions are controlled by three different predicates, one for each column. The first predicate allows elements of the filter to be multiplied with elements inside the window \(\left[ j;\,j+L-2\right] \), where j is the current position of the filter in the image columns and L is the SVE vector length. Similarly, the second predicate will allow multiplication only in \(\left[ j+1;\,j+L-1\right] \) and the third one in \(\left[ j+2;\,j+L\right] \). The algorithm can be easily extended to \(5\times 5\) convolution, increasing the register pressure.

figure f

6.4 Pooling

Pooling kernels (i.e. average and max pooling) are important operations aimed to reduce the spatial information of network layers. In general, the spatial behaviour of these kernels is similar to convolution operations. In particular, the average pooling layer can be seen as a \(f\times f\) convolution with all the filter elements fixed to \(1/f^2\). Therefore, for average pooling we can make the same conclusion already applied to convolution operations, hence having similar results.

Typically, we want to reduce the layer output size by a factor k, equal to 2 or 3. To have a reduction of a factor 2, we need to employ a \(2\times 2\) kernel with stride equal to 2. In general, if we want to reduce the size by a factor k, we need to employ a \(k\times k\) kernel with stride equal to k (with appropriate padding). The implemented algorithm (see Algorithm 2) for vectorized max pooling is quite different from the convolution one, employing both intrinsic vectorization and auto-vectorization mechanisms. Consider now a typical \(3\times 3\) max-pooling with stride 3. The first step is to perform element-wise maximum between the three rows targeted by current filter top-left element row position i. Now we also need to perform the same operation column-wise. Therefore, to avoid expensive gather loads to fetch the matrix columns, we can perform the reduction with an additional, auto-vectorizable loop. We force the compiler to particularly vectorize a loop with an index step of 3 (that is, the separation step between groups of item involved in the maximum operation) using the following pragma directive:

#pragma clang loop interleave_count(3).

This pre-processor directive aims to increase both the instruction-level parallelism inside the loop performing an unrolling operation and the data level parallelism of the vector performing data interleaving with a compiler-specified parameter 3.

For max pooling, we employed \(225\times 225\) images with a \(3\times 3\) pooling kernel with stride 3. Each computation was repeated 1000 times, and mean timing results are shown in Table 4. As reported therein, the max pooling operation incredibly benefits from the SVE vectorization, gaining a massive speed-up when compared to the naive version. As a plus, we analysed the Clang vectorization report to verify that the additional loop is interleaved by the compiler.

figure g

7 tinyDNN benchmarks results

In this section, we present benchmark results for the vectorized cppPosit library when used inside the tinyDNN neural network framework. For this benchmark, we used different very deep neural network models. In particular, we used some models proven to be successful in the ImageNet challenge [27]. AlexNet [23] consists of an eight-layer deep neural network, with internal convolutional layers reaching up to 192 convolution kernels, with an overall 60M parameters. ResNet architectures [28] are built stacking the so-called residual blocks, composed by two or, in deeper models, three convolutional layers. For each of them, the block input is summed to the output. This approach has been proven to improve classification performance in the ImageNet challenge. We tested our vectorized method on the 34-layer and 152-layer versions of the ResNet architecture. VGG16 and VVG19 models [29] are deep convolutional neural networks with a series of stacked convolutional layers (16 and 19 weight layers, respectively), where dimensionality reduction is operated by max-pooling layers.

Table 5 shows the inference results of the said benchmark networks. As reported the speed-up gained by the SVE-enabled version to the non-vectorized, one is impressive.

Table 5 Image processing time (in seconds) for various very deep neural network models using posit\(\langle 8,0\rangle \)

8 Conclusions

Since many of the current image processing applications use deep neural networks, it is crucial to speed-up at least the DNN forward-pass phase. In this paper, we presented an approach based on the use of a novel representation of real numbers (the posit format) and the speed-up of DNN operations using SIMD instructions. Our approach is interesting for image processing applications where the GPU is not available, such as in most smart cameras applications or even in some assisted driving applications. More precisely, in this work we have presented a vectorized extension of posit arithmetic targeting the ARM SVE architecture, implementing some interesting core functions of machine learning and deep neural networks, where we have taken advantage of both explicit and auto-vectorization. We extended our cppPosit C++ software posit arithmetic library exploiting the knowledge on L1 operations and applying vectorization to the integer arithmetic behind them. This allowed us to obtain a substantial speed-up in the computation of fast approximated activation functions such as sigmoid, hyperbolic tangent and extended linear unit. Moreover, we proposed an approach for implementing machine learning vector and matrix operations with posit format, exploiting the underlying native vectorization for ARM floats, gaining again a solid speed-up in the computation of operations such as dot products and convolutions. Finally, we applied acquired knowledge to the tinyDNN C++ deep neural network library for the low-precision inference phase with 8-bit posits, reporting a relevant improvement in the mean sample inference time when switching from non-optimized version to optimized one. Future work includes porting more portions of the tinyDNN library to the ARM SVE architectures, to further extend the presence of vectorization within it.