#### **ORIGINAL RESEARCH**



# **A Precision‑Aware Neuron Engine for DNN Accelerators**

**Sudheer Vishwakarma1 · Gopal Raut2 · Sonu Jaiswal2 · Santosh Kumar Vishvakarma2 · Dhruva Ghai[1](http://orcid.org/0000-0002-8204-6330)**

Received: 13 January 2024 / Accepted: 31 March 2024 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2024

#### **Abstract**

Deep Neural Networks (DNNs) form the backbone of contemporary deep learning, powering various artifcial intelligence (AI) applications. However, their computational demands, primarily stemming from the resource-intensive Neuron Engine (NE), present a critical challenge. This NE comprises of Multiply-and-Accumulate (MAC) and Activation Function (AF) operations, contributing signifcantly to the overall computational overhead. To address these challenges, we propose a groundbreaking Precision-aware Neuron Engine (PNE) architecture, introducing a novel approach to low-bit and high-bit precision computations with minimal resource utilization. The PNE's MAC unit stands out for its innovative pre-loading of the accumulator register with a bias value, eliminating the need for additional components like an extra adder, multiplexer, and bias register. This design achieves signifcant resource savings, with an 8-bit signed fxed-point implementation demonstrating notable reductions in resource utilization, critical delay, and power-delay product compared to conventional architectures. An 8-bit sfixed  $\langle N, q \rangle$  implementation of the MAC in the PNE shows 29.23% savings in resource utilization and 32.91% savings in critical delay compared with IEEE architecture, and 24.91% savings in PDP (power-delay product) compared with booth architecture. Our comprehensive evaluation showcases the PNE's efficacy in maintaining inferential accuracy across quantized and unquantized models. The proposed design not only achieves precision-awareness with a minimal increase  $(\approx 10\%)$  in resource overhead, but also achieves a remarkable 34.61% increase in throughput and reduction in critical delay (34.37% faster than conventional design), highlighting its efficiency gains and superior performance in PNE computations. Software emulator shows minimal accuracy losses ranging from 0.6% to 1.6%, the PNE proves its versatility across diferent precisions and datasets, including MNIST (on LeNet) and ImageNet (on CafeNet). The fexibility and confgurability of the PNE make it a promising solution for precision-aware neuron processing, particularly in edge AI applications with stringent hardware constraints. This research contributes a pivotal advancement towards enhancing the efficiency of DNN computations through precision-aware architecture, paving the way for more resource-efficient and high-performance AI systems.

**Keywords** Deep neural networks · Neuron engine · Edge-AI · Multiply-accumulate unit · Activation function · Precisionaware architecture · Approximate computing

Sudheer Vishwakarma and Gopal Raut have contributed equally to this work.

 $\boxtimes$  Dhruva Ghai dhruvaghai@orientaluniversity.in

> Sudheer Vishwakarma vsudheer062@orientaluniversity.in

Gopal Raut gopalraut05@gmail.com

Sonu Jaiswal phd2101191002@iiti.ac.in

# **Introduction and Motivation**

The demand for efficient deep learning (DL) hardware is escalating with the increasing need for advanced AI applications. Within the realm of DL, deep neural networks (DNNs)

Santosh Kumar Vishvakarma skvishvakarma@iiti.ac.in

- <sup>1</sup> Department of Electronics and Communication Engineering, Oriental University, Indore, India
- <sup>2</sup> Department of Electrical Engineering, Indian Institute of Technology Indore, Indore, India

have gained prominence for diverse applications, including object detection, pattern and character recognition, audio and video processing, language translation, trading, gaming, and cyber-security, as indicated by Sim et al. [[1](#page-12-0)]. It's capability to map complex relationships within non-linear data bestows considerable advantages, setting it apart from other prediction techniques, as highlighted by Khalil et al. [\[2](#page-12-1)]. The DNN's Neuron engine plays a pivotal role in computation and accuracy, yet its hardware implementation is known for its demand for power and resources, as pointed out by Shawl et al. [[3\]](#page-12-2). Optimizing the physical performance of the Neuron engine (NE) becomes crucial in addressing the heightened computational requirements of DNN. This engine adeptly handles operations like Multiply-Accumulate (MAC) and the execution of the non-linear transformation function, known as the activation function (AF). Furthermore, the MAC and AF operation are responsible for 90% of the computation in the neural network. Therefore, optimizing the computational unit architecture is essential to enhance DNN performance.

A typical DNN consists of two parts: feature extraction, as illustrated in Fig. [1](#page-1-0)a, and output classifcation, as depicted in Fig. [1](#page-1-0)b. In the course of this operation, input features undergo convolution with flters, requiring hundreds and thousands of parallel neuron processing engines to carry out these computations [\[4](#page-12-3)]. Consequently, there is a compelling need to optimize the Neuron engine responsible for executing the fundamental computations within DNN inference. Furthermore, the design of the NE, encompassing MAC and AF, involves the consideration of various design parameters. These parameters include arithmetic precision, data types, approximation in computation, data quantization, computation algorithms, and hardware implementation platforms, among others [[5\]](#page-12-4). The hardware implementation platforms utilized for implementation encompass CPU, GPU, FPGAs, and ASICs, each with its respective advantages and drawbacks [[6\]](#page-12-5). However, for edge-AI solutions, FPGA and ASIC-based implementations are preferred. Additionally, for powerefficient solutions, ASIC-based implementation is favored, although it lacks reconfgurability compared to FPGAs. To enhance computational efficiency and reduce architecture complexity, numerous investigations explore approximation in computation and data quantization [[7](#page-12-6)]. However, these techniques often result in reduced accuracy. Hence, careful consideration of these advancements in the neuron engine becomes imperative. Scholars are presently exploring the potential of employing quantization in NE to enhance computational capacity while preserving model precision. Additionally, the use of application dependent diferent AFs within the same network is recommended [[8](#page-12-7)]. Traditionally, this necessitates separate hardware for individual AFs and their confguration, leading to increased hardware resources and critical circuit delays.

In pursuit of reduced complexity in DNN hardware accelerators, a preference for lower arithmetic precision and integer or fxed-point data representation arises in both MAC and AF computations. During convolution, where a  $k \times k$ kernel convolves with an input feature map (Fig. [1](#page-1-0)a), parallel multiplication and accumulation occur, leading to an output precision increase to  $2N + M$ . Here, N represents input precision, and M signifes the overhead bits, dependent on the number of accumulations performed by the corresponding MAC unit. The MAC output is then provided to the AF, dictating the precision of the AF. Conventional AF implementations, such as those based on Look-Up Tables (LUTs) or Read-Only Memory (ROM), become hardware-costly for higher precision due to the increase in memory elements to  $2^P$ , where *P* is the input bit precision of the AF (traditionally  $2N + M$ ). We can quantize the MAC output to N bits before applying it to the AF  $[8, 9]$  $[8, 9]$  $[8, 9]$  $[8, 9]$ . However, this approach may result in accuracy loss, particularly when prioritizing higher accuracy for complex input features. Therefore, an efficient neuron must provide the option to select between quantized and unquantized MAC outputs. Moreover, if opting for unquantized data feed to the AF, conventional AF implementation becomes undesirable, as it is power and resourceintensive, especially for higher precision. Consequently, it becomes imperative to address a solution that accommodates both quantized and unquantized MAC outputs. To tackle this challenge, we introduce the Precision-aware Neuron Engine (PNE). The distinctive features of PNE and the primary contributions of this work are outlined below:



<span id="page-1-0"></span>**Fig. 1** The convolution layer in DNN performs 2D matrix multiplication between the input feature map and kernel weights for feature extraction using MAC. FC layer in DNN performs 1D element-wise data computation. The ultimate classifying output layer is an FC layer

SN Computer Science A SPRINGER NATURE journal

- A resource and power-efficient MAC architecture in PNE with a state machine design is presented, eliminating the multiplexer and utilizing pre-loaded bias for precisionaware computations, offering both quantized and unquantized outputs.
- We present an adaptable AF using ROM and Cordic, capable of producing tanh and sigmoid functions across varying bit-precisions, exhibiting minimal accuracy degradation and reduced LUT usage compared to tensorbased models.
- The PNE is designed using the proposed MAC and AF, achieving high accuracy at low-bit precision through the quantized MAC output and ROM AF output, as well as at high-bit precision using unquantized MAC output and Cordic AF output.
- The PNE's inference accuracy is assessed using Python emulation of LeNet and CafeNet DNN models on FPGA hardware, comparing performance parameters, including resource, power utilization and delay, with state-of-theart architectures.

This paper is an extension of our previous work presented at the IFIP-IoT conference [[10\]](#page-12-9) where an adaptable AF is presented. In this paper, we present a precision-aware neuron architecture using a precision-aware MAC and the adaptable AF.

## **Organization**

This article is structured as follows: related research is presented in section ["Related Research](#page-2-0)". The proposed PNE architecture, it's state machine design, and it's components i.e. the MAC and AF are discussed in section "[Proposed](#page-4-0) [PNE Architecture"](#page-4-0), followed by performance analysis and results discussions in section "[Inference Accuracy and Hard](#page-7-0)[ware Performance: Evaluation and Analysis"](#page-7-0). Finally, the concluding remarks are given in section "[Conclusions and](#page-11-0) [Future Research"](#page-11-0).

## <span id="page-2-0"></span>**Related Research**

Various hardware accelerator architectures for neural net-works have been introduced in the recent years [[11\]](#page-12-10). Shallow neural networks are no longer useful, as the quantity of hardware neurons and connections in modern networks makes them outdated and unsuitable for the deep learning era. CNNs (a type of DNN) are frequently used in both video and image recognition systems, and usually employ a number of flters or convolution matrices [[12\]](#page-12-11). As convolution matrices have fewer parameters than FC network layer weights, parallelism can be introduced. In order to reduce the network complexity, quantization in data representation is preferred for quantizing MAC output during inference, also rounding of weights and biases since they are fxed after training is finished  $[13]$  $[13]$  $[13]$ . In  $[14]$  and  $[15]$  $[15]$ , a flexible, multi-precision per-layer data compression procedure is presented and implemented. Pruning aims to eliminate the subset of network units (i.e. weights or flters) which are least important for the network's intended task [\[16\]](#page-12-15). All of the above strategies necessitate the need for a programmable and precision-aware PNE which involves MAC computation followed by AF [[13\]](#page-12-12).

The MAC operation comprises of an adder, multiplier, and accumulator register. The multiplication result is transferred to an accumulator, added, and the output is stored in a register. The type of adders, multipliers, and registers lead to variations in area and delay [\[17](#page-12-16)]. Various articles have addressed MAC optimization by modifying the multiplication and addition techniques. Existing literature proposes diferent multiplication methods, such as vedic [[18\]](#page-12-17), array, wallace tree, booth [[19\]](#page-12-18), shift and add [\[20,](#page-12-19) [21](#page-12-20)], and modified booth [[22,](#page-12-21) [23](#page-12-22)]. Researchers have also focused on optimizing the addition and quantized accumulation process using techniques such as approximation, quantization/ data resize, bits-serial, and reduced precision, as discussed in a study by Garland et al. [[9\]](#page-12-8). Limited hardware resources make it challenging to implement MAC with parallel multipliers [\[24](#page-12-23)]. The conventional architecture of MAC (showing a single multiplier followed by accumulator along with input–output precision) is illustrated in Fig. [3](#page-3-0).

The typical RTL view of the MAC architecture is depicted in Fig. [2](#page-3-1), which includes a multiplier, two adders, a multiplexer, and an accumulator register [[25\]](#page-13-0). The stateof-the-art revised architecture presented in Fig. [3](#page-3-0) uses only one adder, one multiplexer, and register fles, resulting in a saving of one adder compared to the typical architecture [[26\]](#page-13-1). However, this design still uses a bias register fle and a multiplexer, which result in more hardware resources and additional delay due to bias register loading time and delay due to the multiplexer. Therefore, the design can be further optimized to make it more efficient for DNN accelerator applications, in order to increase throughput. The conventional MAC design produces an output of  $2N + M$  bits, where N is the input bit, and M is the overhead bits that come across the accumulation. The overfow bit depends on the number of accumulations and can be defined as  $2<sup>j</sup>$ , where j is the number of accumulations required in the MAC, which depends on the number of inputs.

The designs shown in Figs. [2,](#page-3-1) and [3,](#page-3-0) including our proposed design, utilize an arithmetic fxed-point ⟨*N*, *q*⟩ representation, which employs a binary point implication representation for the integer, signed, and fractional bits, as shown in Fig. [4](#page-3-2). The representation comprises of N bits, comprising one sign bit, N-q integer bits, and q fractional bits. As overfow bits are necessary in the accumulation stage, the



<span id="page-3-1"></span>**Fig. 2** Conventional RTL design for typical MAC with fxed-point representation. Here, N-bit precision is considered at the input with an output of  $2N + M$  bits which includes overflow bits



<span id="page-3-0"></span>**Fig. 3** Conventional MAC architecture with a single multiplier, an accumulator register with additional overhead bits, and a MUX for selecting the bias value



<span id="page-3-2"></span>**Fig. 4** The fxed-point arithmetic representation for an N-bit number, wherein q represents the fraction bit,  $(N - q)$  represents the integer bit, and MSB represents the sign bit

accumulator's bit size must be increased, and the overfow bit size is determined by the input size (i.e., the number of accumulations) in the corresponding neuron. In Fig. [3,](#page-3-0) the fxed-point N-bit numbers are depicted, as explained in Fig. [4](#page-3-2), along with the logic elements and a 2:1 MUX for the N-bit data line. The MUX with a select line is employed to choose between pre-trained bias or accumulation process. The excessive hardware, i.e., MUX and bias register, shown in the dotted red box [\[1\]](#page-12-0) Fig. [3](#page-3-0), occupies additional hardware resources and increases delay. To address this issue, we optimized the design by efficiently pre-loading the bias value in the accumulator register, enabling resizing of the MAC output and handshaking with activation.

AF is the key to improving the network's learning capabilities in addition to correctly re-initializing the weights parameter. Sigmoid function is widely used for backpropagation training algorithms. It is crucial to select the right AF for machine learning training and inferencing. Type of AF can afect the convergence and accuracy of network training as well as increase the computational cost of training and inference phases [\[27](#page-13-2)]. Here arises the need for an adaptable AF. [\[28](#page-13-3)] presents a polynomial model for implementation of the fractional exponent part of tanh AF. These approaches achieve accurate approximations with minimal resource usage, but do not address confgurability in AF. An energyefficient DNN accelerator, with variable precision support, improved performance, and reduced energy consumption, is evaluated at the MAC level in [\[29](#page-13-4)], but the investigation of the AF is warranted for further enhancement. The Cordic method, originally introduced by Volder and later modifed by Walther, performs circular, linear, and hyperbolic operations [[30\]](#page-13-5). To address the issues related to additional resources and higher critical delays associated with Cordic, a resources reused Cordic-based architecture in [[8\]](#page-12-7) realizes sigmoid and tanh AFs using the same logic resources. This

approach has two main drawbacks: low accuracy and high LUT utilization for bit-precision  $\leq 8$ . The adaptable AF presented in this paper combines the Cordic algorithm for high bit-precision AF and ROM for low bit-precision AF ( $\leq$ 8). For hardware implementation with the fxed-point notation, ROM-based AFs are not suitable for high-bit precision applications due to their signifcant resource utilization (i.e., LUT in FPGA and memory elements in ASIC). The LUTbased approach splits non-linear input ranges into regions and stores their data in LUTs as straight-line segments. FPGA-based customizable hardware designs for AFs have been proposed in [\[31\]](#page-13-6), which are configurable, but consume more on-chip area compared to ASICs. FPGAs use BRAM to reduce computation overhead, but increased BRAM utilization trades memory usage for bit precision [[26\]](#page-13-1). In [[32](#page-13-7)], authors present a library of VLSI implementations for various AFs for hardware-efficient NN accelerators.

## <span id="page-4-0"></span>**Proposed PNE Architecture**

In this section, we examine the hardware architecture and state machine of the precision-aware neuron engine (PNE) proposed in this research. The design incorporates optimized MAC unit to address precision sensitivity. The PNE also includes an adaptable AF with ROM and Cordic-based implementation to improve memory efficiency in PNE processing while preserving versatile precision capabilities.

### **PNE and State Machine**

In this section, we present the PNE architecture and its corresponding state machine. Comprising the MAC unit and the AF unit, the PNE, depicted in Fig. [5](#page-4-1), illustrates the sequence of the MAC unit followed by the AF unit. In Fig. [6,](#page-4-2) the MAC unit within the PNE is designed to yield 2 outputs: quantized and unquantized. The MAC output serves as the input to a 2:1 MUX, which, contingent on the control signal (precision\_ctrl), determines whether the quantized or unquantized output is to be provided as input to the AF. The (precision\_ctrl) signal also determines the AF type based on the output produced by the MAC unit (i.e., quantized or unquantized).

The PNE has three inputs: input, weight, and bias. While the design is versatile enough to handle diferent bit-precision, we opt for a signed 8-bit arithmetic computation for performance evaluation and result extraction. The fnal output, identified as  $AF_{out}$ , is configured as an 8-bit value within the PNE. In the AF, the hardware incorporates a select pin (precision\_ctrl) for AF processing. For lower precision, specifically quantized output, we have employed a ROM-based AF designed to support 8-bit and lower precision. However, when dealing with unquantized

<span id="page-4-1"></span>

<span id="page-4-2"></span>**Fig. 6** PNE showcasing quantized and unquantized MAC outputs fed through 2:1 MUX to give MAC<sub>out</sub>, followed by AF output: AF<sub>out</sub>. PNE<sub>out</sub> is the same as AF*out*

output featuring a higher bitwidth, a ROM-based approach might not be the most efficient choice for AF. Consequently, we introduce a novel AF function using iterative Cordic, ensuring support for higher precision computation with minimal hardware overhead compared to ROM-based implementations.

Figure [7](#page-5-0) shows the state machine of the PNE. The state machine comprises of several states: idle, idle, pre\_MAC, MAC, post\_MAC, and AF. In the idle state, the machine sets various input and output signals to their initial values and waits for the Computeinit signal to begin processing. Upon receiving a ComputeInit signal, the machine transitions to the initial state, where input data is registered, and the initial sum value, including any bias values, is calculated. Specifically, mult\_req is set to 0, and sum\_req is initialized to the bias value (shown in Fig. [6](#page-4-2)). This state persists for a single clock cycle, after which it transitions to pre\_MAC, initiating the frst multiplication. In this state, the initial bias is loaded into the accumulator register, and the machine multiplies the input data with the appropriate weight value, adding the result to the sum (preloaded bias). The index fag is also set to the number of inputs for later use. The machine then progresses to the MAC state, where the multiplication and accumulation operations continue until the index fag reaches zero. The index, initialized during the initial state, is decremented at every clock cycle during the MAC state. In this state, both the multiplier and accumulator are enabled. Upon index



<span id="page-5-0"></span>**Fig. 7** State machine of the PNE showing reset signal and idle mode. The MAC computation happens during the two states: pre\_MAC and post\_MAC

reaching zero, the state transitions to post\_MAC, where the fnal accumulation occurs, and the multiplier is disabled. The machine then moves to the post\_MAC state. Here, fnal calculations are completed, producing two outputs: quantized and unquantized. One of these outputs is selected and passed through the MUX. Additionally, in this state, preparations are made for the AF to be applied, along with control signal precision ctrl. The state then changes to AF, during which the output of the MAC unit (i.e., the output of the 2:1 MUX with sum\_reg), quantized or unquantized depending on precision ctrl as shown in Fig. [6,](#page-4-2) is applied to AF. Finally, the machine transitions to the AF state and operates based on the control signal AF\_ctrl. In the AF state, the machine applies the activation function to the fnal sum value and sets a "done" fag to indicate that processing is complete. Here, the output of the proposed adaptable AF becomes the output of the PNE. The machine then returns to the idle state and waits for the next processing request. The detailed architecture for MAC and AF are discussed in subsections ["Precision-Aware MAC Unit with](#page-5-1) [Pre-Loaded Bias](#page-5-1)" and ["Adaptable AF Using ROM/ Cordic](#page-6-0)". Table [1](#page-5-2) shows the various modes of operation and outputs of the PNE.

### <span id="page-5-1"></span>**Precision‑Aware MAC Unit with Pre‑Loaded Bias**

The proposed design employs two inputs for the multiplication operation: the j-input feature and the pre-trained j-weight, both stored in the weight register fle (shown in Fig. [6](#page-4-2)). The multiplication of these inputs yields a 2N-bit  $output(mult\_req)$ , serving as input to the accumulator (sum\_reg), where trained biases are pre-loaded. The adder takes the multiplier output and the feedback (accumulation) from the output register as inputs, storing the results in the output register. Iteratively accumulating the adder output, the output register produces a fnal unquantized output of  $2N + M$  bits, where M ( $M = log_2 j$ ) represents the overflow bit width resulting from iterative operations. The output generated at the accumulate stage is quantized into an N-bit value. Both the quantized and unquantized results are then passed to the non-linear transformation (AF) through a multiplexer based on the precision\_ctrl signal, as depicted in Fig. [6](#page-4-2) and shown in Table [1](#page-5-2). Notably, the MUX and bias\_reg present in the conventional design (Fig. [3\)](#page-3-0)

<span id="page-5-2"></span>

have been eliminated in the proposed design. The pre-trained bias output is now loaded directly into the accumulator register (sum\_reg). This optimization reduces resource utilization and critical delay in the MAC operation. Also, the proposed design can accommodate any bit precision, enabling the specifcation of integer and fractional bits in the fxed ⟨ N, q ⟩ format representation before synthesis. However, for the purpose of design analysis and implementation, an 8-bit precision architecture has been employed. Although we have validated this proposed architecture on an FPGA using Hardware Description Language (HDL), the advantages demonstrated by this design are expected to extend to Application-Specifc Integrated Circuits (ASICs) as well. In Fig. [6,](#page-4-2) the MAC architecture utilized in the PNE is emphasized with a dashed box.

In the frst clock cycle, two signed fxed-point values, each with an N-bit width, are multiplied using a multiplier (mult\_reg), and the bias value is pre-loaded into the sum\_reg. This saves an extra clock cycle that is conventionally required for bias accumulation. The multiplication result is a  $2 \times N$ -bit value with 4 integer bits and 12 fractional bits stored in mult\_reg. At each clock cycle, the value in mult\_reg is accumulated in sum\_reg. sum\_reg is initialized with the bias value at the beginning of every layer computation, and extra bits are used to prevent overflow. Once all MAC operations are complete, the values in sum\_req are resized to a fixed  $(8, 6)$  format using the inbuilt IEEE library resize function. We have used the 'resize' function provided by *Xilinx* at the output of the MAC, which comes with rounding and inherent accuracy loss. Bit rounding can be defned as the procedure of changing a number with roughly the exact value but fewer digits with another number. This resized value is then fed into the sigmoid ROM for AF operation. In the fixed  $(8, 6)$ ' format, 2 integer bits and 6 fractional bits, along with 1 sign bit (MSB), represent the signed fxed-point representation (Fig. [4](#page-3-2)). The output of the AF is the output of the PNE for the current layer.

#### <span id="page-6-0"></span>**Adaptable AF Using ROM/ Cordic**

The output generated by the MAC unit, determined by the precision\_ctrl pin, is fed into the novel AF designed to accommodate both higher and lower precision arithmetic. In this study, we have utilized a ROM-based implementation for processing the quantized output of MAC (N-bits), and a Cordic-based implementation for processing the unquantized output  $(2N+M-bits)$ . The AF design offers adaptability for selection of precision (via precision\_ctrl) and the selection of AF (via AF\_ctrl). The architecture supports both tanh and sigmoid AF computations. A detailed description of this performance-efficient adaptable AF's design architecture and computation techniques has been provided in [\[10](#page-12-9)]. We have integrated the aforementioned activation function (AF) into our proposed PNE, as depicted in red in Fig. [6](#page-4-2). Within the PNE, the core of the adaptable AF is represented in Fig. [8.](#page-6-1) This core consists of a ROM or CORDIC Confgure block and a processing block that facilitates adaptability for AF type selection. Two control signals, namely precision\_ctrl and AF\_ctrl, are employed for this purpose.

The ROM/ Cordic Configuration Block, shown in Fig. [8,](#page-6-1) integrates adders/ subtractors, shifters, and memory elements [\[10](#page-12-9)]. In the Cordic-based approach, the most significant bit (MSB) of R*in*[N-1] (sign bit) generates the directional signal d*<sup>i</sup>* [\[33\]](#page-13-8), determining whether addition or subtraction is performed to converge  $R_{out}$  to 0. Here,  $d_i \in \{0, 1\}$ , representing the sign bit  $R_{in}[N-1] \in \{0, 1\}$ . In the ROM-based approach (Fig. [8\)](#page-6-1), the value at the R*in* address is accessed as ROM[R*in*]. Depending on the ROM implementation and confguration of control pins, the AF's output for sigmoid or tanh is obtained. Thus,  $R_{out}$  = ROM[R*in*] for the ROM-based approach, while R*out* converges to 0 for the Cordic-based approach, as highlighted in the ROM/ Cordic Confguration Block in Fig. [8](#page-6-1). The output of the Cordic block produces values  $cosh(R_{in})$  and  $sinh(R_{in})$  at  $P_{out}$  and  $Q_{out}$ , respectively. These outputs are used for exponential calculation, as described in Eq. [1.](#page-6-2)

<span id="page-6-2"></span>
$$
e^{R_{in}} = \cosh(R_{in}) + \sinh(R_{in})
$$
\n(1)

The adaptable AF (Fig.  $8$ ) incorporates select signals precision\_ctrl and AF\_ctrl, which are summarized in Table [2](#page-7-1) to determine outputs using either ROM or Cordic. The input data  $R_{in}$  serves as  $AF_{in}$  (or  $MAC_{out}$ ) to the AF block and produces the output ROM[R*in*] in the subsequent clock cycle or converges to 0 after the N*th* Cordic iteration. One can observe that ROM/ Cordic Confguration Block provides three outputs: sinh(R*in*), cosh(R*in*), and 0/ ROM $[R_{in}]$ , as depicted in Fig. [8](#page-6-1), with  $AF_ctr1$  controlling



<span id="page-6-1"></span>**Fig. 8** The design of the adaptable AF for variable precision, consisting of the ROM/ Cordic Confguration Block and additional logic elements

MUX1 and MUX2 for Cordic-based sigmoid or tanh AF selection. sinh(R*in*) and cosh(R*in*) are sent to ADDER1. The output of ADDER1 is  $e^{R_{in}} = \sinh(R_{in}) + \cosh(R_{in})$ , which serves as input to ADDER2. The output of ADDER2 is  $1 +$  $e^{R_{in}}$ . MUX1 has inputs  $e^{R_{in}}$ , sinh( $R_{in}$ ), and MUX2 has inputs  $cosh(R_{in})$ ,  $1 + e^{R_{in}}$  with the select line as  $AF_ctr1$ . The outputs of MUX1 and MUX2 are processed in the divider to calculate Cordic[R*in*] for sigmoid/ tanh evaluation. Subsequently, MUX3 is used to select ROM[R*in*] or Cordic[R*in*] based on the precision\_ctrl signal. The state of control signals for generating tanh and sigmoid AFs using ROM/ Cordic approaches is presented in Table [2](#page-7-1). Although ReLU is not implemented in this paper, Fig. [9](#page-7-2) and Table [3](#page-8-0) present how integration of ReLU AF is also possible in the proposed AF, showing it's adaptability. Comparing with Fig. [8,](#page-6-1) it can be observed that additional hardware requirements include 1 control signal (AF\_ctrl2) and 2 MUXes (MUX4 and MUX5). MUX4 is used to implement the ReLU AF and MUX5 provides the selection (via the control signal AF\_ctrl2) between ROM[R*in*] and ReLU[R*in*] output. The output of MUX5 is one of the inputs to MUX3, the other input being Cordic[R*in*].

## <span id="page-7-0"></span>**Inference Accuracy and Hardware Performance: Evaluation and Analysis**

We have evaluated the inference accuracy of the proposed PNE for image classifcation tasks. Additionally, an analysis of resource utilization and delay of the PNE has been carried out with Cordic-based and ROM-based AF, targeting both quantized and unquantized results with precision requirements. The results demonstrate the efectiveness of our PNE, which has signed fxed-point pre-loaded bias MAC unit and an adaptable AF, enabling precision-aware DNN computations.



<span id="page-7-2"></span>**Fig. 9** ReLU AF integrated with our proposed adaptable AF design. The additional hardware requirements ( MUX4, MUX5 and AF\_ctrl2) are required to enable ReLU AF with our proposed adaptable AF

SN Computer Science A SPRINGER NATURE journal

<span id="page-7-1"></span>**Table 2** AF selection using AF\_ctrl and precision\_ctrl signals for ROM/ Cordic AF as depicted in Fig. [8](#page-6-1)

| AF_ctrl preci- | sion_ctrl | $R_{out}$          | $AF_{out}$                                                                                                                                                              |
|----------------|-----------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\Omega$       |           | $_{0}$<br>$\theta$ | Cordic[R <sub>in</sub> ] = sigmoid(R <sub>in</sub> ) = $\frac{e^{R_{in}}}{1+e^{R_{in}}}$<br>Cordic[ $R_{in}$ ] = tanh( $R_{in}$ ) = $\frac{sinh(R_{in})}{cosh(R_{in})}$ |
| X              |           |                    | $ROM[R_{in}]$ $ROM[R_{in}] = sigmoid(R_{in})$ OR<br>$tanh(R_{in})$                                                                                                      |

## **Experimental Validation of PNE: Quantized and Unquantized model**

The experimental results presented in Table [4](#page-8-1) provide a detailed analysis of the inferential accuracy of the PNE for both quantized and unquantized models. In this work, the precision comparison includes settings for 4-bit, 8-bit, and 16-bit across diferent datasets (MNIST, CIFAR-10, CIFAR-100) [\[34\]](#page-13-9) and DNN architectures (LeNet [\[35\]](#page-13-10), VGG-16 [\[36\]](#page-13-11)). Further, it is to be noted that for target applications of the proposed PNE i.e. feature extraction and output classifcation, bit-precision upto 16-bits is optimal [[37,](#page-13-12) [38](#page-13-13)]. Higher bit resolutions would lead to higher power consumption, higher critical delay and higher resource utilization, which are not desirable. The proposed PNE (P) is benchmarked against the Tensor-based MAC and AF model in the neuron engine (T) [\[39\]](#page-13-14). For the unquantized output, i.e., PNE with Cordic-based AF, the proposed PNE maintains competitive accuracy compared to the Tensor-based model, with a less than 2% accuracy loss compared to the TensorFlow model. The results underscore the robustness of the proposed PNE in preserving accuracy while operating under higher-precision settings. Notably, the proposed PNE showcases robust accuracy, with minimal diferences ranging from 0.6% to 1.6%.

Furthermore, in the case of PNE with ROM-based AF (Quantized), the proposed PNE consistently demonstrates an insignifcant accuracy loss of 1.6% compared to the Tensorbased model across diferent precisions and datasets. Even under lower bit-precision settings (4-bit, 8-bit, 16-bit), the proposed PNE exhibits strong performance, with accuracy diference ranging from 0.7% to 1.6%. This showcases the efectiveness of the PNE even in quantized scenarios with lower bit-precision. Overall, the proposed PNE exhibits robust performance across various precisions and DNN architectures, showcasing its efectiveness in maintaining accuracy under lower bit-precision settings. Notably, the comparison provides valuable insights into the performance of the proposed PNE across diferent bit precisions, highlighting its versatility and efficacy in diverse DNN architectures and datasets. It is crucial to emphasize that in evaluating accuracy, we employed a ROM-based approach for <span id="page-8-0"></span>**Table 3** Table showing ReLU integration with proposed AF. 1 additional control signal (AF\_ctrl2) and 2 MUXes are required as depicted in Fig. [9](#page-7-2)



<span id="page-8-1"></span>**Table 4** Comparative analysis of accuracy: PNE vs. Tensorbased neuron model having MAC and AF (T) [\[39\]](#page-13-14) for LeNET and VGG-16 DNN models



both 8-bit and 16-bit precision. Additionally, for the Cordicbased algorithm, we conducted accuracy assessments using both 8 and 16-bit precision computations. However, for the hardware implementation, where resource constraints are a consideration, we adopted an 8-bit computation using ROMbased architecture. On the other hand, for the 16-bit computation, we utilized a Cordic-based architecture.

## **Hardware Implementation and Result Comparison of the Proposed PNE**

This section delineates the experimental setup and methodology employed to evaluate the PNE proposed in this research. The assessment involves two distinct blocks, namely the Multiply-Accumulate (MAC) unit and the Activation Function (AF), implemented through the Zybo-Xilinx evaluation kit. Hardware performance parameters are extracted at an operating frequency of 50 MHz and an operating temperature of 25◦C. The modular RTL design architecture is tailored for ASIC implementation. Simulation and performance parameters extraction consider 1 sign bit and 8 magnitude bits (2 integers and 6 fractional). Comparisons are drawn with conventional architecture, covering LUT and register utilization, critical delay, and total on-chip power. All architectures are implemented on the same FPGA platform for a fair comparison, detailed in [\[40\]](#page-13-15). Our proposed design, implemented on the Zybo-Xilinx FPGA SoC kit, has demonstrated noteworthy results.

Various state-of-the-art processing engines supporting defnite precision were investigated, utilizing rounding or truncation between the MAC unit and AF, resulting in accuracy loss. In contrast, our proposed design provides developers the fexibility to choose quantization on the MAC result. Here, we have compared the proposed architecture with a conventional design. The proposed design supports both quantized and unquantized data. It is observed that if the same feature is incorporated into the conventional architecture, it would require  $328$  LUTs  $(102 + 226)$ , whereas the proposed design utilizes only 248 LUTs. The specifc advantages of ROM-based AF in lower precision and Cordic-based AF are discussed in section ["Resource](#page-11-1) [Utilization of the ROM/ Cordic-Based Adaptable AF](#page-11-1)". Our focus here is on architectural optimization and implementation. However, it can be extended for diferent types of MAC units and AFs, along with a state-of-the-art comparison. If higher accuracy is desired, the MAC output is processed directly through the AF. Conversely, if accuracy speed is not a signifcant concern, the MAC result is quantized using the precision\_ctrl pin. It is crucial to note that quantized data confgures the ROM-based AF, while unquantized data passes through Cordic-based AF. The detailed architectural design and computational arithmetic for the MAC unit and AF are elaborated in section ["Precision-Aware MAC Unit](#page-5-1) [with Pre-Loaded Bias](#page-5-1)" and ["Adaptable AF Using ROM/](#page-6-0) [Cordic"](#page-6-0), respectively. The hardware implementation results reported for the Zybo FPGA board are presented below.

The comprehensive analysis of proposed PNE in comparison to conventional counterparts on a Zybo board at 50MHz is presented in the Table [5](#page-9-0). This table outlines the standalone architecture supporting both unquantized and quantized MAC operations in the conventional design. Additionally, it presents the proposed precision-aware design, offering the fexibility for both operations. Notably, the proposed design can be confgured through an external signal, enabling PNE and supporting adaptable AF. Precision-aware confgurations, specifcally the proposed precision aware pre-loaded bias MAC and adaptable Cordic AF, exhibit minor resource overhead, enabling the precision-aware design to evolve from 226 LUTs to 248 LUTs, constituting nearly a 9.7% overhead. In the conventional MAC design with only Cordic-based AF, the critical delay increases by 34.37% compared to the proposed PNE. This is due to the additional hardware resources, namely bias registers and MUX selection, utilized for bias loading. The incorporation of these elements introduces an extra clock delay, consequently reducing the throughput of the system. However, the proposed designs showcase higher throughput, with the Precision-aware pre-loaded bias MAC and adaptable Cordic AF achieving 34.61% more GOp/s than the conventional MAC with only Cordic AF. The performance metric underscores the superiority of precision-aware designs, indicating a more substantial performance improvement over the conventional MAC with only Cordic and also enables the design with quantization-enabled computation. These fndings highlight the efficiency gains achievable through the proposed pre-loaded bias MAC architecture and an adaptable AF integrated into the PNE.

## **Optimized MAC Unit with Quantization‑Enabled Output Selection and Pre‑Loaded Bias**

In this section, we conduct a comparative analysis of the MAC unit, which generates two distinct outputs. The frst output is the quantize-enabled output, wherein the MAC unit's output is quantized from  $2N + M$  bits to N bits through LSB bit truncation. This advancement enables the utilization of an N-bit Activation Function (AF) for subsequent computations. Additionally, the MAC unit produces a second output, namely the unquantized output, with a precision of  $2N + M$  bits, where 'M' represents the extra overhead bits addressed earlier. The selection between these outputs is determined using a Multiplexer (MUX) based on the desired level of accuracy required for the computation. We conduct a comprehensive comparison of state-of-the-art Multiply-Accumulate (MAC) architectures, all tailored for accurate computations at 8-bit precision. We present a detailed analysis of resource utilization, power consumption, and delay for the proposed design and other existing architectures on the Zybo SoC Xilinx FPGA board as summarized in Table [6.](#page-10-0) Our proposed design demonstrates resource utilization comparable to the IEEE standard, even with the integration of the output selection advancement. Notably, our optimized architecture eliminates the need for bias registers and MUX used for bias selection systematically, resulting in no additional hardware resource overhead. This enhancement signifcantly improves critical computation delay and reduces a clock delay by pre-loading the bias value into the accumulator register. This also leads to reduction in power consumption. Although the introduction of an extra Multiplexer (MUX) at the MAC output incurs minimal hardware overhead, this is efectively compensated by the aforementioned advancements. Table [6](#page-10-0) provides a comparison with architectures like Booth Multiplication, Wallace Tree Vedic, shift-and-add, Cordic, and IEEE. While each design exhibits its own merits and drawbacks, the choice depends on the specifc application requirements, whether prioritizing hardware efficiency, accuracy, or performance. Our proposed design offers support for both features, aligning with diverse application needs.

The proposed MAC unit has been compared with established state-of-the-art designs, including Vedic, Wallace, Booth, Shift-Add, Cordic, and IEEE architectures, as detailed in Table [6](#page-10-0). The reported parameters include slice LUTs and slice registers. The analysis reveals that the resources utilized by the proposed MAC design have utilized comparable hardware to state-of-the-art architectures. While

<span id="page-9-0"></span>**Table 5** Comparison of proposed PNE (Quantized/ Unquantized MAC + adaptable AF) for diferent bit-widths with conventional processing engine (conventional MAC + AF) evaluated on Zybo-board at 50MHz operating frequency



<span id="page-10-0"></span>**Table 6** Performance evaluation: comparison of the precision-aware MAC with state-of-the-art models



slightly exceeding the resource utilization of shift-and-add and Booth algorithm-based MAC units, our design has demonstrated superior performance in critical delay. This improvement has occurred because of the pre-loaded bias, which reduces MUX delay in the critical path and eliminates the need for an additional clock delay traditionally required to accumulate the bias value in the multiplication of input and weights, as discussed in section "[Precision-Aware MAC](#page-5-1) [Unit with Pre-Loaded Bias"](#page-5-1), and shown in Figs. [6,](#page-4-2) [7](#page-5-0). To highlight the advancements of the proposed design, a comparison has been made with Vedic and IEEE DSP package MAC architectures. Specifcally, the Vedic-based architecture [\[18\]](#page-12-17) has shown to have 159 Slice LUTs, whereas our proposed design has had only 92 slice LUTs, indicating a 42.13% reduction compared to the Vedic architecture. Similarly, our design has consumed 29.23% fewer slice LUTs compared to the IEEE architecture. When compared with Wallace's architecture, our proposed design has shown a 12.38% reduction. Additionally, the proposed architecture has utilized 61 slice registers, marking a 45.53% reduction compared to Wallace's architecture, which has had 112 slice registers. In summary, the analysis suggests that the proposed design is suitable for adoption in applications where accuracy demands are application-specifc, and there is requirement for reduced resource usage and critical delay.

The critical delay represents the maximum delay within a circuit, attributed to the longest combinational path. As a crucial performance metric, it dictates the circuit's maximum operating frequency. The critical delay (in nanoseconds) is presented for both the proposed and state-of-the-art architectures with 8-bit precision in Table [6](#page-10-0). Notably, the Cordic architecture exhibits a maximum critical delay of 9.06 ns due to its iterative computation (n-iteration, i.e., n times the critical delay of each iteration), despite its superior resource utilization. In comparison, the critical delay for the IEEE standard DSP architecture [[41\]](#page-13-16) is lower at 3.98 ns. Conversely, our proposed design boasts of a critical delay of 2.67 ns, signifying a 32.91% improvement over the IEEE standard. Our proposed design's critical delay (calculated using Vivado) stands at 2.67 ns, the lowest among all of the architectures in literature, attributed to pre-bias loading and the elimination of multiplexing used for bias selection in conventional MAC designs.

Additionally, the Power-delay Product (PDP) is computed for 8-bit precision across all methods and detailed in Table [6.](#page-10-0) Notably, the booth-multiplier-based approach closely mirrors our proposed design's PDP at 2.77pJ compared to 2.08pJ. Our proposed design demonstrates a 24.91% higher efficiency than the booth technique  $[19]$  in terms of PDP. A comprehensive comparison in Table [6](#page-10-0) reveals that our proposed design outperforms all techniques in the literature in terms of PDP efficiency.

#### <span id="page-10-2"></span>**Time Slack**

It's important to note that the PNE simulations have been carried out at 50MHz, while the evaluation of the standalone MAC Unit slack calculation has been done at 100MHz. This diference in frequencies helps us assess the efectiveness of the design under optimum operating conditions. The slack calculation for various bit precisions has been reported in Table [7](#page-10-1). The table includes three types of slack values: Worst Negative Slack (WNS), Worst Hold Slack (WHS), and Worst Pulse Width Slack (WPWS) for three diferent bit precision values: 8-Bit, 12-Bit, and 16-Bit. The equations have been presented in Eq. [2](#page-10-2), where  $(T_r)$  and  $(T_a)$  denote the earliest start time and actual start time of a task, respectively.  $T_{setup}$  and  $T_{slack}$  refer to the time available before the earliest start time and the time available after the actual start time, respectively. The slack information provides insights into whether the design is functioning at the desired frequency, as explained in [\[42](#page-13-17)].

<span id="page-10-1"></span>**Table 7** Slack calculation at 100MHz frequency for diferent bit precision for optimum operating frequency

| Slack Type | 8-Bit  | $12-Bit$ | $16 - Bit$ |  |
|------------|--------|----------|------------|--|
| WNS (ns)   | 2.356  | 4.898    | 5.136      |  |
| $WHS$ (ns) | 0.068  | 0.104    | 0.236      |  |
| WPWS (ns)  | 49.500 | 49.500   | 49.500     |  |

 $T_{\text{setup}} = T_{\text{r}} - T_{\text{a}}$  (2a)

 $T_{\text{slack}} = T_a - T_r$  (2b)

$$
T_a = T_{total} + T_{rc} + T_{cq}
$$
 (2c)

$$
T_a \ge T_{hold} \tag{2d}
$$

$$
T_{slack, setup} = T_{cycle} - T_a - T_{setup}
$$
 (2e)

$$
T_{\text{slack,hold}} = T_a - T_{\text{hold}} \tag{2f}
$$

As the bit precision increases, WNS and WHS also increase, indicating a decrease in the timing margin. This outcome is expected because higher bit precision means more complex circuits, which results in longer propagation delays and lower timing margins. However, the WPWS values remain constant across all three bit precisions. This is because WPWS is a measure of the minimum required pulse width for the circuit to function correctly and is determined by the slowest path in the circuit. Since the slowest path does not change with bit precision, the WPWS value remains constant. Overall, the table suggests that a higher bit precision is associated with a lower timing margin, which could potentially lead to timing violations and reduced performance. However, the constant WPWS values indicate that the minimum pulse width requirement does not change with bit precision.

## <span id="page-11-1"></span>**Resource Utilization of the ROM/ Cordic‑Based Adaptable AF**

In the hardware-based evaluation, resource utilization is assessed by implementing the adaptable AF using Verilog-HDL, and the corresponding parameters are extracted using the *Vivado-Xilinx* tool. The proposed design is implemented on the Zybo Evaluation Kit, with a specifc focus on the sigmoid AF, which efectively utilizes all the hardware resources within the configurable architecture. The AF exhibits nonlinear behavior in artifcial neurons, and its hardware implementation is particularly costly in conventional approaches. Increasing arithmetic precision leads to a rise in design complexity, as achieving higher precision necessitates exponential growth in computational complexity or memory elements. For instance, an n-bit precision requirement in ROM demands 2*n* memory elements. To illustrate, 8-bit precision requires 256 elements, while 16-bit precision would need 65,536 elements, making it impractical given the dedicated AF for each neuron. Contemporary architectures aim for efficient design and implementation. However, each design favors either lower or higher bit precision and adaptability in AF types of selection. Addressing these challenges, our proposed solution involves adaptable logic for selecting types of AF and precision settings for computations.

Table [8](#page-11-2) compares resource utilisation for ROM, Cordic, and BRAM-based techniques for various bit-precisions. The ROM-based design uses 6 LUTs for 4-bit accuracy, whereas the Cordic-based design uses 45 LUTs and 37 fip-fops (FFs), resulting in an 86.66% LUT savings for the ROM-based method. For 8-bit precision, the ROM-based solution uses 16 LUTs, compared to Cordic's 84 LUTs and 72 FFs, resulting in an 80.95% LUT savings. The ROMbased system depends just on LUTs, with no FFs required. However, as precision increases to 16-bit, the ROM-based design requires a substantial 2111 LUTs, as opposed to Cordic's 140 LUTs and 126 FFs. Thus, implementing a 32-bit ROM-based design on smaller FPGAs would be inefficient due to the exponential increase in resource requirements. We also report results for a BRAM-based approach, which shows signifcant rise in BRAM utilization as precision increases. Specifcally, for 4, 8, and 16-bit precisions, BRAM requirements are 0.5, 0.5, and 17 BRAMs, respectively. Overall, the Cordic-based technique demonstrates better LUT utilization for higher precision computations. The ROM-based implementation of AFs exhibits superior performance at lower precision. These fndings provide valuable insights for selecting appropriate AF implementations based on precision requirements and resource constraints in the PNE.

## <span id="page-11-0"></span>**Conclusions and Future Research**

In summary, our research introduces a precision-aware Neuron Processing Engine (PNE) for efficient deep neural network (DNN) computations. The PNE features a signed fxed-point pre-loaded bias Multiply-Accumulate (MAC) unit and an adaptable Activation Function (AF) supporting both ROM and Cordic implementations. Evaluating inference accuracy across quantized and unquantized models, various bit precisions, and datasets, our experimental results highlight the PNE's efectiveness. Under unquantized scenarios, the Cordic-based AF

<span id="page-11-2"></span>**Table 8** Resource Utilization of adaptable AF for diferent bit-widths evaluated on Zybo-board

| AF Type   | <b>ROM</b>  |            | Cordic      |     | <b>BRAM</b> |  |
|-----------|-------------|------------|-------------|-----|-------------|--|
| Precision | <b>LUTs</b> | <b>FFs</b> | <b>LUTs</b> | FFs |             |  |
| 4-bit     | 6           | 0          | 45          | 37  | 0.5         |  |
| 8-bit     | 16          | $_{0}$     | 84          | 72  | 0.5         |  |
| $16$ -bit | 2111        | $\theta$   | 140         | 126 | 17          |  |

exhibits robust accuracy with less than a 2% loss compared to TensorFlow models. Even in quantized scenarios (4-bit, 8-bit, 16-bit), the PNE performs strongly with accuracy differences ranging from 0.7% to 1.6%, showcasing its versatility. Hardware implementation on the Zybo-Xilinx FPGA platform demonstrates notable results, with resource utilization comparable to state-of-the-art models and superior critical delay and throughput. The precision-aware MAC unit allows developers to choose between quantized and unquantized operations, ofering fexibility in balancing accuracy and speed. The comprehensive analysis of resource utilization, power consumption, and critical delay underscores the efficiency gains achievable through our proposed architecture. The adaptable AF, implemented using ROM and Cordic, caters to diverse precision requirements. Currently, the AF supports sigmoid and tanh implementations. However, it can be adapted to implement ReLU and Gaussian AF as well with additional hardware requirements, as part of future research. In conclusion, our precision-aware Neuron Processing Engine provides a holistic solution for efficient DNN computations, contributing valuable insights to hardware-efficient neural network accelerators and advancing precision-aware computing architectures.

**Acknowledgements** This article is an extended version of our previous conference paper presented at [[10](#page-12-9)].

**Data Availability** Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study, and detailed circuit simulation results are given in the manuscript.

#### **Declarations**

**Conflict of interest** The authors declare that they have no Confict of interest and there was no human or animal testing or participation involved in this research. All data were obtained from public domain sources.

## **References**

- <span id="page-12-0"></span>1. Sim H, Lee J. Cost-Efective Stochastic MAC circuits for Deep Neural Networks. Neural Netw. 2019;117:152–62.
- <span id="page-12-1"></span>2. Khalil K, Eldash O, Kumar A, Bayoumi M. An efficient approach for neural network architecture. In: 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2018;745–748. IEEE
- <span id="page-12-2"></span>3. Shawl MS, Singh A, Gaur N, Bathla S, Mehra A. Implementation of Area and Power Efficient Components of a MAC unit for DSP Processors. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), 2018;1155–1159. IEEE.
- <span id="page-12-3"></span>4. Machupalli R, Hossain M, Mandal M. Review of ASIC Accelerators for Deep Neural Network. Microprocess Microsyst. 2022;89:104441.
- <span id="page-12-4"></span>5. Merenda M, Porcaro C, Iero D. Edge machine learning for aienabled iot devices: A review. Sensors. 2020;20(9):2533.
- <span id="page-12-5"></span>6. Shantharama P, Thyagaturu AS, Reisslein M. Hardware-accelerated platforms and infrastructures for network functions: A

survey of enabling technologies and research studies. IEEE Access. 2020;8:132021–85.

- <span id="page-12-6"></span>7. Hashemi S, Anthony N, Tann H, Bahar RI, Reda S. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, 2017;1474– 1479. IEEE.
- <span id="page-12-7"></span>8. Raut G, Rai S, Vishvakarma SK, Kumar A. RECON: Resource-Efficient CORDIC-based Neuron Architecture. IEEE Open Journal of Circuits and Systems. 2021;2:170–81.
- <span id="page-12-8"></span>9. Garland J, Gregg D. Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing. ACM Transactions on Architecture and Code Optimization (TACO). 2018;15(3):1–24.
- <span id="page-12-9"></span>10. Vishwakarma S, Raut G, Dhakad NS, Vishvakarma SK, Ghai D. A Confgurable Activation Function for Variable Bit-Precision DNN Hardware Accelerators. In: IFIP International Internet of Things Conference, 2023;433–441. Springer.
- <span id="page-12-10"></span>11. Posewsky T, Ziener D. Efficient deep neural network acceleration through fpga-based batch processing. In: 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2016;1–8. IEEE.
- <span id="page-12-11"></span>12. Schmidhuber J. Deep Learning in Neural Networks: An overview. Neural Netw. 2015;61:85–117.
- <span id="page-12-12"></span>13. Jelčicová Z, Mardari A, Andersson O, Kasapaki E, Sparsø J. A neural network engine for resource constrained embedded systems. In: 2020 54th Asilomar Conference on Signals, Systems, and Computers, 2020;125–131. IEEE
- <span id="page-12-13"></span>14. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S, et al. Going deeper with embedded fpga platform for convolutional neural network. In: Proceedings of the 2016 ACM/ SIGDA International Symposium on Field-programmable Gate Arrays, 2016;26–35.
- <span id="page-12-14"></span>15. Zhang Y, Suda N, Lai L, Chandra V. Hello edge: Keyword spotting on microcontrollers. arXiv preprint [arXiv:1711.07128](http://arxiv.org/abs/1711.07128) 2017.
- <span id="page-12-15"></span>16. Cheng Y, Wang D, Zhou P, Zhang T. Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. IEEE Signal Process Mag. 2018;35(1):126–36.
- <span id="page-12-16"></span>17. Masadeh M, Hasan O, Tahar S. Input-Conscious Approximate Multiply-Accumulate (MAC) Unit for Energy-Efficiency. IEEE Access. 2019;7:147129–42.
- <span id="page-12-17"></span>18. Krishna AV, Deepthi S, Nirmala Devi M. Design of 32-Bit MAC unit using Vedic Multiplier and XOR Logic. In: Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications, 2021;715–723. Springer.
- <span id="page-12-18"></span>19. Farrukh FUD, Zhang C, Jiang Y, Zhang Z, Wang Z, Wang Z, Jiang H. Power Efficient Tiny Yolo CNN using Reduced Hardware Resources based on Booth Multiplier and Wallace Tree Adders. IEEE Open Journal of Circuits and Systems. 2020;1:76–87.
- <span id="page-12-19"></span>20. Johansson K. Low power and Low Complexity Shift-and-Add based Computations. PhD thesis, Linköping University Electronic Press 2008.
- <span id="page-12-20"></span>21. Gudovskiy DA, Rigazio L. Shiftcnn: Generalized Low-Precision Architecture for inference of Convolutional Neural Networks. arXiv preprint [arXiv:1706.02393](http://arxiv.org/abs/1706.02393) 2017.
- <span id="page-12-21"></span>22. Janveja M, Niranjan V. High performance Wallace tree multiplier using improved adder. ICTACT j microelectron. 2017;3(01):370–4.
- <span id="page-12-22"></span>23. Yuvaraj M, Kailath BJ, Bhaskhar N. Design of optimized MAC unit using integrated vedic multiplier. In: 2017 International Conference on Microelectronic Devices, Circuits and Systems (ICM-DCS), 2017;1–6. IEEE.
- <span id="page-12-23"></span>24. Sze V, Chen Y-H, Yang T-J, Emer JS. Efficient processing of deep neural networks: A tutorial and survey. Proc IEEE. 2017;105(12):2295–329.
- <span id="page-13-0"></span>25. Sharma VP, Vishwakarma SK. Analysis and Implementation of MAC Unit for different Precisions. signal  $(\mu W)$  70(120):240
- <span id="page-13-1"></span>26. Raut G, Biasizzo A, Dhakad N, Gupta N, Papa G, Vishvakarma SK. Data Multiplexed and Hardware Reused Architecture for Deep Neural Network Accelerator. Neurocomputing. 2022;486:147–59.
- <span id="page-13-2"></span>27. Wuraola A, Patel N, Nguang SK. Efficient activation functions for embedded inference engines. Neurocomputing. 2021;442:73–88.
- <span id="page-13-3"></span>28. Aggarwal S, Meher PK, Khare K. Concept, design, and implementation of reconfgurable CORDIC. IEEE Trans Very Large Scale Integr VLSI Syst. 2015;24(4):1588–92.
- <span id="page-13-4"></span>29. Lee J, et al. Unpu: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J Solid-State Circuits. 2018;54(1):173–85.
- <span id="page-13-5"></span>30. Lin C-H, Wu A-Y. Mixed-scaling-rotation CORDIC (MSR-CORDIC) algorithm and architecture for high-performance vector rotational DSP applications. IEEE Trans Circuits Syst I Regul Pap. 2005;52(11):2385-96.
- <span id="page-13-6"></span>31. Mohamed SM, et al. FPGA implementation of reconfgurable CORDIC algorithm and a memristive chaotic system with transcendental nonlinearities. IEEE Trans Circuits Syst I Regul Pap. 2022;69(7):2885–92.
- <span id="page-13-7"></span>32. Prashanth H, Rao M. SOMALib: Library of Exact and Approximate Activation Functions for Hardware-efficient Neural Network Accelerators. In: 2022 IEEE 40th International Conference on Computer Design (ICCD), 2022;746–753. IEEE.
- <span id="page-13-8"></span>33. Mehra S, Raut G, Das R, Vishvakarma SK, Biasizzo A. An Empirical Evaluation of Enhanced Performance Softmax Function in Deep Learning. IEEE Access 2023.
- <span id="page-13-9"></span>34. Alex K. Learning multiple layers of features from tiny images. <https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf> 2009.
- <span id="page-13-10"></span>35. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
- <span id="page-13-11"></span>36. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint [arXiv:1409.1556](http://arxiv.org/abs/1409.1556) 2014.
- <span id="page-13-12"></span>37. Park J-S, Park C, Kwon S, Kim H-S, Jeon T, Kang Y, Lee H, Lee D, Kim J, Lee Y, Park S, Jang J-W, Ha S, Kim M, Bang J, Lim SH, Kang I. A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unifed Multi-Precision Datapath in 4nm Flagship Mobile SoC. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), 2022;65:246–248.
- <span id="page-13-13"></span>38. Chang J-K, Lee H, Choi C-S. A Power-Aware Variable-Precision Multiply-Acumulate Unit. In: 2009 9th International Symposium on Communications and Information Technology, 2009;1336–1339.
- <span id="page-13-14"></span>39. Abadi M, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorfow.org 2015.
- <span id="page-13-15"></span>40. Raut G, Mukala J, Sharma V, Vishvakarma SK. Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators. Circuits, Systems, and Signal Processing, 2023;1–27.
- <span id="page-13-16"></span>41. Multiplier v12.0 LogiCORE IP Product Guide. [https://www.](https://www.xilinx.com/support/documentation/ipdocumentation/multgen/v120/pg108-mult-gen.pdf) [xilinx.com/support/documentation/ipdocumentation/multgen/](https://www.xilinx.com/support/documentation/ipdocumentation/multgen/v120/pg108-mult-gen.pdf) [v120/pg108-mult-gen.pdf](https://www.xilinx.com/support/documentation/ipdocumentation/multgen/v120/pg108-mult-gen.pdf)
- <span id="page-13-17"></span>42. Venkataramani G, Goldstein SC. Slack Analysis in the System Design Loop. In: Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, 2008;231–236.

**Publisher's Note** Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.