A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Urbinati, Luca; Casu, Mario R.

doi:10.1007/978-3-031-26066-7_2

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1005))

Included in the following conference series:

Annual Meeting of the Italian Electronics Society

500 Accesses
2 Citations

Abstract

Across different Deep Learning (DL) applications or within the same application but in different phases, bitwidth precision of activations and weights may vary. Moreover, energy and latency of MAC units have to be minimized, especially at the edge. Hence, various precision-scalable MAC units optimized for DL have recently emerged. Our contribution is a new precision-configurable multiplier/dot-product unit based on a modified Radix-4 Booth signed multiplier with Sum-Together (ST) mode. Besides 16-bit full precision multiplications, it can be reconfigured to perform dot products among two 8-bit or four 4-bit sub words of the input operands without requiring an external adder, thus reducing the number of cycles of MAC operations. The results of the synthesis in performance, power and area on a 28-nm technology show that our unit (1) is superior to other state of the art ST multipliers in area (\(\approx \)35% less) in the clock frequency range between 100 and 1000 MHz and (2) reduces latency up to 4x when used to compute a convolutional layer, at the cost of limited overheads in area (+10%) and power (+13%) compared to a conventional 16-bit Booth multiplier. This unit can play an important role in designing variable-precision MAC units or DL accelerators for edge devices.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Article 16 May 2023

Bit-Level Operation-Based MAC Unit for Vector Multiplications

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

Keywords

1 Introduction

At the basis of Deep Learning (DL) algorithms are convolutions and matrix multiplications, which require the computation of many dot products and simple scalar multiplications between features and weights. These operations are typically executed by multiply-and-accumulate (MAC) units. In particular, when running DL applications on edge devices, energy and latency of these MAC units have to be minimized. This requires reducing the data bit-width to the minimum, while keeping a satisfying level of accuracy. Such minimum data precision may vary across different applications, but also within the same application in different phases (e.g. mixed-precision quantization for convolutional layers).

Multipliers with Sum-Together (ST) mode [1] are good candidates to realize precision-scalable MAC units [2,3,4,5]. They are special sub-word parallel multipliers that can perform either a multiplication at full precision or a dot-product at lower precision. In particular, they can compute N = 1, 2, 4 multiplications/dot-products in parallel among input operands with precision inversely proportional to N (e.g., 16/N bits). When used inside MAC units, they reduce the overall latency up to 1/N because they save N-1 MAC additions.

In the literature we find different proposals for the implementation of ST multipliers. The design of [2] uses four 16-bit Booth multipliers with a configurable partial products compression array and three configurable 33-bit adders, instead of the usual final adder. The two sub-word parallel dot-product units of [3] comprise of two 17-bit multipliers or four 9-bit multipliers, respectively, followed by a 32-bit adder. The reconfigurable parallel inner product of [4], the ancestor of [5], dynamically composes and decomposes 4 or 8-bit multipliers with a network of combinational logic.

In this context, we propose a new precision-scalable modified Radix-4 Booth signed multiplier with ST mode. Its configuration options are reported in Fig. 1. The main difference with respect to other state of the art (SoA) ST multipliers is that our design does not require a dedicated adder to sum the low-precision products together, but it exploits the normal alignment of partial products as in a standard multiplier (Fig. 2).

The comparison between the SoA ST multipliers and our design in performance, power and area (PPA) on a 28-nm technology shows that, at the cost of limited overhead in area and power compared to a conventional non-ST Booth design, our multiplier/dot-product unit (1) is superior to the other SoA units in the clock frequency range between 100 and 1000 MHz and (2) could reduce latency and energy of convolutional layers when used in MAC units or in variable-precision DL accelerators.

2 Hardware Design

We propose a new precision-configurable Radix-4 Booth signed multiplier with ST mode. The supported configurations are those of Fig. 1(b). The key feature of our multiplier/dot-product unit is the lack of a dedicated external adder to sum together the low-precision products during dot-product operations. Instead, such addition comes for free in our design. In fact, our unit exploits the normal alignment of partial products in a standard multiplier, enabling the computation of dot products when two or four scalar inputs are packed in each operand, as shown in Fig. 2. The bits of output P (yellow circles) are obtained by vertically summing the full-colored circles representing the bits of the eight partial products (PP0-PP7). These full-colored bits are the result of the products of operands with the same color, while the half-colored bits are zeroed as explained below.

In the multiplier architecture shown in Fig. 3, the yellow blocks are the standard components of a Radix-4 Booth multiplier, while the green ones are for precision reconfiguration, for zeroing the half-colored bits of Fig. 2, and for sign extending the inputs in asymmetric configurations, like 16 \(\times \) 8 and 8 \(\times \) 4, in order to treat them as 16 \(\times \) 16 and 8 \(\times \) 8. We implement the reduction tree as a Wallace tree with 4:2 compressors, while the final adder is a Carry Propagate Adder with Prefix Network. Regarding the additional logic for reconfigurability, the configuration signal CONFIG controls: 1) how the bits of operand A are properly composed t o form \(X_0\)–\(X_7\) input triplets for the encoder; 2) how the sub-words of operand B are arranged and presented to the \(Y_0\)–\(Y_7\) inputs of the selector; 3) the number of positions to right-shift the output to the LSB position (Fig. 2).

3 Experimental Results

For a fair comparison, we re-implemented the ideas of the SoA multipliers introduced in Sect. 1, making these minimal adjustments:

we standardized their configurations to match those presented in Fig. 1 and we removed all the unnecessary logic that was not necessary to implement the ST multiplier’s behavior;
since the authors of [3] implemented their ST multiplier with a behavioral RTL description, we made the same, but we forced the synthesizer to use a 16-bit Booth multiplier for 16\(\times \)16/16\(\times \)8 configurations;
we right shifted the output of [4] to align it to the LSB because it produces the sum-of-products on higher bit positions, as it happens in our design;
we added input and output registers to all the ST multipliers, ours included.

We synthesized the designs with Synopsys Design Compiler on a 28-nm technology, varying the clock frequency from 100 to 1000 MHz. The PPA results in the area vs clock period space and power vs clock period space are in Fig. 4(a)–(b), respectively. Our unit is Pareto optimal in area at all frequencies with \(\approx \)35% less area than other SoA competitors, while in power all the designs almost overlap each other. A close examination at 1000 MHz is reported in Table 1(a), where ST multipliers are also compared with a conventional non-ST 16-bit Booth multiplier. From this table we find that our Booth multiplier with ST mode consumes +10% of area and +13% of power compared to the baseline version.

Table 1. ST multipliers vs baseline (non-ST 16-bit Booth multiplier) at 1000 MHz (a); theoretical reduction in MAC operations and latency for the first layer of MobileNetV1 and EfficientNet-B0 computed with an ST-multiplier (b).

Full size table

In Table 1(b) we show how an ST multiplier could reduce the number of MAC operations and the latency of a convolutional layer, in this case the first of MobileNetV1 and EfficientNet-B0. The theoretical reduction that is possible to achieve is 1/N and depends on the precision of activations and weights at which the layer is computed. Finally, it is important to note that we also expect a significant energy saving at lower precisions (N = 2 or 4) because energy scales like latency, while power overhead is constant.

4 Conclusion

This Booth multiplier with ST mode can play an important role inside precision-scalable MAC units or in variable-precision DL accelerators for edge devices [6] because it supports low-precision configurations which can reduce latency and energy. It also outperforms the SoA alternatives in area with limited reconfigurability overheads against a conventional non-configurable Booth multiplier.

References

Camus, V., et al.: Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network processing. IEEE J. Emerg. Sel. Topics Circ. Syst. (JESTCS) 9(4), 697–711 (2019)
Article Google Scholar
Zhang, X., Li, Z., Zheng, Q.: Design of a configurable fixed-point multiplier for digital signal processor. In: Proceedings Asia Pacific Conference on Postgraduate Research in Microelectronics & Electronics (PrimeAsia), pp. 217–220. IEEE, Shanghai (2009)
Google Scholar
Gautschi, M., et al.: Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 25(10), 2700–2713 (2017)
Google Scholar
Lin, R.: Reconfigurable parallel inner product processor architectures. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 9(2), 261–272 (2001)
Google Scholar
Sharma, H., et al.: Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: Proceedings 45th Annual International Symposium on Computer Architecture (ISCA), pp. 764–775. ACM/IEEE (2018)
Google Scholar
Urbinati, L., Casu, M.R.: A reconfigurable depth-wise convolution module for heterogeneously quantized DNNs. In: Proceedings International Symposium on Circuits and Systems (ISCAS), pp. 128–132. IEEE, Austin (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Telecommunications, Politecnico di Torino, Torino, Italy
Luca Urbinati & Mario R. Casu

Authors

Luca Urbinati
View author publications
You can also search for this author in PubMed Google Scholar
Mario R. Casu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luca Urbinati .

Editor information

Editors and Affiliations

DIMES, University of Calabria, Arcavacata, Italy
Giuseppe Cocorullo
DIMES, University of Calabria, Arcavacata, Italy
Felice Crupi
Dipartimento di Ingegneria Elettronica, University of Rome Tor Vergata, Rome, Italy
Ernesto Limiti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Urbinati, L., Casu, M.R. (2023). A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications. In: Cocorullo, G., Crupi, F., Limiti, E. (eds) Proceedings of SIE 2022. SIE 2022. Lecture Notes in Electrical Engineering, vol 1005. Springer, Cham. https://doi.org/10.1007/978-3-031-26066-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-26066-7_2
Published: 28 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26065-0
Online ISBN: 978-3-031-26066-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Abstract

Similar content being viewed by others

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Bit-Level Operation-Based MAC Unit for Vector Multiplications

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

Keywords

1 Introduction

2 Hardware Design

3 Experimental Results

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Reconfigurable Multiplier/Dot-Product Unit for Precision-Scalable Deep Learning Applications

Abstract

Similar content being viewed by others

Designing a Performance-Centric MAC Unit with Pipelined Architecture for DNN Accelerators

Bit-Level Operation-Based MAC Unit for Vector Multiplications

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

Keywords

1 Introduction

2 Hardware Design

3 Experimental Results

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation