Implementation of Low Power and Area Efficient Floating-Point Fused Multiply-Add Unit

Dhanabal, R.; Sahoo, Sarat Kumar; Bharathi, V.

doi:10.1007/978-81-322-2671-0_31

R. Dhanabal¹⁶,
Sarat Kumar Sahoo¹⁷ &
V. Bharathi¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 397))

1918 Accesses
1 Citations

Abstract

In this paper, a modified architecture for Floating-Point Fused Multiply-Add (FMA) unit for low power and reduced area applications is presented. FMA unit is the one which computes a floating-point (A × B) + C operation as a single instruction. In this paper a bridge unit has been used, which connects the existing floating-point multiplier (FMUL) and the FMUL’s add-round unit in the co-processor to perform FMA operation. The main objective of this modified FMA unit is to reuse as many components as possible to allow parallel floating-point addition and floating-point multiplication or floating-point fused multiply-add functionality by addition of little hardware into the FMUL’s add-round unit. In this paper each unit is designed using Verilog HDL. The design is simulated using Altera ModelSim and is synthesized using Cadence RTL compiler in 45 nm. All the floating-point arithmetics are implemented in IEEE-754 double precision format. It is found that the proposed FMA architecture achieved 17 % improvement in power and 6 % improvement in area when compared to the existing Bridge FMA unit.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Performance Evaluation of Multi-operands Floating-Point Adder

Single-Precision Floating Point Matrix Multiplier Using Low-Power Arithmetic Circuits

Implementation of multi-precision floating point divider for high speed signal processing applications

Article 21 June 2019

Keywords

1 Introduction

In digital signal processing applications the floating-point fused multiply-add (FMA) operation has become one of the fundamental operations. Many of the commercial processors like IBM PowerPC, Intel Itanium have included the FMA unit in its floating-point units to execute double precision fused multiply-add operation [1]. FMA unit improves the accuracy of the floating-point $ (A \times B) + C $ operation as it performs single rounding instead of two. FMA operation is very useful when a floating-point multiplication is followed by a floating-point addition.

Floating-point fused multiply-add implementation has two advantages over implementation of floating-point addition (FADD) and floating-point multiplication (FMUL) separately: (1) The FMA operation is performed with only one rounding instead of two (one for floating-point adder and other for floating-point multiplier) reducing overall error due to rounding. (2) There will be a reduction in delay and hardware required by sharing components [2, 3].

In some designs the existing FMUL unit and FADD unit is entirely replaced with a FMA unit. It performs single FMUL operation by making C = 0 and single FADD operation by making A = 1 (or B = 1) in $ (A \times B) + C $, e.g., $ (A \times B) + 0.0 $ for single multiplier and $ (A \times 1.0) + C $ for single adds. But due to the insertion of constants, the latencies of stand-alone FMUL, and FADD operations increase due to the complexity of FMA unit. In such designs there will not be any possibility to perform parallel FMUL and FADD instructions [3].

The first floating-point FMA unit was introduced on IBM RISC System/6000 in 1990 for single instruction execution of $ (A \times B) + C $ operation as an indivisible operation [2, 4]. Executing parallel FMUL and FADD operations is not possible in basic FMA unit. In [5] the Concordia FMA architecture is designed, which uses alignment blocks before the multiplier array. So multiplier tree input range widens. Due to this larger variable multiplier tree is required. A few possible solutions have been identified in the Lang/Bruguera fused multiply-add architecture, which is designed for reduced latencies [3, 6]. But it did not reach the latency of a common FADD/FMUL instruction. A bridge FMA design is introduced in [7] to avoid the stand-alone FMUL and FADD latencies due to the insertion of constants by adding extra blocks between existing FMUL and FADD components in the processor. But the cost added to this architecture is increase in area and power consumed when compared to the basic FMA architecture.

The main objective of this work is to design a low power, area efficient FMA unit which performs FMA operation or parallel FMUL and FADD operations based on requirement. In this paper a modified add-round block is designed, which supports add-round for FMUL as well as FMA. Common add-round unit for both FMA and FMUL instructions is used to save chip area.

All the floating-point arithmetic operations here are done using IEEE-754 double precision format. The standard IEEE-754 double precision format [8] consists of 64 bits which are divided into three sections as shown in Fig. 1. To represent any floating-point number, all the three sections have to be combined. The double precision floating-point number is calculated as shown in Eq. (1).

$$ A = ( - 1)^{{{\text{sign}}_{A} }} \times 1 \cdot {\text{fraction}}_{A} \times 2^{{\exp_{A} - {\text{bias}}}} $$

(1)

2 Architecture of Proposed FMA Unit

Block diagram for proposed floating-point fused multiply-add unit is shown in Fig. 2. The FMA unit starts with the common multiplier and adder units which can perform single stand-alone operations. The main components in this design are:

1.
Floating-Point Multiplier
2.
Floating-Point Adder
3.
Bridge Unit
4.
Add-Round unit for FMA/FMUL
5.
Add-Round Unit for FADD

Our FMA unit performs parallel floating-point addition and multiplication or floating-point fused multiply-add operations based on the requirement. Suppose when a FMA operation is to be performed, this bridge architecture is connected between the existing FMUL and FMUL’s add-round unit. When FMA operation is not needed stand-alone FMUL and FADD operation can be performed without using the intermediate bridge unit.

2.1 Multiplier

Efficient double precision floating-point multiplier using radix-4 modified booth algorithm (MBE) and Dadda algorithm has been implemented. This hybrid multiplier is designed by using the advantages in both the multiplier algorithms. MBE has the advantage of reducing partial products to be added. Dadda scheme has the advantage of adding the partial products in a faster manner [9, 10]. Our main objective is to combine these two schemes to make the multiplier design power efficient and area efficient. Finally obtained two rows (sums and carries) are added using an efficient parallel prefix adder [11].

MBE generates at most $ \left\lfloor {\frac{N}{2}} \right\rfloor + 1 $ partial products, where N is the number of bits. Radix-4 recoding is done with the digit set {−2, −1, 0, 1, 2} is shown in Table 1. Each three consecutive bits of the multiplier B represents the input to the booth recoding block. This block selects the right operation on multiplicand A which can be shift or invert (−2B) or invert (−B) or zero or no operation (B) or shift (2B). Figure 3 shows the generation of one partial product using MBE.

Table 1 Radix-4 modified booth’s recoding (for A × B)

Full size table

In Dadda scheme, the reduction of obtained partial products is done in stages using half adders and full adders. The reduction in size of each stage is obtained by working back from the final stage. Each preceding stage height must be not greater than $ \left\lfloor {3\; \cdot \;{\text{successorheight}}/2} \right\rfloor $ [10].

For a double precision floating-point multiplication two 53-bit (1 hidden bit + mantissa 52 bits) numbers are to be multiplied. If normal method is used for generation of partial products 53 partial products will be obtained. But by using MBE the partial products can be reduced to 27. Each partial product can be obtained using block shown in Fig. 3. These 27 rows of partial products are reduced to 2 rows in 7 reduction stages, where 19, 13, 9, 6, 4, 3, 2 is height of each stage as we go down in the Dadda reduction scheme. The dot diagram for 10 bit by 10 bit booth encoding with Dadda reduction is shown in Fig. 4. The same method in Fig. 4 is extended to 53 bit by 53 bit.

The final sums and carries are added using parallel prefix adders as it offer a highly efficient solution to the binary addition and suitable for VLSI implementations [11, 12]. Among the parallel prefix adders, Kogge–Stone architecture is the widely used and the popular one. Kogge–Stone adder is considered as the fastest adder design possible [11]. Architecture for 8-bit Kogge–Stone adder is as shown in Fig. 5. In this design to add the final sums and carries a 109-bit Kogge–Stone adder is used.

2.2 Floating-Point Adder

The modified FMA architecture uses the Farmwald’s dual-path floating-point adder design [7]. FADD design is shown in Fig. 6. It uses two paths close path and far path to handle different data cases. Far path is used for significand addition and subtraction, when exponent difference is more than 1. Close path is used for significand addition and subtraction, when the exponents are equal or differ by $ \pm $ 1. In far path both the significands are passed through swap multiplexers. When the larger significand is detected it is passed through far_op_greater and the smaller significand is aligned till the exponents match. Smaller significand is passed through far_op_smaller. In close path the two significands are pre-shifted by 1. The original significands and the pre-shifted significands are given to swap multiplexers and based on the exponent difference the significands are swapped. Meanwhile when the exponents are equal the comparator compares the two significands. The greater significand in close path is passed through close_op_greater and the smaller significand is passed through close_op_smaller.

2.3 Bridge Unit

The bridge unit is as shown in Fig. 7. This bridge unit is capable of carrying data from multiplier array to FMUL’s add-round unit to perform FMA operation $ ((A \times B) + C)) $ efficiently. Inputs to this unit are mantissa of the operand C and the carry save format product of $ A \times B $ from multiplier array. The operand C is aligned based on the exponent difference between exponent of C and exponent of the product. After alignment, the select line ‘sub’ decides whether to perform inversion or not. This inversion provides effective 2’s compliment for effective subtraction. If sub = 1, it performs inversion on the aligned data else the aligned data is buffered.

Bridge unit adds the product (i.e., mul_sum and mul_carry) $ A \times B $ along with a part ([108:0]) of pre-aligned 161-bit addend (operand C) using 3:2 CSA as shown in Fig. 7. The remaining 52 ([161:109]) bits of the 161 added is given to the incrementer in FMA/FMUL’s add-round unit. The 109-bit sum and carry obtained from 3:2 CSA is given to multiplexer stage in FMA/FMUL’s add-round unit.

Consider ‘D’ as the exponent difference between exponent of C and exponent of product $ A \times B $, its value is $ D = \exp (C) - (\exp (A) + \exp (B)) $, where exp(A), exp(B), and exp (C) are the exponents of operands A, B, and C, respectively.

When $ D \ge 0 $ (i.e.,$ \exp (C) > (\exp (A) + \exp (B)) $), the normal aligner will shift exponent of $ A \times B $ right by D bits or shift ‘C’ left by D bits until the exp ($ A \times B $) = exp (C). When $ D \ge 56 $, the sum and carry are placed right of LSB of operand ‘C’

When $ D < 0 $, the operand ‘C’ will be shifted right by D bits until exp ($ A \times B $) = exp (C). For the right shift greater than 105 (i.e., $ D < - 105 $), the operand C is placed to the right of the LSB of the sum and carry (product).

To avoid bidirectional shifter, the alignment is totally implemented as right shift by placing operand ‘C’ left to that of sum and carry and by placing two extra bits (guard bit and round bit) between the two. Combining both the cases the shift amount will be in the range of 161-bit right shifter. Figure 8 shows how to align operand ‘C’ in different cases in detail.

In case of $ D\, \ge \,0 $, the shift amount is shift amount = max {0, 56 − D}
In case of $ D\, < \,0 $, the shift amount is shift amount = min {161, 56 − D}

2.4 FMA/FMUL Add-Round Unit

FMA/FMUL add-round unit is shown in Fig. 9. This same add-round unit is used for both FMA and FMUL operation. When a stand-alone FMUL is required it acts as FMUL add-round unit and when FMA is required it acts as FMA add-round unit. Multiplexer stage is used to select FMA or FMUL. 109 bit Kogge–Stone adder is used to add the data from the mux stages. In parallel to this part of aligner output (52 MSB’s) from the bridge unit is given to incrementer. Based on the carry from 109-bit adder the 2:1 mux will select the aligner output or the incrementer output. Compliment the output if necessary. After normalizing the data is sent to perform rounding.

Basically three bits after the LSB decides the rounding. The three bits next to LSB are guard bit (g), round bit (r), and sticky bit (s), respectively. Sticky bit is the logical OR of all bits beyond the guard bit. In the Fig. 9, R[2:0] represents {g, r, s}, respectively. Round-up method which is in [13] is used for rounding purpose, result and result + 1 need to be generated for rounding up. By using the rounding table given in [13] the result is rounded. Finally mantissa of the FMA/FMUL output will be obtained. In parallel to this the exponent is to be adjusted accordingly whenever the normalization or shifting is done.

2.5 FADD Add-Round Unit

The add-round unit which is shown in Fig. 10 is exclusively used for FADD operation. The far path and close path operands from floating-point adder are given to FADD add-round unit. The two selected inputs are passed through 56 bit Kogge–Stone adder and the 56-bit 3:2 CSA. In order to perform round-up we are taking third input of the CSA as {55′b0, 1′b1}. The rounding is done in the same way as the multiplier. Then sum and carry from CSA is added with one more 56-bit Kogge–Stone adder. Finally mantissa of the adder output will be obtained. In parallel to this the exponent is to be adjusted accordingly whenever the normalization or shifting is done.

3 Results and Discussion

Simulation results for floating-point multiply-add operation is shown in Fig. 11. Simulation result for parallel floating-point addition and multiplication operation is shown in Fig. 12. Synthesis report for proposed FMA design and FMA design in [7] using Cadence RTL Compiler in 45 nm technology is given in Table 2.

Table 2 Delay, area, power report in 45 nm technology

Full size table

From Table 2 we found that delay and power consumed for stand-alone FADD and FMA operation decreased. The comparison charts for delay, power, and area is shown in Figs. 13, 14 and 15 respectively.

The proposed FMA unit has achieved 7 and 18 % improvement in delay for FADD and FMA instructions respectively, 19 and 17 % improvement in power consumption for FADD and FMA instructions, respectively, and 6 % improvement in total area when compared to FMA in [7]. Proposed FMA unit for FMUL instruction consumes almost same power as that of the FMA in [7]. But the drawback of proposed FMA unit is that, it has 7 % degradation in timing for FMUL instruction when compared to FMA in [7].

The stand-alone FMUL and FADD operations in existing floating-point units and ALUs [14, 15, 16] can be replaced by floating-point fused multiply-add, if a floating-point addition is followed by a floating-point multiplication. Further this FMA design can be extended and implemented using Residue Number System as it is gaining popularity for fast arithmetic operations [17].

4 Conclusion

This paper presents a low power and area efficient double precision floating-point fused multiply-add unit. The use of common add-round unit for FMUL and FMA instruction is the main reason for reduction in area occupied by the unit. By this the overall power consumption of the unit also decreased. The design has been compared with existing bridge FMA and it is found to be efficient in terms of power and area. But the only drawback is the degradation in timing for FMUL instruction. The proposed FMA can perform FMA operation or it can perform stand-alone FMUL and FADD operations parallely with out any need for insertion of constants. This is not possible with the classic FMA unit. This FMA design is suitable for high performance floating-point units of the co-processors.

References

Schmookler M, Trong SD, Schwarz E, Kroener M (2007) P6 Binary floating-point unit. In: Proceedings of the 15th IEEE symposium on computer arithmetic, Montpellier, pp 77–86, June 2007
Google Scholar
Hokenek E, Montoye R, Cook PW (1990) Second-generation RISC floating point with multiply-add fused. IEEE J Solid-State Circuits 25(5):1207–1213
Article Google Scholar
Lang T, Bruguera JD (2004) Floating-point multiply-add-fused with reduced latency. IEEE Trans Comput 53(8):988–1003
Article Google Scholar
Montoye RK, Hokenek E, Runyon SL (1990) Design of the IBM RISC System/6000 floating point execution unit. IBM J Res Dev 34:59–70
Article Google Scholar
Pillai RVK, Shah SYA, Al-Khalili AJ, Al-Khalili D (2001) Low power floating point MAFs-a comparative study. In: Sixth international symposium on signal processing and its applications, 2001, vol 1, pp 284–287, 2001
Google Scholar
Lang T, Bruguera JD, Floating-point fused multiply-add: reduced latency. In: Proceedings of the 2002 IEEE international conference on computer design: VLSI in computers and processors, pp 145–150, 2002
Google Scholar
Quinnell E, Swartzlander EE, Lemonds C (2008) Bridge floating-point fused multiply-add design. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(12):1727–1731
Article Google Scholar
IEEE Standard for Binary Floating-Point Arithmetic (1985) ANSI/IEEE Standard 754–1985, Reaffirmed 6 Dec 1990, 1985
Google Scholar
Dadda L (1964) Some schemes for parallel multipliers. IEEE Trans Comput 13:14–17
Google Scholar
Waters RS, Swartzlander EE (2010) A reduced complexity wallace multiplier reduction. IEEE Trans Comput 59(8):1134–1137
Article MathSciNet Google Scholar
Dimitrakopoulos Giorgos, Nikolos Dimitris (2005) High-speed parallel-prefix VLSI ling adders. IEEE Trans Comput 54(2):225–231
Article Google Scholar
Anitha RV, Bagyaveereswaran (2012) High performance parallel prefix adders with fast carry chain logic. Int J Adv Res Eng Technol 3(2):01–10
Google Scholar
Quach N, Takagi N, Flynn M, (1991) On fast IEEE rounding, Stanford University, Stanford, CA, Technical Report CSL-TR-91-459, Jan 1991
Google Scholar
Dhanabal R, Bharathi V, Shilpa K, Sujana DV, Sahoo SK (2014) Design and implementation of low power floating point arithmetic unit. Int J Appl Eng Res 9(3):339–346, 2014. ISSN: 0973-4562
Google Scholar
Ushasree G, Dhanabal R, Sahoo SK (2013) VLSI implementation of a high speed single precision floating point unit using verilog. In: Proceedings of IEEE conference on information and communication technologies (ICT 2013), pp 803–808, 2013
Google Scholar
Dhanabal R, Bharathi V, Salim S, Thomas B, Soman H, Sahoo SK (2013) Design of 16-bit low power ALU-DBGPU. Int J Eng Technol 5(3):2172–2180
Google Scholar
Dhanabal R, Sarat Kumar Sahoo, Barathi V, Samhitha NR, Cherian NA, Jacob PM (2014) Implementation of floating point mac using residue number system. J Theor Appl Inf Technol 62(2), April 2014
Google Scholar

Download references

Author information

Authors and Affiliations

VLSI Division, SENSE, VIT University, Vellore, India
R. Dhanabal
SELECT, VIT University, Vellore, 632014, India
Sarat Kumar Sahoo
VLSI Division, CSE, GGR College of Engineering, Anna University, Vellore, India
V. Bharathi

Authors

R. Dhanabal
View author publications
You can also search for this author in PubMed Google Scholar
Sarat Kumar Sahoo
View author publications
You can also search for this author in PubMed Google Scholar
V. Bharathi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Dhanabal .

Editor information

Editors and Affiliations

Noorul Islam Centre for Higher Education, Kumaracoil, Tamil Nadu, India
L. Padma Suresh
IIT Delhi, New Delhi, Delhi, India
Bijaya Ketan Panigrahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dhanabal, R., Sahoo, S.K., Bharathi, V. (2016). Implementation of Low Power and Area Efficient Floating-Point Fused Multiply-Add Unit. In: Suresh, L., Panigrahi, B. (eds) Proceedings of the International Conference on Soft Computing Systems. Advances in Intelligent Systems and Computing, vol 397. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2671-0_31

Download citation

DOI: https://doi.org/10.1007/978-81-322-2671-0_31
Published: 29 December 2015
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2669-7
Online ISBN: 978-81-322-2671-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics