Keywords

1 Introduction

Technology advances brought together large number of devices on a small silicon area, thereby increasing power density on a chip. Low-power designs of these devices are therefore needed to prolong lifetime of a chip as well as the battery. One way to reduce power of a bigger module is to reduce power dissipation at subblocks level. This requires implementation of judicial design strategies and architectures to obtain the best result for any application. Matrix multiplication unit is used in digital signal processing applications such as, digital imaging, signal processing, computer graphics, and multimedia. So, it is very crucial that it occupies less area, works fast, and consumes low power. IEEE standard 754 floating point (FP) is a representation for real numbers on computers. The goal of this paper is to present a low-power design for FP matrix multiplier. Paper publications in this area have been limited to building algorithm [1, 2] and subblocks required for floating point matrix multipliers, which include floating point multiplier [3] and floating point adder [4]. The circuit level simulation of designs at 180 and 45 nm technology nodes is performed using low-voltage and low-power design techniques in Virtuoso-Cadence environment.

The rest of the paper is organized as follows. Section 2 presents architecture of FP matrix multiplier along with its subblocks like multiplexer, FP multiplier, FP adder, and register. Section 3 presents the simulation results and comparison of delay, power, and energy-delay product (EDP) at 180 and 45 nm node for low-power design. Finally, conclusion is drawn in Sect. 4.

2 Matrix Multiplier Implementation

FP matrix multiplication is performed basically in two steps, FP multiplication and FP addition. In this paper, power is reduced by reusing the adder architecture to perform addition one after the other [1, 2]. This is an efficient technique to reduce power consumption by breaking the main circuit in partially sequential and partially parallel parts. Matrix multiplication operation is performed as follows:

$$\left[ {\begin{array}{*{20}c} {X_{11} } & {X_{12} } & {X_{13} } & {X_{14} } \\ {X_{21} } & {X_{22} } & {X_{23} } & {X_{24} } \\ {X_{31} } & {X_{32} } & {X_{33} } & {X_{34} } \\ {X_{41} } & {X_{42} } & {X_{43} } & {X_{44} } \\ \end{array} } \right] \times \left[ {\begin{array}{*{20}c} {Y_{11} } & {Y_{12} } & {Y_{13} } & {Y_{14} } \\ {Y_{21} } & {Y_{22} } & {Y_{23} } & {Y_{24} } \\ {Y_{31} } & {Y_{32} } & {Y_{33} } & {Y_{34} } \\ {Y_{41} } & {Y_{42} } & {Y_{43} } & {Y_{44} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {Z_{11} } & {Z_{12} } & {Z_{13} } & {Z_{14} } \\ {Z_{21} } & {Z_{22} } & {Z_{23} } & {Z_{24} } \\ {Z_{31} } & {Z_{32} } & {Z_{33} } & {Z_{34} } \\ {Z_{41} } & {Z_{42} } & {Z_{43} } & {Z_{44} } \\ \end{array} } \right],$$
(1)

where

$$Z_{11} = X_{11} Y_{11} + X_{12} Y_{21} + X_{13} Y_{31} + X_{14} Y_{41} ,$$
(2)
$$Z_{12} = X_{11} Y_{12} + X_{12} Y_{22} + X_{13} Y_{32} + X_{14} Y_{42} ,$$
(3)

and so on…

The block diagram of FP matrix multiplier along with its subblocks is shown in Fig. 1a–e. Figure 1a depicts the 4 × 4 matrix multiplier. Figure 1b shows operation of each cell of the matrix multiplier. It comprises of 4:1 MUX, FP multiplier, FP adder, and memory register. The circuit diagram of 4:1 MUX is presented in Fig. 1c. MUX selects the inputs of the two matrices to be multiplied. These are 4:1 MUX for matrices of size 4 × 4. Each MUX gives output as a 32-bit single-precision floating point number. These are inputs to the floating point multiplier unit.

Fig. 1
figure 1

Representation of FP matrix multiplier and its subblocks. a Architecture of 4 × 4 FP matrix multiplier. b FP matrix multiplier block diagram. c MUX schematic. d FP multiplier block diagram. e FP adder block diagram. f 1-bit register circuit

The FP multiplier unit has three subparts as shown in Fig. 1d. Floating point multiplication [3] is done in three steps: sign calculation, exponent calculation, and mantissa calculation. The multiplier unit used for mantissa calculation is the slowest path as it has to perform operation on 24 bits (23 bits mantissa and 1 bit normalizing bit). So, in order to make this subblock fast, different architectures of multipliers have been studied. FP multiplier can be implemented using several designs as divide and conquer multiplier [5], Wallace tree multiplier [6], and Barun multiplier [5]. Divide and conquer multiplier is used mostly in FPGAs, as this leads to fast calculation. However, this comes at the expense of more hardware. Wallace tree multiplier has less number of bits in each level, but delay is not evenly distributed. So, it may cause glitches. Also, the tree structure uses a lot of interconnection; therefore uses a lot more area. Barun multiplier shown in Fig. 2 offers evenly distributed delay for the inputs of a full adder and it can be extended to multiply higher number of bits. Based on the simulation results presented in Sect. 3 for all the above three multipliers, it has been inferred that Barun multiplier offers lesser EDP. Consequently, this is incorporated in the present low-power FP matrix multiplier design.

Fig. 2
figure 2

Barun multiplier

The basic block in FP matrix multiplier is a full adder. Different architectures of full adder like 17-T, 14-T, 10-T have been discussed in [7]. The 17-T adder proves to be the best architecture in terms of power, delay, and power-delay product. A 17-T adder was made using pass transistor logic, but there is a probability of getting wrong logic at the output. For example, say when a = b = 0 the output at a ⊕ b is V tp and that at Ex-NOR is V dd − V tn. This problem increases when adders are cascaded as inputs a and b can be output of any previous adder. So, modified 17T architecture is proposed here and used in this work. This is shown in Fig. 3. In this, transmission gate is used instead of pass transistor which transfers full logic.

Fig. 3
figure 3

Proposed full adder

The addition of two FP numbers is the trickiest part. The algorithm for FP addition is taken from [4] and block diagram for the same is shown in Fig. 1e. FP adder performs addition of partial products, as shown in Eqs. (2) and (3). First, there is a need to make exponent of both the operands equal in order to perform addition. According to Fig. 1e, if \(|B_{\exp } | > |A_{\exp } |\;(B_{\exp } = B\langle 30:23\rangle \,{\text{and}}\,A_{\exp } = A\langle 30:23\rangle )\), then swapping of the numbers is required, which implies that the signal swap select = 1. The mantissa of smaller number (A) is shifted right by the shift amount to make exponent of both equal and then added to or subtracted from other number, as per the signs of the two numbers. Normalizing unit normalizes the output to IEEE 754 format again.

A 32-bit register stores the value of the adder block, such that it can be added again to the new multiplied value. Figure 1f shows the circuit diagram of 1-bit register.

Power of the FP matrix multiplier in the present design is reduced by,

  1. (1)

    Using multiplier module with sleep transistor logic.

  2. (2)

    Reducing the bit width for the floating point multiplier input from 24 to 12 bits [8]. This method is called bit truncation.

The inputs to the multiplier in Fig. 1d are of 24 bits, so after multiplication output is of 48 bits. But as this output needs to be fitted in FP format again, rounding has to be done. This requires more power. Consequently to reduce power, bit truncation is used in which inputs are reduced to 12 bits, giving output of 24 bits. This can be effectively used in most of the applications like sensors, image processing, etc., where limited accuracy can be tolerated.

3 Simulation Results

Simulations are performed on circuits made using UMC 180 nm and gpdk 45 nm technology node library in Cadence Virtuoso, so as to compare low-power designs of both technologies in terms of EDP. Figure 4a shows the comparison of EDP for different multiplier architectures. It is evident from the figure that Braun multiplier is best in terms of lower EDP and minimum energy-delay point is observed at 1 V. So, this multiplier is built again using 45 nm technology cells and compared under same conditions with 180 nm circuit. It is clear from Fig. 4b that the best case (1 V) value of 45 nm circuit is lower than the best case (1 V) value of 180 nm circuit.

Fig. 4
figure 4

a EDP of different multipliers at 180 nm technology node. b EDP comparison of Barun multiplier at 180 and 45 nm technology nodes

The functionality of FP multiplier and adder units have been analyzed. It has been investigated that FP multiplier discussed here has an improvement of 61.9% in EDP with respect to [9] (904.2E−21 J) and FP adder has an improvement of 94.8% in power with respect to [10] (5.6 mW). For logic and functionality verification, single element of the FP matrix multiplier is implemented and presented below. Using Eq. (2), the first element of the output matrix is computed as:

$$Z_{11} = X_{11} Y_{11} + X_{12} Y_{21} + X_{13} Y_{31} + X_{14} Y_{41} ,$$

Let,

$$\begin{aligned} X_{11} & = - 16.23998, \, X_{12} = - 12.55799, \, X_{13} = 1.26839, \, X_{14} = 1.26839, \\ Y_{11} & = - 12.2398, \, Y_{21} = 0.01246,\,Y_{31} = 124.8939,\,{\text{and}}\,Y_{41} = 0.01246. \\ \end{aligned}$$

Here X 11, X 12, X 13, and X 14 are the inputs to MUX1, which is first element in Fig. 1b. These comprise row vector. While Y 11, Y 21, Y 31, and Y 41 are the inputs to MUX2, these comprise column vector. M1 (X 11 × Y 11), M2 (X 12 × Y 21), M3 (X 13 × Y 31), and M4 (X 14 × Y 41) are corresponding products of the row and column values. Z is the final output, which is M1 + M2 + M3 + M4. The output from the simulator ADE-XL of virtuoso is shown in Fig. 5. Table 1 shows the simulated result of a single element Z (equals to M1 + M2 + M3 + M4) for floating point matrix multiplier. It is analyzed that the error between desired and simulated output is 0.044% which is because of rounding the product of multiplication to 24 bits as in FP multiplication algorithm. Table 2 presents power, delay, and EDP obtained from the simulation results of FP matrix multiplier at 180 nm for normal voltage of 1.8 V and low-power design at 1 V. The same FP matrix multiplier circuit is implemented for 45 nm technology node. Table 3 compares the results for 180 and 45 nm technology at 1 V.

Fig. 5
figure 5

Simulation table for floating point matrix multiplier

Table 1 Output of first element of floating point matrix multiplier (180 nm)
Table 2 Simulation results of floating point matrix multiplier at 180 nm
Table 3 Comparison of logic circuits at 180 and 45 nm at supply of 1 V

4 Conclusion

The floating point matrix multiplier operating at low voltages is designed and simulated at 180 and 45 nm technology nodes. Various parameters like propagation delay, power dissipation, and energy-delay product have been studied. Based on the study, it is concluded that best performance (in terms of EDP) is achieved by fixing supply voltage to 1 V at 180 nm node. The FP matrix multiplier circuits made by using 45 nm technology reduce delay by 43%, power by 93.44%, and EDP by 97.86% as compared to circuits made at 180 nm technology cells at 1 V. In addition, the 45 nm technology cell occupies only 6.25% of the area as compared to 180 nm cells.