Abstract
Constant multiplier performs a multiplication of a data-input with a constant value. Constant multipliers are essential components in various types of arithmetic circuits, such as filters in digital signal processor (DSP) units and they are prevalent in modern VLSI designs. This study presents efficient algorithms and their fast hardware implementation for performing multiplying-by-(2k ± 1), or (2k ± 1)N, operation with additions. No multiplications are needed. The value of (2k ± 1)N can be computed by adding (±N) to its k-bits left-shifted value 2kN. The additions can be performed by the full-adder-based (FA-based) ripple carry adder (RCA) for simple architecture. This paper presents the unit cells for additions (UCAs). Results show that the UCA-based RCA achieves 34 % faster than the FA-based RCA. Further, in order to improve the speed performance with lower hardware cost, this paper also presents a simple and modular hybrid adder with the proposed UCA concept, where the hybrid adder takes the lower-bit carry lookahead adder (CLA) as a module and many of the CLA modules are serially connected in a fashion similar to the RCA. Results show that the proposed hybrid adder achieved speed performance improvement while maintaining its modular and regular structure.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Constant multiplier performs a multiplication of a data-input with a constant value. Constant multipliers are essential components in various types of arithmetic circuits, such as filters in digital signal processor (DSP) units, dominate the hardware complexity of digital filters [1]. In addition, they are prevalent in modern VLSI designs.
The multiplication by a fixed-point constant can be done “multiplier-less” using additions and shifts only. In such filters the number of adders determines the implementation cost. Since the shifters are implemented as hard-wired inter-block connections, they are considered “free” in transposed implementation of an FIR filter; each input is multiplied by several coefficients [1–3]. Constant multiplier design has been investigated for several decades. However, the emphasis was placed on minimizing the number of additions required to achieve the multiplication of a given constant [4].
In this study, the emphasis is placed on performing the constant multiplication with a faster adder in only one addition operation. This paper targets the development of the multiplication of a constant (2k ± 1). The value of (2k + 1)N can be computed by adding N to its k-bit left-shifted value 2kN. On the other hand, The value of (2k-1)N can be computed by subtracting N from its k-bits left-shifted value 2kN, or adding (−N) to 2kN. The additions can be performed by a simple ripple carry adder (RCA), or a higher speed carry-lookahead adder (CLA).
The unit cells for additions (UCAs) are introduced in this study to construct the UCA-based RCA. Results will show that the UCA-based RCA achieves approximately 34 % faster than Full-adder-(FA)-based RCAs. In order to further improve the speed performance, a simple and modular hybrid adder is also presented, where reasonably smaller bit size of CLA is used as a module and many modules are serially connected in a fashion similar to the RCA.
In the next section, the conventional multiplication of a constant (2k ± 1) using FA-based RCA is discussed. Section III presents the proposed UCA-based RCA structures for (2k ± 1)N operation. Section IV describes a hybrid adder for the constant multiplication. Finally, a brief concluding remark is given in Section V.
2 FA-based RCAs for (2k ± 1)N Operations
Let N = (an-1an-2…a0) be an n-bit number. For the (2k + 1)N operation, there exists a number m such that n = m × k (if n < m × k, sign extension is applied). Thus, N can be expressed as (Am-1Am-2…A0), where Ai = (a(i+1)k-1…aik+1aik), i = 0, 1, …,m-1, and the (2k + 1)N operation can be performed by adding N to its k-bits left-shifted value 2kN, i.e., (1 + 2k)N = N + 2kN, where 2kN = (Am-1Am-2…A00) and 0 = (0..00).
The 3N operation is performed by 3N = N + 2 N,
and the 9N = N + 8 N operation with k = 3 is operated as
On the other hand, (2k-1)N = 2kN+(−N) = 2kN + N* + 1, where N* is the bit-complement of N, i.e., N* = (Am-1’Am-2’…A0’) and Ai’ = (a(i+1)k-1’…aik+1’aik’), i = 0,1,…,m-1. Thus, (2k-1)N operation can be performed by adding the k-bits left-shifted value 2kN to N* with an initial carry of 1. For example, the 7 N operation is performed as follows,
Figure 1a shows a full-adder (FA) cell [5, 6]. A FA cell takes two data inputs, ai-1 and ai, and a carry input bit ci-1, and produces a carry output bit ci and a sum bit si, for i = 0 ~ n, where a−1 = 0 and c−1 = 0. The logic functions are
The point-to-point delays of the FA and HA (Half-adder) cells are
Figure 1b shows an (n + 1)-bit FA-based RCA for 3 N operation. The RCA is comprised of (n-2) FAs and 2 HAs connected in series. The critical path includes the input-to- carry of the right-most HA (Δic(HA)), the carry-to-carry (Δcc(FA)) of (n-2) FAs, and the carry-to-sum (Δcs(HA)) of the left-most HA, i.e.,
In general, the critical path delay of the (km)-bits FA-based RCA for (2k + 1)N operation can be expressed as
Similarly, the critical path delay of the (km)-bits FA-based RCA for (2k-1)N operation can be expressed as
The term, (HA)* in (9), indicate a HA resulted from a FA with an input “1”, where Δ ic(HA) * = Δ inv + Δ NAND2, Δ cc(HA) * = Δ NAND2, and Δ cs(HA) * = Δ XNOR.
By (8) and (9), with k = 3, the delays of the RCA in Fig. 1c and d for 9 N and 7 N operations respectively are
Finally, Fig. 1e illustrates the (3 m)-bits FA-based RCA with dual mode for both 7 N and 9 N operations, where mode = 0 for 9 N operation and mode = 1 for 7 N operation. Similarly, one can easily realize a (km)-bits FA-based RCA for (2k ± 1)N operations.
3 UCA-based RCAs for (2k ± 1)N Operations
This section presents the UCA-based RCAs for (2k + 1)N operation. The implementation concept is readily applied for (2k-1)N operations.
3.1 UCA-based RCA for 3 N Operation
Consider the 3 N operation, as illustrated in (2). By Shannon expansion theorem, the carry and sum functions in (5) can be re-written as
where W0i-1 = ai-1ci-1 and W1i-1 = ai-1 + ci-1. To construct the carry propagation paths, we consider both W0i and W1i, in the next stage. By (12), we can easily derive
By (13), the functions ui can be expressed as
Figure 2a shows a UCA cell for 3 N operation. The cell is comprised of three blocks: Carry Propagation Path (CPP) block, Function ui Generation (FUG) block, and Sum Generation (SMG) block. The critical path of an n-bit UCA-based RCA, as illustrated in Fig. 2b, includes an NAND gate in cell#1, (n-3) UCAs in cell#2 to cell#(n-2), and the inverter, NOR2, and XOR gates in cell#(n-1), where ΔUCA = ΔNOR2, i.e., the delay is
3.2 UCA-based RCA for 5 N Operation
Consider the 5 N operation, as shown in (1) with k = 2, where Ai = (a2i+1,a2i), i = 0,1,…,m-1, and n = 2 m. (Note that a zero is added as the most significant bit if n is not an even number.) Two additions, (a2i+1 + a2i-1 + c2i) and (a2i + a2i-2 + c2i-1), are performed. By (5),
By (17), plugging c2i to c2i+1, we have
where
The 2-bits UCA cell for 5 N, referred to as a UCA2 cell can be derived as in the following property.
Property 1
The logic functions of the CPP block are expressed as
and the functions u2i and u2i+1 of the FUG block are
Proof
Consider W002i+1 = a2i+1a2ic2i+1, with c2i+1 in (18), we obtain W002i+1 = a2i+1a2iW112i-1. Similarly, one can easily derive the remaining terms in (19). By (17),
Since
and since
Similarly,
Figure 3a shows the UCA2 cell, where the CPP block can be realized by using NAND2/NOR2 gates, as illustrated in Fig. 3b. The critical path, as indicated in red, of the (2 m)-bits UCA2-based RCA in Fig. 3c includes an INV and an NOR2 gate in cell #0, two NOR2 gates in each of cells #1 to #(m-2), and the FUG and SMG blocks in cell #(m-1). Thus, the propagation delay is
3.3 UCA-based RCA for (2k + 1)N Operation
We first construct the CPP block and then the FUG and SMG blocks
CPP Block
Let rj be a binary number, either a 0 or a 1,
Thus, W012i+1 in (19) can be expressed as
where r1’ and r0’ are the bit-complement of r1 and r0, respectively.
Figure 3b shows the block diagram of the UCA2 cells, where the CPP block is highlighted. The CPP block contains 4 carry propagation paths, Pa, a = 0 ~ 3, and each path contains 2 gates, Jab, b = 0 or 1. The gate type of Jab is determined by the index of the path output Wab. For example, the output W10 of the path pa which includes both gates J21 and J20. Here, W10 means a = 1 and b = 0, i.e., J21 is an OR gate (because a = 1) and J20 is an AND gate (because b = 0). Fig. 4a is a symbolic representation of that in Fig. 3b.
Similarly, the CPP block of the UCAk cells for (2k + 1)N operation can be expressed as
Figure 4b shows the symbolic representation of the logic function for the CPP block of the UCAk cell and its delay is ΔUCA(k) = kΔNOR2.
FUG and SMG Blocks
Let U(x0) and U(x1) be defined as follows: (“x” means “don’t care term”)
U(x0) takes the terms Wx0, while U(x1) is formed by the terms Wx1. Let V(0x) and V(1x) be defined as
Thus, (19) can be expressed as
Figure 5a is the symbolic representation of FUG and SMG blocks of UCA2 cell for 5 N operation, where U(x0)* is an AND3 gate with 3 inputs a2i’, W002i-1’, and W102i-1, and U(x1)* is that with a2i, W012i-1’, and W112i-1. Thus, the delay path, as shown in Fig. 5b, includes an AND3, a OR2, and a XOR, and it can be implemented with an INV, an NAND3, an NAND2, and a XOR as follows,
Similarly, the UCA3 cell for 9 N operation, the FUG block can be expressed as follows,
Figure 5c illustrates the delay path for 9 N operation, where u3i+2, in (29), can be realized by four AND4 gates and one OR4 gates. However, the term a3i+1’a3i’U(x00) can be realized by an AND2 with a3i+1’ and a3i’ and an AND3 which takes the output of AND2 and two inputs of U(x00). Since the function AND2 with a3i+1’ and a3i’ can be pre-calculated and the AND2 gate will not be in the critical path, as shown in Fig. 5c. The OR4 is realized by a NOR2 followed by a NAND2. This results in an INV, an NAND3, an NOR2, and an NAND2 included in the critical path of the FUG block of the UCA3 cell, i.e.,
Similarly, the delay path in Fig. 5d for UCA cell is,
3.4 Performance Evaluation
The proposed design methodology is mainly about the structure of hardware implementation which is based on the newly developed equations. The actual realization of the proposed UCA, for example, may vary significantly. Therefore, the following performance comparisons were conducted assuming the same type of hardware realization. Thus, the speed performance estimation only consider the total gate delay in the critical path under the same conditions, where the path delays and loading effects were not considered for this rough estimation. In fact, the performances are evaluated based on the gate delays of the standard cells in TSMC 0.18 μm CMOS process technology, as tabulated in Table 1 [2], where the cell height is 5.04 μm.
The delays of UCA-based RCAs are, n = 2t, t ≥ 4, where (Δinv + (k-1)ΔNOR2) is in Cell#0 for 5 N and 17 N operations,
By (8), the delays of FA-based RCAs are, n = 2t, t ≥ 4,
Based on (32), (33), and Table 1, the delays of both FA-based RCAs and UCA-based RCAs for 3 N (k = 1), 5 N (k = 2), and 17 N (k = 4) with the 16 ~ 1024 bits were computed and the values were tabulated in Table 2. Results show that UCA-RCA for 17 N operation has slightly better than that for 5 N operation. This is so simply because (2ΔNOR2 + ΔNAND2) = 0.1176 ns and (ΔNAND3 + ΔNOR3) = 0.1044 ns, where the difference is about 0.01 ns.
Based on Table 2, Fig. 6a and b plot the delays of both FA-based RCA and UCA-based RCA, respectively, for 3 N, 5 N, and 17 N operations. Results show that their delays are almost the same for the same size with the same approach.
Table 2 tabulates the normalized speed performance, where the delay of FA-based RCA is normalized and set to 1. The delay ratios of UCA-based UCAs for various bit sizes are shown in Fig. 6c. Results show that the delay ratios are approximately 65 % which can be simply calculated from the delay ratio of UCA cell (ΔNOR2) over a FA cell (2ΔNAND2), i.e., Δ NOR2/(2Δ NAND2) = 0.0426/0.0648 = 0.66 = 66 %.
Even though the proposed UCA-based RCAs are about 34 % faster than the FA-based ones, their speed performance is still too slow particularly for wider bit sizes. The following section attempts to further improve the speed performance.
4 Hybrid Adders for (2k ± 1)N Operations
As mentioned, the RCA achieves lower hardware cost, while the CLA offers high speed performance. In order to achieve higher speed performance for the additions, a hybrid adder combining RCA and CLA (or generate-propagate adder), as shown in Fig. 7, is presented. The adder is simple and modular, where the lower bit CLA is taken as a module.
4.1 UCA-based Hybrid Adder for 3N Operation
For 3N operation, by (14), the carry-out bits at the first 4 stages of the CPP blocks can be written as follows,
Thus, both W03 and W13 can be written as
where
Similarly, both W07 and W17 are expressed as
where
By (35), both W07 and W17 in (37) can be re-written as
Figure 8a shows a 32-bits hybrid adder, where each CLA unit processes 8-bit data. Each CLA unit, adopting the parallel prefix adder structure [7, 8], is comprised of two 4-bit RQ generator units (RQGUs, or RQGU-4), a 2-bit CLA (or CLA-2), and two 4-bit Function U generation units (FUGUs, or FUGU-4). The RQGU-4, as shown in Fig. 8b, realizes the functions in (34) and it replaces both PG unit and Block CLA-4 (BCLA-4) in the conventional CLA structure. The CLA-2, as illustrated in Fig. 8c, implements the functions in (35) and (38), and the FUGU-4 in Fig. 8d, is for the functions in (34) and (15). Note that ci = (W0i,W1i) in Fig. 8a, The delays of these units can be expressed as
The critical path of the 32-bit hybrid adder includes a RQGU-4, four CLA-2 s, a FUGU-4, and a SUM, i.e.,
In general, for n = 2t, t > 3, the delay
Similarly, the 4-bit CLA module of the hybrid adder in Fig. 8 can be replaced by 8-bit one. The delay is
where
The CLA module of the hybrid adder may include multi-level structure similar to the conventional CLA. For example, the CLA module of the 64-bit hybrid adder in Fig. 9 is with 2-level CLA structure. Let L denote as the number of levels in the CLA module. Thus, the delay of such hybrid adder is, r = m*2L, m = 4 or 8, L ≥ 2, and n ≥ 2r,
4.2 UCA-based Hybrid Adder for (2k + 1)N Operation
Similar to the derivation process for 3 N operation in (35)-(38), the following component delays can be derived from Property 1 for 5 N operation, where
Similarly, one can also derive for 17 N operation as
For 9N operation with k = 3, the UCA3 cells are employed and each cell contains 3 bits. Thus, the delays are
Similarly, for (2k + 1)N operations, the k-bit cells, UCAk cells, are employed. Thus, the hybrid adder may include the k-bit CLA modules.
4.3 Performance Evaluation
This section roughly estimates the speed performances of the FA-based RCAs, UCA-based RCA, and the hybrid adders with various bit sizes for 3N, 5N, 9N, and 17N operations. However, the performance evaluation process is readily applied for any bit sizes and (2k ± 1)N operations.
The speed performance is roughly evaluated based on the gate delays of the standard cells listed in Table 1, where the AND (NAND) and OR (NOR) gates are limited to their fan-ins up to 3. Thus, the AND5 and OR4 gates in (41) are implemented in two-level logic gates, i.e., AND5 = (AND3) •(AND2), and OR4 = (OR2) + (OR2).
The delays of the RQGU-4 and RQGU-8 for 3N, 5N, and 17N operations are the same. By Table 1,
Based on (39, 40, 41, 42, 43, 44, 45, and 46), the delays of the hybrid adders are calculated and tabulated in Table 3.
Results show that the delays are indeed improved as the number of levels, L, increases. For example, for 17 N operation with 1024 bits, the delays was improved from 66.51 ns for the FA-based RCA to 43.64 ns for the proposed UCA-based RCA, as shown in Table 2, Further, the delay is reduced to 12.07 ns and 6.44 ns for the proposed hybrid adders with 4-bits and 8-bits modules, respectively, as illustrated in Table 3 and plotted in Fig. 10a. The delays can be further decreased to 2.31 ns and 1.76 ns by using the proposed hybrid with 4-levels of 4-bits and 8-bits modules, respectively, as plotted in Fig. 10b.
4.4 Discussion
The proposed UCA design concept has the advantages of better speed performance than the conventional FA design approach. This sub-section compares the speed performance of the UCA-CLA and conventional CLA (CV-CLA). The delays of both BCLA-4 and CLA-4 [5–7] are
For n-bit CV-CLAs, n = 4t, with CLA-4 and BCLA-4 modules, the delay is
and the delay of an n-bit UCA-CLA is
Note that the delay (ΔRQGU-4 + ΔFUGU-4) in UCA-CLA is compared with (ΔPGU + 2ΔBCLA-4) in CV-CLA. Table 4 compares the speed performance of CV-CLA and UCA- CLA for 3 N, 5 N, and 17 N operations with various bit sizes.
The UCA-CLAs for 3N, 5N, and 17N operations respectively achieve approximately 30 %, 25 %, and 15 % less delays than CV-CLA for 16-bit addition, and about 13 %, 11 %, and 7 % for 1024 bit addition. As mentioned, the delay of FUGU-2k block increases with k considerably. Thus, the proposed UCA-CLAs may gain the advantage of better speed performance for lower values of k. The proposed simple and modular UCA-RCA are perfectly applied for those constant multiplications with smaller bit sizes, such as 3 N, 5 N, and 7 N operations for arithmetic coding, Booth encoding, and the divider design [9] as a cost-effective solution. For higher speed performance, the proposed UCA-CLA can also be applied.
5 Conclusion
Constant multiplier performs a multiplication of a data-input with a constant value. They are essential components in various types of arithmetic circuits, such as filters in digital signal processor (DSP) units and they are prevalent in modern VLSI designs. This study has presented efficient algorithm and fast hardware implementation for (2k ± 1)N operation with additions. No multiplications were needed.
The salient features of the proposed UCA design concept came from the use of the carry function of (12) instead of (5), thus improving the propagation delay path and achieving 34 % in speed performance improvement. The design concept is perfectly applied for the RCA design for faster speed. The proposed UCA-HyA provides a cost-effective solution for constant multiplication, and the proposed UCA-CLA further improves the speed performance.
In addition, the study presents the hardware implementation for (2k + 1)N operation, how to apply the proposed UCA design concept for any other form of constants with one addition also leads a very interesting topic for future research. Finally, as mentioned in [2], the constant divider can be developed in a similar way.
References
Cappello, P. R., & Steiglitz, K. (1984). Some complexity issues in digital signal processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(No. 5), 1037–1041.
Wey, C. L., Jui, P.-C., & Sung, G.-N. (2014). Efficient Multiply-by-3 and Divide-by-3 Algorithms and Their Fast Hardware Implementation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E97-A(2), 616–623.
Jui, P.-C., Sung, G.-N., & Wey, C. L. (2012). Efficient algorithm and hardware implementation of 3N for arithmetic and for radix-8 encodings, Proc. of IEEE Midwest Symp. on Circuits and Systems, Boise, Idaho (pp. 418–421).
Oudjida, A. K., & Chaillet, N. (2014). Radix-2r Arithmetic for multiplication by a constant. IEEE Transactions on Circuits and Systems – II: Express Briefs, 61, 349–353.
Koren, I. (1993). Computer arithmetic algorithms. New Jersey: Prentice-Hall, Inc.
Huang, K. (1979). Computer arithmetic: principles, architecture, and design. New York: Wiley.
Chang, S.-K., & Wey, C. L. (2012). A fast 64-bit hybrid adder design in 90 nm CMOS process. Boise: Proc. of IEEE Midwest Symp. on Circuits and Systems.
Brent, R. P., & Kung, H. T. (1982). A regular layout for parallel adders. IEEE Transactions on Computers, 31(3), 260–264.
Wey, C. L. (1996). Built-In Self-Test (BIST) design of high-speed carry-free dividers. IEEE Transactions on VLSI Systems, 4(1), 141–145.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jui, PC., Wey, CL. & Shiue, MT. Multiplication of a Constant (2k ± 1) and Its Fast Hardware Implementation. J Sign Process Syst 82, 41–53 (2016). https://doi.org/10.1007/s11265-015-0978-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-0978-4