1 Introduction

Constant multiplier performs a multiplication of a data-input with a constant value. Constant multipliers are essential components in various types of arithmetic circuits, such as filters in digital signal processor (DSP) units, dominate the hardware complexity of digital filters [1]. In addition, they are prevalent in modern VLSI designs.

The multiplication by a fixed-point constant can be done “multiplier-less” using additions and shifts only. In such filters the number of adders determines the implementation cost. Since the shifters are implemented as hard-wired inter-block connections, they are considered “free” in transposed implementation of an FIR filter; each input is multiplied by several coefficients [13]. Constant multiplier design has been investigated for several decades. However, the emphasis was placed on minimizing the number of additions required to achieve the multiplication of a given constant [4].

In this study, the emphasis is placed on performing the constant multiplication with a faster adder in only one addition operation. This paper targets the development of the multiplication of a constant (2k ± 1). The value of (2k + 1)N can be computed by adding N to its k-bit left-shifted value 2kN. On the other hand, The value of (2k-1)N can be computed by subtracting N from its k-bits left-shifted value 2kN, or adding (−N) to 2kN. The additions can be performed by a simple ripple carry adder (RCA), or a higher speed carry-lookahead adder (CLA).

The unit cells for additions (UCAs) are introduced in this study to construct the UCA-based RCA. Results will show that the UCA-based RCA achieves approximately 34 % faster than Full-adder-(FA)-based RCAs. In order to further improve the speed performance, a simple and modular hybrid adder is also presented, where reasonably smaller bit size of CLA is used as a module and many modules are serially connected in a fashion similar to the RCA.

In the next section, the conventional multiplication of a constant (2k ± 1) using FA-based RCA is discussed. Section III presents the proposed UCA-based RCA structures for (2k ± 1)N operation. Section IV describes a hybrid adder for the constant multiplication. Finally, a brief concluding remark is given in Section V.

2 FA-based RCAs for (2k ± 1)N Operations

Let N = (an-1an-2…a0) be an n-bit number. For the (2k + 1)N operation, there exists a number m such that n = m × k (if n < m × k, sign extension is applied). Thus, N can be expressed as (Am-1Am-2…A0), where Ai = (a(i+1)k-1…aik+1aik), i = 0, 1, …,m-1, and the (2k + 1)N operation can be performed by adding N to its k-bits left-shifted value 2kN, i.e., (1 + 2k)N = N + 2kN, where 2kN = (Am-1Am-2…A00) and 0 = (0..00).

$$ \begin{array}{llllllllll}\mathrm{N}\hfill & 0\hfill & {\mathrm{A}}_{\mathrm{m}-1}\hfill & \dots \hfill & {\mathrm{A}}_{\mathrm{i}+1}\hfill & {\mathrm{A}}_{\mathrm{i}}\hfill & {\mathrm{A}}_{\mathrm{i}-1}\hfill & \dots \hfill & {\mathrm{A}}_1\hfill & {\mathrm{A}}_0\hfill \\ {}\underset{\bar{\mkern6mu}}{+{2}^{\mathrm{k}}\mathrm{N}}\hfill & {\mathrm{A}}_{\mathrm{m}-1}\hfill & {\mathrm{A}}_{\mathrm{m}-2}\hfill & \dots \hfill & {\mathrm{A}}_{\mathrm{i}}\hfill & {\mathrm{A}}_{\mathrm{i}-1}\hfill & {\mathrm{A}}_{\mathrm{i}-2}\hfill & \dots \hfill & {\mathrm{A}}_0\hfill & 0\hfill \\ {}\left(1+{2}^{\mathrm{k}}\right)\mathrm{N}\hfill & {\mathrm{S}}_{\mathrm{m}}\hfill & {\mathrm{S}}_{\mathrm{m}-1}\hfill & \dots \hfill & {\mathrm{S}}_{\mathrm{i}+1}\hfill & {\mathrm{S}}_{\mathrm{i}}\hfill & {\mathrm{S}}_{\mathrm{i}-1}\hfill & \dots \hfill & {\mathrm{S}}_1\hfill & {\mathrm{S}}_0\hfill \end{array} $$
(1)

The 3N operation is performed by 3N = N + 2 N,

$$ \begin{array}{lllllll}\mathrm{N}\hfill & 0\hfill & {\mathrm{a}}_{\mathrm{n}-1}\hfill & {\mathrm{a}}_{\mathrm{n}-2}\hfill & \dots \hfill & {\mathrm{a}}_1\hfill & {\mathrm{a}}_0\hfill \\ {}\underset{\bar{\mkern6mu}}{ + 2\mathrm{N}}\hfill & {\mathrm{a}}_{\mathrm{n}-1}\hfill & {\mathrm{a}}_{\mathrm{n}-2}\hfill & {\mathrm{a}}_{\mathrm{n}-3}\hfill & \dots \hfill & {\mathrm{a}}_0\hfill & 0\hfill \\ {}3\mathrm{N}\hfill & {\mathrm{s}}_{\mathrm{n}}\hfill & {\mathrm{s}}_{\mathrm{n}-1}\hfill & {\mathrm{s}}_{\mathrm{n}-2}\hfill & \dots \hfill & {\mathrm{s}}_1\hfill & {\mathrm{s}}_0\hfill \end{array} $$
(2)

and the 9N = N + 8 N operation with k = 3 is operated as

$$ \begin{array}{lllllllll}\mathrm{N}\hfill & 0\hfill & 0\hfill & 0\hfill & {\mathrm{a}}_{3\mathrm{m}-1}\hfill & \dots \hfill & {\mathrm{a}}_{3\mathrm{i}+2}{\mathrm{a}}_{3\mathrm{i}+1}{\mathrm{a}}_{3\mathrm{i}}\hfill & \dots \hfill & {\mathrm{a}}_3{\mathrm{a}}_{2\ }{\mathrm{a}}_1{\mathrm{a}}_0\hfill \\ {}\underset{\bar{\mkern6mu}}{ + 8\mathrm{N}}\hfill & {\mathrm{a}}_{3\mathrm{m}-1}\hfill & {\mathrm{a}}_{3\mathrm{m}-2}\hfill & {\mathrm{a}}_{3\mathrm{m}-3}\hfill & {\mathrm{a}}_{3\mathrm{m}-4}\hfill & \dots \hfill & {\mathrm{a}}_{3\mathrm{i}-1\ }{\mathrm{a}}_{3\mathrm{i}-2\ }{\mathrm{a}}_{3\mathrm{i}-3}\hfill & \dots \hfill & {\mathrm{a}}_{0\ }00\ 0\hfill \\ {}9\mathrm{N}\hfill & {\mathrm{s}}_{3\mathrm{m}+2}\hfill & {\mathrm{s}}_{3\mathrm{m}+1}\hfill & {\mathrm{s}}_{3\mathrm{m}}\hfill & {\mathrm{s}}_{3\mathrm{m}-1}\hfill & \dots \hfill & {\mathrm{s}}_{3\mathrm{i}+2}{\mathrm{s}}_{3\mathrm{i}+1}{\mathrm{s}}_{3\mathrm{i}}\hfill & \dots \hfill & {\mathrm{s}}_3{\mathrm{s}}_{2\ }{\mathrm{s}}_1{\mathrm{s}}_0\hfill \end{array} $$
(3)

On the other hand, (2k-1)N = 2kN+(−N) = 2kN + N* + 1, where N* is the bit-complement of N, i.e., N* = (Am-1’Am-2’…A0’) and Ai’ = (a(i+1)k-1’…aik+1’aik’), i = 0,1,…,m-1. Thus, (2k-1)N operation can be performed by adding the k-bits left-shifted value 2kN to N* with an initial carry of 1. For example, the 7 N operation is performed as follows,

$$ \begin{array}{lllllllll}\mathrm{N}*\hfill & 1\hfill & 1\hfill & 1\hfill & {\mathrm{a}}_{3\mathrm{m}-1}'\hfill & \dots \hfill & {\mathrm{a}}_{3\mathrm{i}+2}'{\mathrm{a}}_{3\mathrm{i}+1}'{\mathrm{a}}_{3\mathrm{i}}'\hfill & \dots \hfill & {\mathrm{a}}_3'{\mathrm{a}}_2'{\mathrm{a}}_1'{\mathrm{a}}_0'\hfill \\ {}\underset{\bar{\mkern6mu}}{ + 8\mathrm{N}}\hfill & {\mathrm{a}}_{3\mathrm{m}-1}\hfill & {\mathrm{a}}_{3\mathrm{m}-2}\hfill & {\mathrm{a}}_{3\mathrm{m}-3}\hfill & {\mathrm{a}}_{3\mathrm{m}-4}\hfill & \dots \hfill & {\mathrm{a}}_{3\mathrm{i}-1\kern0.5em }{\mathrm{a}}_{3\mathrm{i}-2\kern0.5em }{\mathrm{a}}_{3\mathrm{i}-3}\hfill & \dots \hfill & {\mathrm{a}}_{0\ }00\kern0.5em 0\hfill \\ {}7\mathrm{N}\hfill & {\mathrm{s}}_{3\mathrm{m}+2}\hfill & {\mathrm{s}}_{3\mathrm{m}+1}\hfill & {\mathrm{s}}_{3\mathrm{m}}\hfill & {\mathrm{s}}_{3\mathrm{m}-1}\hfill & \dots \hfill & {\mathrm{s}}_{3\mathrm{i}+2\ }{\mathrm{s}}_{3\mathrm{i}+1\kern0.5em }{\mathrm{s}}_{3\mathrm{i}}\hfill & \dots \hfill & {\mathrm{s}}_3{\mathrm{s}}_{2\ }{\mathrm{s}}_1{\mathrm{s}}_0\hfill \end{array} $$
(4)

Figure 1a shows a full-adder (FA) cell [5, 6]. A FA cell takes two data inputs, ai-1 and ai, and a carry input bit ci-1, and produces a carry output bit ci and a sum bit si, for i = 0 ~ n, where a−1 = 0 and c−1 = 0. The logic functions are

Figure 1
figure 1

FA-based RCA: a FA cell; b for 3N; c for 9N; d for 7N; and e for dual mode – 7N and 9N.

$$ \begin{array}{c}\hfill {\mathrm{s}}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}}\oplus {\mathrm{a}}_{\mathrm{i}\hbox{-} 1}\oplus {\mathrm{c}}_{\mathrm{i}\hbox{-} 1}\hfill \\ {}\hfill {\mathrm{c}}_{\mathrm{i}}=\left({\mathrm{a}}_{\mathrm{i}}\oplus {\mathrm{a}}_{\mathrm{i}\hbox{-} 1}\right){\mathrm{c}}_{\mathrm{i}\hbox{-} 1}+{\mathrm{a}}_{\mathrm{i}}{\mathrm{a}}_{\mathrm{i}\hbox{-} 1}\hfill \end{array} $$
(5)

The point-to-point delays of the FA and HA (Half-adder) cells are

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{cc}\left(\mathrm{F}\mathrm{A}\right)}={\varDelta}_{\mathrm{carry}-\mathrm{in}-\mathrm{t}\mathrm{o}-\mathrm{carry}-\mathrm{out}\left(\mathrm{F}\mathrm{A}\right)}=2{\varDelta}_{\mathrm{NAND}2}\hfill \\ {}\hfill {\varDelta}_{\mathrm{cs}\left(\mathrm{F}\mathrm{A}\right)}={\varDelta}_{\mathrm{carry}-\mathrm{t}\mathrm{o}-\mathrm{sum}\left(\mathrm{F}\mathrm{A}\right)}={\varDelta}_{\mathrm{XOR}}\hfill \\ {}\hfill {\varDelta}_{\mathrm{ic}\left(\mathrm{F}\mathrm{A}\right)}={\varDelta}_{\mathrm{input}-\mathrm{t}\mathrm{o}-\mathrm{carry}\left(\mathrm{F}\mathrm{A}\right)}={\varDelta}_{\mathrm{XOR}}+2{\varDelta}_{\mathrm{NAND}2}\hfill \\ {}\hfill {\varDelta}_{\mathrm{cc}\left(\mathrm{H}\mathrm{A}\right)}={\varDelta}_{\mathrm{ic}\left(\mathrm{H}\mathrm{A}\right)}={\varDelta}_{\mathrm{NOR}2}\;{\varDelta}_{\mathrm{cs}\left(\mathrm{H}\mathrm{A}\right)}={\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(6)

Figure 1b shows an (n + 1)-bit FA-based RCA for 3 N operation. The RCA is comprised of (n-2) FAs and 2 HAs connected in series. The critical path includes the input-to- carry of the right-most HA (Δic(HA)), the carry-to-carry (Δcc(FA)) of (n-2) FAs, and the carry-to-sum (Δcs(HA)) of the left-most HA, i.e.,

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{ic}\left(\mathrm{H}\mathrm{A}\right)}+\left(\mathrm{n}-2\right){\varDelta}_{\mathrm{cc}\left(\mathrm{F}\mathrm{A}\right)}+{\varDelta}_{\mathrm{cs}\left(\mathrm{H}\mathrm{A}\right)}\hfill \\ {}\hfill ={\varDelta}_{\mathrm{NOR}2}+2\left(\mathrm{n}-2\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(7)

In general, the critical path delay of the (km)-bits FA-based RCA for (2k + 1)N operation can be expressed as

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(\left({2}^{\mathrm{k}+1}\right)\mathrm{N}\right)}={\varDelta}_{\mathrm{ic}\left(\mathrm{H}\mathrm{A}\right)}+\left(\mathrm{k}\mathrm{m}-\mathrm{k}-1\right){\varDelta}_{\mathrm{cc}\left(\mathrm{F}\mathrm{A}\right)}+\left(\mathrm{k}-1\right){\varDelta}_{\mathrm{cc}\left(\mathrm{H}\mathrm{A}\right)}+{\varDelta}_{\mathrm{cs}\left(\mathrm{H}\mathrm{A}\right)}\hfill \\ {}\hfill =\mathrm{k}{\varDelta}_{\mathrm{NOR}2}+2\left(\mathrm{k}\mathrm{m}-\mathrm{k}-1\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(8)

Similarly, the critical path delay of the (km)-bits FA-based RCA for (2k-1)N operation can be expressed as

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(\left({2}^{\mathrm{k}\hbox{-} 1}\right)\mathrm{N}\right)}={\varDelta}_{\mathrm{ic}\left(\mathrm{H}\mathrm{A}\right)*}+\left(\mathrm{k}\mathrm{m}-\mathrm{k}-1\right){\varDelta_{\mathrm{cc}}}_{\left(\mathrm{F}\mathrm{A}\right)}+\left(\mathrm{k}-1\right){\varDelta}_{\mathrm{cc}\left(\mathrm{H}\mathrm{A}\right)*}+{\varDelta}_{\mathrm{cs}\left(\mathrm{H}\mathrm{A}\right)*}\hfill \\ {}\hfill ={\varDelta}_{\mathrm{inv}}+\left[2\left(\mathrm{k}\mathrm{m}-1\right)-\mathrm{k}\right]{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XNOR}}\hfill \end{array} $$
(9)

The term, (HA)* in (9), indicate a HA resulted from a FA with an input “1”, where Δ ic(HA) * = Δ inv + Δ NAND2, Δ cc(HA) * = Δ NAND2, and Δ cs(HA) * = Δ XNOR.

By (8) and (9), with k = 3, the delays of the RCA in Fig. 1c and d for 9 N and 7 N operations respectively are

$$ {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(9\mathrm{N}\right)}=3{\varDelta}_{\mathrm{NOR}2}+2{\left(3\mathrm{m}-4\right)}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}} $$
(10)
$$ {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(7\mathrm{N}\right)}={\varDelta}_{\mathrm{inv}}+\left(6\mathrm{m}-5\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XNOR}} $$
(11)

Finally, Fig. 1e illustrates the (3 m)-bits FA-based RCA with dual mode for both 7 N and 9 N operations, where mode = 0 for 9 N operation and mode = 1 for 7 N operation. Similarly, one can easily realize a (km)-bits FA-based RCA for (2k ± 1)N operations.

3 UCA-based RCAs for (2k ± 1)N Operations

This section presents the UCA-based RCAs for (2k + 1)N operation. The implementation concept is readily applied for (2k-1)N operations.

3.1 UCA-based RCA for 3 N Operation

Consider the 3 N operation, as illustrated in (2). By Shannon expansion theorem, the carry and sum functions in (5) can be re-written as

$$ {\mathrm{c}}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}}'\left({\mathrm{a}}_{\mathrm{i}-1}{\mathrm{c}}_{\mathrm{i}-1}\right)+{\mathrm{a}}_{\mathrm{i}}\left({\mathrm{a}}_{\mathrm{i}-1}+{\mathrm{c}}_{\mathrm{i}-1}\right)={\mathrm{a}}_{\mathrm{i}}'\mathrm{W}{0}_{\mathrm{i}-1}+{\mathrm{a}}_{\mathrm{i}}\mathrm{W}{1}_{\mathrm{i}-1} $$
(12)
$$ {\mathrm{s}}_{\mathrm{i}} = {\mathrm{a}}_{\mathrm{i}}\oplus {\mathrm{u}}_{\mathrm{i}},\mathrm{where}\ {\mathrm{u}}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}-1}\oplus {\mathrm{c}}_{\mathrm{i}-1} $$
(13)

where W0i-1 = ai-1ci-1 and W1i-1 = ai-1 + ci-1. To construct the carry propagation paths, we consider both W0i and W1i, in the next stage. By (12), we can easily derive

$$ \begin{array}{l}\mathrm{W}{0}_{\mathrm{i}} = {\mathrm{a}}_{\mathrm{i}}{\mathrm{c}}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}}\left({\mathrm{a}}_{\mathrm{i}}'\mathrm{W}{0}_{\mathrm{i}-1}+{\mathrm{a}}_{\mathrm{i}}\mathrm{W}{1}_{\mathrm{i}-1}\right)={\mathrm{a}}_{\mathrm{i}}\mathrm{W}{1}_{\mathrm{i}-1};\ \mathrm{and}\hfill \\ {}\mathrm{W}{1}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}}+{\mathrm{c}}_{\mathrm{i}}={\mathrm{a}}_{\mathrm{i}}+{\mathrm{a}}_{\mathrm{i}}'\mathrm{W}{0}_{\mathrm{i}-1}+{\mathrm{a}}_{\mathrm{i}}\mathrm{W}{1}_{\mathrm{i}-1}={\mathrm{a}}_{\mathrm{i}}+\mathrm{W}{0}_{\mathrm{i}-1}\hfill \end{array} $$
(14)

By (13), the functions ui can be expressed as

$$ {\mathrm{u}}_{\mathrm{i}}=\left[{\mathrm{a}}_{\mathrm{i}-1}{\mathrm{c}}_{\mathrm{i}-1}+{\mathrm{a}}_{\mathrm{i}-1}'{\mathrm{c}}_{\mathrm{i}-1}'\right]'=\left[\mathrm{W}{0}_{\mathrm{i}-1}+\mathrm{W}{1}_{\mathrm{i}-1}'\right]'=\mathrm{W}{0}_{\mathrm{i}-1}'\mathrm{W}{1}_{\mathrm{i}-1} $$
(15)

Figure 2a shows a UCA cell for 3 N operation. The cell is comprised of three blocks: Carry Propagation Path (CPP) block, Function ui Generation (FUG) block, and Sum Generation (SMG) block. The critical path of an n-bit UCA-based RCA, as illustrated in Fig. 2b, includes an NAND gate in cell#1, (n-3) UCAs in cell#2 to cell#(n-2), and the inverter, NOR2, and XOR gates in cell#(n-1), where ΔUCA = ΔNOR2, i.e., the delay is

Figure 2
figure 2

UCA-based RCA: a UCA cell for 3N; and b (n + 1)-bit RCA.

$$ {\varDelta}_{\mathrm{RCA}\left(\mathrm{U}\mathrm{C}\mathrm{A}\right)\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}2}+\left(\mathrm{n}-2\right){\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{inv}}+{\varDelta}_{\mathrm{XOR}} $$
(16)

3.2 UCA-based RCA for 5 N Operation

Consider the 5 N operation, as shown in (1) with k = 2, where Ai = (a2i+1,a2i), i = 0,1,…,m-1, and n = 2 m. (Note that a zero is added as the most significant bit if n is not an even number.) Two additions, (a2i+1 + a2i-1 + c2i) and (a2i + a2i-2 + c2i-1), are performed. By (5),

$$ \begin{array}{c}\hfill {\mathrm{c}}_{2\mathrm{i}}={\mathrm{a}}_{2\mathrm{i}}'\left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)+{\mathrm{a}}_{2\mathrm{i}}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right);\hfill \\ {}\hfill {\mathrm{s}}_{2\mathrm{i}}={\mathrm{a}}_{2\mathrm{i}}\oplus {\mathrm{u}}_{2\mathrm{i}},\ \mathrm{where}\ {\mathrm{u}}_{2\mathrm{i}}={\mathrm{a}}_{2\mathrm{i}-2}\oplus {\mathrm{c}}_{2\mathrm{i}-1};\hfill \\ {}\hfill {\mathrm{c}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}+1}'\left({\mathrm{a}}_{2\mathrm{i}-1}{\mathrm{c}}_{2\mathrm{i}}\right)+{\mathrm{a}}_{2\mathrm{i}+1}\left({\mathrm{a}}_{2\mathrm{i}-1}+{\mathrm{c}}_{2\mathrm{i}}\right);\hfill \\ {}\hfill {\mathrm{s}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}+1}\oplus {\mathrm{u}}_{2\mathrm{i}+1},\ \mathrm{where}\ {\mathrm{u}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}-1}\oplus {\mathrm{c}}_{2\mathrm{i}};\hfill \end{array} $$
(17)

By (17), plugging c2i to c2i+1, we have

$$ \begin{array}{l}{\mathrm{c}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}+1}'{\mathrm{a}}_{2\mathrm{i}-1}\left[{\mathrm{a}}_{2\mathrm{i}}'\left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)+{\mathrm{a}}_{2\mathrm{i}}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]+\hfill \\ {}\begin{array}{lll}\hfill & \begin{array}{ll}\hfill & \hfill \end{array}\hfill & {\mathrm{a}}_{2\mathrm{i}+1}\left({\mathrm{a}}_{2\mathrm{i}-1} + {\mathrm{a}}_{2\mathrm{i}}'\left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)+{\mathrm{a}}_{2\mathrm{i}}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right);\hfill \end{array}\hfill \\ {}\begin{array}{ll}\hfill & ={\mathrm{a}}_{2\mathrm{i}+1}'{\mathrm{a}}_{2\mathrm{i}}'\left({\mathrm{a}}_{2\mathrm{i}-1}{\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right) + {\mathrm{a}}_{2\mathrm{i}+1}'{\mathrm{a}}_{2\mathrm{i}}\left[{\mathrm{a}}_{2\mathrm{i}-1}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]+\hfill \end{array}\hfill \\ {}\begin{array}{lll}\hfill & \begin{array}{ll}\hfill & \hfill \end{array}\hfill & {\mathrm{a}}_{2\mathrm{i}+1}{\mathrm{a}}_{2\mathrm{i}}'\left[{\mathrm{a}}_{2\mathrm{i}-1}+{\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right] + {\mathrm{a}}_{2\mathrm{i}+1}{\mathrm{a}}_{2\mathrm{i}}\left[{\mathrm{a}}_{2\mathrm{i}-1} + {\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right]\hfill \end{array}\hfill \\ {}\begin{array}{ll}\hfill & ={\mathrm{a}}_{2\mathrm{i}+1}'{\mathrm{a}}_{2\mathrm{i}}'\mathrm{W}{00}_{2\mathrm{i}-1}+{\mathrm{a}}_{2\mathrm{i}+1}'{\mathrm{a}}_{2\mathrm{i}}\mathrm{W}0{1}_{2\mathrm{i}-1}+\hfill \end{array}\hfill \\ {}\begin{array}{lllll}\hfill & \hfill & \hfill & \hfill & {\mathrm{a}}_{2\mathrm{i}+1}{\mathrm{a}}_{2\mathrm{i}}'\mathrm{W}1{0}_{2\mathrm{i}-1}+{\mathrm{a}}_{2\mathrm{i}+1}{\mathrm{a}}_{2\mathrm{i}}\mathrm{W}{11}_{2\mathrm{i}-1}\hfill \end{array}\hfill \end{array} $$
(18)

where

$$ \begin{array}{cc}\hfill \mathrm{W}{00}_{2\mathrm{i}-1} = {\mathrm{a}}_{2\mathrm{i}-1}{\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1};\hfill & \hfill \mathrm{W}0{1}_{2\mathrm{i}-1} = {\mathrm{a}}_{2\mathrm{i}-1}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right);\hfill \\ {}\hfill \mathrm{W}1{0}_{2\mathrm{i}-1} = {\mathrm{a}}_{2\mathrm{i}-1}+{\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1};\hfill & \hfill \mathrm{W}{11}_{2\mathrm{i}-1} = {\mathrm{a}}_{2\mathrm{i}-1}+{\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1};\hfill \end{array} $$

The 2-bits UCA cell for 5 N, referred to as a UCA2 cell can be derived as in the following property.

Property 1

The logic functions of the CPP block are expressed as

$$ \begin{array}{c}\hfill \mathrm{W}{00}_{2\mathrm{i}+1} = {\mathrm{a}}_{2\mathrm{i}+1}\cdotp {\mathrm{a}}_{2\mathrm{i}}\cdotp \mathrm{W}{11}_{2\mathrm{i}-1}\hfill \\ {}\hfill \mathrm{W}0{1}_{2\mathrm{i}+1} = {\mathrm{a}}_{2\mathrm{i}+1}\cdotp \left({\mathrm{a}}_{2\mathrm{i}} + \mathrm{W}1{0}_{2\mathrm{i}-1}\right)\hfill \\ {}\hfill \mathrm{W}1{0}_{2\mathrm{i}+1} = {\mathrm{a}}_{2\mathrm{i}+1} + \left({\mathrm{a}}_{2\mathrm{i}}\cdotp \mathrm{W}0{1}_{2\mathrm{i}-1}\right)\hfill \\ {}\hfill \mathrm{W}{11}_{2\mathrm{i}+1} = {\mathrm{a}}_{2\mathrm{i}+1} + {\mathrm{a}}_{2\mathrm{i}} + \mathrm{W}{00}_{2\mathrm{i}-1}\hfill \end{array} $$
(18)

and the functions u2i and u2i+1 of the FUG block are

$$ \begin{array}{c}\hfill {\mathrm{u}}_{2\mathrm{i}}=\left(\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}0{1}_{2\mathrm{i}-1}\right)+\left(\mathrm{W}1{0}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right)\hfill \\ {}\hfill {\mathrm{u}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}}'\left[\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1}\right]+{\mathrm{a}}_{2\mathrm{i}}\left[\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right]\hfill \end{array} $$
(20)

Proof

Consider W002i+1 = a2i+1a2ic2i+1, with c2i+1 in (18), we obtain W002i+1 = a2i+1a2iW112i-1. Similarly, one can easily derive the remaining terms in (19). By (17),

$$ \begin{array}{c}\hfill {\mathrm{u}}_{2\mathrm{i}}={\mathrm{a}}_{2\mathrm{i}-2}\oplus {\mathrm{c}}_{2\mathrm{i}-1}=\left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)\oplus \left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\hfill \\ {}\hfill =\left[{\mathrm{a}}_{2\mathrm{i}-1}\oplus \left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]\oplus \left[{\mathrm{a}}_{2\mathrm{i}-1}\oplus \left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]\hfill \\ {}\hfill =\left(\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1}\right)\oplus \left(\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right)\hfill \\ {}\hfill =\left(\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1}\right)'\left(\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right)+\hfill \\ {}\hfill \left(\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1}\right)\ \left(\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right)'\hfill \end{array} $$

Since

$$ \begin{array}{c}\hfill \mathrm{W}{00}_{2\mathrm{i}-1}\mathrm{W}{11}_{2\mathrm{i}-1}=\mathrm{W}{00}_{2\mathrm{i}-1},\ \mathrm{W}1{0}_{2\mathrm{i}-1}'\mathrm{W}0{1}_{2\mathrm{i}-1}'=\mathrm{W}1{0}_{2\mathrm{i}-1}',\hfill \\ {}\hfill \mathrm{W}1{0}_{2\mathrm{i}-1}\mathrm{W}0{1}_{2\mathrm{i}-1}=\mathrm{W}0{1}_{2\mathrm{i}-1},\ \mathrm{and}\ \mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}'=\mathrm{W}{11}_{2\mathrm{i}-1}',\hfill \\ {}\hfill {\mathrm{u}}_{2\mathrm{i}} = \mathrm{W}{00}_{2\mathrm{i}-1}\mathrm{W}0{1}_{2\mathrm{i}-1}'+\mathrm{W}1{0}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}+\hfill \\ {}\hfill \mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}0{1}_{2\mathrm{i}-1}+\mathrm{W}1{0}_{2\mathrm{i}-1}\mathrm{W}{11}_{2\mathrm{i}-1}'\hfill \end{array} $$

and since

$$ \begin{array}{c}\hfill \mathrm{W}{00}_{2\mathrm{i}-1}\mathrm{W}0{1}_{2\mathrm{i}-1}'=0\ \mathrm{and}\ \mathrm{W}1{0}_{2\mathrm{i}-1}\mathrm{W}{11}_{2\mathrm{i}-1}'=0,\ \mathrm{thus}\hfill \\ {}\hfill {\mathrm{u}}_{2\mathrm{i}} = \left(\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}0{1}_{2\mathrm{i}-1}\right)+\left(\mathrm{W}1{0}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right)\hfill \end{array} $$

Similarly,

$$ \begin{array}{c}\hfill {\mathrm{u}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}-1}\oplus {\mathrm{c}}_{2\mathrm{i}}={\mathrm{a}}_{2\mathrm{i}-1}\oplus \left[{\mathrm{a}}_{2\mathrm{i}}'\left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)+{\mathrm{a}}_{2\mathrm{i}}\left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]\hfill \\ {}\hfill = {\mathrm{a}}_{2\mathrm{i}}'\left[{\mathrm{a}}_{2\mathrm{i}-1}\oplus \left({\mathrm{a}}_{2\mathrm{i}-2}{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]+{\mathrm{a}}_{2\mathrm{i}}\left[{\mathrm{a}}_{2\mathrm{i}-1}\oplus \left({\mathrm{a}}_{2\mathrm{i}-2}+{\mathrm{c}}_{2\mathrm{i}-1}\right)\right]\hfill \\ {}\hfill ={\mathrm{a}}_{2\mathrm{i}}'\left[\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1}\right]+{\mathrm{a}}_{2\mathrm{i}}\left[\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1}\right]\hfill \end{array} $$

Figure 3a shows the UCA2 cell, where the CPP block can be realized by using NAND2/NOR2 gates, as illustrated in Fig. 3b. The critical path, as indicated in red, of the (2 m)-bits UCA2-based RCA in Fig. 3c includes an INV and an NOR2 gate in cell #0, two NOR2 gates in each of cells #1 to #(m-2), and the FUG and SMG blocks in cell #(m-1). Thus, the propagation delay is

Figure 3
figure 3

5N operation: a & b UCA2 cell; and c UCA2-based RCA.

$$ {\varDelta}_{\mathrm{RCA}\left(\mathrm{U}\mathrm{C}\mathrm{A}2\right)\left(5\mathrm{N}\right)}={\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2}+\left(\mathrm{m}-2\right){\varDelta}_{\mathrm{UCA}2}+{\varDelta}_{\mathrm{FUG}\left(5\mathrm{N}\right)}+{\varDelta}_{\mathrm{SMG}} $$
(21)

3.3 UCA-based RCA for (2k + 1)N Operation

We first construct the CPP block and then the FUG and SMG blocks

CPP Block

Let rj be a binary number, either a 0 or a 1,

$$ \begin{array}{c}\hfill {\#}_{\mathrm{r}\mathrm{i}}={\#}_0=``\cdotp "\left(\mathrm{Logic}\ \mathrm{AND}\ \mathrm{operation}\right),\ \mathrm{i}\mathrm{f}\ {\mathrm{r}}_{\mathrm{i}}=0,\hfill \\ {}\hfill {\#}_{\mathrm{r}\mathrm{i}}={\#}_1=``+"\left(\mathrm{logic}\ \mathrm{OR}\ \mathrm{operation}\right),\kern0.5em \mathrm{i}\mathrm{f}\ {\mathrm{r}}_{\mathrm{i}}=1.\hfill \end{array} $$
(22)

Thus, W012i+1 in (19) can be expressed as

$$ \mathrm{W}{\left({\mathrm{r}}_1{\mathrm{r}}_0\right)}_{2\mathrm{i}+1} = {\mathrm{a}}_{2\mathrm{i}+1}{\#}_0\left({\mathrm{a}}_{2\mathrm{i}}{\#}_1\mathrm{W}{\left({\mathrm{r}}_1'{\mathrm{r}}_0'\right)}_{2\mathrm{i}-1}\right) $$
(23)

where r1’ and r0’ are the bit-complement of r1 and r0, respectively.

Figure 3b shows the block diagram of the UCA2 cells, where the CPP block is highlighted. The CPP block contains 4 carry propagation paths, Pa, a = 0 ~ 3, and each path contains 2 gates, Jab, b = 0 or 1. The gate type of Jab is determined by the index of the path output Wab. For example, the output W10 of the path pa which includes both gates J21 and J20. Here, W10 means a = 1 and b = 0, i.e., J21 is an OR gate (because a = 1) and J20 is an AND gate (because b = 0). Fig. 4a is a symbolic representation of that in Fig. 3b.

Figure 4
figure 4

Symbolic representation: a for 5N; and b for (2k + 1)N.

Similarly, the CPP block of the UCAk cells for (2k + 1)N operation can be expressed as

$$ \mathrm{W}{\left({\mathrm{r}}_{\mathrm{k}-1}..{\mathrm{r}}_1{\mathrm{r}}_0\right)}_{\mathrm{k}\left(\mathrm{i}+1\right)-1}={\mathrm{a}}_{\mathrm{k}\left(\mathrm{i}+1\right)-1}{\#}_{\mathrm{r}\mathrm{k}-1}\left(\dots {\#}_{\mathrm{r}1}{\left({\mathrm{a}}_{\mathrm{k}\mathrm{i}}{\#}_{\mathrm{r}0}\mathrm{W}\left({\mathrm{r}}_{\mathrm{k}-1}'\dots {\mathrm{r}}_0'\right.\right)}_{\mathrm{k}\mathrm{i}-1}\right) $$
(24)

Figure 4b shows the symbolic representation of the logic function for the CPP block of the UCAk cell and its delay is ΔUCA(k) = kΔNOR2.

FUG and SMG Blocks

Let U(x0) and U(x1) be defined as follows: (“x” means “don’t care term”)

$$ \mathrm{U}\left(\mathrm{x}0\right)=\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}1{0}_{2\mathrm{i}-1},\ \mathrm{U}\left(\mathrm{x}1\right)=\mathrm{W}0{1}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1} $$
(25)

U(x0) takes the terms Wx0, while U(x1) is formed by the terms Wx1. Let V(0x) and V(1x) be defined as

$$ \mathrm{V}\left(0\mathrm{x}\right)=\mathrm{W}{00}_{2\mathrm{i}-1}'\mathrm{W}0{1}_{2\mathrm{i}-1},\mathrm{V}\left(1\mathrm{x}\right)=\mathrm{W}1{0}_{2\mathrm{i}-1}'\mathrm{W}{11}_{2\mathrm{i}-1} $$
(26)

Thus, (19) can be expressed as

$$ \begin{array}{l}{\mathrm{u}}_{2\mathrm{i}+1}={\mathrm{a}}_{2\mathrm{i}}'\mathrm{U}\left(\mathrm{x}0\right)+{\mathrm{a}}_{2\mathrm{i}}\mathrm{U}\left(\mathrm{x}1\right)\hfill \\ {}{\mathrm{u}}_{2\mathrm{i}}=\mathrm{V}\left(0\mathrm{x}\right)+\mathrm{V}\left(1\mathrm{x}\right)\hfill \end{array} $$
(27)

Figure 5a is the symbolic representation of FUG and SMG blocks of UCA2 cell for 5 N operation, where U(x0)* is an AND3 gate with 3 inputs a2i’, W002i-1’, and W102i-1, and U(x1)* is that with a2i, W012i-1’, and W112i-1. Thus, the delay path, as shown in Fig. 5b, includes an AND3, a OR2, and a XOR, and it can be implemented with an INV, an NAND3, an NAND2, and a XOR as follows,

Figure 5
figure 5

Delay paths: a FUG and SMG blocks for 5N; b Delay paths for 5N; c for 9N; and d for 17N.

$$ {\varDelta}_{\mathrm{FUG}\left(5\mathrm{N}\right)}={\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2};{\varDelta}_{\mathrm{SMG}}={\varDelta}_{\mathrm{XOR}} $$
(28)

Similarly, the UCA3 cell for 9 N operation, the FUG block can be expressed as follows,

$$ \begin{array}{c}\hfill {\mathrm{u}}_{3\mathrm{i}+2}={\mathrm{a}}_{3\mathrm{i}+1}'{\mathrm{a}}_{3\mathrm{i}}'\mathrm{U}\left(\mathrm{x}00\right)+{\mathrm{a}}_{3\mathrm{i}+1}'{\mathrm{a}}_{3\mathrm{i}}\mathrm{U}\left(\mathrm{x}01\right)+\hfill \\ {}\hfill \kern1.56em {\mathrm{a}}_{3\mathrm{i}+1}{\mathrm{a}}_{3\mathrm{i}}'\mathrm{U}\left(\mathrm{x}10\right)+{\mathrm{a}}_{3\mathrm{i}+1}{\mathrm{a}}_{3\mathrm{i}}\mathrm{U}\left(\mathrm{x}11\right)\hfill \\ {}\hfill {\mathrm{u}}_{3\mathrm{i}+1}={\mathrm{a}}_{3\mathrm{i}}'\left[\mathrm{V}\left(0\mathrm{x}0\right)+\mathrm{V}\left(1\mathrm{x}0\right)\right]+{\mathrm{a}}_{3\mathrm{i}}\left[\mathrm{V}\left(0\mathrm{x}1\right)+\mathrm{V}\left(1\mathrm{x}1\right)\right]\hfill \\ {}\hfill {\mathrm{u}}_{3\mathrm{i}} = \mathrm{V}\left(0\mathrm{x}0\right)+\mathrm{V}\left(1\mathrm{x}0\right)+\mathrm{V}\left(0\mathrm{x}1\right)+\mathrm{V}\left(1\mathrm{x}1\right)\hfill \end{array} $$
(29)

Figure 5c illustrates the delay path for 9 N operation, where u3i+2, in (29), can be realized by four AND4 gates and one OR4 gates. However, the term a3i+1’a3i’U(x00) can be realized by an AND2 with a3i+1’ and a3i’ and an AND3 which takes the output of AND2 and two inputs of U(x00). Since the function AND2 with a3i+1’ and a3i’ can be pre-calculated and the AND2 gate will not be in the critical path, as shown in Fig. 5c. The OR4 is realized by a NOR2 followed by a NAND2. This results in an INV, an NAND3, an NOR2, and an NAND2 included in the critical path of the FUG block of the UCA3 cell, i.e.,

$$ {\varDelta}_{\mathrm{FUG}\left(9\mathrm{N}\right)}={\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{NAND}2};{\varDelta}_{\mathrm{SMG}}={\varDelta}_{\mathrm{XOR}} $$
(30)

Similarly, the delay path in Fig. 5d for UCA cell is,

$$ {\varDelta}_{\mathrm{FUG}\left(17\mathrm{N}\right)}={\varDelta}_{\mathrm{INV}}+2{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NOR}3};{\varDelta}_{\mathrm{SMG}}={\varDelta}_{\mathrm{XOR}} $$
(31)

3.4 Performance Evaluation

The proposed design methodology is mainly about the structure of hardware implementation which is based on the newly developed equations. The actual realization of the proposed UCA, for example, may vary significantly. Therefore, the following performance comparisons were conducted assuming the same type of hardware realization. Thus, the speed performance estimation only consider the total gate delay in the critical path under the same conditions, where the path delays and loading effects were not considered for this rough estimation. In fact, the performances are evaluated based on the gate delays of the standard cells in TSMC 0.18 μm CMOS process technology, as tabulated in Table 1 [2], where the cell height is 5.04 μm.

Table 1 Cell delay data [2]

The delays of UCA-based RCAs are, n = 2t, t ≥ 4, where (Δinv + (k-1)ΔNOR2) is in Cell#0 for 5 N and 17 N operations,

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{U}\mathrm{C}\mathrm{A}\right)\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}2}+\left(\mathrm{n}-2\right){\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{inv}}+{\varDelta}_{\mathrm{XOR}}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{U}\mathrm{C}\mathrm{A}\right)\left(5\mathrm{N}\right)}=\left(\mathrm{n}-3\right){\varDelta}_{\mathrm{NOR}2}+2{\varDelta}_{\mathrm{inv}}+{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{U}\mathrm{C}\mathrm{A}\right)\left(17\mathrm{N}\right)}=\left(\mathrm{n}-5\right){\varDelta}_{\mathrm{NOR}2}+2{\varDelta}_{\mathrm{inv}}+2{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NOR}3}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(32)

By (8), the delays of FA-based RCAs are, n = 2t, t ≥ 4,

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NOR}2}+2\left(\mathrm{n}-2\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(5\mathrm{N}\right)}=2{\varDelta}_{\mathrm{NOR}2}+2\left(\mathrm{n}-3\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RCA}\left(\mathrm{F}\mathrm{A}\right)\left(17\mathrm{N}\right)}=4{\varDelta}_{\mathrm{NOR}2}+2\left(\mathrm{n}-5\right){\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(33)

Based on (32), (33), and Table 1, the delays of both FA-based RCAs and UCA-based RCAs for 3 N (k = 1), 5 N (k = 2), and 17 N (k = 4) with the 16 ~ 1024 bits were computed and the values were tabulated in Table 2. Results show that UCA-RCA for 17 N operation has slightly better than that for 5 N operation. This is so simply because (2ΔNOR2 + ΔNAND2) = 0.1176 ns and (ΔNAND3 + ΔNOR3) = 0.1044 ns, where the difference is about 0.01 ns.

Table 2 Performance evaluation for RCAs

Based on Table 2, Fig. 6a and b plot the delays of both FA-based RCA and UCA-based RCA, respectively, for 3 N, 5 N, and 17 N operations. Results show that their delays are almost the same for the same size with the same approach.

Figure 6
figure 6

Speed performance: a FA-based RCA; b UCA-based RCA;and c comparison.

Table 2 tabulates the normalized speed performance, where the delay of FA-based RCA is normalized and set to 1. The delay ratios of UCA-based UCAs for various bit sizes are shown in Fig. 6c. Results show that the delay ratios are approximately 65 % which can be simply calculated from the delay ratio of UCA cell (ΔNOR2) over a FA cell (2ΔNAND2), i.e., Δ NOR2/(2Δ NAND2) = 0.0426/0.0648 = 0.66 = 66 %.

Even though the proposed UCA-based RCAs are about 34 % faster than the FA-based ones, their speed performance is still too slow particularly for wider bit sizes. The following section attempts to further improve the speed performance.

4 Hybrid Adders for (2k ± 1)N Operations

As mentioned, the RCA achieves lower hardware cost, while the CLA offers high speed performance. In order to achieve higher speed performance for the additions, a hybrid adder combining RCA and CLA (or generate-propagate adder), as shown in Fig. 7, is presented. The adder is simple and modular, where the lower bit CLA is taken as a module.

Figure 7
figure 7

Hybrid adder – RCA & CLA.

4.1 UCA-based Hybrid Adder for 3N Operation

For 3N operation, by (14), the carry-out bits at the first 4 stages of the CPP blocks can be written as follows,

$$ \begin{array}{ll}\begin{array}{l}\mathrm{W}{0}_0={\mathrm{a}}_0\mathrm{W}{1}_{-1}\hfill \\ {}\mathrm{W}{0}_1={\mathrm{a}}_1\mathrm{W}{1}_0={\mathrm{a}}_0{\mathrm{a}}_1+{\mathrm{a}}_1\mathrm{W}{0}_{-1}\hfill \\ {}\mathrm{W}{0}_2={\mathrm{a}}_1{\mathrm{a}}_2+{\mathrm{a}}_0{\mathrm{a}}_2\mathrm{W}{1}_{-1}\hfill \\ {}\mathrm{W}{0}_3={\mathrm{a}}_2{\mathrm{a}}_3+{\mathrm{a}}_0{\mathrm{a}}_1{\mathrm{a}}_3+{\mathrm{a}}_1{\mathrm{a}}_3\mathrm{W}{0}_{-1}\hfill \end{array}\hfill & \begin{array}{l}\mathrm{W}{1}_0={\mathrm{a}}_0+\mathrm{W}{0}_{-1}\hfill \\ {}\mathrm{W}{1}_1={\mathrm{a}}_1+\mathrm{W}{0}_0={\mathrm{a}}_1+{\mathrm{a}}_0\mathrm{W}{1}_{-1}\hfill \\ {}\mathrm{W}{1}_2={\mathrm{a}}_2+{\mathrm{a}}_0{\mathrm{a}}_1+{\mathrm{a}}_1\mathrm{W}{0}_{-1}\hfill \\ {}\mathrm{W}{1}_3={\mathrm{a}}_3+{\mathrm{a}}_1{\mathrm{a}}_2+{\mathrm{a}}_0{\mathrm{a}}_2\mathrm{W}{1}_{-1}\hfill \end{array}\hfill \end{array} $$
(34)

Thus, both W03 and W13 can be written as

$$ \begin{array}{ll}\mathrm{W}{0}_3={\mathrm{R}}_{01}+{\mathrm{R}}_{00}\mathrm{W}{0}_{-1}\hfill & \mathrm{W}{1}_3={\mathrm{Q}}_{01}+{\mathrm{Q}}_{00}\mathrm{W}{1}_{-1}\hfill \end{array} $$
(35)

where

$$ {\mathrm{R}}_{00}={\mathrm{a}}_1{\mathrm{a}}_3,\ {\mathrm{R}}_{01}={\mathrm{a}}_2{\mathrm{a}}_3+{\mathrm{a}}_0{\mathrm{a}}_1{\mathrm{a}}_3,{\mathrm{Q}}_{00}={\mathrm{a}}_0{\mathrm{a}}_2,\ {\mathrm{Q}}_{01}={\mathrm{a}}_3+{\mathrm{a}}_1{\mathrm{a}}_2 $$
(36)

Similarly, both W07 and W17 are expressed as

$$ \begin{array}{ll}\mathrm{W}{0}_7={\mathrm{R}}_{11}+{\mathrm{R}}_{10}\mathrm{W}{0}_3\hfill & \mathrm{W}{1}_7={\mathrm{Q}}_{11}+{\mathrm{Q}}_{10}\mathrm{W}{1}_3\hfill \end{array} $$
(37)

where

$$ {\mathrm{R}}_{10}={\mathrm{a}}_5{\mathrm{a}}_7,\ {\mathrm{R}}_{11}={\mathrm{a}}_6{\mathrm{a}}_7+{\mathrm{a}}_4{\mathrm{a}}_5{\mathrm{a}}_7,\ {\mathrm{Q}}_{00}={\mathrm{a}}_4{\mathrm{a}}_6,\ {\mathrm{Q}}_{01}={\mathrm{a}}_7+{\mathrm{a}}_5{\mathrm{a}}_6 $$

By (35), both W07 and W17 in (37) can be re-written as

$$ \begin{array}{l}\mathrm{W}{0}_7={\mathrm{R}}_{11}+{\mathrm{R}}_{10}{\mathrm{R}}_{01}+{\mathrm{R}}_{10}{\mathrm{R}}_{00}\mathrm{W}{0}_{-1}\hfill \\ {}\mathrm{W}{1}_7={\mathrm{Q}}_{11}+{\mathrm{Q}}_{10}{\mathrm{Q}}_{01}+{\mathrm{Q}}_{10}{\mathrm{Q}}_{00}\mathrm{W}{1}_{-1}\hfill \end{array} $$
(38)

Figure 8a shows a 32-bits hybrid adder, where each CLA unit processes 8-bit data. Each CLA unit, adopting the parallel prefix adder structure [7, 8], is comprised of two 4-bit RQ generator units (RQGUs, or RQGU-4), a 2-bit CLA (or CLA-2), and two 4-bit Function U generation units (FUGUs, or FUGU-4). The RQGU-4, as shown in Fig. 8b, realizes the functions in (34) and it replaces both PG unit and Block CLA-4 (BCLA-4) in the conventional CLA structure. The CLA-2, as illustrated in Fig. 8c, implements the functions in (35) and (38), and the FUGU-4 in Fig. 8d, is for the functions in (34) and (15). Note that ci = (W0i,W1i) in Fig. 8a, The delays of these units can be expressed as

Figure 8
figure 8

Proposed hybrid adder for 3N operation: a 32-bit hybrid adder; b RQGU-4; c CLA-2; and d FUGU-4 & SUM.

$$ \begin{array}{l}{\varDelta}_{\mathrm{RQGU}-4}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2};{\varDelta}_{\mathrm{CLA}-2}=2{\varDelta}_{\mathrm{NAND}3};\hfill \\ {}{\varDelta}_{\mathrm{FUGU}-4}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2};{\varDelta}_{\mathrm{SUM}}={\varDelta}_{\mathrm{XOR}};\hfill \end{array} $$

The critical path of the 32-bit hybrid adder includes a RQGU-4, four CLA-2 s, a FUGU-4, and a SUM, i.e.,

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{HyA}\left(3\mathrm{N}\right)(32)}={\varDelta}_{\mathrm{RQGU}-4}+4{\varDelta}_{\mathrm{CLA}-2}+{\varDelta}_{\mathrm{FUGU}-4}+{\varDelta}_{\mathrm{SUM}}\hfill \\ {}\hfill =10{\varDelta}_{\mathrm{NAND}3}+2{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$

In general, for n = 2t, t > 3, the delay

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{HyA}\left(3\mathrm{N}\right)\left(\mathrm{n}\right)}={\varDelta}_{\mathrm{RQGU}-4}+\left(\mathrm{n}/8\right){\varDelta}_{\mathrm{CLA}-2}+{\varDelta}_{\mathrm{FUGU}-4}+{\varDelta}_{\mathrm{SUM}}\hfill \\ {}\hfill =\left(\mathrm{n}/4+2\right){\varDelta}_{\mathrm{NAND}3}+2{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{XOR}}\hfill \end{array} $$
(39)

Similarly, the 4-bit CLA module of the hybrid adder in Fig. 8 can be replaced by 8-bit one. The delay is

$$ {\varDelta}_{\mathrm{HyA}\left(3\mathrm{N}\right)\left(\mathrm{n}\right)}={\varDelta}_{\mathrm{RQGU}-8}+\left(\mathrm{n}/16\right){\varDelta}_{\mathrm{CLA}-2}+{\varDelta}_{\mathrm{FUGU}-8}+{\varDelta}_{\mathrm{SUM}} $$
(40)

where

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RQGU}-8}={\varDelta}_{\mathrm{AND}5}+{\varDelta}_{\mathrm{OR}3};\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-8}={\varDelta}_{\mathrm{AND}5}+{\varDelta}_{\mathrm{OR}4}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2}.\hfill \end{array} $$
(41)

The CLA module of the hybrid adder may include multi-level structure similar to the conventional CLA. For example, the CLA module of the 64-bit hybrid adder in Fig. 9 is with 2-level CLA structure. Let L denote as the number of levels in the CLA module. Thus, the delay of such hybrid adder is, r = m*2L, m = 4 or 8, L ≥ 2, and n ≥ 2r,

Figure 9
figure 9

Sixty-four-bits hybrid adder with 2-level CLA module.

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{HyA}\left(3\mathrm{N}\right)\left(\mathrm{n}\right)}={\varDelta}_{\mathrm{RQGU}-\mathrm{m}}+2\left(\mathrm{L}-1\right){\varDelta}_{\mathrm{BCLA}-2}+\left(\mathrm{n}/\mathrm{r}\right){\varDelta}_{\mathrm{CLA}-2}+\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-\mathrm{m}}+{\varDelta}_{\mathrm{SUM}}\hfill \end{array} $$
(42)

4.2 UCA-based Hybrid Adder for (2k + 1)N Operation

Similar to the derivation process for 3 N operation in (35)-(38), the following component delays can be derived from Property 1 for 5 N operation, where

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RQGU}-4\left(5\mathrm{N}\right)}={\varDelta}_{\mathrm{RQGU}-4\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2};\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-4\left(5\mathrm{N}\right)}=2{\varDelta}_{\mathrm{NAND}3}+2{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}};\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-8\left(5\mathrm{N}\right)}={\varDelta}_{\mathrm{RQGU}-8\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2}\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-8\left(5\mathrm{N}\right)}=2{\varDelta}_{\mathrm{NAND}3}+2{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{INV}};\hfill \end{array} $$
(43)

Similarly, one can also derive for 17 N operation as

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RQGU}-4\left(17\mathrm{N}\right)}={\varDelta}_{\mathrm{RQGU}-4\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2};\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-4\left(17\mathrm{N}\right)}=3{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}3};\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-8\left(17\mathrm{N}\right)}={\varDelta}_{\mathrm{RQGU}-8\left(3\mathrm{N}\right)}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2};\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-8\left(17\mathrm{N}\right)}=3{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}3}.\hfill \end{array} $$
(44)

For 9N operation with k = 3, the UCA3 cells are employed and each cell contains 3 bits. Thus, the delays are

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RQGU}-3\left(9\mathrm{N}\right)}=2{\varDelta}_{\mathrm{NAND}2};\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-6\left(9\mathrm{N}\right)}={\varDelta}_{\mathrm{AND}4}+{\varDelta}_{\mathrm{OR}3}={\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{OR}3}\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-3\left(9\mathrm{N}\right)}=3{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NOR}2}+2{\varDelta}_{\mathrm{NAND}3};\hfill \\ {}\hfill {\varDelta}_{\mathrm{FUGU}-6\left(9\mathrm{N}\right)}=2{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{INV}}+{\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{OR}3}.\hfill \end{array} $$
(45)

Similarly, for (2k + 1)N operations, the k-bit cells, UCAk cells, are employed. Thus, the hybrid adder may include the k-bit CLA modules.

4.3 Performance Evaluation

This section roughly estimates the speed performances of the FA-based RCAs, UCA-based RCA, and the hybrid adders with various bit sizes for 3N, 5N, 9N, and 17N operations. However, the performance evaluation process is readily applied for any bit sizes and (2k ± 1)N operations.

The speed performance is roughly evaluated based on the gate delays of the standard cells listed in Table 1, where the AND (NAND) and OR (NOR) gates are limited to their fan-ins up to 3. Thus, the AND5 and OR4 gates in (41) are implemented in two-level logic gates, i.e., AND5 = (AND3) •(AND2), and OR4 = (OR2) + (OR2).

The delays of the RQGU-4 and RQGU-8 for 3N, 5N, and 17N operations are the same. By Table 1,

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{RQGU}-4}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}=0.0777\mathrm{ns}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-8}={\varDelta}_{\mathrm{AND}5}+{\varDelta}_{\mathrm{OR}4}\hfill \\ {}\hfill \kern2.04em ={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+2{\varDelta}_{\mathrm{NOR}2}=0.1629\mathrm{ns}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-3}=2{\varDelta}_{\mathrm{NAND}2}=0.0648\mathrm{ns}\hfill \\ {}\hfill {\varDelta}_{\mathrm{RQGU}-6}={\varDelta}_{\mathrm{NAND}}+{\varDelta}_{\mathrm{NOR}2}+{\varDelta}_{\mathrm{NOR}3}+{\varDelta}_{\mathrm{INV}}=0.1602\mathrm{ns}\hfill \end{array} $$
(46)

Based on (39, 40, 41, 42, 43, 44, 45, and 46), the delays of the hybrid adders are calculated and tabulated in Table 3.

Table 3 Performance comparison

Results show that the delays are indeed improved as the number of levels, L, increases. For example, for 17 N operation with 1024 bits, the delays was improved from 66.51 ns for the FA-based RCA to 43.64 ns for the proposed UCA-based RCA, as shown in Table 2, Further, the delay is reduced to 12.07 ns and 6.44 ns for the proposed hybrid adders with 4-bits and 8-bits modules, respectively, as illustrated in Table 3 and plotted in Fig. 10a. The delays can be further decreased to 2.31 ns and 1.76 ns by using the proposed hybrid with 4-levels of 4-bits and 8-bits modules, respectively, as plotted in Fig. 10b.

Figure 10
figure 10

Propagation delays for 17N operation with 1024 bits: a Various adders; and b Hybrid adders with 4-bits and 8-bits modules.

4.4 Discussion

The proposed UCA design concept has the advantages of better speed performance than the conventional FA design approach. This sub-section compares the speed performance of the UCA-CLA and conventional CLA (CV-CLA). The delays of both BCLA-4 and CLA-4 [57] are

$$ \begin{array}{c}\hfill {\varDelta}_{\mathrm{CLA}-4}={\varDelta}_{\mathrm{AND}5}+{\varDelta}_{\mathrm{OR}5}={\varDelta}_{\mathrm{NAND}3}+{\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{NOR}3}+{\varDelta}_{\mathrm{NOR}2}\hfill \\ {}\hfill {\varDelta}_{\mathrm{BCLA}-4}={\varDelta}_{\mathrm{AND}4}+{\varDelta}_{\mathrm{OR}4}=2\left({\varDelta}_{\mathrm{NAND}2}+{\varDelta}_{\mathrm{NOR}2}\right)\hfill \end{array} $$

For n-bit CV-CLAs, n = 4t, with CLA-4 and BCLA-4 modules, the delay is

$$ {\varDelta}_{\mathrm{CV}-\mathrm{L}\mathrm{A}}={\varDelta}_{\mathrm{PGU}}+2\left(\mathrm{t}-1\right){\varDelta}_{\mathrm{BCLA}-4}+{\varDelta}_{\mathrm{CLA}-4}+{\varDelta}_{\mathrm{SUM}} $$
(47)

and the delay of an n-bit UCA-CLA is

$$ {\varDelta}_{\mathrm{UCA}-\mathrm{C}\mathrm{L}\mathrm{A}}={\varDelta}_{\mathrm{RQGU}-4}+2\left(\mathrm{t}-2\right){\varDelta}_{\mathrm{BCLA}-4}+{\varDelta}_{\mathrm{CLA}-4}+{\varDelta}_{\mathrm{FUGU}-4}+{\varDelta}_{\mathrm{SUM}} $$
(48)

Note that the delay (ΔRQGU-4 + ΔFUGU-4) in UCA-CLA is compared with (ΔPGU + 2ΔBCLA-4) in CV-CLA. Table 4 compares the speed performance of CV-CLA and UCA- CLA for 3 N, 5 N, and 17 N operations with various bit sizes.

Table 4 Performance comparison (CLA Modules)

The UCA-CLAs for 3N, 5N, and 17N operations respectively achieve approximately 30 %, 25 %, and 15 % less delays than CV-CLA for 16-bit addition, and about 13 %, 11 %, and 7 % for 1024 bit addition. As mentioned, the delay of FUGU-2k block increases with k considerably. Thus, the proposed UCA-CLAs may gain the advantage of better speed performance for lower values of k. The proposed simple and modular UCA-RCA are perfectly applied for those constant multiplications with smaller bit sizes, such as 3 N, 5 N, and 7 N operations for arithmetic coding, Booth encoding, and the divider design [9] as a cost-effective solution. For higher speed performance, the proposed UCA-CLA can also be applied.

5 Conclusion

Constant multiplier performs a multiplication of a data-input with a constant value. They are essential components in various types of arithmetic circuits, such as filters in digital signal processor (DSP) units and they are prevalent in modern VLSI designs. This study has presented efficient algorithm and fast hardware implementation for (2k ± 1)N operation with additions. No multiplications were needed.

The salient features of the proposed UCA design concept came from the use of the carry function of (12) instead of (5), thus improving the propagation delay path and achieving 34 % in speed performance improvement. The design concept is perfectly applied for the RCA design for faster speed. The proposed UCA-HyA provides a cost-effective solution for constant multiplication, and the proposed UCA-CLA further improves the speed performance.

In addition, the study presents the hardware implementation for (2k + 1)N operation, how to apply the proposed UCA design concept for any other form of constants with one addition also leads a very interesting topic for future research. Finally, as mentioned in [2], the constant divider can be developed in a similar way.