ASIC Implementation and Comparison of Diminished-one Modulo $2^n+1$ Adder

Raj Kishore Kumar¹, Vikram Kumar²

¹Shivalik Institute of Engineering & Technology
²Assistant Professor, Shivalik Institute of Engineering & Technology

Abstract: Modulo $2^n+1$ adder is one of the most common adder which have frequent use in RNS operations that has significant critical path, and often encountered in applications like pseudo-random number generation and cryptography as well. In this paper we have presented various types of diminished-one modulo $2^n+1$ adder using globular carry selection (GCS) and several parallel prefix adders. The adder has been simulated using verilog HDL codes and mapped this design to the TSMC (90 nm) and calculated the area, power dissipation and time for n= 8,16,32 & 64 and shown in the graph which adder will have better power efficiency and time delay. Area occupied by several bits is presented in the table.

Keyword:- RNS, Parallel Prefix Adder, GCS, ASIC

1. Introduction

Arithmetic modulo $2^n+1$ has wide range of applicability ranging from pseudorandom number generation and cryptography without round-off errors [1], [2], [3]. Also, modulo $2^n+1$ arithmetic operators are frequently included in residue number system (RNS) application. Most of the case, RNS is extremely good for many applications-such as computer security, speech processing, communication engineering, digital signal processing, image processing, and transformation in which the critical arithmetic operations are addition and multiplication. RNS offers in several cases apart from enhanced operation speed, low-power characteristics. RNS are more complex than the conventional arithmetic, and, consequently, cannot be implemented directly with the same efficiency. Therefore, in practice, residue arithmetic is frequently realized in terms of lookup-tables (to avoid the complex combinational-logic circuits) and conventional arithmetic(i.e. arithmetic in some standard notation). For example, the sign-and-magnitude approach may be convenient for representing signed numbers in RNS, but actual arithmetic operations might be best realized in terms of radix-complement arithmetic. We shall also see that certain choices of representational parameters in RNs naturally lead to diminished-radix complement arithmetic.

The difficulty of a modulo $2^n+1$ arithmetic unit is defined by the representation chosen for the input operands. There are three representations which have been considered: namely, the signed-LSB representation, the normal weighted one and the Diminished-one. The adoption of the signed-LSB representation does not lead to more efficient circuits in delay or area terms so that we only consider the last two representations in the following. when performing arithmetic operations modulo $2^n+1$ in all cases, the input operands and the results are limited between 0 and 2n. In the case of the normal weighted representation, each input operand requires n+1 bits for its representation but only utilizes C representations out of the $2^n+1$ that these can provide. Other way to simplify arithmetic operations modulo $2^n+1$ are offered by the diminished-one representation. In the diminished-one representation, A is represented as $a_iA^+$, where $a_i$ is a single bit, frequently called the zero indication bit, and $A^+$ is an n-bit vector, often called the number part. $a_i=0$ and $A^+=A-1$ only if $A>0$, whereas for $A=0$, $a_i=1$ and $A^+=0$.

For example, the diminished-one representation of $A=6\bmod 17$ is 001011 considering that the most common operation required in modulo $2^n+1$ arithmetic are multiplication by a power of two, negation and addition, so in the case of diminished-one representation, allows to limit these operations to n bits. Specifically negation is performed by complementing every bit of $A^+$, if $a_i=0$ and prevent any change when $a_i=1$. In this paper we have presented a paper on diminished-one modulo 2 +1 arithmetic adder using GCS which is based on the paper[5]. This adder consists of a DS-CLA, a GCS and a multiplexer. The DS-CLA adder produces two sets of modulo results in parallel and the GCS computes the carry out bit and globularly control the multiplexer to get the correct modulo sum from DS-CLA adder.

The paper has been organized in six section. In section 2, the details of the basics of parallel-prefix addition is presented. All operations involved in modulo $2^n+1$ addition for diminished-one operands are discussed in details in section 3. Section 4 gives the details of the diminished-one modulo adder using GCS is presented. Section 5 gives the performance parameter comparison and the conclusion part is given in section 6.

2. Parallel-Prefix addition basics

Consider that $A=A_{01}A_{02}......A_0$ and $B=B_{01}B_{02}......B_0$ represent the two numbers to be added and $S=S_{01}S_{02}......S_0$ shows there sum. An adder can be divided as a three stage circuit. The preprocessing stage computes the
carry-generate bits $G_i$, the carry-propagate bits $P_i$, and the half-sum bits $H_i$ for all $i$, $0 \leq i \leq n-1$, according to

$$G_i = \overline{A_i \cdot B_i} \quad \quad P_i = A_i \cdot B_i \quad \quad H_i = A_i + B_i$$

Where the sign $\bullet$, $\oplus$, denotes logical AND, OR, and exclusive-OR, respectively. After this stage, the second stage of the adder called the carry signals $C_i$, for $0 \leq i \leq n-1$ using the carry propagate and carry generate bits $G_i$ and $P_i$. The third stage computes the sum bits as

$$S_i = H_i \cdot C_i$$

Carry computation get changed into a parallel prefix problem by using the $\bullet$ operator, which includes pair of generate and propagate signals and was defined in [10] as

$$(G, P) \bullet (G', P') = (G + P \cdot G', P \cdot P')$$

The notation $(G_{K\cdot j}, P_{K\cdot j})$, with $K > j$, is used to denote the group generate/propagate term produced out of bits $K$, $K-1$,........$j$, that is

$$(G_{K\cdot j}, P_{K\cdot j}) = (G_{K}, P_{K}) \bullet (G_{K-1}, P_{K-1}) \bullet .... \bullet (G_{j}, P_{j})$$

Since every carry $C_i = G_i \oplus P_i$, various types of algorithms have been introduced for computing all the carries using only $\bullet$ operators. Fig.1 presents the well-known approaches for the design of an 8-bit adder, while Fig.2 shows the logic-level implementation of the basic cells used throughout the paper. When we use large word lengths, the design of sparse parallel prefix adders is preferred, because the wiring and area of the design are significantly reduced without sacrificing delay. The basic design of sparse adders depends on the use of a sparse parallel-prefix carry computation unit and carry select blocks. It saves considerable amount of area in the carry-computation unit[4]. Carry selection block work with the two sets of sum bits corresponding to the two possible values of the incoming carry and select the proper sum without any delay overhead.

3. Modulo $2^n$ addition basics

3.1 Modulo $2^n$-1 adders:-

The modulo $2^n$-1 addition depends on a conditional operation defined as

$$(A + B) \mod (2^n + 1) = \begin{cases} (A + B), & A + B < 2^n \\ (A + B + 1) \mod 2^n, & A + B \geq 2^n \end{cases}$$

When $A+B\geq2^n$, A modulo $2^n$-1 adder can be implemented using an integer adder that increments also its sum when the carry output is one, which is equivalent to feed the carry-input of the adder with the carry-output of the first addition. Additional carry increment stage can implement the conditional increment as shown in Fig.4a. In this case, one extra level of cells $\bullet$ driven by the carry output of the adder, is required. When $A+B=2^n$, the adder may produce an all 1s output vector, in place of the expected result which is equal to zero. To implement a modulo $2^n$-1 adder requires the connection of the carry output $C_{n-1}=G_{n-1}\oplus P_{n-1}$ of an integer adder to its carry-input port. The carries of the modulo $2^n$-1 adder is equal to $G_{n-1} + P_{n-1} \cdot C_n$. So that connecting the carry output to the carry input leads to $= G_{n-1} + P_{n-1} \cdot C_{n-1}$.

![Figure 1](image_url)

**Figure 1:** Examples of 8-bit parallel-prefix structures for integer adders (a) Kogge-stone[7] (b) Ladner-Fischer[8]

This relation can be simplified [1] to

$$C_i = G_{i:0} + P_{i:0} \cdot G_{n-1:i+1} \quad \quad (1)$$

This equation is equivalent to

$$C_i \leftrightarrow (G_i, P_i) \oplus (G_0, P_0) \oplus (G_{n-1}, P_{n-1}) \oplus (G_{n-2}, P_{n-2}) \oplus \ldots \oplus (G_i, P_i) \quad \quad (2)$$

Equation (2) computes the modulo $2^n$-1 carries have a cyclic form and, in terms of the number of generate and propagate pairs $(G_i, P_i)$ is equal to $n$. Therefore the parallel-prefix carry computation unit of a modulo $2^n$-1 adder has increased area complexity than that of a corresponding integer adder.
number parts $A'$ and $B'$ are added modulo $2^n+1$. When ant one of the two inputs is zero the result is nonzero operand and when both operands are zero, the result is zero. The Diminished-one sum is derived by the number parts $A'$ and $B'$ of the input operands, when none of the input operands is zero, as follows:

$$S' = (A' + B') \mod (2^n+1)$$

$$= \begin{cases} 
(A' + B' + 1) \mod 2^n, & A' + B' < 2^n \\
(A' + B') \mod 2^n, & A' + B' \geq 2^n 
\end{cases} \text{[3]}
$$

4. Globular Carry Selection Diminished-one Modulo $2^n+1$ Adder

Suppose that the two n-bit diminished-one operands are $A' = a_{n-1}...a_0$ and $B' = b_{n-1}...b_0$. So their sum $S' = s_{n-1}...s_0$ is derived by doing modulo $2^n+1$ addition of these two operands. Sum can be changed into the uncomplicated function with performing modulo $2^n$ addition as:

$$S' = (A' + B' + C_{n-1}) \mod 2^n \text{[4]}$$

Where $C_{n-1}$ denotes as an original carry-out bit of $(A' + B')$.

According to the carry look ahead adder function [12], the carry term

$$C_{i} = G_{i} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{i} P_{k}) G_{j} + \sum_{k=0}^{n-1} P_{k} \text{ for } i = 0, 1, ..., n-1$$

Where denotes the carry-in bit. According to GCS technique, we set $C_{i} = C_{n-1}$. The boolean function of each sum bit in (4) can be shown as:

$$S_{i} = C_{i-1} \oplus P_{i}$$

$$= (G_{i-1} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{i} P_{k}) G_{j} + \sum_{k=0}^{n-1} P_{k}) \oplus P_{i} \text{ [5]}$$

Where,

$$C_{n-1} = G_{n-1} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{n-1} P_{k}) G_{j} \text{ [6]}$$


3.2 Modulo $2^n+1$ adders

In terms of complexity, Diminished-1 modulo $2^n+1$ addition is more complex so that special care is required when at least one of the input operand is zero (100...0). When none of the input operands is zero their

Since $C_{n-1} \in \{0, 1\}$, so that

$$S_{i} = \begin{cases} 
S_{i,1} = (G_{i-1} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{i} P_{k}) G_{j} + \sum_{k=0}^{n-1} P_{k}) \oplus P_{i}, & \text{if } C_{n-1} = 0 \\
S_{i,0} = (G_{i-1} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{i} P_{k}) G_{j} + \sum_{k=0}^{n-1} P_{k}) \oplus P_{i}, & \text{if } C_{n-1} = 1 
\end{cases} \text{ [7]}
$$

In (7), We can clearly see that they have the same term $(G_{i-1} + \sum_{j=0}^{n-2} (\prod_{k=j+1}^{i} P_{k}) G_{j}) \oplus P_{i}$, so that it is easy to design a DS-CLA adder to produce two sums $S_{i,1}$ and $S_{i,0}$. Meanwhile, the carry $C_{n-1}$ generated by the CLA function of (6) is globally used to control multiplexer for getting the appropriate outputs $S_{i,15}$. The block diagram of GCS Diminished-one modulo $2^n+1$ adder is shown in Fig.3. At a clear point of view, Fig.4 shows the detailed logic design for GCS Diminished-one modulo $2^n+1$ adder.
To increase the speed of the GCS modular adder for the large dimension of n, we divide the n-bit GCS modular adder into m r-bit GCS addition blocks and a fast GCS where n=m x r. Fig. 5 illustrates the partitioned GCS modular adder architecture. Both the input operands are divided into m blocks inputs a

\[ A^t = \{ A^t_{m-1} \ldots A^t_0 \} \] and \[ B^t = \{ B^t_{m-1} \ldots B^t_0 \} \]

where \[ A^t_i = \{ a^t_i(t+1)r-1 \ldots a^t_i1 \} \] and \[ B^t_i = \{ b^t_i(t+1)r-1 \ldots b^t_i1 \} \] for \( t=0, \ldots , (m-1) \).

The block sum \( S^t_i = \{ s^t_i(t+1)r-1 \ldots s^t_i1 \} \) is derived

![Figure 3: Block diagram of GCS diminished-one modulo 2^8+1 adder](image)

![Figure 4: The m x r partitioned GCS modular adder](image)

by the equation \( A^t_i + B^t_i + K^t_{r-1} \) where \( K^t_{r-1} \) denotes the carry-out bit of the \( (t-1)^{th} \) addition block. In each GCS addition block, the DS-CLA adder generates two block sums \( S^t_{i,0} = S^t_i \) for \( K^t_{r-1} = 1 \) and \( S^t_{i,0} = S^t_i \) for \( K^t_{r-1} = 0 \) in parallel. Same as, the carry-out bit \( K^t_{r-1} \) is used to select the correct block sum. Each carry-

\[
K^t_{r-1} = G^t_{i-1} + \sum_{j=0}^{r-2} (\prod_{\xi=j+1}^{r-1} P^t_j) G^t_j + \prod_{\xi=0}^{r-1} P^t_{j} \text{ if } C_{n-1} = 0
\]

\[
K^t_{r-1} = G^t_{i-1} + \sum_{j=0}^{r-1} (\prod_{\xi=j+1}^{r-1} P^t_j) G^t_j \text{ if } C_{n-1} = 0
\]

Where the block generate term \( G^t_{i} = G^t_{i+r(r-1)} + \sum_{j=r}^{r} (\prod_{\xi=j+1}^{r-1} P^t_{k}) G^t_j \) and the block propagate term \( P^t_{i} = \prod_{\xi=r}^{r} P^t_{k} \) are given by the \( i^{th} \) GCS addition block. So that the carry-out bit

\[
= C^t_{n-1} + \sum_{j=0}^{m-2} (\prod_{\xi=j+1}^{m-1} P^t_j) G^t_j \text{ ...........(9)}
\]
5. Comparison and VLSI Implementation

We compare the GCS diminished-one modulo 2^n+1 adder against three parallel-prefix modular adder\[7\] [8] [10] which are regarded as the fast and the more area efficient design for arithmetic addition. In order to make an accurate comparison, we use TSMC 90 nm design kit with Cadence’s rc tool to implement the designs for n = 8, 16, 32, 64. Then calculated the area, power dissipation and time delay and shows the AD (Area x Delay) and DP (Delay x Power) products. All the area results are calculated in µm², delay results are given in ns and power dissipation results are expressed in µw. Table1 shows the comparison in terms of area, delay and power consumption and its product for various bits. The shaded parts in the table shows the best results for the specific dimension of n. In the graph, we can see that for n > 8, the GCS modular adder has less power delay product than [7], [8], [10]. This result shows that GCS modular adder to be profitable for many real application when to require a good compromise in area, delay and power.

![Logic circuit of CCS diminished-one modulo 2^n+1 adder][5]

### Table 1: Comparison of the Synthesized Adders

<table>
<thead>
<tr>
<th>n</th>
<th>Design</th>
<th>Area(µm²)</th>
<th>Delay(ns)</th>
<th>Power(µw)</th>
<th>Area x Delay</th>
<th>Delay x Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>KS 8x1</td>
<td>58</td>
<td>3.184</td>
<td>2.04</td>
<td>185</td>
<td>6.49</td>
</tr>
<tr>
<td></td>
<td>LF</td>
<td>209</td>
<td>2.34</td>
<td>8.64</td>
<td>489</td>
<td>20.21</td>
</tr>
<tr>
<td></td>
<td>BK</td>
<td>196</td>
<td>2.21</td>
<td>8.21</td>
<td>433</td>
<td>18.14</td>
</tr>
<tr>
<td></td>
<td>GCS 2X4</td>
<td>98</td>
<td>2.39</td>
<td>3.28</td>
<td>234</td>
<td>7.83</td>
</tr>
<tr>
<td>16</td>
<td>KS 8X2</td>
<td>112</td>
<td>4.75</td>
<td>5.14</td>
<td>532</td>
<td>24.41</td>
</tr>
<tr>
<td></td>
<td>KS 16X1</td>
<td>120</td>
<td>5.2</td>
<td>4.37</td>
<td>624</td>
<td>22.7</td>
</tr>
<tr>
<td></td>
<td>LF</td>
<td>463</td>
<td>2.48</td>
<td>24.65</td>
<td>1148</td>
<td>61.13</td>
</tr>
<tr>
<td></td>
<td>BK</td>
<td>454</td>
<td>2.62</td>
<td>24.29</td>
<td>1189</td>
<td>63.63</td>
</tr>
<tr>
<td></td>
<td>GCS 4X4</td>
<td>138</td>
<td>3.21</td>
<td>4.34</td>
<td>443</td>
<td>13.93</td>
</tr>
<tr>
<td></td>
<td>GCS 1X16</td>
<td>116</td>
<td>3.98</td>
<td>3.67</td>
<td>461</td>
<td>14.6</td>
</tr>
<tr>
<td>32</td>
<td>KS 8X4</td>
<td>231</td>
<td>10.78</td>
<td>7.65</td>
<td>2490</td>
<td>82.46</td>
</tr>
<tr>
<td></td>
<td>KS 32X1</td>
<td>244</td>
<td>9.23</td>
<td>9.59</td>
<td>2252</td>
<td>88.51</td>
</tr>
<tr>
<td></td>
<td>LF</td>
<td>993</td>
<td>2.72</td>
<td>54.83</td>
<td>2701</td>
<td>149.13</td>
</tr>
<tr>
<td></td>
<td>BK</td>
<td>943</td>
<td>2.96</td>
<td>52.65</td>
<td>2791</td>
<td>155.84</td>
</tr>
<tr>
<td></td>
<td>GCS 8X4</td>
<td>203</td>
<td>3.67</td>
<td>13.76</td>
<td>745</td>
<td>50.49</td>
</tr>
<tr>
<td></td>
<td>GCS 2X16</td>
<td>192</td>
<td>3.82</td>
<td>11.23</td>
<td>733</td>
<td>42.89</td>
</tr>
<tr>
<td>64</td>
<td>KS 8X8</td>
<td>584</td>
<td>5.37</td>
<td>26.68</td>
<td>3136</td>
<td>143.27</td>
</tr>
<tr>
<td></td>
<td>KS 64X1</td>
<td>633</td>
<td>10.29</td>
<td>27.7</td>
<td>6513</td>
<td>285.03</td>
</tr>
<tr>
<td></td>
<td>LF</td>
<td>1568</td>
<td>3.81</td>
<td>83.56</td>
<td>5974</td>
<td>318.36</td>
</tr>
<tr>
<td></td>
<td>BK</td>
<td>1523</td>
<td>3.96</td>
<td>81.03</td>
<td>6031</td>
<td>320.87</td>
</tr>
<tr>
<td></td>
<td>GCS 16X4</td>
<td>429</td>
<td>4.01</td>
<td>23.01</td>
<td>1720</td>
<td>92.27</td>
</tr>
<tr>
<td></td>
<td>GCS 4X16</td>
<td>403</td>
<td>4.21</td>
<td>19.56</td>
<td>1696</td>
<td>82.34</td>
</tr>
</tbody>
</table>
6. Conclusion

A GCS Diminished-one modulo $2^{n+1}$ adder has been implemented and compared with various parallel prefix adder to drive the most compromising design in terms of area, delay, power. Based on the 90 nm CMOS technology, the ASIC implementation of GCS modular adder has better area-delay and delay-power performance over those of the parallel-prefix adder.

References


[7] P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient solution of a General class of...


