A high performance modulo 2n+1 squarer design …1346/... · Design Based on Carbon Nanotube...

I

A High Performance Modulo 2n

A Thesis Presented

+1 Squarer Design Based on Carbon Nanotube

Technology

by

Weifu Li

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirement

for the degree of

Master of Science

in

Electrical Engineering

in the field of

Electronic Circuits and Semiconductor Devices

Northeastern University

Boston, Massachusetts

November, 2012

II

Abstract

Modulo 2n+1 squarer is widely used in the digital system, such as digital signal

processing (DSP), cryptography and residue arithmetic as an important component. In

this thesis, an improved high-speed low-power design of modulo 2n+1 squarer is

proposed. The primary improvement comes from algorithm, circuit implementation

and implementation technology. For the algorithm, the partial product matrix

reconstruction is optimized to achieve a larger range of input and fewer operation

steps. Modified Wallace tree is also employed in partial product compression process.

For the circuit implementation, full adders in traditional Wallace tree structure are

replaced by 3:2 compressors and a spare-tree based inverted End-Carry-Around (EAC)

modulo 2n

+1 adder is utilized to implement the final addition stage. The proposed

design in this thesis is demonstrated a much more excellent performance in terms of

speed, power and area comparing with existing design. Considering the limitation of

CMOS technology, a novel carbon nanotube implementation technology is utilized. In

the aspects of critical path delay, power and PDP, the CNT-based implementation

demonstrates itself a competitive candidate for high-speed, low-power application

through HSpice simulation. A Monte Carlo simulation is also performed to prove the

better PVT properties of CNT technology.

Keywords: modulo 2n+1 squarer, Wallace tree structure, sparse tree modulo 2n+1

adder, carbon nanotube technology, Monte Carlo, HSpice Simulation

III

Acknowledgements

It is a pleasure to thank many the people who made this dissertation possible. I would

like to express my gratitude to my research advisors, Dr. Yong-Bin Kim, whose

expertise, understanding, and patience, added considerably to my graduate experience.

I appreciate their vast knowledge in many areas, and their assistance in writing

technical articles.

I am especially grateful to the member of my dissertation committee, Prof. Fabrizio

Lombardi and Prof. Minsu Choi and the faculty members of the Department of

Electrical and Computer Engineering.

Thanks to all my student colleagues, Jing Yang and Jing Lv, they provide me a

collaborative, stimulating and enjoyable environment.

I want to dedicate this work to my parents, Yi Li and Xiaoxia Lan. All my

achievements cannot live without their support. Words cannot express my gratitude to

them.

Finally, I want to say thanks to my dear wife, Zijian Chen. With your company, the

two-year time in Boston becomes one of the best memories in my life. I love you

forever.

Boston, Massachusetts

Weifu Li

Nov 12, 2012

IV

Contents

I. Introduction........................................................................................................... 1

1.1 Background ................................................................................................. 1

1.2 Work Statement .......................................................................................... 6

1.3 Thesis Outline ............................................................................................. 7

II. Algorithm ............................................................................................................... 9

2.1 Partial Product Generation and Repositioning ....................................... 9

2.2 Partial Product Matrix Compression ..................................................... 13

2.3 Final Addition Stage ................................................................................. 15

2.4 Computation Example ............................................................................. 17

2.5 Algorithm Performance Comparison ..................................................... 19

2.5.1 Existing Algorithm Review ........................................................... 20

2.5.2 Performance Analysis .................................................................... 21

III. Circuit Implementation on CMOS Technology ............................................... 24

3.1 Partial Product Generation and Repositioning Module ....................... 24

3.2 Wallace Tree Compression Module ........................................................ 25

3.2.1 Design of 3:2 Compressors ............................................................ 25

3.3 Modulo 2n+1 Adder .................................................................................. 30

3.4 Design of Primitive Blocks ....................................................................... 35

3.4.1 Circuit Design of XOR-XNOR ...................................................... 35

3.4.2 Circuit Design of MUX .................................................................. 37

3.4.3 Circuit Design of GP generator .................................................... 39

3.5 Summary and Simulation Result ............................................................ 41

IV. Circuit Implementation on CNT Technology ................................................... 44

4.1 Introduction to CNTFET ......................................................................... 44

4.2 Circuit Implementation ............................................................................ 47

4.3 Performance Comparison ........................................................................ 49

4.3.1 PDP Comparison ............................................................................ 50

V

4.3.2 PVT Comparison ........................................................................... 51

V. Conclusion ........................................................................................................... 61

Reference .................................................................................................................... 62

Appendices .................................................................................................................. 65

Appendix A. HSpice Code .................................................................................. 65

1.1 HSpice code for modulo 2n+1 squarer ............................................. 65

1.2 HSpice code for modulo 2n+1 adder ................................................ 67

1.3 HSpice code for partial product Matrix .......................................... 69

1.4 Traditional Partial Product Process ................................................ 71

1.5 HSpice code for Compressors ........................................................... 73

1.6 HSpice code for Proposed FPP ......................................................... 75

1.7 HSpice code for Sub-circuit on CMOS technology ........................ 78

1.8 HSpice code for Sub-circuit on CMOS technology ........................ 84

Appendix B. Monte Carlo Simulation Data ..................................................... 89

2.1 Monte Carlo Simulation for CMOS ..................................................... 89

2.2 Monte Carlo Simulation for CNT ........................................................ 96

VI

List of Table

Table- 1 Initial Partial Product Matrix .................................................................... 9

Table- 2 Modified Partial Product Matrix ............................................................. 10

Table- 3 Repositioned Partial Product Matrix ...................................................... 10

Table- 4 Shifted Partial Product Matrix ................................................................ 11

Table- 5 Final Partial Product Matrix ................................................................... 13

Table- 6 Number of Wallace Trees Stage .............................................................. 14

Table- 7 Initial Partial Product Matrix of example ............................................... 18

Table- 8 Repositioned Partial Product Matrix of example .................................... 18

Table- 9 Shifted Partial Product Matrix of example ............................................. 19

Table- 10 Compression Process of example ......................................................... 19

Table- 11 Final Result of example ........................................................................ 19

Table- 12 Modified Partial Product Matrix of Existing Algorithm ....................... 20

Table- 13 Performance Comparison of Compressor ............................................. 22

Table- 14 Performance Compression of First Two Stage ..................................... 23

Table- 15 Performance Comparison between Full Adder and 3:2 Compressor .... 26

Table- 16 Performance Compression between Different Compression Processes30

Table- 17 Performance Comparison of existing modulo 2n +1 adder .................. 30

Table- 18 Performance Compression between Proposed FPP and Sparse Tree ... 34

Table- 19 Performance Comparison between GP generators ............................... 40

Table- 20 Performance of modulo 28+1 Squarer on Two Technologies .............. 51

Table- 21 Process Variation of CMOS Technology .............................................. 52

Table- 22 Process Variation of CNT Technology .................................................. 52

Table- 23 Supply Voltage Variation of CMOS Technology .................................. 54

Table- 24 Supply Voltage Variation of CNT Technology ..................................... 54

Table- 25 Temperature Variation of CMOS Technology ...................................... 56

Table- 26 Temperature Variation of CNT Technology .......................................... 56

VII

List of Figure

Figure 1: Wallace tree Compression Process ........................................................ 14

Figure 2: Schematic of nand gate.......................................................................... 24

Figure 3: Schematic of full adder.......................................................................... 25

Figure 4: Schematic of 3:2 Compressor................................................................ 25

Figure 5: Modified 3:2 Compressor ...................................................................... 27

Figure 6: Compression process based on Existing Algorithm .............................. 28

Figure 7: Compression process based on Proposed Algorithm ............................ 28

Figure 8: Critical path delay of traditional compression process ......................... 29

Figure 9: Critical path delay of proposed compression process ........................... 29

Figure 10: Schematic of Proposed FPP ................................................................ 31

Figure 11: Schematic of Sparse Tree .................................................................... 32

Figure 12: Schematic of Conditional Sum Generator ........................................... 33

Figure 13: Critical path delay of proposed FPP .................................................... 34

Figure 14: Critical path delay of sparse tree adder ............................................... 34

Figure 15: Schematic of existing XOR-XNOR gate............................................. 36

Figure 16: Schematic of improved XOR-XNOR gate .......................................... 37

Figure 17: Schematic of MUX .............................................................................. 38

Figure 18: Schematic of Simple GP Generator ..................................................... 39

Figure 19: Schematic of AOI and OAI Gate ......................................................... 40

Figure 20: Delay and Rise time of modulo 2n 42+1 squarer......................................

Figure 21: power consumption of modulo 2n 42+1 squarer ......................................

Figure 22: Non-critical delay of modulo 2n 43+1 squarer .........................................

Figure 23: Schematic of CNTFET transistor ........................................................ 44

Figure 24: Schematic of unrolled nanotube .......................................................... 45

Figure 25: CNTFET threshold voltage varies with n ............................................ 46

Figure 26: I-V characteristic of CNTFET transistor ............................................. 47

Figure 27: Delay with various number of nanotube ............................................. 48

VIII

Figure 28: Critical path delay of modulo 28 49+1 squarer.........................................

Figure 29: power of modulo 28 49+1 squarer ............................................................

Figure 30: Process Delay Variation of CMOS Technology .................................. 53

Figure 31: Process Delay Variation of CNT Technology ...................................... 53

Figure 32: Supply Voltage Delay Variation of CMOS Technology ...................... 55

Figure 33: Supply Voltage Delay Variation of CNT Technology ......................... 55

Figure 34: Temperature Delay Variation of CMOS Technology .......................... 57

Figure 35: Temperature Delay Variation of CNT Technology .............................. 57

Figure 36: Monte Carlo analysis of CMOS implementation ................................ 59

Figure 37: Monte Carlo analysis of CNT implementation ................................... 59

1

I. Introduction 1.1 Background

In the past decades, modular arithmetic has been playing an important role in various

digital computing systems, such as digital signal processing (DSP), cryptography and

residue arithmetic. In particular, the residue number system (RNS) can be considered

as the most common application field of modular arithmetic [1].

In RNS, every operand is represented as a sequence of residue, e.g., (a1, a2,…, an

(x1, x2, … , xn) = (a1, a2, … an)◇(b1, b2, . . bn) (1)

).

Hence, a two operand RNS operation can be defined as [2]:

where ◇ denotes either addition, subtraction and multiplication Considering (1), the

computation of X can be considered as a combination of multiple separate operation

between ai and bi

Since efficient combination conversion between RNS and binary number cane be

realized based on Chinese remainder theorem [5], the RNS base with a form of {2

performing in parallel, so that the overall computation speed can be

significantly improved. Due to the superior performance in applications with large

width operand, the RNS nowadays has been found perfectly suitable for

high-precision application, such as Fast Fourier Transforms (FFT), Finite Impulse

Response (FIR) filters [3] and convolution [4].

n-1,

2n, 2n+1} is currently considered as the most appropriate one for VLSI (Very Large

Scale Integration) implementation among various base forms for RNS [6]. However,

since a (n+1)-bit input is required in modulo 2n+1 arithmetic, the difference in input

2

width among these moduli can result in new problems. Numerous algorithm and

architecture has been proposed for this issue.

To overcome the problem resulted from extra one bit of input, the diminished-one

representation introduced in [7] is adopted by [8]. In the diminished-one number

system, each operand is represented as X*= X-1, so that the n-bit modulo 2n+1 squarer

and multiplier can be realized. However, the zero operand is inhibited in this module

since negative input is invalid here. For the partial product compression process, a

Dadda tree structure is employed with full adders and half adders. Then the final Sum

and Carry Vector is added by diminished-1 modulo 2n+1 parallel adder. Comparing

with previous solution for (n+1)-bit operand, the proposed design in this article offers

a significant improvement in terms of delay and power. Some other solutions based on

diminished-1 operand representation for modulo 2n

In the work of Curiger, H. Bonnenherg and H. Kaeshi [11], only one operand is

represented in diminished-1 code, so that the complexity of circuit implementation is

improved in some degree. In addition, the correction factor can be computed in

parallel to current carry-save stage. In some certain applications of digital signal

processing, such as cipher and image process, a considerable improvement can be

provided. However, a relatively complex correction circuit is required and the zero

operand is also inhibited.

+1 arithmetic is also proposed in

[9] and [10].

Although the diminished-1 operand representation based implementation is

3

demonstrated a great advantage in the aspects of delay, power and area for modular

arithmetic, the conversion between diminsed-1 system and weighted number system

will unnecessarily add complexity of the system and increase the error risk in VLSI

implementation. Therefore, an efficient modulo 2n

In the work of Wrzyszcz and Milford [12], the partial product matrix of modulo 2

+1 arithmetic algorithm for

weighted operand is necessary.

n + 1

multiplier is reconstructed, so that an n × n partial product matrix can be achieved

without operand diminished. Due to the novel partial product compression process

using carry-save-adder (CSA) and periodic property of powers of 2, the correction

process is simplified by combining it with other operation. Since the entire partial

product computation module is only exclusively composed of half adders, full adders

and multiplexers, the design in this article can be more suitable for the regular VLSI

implementation with acceptable power consumption and it also allows a potential

pipelined computation structure which can significantly improve the operation

frequency of modulo 2n

In the later work of Vergos and Efstathiou [13], an improved implementation of

modulo 2

+ 1 multiplier.

n+1 multiplier is proposed based on [12]. In this article, the partial product

matrix is also divided into four groups and then reconstructed them in a different way.

Comparing with previous work in [12], the OR-AND-XOR gates which consume

more area and power can be replaced to achieve an area and power efficient design.

On the other hand, the correction factor resulted from partial product repositioning in

each process is summarized as final correction factor with the value of 3. Therefore,

4

the correction process is further improved by only adding an extra vector of correction

factor into the partial product matrix. For the final addition stage, modulo 2n+1

addition is converted into modulo 2n addition by using the other part of the correction

factor and is implemented by inverted End-Around-Carry (EAC) modulo 2n

In [1], a fast low-power modulo 2

adder

which is more suitable for VLSI implementation.

n+1 squarer is proposed based on the algorithm in

[13]. The same partial product matrix reconstruction is performed as shown in [13]

and the equivalent pairs in each column come from the identical input is shifted to

further reduce the number of partial produce vector before partial product

compression process. In addition, compressors with large number of input, such as 7:2

compressor, 5:2 compressor and 4:2 compressor, are utilized to compress the partial

product in each column to achieve a greater saving of power and delay. For the final

addition staged, a novel sparse-tree based inverted EAC modulo 2n+1 adder is

introduced. Comparing with previous design of modulo 2n+1 adder, the power and

area of novel design is substantially decreased due to fewer operators for carryout

computation with different weight. The wire routing in the spares-tree based modulo

2n

In addition to the algorithm and implementation, the improvement of Complementary

metal-oxide semiconductor (CMOS) technology can be considered as a potential

research direction as well. During the past few years, the gate channel length scaling

+1 adder is also simplified to be more suitable for VLSI design. The simulation of

entire implementation provides us a consideration improvement in terms delay and

area.

5

from 0.35 μm to 32 nm contributed greatly to the improvement of metal-oxide

semiconductor field-effect (MOFET) transistor in terms of power and system level

performance. However, further scaling of CMOS in sub-nano range may not offer the

same performance advantages as before and this is primarily resulted from following

several reasons: Firstly, due to the significantly increased leakage current, the static

power can make the MOSFET transistor uncompetitive for ultra low power

application in nanometer range. Secondly, the stability of MOSFET transistor can be

worse due to the higher sensitivity to unavoidable process variations in fabrication

[14]. Thirdly, considering the smaller amount of charge in circuit node resulted from

lower supply voltage and smaller capacitance, the CMOS circuit with ultra-short

length channel become more vulnerable to external voltage variation. Finally, in

nanometer range, the effect control for short channel effect is weakened. These

various device non-idealities can cause the current-voltage (I-V) characteristics of

ultra-short length MOSFET substantially different from the ideal one [14].

In recent years, various new devices and materials have been investigated. In [15], an

ultra-thin body device, FinFET transistor is introduced. In FinFET, a double-gate is

built on the SOI substrate and the conducting channel is wrapped by a thin silicon fin,

so that the gates on either side can be tied together or electrically isolated [16]. Due to

the particular structure of FinFET transistor, the two gates can be either both used to

turn on the transistor or only one gate is used to turn on the transistor while the other

one is used to adjust the threshold voltage. Therefore, the dynamic and static

performance can be tunable with lower leakage and better short-channel-effect.

6

However, both thin silicon fin and matched gates on multiple sides of fin is difficult to

fabricate.

The carbon nanotube technology (CNT) nowadays, considered as another promising

technology, can largely avoid most of the fundamental limitation of traditional

MOSFET transistor [17]. Comparing with traditional CMOS technology, the CNT

technology performs much more excellent characteristics in terms of timing,

frequency response and power consuming. In addition, the possibility of channel

burning in CNTFET transistor is significantly decreased, because the heat generated

in a small fraction of the CNTFET can be dissipated all along the channel. It can be

foreseen that the CNT technology can have an excellent prospect in VLSI area.

1.2 Work Statement

In general, current implementation of modulo 2n+1 squarer can provide an excellent

performance in the aspect of delay, power, area and stability. However, there is still

some room for further improvement. In this thesis, modulo 2n

For the algorithm, the (n+1)-bit operand is utilized without any special representation

to avoid extra code conversion between weighted operand and diminished-1 operand.

In addition, the reconstruction of partial products matrix is also optimized. Comparing

with previous method, the bit-wise operation before partial product compression

process is saved to further reduce number of gate on the critical delay path. For the

+1 squarer is primarily

improved in the aspects of algorithm, circuit configuration and implementation

technology.

7

partial product compression process, a Wallace tree structure is introduced to increase

the compression speed with lower power.

Regard to the circuit implementation of modulo 2n+1 squarer, the improvement

primarily comes from the following several aspects. Full adder and half adder utilized

in traditional Wallace tree structure are replaced by 3:2 compressors which can

perform a much better PDP. For the final addition stage, the optimal implementation

of modulo 2n+1 adder is decided based on the performance comparison among

various qualified candidates and the sparse-tree based modulo 2n

Finally, modulo 2

+1 adder is selected.

Furthermore, instead of simple combination of nand gate and or gate, novel

And-Or-Inverter (AOI) gate and Or-And-Inverter (OAI) gate is employed to compute

carryout with different weights.

n

1.3 Thesis Outline

+1 squarer is implemented on CNT (carbon nanotube) technology.

Performance of critical path delay, power and PDP is compared by HSPICE

simulation result with the CMOS implementation. For the PVT characteristics, a

Monte Carlo simulation is performed for both CNT and CMOS technology with

sample number of 100 and 1000 respectively.

The rest of this thesis is organized as follow: the improved algorithm for modulo 2n+1

squarer is represented in Section II. The optimal configuration of modulo 2n+1

squarer including the design of modified Wallace tree structure and sparse-tree based

modulo 2n+1 is decided and implemented on CMOS technology in Section III. In the

8

Section IV, the optimal configuration of modulo 2n

+1 squarer is implemented on CNT

technology and the performance comparison in different aspects is represented.

9

II. Algorithm In this section, an improved algorithm for modulo 2n

2.1 Partial Product Generation and Repositioning

+1 squarer with two (n+1) bit

unsigned inputs is proposed. In addition to the introduction of algorithm, a

computation example and a performance comparison with existing algorithm in [12]

is presented as well.

Let X be a n+1 bit unsigned input denoted as X=xnxn-1… x0, then the square of X

modulo 2n

Q = |𝑋2|2n+1 = |∑ ∑ 𝑥i𝑥j2i+j|n−1j=0

n−1i=0 2n+1

(2)

+1 can be represented as flow:

The partial products derived for the term Q of (2) are shown in Table-1.

Table- 1 Initial Partial Product Matrix

22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 … 22 21 20 p pn,0 pn-1,0 … n-2,0 p p2,0 p1,0

0,0 p pn,1 pn-1,1 pn-2,1 … n-3,1 p p1,1 0,1

p pn,2 pn-1,2 pn-2,2 pn-3,2 … n-4,2 p 0,2 … …. … … … … … p … n,n-2 p p4,n-2 p3,n-2 p2,n-2 p1,n-2 0,n-2 p pn,n-1 … n-1,n-1 p p3,n-1 p2,n-1 p1,n-1 0,n-1

p pn,n pn-1,n …. n-2,n p p2,n p1,n 0,n where pi,j = xi·xj

Since the two inputs of modulo 2

is partial product.

n+1 squarer are identical, the value of partial

products pi,j and pj,i is always equal and these equal pairs can be simply replaced by

shifting either pi,j or pj,i

Taking (3) into account, each partial product terms with weight greater than 2

to next left column and removing the other one. Therefore,

the partial product matrix in Table-1 could be modified as shown in Table-2.

n-1 could

10

be divided into two parts, repositioned partial product and correction factor. Due

to |22n|2n+1 = 1, the repositioning result of term pn,n can be simply donated as pn,n

without any correction factor in the column with weight 20

= |s̅2|i|n + 2n2|i|n|2n+1 (3)

. Therefore, the n×n partial

product matrix is rewritten in Table-3.

|s2i|2n+1 = | − s2|i|n|2n+1

where s is the value of repositioned bit.

Table- 2 Modified Partial Product Matrix

22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 … 22 21 20 p pn-1,0 … n-2,0 p p2,0 p1,0

0,0 p pn-1,1 pn-2,1 … n-3,1 p p1,1 0,1

p pn-1,2 pn-2,2 pn-3,2 … n-4,2 p 0,2 …. … … … … … … p p4,n-2 p3,n-2 p2,n-2 p1,n-2 0,n-2

p n-1,n p … n-1,n-1 p p3,n-1 p2,n-1 p1,n-1 0,n-1 p pn,n …. n-2,n p p2,n p1,n 0,n

Table- 3 Repositioned Partial Product Matrix

2n-1 2n-2 2n-2 … 22 21 20 p pn-1,0 pn-2,0 … n-3,0 p p2,0 p1,0 p

0,0 pn-2,1 pn-3,1 … n-4,1 p p1,1 pn−1,1�� 0,1

p pn-3,2 pn-4,2 … n-5,2 p pn−1,2�� 0,2 pn−2,2�� … … … … ... … …

p pn−1,n−1�� 0,n-1 pn−2,n−1�� … p3,n−1�� p2,n−1�� p1,n−1�� pn−2,n�� pn−3,n�� pn−4,n�� … p1,n�� p0,n�� p

0 n-1,n

0 0 0 0 0 pn,n

Observing the partial products matrix in Table-3, there are still some equal terms

appearing twice as pi,j

In addition to the partial product matrix, the correction factor, actually resulted from

or pı,ȷ�� in the same column, so that the vector number of partial

products matrix can be further reduced by the method mentioned above. The matrix

after shifting is shown in Table-4.

11

repositioning of partial product as shown in (3), needs to be considered as well. The

overall correction factor consists of three parts, correction factor from matrix

repositioning denoted as CF1, correction factor from identical pairs shifting denoted

as CF2 and correction factor from partial products compressing denoted as CF

Table- 4 Shifted Partial Product Matrix

3.

2n-1 2n-2 2n-2 … 22 21 20 p pn-2,0 pn-3,0 … n-4,0 p pn−1,1�� 1,0 pp

0,0 pn-3,1 pn-4,1 … n-5,1 p pn−2,2�� 1,1 pn−1,0��

p pn-4,2 pn-5,2 … n-6,2 pn−1,2�� pn−3,3�� pn−2,1�� … … … … ... … …

pn−32 ,n−1

2 pn−5

2 ,n−12

pn−52 ,n−3

2 … pn−1

2 ,n+32�� pn−1

2 ,n+12�� pn−3

2 ,n+12��

pn−2,n�� pn−3,n�� pn−4,n�� … p1,n�� p0,n�� p0

n-1,n 0 0 0 0 0 pn,n

where n is assuming as odd.

For the computation of CF1, number of bit (m), needed to be repositioned, in each

vector is incremented from 0 to n-1 and the sum of correction factor in each vector,

denoted as CFVj

CFVj = ( ∑ 2i)2nm−1i=0 = 2n(2m − 1) (4)

could be computed as flow:

where j represents the jth

Hence, the correction factor for matrix repositioning, CF1, would be

vector.

CF1 = ( ∑ CFVj)nj=2

= 2n[(21 − 1) + (22 − 1) + ⋯+ (2n − 1) − 1]

= 2n[2(1 + 2 + ⋯+ 2n−1) − (n + 1)]

= 2n[2n+1 − n − 3] (5)

The correction factor for identical pairs shifting, denoted as CF2, is resulted from

12

identical partial product pairs shifting from the column with weight 2n to the column

with weight 20 in partial product matrix and the correction factor of each shift is 2n

CF2 = �(n−2

2)2n when n is even

(n−12

)2n when n is odd � (6)

.

Therefore, the value of CF2 is determined by the number of identical partial product

pairs in one column and is given as:

The third part of the correction factor, CF3, is generated during the partial products

compressing process. In this process, the most significant bit of each carryout should

be shifted from most left column to most right column, which will result in a

correction factor with the value of 2n

VNPP = �n − n−4

2+ 1 when n is even

n − n−32

+ 1 when n is odd � (7)

and the number of shifted terms depends on the

value of n. Considering the value of n as even and odd number respectively, the

vector number of partial product including the correction factor is:

where VNPP is the vector number of partial product.

These partial product vectors will be used to produce the final Sum and Carry Vector,

so that the correction factor, CF3, is given as:

CF3 = �(n+2

2)2n when n is even

(n+12

)2n when n is odd � (8)

Therefore, the overall correction factor for modulo 2n+1 squarer can be computed by

summing up the three correction factors as follow:

13

CFall = CF1 + CF2 +CF3

= ��n−2

2+ n+2

2+ 2n+1 − n − 3�2n�

2n+1 when n is even

��n−12

+ n+12

+ 2n+1 − n − 3� 2n�2n+1

when n is odd �

= |(2n+1 − 3)2n|2n+1 = 5 (9) where n≥3.

The final partial product matrix including overall correction factor is shown in

Table-5. To convert modulo 2n+1adder to modulo 2n

Table- 5 Final Partial Product Matrix

adder, explained in section 2.3 in

detail, only part of the correction factor is added to the final partial product matrix in

this step.

2n-1 2n-2 2n-2 … 22 21 20 pn-2,0 pn-3,0 pn-4,0 … p1,0 pn−1,1�� p0,0 pn-3,1 pn-4,1 pn-5,1 … p1,1 pn−2,2�� pn−1,0�� pn-4,2 pn-5,2 pn-6,2 … pn−1,2�� pn−3,3�� pn−2,1�� … … … … ... … …

pn−32 ,n−1

2 pn−5

2 ,n−12

pn−52 ,n−3

2 … pn−1

2 ,n+32�� pn−1

2 ,n+12�� pn−3

2 ,n+12��

pn−2,n�� pn−3,n�� pn−4,n�� … p1,n�� p0,n�� pn-1,n 0 0 0 0 1 0 pn,n

2.2 Partial Product Matrix Compression

To implement the final addition stage by modulo 2n

N(VNPP) ≈ log1.5K (10)

+1 adder, each column in the final

partial product matrix have to be compressed to obtain the final Sum and Carry Vector.

Hence, the Wallace Tree structure which is well known to optimally implement

multi-operand binary addition is introduced [18] and the required stage number of

Wallace Tree is determined as follow [19]:

14

The required number of Wallace Tree stage for different VPNN is given in Table-6

and it can be observed that the Wallace Tree structure is more efficient which means

more power and gate delay can be saved when the bit number of input is increased.

Table- 6 Number of Wallace Trees Stage

VNPP 0~2 3 4 5~6 7~9

N(VNPP) 0 1 2 3 4

VNPP 10~13 14~19 20~28 29~42 43~63

N(VNPP) 5 6 7 8 9

To illustrate the analysis clearly, the schematic of partial product compression process

for a 16-bit modulo 2n+1 squarer is shown in Figure.1. According to (7), the partial

product vector number of 16-bit modulo 2n

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

…

……

…

20… 215

20… 215

+1 squarer is 10 and 5 Wallace Tree stage

is required for this example.

Figure 1: Wallace tree Compression Process

15

where is the symbol of partial product and is the symbol of inverted partial

product.

In general, the superior performance of Wallace Tree structure in regular multiplier

and squarer comes at a high expense of no regularity interconnection and wasted area

[20]. In modulo 2n+1 squarer, terms with weight larger than 2n have to repositioned to

left and the most significant bits of the Carry vectors, generated during compression

process, have to be fed back to the least significant bit as well, so that the shape of this

partial product matrix is always rectangular and the number of partial products in

different column is always equal as shown in Figure 1, which means the waste of area

and irregular interconnection can be dramatically improved in modulo 2n

2.3 Final Addition Stage

+1 squarer.

An n-bit Carry vector and an n-bit Sum vector are generated by the partial product

compression process. To obtain the final result of modulo 2n+1 squarer, these two

vectors need to be modulo 2n

Assuming A=anan-1… a0 and B=bnbn-1… b0 are two n-bit operands of modulo 2

+1 added in this final stage.

n

|A + B|2n+1 = �A + B − (2n + 1) if A + B ≥ 2n + 1A + B if A + B ≤ 2n + 1

� (11)

+1 adder,

then the sum of A and B can be represented as:

Therefore, (10) could be rewritten as:

|A + B + 1|2n+1 = �A + B + 1 − (2n + 1) if A + B + 1 ≥ 2n + 1A + B + 1 if A + B + 1 ≤ 2n + 1

�

= �A + B − 2n if A + B ≥ 2nA + B + 1 if A + B ≤ 2n

�

= �|A + B|2n if A + B ≥ 2nA + B + 1 if A + B ≤ 2n

� (12)

16

Since the Cout of final result is equal to 0 if A + B ≤ 2n, otherwise it is equal to 1,

(12) can be considered as described in [9]:

|A + B + 1|2n+1 = |A + B + C�out|2n (13)

Therefore, the final stage addition of modulo 2n+1 squarer can be implemented by an

Inverted EAC modulo 2n

To compute the value of carryout of each column, denoted as Ci, let’s define the

generate bit as gi = ai · bi and propagate bit as pi = ai + bi, so that the generate and

propagate group (Gi,Pi) in a regular parallel prefix adder, where Gi=Ci, can be

computed as follow:

(Gi, Pi) = (gi, pi) ○ (gi−1, pi−1) … (g1, p1) ○ (g0, p0)

adder, a more VLSI implementation suitable adder, by

adding constant “1” to the input vectors. This constant “1” comes from the correction

factor, so that none extra operation or hardware are needed.

= (gi + pi · gi−1, pi · pi−1) … (g1, p1) ○ (g0, p0)

= (Gi:n+1 + Pi:n+1 · Gn:0, Pi:n+1 · Pn:0) (14)

where -1 ≤ n ≤ i-1

According to (13) and G−1∗ = C�out = G�n−1 , the generate bits for module 2n

Gi∗ = �Gi + Pi · G−1∗ = Gi + Pi · G�n−1 0 ≤ i ≤ n − 2 G�n−1 i = −1

� (15)

+1

adder, Gi∗, is given as:

Therefore, generate and propagate group for module 2n+1 adder, denoted as (Gi∗, Pi∗),

is given as:

17

(Gi∗, Pi∗) = (Gi, Pi) ○ (Gn−1, Pn−1��)

= (Gi, Pi) ○ (G�n−1, Pn−1)

= (Gi + Pi · (Gn−1:ı+1 + Pn−1:ı+1 · Gı��) , Pi · Pn−1)

= (Gi + Pi · Gn−1:ı+1�� (Pn−1:ı+1�� + Gı� ), Pi · Pn−1:i+1)

= (Gi, Pi) ○ (G�n−1:i+1, Pn−1:i+1)

= (Gi, Pi) ○ (Gn−1:ı+1, Pn−1:ı+1��)

= (Pı� , Gı� ) ○ (Gn−1:ı+1, Pn−1:ı+1)�� (16)

where 0 ≤ n ≤ 2.

Taking (15) and (16) into account, the mathematical expressions of each carryout for

module 2n

︙

Cn−2∗ = (g0, p0) …○ (gn−2, pn−2) ○ (gn−1, pn−1) ○ (gn, pn)��

+1 adder can be computed using the follow set of equations:

C−1∗ = Cn�� = (g0, p0) ○ (g1, p1) …○ (gn, pn)��

C0∗ = (g0, p0) ○ (g1, p1) …○ (gn, pn)��

C1∗ = (g0, p0) ○ (g1, p1) ○ (g2, p2) …○ (gn, pn)��

Cn−1∗ = (g0, p0) …○ (gn−1, pn−1) ○ (gn, pn)�� (17)

Hence, the final result of module 2n

Si = ai ⊕ bi ⊕ Ci−1 (18)

+1 adder is given as:

2.4 Computation Example

In this section, a computation example is provided to further explain the algorithm

mentioned above.

18

Let x=87 be the input of modulo 2n

Table- 7 Initial Partial Product Matrix of example

+1 squarer and the binary format of it is

x=01010111. Thus, the initial partial product matrix is shown in Table-7.

0 1 0 1 0 1 1 1 × 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0

Then, the partial product matrix is modified and the terms with a weight greater than

26

Table- 8 Repositioned Partial Product Matrix of example

will be repositioned as shown in Table-8.

1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1

Before the partial products compression, the identical pairs in the matrix are shifted to

the right column and the correction factor is also included in this matrix as shown in

Table-9.

The modified partial product matrix will be compressed using Wallace Tree structure

and then the final Carry and Sum vector will be obtained. To input them into modulo

2n+1 adder, the most significant bit of the Carry vector need to be shifted to the

19

column with weight 20

Table- 9 Shifted Partial Product Matrix of example

and inverted. The compression process is shown in Table-10.

0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1

Table- 10 Compression Process of example

0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 0 1

0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0

Finally, the final Sum and Carry Vector are modulo 27

Table- 11 Final Result of example

+1 added to obtain the final

computation result as shown in Table-11.

1 1 0 0 1 1 1 = 103 1 1 1 0 0 0 1

+ = 113 0 1 0 1 1 1 1 = 87

2.5 Algorithm Performance Comparison

In [1], an efficient algorithm for modulo 2n+1 squarer, based on the one presented by

Vergos and Efstathiou in [13], is described. To demonstrate the superiority of design

in this thesis, a performance comparison between these two algorithms is represented

20

and the simulation result of an 8-bit example for each algorithm is also introduced to

make the comparison more straightforward.

2.5.1 Existing Algorithm Review

Different from the proposed algorithm, an OR operation is executed at first in [1].

Then, the terms in each column with weight greater than 2n-1

Table- 12 Modified Partial Product Matrix of Existing Algorithm

are repositioned to the

corresponding position on the right. The partial product matrix after modification and

repositioning is shown in Table-12 and the correction factor is also included. Since

the number of repositioned bit and identical pairs is different in existing algorithm, the

new overall correction factor is 3.

26 25 24 23 22 21 20 pn-1,0∨qn-1 pn-2,0 pn-3,0 … p2,0 p1,0 p0,0∨qn-1∨pn,n

pn-2,1 pn-3,1 pn-4,1 … p1,1 p0,1 pn−1,1∨q0�� pn-3,2 pn-4,2 pn-5,2 … p0,2 pn−1,2∨q,1�� pn−2,2�� … … … … ... … …

p1,n-2 p0,n-2 pn−1,n−2∨qn−�� … p4,n−2�� p3,n−2�� p2,n−2�� p0,n-1 pn−1,n−1∨qn−�� pn−2,n−1�� 0 p3,n−1�� p2,n−1�� p1,n−1��

0 0 0 0 0 1 0 where qi= pn,i or pi,n and symbol ∨ stands for the OR operation.

As mentioned in the proposed algorithm, the value of qi,j and qj,i is always equal, so

that the identical pairs in each column can be shifted to left to further reduce the

number of partial product vector and the identical pairs in most left column should be

inverted and fed back to the most left column.

In [1], it is said that the optimal possible implementation for partial product matrix

compression can be achieved by using compressors with a large number of input bit

which can save more power and delay comparing with compressors with small

21

number of input bit. Taking the modulo 215

Finally, the final Sum and Carry vector are modulo 2

+1 squarer as example, the optimal

compression configuration will consist of a 7:2 compressor, a 4:2 compressor and a

3:2 compressor in order. The most significant bits of each Carry vector, generated

during the compression process, need to be inverted and fed back as well.

n+1 added by an Inverted EAC

modulo 2n

2.5.2 Performance Analysis

adder to obtain the final computation result.

Considering the excellent performance of Inverted EAC adder, demonstrated in 2.3,

the proposed algorithm is mainly improved in the partial product generation and

repositioning stage and partial product matrix compression stage, in following several

aspects:

1. By repositioning the last vector of partial product as mentioned in previous section,

the OR operation is no longer needed in proposed algorithm. Hence, an OR gate

can be removed from the critical path to increase computation speed and reduce

power consumption and area. Though the number of partial product vector in

proposed algorithm is one more than that of existing algorithm, the number of

vector needed to be compressed is the same because there is one more identical

pair in each column.

2. The range of input form in existing algorithm is extended. For the inputs in form

of 1Z1 where Z is a (n-2)-bit vector, both the value of pn,i and p0,0 will be “1”, so

that OR operation is no longer valid to compute the value of term with weight 20

22

and same condition happens to pn,n-1 as well. In proposed algorithm, because there

isn’t any operation between terms in matrix before partial product compression,

the correctness of final computation result is no longer influenced by the value of

input, so that the range of input is extend from [0, 2n-1) to [0, 2n+1

3. In [1], the best possible implementation of partial product compression is

described as using compressors with larger number of inputs. In proposed

algorithm, compressors with large number of inputs are replaced by 3:2

compressors executed in Wallace Tree Structure. Due to the simpler structure and

better critical path delay of 3:2 compressors, the performance of compression

process is further improved. For example, an efficient 7:2 compressor is proposal

in [21]. According to the proposed algorithm, the 7:2 compressors can be

equivalently replaced by Wallace Tree structure using 3:2 compressors and the

performance analysis is shown in Table-13.

-1] [13].

Table- 13 Performance Comparison of Compressor

7:2 Compressor 3:2 Compressor based

Wallace Tree

Number of Input/Output 9/4 9/4

Gate on Critical Path 6 4

Transistor Count 172 150

Unfortunately, the proposed algorithm is not perfect either. From (9), it can be found

that the correction factor of 5 is valid only under the condition that the bit number of

input is not less than 3, which makes a stricter requirement for the proposed algorithm.

However, this drawback will not have much negative influence on application of

23

proposed algorithm since modulo 2n

To verify the analysis mentioned above, the simulation result of first two stages in

modulo 2

+1 squarer with 2-bit input or 1-bit input aren’t

employed a lot in practice.

n

Table- 14 Performance Compression of First Two Stage

+1 squarer based on proposed algorithm and existing algorithm is shown in

Table-14 respectively.

Existing Algorithm Proposed Algorithm Performance Improvement

Delay (ps) 314.3 276.3 13.8%

Power (μw) 69.35 56.50 22.7%

PDP (j) 22.8e-15 15.6e-15 46.2%

Gate Count 138 120 15%

Transistor Count 1620 1374 17.9%

24

III. Circuit Implementation on CMOS Technology

Based on the proposed algorithm and performance comparison between different

possible implementation, modulo 2n

Circuit implementation of modulo 2

+1 squarer is implemented on 32nm CMOS

technology in this section.

n+1 squarer is divided into three parts: partial

product generation and repositioning module, Wallace tree compression module and

modulo 2n

3.1 Partial Product Generation and Repositioning Module

+1 adder.

In this module, the partial product matrix, shown in Table-1, is generated by simple

nand gate. Instead of simply adding inverters at the output ports of nand gate, the

output of nand gate can be used in compression module directly in our design, so that

a large amount of inverters can be saved in this module. The schematic of nand gate is

shown in Figure 2.

A

Out

Vdd

GND

A

B

B

Figure 2: Schematic of nand gate

25

3.2 Wallace Tree Compression Module

3.2.1 Design of 3:2 Compressors

Different from traditional Wallace tree compression configuration, 3:2 compressors,

governed by (19), are employed to replace full adders in this thesis. The schematic of

full adder and 3:2 compressor is shown in Figure 3 and Figure 4 respectively.

𝑋1 + 𝑋2 + 𝑋3 = Sum + 2 × Carry (19)

A

B

Cin

Sum

Carry

Figure 3: Schematic of full adder

XOR-XNOR

MUX MUX

X1 X2 X3

Sum Carry

3:2 Compressor

Figure 4: Schematic of 3:2 Compressor

Comparing with traditional full adder structure, the number of gate on critical path for

each Wallace tree stage is reduced from 3 to 2. Considering the increased number of

Wallace tree stage required for large input width of modulo 2n+1 squarer and the total

26

critical path delay of the compression module, given as (20), the critical path delay of

entire module can be significantly improved.

TCom_M = n × TS_C (20)

where n is number of Wallace tree stage and TS_C is delay of single compression stage.

The performance comparison between single full adder and 3:2 compressor is also

shown in Table-15 in detail. The critical parth delay of 3:2 compressor is 49% faster

than that of full adder with a 21.9% lower power consumption. In addition to the

analysis above, the partial product compression module based on traditional and

modified Wallace tree is only the combination of full adders and 3:2 compressors

respectively, so that the total power consumption will be significantly reduced as well.

Table- 15 Performance Comparison between Full Adder and 3:2 Compressor

Full Adder 3:2 Compressor Performance Improvement

Delay (ps) 76.3 51.2 49%

Power (μw) 1.988 1.631 21.9%

PDP (joule) 151.68e-18 83.51e-18 81.6%

3.2.2 Design of Wallace Tree Compression Configuration

The Wallace tree structure is an efficient hardware implementation for multiplication.

In this structure, any three partial products with the same weight are inputted into a

3:2 compressor until the final carry and sum vector is obtained. Hence, one third of

available partial product in each column can be reduced at the expense of only two

gate delay.

http://en.wikipedia.org/wiki/Computational_complexity_theory�

http://en.wikipedia.org/wiki/Computer_hardware�

27

As mentioned in section 3.1, part of terms in partial product matrix need to be

inverted before compression, so that extra inverters has to be added into the critical

path. Based on (21), the extra inverters can be actually eliminated from critical path

by moving it into the bypass of some certain 3:2 compressors as shown in Figure 5.

a� ⊕ b� = a ⊕ b (21)

XOR-XNOR

MUX MUX

X1 X2 X3

Sum Carry

3:2 Compressor

**

Figure 5: Modified 3:2 Compressor

To clarify the Wallace tree compression configuration in modulo 2n+1 squarer and

demonstrate the analysis in section 2.5, the compression process of modulo 215

For modulo 2

+1

squarer based on both proposed algorithm and existing algorithm in [1] is introduced

respectively.

15+1 squarer, the number of partial product vector after shifting will be

10. Therefore, the optimal compression configuration based on existing algorithm in

[1] consist of a 7:2 compressor, a 4:2 compressor and a 3:2 compressor as shown in

Figure 6. For the proposed algorithm, 5 Wallace tree stages are required and its

schematic is shown in Figure 7.

28

7:2 Compressor

4:2 Compressor

3:2 Compressor

X0~X8 X9

Final Sum Final Carry

Figure 6: Compression process based on Existing Algorithm

3:2 Compressor 3:2 Compressor 3:2 Compressor

3:2 Compressor 3:2 Compressor

3:2 Compressor

3:2 Compressor

X9~X7 X6~X4 X3~X1

3:2 Compressor

X0

Final Sum Final Carry

Figure 7: Compression process based on Proposed Algorithm

29

As shown in Figure 8 and Figure 9, the critical path delay of Wallace tree

compression configuration is almost 17% faster than that of the traditional one. In

addition, the simulation result in Table-16 also demonstrates the Wallace tree

compression configuration a much more excellent performance in terms of power

consumption and area. Considering the great contribution of delay and power from

compression process to the entire circuit, Wallace tree compression configuration can

efficiently improve the overall performance.

Figure 8: Critical path delay of traditional compression process

Figure 9: Critical path delay of proposed compression process

30

Table- 16 Performance Comparison between Different Compression Processes

Existing Compression Configuration in [1]

Wallace Tree Configuration

Performance Improvement

Delay (ps) 457.2 392.1 16.6%

Power (μw) 12.2 8.64 41.2%

PDP (joule) 5.58e-15 3.39-e15 64.7%

Gate Count 24 24 -


3.3 Modulo 2n

In this section, an inverted EAC modulo 2

+1 Adder

n +1 adder is designed to implement the

final addition stage. In [22], the performance of various existing modulo 2n

Table- 17 Performance Comparison of existing modulo 2

+1 adder

is concluded in Table-17.

n

Architecture

+1 adder

N=8 N=16

Delay (ns) Number of operators Delay (ns) Number of operators

Sklansky [9] 0.63 40 0.76 80

Kogge-Stone [9] 0.62 65 0.74 161

Parallel-Prefix[23] 0.50 68 0.62 196

Proposed FPP 0.44 60 0.54 160

Proposed RAPP 0.51 52 0.75 144

where N is the bit number of input.

In Table-17, the proposed fast parallel-prefix modulo 2n+1 adder (FPP) is considered

as the fastest possible implementation with acceptable number of operator. Different

31

from other parallel-prefix adders, the Ling equation is utilized to compute carryout in

proposed FPP. In Ling equation, a pseudo carry out (Hi) is proposed as given in (22)

and it allows a single local propagate signal to be removed from the critical path [22]

and the schematic of proposed FPP modulo 2n

Ci = Hi · pi (22)

+1 is shown in Figure 10.

where Ci is the traditional carryout

(Gi,+Pi·Gi-1)

(Gi,Pi)(Gi-1,Pi-1)

(Gi,Pi)(Gi-1,Pi-1)

H7 H6 H5 H4 H3 H2 H1 H0

(G7,G6) (G6,G5) (G5,G4) (G4,G3) (G3,G2) (G2,G1) (G1,G0) (G0,G7)(P4,G3)(P5,G4)(P6,G5)(P7,G6) (P3,G2) (P2,G1)

(Gi,Pi)(Gi-1,Pi-1)

S7 S6 S5 S4 S3 S2 S1 S0

Mux

(ai,bi)(ai-1,bi-1)

Sum Sum Sum Sum Sum Sum Sum Sum

Si

Hi

Sum

Figure 10: Schematic of Proposed FPP

For wide operands, the Proposed FPP modulo 2n+1 adder has to suffer from area and

power issues due to the large amount of operator. Furthermore, the complex wire

routing of Proposed FPP will further influence its performance in practice and

increase the implementation difficulty. Therefore, a sparse-tree based inverted EAC

modulo 2n+1 adder is implemented based on the algorithm discussed in Section 2.3 in

this thesis.

32

The sparse tree modulo 2n+1 adder combines the advantages of parallel prefix adder

and conditional sum generator. It has the minimum logic depth of log2n and its

maximum fanout is 3. Comparing with proposed FPP, the sparse tree adder computes

the carryout into each 4-bit group using a valency-2 tree structure similar to Sklansky,

so that the amount of operator can be dramatically reduced to achieve a lower power

and area efficiency implementation with much simpler wire routing [24]. In addition,

since the critical path of sparse tree adder comes from the gates used to compute

carryout, the output delay skew of final sum vector can be improved. The schematic

of sparse tree modulo 2n

Conditional Sum Generator



a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b10a11,b11a12,b12a13,b13a14,b14a15,b15


+1 adder is shown in Figure 11.

ai bi

(pi,gi)

(Gi,Pi) (Gi-1,Pi-1)

(Gi,Pi) (Gi-1,Pi-1)(Gi,Pi) (Gi-1,Pi-1)

(Gi,Pi) (Gi-1,Pi-1)

(Gi,Pi) (Gi-1,Pi-1)

Figure 11: Schematic of Sparse Tree

33

Besides the carryout computation circuit, 4-bit conditional sum generators are also

needed in the sparse-tree based modulo 2n

XOR-XNOR

MUXMUXMUX

GP GP

MUX

GP GP

(a0,b0)

(a1,b1)(a2,b2)(a3,b3)

(p0,g0)(p1,g1)(p2,g2)

Cin

S3 S2 S1 S0

+1 adder as shown in Figure 11. Since the

delay from sum computation path is less that from carryout computation path, the

conditional sum generators only contributes one multiplexer to entire critical path and

its schematic is shown in Figure 12.

Figure 12: Schematic of Conditional Sum Generator

The simulation result of critical path delay in 8-bit proposed FPP and sparse tree

modulo 2n+1 adder is shown in Figure 13 and Figure 14 respectively and their

performance is summarized in Table-18. Although the critical path delay of sparse tree

is slightly larger than that of proposed FPP, the power consumption is significantly

reduced, so that the PDP of sparse tree is almost 4 times better. In addition, since the

transistor count in sparse tree is 20% fewer than that in proposed FPP and this percent

value will be further increased for large width operand, the sparse-tree based modulo

2n+1 adder is more area efficiency.

34

Figure 13: Critical path delay of proposed FPP

Figure 14: Critical path delay of sparse tree adder

Table- 18 Performance Compression between Proposed FPP and Sparse Tree

Proposed FPP Sparse Tree Performance Improvement

Delay (ps) 96.01 100.66 -2.5%

Power (μw) 60.35 12.54 3.8x

PDP(joule) 57.94e-16 12.62 e-16 3.6x


35

3.4 Design of Primitive Blocks

In this section, the design of primitive blocks, such as XOR-XNOR, MUX and GP

generator in each module, is represented.

3.4.1 Circuit Design of XOR-XNOR

In Figure 15 (a), a basic XOR-XNOR gate with least transistors count is presented.

However, due to the Vth of NMOS, this XOR-XNOR gate suffers from a weak logic

“1” output at the node of xnor, which will result in a terrible fall time at the xor node.

In addition, the xor output is generated by the inverted xnor, which leads to an output

delay skew.

In Figure 15 (b), a complementary pass-gate XOR-XNOR is introduced to solve the

problem of weak logic at the expense of larger area and higher power consumption.

However, this XOR-XNOR structure also suffers from the output delay skew and

limited driving capacity of xor node.

To solve the problem of output delay skew, a symmetrical structure based on the basic

XOR-XNOR gate is shown in Figure 15 (c). However, it also has a problem of weak

logic output due to same reason of basic one.

In [25], a low-power XOR-XNOR gate is implemented to solve the problem of output

delay skew and weak logic by employing a pair of feedback transistors at the output

ports as shown in Figure 15 (d). However, it encounters a very low output voltage and

a tremendously high current during the transition from any other pattern to “00” or

“11”, because both feedback transistors will be turned on simultaneously and acts as a

36

high impedance driver.

xor

Vdd

Vdd

xnor

B

Vdd

xnor

Vdd

xor

A

B

Vdd

A B

xor

xnor

xor

xnor

A B

Vdd

GND

GND

Vdd

GND

GND

A

GND

GND

（a）（b）

（c）（d）

Figure 15: Schematic of existing XOR-XNOR gate

An improved design of XOR-XNOR gate, proposed in [26], is employed in this thesis.

By adding a feedback transistor pair into the structure shown in Figure 15(c), the

37

problem of weak logic output is solved and the two keeper transistor pairs at the node

of xor and xnor can ensure the output voltage level under the input pattern of “00” and

“11”. The schematic of improved XOR-XNOR gate is shown in Figure 16.

xor

xnor

A B

Vdd

Vdd

GND

GND

Figure 16: Schematic of improved XOR-XNOR gate

3.4.2 Circuit Design of MUX

In [27], two possible implementations of MUX used for 3:2 compressor is shown in

Figure 17.

In Figure 17 (a), the simple structure and few transistor account allows the pass-gate

based MUX to be applied in various low power application with an acceptable speed.

However, due to the poor driving capacity of pass-gate, this design can only used in

38

the intermediate output stage. Although the driving capacity can be strengthened by

adding extra buffer at the output port of MUX, the increased number of transistor and

extra gates on critical path will make the design uncompetitive in our application.

Out

A

B

Sel

Vdd

Vdd

Out

A B

Sel

Sel

GNDGND

GND

Sel

(a)

(b)

Figure 17: Schematic of MUX

In Figure 17 (b), an alternative design is proposed. By adding an inverter at the output

port, enough driving strengthen of MUX can be achieved. Comparing with previous

design in Figure 17 (a), the gate number on critical path is not increased in this

implementation. Although there are two more transistors in proposed design, total

power consumption here is still acceptable in low power application.

39

3.4.3 Circuit Design of GP generator

To compute the carryout in both modulo 2n+1 adder and conditional sum generator,

generate and propagate group (Gi,Pi) generator is necessary. In general, the GP

generator is implemented by the combination of “and” and “or” gate as shown in

Figure 18. However, the “and” and “or” gate usually consume more power and timing

comparing with nand and nor gate in regular VLSI implementation. Since the amount

of GP generator in modulo 2n

A

B

C

Out

+1 squarer is large and this amount will be further

increased with larger input width, the overall delay and power saving will be

considerable.

Figure 18: Schematic of Simple GP Generator

Considering (22), the traditional GP generator can be replaced by the combination of

AOI and OAI gate, shown in Figure 19 (a) and Figure 19 (b) respectively, to achieve a

faster speed with lower power consumption and smaller area.

(a · b) + c = (a� + b�) · c�� (22)

In the proposed GP generator, only one OAI gate or AOI gate is required for each

Wallace tree stage. Comparing with single traditional GP generator, the speed of

proposed one is increased 2 times with 51% less power consumption and the number

of transistor is also reduced from 12 to 6. Furthermore, due to the particular

40

functionality of OAI and AOI gate, the output from nand and nor gate can be used

directly without extra inverters. Therefore, the gate number on carryout computation

path which is also the critical path of entire circuit can be reduced.

A

Vdd

GND

C

C

A B

B

Out

OAIGND

B

A

C

A

Vdd

B

C

Out

AOI

Figure 19: Schematic of AOI and OAI Gate

The performance comparison between two GP generator implementations for an

adder with logic depth of 3 is summarized in Table-19.

Table- 19 Performance Comparison between GP generators

Traditional GP AOI/IOA based GP Performance Improvement

Delay (ps) 93.45 51.65 80.9%

Power (μw) 5.78 4.15 39.3%

PDP (joule) 0.54e-15 0.21e-15 1.58x

Gate Count 72 29 1.48x


41

In Table-19, the critical path delay of a 3-stage proposed GP generator is improved

80.9% with 39.3% lower power consumption. In addition, the transistor count of

proposed structure is only 60% of that in the traditional one.

3.5 Summary and Simulation Result

In this thesis, the implementation of modulo 2n

Firstly, 3:2 compressors are employed in the Wallace tree structure to replace full

adders. Because of the smaller number of gate on both critical path and non-critical

path, the partial product compression process speed is improved with a much lower

consumption.

+1 squarer is mainly improved in

following several aspects:

Secondly, a sparse-tree based inverted EAC modulo 2n

Finally, the proposed GP generator is introduced to further improve the performance

of modulo 2

+1 adder is used to implement

the final addition stage. Different from the full tree adder, it does not compute the

carryout of each bit and requires a smaller amount of GP generator. Therefore, the

total power consumption and area is improved with an almost same speed of full tree

adder. In addition, the wire routing which is also an important factor for VLSI

implementation in practice is simplified. Due to usage of 4-bit conditional sum

generator, the output delay skew of final result can be improved.

n+1 squarer. In the final addition stage, the critical path is mainly

composed of GP generator used to compute carryout of each bit. By employing AOI

gate and OAI gate as GP generator, inverters used to implement “and” and “or” gate

42

can be saved and power consumption of proposed GP generator is also demonstrated

significant lower than the traditional one.

The simulation result of modulo 2n

+1 squarer with fanout of four, including critical

path delay, non-critical path delay and power consumption is shown in Figure 20,

Figure 21 and Figure 22 respectively.

Figure 20: Delay and Rise time of modulo 2n

+1 squarer

Figure 21: power consumption of modulo 2n+1 squarer

43

Figure 22: Non-critical delay of modulo 2n

+1 squarer

44

IV. Circuit Implementation on CNT Technology

In this section, a novel Carbon Nanotube technology (CNT) is introduced and the

optimal configuration of modulo 2n

The simulation result between CNT technology and CMOS technology is compared

in the aspect of critical path delay, power and area. A Monte Carlo simulation of PVT

variation is also performed.

+1 squarer is implemented on CNT technology as

well.

4.1 Introduction to CNTFET

In the structure of CNTFET transistor, bulk silicon utilized as channel material in

MOSFET transistor is replaced by a single or an array of nanotube [28]. The

schematic of CNTFET transistor is shown in Figure 23.

Substrate

Drain SourceGate

Dielectric

CNTs

Drain

Source

Gate

Figure 23: Schematic of CNTFET transistor

45

In Figure 24, the single-wall nanotube utilized as channel material in CNTFET

transistor is unrolled as a sheet of graphite with a roll-up vector given as (23) and the

diameter of nanotube can be given as (24).

Chiral angle

Roll-up Vector Ch

a1

a2

Figure 24: Schematic of unrolled nanotube

Ch��⃗ = na1��⃗ + ma2��⃗ (23)

DCNT = √3a0π√n2 + m2 + nm (24)

where (n , m) is pair of positive integer , (a1��⃗ , a2��⃗ ) are lattice unit vector and a0 is the

interatomic distance with the value of 0.142nm.

Due to the difference in chiral angle and nanotube diameter, both resulted from the

variation of positive integer pair (n, m), the electrical properties of carbon nanotube

can be either metallic if n = m or n – m = 3i, where i is an integer, or semiconducting

if n-m equals to other value.

Similar with MOSFET transistor, the CNTFET transistor can’t be turned on until the

46

voltage between gate and source, denoted as Vgs is larger than the threshold voltage,

The threshold voltage of CNT channel be approximated to the inverse function of

nanotube diameter and is given as (25)[29]:

Vth ≈Eg𝑒

= √3α×Vπ3𝑒×𝐷𝐶𝑁𝑇

(25)

where α = 2.49 is the atom distance between carbons, Vπ = 3.033 eV is the carbon π-π

bond energy in the tight bonding model and e is the unit electron charge.

Considering (24) and (25) together, the threshold voltage of CNTFET transistor is

inversely proportional to the positive integer pair (n, m). Keeping the constant m=0,

the threshold voltage varies with different value of n is shown in Figure 25.

Figure 25: CNTFET threshold voltage varies with n

The I-V characteristic of CNTFET transistor is shown in Figure 26. Similar with the

MOSFET transistor, the channel current in CNTFET increases with increasing Drain

to Source Voltage (Vds) when it is turned on. However, the current will be saturated

47

once the Vds is increased to some certain value and then further increasing Vds can

only slightly influence the current. In addition, longer physical channel can resulted in

larger saturated current.

Figure 26: I-V characteristic of CNTFET transistor

4.2 Circuit Implementation

Due to similar operation principle and device structure as MOSFET transistor, the

configuration of CNT implementation is almost the same as that of the CMOS one.

However, in order to implement the modulo 2n

In Figure 27, it can be found that the delay of inverter is initially improved due to the

increased total drive current in channel as a result of larger nanotube number in single

+1 squarer on CNT technology, the

optimal number of nanotube in single CNTFET transistor, which is equivalent to the

width of channel in MOSFET transistor, has been decided based on the simulation of

3 stages inverter chain with fanout of 4. The simulation result of delay with various

number of nanotube is shown in Figure 27.

48

CNTFET transistor. However, the delay worsens when the nanotube number is larger

than 8, which is resulted from the reduced drive current in each carbon nanotube

because of the increased inter charge screening [30].Therefore, the nanotube number

of 8 can be considered as optimal for circuit implementation on CNT technology and

the scaling for complex logic gates should also be based on this value.

Figure 27: Delay with various number of nanotube

In addition, the width ratio between pFET and nFET, computed as the ratio of

nanotube number on CNT technology, is changed to 1 and this is because of the

similar driving capacity of both pFET and nFET in CNT case. The ratio value should

also be utilized for the design of complex logic gates.

Based on all the analysis above, modulo 28+1 squarer is implemented on CNT

technology. The simulation result of critical path delay and power of modulo 28+1

squarer with fanout of 4 is shown in Figure 28 and Figure 29 respectively.

1.85

1.9

1.95

2

2.05

2.1

2.15

2.2

1 2 4 6 8 12 14 16 18

FO4

CN

T D

elay

(ps)

FO4 CNT Delay

49

Figure 28: Critical path delay of modulo 28

+1 squarer

Figure 29: power of modulo 28

4.3 Performance Comparison

+1 squarer

Comparing with the traditional CMOS technology, the CNT technology has a much

more excellent performance in the aspects of delay, power, frequency response and

stability.

50

4.3.1 PDP Comparison

Due to the better threshold voltage of CNTFET transistor, demonstrated in section 4.1,

the CNTFET transistor can be turned on at a lower voltage comparing with MOSFET

transistor, so that a faster rise/fall time which means a better delay performance can be

achieved in CNTFET-based logic gates. In addition, the tunable threshold voltage

allows the CNTFET transistor to be more competitive for low supply voltage

application.

As a result of nanometer range channel length, the static power consumption,

generated by the leakage current, can dominate the total power consumption. In [14],

a comparison between leakage current of various basic logic gates on both 32nm

MOSFET and CNTFET is performed. According to the simulation result, the

maximum and minimum leakage power of the CNTFET-based logic gates is 75 times

and 3 times smaller than that of the MOSFET-based ones respectively. Therefore,

although the dynamic power of CNTFET may be larger than that of MOSFET due to

the larger dynamic current, the total consumption can still be significantly improved

in nanometer range application.

For the frequency response, an AC simulation of inverters implemented on both

CMOS technology and CNT technology is performed. According to the simulation

result, the voltage gain of CNTFET inverter is 3dB larger than that of MOSFET

inverter and the 3dB frequency is 3 times higher.

51

To demonstrate the conclusion above, the performance comparison between delay,

power and PDP of modulo 28

Table- 20 Performance of modulo 2

+1 squarer on both implementation technologies is

summarized in Table-20.

8

+1 Squarer on Two Technologies

CMOS CNTFET Performance

Improved

Delay (ps) 401.81 29.63 13.6x

Rise-Time (ps) 35.84 3.84 9.3x

Power (μw) 27.42 11.74 2.3x

PDP (joule) 11.02e-15 0.35e-15 31.8x

In Table-20, it can be clearly observed that the critical path delay and rise time of

modulo 28

4.3.2 PVT Comparison

+1 squarer on CNT technology is 13.6 times and 9.3 times better than that

of CMOS technology respectively. In addition, the leakage current of CNT

implementation shown in Figure 29 is much smaller than that of CMOS one shown in

Figure 21 and results in lower power consumption. Finally, a nearly 32 times better

PDP is achieved by CNT technology.

To further compare the performance of modulo 2n+1 squarer on both technologies,

PVT simulation using control variables is performed in this section. In each time

simulation, only one of the process parameter, supplied voltage and temperature is

varied with same degree in both CMOS and CNT implementation of modulo 28+1

squarer.

52

For the process variation, there are typically three concerns exiting: typical (T), fast (F)

and slow (S) for both pFET and nFET. Therefore, five possible combinations of

corners are utilized as FF, FS, TT, SF and SS, where the first letter stands for the

corner of nFET and the other one of pFET. The critical path delay, rise time and

power varies with process corner is shown Table-21 and Table-22. The variation trend

of critical path delay is shown Figure 30 and Figure 31.

Table- 21 Process Variation of CMOS Technology

Corners (5%) SS FS TT SF FF

Delay(ps) 595.07 455.43 401.83 370.5 281.1

Percent Variation 48.09% 7.79% - 13.34% 30.04%

Power (μw) 19.93 31.03 27.42 36.47 45.44


Rise Time (ps) 79.156 69.081 35.843 44.241 37.667


Table- 22 Process Variation of CNT Technology

Corner (5%) SS FS TT SF FF

Delay(ps) 35.86 32.075 29.303 29.359 26.302


Power (μw) 7.47 14.65 11.74 8.14 20.41


Rise Time (ps) 4.489 4.301 3.84 4.23 3.675

Percent Variation 16.90% 22.56%

10.16% 4.30%

53

Figure 30: Process Delay Variation of CMOS Technology

Figure 31: Process Delay Variation of CNT Technology

0

100

200

300

400

500

600

700

SS FS TT SF FF

Cri

tical

Pat

h D

ealy

(ps)

Process Variation(CMOS)

15

20

25

30

35

40

SS FS TT SF FF

Cri

tical

Pat

h D

ealy

(ps)

Process Variation(CNT)

54

For the voltage variation, the supply voltage is varied from 0.72V to 0.88V in the step

of 0.4V and 0.8V is considered as the normal condition in this thesis. The critical path

delay, rise time and power variation of CMOS and CNT implementation is

summarized in Table-23 and Table-24 respectively. The variation trend of their critical

path delay is shown Figure 32 and Figure 33.

Table- 23 Supply Voltage Variation of CMOS Technology

Voltage(V) 0.72 0.76 0.8 0.84 0.88

Delay(ps) 530.4 456.74 401.81 361.04 329.77


Power (μw) 17.69 22.87 27.42 31.59 36.19


Rise Time (ps) 45.217 40.195 35.84 33.568 31.162


Table- 24 Supply Voltage Variation of CNT Technology

Voltage(V) 0.72 0.76 0.8 0.84 0.88

Delay(ps) 31.96 30.415 29.303 27.883 27.13


Power (μw) 7.85 9.58 11.74 14.21 16.99


Rise Time (ps) 4.066 3.958 3.84 3.907 3.793


55

Figure 32: Supply Voltage Delay Variation of CMOS Technology

Figure 33: Supply Voltage Delay Variation of CNT Technology

0

100

200

300

400

500

600

0.72 0.76 0.8 0.84 0.88

Cri

tical

Pat

h D

ealy

(ps)

Voltage Variation (CMOS)

24

25

26

27

28

29

30

31

32

33

0.72 0.76 0.8 0.84 0.88

Cri

tical

Pat

h D

ealy

(ps)

Voltage Variation(CNT)

56

For the temperature variation, the environment temperature is varied from 0 to 100

degree centigrade and the 25 degree centigrade is considered as the normal room

temperature for comparison. The critical path delay, rise time and power varies with

different temperature is shown in Table-25 and Table-26. The variation trend of

critical path delay is shown in Figure 34, Figure 35.

Table- 25 Temperature Variation of CMOS Technology

Temperature(V) 0C 25C 50C 75C 100C

Delay(ps) 338.12 401.83 476.29 558.52 645.96

Percent Variation 15.84% - 18.53% 38.99% 60.75%

Power (μw) 26.13 27.42 28.13 27.68 27.97


Rise Time (ps) 28.937 35.84 44.003 52.632 61.953


Table- 26 Temperature Variation of CNT Technology

Temperature(C) 0C 25C 50C 75C 100C

Delay(ps) 29.291 29.303 29.313 29.329 29.34


Power (μw) 11.78 11.74 11.69 11.65 11.58


Rise Time (ps) 3.830 3.843 3.841 3.842 3.841


57

Figure 34: Temperature Delay Variation of CMOS Technology

Figure 35: Temperature Delay Variation of CNT Technology

0

100

200

300

400

500

600

700

0C 25C 50C 75C 100C

Cri

tical

Pat

h D

ealy

(ps)

Temprature Variation(CMOS)

29.26

29.27

29.28

29.29

29.3

29.31

29.32

29.33

29.34

29.35

0C 25C 50C 75C 100C

Cri

tical

Pat

h D

ealy

(ps)

Temprature Variation(CNT)

58

From Figure 30 to Figure 35, it can be found that the critical path delay variation

trend of CNT implementation is almost the same as that of the CMOS one. However,

the critical path delay variation percent of CNT implementation are both much smaller.

For example, the average variation of critical path delay and rise time with various

process corners on CNT technology is 9.76% and 13.48%. The equivalent value of

CMOS implementation is 24.82% and 60.54%. The minimum variation percent of

delay, rise time and power on CNT technology is 0.03%, 0.03% and 0.34%

respectively and they all come from the temperature variation. This is primarily

because the environment temperature only has a slight influence on the I-V

characteristics of CNTFET transistor [31]. Though the power variation percent of

CNT implementation is a little higher than that of the CMOS one, the absolute

variation value of CNT implementation is much smaller, so that it will not

significantly influence the performance in practice.

In practice, the variation of PVT always comes together. Therefore, a Monte Carlo

simulation which varies all the PVT factors in one simulation is performed for both

CNT and CMOS implementation of modulo 28

+1 squarer. In each Monte Carlo

simulation, the threshold voltage, environment temperature and supply voltage of both

implementations is randomly selected within the range of ±3% at the same time. For

the CMOS implementation, one thousand samples are collected as shown in Figure 36

and one hundred twenty samples are collected for the CNT implementation as shown

in Figure 37.

59

Figure 36: Monte Carlo analysis of CMOS implementation

Figure 37: Monte Carlo analysis of CNT implementation

0

100

200

300

400

500

600

0 100 200 300 400 500 600 700 800 900 1000

Cri

tical

Pat

h D

elay

(ps)

Monte Carlo Index

Monte Carlo (CMOS)

20

22

24

26

28

30

32

34

36

38

40

0 20 40 60 80 100 120

Cri

tical

Pat

h D

elay

(ps)

Monte Carlo Index

Monte Carlo (CNT)

60

Comparing with the 21.95% maximum variation of CMOS implementation in the

Figure 36, the maximum variation of CNT implementation is only 5.7% in Figure 37.

In addition, the number of samples vary more than 10% is 131 in Figure 36 and

accounts for about 13.1% of the whole sampling. In Figure 37, the proportion of

samples vary more than 2% is only 11.6%. Hence, the CNT implementation of

modulo 2n

+1 squarer performs a considerable improvement of stability in the aspect

of critical path delay comparing the CMOS one.

61

V. Conclusion In this thesis, a novel modulo 2n+1 squarer is implemented based on the improved

algorithm. Comparing with existing algorithm, the input range of modulo 2n

In addition, the improved modulo 2

+1

squarer can be extended without any extra cost and the number of gate on critical path

can be further reduced. In the partial product compression stage, the employment of

3:2 compressor-base Wallace tree configuration resulted in a considerable

improvement in terms of delay, power and area. For the final addition stage, a sparse

tree IEAC adder is introduced to further improve the delay and power with fewer

gates and simpler wire routing.

n+1 squarer is implemented on both CMOS

technology and CNT technology. Comparing with traditional MOSFET transistor, the

CNTFET transistor is proven a much more excellent performance in the aspects of

power and delay. A Monte Carlo simulation is also performed to demonstrate the

better PVT properties of CNT implementation. Hence, the CNTFET-based modulo

2n

+1 squarer can be considered as a competitive candidate for low-power and

high-performance application.

62

Reference [1] Rajashekar Modugua, Yong-Bin Kim, Minsu Choi, “Fast Low-Power Modulo 2n +1 Squarer Hardware for Efficient Data Processing,” http://www.ece.neu.edu /groups/hpvlsi/ publication/2009_Modulo.pdf.

[2] M. A. Soderstrand, W. K. Jenkins, G. A. Jullien, and F. J. Taylor, Eds., “Modern Applications of Residue Number System Arithmetic to Digital Signal Processing,” New York: IEEE Press, 1986.

[3] K. G. Smitha and A. P. Vinod, “A reconfigurable high speed RNS-FIR channel filter for multi-standard software radio receivers,” in Proceedings of the 11th IEEE Singapore International Conference on Communication Systems (ICCS’08), pp.1354–1358, Guangzhou, China, 2008.

[4] P. E. Beckmann and B. R. Musicus, “Fast fault-tolerant digital convolution using a polynomial residue number system,” IEEE Trans. Signal Process. vol.41, no.7, pp.2300–2313, Jul.1993.

[5] Yuke Wang, M. N. S. Swamy, and M. Omair Ahmad, “Residue-to-binary number converters for three moduli set”, IEEE Transactions on CIircuits and Systems, vol.46, No.2, Feb.1999.

[6] D. Gallaher, F. Petry, and P. Srinivasan, The digital parallel method for fast RNS to weighted number system conversion for specific moduli (2k-1; 2k; 2k

[7] L.M. Leibowitz, “A simplified binary arithmetic for the Fermat number transform”, IEEE Trans. Acoust. Speech Signal Process, pp.356–359, 1976.

+ 1), IEEE Trans. Circuits Syst.II, vol.44, pp.537, Jan.1997.

[8] H.T. Vergos , and C. Efstathiou, “Diminished-1 modulo 2n

[9] R. Zimmerman, “Efficient VLSI implementation of modulo (2

+1 squarer design,” Computers and Digital Techniques, IEEE Proceedings, vol.152, no.5, pp.561-566, Sep.2005.

n

[10] Yutai Ma, “A Simplified Architecture for Modulo (2

± 1) addition and multiplication” IEEE trans. Compute., pp.1389-1399, 2002.

n

[11] V. Curiger, H. Bonnenherg, and H. Kaeshi, “Regular VLSI architectures for multiplication modulo (2

+ 1) Multiplication,” IEEE Transactions on Computers, vol.47, no.3, pp.333-337, Mar.1998.

n

[12] A. Wrzyszcz, and D. Milford, “A new modulo 2

+ l),” IEEE .I. Solid-State Circuit, vol.26, pp.990-994, Jul.1991.

a

[13] H.T. Vergos , and C. Efstathiou, “Design of efficient modulo 2

+1 multiplier,” Int. Conf. Computer Design (ICCD3), pp.614-617, 1995.

n + 1 multipliers,” IET Comput. Digit.Tech., vol.1, No.1, pp.49-57, 2007.

http://www.ece.neu.edu/�

63

[14] Yong-Bin Kim, “Integrated circuit design based on carbon nanotube field effect transistor,” IEEE Journal of Trans. On EE Materials, vol.12, No.5, pp.175-188, Oct.25, 2011.

[15] S.A. Tawfik, Z. Liu, and V. Kursun, “Independent-gate and tied-gate FinFet SRAM circuits: Design guidelines for reduced area and enhanced stability”, International Conference on Microelectronics, pp.171–174, 2007.

[16] Behzad Ebrahimi, Masoud Rostami, Ali Afzali-Kusha and Massoud Pedram, “Statistical Design Optimization of FinFET SRAM Using Back-Gate Voltage,” IEEE Transactions on VLSI Systems, vol.19, Issue.10, pp.1911 – 1916, Oct.2011.

[17] R. Chau, “Benchmarking nanotechnology for high-performance and low-power logic transistor applications,” IEEE Transactions on Nanotechnology, vol.4, Issue.2, pp.153 – 158, Aug.2004.

[18] C.S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. Electronic Computers, vol.EC-13, pp.14-17, 1964.

[19] Zhongde Wang, G.A. Jullien and W.C. Miller, “An Efficient Tree Architecture for Modulo 2n

[20] Mounir Bohsali, Michael Doan, “Rectangular Styled Wallace Tree Multipliers,” Berkeley University.

+ 1 Multiplication,” University of Windsor, 401 Sunset, and Windsor, Ontario N9B3P4, Canada Received Mar.11, 1996.

[21] Mahnoush Rouholamini, Omid Kavehie, Amir-Pasha Mirbaha, Somaye Jafarali Jasbi, “A New Design for 7:2 Compressors,” Computer Systems and Applications, AICCSA '07. IEEE/ACS, 2007.

[22] H.T. Vergos, C. Efstathiou, “Efficient modulo2n+1 adder architectures,” Computer Engineering and Informatics Department, University of Patras, 26500 Patras, Greece Informatics Department,ATEIofAthens,12210Egaleo,Athens,Greece.

[23] H.T. Vergos, C. Efstathiou, and D. Nikolos, “Diminished-one modulo 2n

[24] Radu Zlatanovici, Sean Kao, and Borivoje Nikolic,“Energy–Delay Optimization of 64-Bit Carry-Lookahead Adders With a 240 ps 90 nm CMOS Design Example,” IEEE Journal of Solid-State , vol.44, No.2, Feb.2009.

+1 adder design,” IEEETrans.Comput., pp.1389–1399, 2002.

[25] K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2 compressors,” in Proc. of the 35th

[26] Chip-Hong Chang, Jiangmin Gu, and Mingyan Zhang, “Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2 Compressors for Fast Arithmetic Circuits,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, vol.51, No.10, Oct.2004.

Asilomar Conf. on Signals, Systems and Computers, vol.1, pp.129–133, 2001.

[27] V. Sreehari, M. Kirthi, A. Lingamneni, and R. Sreekanth, “Novel architectures for high-speed and low-power 3-2, 4-2 and 5-2 compressor,” IEEE 20th International Conference on VLSI Design, pp.324-329, Jan.2007.

http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=92�


http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=5978256�


64

[28] Sander J. Tans, Alwin R. M. Verschueren, and Cees Dekker, “Room-temperature transistor based on a single carbon Nanotube,” Nature, vol.393, Issue.6680, pp.49-52, 1998.

[29] Stanford University CNFET website, http://nano.stanford.edu/model.php?id=23.

[30] Jie Deng, “Carbon nanotube transistor circuits: Circuit-level performance benchmarking and design options for living with imperfections” International Solid-State Circuits Conference, pp.70-71, San Francisco, CA, Feb.2007.

[31] Ouyang Yijian, and Guo Jing, “Heat dissipation in carbon nanotube transistors,” Applied Physics Letters, vol.89, Issue.18, Oct.2006.

http://adsabs.harvard.edu/cgi-bin/author_form?author=Tans,+S&fullauthor=Tans,%20Sander%20J.&charset=UTF-8&db_key=PHY�

http://adsabs.harvard.edu/cgi-bin/author_form?author=Verschueren,+A&fullauthor=Verschueren,%20Alwin%20R.%20M.&charset=UTF-8&db_key=PHY�

http://adsabs.harvard.edu/cgi-bin/author_form?author=Dekker,+C&fullauthor=Dekker,%20Cees&charset=UTF-8&db_key=PHY�

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4242269�

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4242269�

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=4242240�

http://adsabs.harvard.edu/cgi-bin/author_form?author=Ouyang,+Y&fullauthor=Ouyang,%20Yijian&charset=UTF-8&db_key=PHY�

http://adsabs.harvard.edu/cgi-bin/author_form?author=Guo,+J&fullauthor=Guo,%20Jing&charset=UTF-8&db_key=PHY�

65

Appendices Appendix A. HSpice Code

1.1 HSpice code for modulo 2n

Weifu Li Modulo_Squarer

+1 squarer

****************************************** .include 'PTM_customized_32nm_nom.lib' x1 PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66 +PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd +PartialProduct *****The first_round****** x2 vdd PP67 PP77 PP00 Sum00 Carry00 Compressor32_bar x3 vdd PP06 PP15 PP24 Sum10 Carry10 Compressor32 x4 vdd PP07 PP16 PP25 Sum01 Carry01 Compressor32 x5 vdd PP34 PP44 0 Sum11 Carry11 Compressor32 x6 vdd PP17 PP26 PP35 Sum02 Carry02 Compressor32 x7 vdd PP01 PP11 0 Sum12 Carry12 Compressor32_bar x8 vdd PP27 PP36 PP45 Sum03 Carry03 Compressor32 x9 vdd PP55 0 PP02 Sum13 Carry13 Compressor32_bari ****only one input need to be revesered PP02**** x10 vdd PP37 PP46 0 Sum04 Carry04 Compressor32 x11 vdd PP03 PP12 PP22 Sum14 Carry14 Compressor32_bar x12 vdd PP47 PP56 PP66 Sum05 Carry05 Compressor32 x13 vdd PP04 PP13 vdd Sum15 Carry15 Compressor32_bar x14 vdd PP57 0 PP05 Sum06 Carry06 Compressor32_bari x15 vdd PP14 PP23 PP33 Sum16 Carry16 Compressor32_bar *****The second_round***** x16 vdd Sum00 Sum10 Carry06 Sum20 Carry20 Compressor32_bari x17 vdd Sum01 Sum11 Carry00 Sum21 Carry21 Compressor32

66

x18 vdd Sum02 Sum12 Carry01 Sum22 Carry22 Compressor32 x19 vdd Sum03 Sum13 Carry02 Sum23 Carry23 Compressor32 x20 vdd Sum04 Sum14 Carry03 Sum24 Carry24 Compressor32 x21 vdd Sum05 Sum15 Carry04 Sum25 Carry25 Compressor32 x22 vdd Sum06 Sum16 Carry05 Sum26 Carry26 Compressor32 *****The final_round***** x23 vdd Carry16_D Carry26 Sum20 Sum30 Carry30 Compressor32_barii x24 vdd Carry10_D Carry20 Sum21 Sum31 Carry31 Compressor32 x25 vdd Carry11_D Carry21 Sum22 Sum32 Carry32 Compressor32 x26 vdd Carry12_D Carry22 Sum23 Sum33 Carry33 Compressor32 x27 vdd Carry13_D Carry23 Sum24 Sum34 Carry34 Compressor32 x28 vdd Carry14_D Carry24 Sum25 Sum35 Carry35 Compressor32 x29 vdd Carry15_D Carry25 Sum26 Sum36 Carry36 Compressor32 x30 vdd Carry36 Carry36_bar inverter x31 vdd Carry16 Carry16_D buffer_chain x32 vdd Carry10 Carry10_D buffer_chain x33 vdd Carry11 Carry11_D buffer_chain x34 vdd Carry12 Carry12_D buffer_chain x35 vdd Carry13 Carry13_D buffer_chain x36 vdd Carry14 Carry14_D buffer_chain x37 vdd Carry15 Carry15_D buffer_chain ******************************************************* ***************Final Sparse Adder********************* ******************************************************* x38 vdd +Sum30 Sum31 Sum32 Sum33 Sum34 Sum35 Sum36 Sum36 +Carry36_bar Carry30 Carry31 Carry32 Carry33 Carry3 Carry35 Carry35 +S0 S1 S2 S3 S4 S5 S6 + Modulo_Sparse x39 vdd S0 S0_fanout fanout x40 vdd S1 S1_fanout fanout x41 vdd S2 S2_fanout fanout x42 vdd S3 S3_fanout fanout x43 vdd S4 S4_fanout fanout x44 vdd S5 S5_fanout fanout x45 vdd S6 S6_fanout fanout

67

1.2 HSpice code for modulo 2n

Weifu Li Modulo_Adder

+1 adder

************The Sparse Tree Adder************** .subckt Modulo_Sparse vdd +a0 a1 a2 a3 a4 a5 a6 a7 +b0 b1 b2 b3 b4 b5 b6 b7 +S0 S1 S2 S3 S4 S5 S6 x1 vdd a0 b0 g0 nand x2 vdd a1 b1 g1 nand x3 vdd a2 b2 g2 nand x4 vdd a3 b3 g3 nand x5 vdd a4 b4 g4 nand x6 vdd a5 b5 g5 nand x7 vdd a6 b6 g6 nand x35 vdd a7 b7 g7 nand x8 vdd a0 b0 p0 nor x9 vdd a1 b1 p1 nor x10 vdd a2 b2 p2 nor x11 vdd a3 b3 p3 nor x12 vdd a4 b4 p4 nor x13 vdd a5 b5 p5 nor x14 vdd a6 b6 p6 nor x36 vdd a7 b7 p7 nor ******************************************************** x15 vdd g0 p1 g1 g10 IOA x16 vdd g2 p3 g3 g32 IOA x17 vdd g4 p5 g5 g54 IOA x18 vdd g6 p7 g7 g76 IOA x19 vdd p0 p1 p10 nor x20 vdd p2 p3 p32 nor x21 vdd p4 p5 p54 nor x22 vdd p6 p7 p76 nor ******************************************************** x23 vdd g10 p32 g32 g30 AOI x24 vdd g54 p76 g76 g74 AOI

68

x25 vdd p76 p54 p74 nand x26 vdd p32 p10 p30 nand x27 vdd g30 p74 g74 C7_bari IOA x28 vdd p30 g74_bar g30 C3i IOA x29 vdd g74 g74_bar inverter x31 vdd C3i C3_bar inverter x32 vdd C7_bari C7 inverter x40 vdd C7 C7_bar inverter x41 vdd C3_bar C3 inverter x33 vdd +a0 a1 a2 a3 +b0 b1 b2 b3 +S0 S1 S2 S3 +C7 C7_bar Conditional_Sum x34 vdd +a4 a5 a6 a7 +b4 b5 b6 b7 +S4 S5 S6 x +C3 C3_bar Conditional_Sum .ends ************The Conditional Adder************** .subckt Conditional_Sum vdd +a0 a1 a2 a3 +b0 b1 b2 b3 +S0 S1 S2 S3 Cin Cin_bar x1 vdd a0 b0 xor_0 xnor_0 xor_xnor x2 vdd a1 b1 xor_1 xor_s x3 vdd a2 b2 xor_2 xor_s x4 vdd a3 b3 xor_3 xor_s x5 vdd a0 b0 g0 nand x6 vdd a1 b1 g1 nand x7 vdd a2 b2 g2 nand x8 vdd a3 b3 g3 nand x9 vdd a0 b0 p0 nor x10 vdd a1 b1 p1 nor x11 vdd a2 b2 p2 nor x12 vdd a3 b3 p3 nor

69

x101 vdd g0 g0_bar invertr x102 vdd p0 p0_bar inverer x103 vdd g2 g2_bar inveter x104 vdd p2 p2_bar invrter x13 vdd xor_1 g0_bar mux10 xor_s x14 vdd xor_1 p0_bar mux11 xor_s x15 vdd p0 p1 g1 Out_2 IOA x16 vdd Out_2_bar xor_2_bar mux21 xor_s x17 vdd p1 g0 g1 Out_4 IOA x18 vdd Out_4 xor_2 mux20 xor_s x30 vdd xor_2 xor_2_bar inverter x31 vdd xor_3 xor_3_bar inverter x32 vdd Out_2 Out_2_bar inverter x19 vdd p2_bar Out_2 g2_bar Out_6 AOI x20 vdd Out_6 xor_3_bar mux31 xor_s x21 vdd p2_bar Out_4 g2_bar Out_8 AOI x22 vdd Out_8 xor_3_bar mux30 xor_s x23 vdd xor_0_b xnor_0_b Cin Cin_bar S0 mux x24 vdd mux10 mux11 Cin Cin_bar S1 mux x25 vdd mux20 mux21 Cin Cin_bar S2 mux x26 vdd mux30 mux31 Cin Cin_bar S3 mux x27 vdd xor_0 xor_0_b buffer x28 vdd xnor_0 xnor_0_b buffer .ends

1.3 HSpice code for partial product Matrix

****************The Partial Product Generator**************** .subckt PartialProduct +PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66

70

+PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd x1 vdd a0 a0 PP00 nand x2 vdd a0 a1 PP01 nand x3 vdd a0 a2 PP02 nand x4 vdd a0 a3 PP03 nand x5 vdd a0 a4 PP04 nand x6 vdd a0 a5 PP05 nand x7 vdd a0 a6 PP06 nand x8 vdd a0 a7 PP07 nand x9 vdd a1 a1 PP11 nand x10 vdd a1 a2 PP12 nand x11 vdd a1 a3 PP13 nand x12 vdd a1 a4 PP14 nand x13 vdd a1 a5 PP15 nand x14 vdd a1 a6 PP16 nand x15 vdd a1 a7 PP17 nand x16 vdd a2 a2 PP22 nand x17 vdd a2 a3 PP23 nand x18 vdd a2 a4 PP24 nand x19 vdd a2 a5 PP25 nand x20 vdd a2 a6 PP26 nand x21 vdd a2 a7 PP27 nand x22 vdd a3 a3 PP33 nand x23 vdd a3 a4 PP34 nand x24 vdd a3 a5 PP35 nand x25 vdd a3 a6 PP36 nand x26 vdd a3 a7 PP37 nand x27 vdd a4 a4 PP44 nand x28 vdd a4 a5 PP45 nand x29 vdd a4 a6 PP46 nand x30 vdd a4 a7 PP47 nand x31 vdd a5 a5 PP55 nand x32 vdd a5 a6 PP56 nand x33 vdd a5 a7 PP57 nand x34 vdd a6 a6 PP66 nand x35 vdd a6 a7 PP67 nand x36 vdd a7 a7 PP77 nand .ends

71

1.4 Traditional Partial Product Process

Weifu Li *************************** .include 'PTM_customized_32nm_nom.lib' x1 PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66 +PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd +PartialProduct x2 vdd PP07 PP16 PP16R nand x3 vdd PP17 PP26 PP26R nand x4 vdd PP27 PP36 PP36R nand x5 vdd PP37 PP46 PP46R nand x6 vdd PP47 PP56 PP56R nand x7 vdd PP57 PP66 PP66R nand x8 vdd PP67 PP67R inverter x9 vdd P76R PP06 PP06R nand x10 vdd PP77 PP00 PP67 PP00R nand_3 x11 vdd PP16R PP16Ri inverter x12 vdd PP00R PP00Ri inverter x13 vdd PP26R PP26Ri inverter x14 vdd PP24_b PP16_b PP15 PP16Ri PP00Ri Sum00 Carry00 Cout00 Compressor42 x15 vdd PP44 PP34 PP26 PP25 PP26Ri Sum01 Carry01 Cout01 Compressor42 x101 vdd PP24_bb PP24_b buffer x102 vdd PP16_bb PP16_b buffer x103 vdd PP16 PP16_bb buffer

72

x104 vdd PP24 PP24_bb buffer x16 vdd PP11 PP11i inverter x17 vdd PP36R PP36Ri inverter x18 vdd PP01 PP01i inverter x19 vdd PP36 PP35 PP01i PP11i PP36Ri Sum02 Carry02 Cout02 Compressor42 x20 vdd PP46R PP46Ri inverter x21 vdd PP02 PP02i inverter x22 vdd PP46 PP45 PP02i PP55i PP46Ri Sum03 Carry03 Cout03 Compressor42 x23 vdd PP22 PP22i inverter x24 vdd PP03 PP03i inverter x25 vdd PP12 PP12i inverter x26 vdd PP56R PP56Ri inverter x27 vdd PP56 PP12i PP03i PP22i PP56Ri Sum04 Carry04 Cout04 Compressor42 x28 vdd PP23 PP23i inverter x29 vdd PP04 PP04i inverter x30 vdd PP13 PP13i inverter x31 vdd PP66R PP66Ri inverter x32 vdd PP23i PP23i PP04i PP13i PP66Ri Sum05 Carry05 Cout05 Compressor42 x33 vdd PP33 PP33i inverter x34 vdd PP06 PP06i inverter x35 vdd PP05 PP05i inverter x36 vdd PP14 PP14i inverter x38 vdd PP33i PP06i PP05i PP14i PP06R Sum06 Carry06 Cout06 Compressor42 ************************The Second Round*********************** x39 vdd Carry06 Sum00 Cout06 Sum10 Carry10 Compressor32_barii x40 vdd Carry00 Sum01 Cout00 Sum11 Carry11 Compressor32 x41 vdd Carry01 Sum02 Cout01 Sum12 Carry12 Compressor32

73

x42 vdd Carry02 Sum03 Cout02 Sum13 Carry13 Compressor32 x43 vdd Carry03 Sum04 Cout03 Sum14 Carry14 Compressor32 x44 vdd Carry04 Sum05 Cout04 Sum15 Carry15 Compressor32 x45 vdd Carry05 Sum06 Cout05 Sum16 Carry16 Compressor32 **************************************************************** x46 vdd 0 Sum10 Carry16 Sum20 Carry20 Compressor32_bari x47 vdd vdd Sum11 Carry10 Sum21 Carry21 Compressor32 x48 vdd 0 Sum12 Carry11 Sum22 Carry22 Compressor32 x49 vdd 0 Sum13 Carry12 Sum23 Carry23 Compressor32 x50 vdd 0 Sum14 Carry13 Sum24 Carry24 Compressor32 x51 vdd 0 Sum15 Carry14 Sum25 Carry25 Compressor32 x52 vdd 0 Sum16 Carry15 Sum26 Carry26 Compressor32

1.5 HSpice code for Compressors

Weifu Li *************************** ****************The 32 Compressor Group**************** .subckt Compressor32_bar vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z_bar z Sum mux x3 vdd x_bar z_bar xor xnor Carry mux x4 vdd x x_bar inverter x5 vdd z z_bar inverter .ends .subckt Compressor32 vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z z_bar Sum mux x3 vdd x_buffer z_buffer xor xnor Carry mux x4 vdd z z_bar inverter x5 vdd x x_buffer buffer x6 vdd z z_buffer buffer .ends

74

.subckt Compressor32_bari vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z_bar z_buff Sum mux x3 vdd x z_bar xor xnor Carry mux x4 vdd z_buff z_bar inverter x5 vdd z_b z_buff buffer x6 vdd z z_b buffer .ends .subckt Compressor32_barii vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z z_bar Sum mux x3 vdd x_bar z xor xnor Carry mux x4 vdd x x_bar inverter x5 vdd z z_bar inverter .ends .subckt Compressor72 vdd +x1 x2 x3 x4 x5 x6 x7 Cin1 Cin2 +Sum Carry Cout1 Cout2 x1 vdd x5 x6 x7 c2 CGEN x2 vdd x5 x6 s2 xor x3 vdd x2 x3 x4 c1 CGEN x4 vdd x2 x3 s1 xor x5 vdd c2 c1 c3 xor x6 vdd s2 x7 s4 xor x7 vdd s1 x4 s3 xor x8 vdd s3 s4 s5 xor x9 vdd s4 s3 x1 c4 CGEN x10 vdd c3 c4 Cout1 xor x11 vdd c1 c4 c3 c3_bar Cout2 mux x12 vdd c3 c3_bar inverter x13 vdd s5 x1 s6 xor x14 vdd s6 Cin2 s7 xor x15 vdd s6 Cin1 s7 s7_bar Carry mux x17 vdd s7 s7_bar inverter x16 vdd Cin1 s7 Sum xor .ends

75

.subckt Compressor42 vdd x1 x2 x3 x4 +Cin Sum Carry Cout x1 vdd x1 x2 1 2 xor_xnor x2 vdd x1 x3 1 2 Cout mux_single x3 vdd x3 x4 3 4 xor_xnor x4 vdd 1 2 3 4 5 6 mux x5 vdd 5 6 Cin Cin_bar Sum mux_single x6 vdd x4 Cin 5 6 Carry mux_single x7 vdd Cin Cin_bar inverter .ends

1.6 HSpice code for Proposed FPP

.subckt Modulo_FPP vdd +a0 a1 a2 a3 a4 a5 a6 a7 +b0 b1 b2 b3 b4 b5 b6 b7 +S0 S1 S2 S3 S4 S5 S6 x1 vdd a0 b0 g0 nand x2 vdd a1 b1 g1 nand x3 vdd a2 b2 g2 nand x4 vdd a3 b3 g3 nand x5 vdd a4 b4 g4 nand x6 vdd a5 b5 g5 nand x7 vdd a6 b6 g6 nand x23 vdd a7 b7 g7 nand x8 vdd a0 b0 p0 nor x9 vdd a1 b1 p1 nor x10 vdd a2 b2 p2 nor x11 vdd a3 b3 p3 nor x12 vdd a4 b4 p4 nor x13 vdd a5 b5 p5 nor x14 vdd a6 b6 p6 nor x24 vdd a7 b7 p7 nor x15 vdd g7 g6 G77 nand x16 vdd g6 g5 G66 nand x17 vdd g5 g4 G55 nand x18 vdd g4 g3 G44 nand x19 vdd g3 g2 G33 nand

76

x20 vdd g2 g1 G22 nand x21 vdd g1 g0 G11 nand x22 vdd g0 p7_bar G00 nand x25 vdd p7 p6 P77 nor x26 vdd p6 p5 P66 nor x27 vdd p5 p4 P55 nor x28 vdd p4 p3 P44 nor x29 vdd p3 p2 P33 nor x30 vdd p2 p1 P22 nor x31 vdd p1 p0 P11 nor x32 vdd p0 g7_bar P00 nor x33 vdd g7 g7_bar inverter x34 vdd p7 p7_bar inverter x35 vdd G77_bar P66_bar G00 G760 AOI x36 vdd G77_bar G55_bar P75 nand x37 vdd G33_bar P22_bar P44_bar G432 AOI x38 vdd P75 G432 G760 H0 IOA x39 vdd P00 P77_bar G11 G071 AOI x40 vdd P00 G66_bar P06 nand x41 vdd G44_bar P33_bar P55_bar G435 AOI x42 vdd P06 G435 G071 H1 IOA x43 vdd P11 G00 G22 G102 AOI x44 vdd P11 G77_bar P17 nand x45 vdd G55_bar P44_bar P66_bar G546 AOI x46 vdd P17 G546 G102 H2 IOA x47 vdd P22 G11 G33 G213 AOI x48 vdd P22 P00 P20 nand x49 vdd G66_bar P55_bar P77_bar G657 AOI x50 vdd P20 G657 G213 H3 IOA x51 vdd P33 G22 G44 G324 AOI x52 vdd P33 P11 P31 nand x53 vdd P31 G760 G324 H4 IOA x54 vdd P44 G33 G55 G435 AOI x55 vdd P44 P22 P42 nand x56 vdd P42 G071 G435 H5 IOA x57 vdd P55 G44 G66 G546 AOI

77

x58 vdd P55 P33 P53 nand x59 vdd P53 G102 G546 H6 IOA x60 vdd P66 G55 G77 G657 AOI x61 vdd P66 P44 P64 nand x62 vdd P64 G213 G657 H7 IOA x63 vdd a0 b0 d0 xor x64 vdd a1 b1 d1 xor x65 vdd a2 b2 d2 xor x66 vdd a3 b3 d3 xor x67 vdd a4 b4 d4 xor x68 vdd a5 b5 d5 xor x69 vdd a6 b6 d6 xor x70 vdd a7 b7 d7 xor x72 vdd H0 H0_bar inverter x73 vdd H1 H1_bar inverter x74 vdd H2 H2_bar inverter x75 vdd H3 H3_bar inverter x76 vdd H4 H4_bar inverter x77 vdd H5 H5_bar inverter x78 vdd H6 H6_bar inverter x79 vdd H7 H7_bar inverter x80 vdd d0_par p7_bar z0 xor x81 vdd d1_bar p0 z1 xor x82 vdd d2_bar p1 z2 xor x83 vdd d3_bar p2 z3 xor x84 vdd d4_bar p3 z4 xor x85 vdd d5_bar p4 z5 xor x86 vdd d6_bar p5 z6 xor x87 vdd d7_bar p6 z7 xor x88 vdd d0 d0_bar inverter x89 vdd d1 d1_bar inverter x90 vdd d2 d2_bar inverter x91 vdd d3 d3_bar inverter x92 vdd d4 d4_bar inverter x93 vdd d5 d5_bar inverter x94 vdd d6 d6_bar inverter x95 vdd d7 d7_bar inverter x96 vdd G77 G77_bar inverter x97 vdd G66 G66_bar inverter

78

x98 vdd G55 G55_bar inverter x99 vdd G44 G44_bar inverter x100 vdd G33 G33_bar inverter x101 vdd G22 G22_bar inverter x102 vdd G11 G11_bar inverter x103 vdd G00 G00_bar inverter x104 vdd P77 P77_bar inverter x105 vdd P66 P66_bar inverter x106 vdd P55 P55_bar inverter x107 vdd P44 P44_bar inverter x108 vdd P33 P33_bar inverter x109 vdd P22 P22_bar inverter x110 vdd P11 P11_bar inverter x111 vdd P00 P00_bar inverter x112 vdd d0_bar z0 H7 H7_bar S0 mux x113 vdd d1 z1 H0 H0_bar S1 mux x114 vdd d2 z2 H1 H1_bar S2 mux x115 vdd d3 z3 H2 H2_bar S3 mux x116 vdd d4 z4 H3 H3_bar S4 mux x117 vdd d5 z5 H4 H4_bar S5 mux x118 vdd d6 z6 H5 H5_bar S6 mux x119 vdd d7 z7 H6 H6_bar x mux .ends

1.7 HSpice code for Sub-circuit on CMOS technology

****************************************************************** **************************The Subckt****************************** ****************************************************************** *******Buffer******* .subckt buffer vdd input output m1 input_bar input vdd vdd pmos w=64n l=32n m2 input_bar input 0 0 nmos w=32n l=32n m3 output input_bar vdd vdd pmos w=128n l=32n m4 output input_bar 0 0 nmos w=64n l=32n .ends

79

*****Buffer Chain***** subckt buffer_chain vdd input Output X1 vdd input input_D buffer X2 vdd input_D input_DD buffer X3 vdd input_DD input_DDD buffer X4 vdd input_DDD Output buffer .ends ************2 Input IOA************** .subckt IOA vdd a b c Out m1 1 a vdd vdd pmos w=128n l=32n m2 Out b 1 vdd pmos w=128n l=32n m3 Out c vdd vdd pmos w=64n l=32n m4 Out c 2 0 nmos w=64n l=32n m5 2 a 0 0 nmos w=64n l=32n m6 2 b 0 0 nmos w=64n l=32n .ends ************2 Input AOI************** .subckt AOI vdd a b c Out m1 1 a vdd vdd pmos w=128n l=32n m2 1 b vdd vdd pmos w=128n l=32n m3 Out c 1 vdd pmos w=128n l=32n m4 Out c 0 0 nmos w=32n l=32n m5 Out a 2 0 nmos w=64n l=32n m6 2 b 0 0 nmos w=64n l=32n .ends ************2 Input XOR**************

80

.subckt xor vdd a b Out m1 a b Out_bar 0 nmos w=32n l=32n m2 b a Out_bar 0 nmos w=32n l=32n m3 Out_bar a 1 vdd pmos w=64n l=32n m4 1 b vdd vdd pmos w=64n l=32n m5 Out Out_bar vdd vdd pmos w=64n l=32n m6 Out Out_bar 0 0 nmos w=32n l=32n .ends ************2 Input XOR_S************** .subckt xor_s vdd a b Out x1 vdd a a_bar inverter x2 vdd b b_bar inverter m1 b a_bar Out 0 nmos w=32n l=32n m2 Out a b vdd pmos w=64n l=32n m3 b_bar a Out 0 nmos w=32n l=32n m4 Out a_bar b_bar vdd pmos w=64n l=32n .ends ************2 Input XNOR************** .subckt xnor vdd a b Out m1 a b Out 0 nmos w=32n l=32n m2 b a Out 0 nmos w=32n l=32n m3 Out a 1 vdd pmos w=64n l=32n m4 1 b vdd vdd pmos w=64n l=32n .ends ************2 Input NOR************** .subckt nor vdd a b Out

81

m1 1 a vdd vdd pmos w=128n l=32n m2 Out b 1 vdd pmos w=128n l=32n m3 Out a 0 0 nmos w=32n l=32n m4 Out b 0 0 nmos w=32n l=32n .ends ************2 Input NAND************** .subckt nand vdd a b Output m1 Output a vdd vdd pmos w=64n l=32n m2 Output b vdd vdd pmos w=64n l=32n m3 Output a 1 0 nmos w=64n l=32n m4 1 b 0 0 nmos w=64n l=32n .ends ************The Inverter************** .subckt inverter vdd input Output m1 Output input vdd vdd pmos w=64n l=32n m2 Output input 0 0 nmos w=32n l=32n .ends **************3 Input nand**************** .subckt nand_3 vdd a b c Out m1 Output a vdd vdd pmos w=64n l=32n m2 Output b vdd vdd pmos w=64n l=32n m3 Output c vdd vdd pmos w=64n l=32n m4 Output a 1 0 nmos w=96n l=32n m5 1 b 2 0 nmos w=96n l=32n m6 2 c 0 0 nmos w=96n l=32n .ends

82

************2 Input XOR_XNOR************** .subckt xor_xnor vdd a b xor xnor m1 a b xor vdd pmos L=32n W=64n m2 xor a b vdd pmos L=32n W=64n m3 xor b 1 0 nmos L=32n W=32n m4 1 a 0 0 nmos L=32n W=32n m5 xnor xor vdd vdd pmos L=32n W=64n m6 xor xnor 0 0 nmos L=32n W=32n m7 2 b vdd vdd pmos L=32n W=64n m8 xnor a 2 vdd pmos L=32n W=64n m9 a b xnor 0 nmos L=32n W=32n m10 xnor a b 0 nmos L=32n W=32n .ends ************2-1 Mux************** .subckt mux vdd a b sel sel_bar Output m1 1 b vdd vdd pmos w=64n l=32n m2 Out_bar sel_bar 1 vdd pmos w=64n l=32n m3 Out_bar sel 2 0 nmos w=32n l=32n m4 2 b 0 0 nmos w=32n l=32n m5 3 a vdd vdd pmos w=64n l=32n m6 Out_bar sel 3 vdd pmos w=64n l=32n m7 Out_bar sel_bar 4 0 nmos w=32n l=32n m8 4 a 0 0 nmos w=32n l=32n m9 Output Out_bar vdd vdd pmos w=64n l=32n m10 Output Out_bar 0 0 nmos w=32n l=32n *****Sel=0 Output = a***** .ends ************ fanout ************** .subckt fanout vdd input output x1 vdd input output inverter x2 vdd input output inverter

83

x3 vdd input output inverter x4 vdd input output inverter .ends ************ CGEN ************** .subckt CGEN vdd a b cin Carry m1 1 b vdd vdd pmos l=32n w=128n m2 1 a vdd vdd pmos l=32n w=128n m3 2 cin 1 vdd pmos l=32n w=128n m4 2 cin 3 0 nmos l=32n w=64n m5 3 b 0 0 nmos l=32n w=64n m6 3 a 0 0 nmos l=32n w=64n m7 4 b vdd vdd pmos l=32n w=128n m8 2 a 4 vdd pmos l=32n w=128n m9 2 a 6 0 nmos l=32n w=64n m10 6 b 0 0 nmos l=32n w=64n x1 vdd 2 carry inverter .ends ************ Dual-Output Mux ************** .subckt mux_dual vdd a b set set_bar out outbar m1 a set out 0 nmos W=32n L=32n m4 a set_bar outbar 0 nmos W=32n L=32n m2 b set_bar out 0 nmos W=32n L=32n m3 b set outbar 0 nmos W=32n L=32n m5 out outbar vdd vdd pmos W=64n L=32n m6 outbar out vdd vdd pmos W=64n L=32n .ends

84

1.8 HSpice code for Sub-circuit on CMOS technology

Weifu Li *************************** ***********2 Input IOA**************

.subckt IOA vdd a b c Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 Out b 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out c vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out c 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x5 2 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x6 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 ***********The Output of this gate is Out=[(a+b)*c]_bar************** .ends ************2 Input AOI************** .subckt AOI vdd a b c Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 1 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out c 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 Out c 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 Out a 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x6 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9

85

Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends ************2 Input XOR_S************** .subckt xor_s vdd a b Out x1 vdd a a_bar inverter x2 vdd b b_bar inverter x3 b a_bar Out 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out a b 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 b_bar a Out 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x6 Out a_bar b_bar 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input NOR************** .subckt nor vdd a b Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 Out b 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input NAND************** .subckt nand vdd a b Output x1 Output a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9

86

Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Output b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 Output a 1 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 1 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends ************The Inverter************** .subckt inverter vdd input Output x1 Output input vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Output input 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input XOR_XNOR************** .subckt xor_xnor vdd a b xor xnor x1 a b xor 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 xor a b 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 xor b 1 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 1 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 xnor xor vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=2 pitch=4e-9 x6 xor xnor 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=2 pitch=4e-9 x7 2 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x8 xnor a 2 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x9 a b xnor 0 NCNFET Lch=32e-9 Lss=32e-9

87

Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x10 xnor a b 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2-1 Mux************** .subckt mux vdd a b sel sel_bar Output x1 1 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Out_bar sel_bar 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 Out_bar sel 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 3 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x6 Out_bar sel 3 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x7 Out_bar sel_bar 4 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x8 4 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x9 Output Out_bar vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x10 Output Out_bar 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 *****Sel=0 Output = a***** .ends **********fanout*********** .subckt fanout vdd input output x1 vdd input output inverter

88

x2 vdd input output inverter x3 vdd input output inverter x4 vdd input output inverter .ends **********Buffer*********** .subckt buffer vdd input output x1 input_bar input vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 input_bar input 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 output input_bar vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 output input_bar 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends

89

Appendix B. Monte Carlo Simulation Data

2.1 Monte Carlo Simulation for CMOS

1 1.1 1.2 1.3 1.4 15.5 15.6

415 403.9 403.5 420 425.1 412.2 425.2 1.5 1.6 1.7 1.8 1.9 16 16.1

434.1 395.7 408.7 424.8 429.7 357.7 411.2 2 2.1 2.2 2.3 2.4 16.5 16.6

442.2 390.7 407.7 402.6 398.5 413.5 336.6 2.5 2.6 2.7 2.8 2.9 17 17.1

422.4 401.6 455 394.5 414.2 382.9 374.6 3 3.1 3.2 3.3 3.4 17.5 17.6

449.9 433.5 378.3 477.5 408.2 407 442.1 3.5 3.6 3.7 3.8 3.9 18 18.1

424.6 447.4 357.2 459.2 421.2 378.4 386.7 4 4.1 4.2 4.3 4.4 18.5 18.6

415.2 389.1 374.3 408.9 424.3 378.5 395.2 4.5 4.6 4.7 4.8 4.9 19 19.1

405.9 394.6 392.7 447.5 397.4 373.4 385.3 5 5.1 5.2 5.3 5.4 19.5 19.6

461.1 438.8 408.5 478.1 416.9 376.6 388.5 5.5 5.6 5.7 5.8 5.9 20 20.1

437.8 453.1 435.4 393.7 412.2 397.6 420 6 6.1 6.2 6.3 6.4 20.5 20.6

378.1 435.2 413.3 417.1 412.8 396.2 368.4 6.5 6.6 6.7 6.8 6.9 21 21.1

437.9 353.6 414.2 417.5 407.1 403 444.7 7 7.1 7.2 7.3 7.4 21.5 21.6

406.5 397.9 430.3 417.8 433.5 421.1 386 7.5 7.6 7.7 7.8 7.9 22 22.1

431.5 471.2 424 465.5 462.3 428.9 378.2 8 8.1 8.2 8.3 8.4 22.5 22.6

399.4 410.6 387.8 436.7 425.1 410.3 391 8.5 8.6 8.7 8.8 8.9 23 23.1

401.6 414.8 439.5 369.4 441.3 435.8 420.7 9 9.1 9.2 9.3 9.4 23.5 23.6

424.2 414.5 414.7 406.2 369.8 413.5 434.2 9.5 9.6 9.7 9.8 9.9 24 24.1

399.9 411 434 377.4 401.5 402 377.7 10 10.1 10.2 10.3 10.4 24.5 24.6 421 446.6 484.4 450.4 394.2 395.2 384.6

90

10.5 10.6 10.7 10.8 10.9 25 25.1 418.9 390.1 390.1 402.6 432.8 447.1 426.5

11 11.1 11.2 11.3 11.4 25.5 25.6 391.4 431.8 380.6 396.8 402 424.4 438.5 11.5 11.6 11.7 11.8 11.9 26 26.1 409.8 372.8 384.4 401.6 405 365.8 422.1

12 12.1 12.2 12.3 12.4 26.5 26.6 416.8 369.3 387.2 380.1 375.9 424.9 344.8 12.5 12.6 12.7 12.8 12.9 27 27.1 398.9 379.8 427.9 372.6 393.4 394.1 385.8 15.9 30 30.1 30.2 30.3 30.4 44.5 398.5 408.7 433.3 469 436 385.5 388.7 16.4 30.5 30.6 30.7 30.8 30.9 45 388.2 406.3 378 378.4 390.8 420.4 440.3 16.9 31 31.1 31.2 31.3 31.4 45.5 387.1 408.7 452.3 398.1 414.5 418.9 418.2 17.4 31.5 31.6 31.7 31.8 31.9 46 409.5 427.8 390 402.9 418.4 423.2 361.5 17.9 32 32.1 32.2 32.3 32.4 46.5 433.3 435.5 386 401.3 397.8 392.7 418.9 18.4 32.5 32.6 32.7 32.8 32.9 47 402.2 416.6 395.7 447.5 388 406.6 388.9 18.9 33 33.1 33.2 33.3 33.4 47.5 414.3 443.1 427 373.1 469.5 402.7 413.2 19.4 33.5 33.6 33.7 33.8 33.9 48 401.7 419.2 440.4 351.6 491.2 415.7 381.5 19.9 34 34.1 34.2 34.3 34.4 48.5 379.2 408.6 386.6 368.2 404.2 417.7 384.9 20.4 34.5 34.6 34.7 34.8 34.9 49 372.6 401.1 390.4 387.5 440.8 392.4 380.7 20.9 35 35.1 35.2 35.3 35.4 49.5 408.7 454 432.2 402.2 470.1 409.9 384 21.4 35.5 35.6 35.7 35.8 35.9 50 413.2 430.5 445.1 428.9 387.9 415.8 402.5 21.9 36 36.1 36.2 36.3 36.4 50.5 416.7 374.2 429.1 406.3 411.1 407.2 401.3 22.4 36.5 36.6 36.7 36.8 36.9 51 386.8 431.9 350.4 408.3 411.1 401.6 398 22.9 37 37.1 37.2 37.3 37.4 51.5 402.3 398.7 390.5 424 412 427.1 416.2 23.4 37.5 37.6 37.7 37.8 37.9 52 395.8 424.3 462.7 418.4 458.7 453.6 423.5 23.9 38 38.1 38.2 38.3 38.4 52.5

91

409.4 392.9 405.1 381.6 429.7 419.4 405.7 24.4 38.5 38.6 38.7 38.8 38.9 53 413.4 395.9 409.1 433 363.2 433.8 430.9 24.9 39 39.1 39.2 39.3 39.4 53.5 386.9 390.7 400.9 408.8 406.6 418.9 407.3 25.4 39.5 39.6 39.7 39.8 39.9 54 402.5 394.6 403.4 427.2 372.1 396.4 399.1 25.9 40 40.1 40.2 40.3 40.4 54.5 409.4 414.7 440 476.9 443.7 388.5 389.9 26.4 40.5 40.6 40.7 40.8 40.9 55 400.9 413.5 386 385.3 396.3 426 441.8 26.9 41 41.1 41.2 41.3 41.4 55.5 395.9 397.2 437.7 386.5 402.4 407.5 419.3 27.4 41.5 41.6 41.7 41.8 41.9 56 402.3 415.1 379.2 391 407.1 410.4 362.4 44.8 44.9 59 59.1 59.2 59.3 59.4 428.4 382.9 381.1 389.1 398.1 398.4 407.9 45.3 45.4 59.5 59.6 59.7 59.8 59.9 455.7 397.7 385.1 394.6 416.7 363.2 386.5 45.8 45.9 60 60.1 60.2 60.3 60.4 377.6 403.3 403.8 427.4 464.1 431 377.2 46.3 46.4 60.5 60.6 60.7 60.8 60.9 399.6 394.7 404.5 374 373 386.6 414.9 46.8 46.9 61 61.1 61.2 61.3 61.4 399.6 389.9 407.6 450.5 398.2 413.6 417.8 47.3 47.4 61.5 61.6 61.7 61.8 61.9 400 414.9 426.3 388.8 401.6 418.1 421.8 47.8 47.9 62 62.1 62.2 62.3 62.4 445.5 440.2 434.5 385.2 399.7 396.5 391.6 48.3 48.4 62.5 62.6 62.7 62.8 62.9 417.1 406.9 415.7 394.5 446.7 386.9 406.7 48.8 48.9 63 63.1 63.2 63.3 63.4 354.6 420.3 441.6 425.6 372.7 467.9 401.3 49.3 49.4 63.5 63.6 63.7 63.8 63.9 396.5 407.3 417.3 439.8 350.7 490 413.9 49.8 49.9 64 64.1 64.2 64.3 64.4 362.4 385.9 408.2 385 367.1 402.8 417.1 50.3 50.4 64.5 64.6 64.7 64.8 64.9 429.5 376.5 399.8 388.9 385.7 439.7 392 50.8 50.9 65 65.1 65.2 65.3 65.4 384.6 414.5 453.4 430.3 401.7 468.9 407.3 51.3 51.4 65.5 65.6 65.7 65.8 65.9 404.2 408.6 429.4 443.5 427.3 388 414.5

92

51.8 51.9 66 66.1 66.2 66.3 66.4 408 411.3 373.9 427.6 406 409.4 405.8 52.3 52.4 66.5 66.6 66.7 66.8 66.9 387.6 384.6 431.1 349.4 406.7 409.3 400.1 52.8 52.9 67 67.1 67.2 67.3 67.4 379.7 396.8 398.2 391 421.8 410.7 425.6 53.3 53.4 67.5 67.6 67.7 67.8 67.9 457.7 392.2 423.7 462.5 416.9 458.3 453.4 53.8 53.9 68 68.1 68.2 68.3 68.4 477.5 405.4 392.7 401.5 379.5 428.7 417.8 54.3 54.4 68.5 68.6 68.7 68.8 68.9 393.1 407.9 392.9 408.1 431 362 432.7 54.8 54.9 69 69.1 69.2 69.3 69.4 429.2 385.3 390.2 399.4 406 406.6 417.9 55.3 55.4 69.5 69.6 69.7 69.8 69.9 447.8 398.5 392.9 403.7 426.4 370.9 394.9 55.8 55.9 70 70.1 70.2 70.3 70.4 378.3 405.3 413.1 438.6 475.2 442 386.6 56.3 56.4 70.5 70.6 70.7 70.8 70.9 400.7 396.5 411.2 385.2 383.3 395.6 424.3

13 13.1 13.2 13.3 13.4 27.5 27.6 423.3 410.4 357.9 448.6 384.9 418.5 456.3 13.5 13.6 13.7 13.8 13.9 28 28.1 401.6 421.8 337.3 469 398 388.5 398.8

14 14.1 14.2 14.3 14.4 28.5 28.6 392.6 369.4 355.2 388.2 401.1 388.4 403.7 14.5 14.6 14.7 14.8 14.9 29 29.1 384.2 373.7 370.8 421.2 375.3 386.3 395.2

15 15.1 15.2 15.3 15.4 29.5 29.6 435.5 413.7 386.2 449.2 391.9 388.8 398.3

80 80.1 80.2 80.3 80.4 80.5 80.6 398 439.3 387.2 404.2 408.6 416.2 378.7 91 91.1 91.2 91.3 91.4 91.5 91.6

430.9 415.7 363.4 457.7 392.2 407.3 429 93 93.1 93.2 93.3 93.4 93.5 93.6

441.8 420.5 392.8 457.8 398.5 419.3 433 95 95.1 95.2 95.3 95.4 95.5 95.6

390.4 380.5 413.3 400.8 416.2 414.9 450.4 97 97.1 97.2 97.3 97.4 97.5 97.6

381.1 389.1 398.1 398.4 407.9 385.1 394.6 99 99.1 99.2 99.3 99.4 99.5 99.6

407.6 450.5 398.2 413.6 417.8 426.3 388.8 1 1.1 1.2 1.3 1.4 1.5 1.6

93

441.6 425.6 372.7 467.9 401.3 417.3 439.8 3 3.1 3.2 3.3 3.4 3.5 3.6

453.4 430.3 401.7 468.9 407.3 429.4 443.5 5 5.1 5.2 5.3 5.4 5.5 5.6

398.2 391 421.8 410.7 425.6 423.7 462.5 7 7.1 7.2 7.3 7.4 7.5 7.6

390.2 399.4 406 406.6 417.9 392.9 403.7 9 9.1 9.2 9.3 9.4 9.5 9.6

404.4 447.6 394.1 410.3 415.7 423.9 386.6 11 11.1 11.2 11.3 11.4 11.5 11.6 439 422.5 370.7 464.8 398.5 415.6 437 13 13.1 13.2 13.3 13.4 13.5 13.6

449.8 427.9 399.5 466.6 406.3 426.5 444.6 15 15.1 15.2 15.3 15.4 15.5 15.6

399.8 442.2 389.3 405.9 410.3 419.1 382 17 17.1 17.2 17.3 17.4 17.5 17.6

433.6 418.1 366.8 459.8 394.4 410.7 432.1 56.6 56.7 56.8 56.9 58.6 58.7 58.8 341.2 399.2 400 390.4 397.8 420.5 355.7 57.1 57.2 57.3 57.4 90.6 90.7 90.8 380.5 413.3 400.8 416.2 385.3 436.1 379.7 57.6 57.7 57.8 57.9 92.6 92.7 92.8 450.4 407.9 447.2 441.2 380.4 377.5 429.2 58.1 58.2 58.3 58.4 94.6 94.7 94.8 393.5 371.4 419 407.6 341.2 399.2 400 27.9 42 42.1 42.2 42.3 42.4 56.5 446.7 422.7 373.8 389.7 385.8 383.9 420 28.4 42.5 42.6 42.7 42.8 42.9 57 412.4 404.6 384.2 434.3 378.7 396.3 390.4 28.9 43 43.1 43.2 43.3 43.4 57.5 426.8 430.2 414.2 363.1 455.7 391.8 414.9 29.4 43.5 43.6 43.7 43.8 43.9 58 413 406.7 428 342.1 476.4 404.5 383.9 29.9 44 44.1 44.2 44.3 44.4 58.5 391.2 397.6 373.7 359.7 394.2 407 386 80.9 90 90.1 90.2 90.3 90.4 90.5 411.3 423.5 375.5 390.9 387.6 384.6 405.7 91.9 92 92.1 92.2 92.3 92.4 92.5 405.4 399.1 373.2 360.7 393.1 407.9 389.9 93.9 94 94.1 94.2 94.3 94.4 94.5 405.3 362.4 417.4 396.6 400.7 396.5 420 95.9 96 96.1 96.2 96.3 96.4 96.5 441.2 383.9 393.5 371.4 419 407.6 386

94

97.9 98 98.1 98.2 98.3 98.4 98.5 386.5 403.8 427.4 464.1 431 377.2 404.5 99.9 100 100.1 100.2 100.3 100.4 100.5 421.8 434.5 385.2 399.7 386.5 391.6 415.7

1.9 2 2.1 2.2 2.3 2.4 2.5 413.9 408.2 385 367.1 402.8 417.1 399.8

3.9 4 4.1 4.2 4.3 4.4 4.5 414.5 373.9 427.6 406 409.4 405.8 431.1

5.9 6 6.1 6.2 6.3 6.4 6.5 453.4 392.7 401.5 379.5 428.7 417.8 392.9

7.9 8 8.1 8.2 8.3 8.4 8.5 394.9 413.1 438.6 475.2 442 386.6 411.2

9.9 10 10.1 10.2 10.3 10.4 10.5 419.6 431.6 380.2 397.6 394 388.5 413.6 11.9 12 12.1 12.2 12.3 12.4 12.5 412.1 405.7 379.8 365.6 400.1 415 397.6 13.9 14 14.1 14.2 14.3 14.4 14.5 460.4 401.3 421.8 435.7 421 373.8 417.7 15.9 16 16.1 16.2 16.3 16.4 16.5 414.9 427 376.7 393.3 389.8 383.9 407.7 17.9 18 18.1 18.2 18.3 18.4 18.5 406.9 401.8 375.3 361.7 395.4 409.7 393.8 96.7 96.8 96.9 4.6 4.7 4.8 4.9 420.5 355.7 421.9 349.4 406.7 409.3 400.1 98.7 98.8 98.9 6.6 6.7 6.8 6.9 373 386.6 414.9 408.1 431 362 432.7

100.7 100.8 100.9 8.6 8.7 8.8 8.9 446.7 386.9 406.7 385.2 383.3 395.6 424.3

2.7 2.8 2.9 10.6 10.7 10.8 10.9 385.7 439.7 392 392.1 444 385.2 404.6 15.7 15.8 44.6 44.7 27.7 27.8 12.8 411.6 374.1 379 376 412.1 452.8 436.9 16.2 16.3 45.1 45.2 28.2 28.3 14.8 393 394.2 419.6 390.6 376.8 423.6 416.7 16.7 16.8 45.6 45.7 28.7 28.8 12.9 390.5 392.2 431.7 416.4 426.2 359.3 388.4 17.2 17.3 46.1 46.2 29.2 29.3 14.9 406.3 393.8 417.2 395.7 402.3 404.1 424.6 17.7 17.8 46.6 46.7 29.7 29.8 16.6 401.9 439.2 339.9 396.9 421.5 368.2 386.9 18.2 18.3 47.1 47.2 80.7 80.8 18.6 366.7 412 379 412.4 391 408 384.4 18.7 18.8 47.6 47.7 91.7 91.8 16.7

95

413.5 348.7 449.2 406.4 342.7 477.5 438.6 19.2 19.3 48.1 48.2 93.7 93.8 18.7 391.3 390.7 393 370.8 418.6 378.3 378.2 19.7 19.8 48.6 48.7 95.7 95.8 16.8 409.6 356.6 397.7 419.7 407.9 447.2 381.3 20.2 20.3 49.1 49.2 97.7 97.8 18.8 457.1 423.7 392.6 395.8 416.7 363.2 431.4 20.7 20.8 49.6 49.7 99.7 99.8 16.9 367.6 379.1 393.4 415 401.6 418.1 399.7 21.2 21.3 50.1 50.2 1.7 1.8 18.9 392.4 407.3 425.2 462.9 350.7 490 384 21.7 21.8 50.6 50.7 3.7 3.8 12.6 397 413.1 373.2 372.9 427.3 388 385.8 22.2 22.3 51.1 51.2 5.7 5.8 14.6 395.6 391 439.3 386.2 416.9 458.3 436.6 22.7 22.8 51.6 51.7 7.7 7.8 12.7 440.7 383.3 378.7 391 426.4 370.9 386.2 23.2 23.3 52.1 52.2 9.7 9.8 14.7 367.9 442.8 375.5 390.9 399.1 414.6 391.3 23.7 23.8 52.6 52.7 11.7 11.8 346.9 483.8 385.3 436.1 348.7 486.2 24.2 24.3 53.1 53.2 13.7 13.8 363.2 397.8 415.7 363.4 422.8 384.3 24.7 24.8 53.6 53.7 15.7 15.8 382.6 434.2 429 342.7 394.6 409.7 25.2 25.3 54.1 54.2 17.7 17.8 397.1 462.9 373.2 360.7 346 480.7 25.7 25.8 54.6 54.7 58.9 96.6 423 382.1 380.4 377.5 421.9 397.8 26.2 26.3 55.1 55.2 90.9 98.6 401.3 404.6 420.5 392.8 396.8 374 26.7 26.8 55.6 55.7 92.9 100.6 401.7 404 433 418.6 385.3 394.5 27.2 27.3 56.1 56.2 94.9 2.6 417.5 406.1 417.4 396.6 390.4 388.9

96

2.2 Monte Carlo Simulation for CNT

1 2 3 4 5 86 87

28.9 28.71 28.59 28.95 28.92 29.24 29.56 6 7 8 9 10 91 92

28.96 28.82 28.97 29.38 29.11 29.03 28.85 11 12 13 14 15 96 97

28.87 29.09 28.8 29 28.85 29.5 29.25 16 17 18 19 20 101 102

28.6 28.52 29.16 29.6 29.25 29.38 29.06 21 22 23 24 25 106 107

29.67 29.42 29 28.9 28.82 28.8 29.5 26 27 28 29 30 111 112

28.69 28.84 28.74 29.6 29.81 29.84 29.15 31 32 33 34 35 116 117

29.23 29.92 29.4 29.39 29.6 29.45 28.96 36 37 38 39 40 121 122

29.1 29.48 29.31 29.41 29.5 28.83 29.48 41 42 43 44 45 126 127

29.17 29.77 29.53 29.12 28.91 29.89 29.2 46 47 48 49 50 88 89

28.72 28.62 28.84 29.14 29.77 29.23 29.64 51 52 53 54 55 93 94

29.22 29.84 29.58 29.04 29.05 28.75 28.67 56 57 58 59 60 98 99

28.91 28.73 28.89 29.13 29.72 29.53 29.21 61 62 63 64 65 103 104

29.24 29.77 29.27 29.51 29.2 28.85 28.75 66 67 68 69 70 108 109

29.63 29.13 29.72 29.24 29.77 29.38 29.55 71 72 73 74 75 113 114

29.14 29.66 29.26 29.7 29.44 29.63 29.25 76 77 78 79 80 118 119

28.98 28.93 28.86 28.7 28.88 28.92 28.84 81 82 83 84 85 123 124

29.44 29.27 29.51 29.2 29.63 29.37 29.47 95 100 105 110 90 128 129

28.79 29.63 28.65 29.02 29.39 29.15 29.08 115 120 125 130

29.68 28.69 29.04 28.84

A high performance modulo 2n+1 squarer design …1346/... · Design Based on Carbon Nanotube...

Documents

Transcript of A high performance modulo 2n+1 squarer design …1346/... · Design Based on Carbon Nanotube...