A high performance modulo 2n+1 squarer design …1346/... · Design Based on Carbon Nanotube...
Transcript of A high performance modulo 2n+1 squarer design …1346/... · Design Based on Carbon Nanotube...
I
A High Performance Modulo 2n
A Thesis Presented
+1 Squarer Design Based on Carbon Nanotube
Technology
by
Weifu Li
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirement
for the degree of
Master of Science
in
Electrical Engineering
in the field of
Electronic Circuits and Semiconductor Devices
Northeastern University
Boston, Massachusetts
November, 2012
II
Abstract
Modulo 2n+1 squarer is widely used in the digital system, such as digital signal
processing (DSP), cryptography and residue arithmetic as an important component. In
this thesis, an improved high-speed low-power design of modulo 2n+1 squarer is
proposed. The primary improvement comes from algorithm, circuit implementation
and implementation technology. For the algorithm, the partial product matrix
reconstruction is optimized to achieve a larger range of input and fewer operation
steps. Modified Wallace tree is also employed in partial product compression process.
For the circuit implementation, full adders in traditional Wallace tree structure are
replaced by 3:2 compressors and a spare-tree based inverted End-Carry-Around (EAC)
modulo 2n
+1 adder is utilized to implement the final addition stage. The proposed
design in this thesis is demonstrated a much more excellent performance in terms of
speed, power and area comparing with existing design. Considering the limitation of
CMOS technology, a novel carbon nanotube implementation technology is utilized. In
the aspects of critical path delay, power and PDP, the CNT-based implementation
demonstrates itself a competitive candidate for high-speed, low-power application
through HSpice simulation. A Monte Carlo simulation is also performed to prove the
better PVT properties of CNT technology.
Keywords: modulo 2n+1 squarer, Wallace tree structure, sparse tree modulo 2n+1
adder, carbon nanotube technology, Monte Carlo, HSpice Simulation
III
Acknowledgements
It is a pleasure to thank many the people who made this dissertation possible. I would
like to express my gratitude to my research advisors, Dr. Yong-Bin Kim, whose
expertise, understanding, and patience, added considerably to my graduate experience.
I appreciate their vast knowledge in many areas, and their assistance in writing
technical articles.
I am especially grateful to the member of my dissertation committee, Prof. Fabrizio
Lombardi and Prof. Minsu Choi and the faculty members of the Department of
Electrical and Computer Engineering.
Thanks to all my student colleagues, Jing Yang and Jing Lv, they provide me a
collaborative, stimulating and enjoyable environment.
I want to dedicate this work to my parents, Yi Li and Xiaoxia Lan. All my
achievements cannot live without their support. Words cannot express my gratitude to
them.
Finally, I want to say thanks to my dear wife, Zijian Chen. With your company, the
two-year time in Boston becomes one of the best memories in my life. I love you
forever.
Boston, Massachusetts
Weifu Li
Nov 12, 2012
IV
Contents
I. Introduction........................................................................................................... 1
1.1 Background ................................................................................................. 1
1.2 Work Statement .......................................................................................... 6
1.3 Thesis Outline ............................................................................................. 7
II. Algorithm ............................................................................................................... 9
2.1 Partial Product Generation and Repositioning ....................................... 9
2.2 Partial Product Matrix Compression ..................................................... 13
2.3 Final Addition Stage ................................................................................. 15
2.4 Computation Example ............................................................................. 17
2.5 Algorithm Performance Comparison ..................................................... 19
2.5.1 Existing Algorithm Review ........................................................... 20
2.5.2 Performance Analysis .................................................................... 21
III. Circuit Implementation on CMOS Technology ............................................... 24
3.1 Partial Product Generation and Repositioning Module ....................... 24
3.2 Wallace Tree Compression Module ........................................................ 25
3.2.1 Design of 3:2 Compressors ............................................................ 25
3.3 Modulo 2n+1 Adder .................................................................................. 30
3.4 Design of Primitive Blocks ....................................................................... 35
3.4.1 Circuit Design of XOR-XNOR ...................................................... 35
3.4.2 Circuit Design of MUX .................................................................. 37
3.4.3 Circuit Design of GP generator .................................................... 39
3.5 Summary and Simulation Result ............................................................ 41
IV. Circuit Implementation on CNT Technology ................................................... 44
4.1 Introduction to CNTFET ......................................................................... 44
4.2 Circuit Implementation ............................................................................ 47
4.3 Performance Comparison ........................................................................ 49
4.3.1 PDP Comparison ............................................................................ 50
V
4.3.2 PVT Comparison ........................................................................... 51
V. Conclusion ........................................................................................................... 61
Reference .................................................................................................................... 62
Appendices .................................................................................................................. 65
Appendix A. HSpice Code .................................................................................. 65
1.1 HSpice code for modulo 2n+1 squarer ............................................. 65
1.2 HSpice code for modulo 2n+1 adder ................................................ 67
1.3 HSpice code for partial product Matrix .......................................... 69
1.4 Traditional Partial Product Process ................................................ 71
1.5 HSpice code for Compressors ........................................................... 73
1.6 HSpice code for Proposed FPP ......................................................... 75
1.7 HSpice code for Sub-circuit on CMOS technology ........................ 78
1.8 HSpice code for Sub-circuit on CMOS technology ........................ 84
Appendix B. Monte Carlo Simulation Data ..................................................... 89
2.1 Monte Carlo Simulation for CMOS ..................................................... 89
2.2 Monte Carlo Simulation for CNT ........................................................ 96
VI
List of Table
Table- 1 Initial Partial Product Matrix .................................................................... 9
Table- 2 Modified Partial Product Matrix ............................................................. 10
Table- 3 Repositioned Partial Product Matrix ...................................................... 10
Table- 4 Shifted Partial Product Matrix ................................................................ 11
Table- 5 Final Partial Product Matrix ................................................................... 13
Table- 6 Number of Wallace Trees Stage .............................................................. 14
Table- 7 Initial Partial Product Matrix of example ............................................... 18
Table- 8 Repositioned Partial Product Matrix of example .................................... 18
Table- 9 Shifted Partial Product Matrix of example ............................................. 19
Table- 10 Compression Process of example ......................................................... 19
Table- 11 Final Result of example ........................................................................ 19
Table- 12 Modified Partial Product Matrix of Existing Algorithm ....................... 20
Table- 13 Performance Comparison of Compressor ............................................. 22
Table- 14 Performance Compression of First Two Stage ..................................... 23
Table- 15 Performance Comparison between Full Adder and 3:2 Compressor .... 26
Table- 16 Performance Compression between Different Compression Processes30
Table- 17 Performance Comparison of existing modulo 2n +1 adder .................. 30
Table- 18 Performance Compression between Proposed FPP and Sparse Tree ... 34
Table- 19 Performance Comparison between GP generators ............................... 40
Table- 20 Performance of modulo 28+1 Squarer on Two Technologies .............. 51
Table- 21 Process Variation of CMOS Technology .............................................. 52
Table- 22 Process Variation of CNT Technology .................................................. 52
Table- 23 Supply Voltage Variation of CMOS Technology .................................. 54
Table- 24 Supply Voltage Variation of CNT Technology ..................................... 54
Table- 25 Temperature Variation of CMOS Technology ...................................... 56
Table- 26 Temperature Variation of CNT Technology .......................................... 56
VII
List of Figure
Figure 1: Wallace tree Compression Process ........................................................ 14
Figure 2: Schematic of nand gate.......................................................................... 24
Figure 3: Schematic of full adder.......................................................................... 25
Figure 4: Schematic of 3:2 Compressor................................................................ 25
Figure 5: Modified 3:2 Compressor ...................................................................... 27
Figure 6: Compression process based on Existing Algorithm .............................. 28
Figure 7: Compression process based on Proposed Algorithm ............................ 28
Figure 8: Critical path delay of traditional compression process ......................... 29
Figure 9: Critical path delay of proposed compression process ........................... 29
Figure 10: Schematic of Proposed FPP ................................................................ 31
Figure 11: Schematic of Sparse Tree .................................................................... 32
Figure 12: Schematic of Conditional Sum Generator ........................................... 33
Figure 13: Critical path delay of proposed FPP .................................................... 34
Figure 14: Critical path delay of sparse tree adder ............................................... 34
Figure 15: Schematic of existing XOR-XNOR gate............................................. 36
Figure 16: Schematic of improved XOR-XNOR gate .......................................... 37
Figure 17: Schematic of MUX .............................................................................. 38
Figure 18: Schematic of Simple GP Generator ..................................................... 39
Figure 19: Schematic of AOI and OAI Gate ......................................................... 40
Figure 20: Delay and Rise time of modulo 2n 42+1 squarer......................................
Figure 21: power consumption of modulo 2n 42+1 squarer ......................................
Figure 22: Non-critical delay of modulo 2n 43+1 squarer .........................................
Figure 23: Schematic of CNTFET transistor ........................................................ 44
Figure 24: Schematic of unrolled nanotube .......................................................... 45
Figure 25: CNTFET threshold voltage varies with n ............................................ 46
Figure 26: I-V characteristic of CNTFET transistor ............................................. 47
Figure 27: Delay with various number of nanotube ............................................. 48
VIII
Figure 28: Critical path delay of modulo 28 49+1 squarer.........................................
Figure 29: power of modulo 28 49+1 squarer ............................................................
Figure 30: Process Delay Variation of CMOS Technology .................................. 53
Figure 31: Process Delay Variation of CNT Technology ...................................... 53
Figure 32: Supply Voltage Delay Variation of CMOS Technology ...................... 55
Figure 33: Supply Voltage Delay Variation of CNT Technology ......................... 55
Figure 34: Temperature Delay Variation of CMOS Technology .......................... 57
Figure 35: Temperature Delay Variation of CNT Technology .............................. 57
Figure 36: Monte Carlo analysis of CMOS implementation ................................ 59
Figure 37: Monte Carlo analysis of CNT implementation ................................... 59
1
I. Introduction 1.1 Background
In the past decades, modular arithmetic has been playing an important role in various
digital computing systems, such as digital signal processing (DSP), cryptography and
residue arithmetic. In particular, the residue number system (RNS) can be considered
as the most common application field of modular arithmetic [1].
In RNS, every operand is represented as a sequence of residue, e.g., (a1, a2,…, an
(x1, x2, … , xn) = (a1, a2, … an)◇(b1, b2, . . bn) (1)
).
Hence, a two operand RNS operation can be defined as [2]:
where ◇ denotes either addition, subtraction and multiplication Considering (1), the
computation of X can be considered as a combination of multiple separate operation
between ai and bi
Since efficient combination conversion between RNS and binary number cane be
realized based on Chinese remainder theorem [5], the RNS base with a form of {2
performing in parallel, so that the overall computation speed can be
significantly improved. Due to the superior performance in applications with large
width operand, the RNS nowadays has been found perfectly suitable for
high-precision application, such as Fast Fourier Transforms (FFT), Finite Impulse
Response (FIR) filters [3] and convolution [4].
n-1,
2n, 2n+1} is currently considered as the most appropriate one for VLSI (Very Large
Scale Integration) implementation among various base forms for RNS [6]. However,
since a (n+1)-bit input is required in modulo 2n+1 arithmetic, the difference in input
2
width among these moduli can result in new problems. Numerous algorithm and
architecture has been proposed for this issue.
To overcome the problem resulted from extra one bit of input, the diminished-one
representation introduced in [7] is adopted by [8]. In the diminished-one number
system, each operand is represented as X*= X-1, so that the n-bit modulo 2n+1 squarer
and multiplier can be realized. However, the zero operand is inhibited in this module
since negative input is invalid here. For the partial product compression process, a
Dadda tree structure is employed with full adders and half adders. Then the final Sum
and Carry Vector is added by diminished-1 modulo 2n+1 parallel adder. Comparing
with previous solution for (n+1)-bit operand, the proposed design in this article offers
a significant improvement in terms of delay and power. Some other solutions based on
diminished-1 operand representation for modulo 2n
In the work of Curiger, H. Bonnenherg and H. Kaeshi [11], only one operand is
represented in diminished-1 code, so that the complexity of circuit implementation is
improved in some degree. In addition, the correction factor can be computed in
parallel to current carry-save stage. In some certain applications of digital signal
processing, such as cipher and image process, a considerable improvement can be
provided. However, a relatively complex correction circuit is required and the zero
operand is also inhibited.
+1 arithmetic is also proposed in
[9] and [10].
Although the diminished-1 operand representation based implementation is
3
demonstrated a great advantage in the aspects of delay, power and area for modular
arithmetic, the conversion between diminsed-1 system and weighted number system
will unnecessarily add complexity of the system and increase the error risk in VLSI
implementation. Therefore, an efficient modulo 2n
In the work of Wrzyszcz and Milford [12], the partial product matrix of modulo 2
+1 arithmetic algorithm for
weighted operand is necessary.
n + 1
multiplier is reconstructed, so that an n × n partial product matrix can be achieved
without operand diminished. Due to the novel partial product compression process
using carry-save-adder (CSA) and periodic property of powers of 2, the correction
process is simplified by combining it with other operation. Since the entire partial
product computation module is only exclusively composed of half adders, full adders
and multiplexers, the design in this article can be more suitable for the regular VLSI
implementation with acceptable power consumption and it also allows a potential
pipelined computation structure which can significantly improve the operation
frequency of modulo 2n
In the later work of Vergos and Efstathiou [13], an improved implementation of
modulo 2
+ 1 multiplier.
n+1 multiplier is proposed based on [12]. In this article, the partial product
matrix is also divided into four groups and then reconstructed them in a different way.
Comparing with previous work in [12], the OR-AND-XOR gates which consume
more area and power can be replaced to achieve an area and power efficient design.
On the other hand, the correction factor resulted from partial product repositioning in
each process is summarized as final correction factor with the value of 3. Therefore,
4
the correction process is further improved by only adding an extra vector of correction
factor into the partial product matrix. For the final addition stage, modulo 2n+1
addition is converted into modulo 2n addition by using the other part of the correction
factor and is implemented by inverted End-Around-Carry (EAC) modulo 2n
In [1], a fast low-power modulo 2
adder
which is more suitable for VLSI implementation.
n+1 squarer is proposed based on the algorithm in
[13]. The same partial product matrix reconstruction is performed as shown in [13]
and the equivalent pairs in each column come from the identical input is shifted to
further reduce the number of partial produce vector before partial product
compression process. In addition, compressors with large number of input, such as 7:2
compressor, 5:2 compressor and 4:2 compressor, are utilized to compress the partial
product in each column to achieve a greater saving of power and delay. For the final
addition staged, a novel sparse-tree based inverted EAC modulo 2n+1 adder is
introduced. Comparing with previous design of modulo 2n+1 adder, the power and
area of novel design is substantially decreased due to fewer operators for carryout
computation with different weight. The wire routing in the spares-tree based modulo
2n
In addition to the algorithm and implementation, the improvement of Complementary
metal-oxide semiconductor (CMOS) technology can be considered as a potential
research direction as well. During the past few years, the gate channel length scaling
+1 adder is also simplified to be more suitable for VLSI design. The simulation of
entire implementation provides us a consideration improvement in terms delay and
area.
5
from 0.35 μm to 32 nm contributed greatly to the improvement of metal-oxide
semiconductor field-effect (MOFET) transistor in terms of power and system level
performance. However, further scaling of CMOS in sub-nano range may not offer the
same performance advantages as before and this is primarily resulted from following
several reasons: Firstly, due to the significantly increased leakage current, the static
power can make the MOSFET transistor uncompetitive for ultra low power
application in nanometer range. Secondly, the stability of MOSFET transistor can be
worse due to the higher sensitivity to unavoidable process variations in fabrication
[14]. Thirdly, considering the smaller amount of charge in circuit node resulted from
lower supply voltage and smaller capacitance, the CMOS circuit with ultra-short
length channel become more vulnerable to external voltage variation. Finally, in
nanometer range, the effect control for short channel effect is weakened. These
various device non-idealities can cause the current-voltage (I-V) characteristics of
ultra-short length MOSFET substantially different from the ideal one [14].
In recent years, various new devices and materials have been investigated. In [15], an
ultra-thin body device, FinFET transistor is introduced. In FinFET, a double-gate is
built on the SOI substrate and the conducting channel is wrapped by a thin silicon fin,
so that the gates on either side can be tied together or electrically isolated [16]. Due to
the particular structure of FinFET transistor, the two gates can be either both used to
turn on the transistor or only one gate is used to turn on the transistor while the other
one is used to adjust the threshold voltage. Therefore, the dynamic and static
performance can be tunable with lower leakage and better short-channel-effect.
6
However, both thin silicon fin and matched gates on multiple sides of fin is difficult to
fabricate.
The carbon nanotube technology (CNT) nowadays, considered as another promising
technology, can largely avoid most of the fundamental limitation of traditional
MOSFET transistor [17]. Comparing with traditional CMOS technology, the CNT
technology performs much more excellent characteristics in terms of timing,
frequency response and power consuming. In addition, the possibility of channel
burning in CNTFET transistor is significantly decreased, because the heat generated
in a small fraction of the CNTFET can be dissipated all along the channel. It can be
foreseen that the CNT technology can have an excellent prospect in VLSI area.
1.2 Work Statement
In general, current implementation of modulo 2n+1 squarer can provide an excellent
performance in the aspect of delay, power, area and stability. However, there is still
some room for further improvement. In this thesis, modulo 2n
For the algorithm, the (n+1)-bit operand is utilized without any special representation
to avoid extra code conversion between weighted operand and diminished-1 operand.
In addition, the reconstruction of partial products matrix is also optimized. Comparing
with previous method, the bit-wise operation before partial product compression
process is saved to further reduce number of gate on the critical delay path. For the
+1 squarer is primarily
improved in the aspects of algorithm, circuit configuration and implementation
technology.
7
partial product compression process, a Wallace tree structure is introduced to increase
the compression speed with lower power.
Regard to the circuit implementation of modulo 2n+1 squarer, the improvement
primarily comes from the following several aspects. Full adder and half adder utilized
in traditional Wallace tree structure are replaced by 3:2 compressors which can
perform a much better PDP. For the final addition stage, the optimal implementation
of modulo 2n+1 adder is decided based on the performance comparison among
various qualified candidates and the sparse-tree based modulo 2n
Finally, modulo 2
+1 adder is selected.
Furthermore, instead of simple combination of nand gate and or gate, novel
And-Or-Inverter (AOI) gate and Or-And-Inverter (OAI) gate is employed to compute
carryout with different weights.
n
1.3 Thesis Outline
+1 squarer is implemented on CNT (carbon nanotube) technology.
Performance of critical path delay, power and PDP is compared by HSPICE
simulation result with the CMOS implementation. For the PVT characteristics, a
Monte Carlo simulation is performed for both CNT and CMOS technology with
sample number of 100 and 1000 respectively.
The rest of this thesis is organized as follow: the improved algorithm for modulo 2n+1
squarer is represented in Section II. The optimal configuration of modulo 2n+1
squarer including the design of modified Wallace tree structure and sparse-tree based
modulo 2n+1 is decided and implemented on CMOS technology in Section III. In the
8
Section IV, the optimal configuration of modulo 2n
+1 squarer is implemented on CNT
technology and the performance comparison in different aspects is represented.
9
II. Algorithm In this section, an improved algorithm for modulo 2n
2.1 Partial Product Generation and Repositioning
+1 squarer with two (n+1) bit
unsigned inputs is proposed. In addition to the introduction of algorithm, a
computation example and a performance comparison with existing algorithm in [12]
is presented as well.
Let X be a n+1 bit unsigned input denoted as X=xnxn-1… x0, then the square of X
modulo 2n
Q = |𝑋2|2n+1 = |∑ ∑ 𝑥i𝑥j2i+j|n−1j=0
n−1i=0 2n+1
(2)
+1 can be represented as flow:
The partial products derived for the term Q of (2) are shown in Table-1.
Table- 1 Initial Partial Product Matrix
22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 … 22 21 20 p pn,0 pn-1,0 … n-2,0 p p2,0 p1,0
0,0 p pn,1 pn-1,1 pn-2,1 … n-3,1 p p1,1 0,1
p pn,2 pn-1,2 pn-2,2 pn-3,2 … n-4,2 p 0,2 … …. … … … … … p … n,n-2 p p4,n-2 p3,n-2 p2,n-2 p1,n-2 0,n-2 p pn,n-1 … n-1,n-1 p p3,n-1 p2,n-1 p1,n-1 0,n-1
p pn,n pn-1,n …. n-2,n p p2,n p1,n 0,n where pi,j = xi·xj
Since the two inputs of modulo 2
is partial product.
n+1 squarer are identical, the value of partial
products pi,j and pj,i is always equal and these equal pairs can be simply replaced by
shifting either pi,j or pj,i
Taking (3) into account, each partial product terms with weight greater than 2
to next left column and removing the other one. Therefore,
the partial product matrix in Table-1 could be modified as shown in Table-2.
n-1 could
10
be divided into two parts, repositioned partial product and correction factor. Due
to |22n|2n+1 = 1, the repositioning result of term pn,n can be simply donated as pn,n
without any correction factor in the column with weight 20
= |s̅2|i|n + 2n2|i|n|2n+1 (3)
. Therefore, the n×n partial
product matrix is rewritten in Table-3.
|s2i|2n+1 = | − s2|i|n|2n+1
where s is the value of repositioned bit.
Table- 2 Modified Partial Product Matrix
22n 22n-1 22n-2 … 2n+2 2n+1 2n 2n-1 2n-2 … 22 21 20 p pn-1,0 … n-2,0 p p2,0 p1,0
0,0 p pn-1,1 pn-2,1 … n-3,1 p p1,1 0,1
p pn-1,2 pn-2,2 pn-3,2 … n-4,2 p 0,2 …. … … … … … … p p4,n-2 p3,n-2 p2,n-2 p1,n-2 0,n-2
p n-1,n p … n-1,n-1 p p3,n-1 p2,n-1 p1,n-1 0,n-1 p pn,n …. n-2,n p p2,n p1,n 0,n
Table- 3 Repositioned Partial Product Matrix
2n-1 2n-2 2n-2 … 22 21 20 p pn-1,0 pn-2,0 … n-3,0 p p2,0 p1,0 p
0,0 pn-2,1 pn-3,1 … n-4,1 p p1,1 pn−1,1�������� 0,1
p pn-3,2 pn-4,2 … n-5,2 p pn−1,2�������� 0,2 pn−2,2�������� … … … … ... … …
p pn−1,n−1����������� 0,n-1 pn−2,n−1����������� … p3,n−1�������� p2,n−1�������� p1,n−1�������� pn−2,n�������� pn−3,n�������� pn−4,n�������� … p1,n����� p0,n����� p
0 n-1,n
0 0 0 0 0 pn,n
Observing the partial products matrix in Table-3, there are still some equal terms
appearing twice as pi,j
In addition to the partial product matrix, the correction factor, actually resulted from
or pı,ȷ���� in the same column, so that the vector number of partial
products matrix can be further reduced by the method mentioned above. The matrix
after shifting is shown in Table-4.
11
repositioning of partial product as shown in (3), needs to be considered as well. The
overall correction factor consists of three parts, correction factor from matrix
repositioning denoted as CF1, correction factor from identical pairs shifting denoted
as CF2 and correction factor from partial products compressing denoted as CF
Table- 4 Shifted Partial Product Matrix
3.
2n-1 2n-2 2n-2 … 22 21 20 p pn-2,0 pn-3,0 … n-4,0 p pn−1,1�������� 1,0 pp
0,0 pn-3,1 pn-4,1 … n-5,1 p pn−2,2�������� 1,1 pn−1,0��������
p pn-4,2 pn-5,2 … n-6,2 pn−1,2�������� pn−3,3�������� pn−2,1�������� … … … … ... … …
pn−32 ,n−1
2 pn−5
2 ,n−12
pn−52 ,n−3
2 … pn−1
2 ,n+32����������� pn−1
2 ,n+12����������� pn−3
2 ,n+12�����������
pn−2,n�������� pn−3,n�������� pn−4,n�������� … p1,n����� p0,n����� p0
n-1,n 0 0 0 0 0 pn,n
where n is assuming as odd.
For the computation of CF1, number of bit (m), needed to be repositioned, in each
vector is incremented from 0 to n-1 and the sum of correction factor in each vector,
denoted as CFVj
CFVj = ( ∑ 2i)2nm−1i=0 = 2n(2m − 1) (4)
could be computed as flow:
where j represents the jth
Hence, the correction factor for matrix repositioning, CF1, would be
vector.
CF1 = ( ∑ CFVj)nj=2
= 2n[(21 − 1) + (22 − 1) + ⋯+ (2n − 1) − 1]
= 2n[2(1 + 2 + ⋯+ 2n−1) − (n + 1)]
= 2n[2n+1 − n − 3] (5)
The correction factor for identical pairs shifting, denoted as CF2, is resulted from
12
identical partial product pairs shifting from the column with weight 2n to the column
with weight 20 in partial product matrix and the correction factor of each shift is 2n
CF2 = �(n−2
2)2n when n is even
(n−12
)2n when n is odd � (6)
.
Therefore, the value of CF2 is determined by the number of identical partial product
pairs in one column and is given as:
The third part of the correction factor, CF3, is generated during the partial products
compressing process. In this process, the most significant bit of each carryout should
be shifted from most left column to most right column, which will result in a
correction factor with the value of 2n
VNPP = �n − n−4
2+ 1 when n is even
n − n−32
+ 1 when n is odd � (7)
and the number of shifted terms depends on the
value of n. Considering the value of n as even and odd number respectively, the
vector number of partial product including the correction factor is:
where VNPP is the vector number of partial product.
These partial product vectors will be used to produce the final Sum and Carry Vector,
so that the correction factor, CF3, is given as:
CF3 = �(n+2
2)2n when n is even
(n+12
)2n when n is odd � (8)
Therefore, the overall correction factor for modulo 2n+1 squarer can be computed by
summing up the three correction factors as follow:
13
CFall = CF1 + CF2 +CF3
= ���n−2
2+ n+2
2+ 2n+1 − n − 3�2n�
2n+1 when n is even
��n−12
+ n+12
+ 2n+1 − n − 3� 2n�2n+1
when n is odd �
= |(2n+1 − 3)2n|2n+1 = 5 (9) where n≥3.
The final partial product matrix including overall correction factor is shown in
Table-5. To convert modulo 2n+1adder to modulo 2n
Table- 5 Final Partial Product Matrix
adder, explained in section 2.3 in
detail, only part of the correction factor is added to the final partial product matrix in
this step.
2n-1 2n-2 2n-2 … 22 21 20 pn-2,0 pn-3,0 pn-4,0 … p1,0 pn−1,1�������� p0,0 pn-3,1 pn-4,1 pn-5,1 … p1,1 pn−2,2�������� pn−1,0�������� pn-4,2 pn-5,2 pn-6,2 … pn−1,2�������� pn−3,3�������� pn−2,1�������� … … … … ... … …
pn−32 ,n−1
2 pn−5
2 ,n−12
pn−52 ,n−3
2 … pn−1
2 ,n+32����������� pn−1
2 ,n+12����������� pn−3
2 ,n+12�����������
pn−2,n�������� pn−3,n�������� pn−4,n�������� … p1,n����� p0,n����� pn-1,n 0 0 0 0 1 0 pn,n
2.2 Partial Product Matrix Compression
To implement the final addition stage by modulo 2n
N(VNPP) ≈ log1.5K (10)
+1 adder, each column in the final
partial product matrix have to be compressed to obtain the final Sum and Carry Vector.
Hence, the Wallace Tree structure which is well known to optimally implement
multi-operand binary addition is introduced [18] and the required stage number of
Wallace Tree is determined as follow [19]:
14
The required number of Wallace Tree stage for different VPNN is given in Table-6
and it can be observed that the Wallace Tree structure is more efficient which means
more power and gate delay can be saved when the bit number of input is increased.
Table- 6 Number of Wallace Trees Stage
VNPP 0~2 3 4 5~6 7~9
N(VNPP) 0 1 2 3 4
VNPP 10~13 14~19 20~28 29~42 43~63
N(VNPP) 5 6 7 8 9
To illustrate the analysis clearly, the schematic of partial product compression process
for a 16-bit modulo 2n+1 squarer is shown in Figure.1. According to (7), the partial
product vector number of 16-bit modulo 2n
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
……
…
20… 215
20… 215
+1 squarer is 10 and 5 Wallace Tree stage
is required for this example.
Figure 1: Wallace tree Compression Process
15
where is the symbol of partial product and is the symbol of inverted partial
product.
In general, the superior performance of Wallace Tree structure in regular multiplier
and squarer comes at a high expense of no regularity interconnection and wasted area
[20]. In modulo 2n+1 squarer, terms with weight larger than 2n have to repositioned to
left and the most significant bits of the Carry vectors, generated during compression
process, have to be fed back to the least significant bit as well, so that the shape of this
partial product matrix is always rectangular and the number of partial products in
different column is always equal as shown in Figure 1, which means the waste of area
and irregular interconnection can be dramatically improved in modulo 2n
2.3 Final Addition Stage
+1 squarer.
An n-bit Carry vector and an n-bit Sum vector are generated by the partial product
compression process. To obtain the final result of modulo 2n+1 squarer, these two
vectors need to be modulo 2n
Assuming A=anan-1… a0 and B=bnbn-1… b0 are two n-bit operands of modulo 2
+1 added in this final stage.
n
|A + B|2n+1 = �A + B − (2n + 1) if A + B ≥ 2n + 1A + B if A + B ≤ 2n + 1
� (11)
+1 adder,
then the sum of A and B can be represented as:
Therefore, (10) could be rewritten as:
|A + B + 1|2n+1 = �A + B + 1 − (2n + 1) if A + B + 1 ≥ 2n + 1A + B + 1 if A + B + 1 ≤ 2n + 1
�
= �A + B − 2n if A + B ≥ 2nA + B + 1 if A + B ≤ 2n
�
= �|A + B|2n if A + B ≥ 2nA + B + 1 if A + B ≤ 2n
� (12)
16
Since the Cout of final result is equal to 0 if A + B ≤ 2n, otherwise it is equal to 1,
(12) can be considered as described in [9]:
|A + B + 1|2n+1 = |A + B + C�out|2n (13)
Therefore, the final stage addition of modulo 2n+1 squarer can be implemented by an
Inverted EAC modulo 2n
To compute the value of carryout of each column, denoted as Ci, let’s define the
generate bit as gi = ai · bi and propagate bit as pi = ai + bi, so that the generate and
propagate group (Gi,Pi) in a regular parallel prefix adder, where Gi=Ci, can be
computed as follow:
(Gi, Pi) = (gi, pi) ○ (gi−1, pi−1) … (g1, p1) ○ (g0, p0)
adder, a more VLSI implementation suitable adder, by
adding constant “1” to the input vectors. This constant “1” comes from the correction
factor, so that none extra operation or hardware are needed.
= (gi + pi · gi−1, pi · pi−1) … (g1, p1) ○ (g0, p0)
= (Gi:n+1 + Pi:n+1 · Gn:0, Pi:n+1 · Pn:0) (14)
where -1 ≤ n ≤ i-1
According to (13) and G−1∗ = C�out = G�n−1 , the generate bits for module 2n
Gi∗ = �Gi + Pi · G−1∗ = Gi + Pi · G�n−1 0 ≤ i ≤ n − 2 G�n−1 i = −1
� (15)
+1
adder, Gi∗, is given as:
Therefore, generate and propagate group for module 2n+1 adder, denoted as (Gi∗, Pi∗),
is given as:
17
(Gi∗, Pi∗) = (Gi, Pi) ○ (Gn−1, Pn−1�������������)
= (Gi, Pi) ○ (G�n−1, Pn−1)
= (Gi + Pi · (Gn−1:ı+1 + Pn−1:ı+1 · Gı����������������������������) , Pi · Pn−1)
= (Gi + Pi · Gn−1:ı+1���������� (Pn−1:ı+1���������� + Gı� ), Pi · Pn−1:i+1)
= (Gi, Pi) ○ (G�n−1:i+1, Pn−1:i+1)
= (Gi, Pi) ○ (Gn−1:ı+1, Pn−1:ı+1���������������������)
= (Pı� , Gı� ) ○ (Gn−1:ı+1, Pn−1:ı+1)������������������������������������ (16)
where 0 ≤ n ≤ 2.
Taking (15) and (16) into account, the mathematical expressions of each carryout for
module 2n
︙
Cn−2∗ = (g0, p0) …○ (gn−2, pn−2) ○ (gn−1, pn−1) ○ (gn, pn)�����������������������������
+1 adder can be computed using the follow set of equations:
C−1∗ = Cn��� = (g0, p0) ○ (g1, p1) …○ (gn, pn)��������������������������������������
C0∗ = (g0, p0) ○ (g1, p1) …○ (gn, pn)�������������������������
C1∗ = (g0, p0) ○ (g1, p1) ○ (g2, p2) …○ (gn, pn)�������������������������
Cn−1∗ = (g0, p0) …○ (gn−1, pn−1) ○ (gn, pn)���������� (17)
Hence, the final result of module 2n
Si = ai ⊕ bi ⊕ Ci−1 (18)
+1 adder is given as:
2.4 Computation Example
In this section, a computation example is provided to further explain the algorithm
mentioned above.
18
Let x=87 be the input of modulo 2n
Table- 7 Initial Partial Product Matrix of example
+1 squarer and the binary format of it is
x=01010111. Thus, the initial partial product matrix is shown in Table-7.
0 1 0 1 0 1 1 1 × 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0
Then, the partial product matrix is modified and the terms with a weight greater than
26
Table- 8 Repositioned Partial Product Matrix of example
will be repositioned as shown in Table-8.
1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1
Before the partial products compression, the identical pairs in the matrix are shifted to
the right column and the correction factor is also included in this matrix as shown in
Table-9.
The modified partial product matrix will be compressed using Wallace Tree structure
and then the final Carry and Sum vector will be obtained. To input them into modulo
2n+1 adder, the most significant bit of the Carry vector need to be shifted to the
19
column with weight 20
Table- 9 Shifted Partial Product Matrix of example
and inverted. The compression process is shown in Table-10.
0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1
Table- 10 Compression Process of example
0 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 0 1
0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0
Finally, the final Sum and Carry Vector are modulo 27
Table- 11 Final Result of example
+1 added to obtain the final
computation result as shown in Table-11.
1 1 0 0 1 1 1 = 103 1 1 1 0 0 0 1
+ = 113 0 1 0 1 1 1 1 = 87
2.5 Algorithm Performance Comparison
In [1], an efficient algorithm for modulo 2n+1 squarer, based on the one presented by
Vergos and Efstathiou in [13], is described. To demonstrate the superiority of design
in this thesis, a performance comparison between these two algorithms is represented
20
and the simulation result of an 8-bit example for each algorithm is also introduced to
make the comparison more straightforward.
2.5.1 Existing Algorithm Review
Different from the proposed algorithm, an OR operation is executed at first in [1].
Then, the terms in each column with weight greater than 2n-1
Table- 12 Modified Partial Product Matrix of Existing Algorithm
are repositioned to the
corresponding position on the right. The partial product matrix after modification and
repositioning is shown in Table-12 and the correction factor is also included. Since
the number of repositioned bit and identical pairs is different in existing algorithm, the
new overall correction factor is 3.
26 25 24 23 22 21 20 pn-1,0∨qn-1 pn-2,0 pn-3,0 … p2,0 p1,0 p0,0∨qn-1∨pn,n
pn-2,1 pn-3,1 pn-4,1 … p1,1 p0,1 pn−1,1∨q0������������ pn-3,2 pn-4,2 pn-5,2 … p0,2 pn−1,2∨q,1������������� pn−2,2�������� … … … … ... … …
p1,n-2 p0,n-2 pn−1,n−2∨qn−���������������� … p4,n−2�������� p3,n−2�������� p2,n−2�������� p0,n-1 pn−1,n−1∨qn−���������������� pn−2,n−1����������� 0 p3,n−1�������� p2,n−1�������� p1,n−1��������
0 0 0 0 0 1 0 where qi= pn,i or pi,n and symbol ∨ stands for the OR operation.
As mentioned in the proposed algorithm, the value of qi,j and qj,i is always equal, so
that the identical pairs in each column can be shifted to left to further reduce the
number of partial product vector and the identical pairs in most left column should be
inverted and fed back to the most left column.
In [1], it is said that the optimal possible implementation for partial product matrix
compression can be achieved by using compressors with a large number of input bit
which can save more power and delay comparing with compressors with small
21
number of input bit. Taking the modulo 215
Finally, the final Sum and Carry vector are modulo 2
+1 squarer as example, the optimal
compression configuration will consist of a 7:2 compressor, a 4:2 compressor and a
3:2 compressor in order. The most significant bits of each Carry vector, generated
during the compression process, need to be inverted and fed back as well.
n+1 added by an Inverted EAC
modulo 2n
2.5.2 Performance Analysis
adder to obtain the final computation result.
Considering the excellent performance of Inverted EAC adder, demonstrated in 2.3,
the proposed algorithm is mainly improved in the partial product generation and
repositioning stage and partial product matrix compression stage, in following several
aspects:
1. By repositioning the last vector of partial product as mentioned in previous section,
the OR operation is no longer needed in proposed algorithm. Hence, an OR gate
can be removed from the critical path to increase computation speed and reduce
power consumption and area. Though the number of partial product vector in
proposed algorithm is one more than that of existing algorithm, the number of
vector needed to be compressed is the same because there is one more identical
pair in each column.
2. The range of input form in existing algorithm is extended. For the inputs in form
of 1Z1 where Z is a (n-2)-bit vector, both the value of pn,i and p0,0 will be “1”, so
that OR operation is no longer valid to compute the value of term with weight 20
22
and same condition happens to pn,n-1 as well. In proposed algorithm, because there
isn’t any operation between terms in matrix before partial product compression,
the correctness of final computation result is no longer influenced by the value of
input, so that the range of input is extend from [0, 2n-1) to [0, 2n+1
3. In [1], the best possible implementation of partial product compression is
described as using compressors with larger number of inputs. In proposed
algorithm, compressors with large number of inputs are replaced by 3:2
compressors executed in Wallace Tree Structure. Due to the simpler structure and
better critical path delay of 3:2 compressors, the performance of compression
process is further improved. For example, an efficient 7:2 compressor is proposal
in [21]. According to the proposed algorithm, the 7:2 compressors can be
equivalently replaced by Wallace Tree structure using 3:2 compressors and the
performance analysis is shown in Table-13.
-1] [13].
Table- 13 Performance Comparison of Compressor
7:2 Compressor 3:2 Compressor based
Wallace Tree
Number of Input/Output 9/4 9/4
Gate on Critical Path 6 4
Transistor Count 172 150
Unfortunately, the proposed algorithm is not perfect either. From (9), it can be found
that the correction factor of 5 is valid only under the condition that the bit number of
input is not less than 3, which makes a stricter requirement for the proposed algorithm.
However, this drawback will not have much negative influence on application of
23
proposed algorithm since modulo 2n
To verify the analysis mentioned above, the simulation result of first two stages in
modulo 2
+1 squarer with 2-bit input or 1-bit input aren’t
employed a lot in practice.
n
Table- 14 Performance Compression of First Two Stage
+1 squarer based on proposed algorithm and existing algorithm is shown in
Table-14 respectively.
Existing Algorithm Proposed Algorithm Performance Improvement
Delay (ps) 314.3 276.3 13.8%
Power (μw) 69.35 56.50 22.7%
PDP (j) 22.8e-15 15.6e-15 46.2%
Gate Count 138 120 15%
Transistor Count 1620 1374 17.9%
24
III. Circuit Implementation on CMOS Technology
Based on the proposed algorithm and performance comparison between different
possible implementation, modulo 2n
Circuit implementation of modulo 2
+1 squarer is implemented on 32nm CMOS
technology in this section.
n+1 squarer is divided into three parts: partial
product generation and repositioning module, Wallace tree compression module and
modulo 2n
3.1 Partial Product Generation and Repositioning Module
+1 adder.
In this module, the partial product matrix, shown in Table-1, is generated by simple
nand gate. Instead of simply adding inverters at the output ports of nand gate, the
output of nand gate can be used in compression module directly in our design, so that
a large amount of inverters can be saved in this module. The schematic of nand gate is
shown in Figure 2.
A
Out
Vdd
GND
A
B
B
Figure 2: Schematic of nand gate
25
3.2 Wallace Tree Compression Module
3.2.1 Design of 3:2 Compressors
Different from traditional Wallace tree compression configuration, 3:2 compressors,
governed by (19), are employed to replace full adders in this thesis. The schematic of
full adder and 3:2 compressor is shown in Figure 3 and Figure 4 respectively.
𝑋1 + 𝑋2 + 𝑋3 = Sum + 2 × Carry (19)
A
B
Cin
Sum
Carry
Figure 3: Schematic of full adder
XOR-XNOR
MUX MUX
X1 X2 X3
Sum Carry
3:2 Compressor
Figure 4: Schematic of 3:2 Compressor
Comparing with traditional full adder structure, the number of gate on critical path for
each Wallace tree stage is reduced from 3 to 2. Considering the increased number of
Wallace tree stage required for large input width of modulo 2n+1 squarer and the total
26
critical path delay of the compression module, given as (20), the critical path delay of
entire module can be significantly improved.
TCom_M = n × TS_C (20)
where n is number of Wallace tree stage and TS_C is delay of single compression stage.
The performance comparison between single full adder and 3:2 compressor is also
shown in Table-15 in detail. The critical parth delay of 3:2 compressor is 49% faster
than that of full adder with a 21.9% lower power consumption. In addition to the
analysis above, the partial product compression module based on traditional and
modified Wallace tree is only the combination of full adders and 3:2 compressors
respectively, so that the total power consumption will be significantly reduced as well.
Table- 15 Performance Comparison between Full Adder and 3:2 Compressor
Full Adder 3:2 Compressor Performance Improvement
Delay (ps) 76.3 51.2 49%
Power (μw) 1.988 1.631 21.9%
PDP (joule) 151.68e-18 83.51e-18 81.6%
3.2.2 Design of Wallace Tree Compression Configuration
The Wallace tree structure is an efficient hardware implementation for multiplication.
In this structure, any three partial products with the same weight are inputted into a
3:2 compressor until the final carry and sum vector is obtained. Hence, one third of
available partial product in each column can be reduced at the expense of only two
gate delay.
27
As mentioned in section 3.1, part of terms in partial product matrix need to be
inverted before compression, so that extra inverters has to be added into the critical
path. Based on (21), the extra inverters can be actually eliminated from critical path
by moving it into the bypass of some certain 3:2 compressors as shown in Figure 5.
a� ⊕ b� = a ⊕ b (21)
XOR-XNOR
MUX MUX
X1 X2 X3
Sum Carry
3:2 Compressor
**
Figure 5: Modified 3:2 Compressor
To clarify the Wallace tree compression configuration in modulo 2n+1 squarer and
demonstrate the analysis in section 2.5, the compression process of modulo 215
For modulo 2
+1
squarer based on both proposed algorithm and existing algorithm in [1] is introduced
respectively.
15+1 squarer, the number of partial product vector after shifting will be
10. Therefore, the optimal compression configuration based on existing algorithm in
[1] consist of a 7:2 compressor, a 4:2 compressor and a 3:2 compressor as shown in
Figure 6. For the proposed algorithm, 5 Wallace tree stages are required and its
schematic is shown in Figure 7.
28
7:2 Compressor
4:2 Compressor
3:2 Compressor
X0~X8 X9
Final Sum Final Carry
Figure 6: Compression process based on Existing Algorithm
3:2 Compressor 3:2 Compressor 3:2 Compressor
3:2 Compressor 3:2 Compressor
3:2 Compressor
3:2 Compressor
X9~X7 X6~X4 X3~X1
3:2 Compressor
X0
Final Sum Final Carry
Figure 7: Compression process based on Proposed Algorithm
29
As shown in Figure 8 and Figure 9, the critical path delay of Wallace tree
compression configuration is almost 17% faster than that of the traditional one. In
addition, the simulation result in Table-16 also demonstrates the Wallace tree
compression configuration a much more excellent performance in terms of power
consumption and area. Considering the great contribution of delay and power from
compression process to the entire circuit, Wallace tree compression configuration can
efficiently improve the overall performance.
Figure 8: Critical path delay of traditional compression process
Figure 9: Critical path delay of proposed compression process
30
Table- 16 Performance Comparison between Different Compression Processes
Existing Compression Configuration in [1]
Wallace Tree Configuration
Performance Improvement
Delay (ps) 457.2 392.1 16.6%
Power (μw) 12.2 8.64 41.2%
PDP (joule) 5.58e-15 3.39-e15 64.7%
Gate Count 24 24 -
Transistor Count 262 240 9.2%
3.3 Modulo 2n
In this section, an inverted EAC modulo 2
+1 Adder
n +1 adder is designed to implement the
final addition stage. In [22], the performance of various existing modulo 2n
Table- 17 Performance Comparison of existing modulo 2
+1 adder
is concluded in Table-17.
n
Architecture
+1 adder
N=8 N=16
Delay (ns) Number of operators Delay (ns) Number of operators
Sklansky [9] 0.63 40 0.76 80
Kogge-Stone [9] 0.62 65 0.74 161
Parallel-Prefix[23] 0.50 68 0.62 196
Proposed FPP 0.44 60 0.54 160
Proposed RAPP 0.51 52 0.75 144
where N is the bit number of input.
In Table-17, the proposed fast parallel-prefix modulo 2n+1 adder (FPP) is considered
as the fastest possible implementation with acceptable number of operator. Different
31
from other parallel-prefix adders, the Ling equation is utilized to compute carryout in
proposed FPP. In Ling equation, a pseudo carry out (Hi) is proposed as given in (22)
and it allows a single local propagate signal to be removed from the critical path [22]
and the schematic of proposed FPP modulo 2n
Ci = Hi · pi (22)
+1 is shown in Figure 10.
where Ci is the traditional carryout
(Gi,+Pi·Gi-1)
(Gi,Pi)(Gi-1,Pi-1)
(Gi,Pi)(Gi-1,Pi-1)
H7 H6 H5 H4 H3 H2 H1 H0
(G7,G6) (G6,G5) (G5,G4) (G4,G3) (G3,G2) (G2,G1) (G1,G0) (G0,G7)(P4,G3)(P5,G4)(P6,G5)(P7,G6) (P3,G2) (P2,G1)
(Gi,Pi)(Gi-1,Pi-1)
S7 S6 S5 S4 S3 S2 S1 S0
Mux
(ai,bi)(ai-1,bi-1)
Sum Sum Sum Sum Sum Sum Sum Sum
Si
Hi
Sum
Figure 10: Schematic of Proposed FPP
For wide operands, the Proposed FPP modulo 2n+1 adder has to suffer from area and
power issues due to the large amount of operator. Furthermore, the complex wire
routing of Proposed FPP will further influence its performance in practice and
increase the implementation difficulty. Therefore, a sparse-tree based inverted EAC
modulo 2n+1 adder is implemented based on the algorithm discussed in Section 2.3 in
this thesis.
32
The sparse tree modulo 2n+1 adder combines the advantages of parallel prefix adder
and conditional sum generator. It has the minimum logic depth of log2n and its
maximum fanout is 3. Comparing with proposed FPP, the sparse tree adder computes
the carryout into each 4-bit group using a valency-2 tree structure similar to Sklansky,
so that the amount of operator can be dramatically reduced to achieve a lower power
and area efficiency implementation with much simpler wire routing [24]. In addition,
since the critical path of sparse tree adder comes from the gates used to compute
carryout, the output delay skew of final sum vector can be improved. The schematic
of sparse tree modulo 2n
Conditional Sum Generator
Conditional Sum Generator
Conditional Sum Generator
a0,b0a1,b1a2,b2a3,b3a4,b4a5,b5a6,b6a7,b7a8,b8a9,b9a10,b10a11,b11a12,b12a13,b13a14,b14a15,b15
Conditional Sum Generator
+1 adder is shown in Figure 11.
ai bi
(pi,gi)
(Gi,Pi) (Gi-1,Pi-1)
(Gi,Pi) (Gi-1,Pi-1)(Gi,Pi) (Gi-1,Pi-1)
(Gi,Pi) (Gi-1,Pi-1)
(Gi,Pi) (Gi-1,Pi-1)
Figure 11: Schematic of Sparse Tree
33
Besides the carryout computation circuit, 4-bit conditional sum generators are also
needed in the sparse-tree based modulo 2n
XOR-XNOR
MUXMUXMUX
GP GP
MUX
GP GP
(a0,b0)
(a1,b1)(a2,b2)(a3,b3)
(p0,g0)(p1,g1)(p2,g2)
Cin
S3 S2 S1 S0
+1 adder as shown in Figure 11. Since the
delay from sum computation path is less that from carryout computation path, the
conditional sum generators only contributes one multiplexer to entire critical path and
its schematic is shown in Figure 12.
Figure 12: Schematic of Conditional Sum Generator
The simulation result of critical path delay in 8-bit proposed FPP and sparse tree
modulo 2n+1 adder is shown in Figure 13 and Figure 14 respectively and their
performance is summarized in Table-18. Although the critical path delay of sparse tree
is slightly larger than that of proposed FPP, the power consumption is significantly
reduced, so that the PDP of sparse tree is almost 4 times better. In addition, since the
transistor count in sparse tree is 20% fewer than that in proposed FPP and this percent
value will be further increased for large width operand, the sparse-tree based modulo
2n+1 adder is more area efficiency.
34
Figure 13: Critical path delay of proposed FPP
Figure 14: Critical path delay of sparse tree adder
Table- 18 Performance Compression between Proposed FPP and Sparse Tree
Proposed FPP Sparse Tree Performance Improvement
Delay (ps) 96.01 100.66 -2.5%
Power (μw) 60.35 12.54 3.8x
PDP(joule) 57.94e-16 12.62 e-16 3.6x
Transistor Count 119 99 20.2%
35
3.4 Design of Primitive Blocks
In this section, the design of primitive blocks, such as XOR-XNOR, MUX and GP
generator in each module, is represented.
3.4.1 Circuit Design of XOR-XNOR
In Figure 15 (a), a basic XOR-XNOR gate with least transistors count is presented.
However, due to the Vth of NMOS, this XOR-XNOR gate suffers from a weak logic
“1” output at the node of xnor, which will result in a terrible fall time at the xor node.
In addition, the xor output is generated by the inverted xnor, which leads to an output
delay skew.
In Figure 15 (b), a complementary pass-gate XOR-XNOR is introduced to solve the
problem of weak logic at the expense of larger area and higher power consumption.
However, this XOR-XNOR structure also suffers from the output delay skew and
limited driving capacity of xor node.
To solve the problem of output delay skew, a symmetrical structure based on the basic
XOR-XNOR gate is shown in Figure 15 (c). However, it also has a problem of weak
logic output due to same reason of basic one.
In [25], a low-power XOR-XNOR gate is implemented to solve the problem of output
delay skew and weak logic by employing a pair of feedback transistors at the output
ports as shown in Figure 15 (d). However, it encounters a very low output voltage and
a tremendously high current during the transition from any other pattern to “00” or
“11”, because both feedback transistors will be turned on simultaneously and acts as a
36
high impedance driver.
xor
Vdd
Vdd
xnor
B
Vdd
xnor
Vdd
xor
A
B
Vdd
A B
xor
xnor
xor
xnor
A B
Vdd
GND
GND
Vdd
GND
GND
A
GND
GND
(a) (b)
(c) (d)
Figure 15: Schematic of existing XOR-XNOR gate
An improved design of XOR-XNOR gate, proposed in [26], is employed in this thesis.
By adding a feedback transistor pair into the structure shown in Figure 15(c), the
37
problem of weak logic output is solved and the two keeper transistor pairs at the node
of xor and xnor can ensure the output voltage level under the input pattern of “00” and
“11”. The schematic of improved XOR-XNOR gate is shown in Figure 16.
xor
xnor
A B
Vdd
Vdd
GND
GND
Figure 16: Schematic of improved XOR-XNOR gate
3.4.2 Circuit Design of MUX
In [27], two possible implementations of MUX used for 3:2 compressor is shown in
Figure 17.
In Figure 17 (a), the simple structure and few transistor account allows the pass-gate
based MUX to be applied in various low power application with an acceptable speed.
However, due to the poor driving capacity of pass-gate, this design can only used in
38
the intermediate output stage. Although the driving capacity can be strengthened by
adding extra buffer at the output port of MUX, the increased number of transistor and
extra gates on critical path will make the design uncompetitive in our application.
Out
A
B
Sel
Vdd
Vdd
Out
A B
Sel
Sel
GNDGND
GND
Sel
(a)
(b)
Figure 17: Schematic of MUX
In Figure 17 (b), an alternative design is proposed. By adding an inverter at the output
port, enough driving strengthen of MUX can be achieved. Comparing with previous
design in Figure 17 (a), the gate number on critical path is not increased in this
implementation. Although there are two more transistors in proposed design, total
power consumption here is still acceptable in low power application.
39
3.4.3 Circuit Design of GP generator
To compute the carryout in both modulo 2n+1 adder and conditional sum generator,
generate and propagate group (Gi,Pi) generator is necessary. In general, the GP
generator is implemented by the combination of “and” and “or” gate as shown in
Figure 18. However, the “and” and “or” gate usually consume more power and timing
comparing with nand and nor gate in regular VLSI implementation. Since the amount
of GP generator in modulo 2n
A
B
C
Out
+1 squarer is large and this amount will be further
increased with larger input width, the overall delay and power saving will be
considerable.
Figure 18: Schematic of Simple GP Generator
Considering (22), the traditional GP generator can be replaced by the combination of
AOI and OAI gate, shown in Figure 19 (a) and Figure 19 (b) respectively, to achieve a
faster speed with lower power consumption and smaller area.
(a · b) + c = (a� + b�) · c�������������� (22)
In the proposed GP generator, only one OAI gate or AOI gate is required for each
Wallace tree stage. Comparing with single traditional GP generator, the speed of
proposed one is increased 2 times with 51% less power consumption and the number
of transistor is also reduced from 12 to 6. Furthermore, due to the particular
40
functionality of OAI and AOI gate, the output from nand and nor gate can be used
directly without extra inverters. Therefore, the gate number on carryout computation
path which is also the critical path of entire circuit can be reduced.
A
Vdd
GND
C
C
A B
B
Out
OAIGND
B
A
C
A
Vdd
B
C
Out
AOI
Figure 19: Schematic of AOI and OAI Gate
The performance comparison between two GP generator implementations for an
adder with logic depth of 3 is summarized in Table-19.
Table- 19 Performance Comparison between GP generators
Traditional GP AOI/IOA based GP Performance Improvement
Delay (ps) 93.45 51.65 80.9%
Power (μw) 5.78 4.15 39.3%
PDP (joule) 0.54e-15 0.21e-15 1.58x
Gate Count 72 29 1.48x
Transistor Count 216 130 66.2%
41
In Table-19, the critical path delay of a 3-stage proposed GP generator is improved
80.9% with 39.3% lower power consumption. In addition, the transistor count of
proposed structure is only 60% of that in the traditional one.
3.5 Summary and Simulation Result
In this thesis, the implementation of modulo 2n
Firstly, 3:2 compressors are employed in the Wallace tree structure to replace full
adders. Because of the smaller number of gate on both critical path and non-critical
path, the partial product compression process speed is improved with a much lower
consumption.
+1 squarer is mainly improved in
following several aspects:
Secondly, a sparse-tree based inverted EAC modulo 2n
Finally, the proposed GP generator is introduced to further improve the performance
of modulo 2
+1 adder is used to implement
the final addition stage. Different from the full tree adder, it does not compute the
carryout of each bit and requires a smaller amount of GP generator. Therefore, the
total power consumption and area is improved with an almost same speed of full tree
adder. In addition, the wire routing which is also an important factor for VLSI
implementation in practice is simplified. Due to usage of 4-bit conditional sum
generator, the output delay skew of final result can be improved.
n+1 squarer. In the final addition stage, the critical path is mainly
composed of GP generator used to compute carryout of each bit. By employing AOI
gate and OAI gate as GP generator, inverters used to implement “and” and “or” gate
42
can be saved and power consumption of proposed GP generator is also demonstrated
significant lower than the traditional one.
The simulation result of modulo 2n
+1 squarer with fanout of four, including critical
path delay, non-critical path delay and power consumption is shown in Figure 20,
Figure 21 and Figure 22 respectively.
Figure 20: Delay and Rise time of modulo 2n
+1 squarer
Figure 21: power consumption of modulo 2n+1 squarer
43
Figure 22: Non-critical delay of modulo 2n
+1 squarer
44
IV. Circuit Implementation on CNT Technology
In this section, a novel Carbon Nanotube technology (CNT) is introduced and the
optimal configuration of modulo 2n
The simulation result between CNT technology and CMOS technology is compared
in the aspect of critical path delay, power and area. A Monte Carlo simulation of PVT
variation is also performed.
+1 squarer is implemented on CNT technology as
well.
4.1 Introduction to CNTFET
In the structure of CNTFET transistor, bulk silicon utilized as channel material in
MOSFET transistor is replaced by a single or an array of nanotube [28]. The
schematic of CNTFET transistor is shown in Figure 23.
Substrate
Drain SourceGate
Dielectric
CNTs
Drain
Source
Gate
Figure 23: Schematic of CNTFET transistor
45
In Figure 24, the single-wall nanotube utilized as channel material in CNTFET
transistor is unrolled as a sheet of graphite with a roll-up vector given as (23) and the
diameter of nanotube can be given as (24).
Chiral angle
Roll-up Vector Ch
a1
a2
Figure 24: Schematic of unrolled nanotube
Ch����⃗ = na1���⃗ + ma2����⃗ (23)
DCNT = √3a0π√n2 + m2 + nm (24)
where (n , m) is pair of positive integer , (a1���⃗ , a2����⃗ ) are lattice unit vector and a0 is the
interatomic distance with the value of 0.142nm.
Due to the difference in chiral angle and nanotube diameter, both resulted from the
variation of positive integer pair (n, m), the electrical properties of carbon nanotube
can be either metallic if n = m or n – m = 3i, where i is an integer, or semiconducting
if n-m equals to other value.
Similar with MOSFET transistor, the CNTFET transistor can’t be turned on until the
46
voltage between gate and source, denoted as Vgs is larger than the threshold voltage,
The threshold voltage of CNT channel be approximated to the inverse function of
nanotube diameter and is given as (25)[29]:
Vth ≈Eg𝑒
= √3α×Vπ3𝑒×𝐷𝐶𝑁𝑇
(25)
where α = 2.49 is the atom distance between carbons, Vπ = 3.033 eV is the carbon π-π
bond energy in the tight bonding model and e is the unit electron charge.
Considering (24) and (25) together, the threshold voltage of CNTFET transistor is
inversely proportional to the positive integer pair (n, m). Keeping the constant m=0,
the threshold voltage varies with different value of n is shown in Figure 25.
Figure 25: CNTFET threshold voltage varies with n
The I-V characteristic of CNTFET transistor is shown in Figure 26. Similar with the
MOSFET transistor, the channel current in CNTFET increases with increasing Drain
to Source Voltage (Vds) when it is turned on. However, the current will be saturated
47
once the Vds is increased to some certain value and then further increasing Vds can
only slightly influence the current. In addition, longer physical channel can resulted in
larger saturated current.
Figure 26: I-V characteristic of CNTFET transistor
4.2 Circuit Implementation
Due to similar operation principle and device structure as MOSFET transistor, the
configuration of CNT implementation is almost the same as that of the CMOS one.
However, in order to implement the modulo 2n
In Figure 27, it can be found that the delay of inverter is initially improved due to the
increased total drive current in channel as a result of larger nanotube number in single
+1 squarer on CNT technology, the
optimal number of nanotube in single CNTFET transistor, which is equivalent to the
width of channel in MOSFET transistor, has been decided based on the simulation of
3 stages inverter chain with fanout of 4. The simulation result of delay with various
number of nanotube is shown in Figure 27.
48
CNTFET transistor. However, the delay worsens when the nanotube number is larger
than 8, which is resulted from the reduced drive current in each carbon nanotube
because of the increased inter charge screening [30].Therefore, the nanotube number
of 8 can be considered as optimal for circuit implementation on CNT technology and
the scaling for complex logic gates should also be based on this value.
Figure 27: Delay with various number of nanotube
In addition, the width ratio between pFET and nFET, computed as the ratio of
nanotube number on CNT technology, is changed to 1 and this is because of the
similar driving capacity of both pFET and nFET in CNT case. The ratio value should
also be utilized for the design of complex logic gates.
Based on all the analysis above, modulo 28+1 squarer is implemented on CNT
technology. The simulation result of critical path delay and power of modulo 28+1
squarer with fanout of 4 is shown in Figure 28 and Figure 29 respectively.
1.85
1.9
1.95
2
2.05
2.1
2.15
2.2
1 2 4 6 8 12 14 16 18
FO4
CN
T D
elay
(ps)
FO4 CNT Delay
49
Figure 28: Critical path delay of modulo 28
+1 squarer
Figure 29: power of modulo 28
4.3 Performance Comparison
+1 squarer
Comparing with the traditional CMOS technology, the CNT technology has a much
more excellent performance in the aspects of delay, power, frequency response and
stability.
50
4.3.1 PDP Comparison
Due to the better threshold voltage of CNTFET transistor, demonstrated in section 4.1,
the CNTFET transistor can be turned on at a lower voltage comparing with MOSFET
transistor, so that a faster rise/fall time which means a better delay performance can be
achieved in CNTFET-based logic gates. In addition, the tunable threshold voltage
allows the CNTFET transistor to be more competitive for low supply voltage
application.
As a result of nanometer range channel length, the static power consumption,
generated by the leakage current, can dominate the total power consumption. In [14],
a comparison between leakage current of various basic logic gates on both 32nm
MOSFET and CNTFET is performed. According to the simulation result, the
maximum and minimum leakage power of the CNTFET-based logic gates is 75 times
and 3 times smaller than that of the MOSFET-based ones respectively. Therefore,
although the dynamic power of CNTFET may be larger than that of MOSFET due to
the larger dynamic current, the total consumption can still be significantly improved
in nanometer range application.
For the frequency response, an AC simulation of inverters implemented on both
CMOS technology and CNT technology is performed. According to the simulation
result, the voltage gain of CNTFET inverter is 3dB larger than that of MOSFET
inverter and the 3dB frequency is 3 times higher.
51
To demonstrate the conclusion above, the performance comparison between delay,
power and PDP of modulo 28
Table- 20 Performance of modulo 2
+1 squarer on both implementation technologies is
summarized in Table-20.
8
+1 Squarer on Two Technologies
CMOS CNTFET Performance
Improved
Delay (ps) 401.81 29.63 13.6x
Rise-Time (ps) 35.84 3.84 9.3x
Power (μw) 27.42 11.74 2.3x
PDP (joule) 11.02e-15 0.35e-15 31.8x
In Table-20, it can be clearly observed that the critical path delay and rise time of
modulo 28
4.3.2 PVT Comparison
+1 squarer on CNT technology is 13.6 times and 9.3 times better than that
of CMOS technology respectively. In addition, the leakage current of CNT
implementation shown in Figure 29 is much smaller than that of CMOS one shown in
Figure 21 and results in lower power consumption. Finally, a nearly 32 times better
PDP is achieved by CNT technology.
To further compare the performance of modulo 2n+1 squarer on both technologies,
PVT simulation using control variables is performed in this section. In each time
simulation, only one of the process parameter, supplied voltage and temperature is
varied with same degree in both CMOS and CNT implementation of modulo 28+1
squarer.
52
For the process variation, there are typically three concerns exiting: typical (T), fast (F)
and slow (S) for both pFET and nFET. Therefore, five possible combinations of
corners are utilized as FF, FS, TT, SF and SS, where the first letter stands for the
corner of nFET and the other one of pFET. The critical path delay, rise time and
power varies with process corner is shown Table-21 and Table-22. The variation trend
of critical path delay is shown Figure 30 and Figure 31.
Table- 21 Process Variation of CMOS Technology
Corners (5%) SS FS TT SF FF
Delay(ps) 595.07 455.43 401.83 370.5 281.1
Percent Variation 48.09% 7.79% - 13.34% 30.04%
Power (μw) 19.93 31.03 27.42 36.47 45.44
Percent Variation 27.32% 13.09% - 33.01% 65.2%
Rise Time (ps) 79.156 69.081 35.843 44.241 37.667
Percent Variation 120.86% 92.75% - 23.44% 5.09%
Table- 22 Process Variation of CNT Technology
Corner (5%) SS FS TT SF FF
Delay(ps) 35.86 32.075 29.303 29.359 26.302
Percent Variation 22.36% 6.00% - 0.46% 10.20%
Power (μw) 7.47 14.65 11.74 8.14 20.41
Percent Variation 36.37% 24.79% - 30.66% 73.76%
Rise Time (ps) 4.489 4.301 3.84 4.23 3.675
Percent Variation 16.90% 22.56%
10.16% 4.30%
53
Figure 30: Process Delay Variation of CMOS Technology
Figure 31: Process Delay Variation of CNT Technology
0
100
200
300
400
500
600
700
SS FS TT SF FF
Cri
tical
Pat
h D
ealy
(ps)
Process Variation(CMOS)
15
20
25
30
35
40
SS FS TT SF FF
Cri
tical
Pat
h D
ealy
(ps)
Process Variation(CNT)
54
For the voltage variation, the supply voltage is varied from 0.72V to 0.88V in the step
of 0.4V and 0.8V is considered as the normal condition in this thesis. The critical path
delay, rise time and power variation of CMOS and CNT implementation is
summarized in Table-23 and Table-24 respectively. The variation trend of their critical
path delay is shown Figure 32 and Figure 33.
Table- 23 Supply Voltage Variation of CMOS Technology
Voltage(V) 0.72 0.76 0.8 0.84 0.88
Delay(ps) 530.4 456.74 401.81 361.04 329.77
Percent Variation 32.00% 13.66% - 10.15% 17.93%
Power (μw) 17.69 22.87 27.42 31.59 36.19
Percent Variation 35.49% 16.59% - 15.21% 31.98%
Rise Time (ps) 45.217 40.195 35.84 33.568 31.162
Percent Variation 26.16% 12.15% - 6.34% 13.5%
Table- 24 Supply Voltage Variation of CNT Technology
Voltage(V) 0.72 0.76 0.8 0.84 0.88
Delay(ps) 31.96 30.415 29.303 27.883 27.13
Percent Variation 9.07% 3.79% - 4.85% 1.19%
Power (μw) 7.85 9.58 11.74 14.21 16.99
Percent Variation 33.13% 18.40% - 21.04% 44.72%
Rise Time (ps) 4.066 3.958 3.84 3.907 3.793
Percent Variation 5.89% 3.07% - 1.74% 1.22%
55
Figure 32: Supply Voltage Delay Variation of CMOS Technology
Figure 33: Supply Voltage Delay Variation of CNT Technology
0
100
200
300
400
500
600
0.72 0.76 0.8 0.84 0.88
Cri
tical
Pat
h D
ealy
(ps)
Voltage Variation (CMOS)
24
25
26
27
28
29
30
31
32
33
0.72 0.76 0.8 0.84 0.88
Cri
tical
Pat
h D
ealy
(ps)
Voltage Variation(CNT)
56
For the temperature variation, the environment temperature is varied from 0 to 100
degree centigrade and the 25 degree centigrade is considered as the normal room
temperature for comparison. The critical path delay, rise time and power varies with
different temperature is shown in Table-25 and Table-26. The variation trend of
critical path delay is shown in Figure 34, Figure 35.
Table- 25 Temperature Variation of CMOS Technology
Temperature(V) 0C 25C 50C 75C 100C
Delay(ps) 338.12 401.83 476.29 558.52 645.96
Percent Variation 15.84% - 18.53% 38.99% 60.75%
Power (μw) 26.13 27.42 28.13 27.68 27.97
Percent Variation 4.70% - 2.59% 0.95% 2.01%
Rise Time (ps) 28.937 35.84 44.003 52.632 61.953
Percent Variation 19.26% - 22.78% 46.85% 72.86%
Table- 26 Temperature Variation of CNT Technology
Temperature(C) 0C 25C 50C 75C 100C
Delay(ps) 29.291 29.303 29.313 29.329 29.34
Percent Variation 0.04% - 0.03% 0.09% 0.13%
Power (μw) 11.78 11.74 11.69 11.65 11.58
Percent Variation 0.34% - 0.43% 0.77% 1.36%
Rise Time (ps) 3.830 3.843 3.841 3.842 3.841
Percent Variation 0.34% - 0.05% 0.03% 0.05%
57
Figure 34: Temperature Delay Variation of CMOS Technology
Figure 35: Temperature Delay Variation of CNT Technology
0
100
200
300
400
500
600
700
0C 25C 50C 75C 100C
Cri
tical
Pat
h D
ealy
(ps)
Temprature Variation(CMOS)
29.26
29.27
29.28
29.29
29.3
29.31
29.32
29.33
29.34
29.35
0C 25C 50C 75C 100C
Cri
tical
Pat
h D
ealy
(ps)
Temprature Variation(CNT)
58
From Figure 30 to Figure 35, it can be found that the critical path delay variation
trend of CNT implementation is almost the same as that of the CMOS one. However,
the critical path delay variation percent of CNT implementation are both much smaller.
For example, the average variation of critical path delay and rise time with various
process corners on CNT technology is 9.76% and 13.48%. The equivalent value of
CMOS implementation is 24.82% and 60.54%. The minimum variation percent of
delay, rise time and power on CNT technology is 0.03%, 0.03% and 0.34%
respectively and they all come from the temperature variation. This is primarily
because the environment temperature only has a slight influence on the I-V
characteristics of CNTFET transistor [31]. Though the power variation percent of
CNT implementation is a little higher than that of the CMOS one, the absolute
variation value of CNT implementation is much smaller, so that it will not
significantly influence the performance in practice.
In practice, the variation of PVT always comes together. Therefore, a Monte Carlo
simulation which varies all the PVT factors in one simulation is performed for both
CNT and CMOS implementation of modulo 28
+1 squarer. In each Monte Carlo
simulation, the threshold voltage, environment temperature and supply voltage of both
implementations is randomly selected within the range of ±3% at the same time. For
the CMOS implementation, one thousand samples are collected as shown in Figure 36
and one hundred twenty samples are collected for the CNT implementation as shown
in Figure 37.
59
Figure 36: Monte Carlo analysis of CMOS implementation
Figure 37: Monte Carlo analysis of CNT implementation
0
100
200
300
400
500
600
0 100 200 300 400 500 600 700 800 900 1000
Cri
tical
Pat
h D
elay
(ps)
Monte Carlo Index
Monte Carlo (CMOS)
20
22
24
26
28
30
32
34
36
38
40
0 20 40 60 80 100 120
Cri
tical
Pat
h D
elay
(ps)
Monte Carlo Index
Monte Carlo (CNT)
60
Comparing with the 21.95% maximum variation of CMOS implementation in the
Figure 36, the maximum variation of CNT implementation is only 5.7% in Figure 37.
In addition, the number of samples vary more than 10% is 131 in Figure 36 and
accounts for about 13.1% of the whole sampling. In Figure 37, the proportion of
samples vary more than 2% is only 11.6%. Hence, the CNT implementation of
modulo 2n
+1 squarer performs a considerable improvement of stability in the aspect
of critical path delay comparing the CMOS one.
61
V. Conclusion In this thesis, a novel modulo 2n+1 squarer is implemented based on the improved
algorithm. Comparing with existing algorithm, the input range of modulo 2n
In addition, the improved modulo 2
+1
squarer can be extended without any extra cost and the number of gate on critical path
can be further reduced. In the partial product compression stage, the employment of
3:2 compressor-base Wallace tree configuration resulted in a considerable
improvement in terms of delay, power and area. For the final addition stage, a sparse
tree IEAC adder is introduced to further improve the delay and power with fewer
gates and simpler wire routing.
n+1 squarer is implemented on both CMOS
technology and CNT technology. Comparing with traditional MOSFET transistor, the
CNTFET transistor is proven a much more excellent performance in the aspects of
power and delay. A Monte Carlo simulation is also performed to demonstrate the
better PVT properties of CNT implementation. Hence, the CNTFET-based modulo
2n
+1 squarer can be considered as a competitive candidate for low-power and
high-performance application.
62
Reference [1] Rajashekar Modugua, Yong-Bin Kim, Minsu Choi, “Fast Low-Power Modulo 2n +1 Squarer Hardware for Efficient Data Processing,” http://www.ece.neu.edu /groups/hpvlsi/ publication/2009_Modulo.pdf.
[2] M. A. Soderstrand, W. K. Jenkins, G. A. Jullien, and F. J. Taylor, Eds., “Modern Applications of Residue Number System Arithmetic to Digital Signal Processing,” New York: IEEE Press, 1986.
[3] K. G. Smitha and A. P. Vinod, “A reconfigurable high speed RNS-FIR channel filter for multi-standard software radio receivers,” in Proceedings of the 11th IEEE Singapore International Conference on Communication Systems (ICCS’08), pp.1354–1358, Guangzhou, China, 2008.
[4] P. E. Beckmann and B. R. Musicus, “Fast fault-tolerant digital convolution using a polynomial residue number system,” IEEE Trans. Signal Process. vol.41, no.7, pp.2300–2313, Jul.1993.
[5] Yuke Wang, M. N. S. Swamy, and M. Omair Ahmad, “Residue-to-binary number converters for three moduli set”, IEEE Transactions on CIircuits and Systems, vol.46, No.2, Feb.1999.
[6] D. Gallaher, F. Petry, and P. Srinivasan, The digital parallel method for fast RNS to weighted number system conversion for specific moduli (2k-1; 2k; 2k
[7] L.M. Leibowitz, “A simplified binary arithmetic for the Fermat number transform”, IEEE Trans. Acoust. Speech Signal Process, pp.356–359, 1976.
+ 1), IEEE Trans. Circuits Syst.II, vol.44, pp.537, Jan.1997.
[8] H.T. Vergos , and C. Efstathiou, “Diminished-1 modulo 2n
[9] R. Zimmerman, “Efficient VLSI implementation of modulo (2
+1 squarer design,” Computers and Digital Techniques, IEEE Proceedings, vol.152, no.5, pp.561-566, Sep.2005.
n
[10] Yutai Ma, “A Simplified Architecture for Modulo (2
± 1) addition and multiplication” IEEE trans. Compute., pp.1389-1399, 2002.
n
[11] V. Curiger, H. Bonnenherg, and H. Kaeshi, “Regular VLSI architectures for multiplication modulo (2
+ 1) Multiplication,” IEEE Transactions on Computers, vol.47, no.3, pp.333-337, Mar.1998.
n
[12] A. Wrzyszcz, and D. Milford, “A new modulo 2
+ l),” IEEE .I. Solid-State Circuit, vol.26, pp.990-994, Jul.1991.
a
[13] H.T. Vergos , and C. Efstathiou, “Design of efficient modulo 2
+1 multiplier,” Int. Conf. Computer Design (ICCD3), pp.614-617, 1995.
n + 1 multipliers,” IET Comput. Digit.Tech., vol.1, No.1, pp.49-57, 2007.
63
[14] Yong-Bin Kim, “Integrated circuit design based on carbon nanotube field effect transistor,” IEEE Journal of Trans. On EE Materials, vol.12, No.5, pp.175-188, Oct.25, 2011.
[15] S.A. Tawfik, Z. Liu, and V. Kursun, “Independent-gate and tied-gate FinFet SRAM circuits: Design guidelines for reduced area and enhanced stability”, International Conference on Microelectronics, pp.171–174, 2007.
[16] Behzad Ebrahimi, Masoud Rostami, Ali Afzali-Kusha and Massoud Pedram, “Statistical Design Optimization of FinFET SRAM Using Back-Gate Voltage,” IEEE Transactions on VLSI Systems, vol.19, Issue.10, pp.1911 – 1916, Oct.2011.
[17] R. Chau, “Benchmarking nanotechnology for high-performance and low-power logic transistor applications,” IEEE Transactions on Nanotechnology, vol.4, Issue.2, pp.153 – 158, Aug.2004.
[18] C.S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. Electronic Computers, vol.EC-13, pp.14-17, 1964.
[19] Zhongde Wang, G.A. Jullien and W.C. Miller, “An Efficient Tree Architecture for Modulo 2n
[20] Mounir Bohsali, Michael Doan, “Rectangular Styled Wallace Tree Multipliers,” Berkeley University.
+ 1 Multiplication,” University of Windsor, 401 Sunset, and Windsor, Ontario N9B3P4, Canada Received Mar.11, 1996.
[21] Mahnoush Rouholamini, Omid Kavehie, Amir-Pasha Mirbaha, Somaye Jafarali Jasbi, “A New Design for 7:2 Compressors,” Computer Systems and Applications, AICCSA '07. IEEE/ACS, 2007.
[22] H.T. Vergos, C. Efstathiou, “Efficient modulo2n+1 adder architectures,” Computer Engineering and Informatics Department, University of Patras, 26500 Patras, Greece Informatics Department,ATEIofAthens,12210Egaleo,Athens,Greece.
[23] H.T. Vergos, C. Efstathiou, and D. Nikolos, “Diminished-one modulo 2n
[24] Radu Zlatanovici, Sean Kao, and Borivoje Nikolic,“Energy–Delay Optimization of 64-Bit Carry-Lookahead Adders With a 240 ps 90 nm CMOS Design Example,” IEEE Journal of Solid-State , vol.44, No.2, Feb.2009.
+1 adder design,” IEEETrans.Comput., pp.1389–1399, 2002.
[25] K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2 compressors,” in Proc. of the 35th
[26] Chip-Hong Chang, Jiangmin Gu, and Mingyan Zhang, “Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2 Compressors for Fast Arithmetic Circuits,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, vol.51, No.10, Oct.2004.
Asilomar Conf. on Signals, Systems and Computers, vol.1, pp.129–133, 2001.
[27] V. Sreehari, M. Kirthi, A. Lingamneni, and R. Sreekanth, “Novel architectures for high-speed and low-power 3-2, 4-2 and 5-2 compressor,” IEEE 20th International Conference on VLSI Design, pp.324-329, Jan.2007.
64
[28] Sander J. Tans, Alwin R. M. Verschueren, and Cees Dekker, “Room-temperature transistor based on a single carbon Nanotube,” Nature, vol.393, Issue.6680, pp.49-52, 1998.
[29] Stanford University CNFET website, http://nano.stanford.edu/model.php?id=23.
[30] Jie Deng, “Carbon nanotube transistor circuits: Circuit-level performance benchmarking and design options for living with imperfections” International Solid-State Circuits Conference, pp.70-71, San Francisco, CA, Feb.2007.
[31] Ouyang Yijian, and Guo Jing, “Heat dissipation in carbon nanotube transistors,” Applied Physics Letters, vol.89, Issue.18, Oct.2006.
65
Appendices Appendix A. HSpice Code
1.1 HSpice code for modulo 2n
Weifu Li Modulo_Squarer
+1 squarer
****************************************** .include 'PTM_customized_32nm_nom.lib' x1 PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66 +PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd +PartialProduct *****The first_round****** x2 vdd PP67 PP77 PP00 Sum00 Carry00 Compressor32_bar x3 vdd PP06 PP15 PP24 Sum10 Carry10 Compressor32 x4 vdd PP07 PP16 PP25 Sum01 Carry01 Compressor32 x5 vdd PP34 PP44 0 Sum11 Carry11 Compressor32 x6 vdd PP17 PP26 PP35 Sum02 Carry02 Compressor32 x7 vdd PP01 PP11 0 Sum12 Carry12 Compressor32_bar x8 vdd PP27 PP36 PP45 Sum03 Carry03 Compressor32 x9 vdd PP55 0 PP02 Sum13 Carry13 Compressor32_bari ****only one input need to be revesered PP02**** x10 vdd PP37 PP46 0 Sum04 Carry04 Compressor32 x11 vdd PP03 PP12 PP22 Sum14 Carry14 Compressor32_bar x12 vdd PP47 PP56 PP66 Sum05 Carry05 Compressor32 x13 vdd PP04 PP13 vdd Sum15 Carry15 Compressor32_bar x14 vdd PP57 0 PP05 Sum06 Carry06 Compressor32_bari x15 vdd PP14 PP23 PP33 Sum16 Carry16 Compressor32_bar *****The second_round***** x16 vdd Sum00 Sum10 Carry06 Sum20 Carry20 Compressor32_bari x17 vdd Sum01 Sum11 Carry00 Sum21 Carry21 Compressor32
66
x18 vdd Sum02 Sum12 Carry01 Sum22 Carry22 Compressor32 x19 vdd Sum03 Sum13 Carry02 Sum23 Carry23 Compressor32 x20 vdd Sum04 Sum14 Carry03 Sum24 Carry24 Compressor32 x21 vdd Sum05 Sum15 Carry04 Sum25 Carry25 Compressor32 x22 vdd Sum06 Sum16 Carry05 Sum26 Carry26 Compressor32 *****The final_round***** x23 vdd Carry16_D Carry26 Sum20 Sum30 Carry30 Compressor32_barii x24 vdd Carry10_D Carry20 Sum21 Sum31 Carry31 Compressor32 x25 vdd Carry11_D Carry21 Sum22 Sum32 Carry32 Compressor32 x26 vdd Carry12_D Carry22 Sum23 Sum33 Carry33 Compressor32 x27 vdd Carry13_D Carry23 Sum24 Sum34 Carry34 Compressor32 x28 vdd Carry14_D Carry24 Sum25 Sum35 Carry35 Compressor32 x29 vdd Carry15_D Carry25 Sum26 Sum36 Carry36 Compressor32 x30 vdd Carry36 Carry36_bar inverter x31 vdd Carry16 Carry16_D buffer_chain x32 vdd Carry10 Carry10_D buffer_chain x33 vdd Carry11 Carry11_D buffer_chain x34 vdd Carry12 Carry12_D buffer_chain x35 vdd Carry13 Carry13_D buffer_chain x36 vdd Carry14 Carry14_D buffer_chain x37 vdd Carry15 Carry15_D buffer_chain ******************************************************* ***************Final Sparse Adder********************* ******************************************************* x38 vdd +Sum30 Sum31 Sum32 Sum33 Sum34 Sum35 Sum36 Sum36 +Carry36_bar Carry30 Carry31 Carry32 Carry33 Carry3 Carry35 Carry35 +S0 S1 S2 S3 S4 S5 S6 + Modulo_Sparse x39 vdd S0 S0_fanout fanout x40 vdd S1 S1_fanout fanout x41 vdd S2 S2_fanout fanout x42 vdd S3 S3_fanout fanout x43 vdd S4 S4_fanout fanout x44 vdd S5 S5_fanout fanout x45 vdd S6 S6_fanout fanout
67
1.2 HSpice code for modulo 2n
Weifu Li Modulo_Adder
+1 adder
************The Sparse Tree Adder************** .subckt Modulo_Sparse vdd +a0 a1 a2 a3 a4 a5 a6 a7 +b0 b1 b2 b3 b4 b5 b6 b7 +S0 S1 S2 S3 S4 S5 S6 x1 vdd a0 b0 g0 nand x2 vdd a1 b1 g1 nand x3 vdd a2 b2 g2 nand x4 vdd a3 b3 g3 nand x5 vdd a4 b4 g4 nand x6 vdd a5 b5 g5 nand x7 vdd a6 b6 g6 nand x35 vdd a7 b7 g7 nand x8 vdd a0 b0 p0 nor x9 vdd a1 b1 p1 nor x10 vdd a2 b2 p2 nor x11 vdd a3 b3 p3 nor x12 vdd a4 b4 p4 nor x13 vdd a5 b5 p5 nor x14 vdd a6 b6 p6 nor x36 vdd a7 b7 p7 nor ******************************************************** x15 vdd g0 p1 g1 g10 IOA x16 vdd g2 p3 g3 g32 IOA x17 vdd g4 p5 g5 g54 IOA x18 vdd g6 p7 g7 g76 IOA x19 vdd p0 p1 p10 nor x20 vdd p2 p3 p32 nor x21 vdd p4 p5 p54 nor x22 vdd p6 p7 p76 nor ******************************************************** x23 vdd g10 p32 g32 g30 AOI x24 vdd g54 p76 g76 g74 AOI
68
x25 vdd p76 p54 p74 nand x26 vdd p32 p10 p30 nand x27 vdd g30 p74 g74 C7_bari IOA x28 vdd p30 g74_bar g30 C3i IOA x29 vdd g74 g74_bar inverter x31 vdd C3i C3_bar inverter x32 vdd C7_bari C7 inverter x40 vdd C7 C7_bar inverter x41 vdd C3_bar C3 inverter x33 vdd +a0 a1 a2 a3 +b0 b1 b2 b3 +S0 S1 S2 S3 +C7 C7_bar Conditional_Sum x34 vdd +a4 a5 a6 a7 +b4 b5 b6 b7 +S4 S5 S6 x +C3 C3_bar Conditional_Sum .ends ************The Conditional Adder************** .subckt Conditional_Sum vdd +a0 a1 a2 a3 +b0 b1 b2 b3 +S0 S1 S2 S3 Cin Cin_bar x1 vdd a0 b0 xor_0 xnor_0 xor_xnor x2 vdd a1 b1 xor_1 xor_s x3 vdd a2 b2 xor_2 xor_s x4 vdd a3 b3 xor_3 xor_s x5 vdd a0 b0 g0 nand x6 vdd a1 b1 g1 nand x7 vdd a2 b2 g2 nand x8 vdd a3 b3 g3 nand x9 vdd a0 b0 p0 nor x10 vdd a1 b1 p1 nor x11 vdd a2 b2 p2 nor x12 vdd a3 b3 p3 nor
69
x101 vdd g0 g0_bar invertr x102 vdd p0 p0_bar inverer x103 vdd g2 g2_bar inveter x104 vdd p2 p2_bar invrter x13 vdd xor_1 g0_bar mux10 xor_s x14 vdd xor_1 p0_bar mux11 xor_s x15 vdd p0 p1 g1 Out_2 IOA x16 vdd Out_2_bar xor_2_bar mux21 xor_s x17 vdd p1 g0 g1 Out_4 IOA x18 vdd Out_4 xor_2 mux20 xor_s x30 vdd xor_2 xor_2_bar inverter x31 vdd xor_3 xor_3_bar inverter x32 vdd Out_2 Out_2_bar inverter x19 vdd p2_bar Out_2 g2_bar Out_6 AOI x20 vdd Out_6 xor_3_bar mux31 xor_s x21 vdd p2_bar Out_4 g2_bar Out_8 AOI x22 vdd Out_8 xor_3_bar mux30 xor_s x23 vdd xor_0_b xnor_0_b Cin Cin_bar S0 mux x24 vdd mux10 mux11 Cin Cin_bar S1 mux x25 vdd mux20 mux21 Cin Cin_bar S2 mux x26 vdd mux30 mux31 Cin Cin_bar S3 mux x27 vdd xor_0 xor_0_b buffer x28 vdd xnor_0 xnor_0_b buffer .ends
1.3 HSpice code for partial product Matrix
****************The Partial Product Generator**************** .subckt PartialProduct +PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66
70
+PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd x1 vdd a0 a0 PP00 nand x2 vdd a0 a1 PP01 nand x3 vdd a0 a2 PP02 nand x4 vdd a0 a3 PP03 nand x5 vdd a0 a4 PP04 nand x6 vdd a0 a5 PP05 nand x7 vdd a0 a6 PP06 nand x8 vdd a0 a7 PP07 nand x9 vdd a1 a1 PP11 nand x10 vdd a1 a2 PP12 nand x11 vdd a1 a3 PP13 nand x12 vdd a1 a4 PP14 nand x13 vdd a1 a5 PP15 nand x14 vdd a1 a6 PP16 nand x15 vdd a1 a7 PP17 nand x16 vdd a2 a2 PP22 nand x17 vdd a2 a3 PP23 nand x18 vdd a2 a4 PP24 nand x19 vdd a2 a5 PP25 nand x20 vdd a2 a6 PP26 nand x21 vdd a2 a7 PP27 nand x22 vdd a3 a3 PP33 nand x23 vdd a3 a4 PP34 nand x24 vdd a3 a5 PP35 nand x25 vdd a3 a6 PP36 nand x26 vdd a3 a7 PP37 nand x27 vdd a4 a4 PP44 nand x28 vdd a4 a5 PP45 nand x29 vdd a4 a6 PP46 nand x30 vdd a4 a7 PP47 nand x31 vdd a5 a5 PP55 nand x32 vdd a5 a6 PP56 nand x33 vdd a5 a7 PP57 nand x34 vdd a6 a6 PP66 nand x35 vdd a6 a7 PP67 nand x36 vdd a7 a7 PP77 nand .ends
71
1.4 Traditional Partial Product Process
Weifu Li *************************** .include 'PTM_customized_32nm_nom.lib' x1 PP07 PP06 PP05 PP04 PP03 PP02 PP01 PP00 +PP17 PP16 PP15 PP14 PP13 PP12 PP11 +PP27 PP26 PP25 PP24 PP23 PP22 +PP37 PP36 PP35 PP34 PP33 +PP47 PP46 PP45 PP44 +PP57 PP56 PP55 +PP67 PP66 +PP77 +a0 a1 a2 a3 a4 a5 a6 a7 +vdd +PartialProduct x2 vdd PP07 PP16 PP16R nand x3 vdd PP17 PP26 PP26R nand x4 vdd PP27 PP36 PP36R nand x5 vdd PP37 PP46 PP46R nand x6 vdd PP47 PP56 PP56R nand x7 vdd PP57 PP66 PP66R nand x8 vdd PP67 PP67R inverter x9 vdd P76R PP06 PP06R nand x10 vdd PP77 PP00 PP67 PP00R nand_3 x11 vdd PP16R PP16Ri inverter x12 vdd PP00R PP00Ri inverter x13 vdd PP26R PP26Ri inverter x14 vdd PP24_b PP16_b PP15 PP16Ri PP00Ri Sum00 Carry00 Cout00 Compressor42 x15 vdd PP44 PP34 PP26 PP25 PP26Ri Sum01 Carry01 Cout01 Compressor42 x101 vdd PP24_bb PP24_b buffer x102 vdd PP16_bb PP16_b buffer x103 vdd PP16 PP16_bb buffer
72
x104 vdd PP24 PP24_bb buffer x16 vdd PP11 PP11i inverter x17 vdd PP36R PP36Ri inverter x18 vdd PP01 PP01i inverter x19 vdd PP36 PP35 PP01i PP11i PP36Ri Sum02 Carry02 Cout02 Compressor42 x20 vdd PP46R PP46Ri inverter x21 vdd PP02 PP02i inverter x22 vdd PP46 PP45 PP02i PP55i PP46Ri Sum03 Carry03 Cout03 Compressor42 x23 vdd PP22 PP22i inverter x24 vdd PP03 PP03i inverter x25 vdd PP12 PP12i inverter x26 vdd PP56R PP56Ri inverter x27 vdd PP56 PP12i PP03i PP22i PP56Ri Sum04 Carry04 Cout04 Compressor42 x28 vdd PP23 PP23i inverter x29 vdd PP04 PP04i inverter x30 vdd PP13 PP13i inverter x31 vdd PP66R PP66Ri inverter x32 vdd PP23i PP23i PP04i PP13i PP66Ri Sum05 Carry05 Cout05 Compressor42 x33 vdd PP33 PP33i inverter x34 vdd PP06 PP06i inverter x35 vdd PP05 PP05i inverter x36 vdd PP14 PP14i inverter x38 vdd PP33i PP06i PP05i PP14i PP06R Sum06 Carry06 Cout06 Compressor42 ************************The Second Round*********************** x39 vdd Carry06 Sum00 Cout06 Sum10 Carry10 Compressor32_barii x40 vdd Carry00 Sum01 Cout00 Sum11 Carry11 Compressor32 x41 vdd Carry01 Sum02 Cout01 Sum12 Carry12 Compressor32
73
x42 vdd Carry02 Sum03 Cout02 Sum13 Carry13 Compressor32 x43 vdd Carry03 Sum04 Cout03 Sum14 Carry14 Compressor32 x44 vdd Carry04 Sum05 Cout04 Sum15 Carry15 Compressor32 x45 vdd Carry05 Sum06 Cout05 Sum16 Carry16 Compressor32 **************************************************************** x46 vdd 0 Sum10 Carry16 Sum20 Carry20 Compressor32_bari x47 vdd vdd Sum11 Carry10 Sum21 Carry21 Compressor32 x48 vdd 0 Sum12 Carry11 Sum22 Carry22 Compressor32 x49 vdd 0 Sum13 Carry12 Sum23 Carry23 Compressor32 x50 vdd 0 Sum14 Carry13 Sum24 Carry24 Compressor32 x51 vdd 0 Sum15 Carry14 Sum25 Carry25 Compressor32 x52 vdd 0 Sum16 Carry15 Sum26 Carry26 Compressor32
1.5 HSpice code for Compressors
Weifu Li *************************** ****************The 32 Compressor Group**************** .subckt Compressor32_bar vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z_bar z Sum mux x3 vdd x_bar z_bar xor xnor Carry mux x4 vdd x x_bar inverter x5 vdd z z_bar inverter .ends .subckt Compressor32 vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z z_bar Sum mux x3 vdd x_buffer z_buffer xor xnor Carry mux x4 vdd z z_bar inverter x5 vdd x x_buffer buffer x6 vdd z z_buffer buffer .ends
74
.subckt Compressor32_bari vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z_bar z_buff Sum mux x3 vdd x z_bar xor xnor Carry mux x4 vdd z_buff z_bar inverter x5 vdd z_b z_buff buffer x6 vdd z z_b buffer .ends .subckt Compressor32_barii vdd x y z Sum Carry x1 vdd x y xor xnor xor_xnor x2 vdd xor xnor z z_bar Sum mux x3 vdd x_bar z xor xnor Carry mux x4 vdd x x_bar inverter x5 vdd z z_bar inverter .ends .subckt Compressor72 vdd +x1 x2 x3 x4 x5 x6 x7 Cin1 Cin2 +Sum Carry Cout1 Cout2 x1 vdd x5 x6 x7 c2 CGEN x2 vdd x5 x6 s2 xor x3 vdd x2 x3 x4 c1 CGEN x4 vdd x2 x3 s1 xor x5 vdd c2 c1 c3 xor x6 vdd s2 x7 s4 xor x7 vdd s1 x4 s3 xor x8 vdd s3 s4 s5 xor x9 vdd s4 s3 x1 c4 CGEN x10 vdd c3 c4 Cout1 xor x11 vdd c1 c4 c3 c3_bar Cout2 mux x12 vdd c3 c3_bar inverter x13 vdd s5 x1 s6 xor x14 vdd s6 Cin2 s7 xor x15 vdd s6 Cin1 s7 s7_bar Carry mux x17 vdd s7 s7_bar inverter x16 vdd Cin1 s7 Sum xor .ends
75
.subckt Compressor42 vdd x1 x2 x3 x4 +Cin Sum Carry Cout x1 vdd x1 x2 1 2 xor_xnor x2 vdd x1 x3 1 2 Cout mux_single x3 vdd x3 x4 3 4 xor_xnor x4 vdd 1 2 3 4 5 6 mux x5 vdd 5 6 Cin Cin_bar Sum mux_single x6 vdd x4 Cin 5 6 Carry mux_single x7 vdd Cin Cin_bar inverter .ends
1.6 HSpice code for Proposed FPP
.subckt Modulo_FPP vdd +a0 a1 a2 a3 a4 a5 a6 a7 +b0 b1 b2 b3 b4 b5 b6 b7 +S0 S1 S2 S3 S4 S5 S6 x1 vdd a0 b0 g0 nand x2 vdd a1 b1 g1 nand x3 vdd a2 b2 g2 nand x4 vdd a3 b3 g3 nand x5 vdd a4 b4 g4 nand x6 vdd a5 b5 g5 nand x7 vdd a6 b6 g6 nand x23 vdd a7 b7 g7 nand x8 vdd a0 b0 p0 nor x9 vdd a1 b1 p1 nor x10 vdd a2 b2 p2 nor x11 vdd a3 b3 p3 nor x12 vdd a4 b4 p4 nor x13 vdd a5 b5 p5 nor x14 vdd a6 b6 p6 nor x24 vdd a7 b7 p7 nor x15 vdd g7 g6 G77 nand x16 vdd g6 g5 G66 nand x17 vdd g5 g4 G55 nand x18 vdd g4 g3 G44 nand x19 vdd g3 g2 G33 nand
76
x20 vdd g2 g1 G22 nand x21 vdd g1 g0 G11 nand x22 vdd g0 p7_bar G00 nand x25 vdd p7 p6 P77 nor x26 vdd p6 p5 P66 nor x27 vdd p5 p4 P55 nor x28 vdd p4 p3 P44 nor x29 vdd p3 p2 P33 nor x30 vdd p2 p1 P22 nor x31 vdd p1 p0 P11 nor x32 vdd p0 g7_bar P00 nor x33 vdd g7 g7_bar inverter x34 vdd p7 p7_bar inverter x35 vdd G77_bar P66_bar G00 G760 AOI x36 vdd G77_bar G55_bar P75 nand x37 vdd G33_bar P22_bar P44_bar G432 AOI x38 vdd P75 G432 G760 H0 IOA x39 vdd P00 P77_bar G11 G071 AOI x40 vdd P00 G66_bar P06 nand x41 vdd G44_bar P33_bar P55_bar G435 AOI x42 vdd P06 G435 G071 H1 IOA x43 vdd P11 G00 G22 G102 AOI x44 vdd P11 G77_bar P17 nand x45 vdd G55_bar P44_bar P66_bar G546 AOI x46 vdd P17 G546 G102 H2 IOA x47 vdd P22 G11 G33 G213 AOI x48 vdd P22 P00 P20 nand x49 vdd G66_bar P55_bar P77_bar G657 AOI x50 vdd P20 G657 G213 H3 IOA x51 vdd P33 G22 G44 G324 AOI x52 vdd P33 P11 P31 nand x53 vdd P31 G760 G324 H4 IOA x54 vdd P44 G33 G55 G435 AOI x55 vdd P44 P22 P42 nand x56 vdd P42 G071 G435 H5 IOA x57 vdd P55 G44 G66 G546 AOI
77
x58 vdd P55 P33 P53 nand x59 vdd P53 G102 G546 H6 IOA x60 vdd P66 G55 G77 G657 AOI x61 vdd P66 P44 P64 nand x62 vdd P64 G213 G657 H7 IOA x63 vdd a0 b0 d0 xor x64 vdd a1 b1 d1 xor x65 vdd a2 b2 d2 xor x66 vdd a3 b3 d3 xor x67 vdd a4 b4 d4 xor x68 vdd a5 b5 d5 xor x69 vdd a6 b6 d6 xor x70 vdd a7 b7 d7 xor x72 vdd H0 H0_bar inverter x73 vdd H1 H1_bar inverter x74 vdd H2 H2_bar inverter x75 vdd H3 H3_bar inverter x76 vdd H4 H4_bar inverter x77 vdd H5 H5_bar inverter x78 vdd H6 H6_bar inverter x79 vdd H7 H7_bar inverter x80 vdd d0_par p7_bar z0 xor x81 vdd d1_bar p0 z1 xor x82 vdd d2_bar p1 z2 xor x83 vdd d3_bar p2 z3 xor x84 vdd d4_bar p3 z4 xor x85 vdd d5_bar p4 z5 xor x86 vdd d6_bar p5 z6 xor x87 vdd d7_bar p6 z7 xor x88 vdd d0 d0_bar inverter x89 vdd d1 d1_bar inverter x90 vdd d2 d2_bar inverter x91 vdd d3 d3_bar inverter x92 vdd d4 d4_bar inverter x93 vdd d5 d5_bar inverter x94 vdd d6 d6_bar inverter x95 vdd d7 d7_bar inverter x96 vdd G77 G77_bar inverter x97 vdd G66 G66_bar inverter
78
x98 vdd G55 G55_bar inverter x99 vdd G44 G44_bar inverter x100 vdd G33 G33_bar inverter x101 vdd G22 G22_bar inverter x102 vdd G11 G11_bar inverter x103 vdd G00 G00_bar inverter x104 vdd P77 P77_bar inverter x105 vdd P66 P66_bar inverter x106 vdd P55 P55_bar inverter x107 vdd P44 P44_bar inverter x108 vdd P33 P33_bar inverter x109 vdd P22 P22_bar inverter x110 vdd P11 P11_bar inverter x111 vdd P00 P00_bar inverter x112 vdd d0_bar z0 H7 H7_bar S0 mux x113 vdd d1 z1 H0 H0_bar S1 mux x114 vdd d2 z2 H1 H1_bar S2 mux x115 vdd d3 z3 H2 H2_bar S3 mux x116 vdd d4 z4 H3 H3_bar S4 mux x117 vdd d5 z5 H4 H4_bar S5 mux x118 vdd d6 z6 H5 H5_bar S6 mux x119 vdd d7 z7 H6 H6_bar x mux .ends
1.7 HSpice code for Sub-circuit on CMOS technology
****************************************************************** **************************The Subckt****************************** ****************************************************************** *******Buffer******* .subckt buffer vdd input output m1 input_bar input vdd vdd pmos w=64n l=32n m2 input_bar input 0 0 nmos w=32n l=32n m3 output input_bar vdd vdd pmos w=128n l=32n m4 output input_bar 0 0 nmos w=64n l=32n .ends
79
*****Buffer Chain***** subckt buffer_chain vdd input Output X1 vdd input input_D buffer X2 vdd input_D input_DD buffer X3 vdd input_DD input_DDD buffer X4 vdd input_DDD Output buffer .ends ************2 Input IOA************** .subckt IOA vdd a b c Out m1 1 a vdd vdd pmos w=128n l=32n m2 Out b 1 vdd pmos w=128n l=32n m3 Out c vdd vdd pmos w=64n l=32n m4 Out c 2 0 nmos w=64n l=32n m5 2 a 0 0 nmos w=64n l=32n m6 2 b 0 0 nmos w=64n l=32n .ends ************2 Input AOI************** .subckt AOI vdd a b c Out m1 1 a vdd vdd pmos w=128n l=32n m2 1 b vdd vdd pmos w=128n l=32n m3 Out c 1 vdd pmos w=128n l=32n m4 Out c 0 0 nmos w=32n l=32n m5 Out a 2 0 nmos w=64n l=32n m6 2 b 0 0 nmos w=64n l=32n .ends ************2 Input XOR**************
80
.subckt xor vdd a b Out m1 a b Out_bar 0 nmos w=32n l=32n m2 b a Out_bar 0 nmos w=32n l=32n m3 Out_bar a 1 vdd pmos w=64n l=32n m4 1 b vdd vdd pmos w=64n l=32n m5 Out Out_bar vdd vdd pmos w=64n l=32n m6 Out Out_bar 0 0 nmos w=32n l=32n .ends ************2 Input XOR_S************** .subckt xor_s vdd a b Out x1 vdd a a_bar inverter x2 vdd b b_bar inverter m1 b a_bar Out 0 nmos w=32n l=32n m2 Out a b vdd pmos w=64n l=32n m3 b_bar a Out 0 nmos w=32n l=32n m4 Out a_bar b_bar vdd pmos w=64n l=32n .ends ************2 Input XNOR************** .subckt xnor vdd a b Out m1 a b Out 0 nmos w=32n l=32n m2 b a Out 0 nmos w=32n l=32n m3 Out a 1 vdd pmos w=64n l=32n m4 1 b vdd vdd pmos w=64n l=32n .ends ************2 Input NOR************** .subckt nor vdd a b Out
81
m1 1 a vdd vdd pmos w=128n l=32n m2 Out b 1 vdd pmos w=128n l=32n m3 Out a 0 0 nmos w=32n l=32n m4 Out b 0 0 nmos w=32n l=32n .ends ************2 Input NAND************** .subckt nand vdd a b Output m1 Output a vdd vdd pmos w=64n l=32n m2 Output b vdd vdd pmos w=64n l=32n m3 Output a 1 0 nmos w=64n l=32n m4 1 b 0 0 nmos w=64n l=32n .ends ************The Inverter************** .subckt inverter vdd input Output m1 Output input vdd vdd pmos w=64n l=32n m2 Output input 0 0 nmos w=32n l=32n .ends **************3 Input nand**************** .subckt nand_3 vdd a b c Out m1 Output a vdd vdd pmos w=64n l=32n m2 Output b vdd vdd pmos w=64n l=32n m3 Output c vdd vdd pmos w=64n l=32n m4 Output a 1 0 nmos w=96n l=32n m5 1 b 2 0 nmos w=96n l=32n m6 2 c 0 0 nmos w=96n l=32n .ends
82
************2 Input XOR_XNOR************** .subckt xor_xnor vdd a b xor xnor m1 a b xor vdd pmos L=32n W=64n m2 xor a b vdd pmos L=32n W=64n m3 xor b 1 0 nmos L=32n W=32n m4 1 a 0 0 nmos L=32n W=32n m5 xnor xor vdd vdd pmos L=32n W=64n m6 xor xnor 0 0 nmos L=32n W=32n m7 2 b vdd vdd pmos L=32n W=64n m8 xnor a 2 vdd pmos L=32n W=64n m9 a b xnor 0 nmos L=32n W=32n m10 xnor a b 0 nmos L=32n W=32n .ends ************2-1 Mux************** .subckt mux vdd a b sel sel_bar Output m1 1 b vdd vdd pmos w=64n l=32n m2 Out_bar sel_bar 1 vdd pmos w=64n l=32n m3 Out_bar sel 2 0 nmos w=32n l=32n m4 2 b 0 0 nmos w=32n l=32n m5 3 a vdd vdd pmos w=64n l=32n m6 Out_bar sel 3 vdd pmos w=64n l=32n m7 Out_bar sel_bar 4 0 nmos w=32n l=32n m8 4 a 0 0 nmos w=32n l=32n m9 Output Out_bar vdd vdd pmos w=64n l=32n m10 Output Out_bar 0 0 nmos w=32n l=32n *****Sel=0 Output = a***** .ends ************ fanout ************** .subckt fanout vdd input output x1 vdd input output inverter x2 vdd input output inverter
83
x3 vdd input output inverter x4 vdd input output inverter .ends ************ CGEN ************** .subckt CGEN vdd a b cin Carry m1 1 b vdd vdd pmos l=32n w=128n m2 1 a vdd vdd pmos l=32n w=128n m3 2 cin 1 vdd pmos l=32n w=128n m4 2 cin 3 0 nmos l=32n w=64n m5 3 b 0 0 nmos l=32n w=64n m6 3 a 0 0 nmos l=32n w=64n m7 4 b vdd vdd pmos l=32n w=128n m8 2 a 4 vdd pmos l=32n w=128n m9 2 a 6 0 nmos l=32n w=64n m10 6 b 0 0 nmos l=32n w=64n x1 vdd 2 carry inverter .ends ************ Dual-Output Mux ************** .subckt mux_dual vdd a b set set_bar out outbar m1 a set out 0 nmos W=32n L=32n m4 a set_bar outbar 0 nmos W=32n L=32n m2 b set_bar out 0 nmos W=32n L=32n m3 b set outbar 0 nmos W=32n L=32n m5 out outbar vdd vdd pmos W=64n L=32n m6 outbar out vdd vdd pmos W=64n L=32n .ends
84
1.8 HSpice code for Sub-circuit on CMOS technology
Weifu Li *************************** ***********2 Input IOA**************
.subckt IOA vdd a b c Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 Out b 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out c vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out c 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x5 2 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x6 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 ***********The Output of this gate is Out=[(a+b)*c]_bar************** .ends ************2 Input AOI************** .subckt AOI vdd a b c Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 1 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out c 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 Out c 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 Out a 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x6 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9
85
Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends ************2 Input XOR_S************** .subckt xor_s vdd a b Out x1 vdd a a_bar inverter x2 vdd b b_bar inverter x3 b a_bar Out 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out a b 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 b_bar a Out 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x6 Out a_bar b_bar 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input NOR************** .subckt nor vdd a b Out x1 1 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x2 Out b 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x3 Out a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 Out b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input NAND************** .subckt nand vdd a b Output x1 Output a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9
86
Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Output b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 Output a 1 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 1 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends ************The Inverter************** .subckt inverter vdd input Output x1 Output input vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Output input 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2 Input XOR_XNOR************** .subckt xor_xnor vdd a b xor xnor x1 a b xor 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 xor a b 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 xor b 1 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 1 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 xnor xor vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=2 pitch=4e-9 x6 xor xnor 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=2 pitch=4e-9 x7 2 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x8 xnor a 2 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x9 a b xnor 0 NCNFET Lch=32e-9 Lss=32e-9
87
Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x10 xnor a b 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 .ends ************2-1 Mux************** .subckt mux vdd a b sel sel_bar Output x1 1 b vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 Out_bar sel_bar 1 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 Out_bar sel 2 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x4 2 b 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x5 3 a vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x6 Out_bar sel 3 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x7 Out_bar sel_bar 4 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x8 4 a 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x9 Output Out_bar vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x10 Output Out_bar 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 *****Sel=0 Output = a***** .ends **********fanout*********** .subckt fanout vdd input output x1 vdd input output inverter
88
x2 vdd input output inverter x3 vdd input output inverter x4 vdd input output inverter .ends **********Buffer*********** .subckt buffer vdd input output x1 input_bar input vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x2 input_bar input 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=8 pitch=4e-9 x3 output input_bar vdd 0 PCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 x4 output input_bar 0 0 NCNFET Lch=32e-9 Lss=32e-9 Ldd=32e-9 Dout=0 Sout=0 n1=19 n2=0 tubes=16 pitch=4e-9 .ends
89
Appendix B. Monte Carlo Simulation Data
2.1 Monte Carlo Simulation for CMOS
1 1.1 1.2 1.3 1.4 15.5 15.6
415 403.9 403.5 420 425.1 412.2 425.2 1.5 1.6 1.7 1.8 1.9 16 16.1
434.1 395.7 408.7 424.8 429.7 357.7 411.2 2 2.1 2.2 2.3 2.4 16.5 16.6
442.2 390.7 407.7 402.6 398.5 413.5 336.6 2.5 2.6 2.7 2.8 2.9 17 17.1
422.4 401.6 455 394.5 414.2 382.9 374.6 3 3.1 3.2 3.3 3.4 17.5 17.6
449.9 433.5 378.3 477.5 408.2 407 442.1 3.5 3.6 3.7 3.8 3.9 18 18.1
424.6 447.4 357.2 459.2 421.2 378.4 386.7 4 4.1 4.2 4.3 4.4 18.5 18.6
415.2 389.1 374.3 408.9 424.3 378.5 395.2 4.5 4.6 4.7 4.8 4.9 19 19.1
405.9 394.6 392.7 447.5 397.4 373.4 385.3 5 5.1 5.2 5.3 5.4 19.5 19.6
461.1 438.8 408.5 478.1 416.9 376.6 388.5 5.5 5.6 5.7 5.8 5.9 20 20.1
437.8 453.1 435.4 393.7 412.2 397.6 420 6 6.1 6.2 6.3 6.4 20.5 20.6
378.1 435.2 413.3 417.1 412.8 396.2 368.4 6.5 6.6 6.7 6.8 6.9 21 21.1
437.9 353.6 414.2 417.5 407.1 403 444.7 7 7.1 7.2 7.3 7.4 21.5 21.6
406.5 397.9 430.3 417.8 433.5 421.1 386 7.5 7.6 7.7 7.8 7.9 22 22.1
431.5 471.2 424 465.5 462.3 428.9 378.2 8 8.1 8.2 8.3 8.4 22.5 22.6
399.4 410.6 387.8 436.7 425.1 410.3 391 8.5 8.6 8.7 8.8 8.9 23 23.1
401.6 414.8 439.5 369.4 441.3 435.8 420.7 9 9.1 9.2 9.3 9.4 23.5 23.6
424.2 414.5 414.7 406.2 369.8 413.5 434.2 9.5 9.6 9.7 9.8 9.9 24 24.1
399.9 411 434 377.4 401.5 402 377.7 10 10.1 10.2 10.3 10.4 24.5 24.6 421 446.6 484.4 450.4 394.2 395.2 384.6
90
10.5 10.6 10.7 10.8 10.9 25 25.1 418.9 390.1 390.1 402.6 432.8 447.1 426.5
11 11.1 11.2 11.3 11.4 25.5 25.6 391.4 431.8 380.6 396.8 402 424.4 438.5 11.5 11.6 11.7 11.8 11.9 26 26.1 409.8 372.8 384.4 401.6 405 365.8 422.1
12 12.1 12.2 12.3 12.4 26.5 26.6 416.8 369.3 387.2 380.1 375.9 424.9 344.8 12.5 12.6 12.7 12.8 12.9 27 27.1 398.9 379.8 427.9 372.6 393.4 394.1 385.8 15.9 30 30.1 30.2 30.3 30.4 44.5 398.5 408.7 433.3 469 436 385.5 388.7 16.4 30.5 30.6 30.7 30.8 30.9 45 388.2 406.3 378 378.4 390.8 420.4 440.3 16.9 31 31.1 31.2 31.3 31.4 45.5 387.1 408.7 452.3 398.1 414.5 418.9 418.2 17.4 31.5 31.6 31.7 31.8 31.9 46 409.5 427.8 390 402.9 418.4 423.2 361.5 17.9 32 32.1 32.2 32.3 32.4 46.5 433.3 435.5 386 401.3 397.8 392.7 418.9 18.4 32.5 32.6 32.7 32.8 32.9 47 402.2 416.6 395.7 447.5 388 406.6 388.9 18.9 33 33.1 33.2 33.3 33.4 47.5 414.3 443.1 427 373.1 469.5 402.7 413.2 19.4 33.5 33.6 33.7 33.8 33.9 48 401.7 419.2 440.4 351.6 491.2 415.7 381.5 19.9 34 34.1 34.2 34.3 34.4 48.5 379.2 408.6 386.6 368.2 404.2 417.7 384.9 20.4 34.5 34.6 34.7 34.8 34.9 49 372.6 401.1 390.4 387.5 440.8 392.4 380.7 20.9 35 35.1 35.2 35.3 35.4 49.5 408.7 454 432.2 402.2 470.1 409.9 384 21.4 35.5 35.6 35.7 35.8 35.9 50 413.2 430.5 445.1 428.9 387.9 415.8 402.5 21.9 36 36.1 36.2 36.3 36.4 50.5 416.7 374.2 429.1 406.3 411.1 407.2 401.3 22.4 36.5 36.6 36.7 36.8 36.9 51 386.8 431.9 350.4 408.3 411.1 401.6 398 22.9 37 37.1 37.2 37.3 37.4 51.5 402.3 398.7 390.5 424 412 427.1 416.2 23.4 37.5 37.6 37.7 37.8 37.9 52 395.8 424.3 462.7 418.4 458.7 453.6 423.5 23.9 38 38.1 38.2 38.3 38.4 52.5
91
409.4 392.9 405.1 381.6 429.7 419.4 405.7 24.4 38.5 38.6 38.7 38.8 38.9 53 413.4 395.9 409.1 433 363.2 433.8 430.9 24.9 39 39.1 39.2 39.3 39.4 53.5 386.9 390.7 400.9 408.8 406.6 418.9 407.3 25.4 39.5 39.6 39.7 39.8 39.9 54 402.5 394.6 403.4 427.2 372.1 396.4 399.1 25.9 40 40.1 40.2 40.3 40.4 54.5 409.4 414.7 440 476.9 443.7 388.5 389.9 26.4 40.5 40.6 40.7 40.8 40.9 55 400.9 413.5 386 385.3 396.3 426 441.8 26.9 41 41.1 41.2 41.3 41.4 55.5 395.9 397.2 437.7 386.5 402.4 407.5 419.3 27.4 41.5 41.6 41.7 41.8 41.9 56 402.3 415.1 379.2 391 407.1 410.4 362.4 44.8 44.9 59 59.1 59.2 59.3 59.4 428.4 382.9 381.1 389.1 398.1 398.4 407.9 45.3 45.4 59.5 59.6 59.7 59.8 59.9 455.7 397.7 385.1 394.6 416.7 363.2 386.5 45.8 45.9 60 60.1 60.2 60.3 60.4 377.6 403.3 403.8 427.4 464.1 431 377.2 46.3 46.4 60.5 60.6 60.7 60.8 60.9 399.6 394.7 404.5 374 373 386.6 414.9 46.8 46.9 61 61.1 61.2 61.3 61.4 399.6 389.9 407.6 450.5 398.2 413.6 417.8 47.3 47.4 61.5 61.6 61.7 61.8 61.9 400 414.9 426.3 388.8 401.6 418.1 421.8 47.8 47.9 62 62.1 62.2 62.3 62.4 445.5 440.2 434.5 385.2 399.7 396.5 391.6 48.3 48.4 62.5 62.6 62.7 62.8 62.9 417.1 406.9 415.7 394.5 446.7 386.9 406.7 48.8 48.9 63 63.1 63.2 63.3 63.4 354.6 420.3 441.6 425.6 372.7 467.9 401.3 49.3 49.4 63.5 63.6 63.7 63.8 63.9 396.5 407.3 417.3 439.8 350.7 490 413.9 49.8 49.9 64 64.1 64.2 64.3 64.4 362.4 385.9 408.2 385 367.1 402.8 417.1 50.3 50.4 64.5 64.6 64.7 64.8 64.9 429.5 376.5 399.8 388.9 385.7 439.7 392 50.8 50.9 65 65.1 65.2 65.3 65.4 384.6 414.5 453.4 430.3 401.7 468.9 407.3 51.3 51.4 65.5 65.6 65.7 65.8 65.9 404.2 408.6 429.4 443.5 427.3 388 414.5
92
51.8 51.9 66 66.1 66.2 66.3 66.4 408 411.3 373.9 427.6 406 409.4 405.8 52.3 52.4 66.5 66.6 66.7 66.8 66.9 387.6 384.6 431.1 349.4 406.7 409.3 400.1 52.8 52.9 67 67.1 67.2 67.3 67.4 379.7 396.8 398.2 391 421.8 410.7 425.6 53.3 53.4 67.5 67.6 67.7 67.8 67.9 457.7 392.2 423.7 462.5 416.9 458.3 453.4 53.8 53.9 68 68.1 68.2 68.3 68.4 477.5 405.4 392.7 401.5 379.5 428.7 417.8 54.3 54.4 68.5 68.6 68.7 68.8 68.9 393.1 407.9 392.9 408.1 431 362 432.7 54.8 54.9 69 69.1 69.2 69.3 69.4 429.2 385.3 390.2 399.4 406 406.6 417.9 55.3 55.4 69.5 69.6 69.7 69.8 69.9 447.8 398.5 392.9 403.7 426.4 370.9 394.9 55.8 55.9 70 70.1 70.2 70.3 70.4 378.3 405.3 413.1 438.6 475.2 442 386.6 56.3 56.4 70.5 70.6 70.7 70.8 70.9 400.7 396.5 411.2 385.2 383.3 395.6 424.3
13 13.1 13.2 13.3 13.4 27.5 27.6 423.3 410.4 357.9 448.6 384.9 418.5 456.3 13.5 13.6 13.7 13.8 13.9 28 28.1 401.6 421.8 337.3 469 398 388.5 398.8
14 14.1 14.2 14.3 14.4 28.5 28.6 392.6 369.4 355.2 388.2 401.1 388.4 403.7 14.5 14.6 14.7 14.8 14.9 29 29.1 384.2 373.7 370.8 421.2 375.3 386.3 395.2
15 15.1 15.2 15.3 15.4 29.5 29.6 435.5 413.7 386.2 449.2 391.9 388.8 398.3
80 80.1 80.2 80.3 80.4 80.5 80.6 398 439.3 387.2 404.2 408.6 416.2 378.7 91 91.1 91.2 91.3 91.4 91.5 91.6
430.9 415.7 363.4 457.7 392.2 407.3 429 93 93.1 93.2 93.3 93.4 93.5 93.6
441.8 420.5 392.8 457.8 398.5 419.3 433 95 95.1 95.2 95.3 95.4 95.5 95.6
390.4 380.5 413.3 400.8 416.2 414.9 450.4 97 97.1 97.2 97.3 97.4 97.5 97.6
381.1 389.1 398.1 398.4 407.9 385.1 394.6 99 99.1 99.2 99.3 99.4 99.5 99.6
407.6 450.5 398.2 413.6 417.8 426.3 388.8 1 1.1 1.2 1.3 1.4 1.5 1.6
93
441.6 425.6 372.7 467.9 401.3 417.3 439.8 3 3.1 3.2 3.3 3.4 3.5 3.6
453.4 430.3 401.7 468.9 407.3 429.4 443.5 5 5.1 5.2 5.3 5.4 5.5 5.6
398.2 391 421.8 410.7 425.6 423.7 462.5 7 7.1 7.2 7.3 7.4 7.5 7.6
390.2 399.4 406 406.6 417.9 392.9 403.7 9 9.1 9.2 9.3 9.4 9.5 9.6
404.4 447.6 394.1 410.3 415.7 423.9 386.6 11 11.1 11.2 11.3 11.4 11.5 11.6 439 422.5 370.7 464.8 398.5 415.6 437 13 13.1 13.2 13.3 13.4 13.5 13.6
449.8 427.9 399.5 466.6 406.3 426.5 444.6 15 15.1 15.2 15.3 15.4 15.5 15.6
399.8 442.2 389.3 405.9 410.3 419.1 382 17 17.1 17.2 17.3 17.4 17.5 17.6
433.6 418.1 366.8 459.8 394.4 410.7 432.1 56.6 56.7 56.8 56.9 58.6 58.7 58.8 341.2 399.2 400 390.4 397.8 420.5 355.7 57.1 57.2 57.3 57.4 90.6 90.7 90.8 380.5 413.3 400.8 416.2 385.3 436.1 379.7 57.6 57.7 57.8 57.9 92.6 92.7 92.8 450.4 407.9 447.2 441.2 380.4 377.5 429.2 58.1 58.2 58.3 58.4 94.6 94.7 94.8 393.5 371.4 419 407.6 341.2 399.2 400 27.9 42 42.1 42.2 42.3 42.4 56.5 446.7 422.7 373.8 389.7 385.8 383.9 420 28.4 42.5 42.6 42.7 42.8 42.9 57 412.4 404.6 384.2 434.3 378.7 396.3 390.4 28.9 43 43.1 43.2 43.3 43.4 57.5 426.8 430.2 414.2 363.1 455.7 391.8 414.9 29.4 43.5 43.6 43.7 43.8 43.9 58 413 406.7 428 342.1 476.4 404.5 383.9 29.9 44 44.1 44.2 44.3 44.4 58.5 391.2 397.6 373.7 359.7 394.2 407 386 80.9 90 90.1 90.2 90.3 90.4 90.5 411.3 423.5 375.5 390.9 387.6 384.6 405.7 91.9 92 92.1 92.2 92.3 92.4 92.5 405.4 399.1 373.2 360.7 393.1 407.9 389.9 93.9 94 94.1 94.2 94.3 94.4 94.5 405.3 362.4 417.4 396.6 400.7 396.5 420 95.9 96 96.1 96.2 96.3 96.4 96.5 441.2 383.9 393.5 371.4 419 407.6 386
94
97.9 98 98.1 98.2 98.3 98.4 98.5 386.5 403.8 427.4 464.1 431 377.2 404.5 99.9 100 100.1 100.2 100.3 100.4 100.5 421.8 434.5 385.2 399.7 386.5 391.6 415.7
1.9 2 2.1 2.2 2.3 2.4 2.5 413.9 408.2 385 367.1 402.8 417.1 399.8
3.9 4 4.1 4.2 4.3 4.4 4.5 414.5 373.9 427.6 406 409.4 405.8 431.1
5.9 6 6.1 6.2 6.3 6.4 6.5 453.4 392.7 401.5 379.5 428.7 417.8 392.9
7.9 8 8.1 8.2 8.3 8.4 8.5 394.9 413.1 438.6 475.2 442 386.6 411.2
9.9 10 10.1 10.2 10.3 10.4 10.5 419.6 431.6 380.2 397.6 394 388.5 413.6 11.9 12 12.1 12.2 12.3 12.4 12.5 412.1 405.7 379.8 365.6 400.1 415 397.6 13.9 14 14.1 14.2 14.3 14.4 14.5 460.4 401.3 421.8 435.7 421 373.8 417.7 15.9 16 16.1 16.2 16.3 16.4 16.5 414.9 427 376.7 393.3 389.8 383.9 407.7 17.9 18 18.1 18.2 18.3 18.4 18.5 406.9 401.8 375.3 361.7 395.4 409.7 393.8 96.7 96.8 96.9 4.6 4.7 4.8 4.9 420.5 355.7 421.9 349.4 406.7 409.3 400.1 98.7 98.8 98.9 6.6 6.7 6.8 6.9 373 386.6 414.9 408.1 431 362 432.7
100.7 100.8 100.9 8.6 8.7 8.8 8.9 446.7 386.9 406.7 385.2 383.3 395.6 424.3
2.7 2.8 2.9 10.6 10.7 10.8 10.9 385.7 439.7 392 392.1 444 385.2 404.6 15.7 15.8 44.6 44.7 27.7 27.8 12.8 411.6 374.1 379 376 412.1 452.8 436.9 16.2 16.3 45.1 45.2 28.2 28.3 14.8 393 394.2 419.6 390.6 376.8 423.6 416.7 16.7 16.8 45.6 45.7 28.7 28.8 12.9 390.5 392.2 431.7 416.4 426.2 359.3 388.4 17.2 17.3 46.1 46.2 29.2 29.3 14.9 406.3 393.8 417.2 395.7 402.3 404.1 424.6 17.7 17.8 46.6 46.7 29.7 29.8 16.6 401.9 439.2 339.9 396.9 421.5 368.2 386.9 18.2 18.3 47.1 47.2 80.7 80.8 18.6 366.7 412 379 412.4 391 408 384.4 18.7 18.8 47.6 47.7 91.7 91.8 16.7
95
413.5 348.7 449.2 406.4 342.7 477.5 438.6 19.2 19.3 48.1 48.2 93.7 93.8 18.7 391.3 390.7 393 370.8 418.6 378.3 378.2 19.7 19.8 48.6 48.7 95.7 95.8 16.8 409.6 356.6 397.7 419.7 407.9 447.2 381.3 20.2 20.3 49.1 49.2 97.7 97.8 18.8 457.1 423.7 392.6 395.8 416.7 363.2 431.4 20.7 20.8 49.6 49.7 99.7 99.8 16.9 367.6 379.1 393.4 415 401.6 418.1 399.7 21.2 21.3 50.1 50.2 1.7 1.8 18.9 392.4 407.3 425.2 462.9 350.7 490 384 21.7 21.8 50.6 50.7 3.7 3.8 12.6 397 413.1 373.2 372.9 427.3 388 385.8 22.2 22.3 51.1 51.2 5.7 5.8 14.6 395.6 391 439.3 386.2 416.9 458.3 436.6 22.7 22.8 51.6 51.7 7.7 7.8 12.7 440.7 383.3 378.7 391 426.4 370.9 386.2 23.2 23.3 52.1 52.2 9.7 9.8 14.7 367.9 442.8 375.5 390.9 399.1 414.6 391.3 23.7 23.8 52.6 52.7 11.7 11.8 346.9 483.8 385.3 436.1 348.7 486.2 24.2 24.3 53.1 53.2 13.7 13.8 363.2 397.8 415.7 363.4 422.8 384.3 24.7 24.8 53.6 53.7 15.7 15.8 382.6 434.2 429 342.7 394.6 409.7 25.2 25.3 54.1 54.2 17.7 17.8 397.1 462.9 373.2 360.7 346 480.7 25.7 25.8 54.6 54.7 58.9 96.6 423 382.1 380.4 377.5 421.9 397.8 26.2 26.3 55.1 55.2 90.9 98.6 401.3 404.6 420.5 392.8 396.8 374 26.7 26.8 55.6 55.7 92.9 100.6 401.7 404 433 418.6 385.3 394.5 27.2 27.3 56.1 56.2 94.9 2.6 417.5 406.1 417.4 396.6 390.4 388.9
96
2.2 Monte Carlo Simulation for CNT
1 2 3 4 5 86 87
28.9 28.71 28.59 28.95 28.92 29.24 29.56 6 7 8 9 10 91 92
28.96 28.82 28.97 29.38 29.11 29.03 28.85 11 12 13 14 15 96 97
28.87 29.09 28.8 29 28.85 29.5 29.25 16 17 18 19 20 101 102
28.6 28.52 29.16 29.6 29.25 29.38 29.06 21 22 23 24 25 106 107
29.67 29.42 29 28.9 28.82 28.8 29.5 26 27 28 29 30 111 112
28.69 28.84 28.74 29.6 29.81 29.84 29.15 31 32 33 34 35 116 117
29.23 29.92 29.4 29.39 29.6 29.45 28.96 36 37 38 39 40 121 122
29.1 29.48 29.31 29.41 29.5 28.83 29.48 41 42 43 44 45 126 127
29.17 29.77 29.53 29.12 28.91 29.89 29.2 46 47 48 49 50 88 89
28.72 28.62 28.84 29.14 29.77 29.23 29.64 51 52 53 54 55 93 94
29.22 29.84 29.58 29.04 29.05 28.75 28.67 56 57 58 59 60 98 99
28.91 28.73 28.89 29.13 29.72 29.53 29.21 61 62 63 64 65 103 104
29.24 29.77 29.27 29.51 29.2 28.85 28.75 66 67 68 69 70 108 109
29.63 29.13 29.72 29.24 29.77 29.38 29.55 71 72 73 74 75 113 114
29.14 29.66 29.26 29.7 29.44 29.63 29.25 76 77 78 79 80 118 119
28.98 28.93 28.86 28.7 28.88 28.92 28.84 81 82 83 84 85 123 124
29.44 29.27 29.51 29.2 29.63 29.37 29.47 95 100 105 110 90 128 129
28.79 29.63 28.65 29.02 29.39 29.15 29.08 115 120 125 130
29.68 28.69 29.04 28.84