ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications
DESIGN AND ANALYSIS OF RECONFIGURABLE MEMORIES...3.6 Example conditional gang clear operation -...
Transcript of DESIGN AND ANALYSIS OF RECONFIGURABLE MEMORIES...3.6 Example conditional gang clear operation -...
DESIGN AND ANALYSIS OF RECONFIGURABLE
MEMORIES
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Ken Mai
June 2005
c© Copyright by Ken Mai 2005
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opin-
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Mark A. Horowitz(Principal Adviser)
I certify that I have read this dissertation and that, in my opin-
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Oyekunle Olukotun
I certify that I have read this dissertation and that, in my opin-
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Bruce A. Wooley
Approved for the University Committee on Graduate Stud-
ies:
iii
Abstract
Decades of technology scaling have made available an unprecedented amount of compu-
tational power from today’s integrated circuits. However,continued process scaling has
come at the price of ever more exotic process technologies and complex designs. Due these
difficulties, the non-recurring engineering cost of customASIC development is growing,
making ASICs economically infeasible for all but the highest volume parts. However, fu-
ture applications will require more efficient, high-performance computation than general
purpose processors can provide. One promising approach to breaking this impasse is to use
reconfigurable architectures that keep the low non-recurring engineering costs of general
purpose silicon, yet still provide efficiency and performance near that of custom ASICs.
While there is a large body of work on designing reconfigurable computation, reconfig-
urable memory systems have been largely ignored. In this work, we examine how to add
reconfigurability to the memory system.
Looking closely at how memory is used in modern digital systems, we recognized that
the most common memory structures, such as caches, FIFOs, and scratchpads, use very
similar memory building blocks. By adding a few meta-data bits and a small amount of
peripheral logic to a basic SRAM array, we designed a reconfigurable memory mat that
could form the core of a many common memory structures. Adding a flexible interconnec-
tion network between the mats and the computation facilitates aggregation of the mats into
larger, more complex memory structures.
To evaluate our design, we implemented a prototype reconfigurable memory testchip
in a 0.18µm CMOS technology. The testchip operates at 1.1GHz at the nominal 1.8V
Vdd and room temperature. The prototype uses a 16kb SRAM mat and achieved area and
power overheads of 32% and 23% of the totals respectively. Our projections show that the
iv
reconfiguration overheads can be reduced to below 15% of the area and below 10% of the
power by using larger capacity SRAMs. The testchip shows that we can build a generic
reconfigurable memory block that can form the basis of many different memory structures,
while maintaining high-performance and low power.
v
Acknowledgements
One of the benefits of having an extremely long graduate student career is that I have had
the opportunity to interact with and learn from many wonderful people. Their contributions
to this work have been invaluable.
I could not have asked for a better advisor than Mark Horowitz. His technical depth and
breadth, ability to bore to the heart of a problem, steadfastrefusal to cut corners, scientific
integrity, dedication and accessibility to his students, and patience made him a great advisor
and role-model. I would like to thank him for giving me the opportunity to work with him.
My reading committee members, Kunle Olukotun and Bruce Wooley, and my orals
committee members, Christos Kozyrakis and Boris Murmann, provided insightful com-
ments on my research.
Charlie Orgish, Joe Little, and Jason Conroy kept the computing infrastructure run-
ning smoothly. The support staff, notably Darlene Hadding,Terry West, Deborah Harber,
Lindsay Brustin, Taru Fisher, Penny Chumley, Teresa Lynn, and Ann Guerra, provided
administrative assistance.
Fujitsu Labs, Philips, and DARPA offered their financial support.
The Horowitz research group has consistently been an engaging, friendly, and helpful
environment. Ron Ho has been my near-constant collaboratorand a good friend. This
work has benefited greatly from his help and expertise. Also deserving of special note are
Guyeon Wei, Dan Weinlader, Birdy Amrutur, and Elad Alon who have not only been good
colleagues, but also good friends. At times our forays into non-research related topics may
have lengthened our time in graduate school, but their friendship most certainly made the
time more enjoyable. My officemates, Toshi Mori, Junji Ogawa, and Sam Palermo stoically
tolerated me and the mountains of papers.
vi
My parents and brother Glenn offered their encouragement and support.
My wife Kaarin’s graduate student career overlapped with mine, but despite her own
trials and travails, she has always been encouraging, empathetic, and supportive.
Finally, I would like to thank our cats, Simon and Max, who somehow always knew
when a good hairball was the proper stress reliever.
vii
Contents
Abstract iv
Acknowledgements vi
1 Introduction 1
1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 SRAM Design 4
2.1 Array Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
2.2 Decoder Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Sense Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Write Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Bitline Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Clocking and Control . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Transport Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Reconfigurable Memory Architecture 26
3.1 Reconfigurable Memory Architecture . . . . . . . . . . . . . . . . .. . . 27
3.2 Memory mat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Operation Modifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
viii
3.4.2 Pointer Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Read Modify Writes . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.4 Conditional Operations . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.5 Conditional Gang Operations . . . . . . . . . . . . . . . . . . . . 35
3.5 Mat Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Micro-architecture and Implementation . . . . . . . . . . . . .. . . . . . 37
3.6.1 Meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.2 Peripheral Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Interconnect Networks 67
4.1 Interconnection Network Design . . . . . . . . . . . . . . . . . . . .. . . 69
4.2 Inter-mat Control Network . . . . . . . . . . . . . . . . . . . . . . . . .. 71
4.3 Processor Interconnect Network . . . . . . . . . . . . . . . . . . . .. . . 75
4.3.1 Request Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.2 Reply Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.4 Processor Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 Experimental Results 94
5.1 Testchip Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.1 SRAM core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.2 Peripheral Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.3 Interconnect and Test Infrastructure . . . . . . . . . . . . .. . . . 104
5.2 Measured Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Conclusions 114
A SRAM Survey 117
Bibliography 119
ix
List of Tables
2.1 SRAM control signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Mat operation modifier applicability . . . . . . . . . . . . . . . .. . . . . 32
3.2 Conditional gang clear truth table . . . . . . . . . . . . . . . . . .. . . . . 35
3.3 Mat I/Os . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Clock cycle comparison for virtual multi-porting . . . . .. . . . . . . . . 41
3.5 Modified conditional gang clear truth table . . . . . . . . . . .. . . . . . . 48
3.6 Ternary CAM stored value meaning . . . . . . . . . . . . . . . . . . . .. 56
4.1 Request crossbar I/O signals . . . . . . . . . . . . . . . . . . . . . . .. . 77
4.2 Reply crossbar I/O signals . . . . . . . . . . . . . . . . . . . . . . . . .. 80
5.1 Process and testchip features . . . . . . . . . . . . . . . . . . . . . .. . . 96
5.2 SRAM area breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Mat area breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Testchip area breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . .108
5.5 Mat power breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.1 SRAM survey data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
x
List of Figures
2.1 6-transistor SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
2.2 SRAM block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Partitioned 512Kb SRAM array using 4 macros each with 8 16Kb blocks . 7
2.4 SRAM block size survey (see Appendix A) . . . . . . . . . . . . . . .. . 8
2.5 Self-resetting, 2-input, AND gate . . . . . . . . . . . . . . . . . .. . . . . 10
2.6 Self-resetting, 2-input, AND gate timing diagram . . . . .. . . . . . . . . 11
2.7 Nambu OR-gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 2-input source driven NAND gate . . . . . . . . . . . . . . . . . . . . .. 13
2.9 Latch-style sense amplifier . . . . . . . . . . . . . . . . . . . . . . . .. . 14
2.10 Write driver with 2:1 NMOS-only column mux . . . . . . . . . . .. . . . 15
2.11 Write driver with reduced NMOS stack height . . . . . . . . . .. . . . . . 16
2.12 Bitline reset circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 17
2.13 Bitline reset circuit with split read and write reset . .. . . . . . . . . . . . 17
2.14 Bitline reset circuit with pseudo-static keeper circuit . . . . . . . . . . . . . 18
2.15 Capacitance and current ratioed replica bitlines . . . .. . . . . . . . . . . 21
2.16 Example differential low-swing interconnect . . . . . . .. . . . . . . . . . 22
2.17 Example low-swing driver, sized for a 0.18µm technology[54] . . . . . . . 22
2.18 Example low-swing interconnect receiver, sized for a 0.18µm technology[54] 23
3.1 Basic reconfigurable memory system architecture . . . . . .. . . . . . . . 28
3.2 Memory system block diagram - 16 mats in an 8 x 2 array . . . . .. . . . 29
3.3 Caching configuration with 2-way data and instruction caches . . . . . . . 29
3.4 Streaming configuration with data FIFOs, instruction memory, and scratchpad 29
3.5 Example gang operation - set md[3], clear md[2:1], NOP md[0] . . . . . . 32
xi
3.6 Example conditional gang clear operation - clearmd[0] if md[1] == 1 . . . 35
3.7 Generic memory structure . . . . . . . . . . . . . . . . . . . . . . . . . .39
3.8 Mat block diagram showing meta-data with support logic (RMW decoder
and PLA) and peripheral logic blocks (pointer logic, write buffer, and com-
parator) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Meta-data bitcell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 One mat cell row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.11 Gang operation I/O logic for one column . . . . . . . . . . . . . .. . . . . 44
3.12 Two cell rows using interleaved meta-data bitcells . . .. . . . . . . . . . . 44
3.13 Meta-data bit cell with embedded match circuit . . . . . . .. . . . . . . . 46
3.14 Conditional gang clear implementation using two transistors . . . . . . . . 47
3.15 Modified conditional gang clear implementation using one transistor . . . . 47
3.16 RMW decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.17 RMW decoder timing diagram . . . . . . . . . . . . . . . . . . . . . . . .50
3.18 Pipelined RMW decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.19 PLA block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.20 Example 3-input, 3-output NOR NOR PLA structure . . . . . .. . . . . . 54
3.21 Ternary CAM trit cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55
3.22 PLA ternary CAM timing diagram . . . . . . . . . . . . . . . . . . . . .. 57
3.23 Normal self-reset circuit . . . . . . . . . . . . . . . . . . . . . . . .. . . 58
3.24 Normal self-reset circuit timing diagram . . . . . . . . . . .. . . . . . . . 58
3.25 Fast reset off self-reset circuit . . . . . . . . . . . . . . . . . .. . . . . . . 59
3.26 Fast reset off self-reset circuit timing diagram . . . . .. . . . . . . . . . . 59
3.27 PLA SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.28 Pointer logic block diagram . . . . . . . . . . . . . . . . . . . . . . .. . . 62
3.29 Storing two FIFOs in one mat . . . . . . . . . . . . . . . . . . . . . . . .63
3.30 A FIFO spanning four mats . . . . . . . . . . . . . . . . . . . . . . . . . .63
3.31 Maskable comparator gate-level diagram . . . . . . . . . . . .. . . . . . . 64
4.1 Interconnect overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68
4.2 4 x 6 crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
xii
4.3 Inter-mat control network . . . . . . . . . . . . . . . . . . . . . . . . .. . 72
4.4 Ext out logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Ext in logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 IMCN wired-OR driver and pull-up circuit . . . . . . . . . . . . .. . . . . 74
4.7 Processor interconnect overview . . . . . . . . . . . . . . . . . . .. . . . 76
4.8 Request crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.9 Request crossbar crosspoint . . . . . . . . . . . . . . . . . . . . . . .. . . 78
4.10 Reply crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.11 Reply crossbar crosspoint . . . . . . . . . . . . . . . . . . . . . . . .. . . 81
4.12 Low swing driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.13 Low swing receiver - modified StrongARM latch . . . . . . . . .. . . . . 84
4.14 Address splitter block diagram . . . . . . . . . . . . . . . . . . . .. . . . 87
4.15 Scratchpad mat configuration for address splitter example . . . . . . . . . . 88
4.16 Address split for contiguous word scratchpad . . . . . . . .. . . . . . . . 89
4.17 Address split for interleaved word scratchpad . . . . . . .. . . . . . . . . 89
4.18 Cache mat configuration for address splitter example . .. . . . . . . . . . 90
4.19 Cache tag address split for contiguous word data array .. . . . . . . . . . 91
4.20 Cache data address split for contiguous word data array. . . . . . . . . . . 91
4.21 Cache tag address split for interleaved word data array. . . . . . . . . . . 92
4.22 Cache data address split for interleaved word data array . . . . . . . . . . . 92
5.1 Testchip die photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Testchip block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . .97
5.3 SRAM predecoder gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 SRAM wordline driver gate . . . . . . . . . . . . . . . . . . . . . . . . . .99
5.5 SRAM read I/O circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 SRAM bitline reset circuits . . . . . . . . . . . . . . . . . . . . . . . .. . 101
5.7 SRAM replica timing path . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8 SRAM replica bitline and wordline . . . . . . . . . . . . . . . . . . .. . . 103
5.9 Pointer logic decoder gate . . . . . . . . . . . . . . . . . . . . . . . . .. 104
5.10 On-die voltage sampler with device widths inµms for a 0.18µm technology[56]106
xiii
5.11 Process monitor block . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 107
5.12 Area overhead and breakdown . . . . . . . . . . . . . . . . . . . . . . .. 109
5.13 Area overhead vs. SRAM capacity . . . . . . . . . . . . . . . . . . . .. . 110
5.14 Power overhead and breakdown . . . . . . . . . . . . . . . . . . . . . .. 111
5.15 Power overhead vs. SRAM capacity . . . . . . . . . . . . . . . . . . .. . 112
5.16 Worst-case read at 1GHz, 1.8V, room temperature . . . . . .. . . . . . . . 112
xiv
Chapter 1
Introduction
Decades of technology scaling have made available an unprecedented amount of compu-
tational power from today’s integrated circuits. This has opened up new application areas
such as computational biology and ubiquitous computing, where previously computation
was either not powerful enough or not efficient enough to attack the problem. These types
of new application areas will keep up the demand for ever moreefficient, high-performance
computation.
However, continued process scaling has come at the price of ever more exotic process
technologies and complex designs. Designers face increasing difficulties such as intercon-
nect delay, device mismatch, leakage, power density constraints, and complexity manage-
ment. Due these difficulties, the cost of custom ASIC development is increasing at an
alarming rate. Within the next few process generations, thecost of mask sets will exceed
$1 million [1], and the design costs will reach into the tens of millions of dollars [2].
For system designers, the increasingly high cost of custom ASIC development means
that the highest performance, highest efficiency solutionsfor their applications are becom-
ing economically infeasible. Broadly speaking, system designer can choose to use off-the-
shelf parts or custom design an ASIC for their application. The off-the-shelf parts have
low non-recurring-engineering costs, yet lag the custom ASICs in performance and effi-
ciency, sometimes by orders of magnitude [3]. Further, the off-the-shelf, general purpose
processor solutions are losing steam, running out of available parallelism and encountering
1
CHAPTER 1. INTRODUCTION 2
architectural scaling problems [4]. While some applications can still use off-the-self pro-
cessor and memory solutions, a growing number of applications need better performance
and efficiency than off-the-shelf components can provide, but cannot economically justify
custom ASIC development.
These applications have stimulated a growing interest in reconfigurable computing so-
lutions. Field programmable gate arrays (FPGAs) [5][6] have been commercially available
for quite some time and have been gaining in popularity as fast prototyping and emulation
platforms. Additionally, a number of academic and industryefforts have combined a gen-
eral purpose processor with an FPGA fabric [7][8][9] for both ease of programming and
reconfigurability. Further, some next-generation computing architecture projects have ex-
plored architectures that can exploit multiple types of application parallelism [10][11] and
even entirely polymorphic computing architectures [12][13][14].
These solutions bridge the gap between custom ASICs and general purpose solutions by
allowing the user to customize a pre-fabricated reconfigurable component. Thus the non-
recurring-engineering costs remain low, while the performance and efficiency are better
than those of off-the-shelf, general purpose solutions [4][15][16]. However, these solutions
still lag custom ASICs in performance and efficiency due to limitations in their reconfig-
urability and overheads associated with that reconfigurability [3][15]. This illustrates the
fundamental design trade-off between efficiency and flexibility: the more general-purpose
a system, the less efficient it is at a specified task; a highly directed design can be extremely
efficient at the target application, but is very inefficient at other tasks.
While there is a large body of work on how to design reconfigurable compute units, the
memory systems of such designs have been largely ignored. Many systems with reconfig-
urable computation, have entirely hardwired memory systems [12][14]. FPGAs have only
recently added small, somewhat reconfigurable memory blocks in the computing fabric.
However, the memory system plays a critical role in determining the performance, power,
and cost of modern machines [17][18]. Just as every application has unique compute re-
quirements, every application has unique memory access patterns which perform optimally
using different memory paradigms. A memory system that can be reconfigured to match
the application memory access pattern requirements can significantly boost both the per-
formance and efficiency of the system. Building such a memorysystem is the goal of this
CHAPTER 1. INTRODUCTION 3
thesis.
1.1 Organization
To set the stage for our discussion of reconfigurable memory,we begin in Chapter 2 by
reviewing contemporary SRAM design techniques. These techniques include: array par-
titioning; post-charged decoder logic; datapath circuit design; control and clocking using
replica timing; and efficient data and address signaling anddistribution. We then examine
how these highly optimal SRAM blocks are used to form caches,FIFOs, and scratchpads
in modern memory systems.
Chapter 3 explores motivating factors for reconfigurable memory and the architecture
of our proposed memory system. The architecture is driven from below by memory circuit
design considerations and driven from above by the requiredarchitectural flexibility. We
also detail the design and implementation of the base reconfigurable memory block called
amemory mat.
We continue the description of the proposed memory system inChapter 4 by detailing
the interconnection networks between the memory mats and between the mats and the
computation. Each network has unique communication requirements and characteristics,
and we tailor the implementations to efficiently meet those needs.
To explore the feasibility and overheads of our architecture, we designed and imple-
mented a prototype reconfigurable memory testchip. Chapter5 describes the testchip im-
plementation and measured results. From these results, we extrapolate to larger, more
complex implementations of the proposed reconfigurable memory system.
Chapter 2
SRAM Design
In today’s processors and ASICs, one of the most voracious consumers of die area and
device count is the on-die memory. The predominant memory technology used on-die with
computation is 6T SRAM. While embedded DRAM has appeared in anumber of aca-
demic and commercial designs, such as the Berkeley IRAM project [19], gaming consoles
[20][21], and the Mosys 1-T “SRAM” memory block [22], it has not been widely used due
to the additional cost of a merged logic-DRAM process and higher design complexity. A
6T SRAM memory, however, can be manufactured in a standard logic process and contin-
ues to offer high-performance at a reasonable density and power dissipation. SRAMs will
likely remain the dominant on-die memory technology, as SRAM cells have already been
demonstrated down to the 32nm technology node [23].
A 6T SRAM cell uses a pair of cross-coupled inverters as its bi-stable storage element
with two additional NMOS devices for read and write access (Figure 2.1). The cells are
aggregated into cell arrays to share the decoding and I/O logic (Figure 2.2). On a read, the
decoder raises the wordline (WL) of the desired word. The bitlines (BL andBL b) have
been precharged to a reference voltage, and the cell drives adifferential current onto the
bitlines according to the stored value. The cell current is relatively weak for the bitline
capacitance, so to speed the read operation, a sense amplifier in the I/O logic amplifies the
bitline differential voltage to produce a full swing logic value. On a write, the write driver
in the I/O logic places the write data onto the bitlines as full-rail signals. The decoder again
raises the wordline of the accessed word, and the cells storethe new data values. A number
4
CHAPTER 2. SRAM DESIGN 5
WL
Vdd Vdd
a a_b
BL BL_b
Figure 2.1: 6-transistor SRAM cell
of cell columns can share I/O circuitry via the column multiplexor. There are a number of
ways in which designers have optimized the architecture andcircuit design of SRAMs to
improve the performance and energy efficiency. In this chapter, we will review a few of the
most prevalent and significant optimizations.
2.1 Array Partitioning
While the simple monolithic cell array architecture in Figure 2.2 is appropriate for small
memories on the order of a few Kbytes, for larger memories, designers partition the cell
array for better performance and energy efficiency [24][25][26]. For partitioned arrays, we
will use the same terminology as Amrutur [24] to describe thepartitioning. An SRAM is
divided intonmmacros, each of which is accessed simultaneously. Every macro operates
independently, with the exception of possibly sharing a portion of the decoder with the
other macros. Each macro contains a portion of the accessed word called the sub-word.
Each macro is again divided into a number of blocks. The requested sub-word is contained
CHAPTER 2. SRAM DESIGN 6
6T
dec
Coldec
Col mux
Sense amps
Write buffer
Control
datain
dataout
se
we
col_sel
BL BL_b
WL
addr
Row
Figure 2.2: SRAM block diagram
CHAPTER 2. SRAM DESIGN 7
entirely in the block, which we define as an array of cells thatshares local wordline drivers
and bitline I/O circuits.1 Each block hasbw cells in a row, andbh cells in a column. The
bitlines run vertically, and the wordlines run horizontally.
Figure 2.3 shows an example partitioning of a 512Kb SRAM. TheSRAM has 4 macros,
each with 8 16Kb blocks. The blocks are 128 rows by 128 columns(i.e. bh= 128 andbw=
128) and have an access width of 16b. The 16b access width requires each block to perform
8:1 column multiplexing.
word
16 16 16 16
128
128
macro
16Kbblock
64
sub−word
Figure 2.3: Partitioned 512Kb SRAM array using 4 macros eachwith 8 16Kb blocks
In a partitioned SRAM only a portion of the array is activatedevery access. By us-
ing hierarchical wordline decoding [27][28] and hierarchical bitline architectures [29], de-
signers can keep the lines short and avoid significant wire RCdelays. The shorter, lower
capacitance lines and partial activation of the array serveto reduce the energy per access.
1A block can be though of as logically a monolithic cell array,but in practice it does not have to be. Forexample, a block could place the decoder in the middle of the cell array to reduce the wordline RC delay, andthus actually contain two monolithic cell arrays.
CHAPTER 2. SRAM DESIGN 8
Partitioning does, however, decrease the area efficiency ofthe SRAM, and designers must
trade that off against the improved energy and performance.Eventually this increase in the
area offsets the gains from a smaller block and is detrimental to the overall performance
and energy dissipation of the memory [24]. The optimal blocksize depends on the opti-
mization goal (e.g. delay, energy, area, energy-delay), SRAM architecture, circuit style,
and technology, but typically is in the 16KB to 128KB range and is not a strong function
of process technology [25][30][31][24].
Total Capacity
Blo
ckca
pac
ity
16Kb 64Kb 256Kb 1Mb 4Mb 16Mb 64Mb
1Kb
4Kb
16Kb
64Kb
256Kb
1Mb
Figure 2.4: SRAM block size survey (see Appendix A)
Figure 2.4 shows a scatter plot of the block sizes for a numberof recent SRAM designs.
The detailed characteristics and full citations of the SRAMs can be found in Appendix A.
With the exception of a few outliers, most of the SRAMs have partition sizes that lie be-
tween 16Kb and 128Kb. This bears out the conclusions of both Amrutur [24] and Evans
[25] that the optimal energy-delay partition size falls within this range. The SRAMs plotted
were fabricated in processes that span the 90nm to 0.65µm technology generations. The
large block outliers are SRAMs that were optimized for area efficiency. The small parti-
tion outliers are SRAMs that were optimized for extreme low power or high performance
operation.
CHAPTER 2. SRAM DESIGN 9
For a partitioned SRAM, we can break the memory access down into four phases: re-
quest transport, block decode, block datapath, and reply transport. In therequest transport
phase, we send the the address and data from the global inputsto the requested block. A
portion of the decode may occur during this phase in the selection of which block to access
[27][28]. Theblock decodephase is the local decode from the block address input to the
local wordline assertion. Theblock datapathphase is from the wordline assertion through
the cell driving the bitlines to the output of the local senseamplifier or the cell writing in
the data on the bitlines. Thereply transportphase is from the output of the local sense
amplifier to the global data output. This phase is only necessary if the request returns data.
Other researchers have chosen to break a memory access into two parts, calling our first
two stagesdecodeand last twodatapath, but as we will see in Chapter 3, our four segment
division is logically cleaner for our reconfigurable memorydesign.
2.2 Decoder Design
The decoder logic can be thought of as a series of high-fanin AND gates with a very low
activity factor. For a memory with2n words, the decoder is logically 2n n-input AND
gates, designed such that only one AND-gate fires (i.e. raises its wordline) for a given n-bit
address input. The decoders are usually hierarchical [27][28] sharing pre-decoder gates for
common boolean terms. The address inputs are typically clustered into pre-decode groups
of three to four bits for pre-decoding. The pre-decoder outputs are then combined in the
global and local row decoder gates to generate the block wordline.
For a fast decoder, we would like to use a low logical effort [32] logic family like
precharged domino logic. However, while domino logic wouldprovide high performance,
the precharge phase is excessively wasteful in energy for decoders, because decoders have
an inherently low activity factor. On each access, only a small percentage of the gates in the
decoder fire. But in precharged logic, the precharge controlsignal goes to all gates in the
decoder and drives the gate capacitance of all the reset devices, regardless of whether they
have discharged or not. This is excessively wasteful in energy, because we only needed to
reset the gates that fired. Ideally, we would only reset thosegates, because all other gates
would not require resetting.
CHAPTER 2. SRAM DESIGN 10
To retain the high-performance of precharged logic, but achieve low-power, designers
instead use post-charged, self-resetting logic styles in the decoder [33][34]. As the name
implies, a short delay after a self-resetting gate asserts its output, it resets itself . The
output is a pulse whose pulsewidth is set by the self-reset delay. The pulse-mode nature of
these gates allows us to carefully control the pulsewidth ofthe wordline which is critical
for decreasing the read power [35]. For a read, the wordline pulsewidth determines how
long the accessed cells drive the bitlines and thus the read bitline swing and bitline energy
dissipation. Generally, the wordline pulsewidth is set to be just long enough to generate the
bitline differential voltage needed to overcome offsets inthe sense amplifiers. Section 2.3
discusses this further.
M1
Vdd Vdd VddVdd
in0
in1
r3
out
n0
M2
M3
M7
M6 M4
M5
M0
Figure 2.5: Self-resetting, 2-input, AND gate
Figure 2.5 shows the circuit schematic for a simple, self-resetting, 2-input, AND gate.
Figure 2.6 shows the timing diagram for the gate. When both inputsin0 andin1 pulse high,
M2 andM3 pull noden0 low. N0 going low pulls the outputout high through theM4/M5
inverter. After the three inverter delay,r3 goes low turning on the reset deviceM6. M6 pulls
n0 high, back to its quiescent state. We could wait for theM4/M5 inverter to pullout low
CHAPTER 2. SRAM DESIGN 11
out
r3
in0
in1
n0
Figure 2.6: Self-resetting, 2-input, AND gate timing diagram
again, but we can speed theout transition edge using an explicit reset deviceM7. Both in0
andin1 are assumed to be pulses and must be low by the timer3 goes low, turning onM6,
to avoid a drive fight betweenM2/M3 andM6. The gate that generatesin0 and in1 must
carefully control their pulsewidth to avoid this drive fight. Other designs for self-resetting
gates can operate correctly without this restriction on theinputs [31].
We can skew the NAND gate formed byM0-M3 for a fast assert edge,n0pull-down, and
theM4/M5 inverter can be skewed for a fastoutpull-up. This speeds the assert transition of
the output, but slows the reset transition. By adding the explicit reset devicesM6 andM7
we can also have a sharp reset edge (i.e. outpull-down). Thus we can precisely control the
pulsewidth of the output, which is key for controlling the read bitline swing and minimizing
the read power [35].
Using self-resetting gates allows the decoder to achieve both high-performance and
low-power. In 2001, Amrutur and Horowitz explored the design space and found that the
optimal decoding structure consists of a mixture of different types of dynamic AND gates
[34]. Their optimal decoder uses an initial stage of clockedNambu OR-gates [36], followed
by self-resetting, source-driven NAND gates [37][38].
The Nambu OR-gate allows us to use a fast dynamic NOR gate topology, while keeping
the power dissipation low. If we used a traditional dynamic NOR gate in the predecoder,
CHAPTER 2. SRAM DESIGN 12
n0
Vdd Vdd
VddVdd
clk
clk
in0
out
in1
Figure 2.7: Nambu OR-gate
then for each predecoder, all but one of the outputs would transition. Only the gate that had
all its inputs low would not transition. While this would be afast circuit topology, it would
dissipate a lot of power. The Nambu OR-gate retains the fast dynamic NOR topology,
but avoids high power dissipation, by directly cascading two dynamic stages as shown in
Figure 2.7. Correct operation of the gate relies on the dynamic NOR evaluating faster than
the following dynamic inverter and requires careful sizingand simulation.
A source driven NAND gate (Figure 2.8) implements the NAND function using a circuit
topology that only has a single NMOS device in the pull-down stack. One of the inputsin n
drives the source of the NMOS device rather than a transistorgate.In n is asserted when it
is low, andin p is asserted when it is high. Thus, when both input are asserted the NMOS
device has a full Vdd across its Vgs and is fully on. This gate can achieve delay near that of
an inverter, while still performing the NAND function. Thistype of gate has also be used
with half-swing inputs to reduce power dissipation if low-Vt devices are available [39][40].
CHAPTER 2. SRAM DESIGN 13
WL
Vdd
in_p
in_n
out
Figure 2.8: 2-input source driven NAND gate
2.3 Datapath Design
The block datapath phase is from the assertion of the wordlines through to the cell I/O
circuits. The cell I/O consists of the read sense amplifiers,the write drivers, and the bitline
reset devices. Both the sense amplifiers and write drivers may have a column mux between
them and the bitlines if they are shared among multiple columns.
2.3.1 Sense Amplifier
On a read access, the cells in the selected row drive a differential current on to the bitlines
based on their stored values. The bitlines have been previously equalized and reset to a
reference voltage by the bitline reset devices. While we could simply wait until the bitlines
have slewed full-rail to a digital logic value, to save powerand reduce the read delay, most
design use sense amplifiers to sense the the bitlines as soon as their differential voltage
has reached approximately 100mV, enough to overcome the built-in sense amplifier offsets
[35]. When the sense amplifiers are enabled, the wordlines are also shut off to prevent
additional, unnecessary swing of the bitlines and wasted power.
To further reduce power dissipation, designers have eschewed static sense amplifiers,
in favor of clocked designs. Figure 2.9 shows a commonly usedlatch-type sense amplifier.
Clocked sense amplifiers have a sense enable signalsethat triggers the sensing. The figure
shows a 2:1 column multiplex using only PMOS devices. This assumes that the bitline
reference level is Vdd, so we only need PMOS devices for the passgate column mux. Some
CHAPTER 2. SRAM DESIGN 14
lower reference levels (e.g.Vdd/2) may require a full transmission gate for the column mux.
sel[1]
Vdd Vdd
se
bit[0] bit[1] bit_b[0] bit_b[1]
sel[0]
Figure 2.9: Latch-style sense amplifier
There are many different clocked sense amplifier circuits, including both voltage mode
and current mode designs. SRAMs that are willing to swing thebitlines full-rail on a read
can use skewed inverters for simple single-ended voltage mode sensing.
2.3.2 Write Driver
On a write access, the write driver sets the bitlines to the input data value. After the decoder
asserts the selected wordline, the cell stores the data value on the bitlines. For a success-
ful write, the bitlines must typically swing full-rail or nearly so. There has been some
CHAPTER 2. SRAM DESIGN 15
work on low-swing writes, but the proposed techniques have not been widely adopted
[41][42][39][40][31][43][44][45]. The write enable control signalwe controls when the
write driver asserts the input data value on the bitlines. Weuse an NMOS only passgate
mux for the 2:1 column multiplexor, because the write driverpulls the bitline to ground
but relies on the bitline reset to keep the bitline that remains high near Vdd. The write
drive as shown has three NMOS devices in a series stack. For a low effective resistance,
this would require the NMOS devices to be quite large. To reduce the stack height, we can
pre-compute the AND ofweanddataanddata b, and only have two devices in the stack
(Figure 2.11).
bit[1]
bit bit_b
data data_b
sel[0]
sel[1]
we
bit_b[1]bit_b[0]bit[0]
Figure 2.10: Write driver with 2:1 NMOS-only column mux
CHAPTER 2. SRAM DESIGN 16
data_b
bit_b
bit_b[1]bit_b[0]
bit
sel[0]
sel[1]
bit[0] bit[1]
we we
data
Figure 2.11: Write driver with reduced NMOS stack height
2.3.3 Bitline Reset
The bitlines sit at a pre-determined reference voltage level when the block is not active.
On a read or write the bitlines deviate from that reference voltage, and the bitline reset
circuits restore the bitlines to the reference voltage level after an operation. The bitline
loads can either be static or clocked. Static loads do not require any complex control or
clocking, but dissipate static power anytime the bitlines deviate from the reference voltage.
Most modern designs use clocked reset circuits like the onesshown in Figure 2.12. The
reset devices pull the bitlines to the reference voltage, typically Vdd, when the bitline reset
control signalbl rst b is asserted (low). The shorting device ensures thatbit andbit b reset
to the same voltage level which is especially important for reads.
There have been some designs that use a lower bitline reference voltage, usually Vdd-
Vt or Vdd/2, to reduce the bitline energy [39][40] and to reduce leakage onto the bitlines
from nominally unselected cells [46], but again, these techniques have not been widely
adopted.
The amount of bitline reset required is typically quite different for reads and writes. On
a read, the bitlines only dip about 100mV from the Vdd reference level, but on a write, one
CHAPTER 2. SRAM DESIGN 17
bit_b
Vdd Vdd
bl_rst_b
bit
Figure 2.12: Bitline reset circuit
of the bitlines is driven to Gnd. Given the very different swings that need to be corrected,
some designs have separate read and write bitline reset circuits. The write reset circuits
use larger devices, because one of the bitlines must be brought all the way from Gnd to
Vdd. Figure 2.13 shows a split reset circuit using larger recovery devices for the write reset
which are enabled bywr bl resetb.
small
Vdd Vdd
Vdd Vdd
bit bit_b
bl_rst_b
wr_bl_rst_blargelarge
small
Figure 2.13: Bitline reset circuit with split read and writereset
Designers can also add static or pseudo-static keeper devices on the bitlines to mitigate
CHAPTER 2. SRAM DESIGN 18
the effects of leakage and crosstalk. A static load can be built using a grounded gate PMOS
device with the source connected to Vdd and the drain to the bitline. A pseudo-static load,
similar to those used in dynamic logic gates, consists of an inverter driving a PMOS device
as shown in Figure 2.14. The advantage of this technique is that the keeper PMOS device
shuts itself off during writes, thus avoiding excessive power dissipation. However, it may
be difficult to fit the extra devices needed for the pseudo-static load in the small horizontal
cell pitch.2 Some designs have used more complex techniques to actively compensate for
the leakage onto the bitline due to nominally off cells [48][49][50].
small
Vdd Vdd
Vdd Vdd
Vdd Vdd
bl_rst_b
wr_bl_rst_blargelarge
small small
bit bit_b
small
Figure 2.14: Bitline reset circuit with pseudo-static keeper circuit
2This may become less of an issue as designers move to a short, wide cell layout for shorter bitlines andbetter manufacturability [47].
CHAPTER 2. SRAM DESIGN 19
Table 2.1: SRAM control signals
signal name unit operation descriptionse sense enablesense amplifier read enables sense of bitlineswe write enable write driver write enables drive of write data onto bitlines
bl rst bitline reset bitline load both turns on bitline reset deviceswl wordline cells both turns on cell access devices
2.3.4 Clocking and Control
An SRAM has a number of key control signals whose timing relationships must be tightly
controlled to maintain correct operation. There are additional timing relationships that can
be maintained for high-performance, low-power operation.Table 2.1 lists some of the key
control signals.
We can generate the control signals either by delay matchingor clock selection. The
clock selection method requires that there be a number of finely spaced clock signals avail-
able. Either by simulation-based dead reckoning at design time or by a training sequence,
we generate the control signals from the clock edges that correspond most closely with the
ideal control signal edges [51].
The delay matching method generates the control signals based on replica timing paths
that match the delay of the SRAM access path [52][31]. Matching the delay of elements
in the decoder is relatively easy, because the decoder consists of logic gates and buffers.
By using delay elements that are made up of identical gates orusing logical effort and
simulation, a delay element can be made that matches the decoder delay reasonably well
across process, voltage, and temperature variations [31].
However, during a read, there is an element in the SRAM accesspath that is not a logic
gate, namely the cell driving the bitline. Matching the delay of the cell driving the bitline is
critical for accurately generating the sense enable signalthat activates the sense amplifiers.
If we assert the sense enable signal too early, there may not be a sufficient differential
voltage on the bitlines to overcome offsets in the sense amplifier, and the sense amplifier
may latch-in incorrect data. If we fire the sense enable too late, the SRAM will still operate
CHAPTER 2. SRAM DESIGN 20
correctly, but the performance will not be optimal. Amrutur[35] showed that a logic delay
chain does not track the read bitline delay very well over a range of process, voltage, and
temperature variations.
However, a replica bitline that mimics the delay of the actual bitline can track the actual
bitline delay well over process, voltage, and temperature variations. To mimic the actual
bitline as accurately as possible, we use a driver cell on thereplica bitline that is identical
to a real SRAM cell, except that it has a hardwired stored value. Unlike the actual bitline
that only swings approximately 100mV, the replica bitline must generate a full digital logic
level output level. We can accomplish this by either a capacitance ratioed or current ratioed
replica bitline [35]. A capacitance ratioed replica bitline uses a single driver cell and a
replica bitline that is a fraction of the length and capacitance of the real bitline. A current
ratioed replica bitline uses a full length and capacitance bitline, but use multiple driver cells.
The exact length of the replica bitline in the capacitance ratioed case or the exact number
of driver cells in the current ratioed case can be fine tuned insimulation to match the actual
bitline delay. Figure 2.15 illustrates the two types of replica bitlines. The capacitance
ratioed replica bitline uses a rblk times shorter than the real bitline, and the current ratioed
replica bitline usesj times the drive strength as a regular cell.
While a replica bitline does track the actual bitline much better than a chain of logic
gates, the tracking is not perfect. Additionally, despite careful simulation-based tuning, the
replica path and the actual path may differ due to random device mismatch, inexact mod-
eling, or local voltage and temperature variations. To ensure that the memory will operate
correctly under worst-case conditions, designers pad the the replica path by between 10%
and 30% extra delay. This extra time may end up being wasted time if the actual path is
faster than the replica path, but ensures correct operationunder worst-case conditions.
2.4 Transport Design
The request and reply transport phases transfer the addressand data to and from the memory
block. These phases of the memory access can account for the majority of the delay and
energy in large, partitioned memories [25][31]. Designersuse a number of techniques to
reduce the delay and power of the transport phases.
CHAPTER 2. SRAM DESIGN 21
rbl
regularcolumn
100mV
Vdd
Vdd
driver cell
n/k cells
n cells
capacitance ratioreplica bitline
rbl
current ratio
n cells
j driver cells
replica bitline
Figure 2.15: Capacitance and current ratioed replica bitlines
CHAPTER 2. SRAM DESIGN 22
In his SRAM design space study, Evans concludes that “multiplexed address line” de-
signs that route address and data to the blocks are optimal for low energy and high per-
formance [25][53]. These designs perform a portion of the decode in the interconnect in
selecting which blocks should be accessed. Broadcasting the request and reply data and
address to all blocks is excessively wasteful of energy.
rcv+
clk
data_out
data_b
data
wire
wire
drv
drv
Figure 2.16: Example differential low-swing interconnect
8.1/0.18
8.1/0.18Φ
Wire
Bit
Bit
1.8v1.8v
1.8v1.8v
0.5v
Figure 2.17: Example low-swing driver, sized for a 0.18µm technology[54]
To reduce the energy of the transport further many memories use low-swing signaling
[31][39][40]. Low-swing schemes can also improve the performance of the interconnect
over full-swing methods [55][54]. There are a number of low-swing signaling techniques
that have been used in memories and other types of designs [55][56]. Figures 2.16-2.18
CHAPTER 2. SRAM DESIGN 23
voa a o
v
Vdd
Bit
Φ
Φ
Φ Φ
Φ
2.88/0.24
S/R latch
Bit
Figure 2.18: Example low-swing interconnect receiver, sized for a 0.18µm technology[54]
show an example low-swing interconnect system. The wires are precharged to Vlow, a
voltage near ground. The drivers conditionally pull the wire to Gnd based on the data. The
modified StrongARM latch [57] at the far end amplifies the low-swing differential input
voltage to a full-swing signal.
Low-swing signaling can, however, require an extra supply and precise clocking of the
receiver sense amplifiers. Additionally, the small swings can be susceptible to noise, but
this can be mitigated by careful circuit design and layout [39][40][54]. The low-swing
drivers and receivers can be merged with the routing logic ina “multiplexed address line”
design.
Since a large portion of the access delay can be in the transport, we can increase the
cycle rate of the memory by pipelining in the transport. Manymicroprocessors already
allocate multiple cycles for a cache or even a register file access due to the interconnect de-
lay. By pipelining in the transport we minimize the necessary latching elements by placing
them before the block decoder.
CHAPTER 2. SRAM DESIGN 24
2.5 Memory Systems
In the previous sections, we have reviewed the basic architecture of a partitioned mem-
ory array and a number of contemporary SRAM circuit techniques for achieving high-
performance and low-power. These highly optimized SRAMs are rarely used by them-
selves, but rather aggregated together to form more complexmemory systems. On modern
processors and ASICs, the on-die memory system occupies a large percentage of the die
area and has a large impact on performance and power dissipation.3 For current system-on-
a-chip ASICs, the memory occupies on average over half of thedie area [58][59]. Similarly,
memory occupies a large portion of the die on modern general purpose CPUs. For exam-
ple, on a recently reported Intel Itanium 2 processor, the caches account for over 73% of
the total device count and occupy over 50% of the die area [60].
Digital systems use memories for a variety of functions, butthey typically fall into one
of three categories: scratchpads, caches, and FIFO buffers. Scratchpads are simple, fast,
software-managed, local memories mapped directly into theaddress space for storing fre-
quently used data and instructions [61]. For fast, deterministic access times, scratchpads
are typically small and located close to the computation. Onthe other hand, caches facili-
tate fast access to much larger memory spaces for access patterns that exhibit temporal or
spatial locality by remapping locations of a larger memory space into a smaller, faster cache
memory. Caches improve performance and efficiency of the memory system by hiding the
latency of large, slow storage or by filtering requests to bandwidth limited resources such as
off-chip DRAM. Some access pattern, however, do not exhibitlocality, but rather streaming
behavior [62][63][64]. Theses access patterns perform most efficiently with a FIFO stream
buffer memories [65][66][67][68]. FIFO structures can also be used to provide elasticity
between blocks with variable or bursty computation rates.
Even within each memory type there can be a multiplicity of different functionality.
For example, while a cache for a large microprocessor may store coherence protocol in-
formation, replacement policy information, speculative state, or dirty information for each
3While the on-die memory itself may not necessarily be the dominant source of power dissipation, itcan have a large impact on the number of off-chip memory requests issued. This off-chip I/O can be quiteexpensive in both power and performance. Additionally, theleakage power from large memory structurescan be significant.
CHAPTER 2. SRAM DESIGN 25
line. These extra meta-data bits may require special handling such as read-modify-write
functionality. However, a cache for a small micro-controller may simply store a valid bit,
and not require any additional functionality. Further, thelarge microprocessor’s caches
are likely to be multi-way set associative and organized in ahierarchy, while the micro-
controller may only have a single level of relatively small caches. While these memory
structures can have widely varying functionality, on closer examination, all of them are
quite similar in implementation. Each of them can be decomposed into a number of RAM
blocks, interconnection between the blocks and the computation, and interface logic be-
tween the computation and the interconnect.
Processors and ASICs often include examples of all three types of memories in various
sizes and forms on the same die. On a processor, the are many caching structures (e.g.
main cache hierarchy, branch target buffer, TLB), FIFOs (e.g. instruction queue, reorder
buffer, reservation stations, DRAM write buffer), and scratchpad memories (e.g. register
file). Even on more application specific chips such as multimedia processors and ASICs,
there can be a mixture of these memory types [69][70][71][58]. The reason for this variety
of memory structures on-die is that different parts of the design have different memory
requirements, and each memory type is most efficient for a certain type of access patterns.
On a larger scope, the memory systems of ASICs targeted at different applications vary
widely, because they are optimized for different applications with different memory access
patterns.
Chapter 3
Reconfigurable Memory Architecture
A conventional hardwired memory system is typically implemented for best average-case
performance and efficiency, which is acceptable for an ASIC with a very narrow application
range, but for a general purpose design, a hardwired memory system that cannot adapt to the
varying memory access characteristics across a range of applications can be a substantial
hindrance. Because the memory system is often the bottleneck for applications [17], a
sub-optimal memory system can significantly degrade the performance and efficiency of
the design [18][72][73][74]. If we had a configurable memorysystem, each application or
application segment could have a memory system tailored to its specific needs and thus run
faster and more efficiently [75][76].
A number of researchers have sought to build memory systems that could adapt to the
application characteristics. The concept of caching is a way to dynamically reconfigure
the memory to suit the application based on its past memory access behavior. Researchers
have further enhanced cache performance and efficiency by altering the cache structure
itself to meet the application needs. These enhancements have included reconfiguring the
line size [72][77], the set associativity [78][18][72][79], and shutting off unneeded lines or
banks [80][81][82]. Some proposed caches have a direct access, scratchpad-like mode that
eschews the tag check to save power [83]. A few DSP chips can even redistribute memory
resources between instruction cache, data cache, and scratchpad memory, on a cache way
granularity [84][85]. Some researchers have gone further by adding the ability to alter the
balance between compute and memory by reconfiguring the caches to become computation
26
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 27
[86].
Reconfigurable memory of a different sort can be found on commercial FPGAs and
coarse-grain reconfigurable computing fabrics. To providethe local memory that appli-
cations need for buffering and local data storage, designers have included block RAMs
distributed throughout the computing fabric which have lower overheads than using the
configurable logic blocks (CLBs) as memory [6][5][87][14][88]. These block RAMs can
reconfigure their access width, many can become FIFOs, and some can be configured as
CAMs or logic blocks [89][90]. These designs provide relatively bare memory modules
and use the surrounding configurable logic fabric to create any higher-level functionality
needed by the system. However, the high overhead associatedwith the configurable logic
degrades the performance of more complex memory functions.
The extant work on reconfigurable memory has focused on the architectural aspects of
reconfiguring the memory, but has largely ignored the circuit-level effects of adding re-
configuration to a cutting-edge memory design. As an example, the FPGA block RAMs
specifications are fairly inefficient and slow when comparedto current, cutting-edge, em-
bedded SRAM designs [5][91]. In this work, we take a different approach to the problem
by looking at reconfigurable memory design from a bottom-up,tabula rasaperspective,
starting with the circuit design of the memories themselves, and then examining where and
what architecturally useful configurability can be added for low overhead. To provide a re-
alistic evaluation, we apply modern, full-custom, circuittechniques to both the SRAM and
the peripheral logic, which allows us to achieve performance and power on par with con-
temporary embedded SRAMs. To guide our development of the additional logic, we target
three different memory configurations: caches, FIFOs, and scratchpads, but the resultant
reconfigurable memory design is not limited to these configurations.
3.1 Reconfigurable Memory Architecture
Our reconfigurable memory systems consists of three sections: the memory, the intercon-
nect, and the interface logic. Figure 3.1 shows the block diagram of the architecture. For
our design, we leverage the natural partitioning of large SRAMs into smaller blocks, by
adding reconfiguration on these partitioning boundaries. By choosing the reconfiguration
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 28
grain size based on circuit- and VLSI-level concerns, we avoid over or under partitioning
the array and minimize the reconfigurability overhead costs.
Processor
Mat . . .
. . .Inter−mat Control Network
. . .
Rec
onfig
urab
le m
emor
y sy
stem
Memory
Interconnect
Interface Logic
Figure 3.1: Basic reconfigurable memory system architecture
The memory section consists of a homogeneous array of memoryblocks calledmats.
Each mat can be configured to be a portion of a cache, a FIFO, or scratchpad memory. We
choose the mat size based on the optimal energy-delay SRAM block size and the necessary
architectural flexibility. A small network called the inter-mat control network runs between
the mats allowing them to pass a few bits of control information to each other. Mats can be
aggregated together to form larger complex memories such ascaches, FIFOs, or scratchpad
memories. Figure 3.2 shows the block diagram of an example reconfigurable memory
system with 16 mats arranged as an 8 x 2 array. Figures 3.3 and 3.4 show this memory
system configured for caching and for stream processing respectively.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 29
Processor Interconnect Network (PIN)
Data in/outAddressOpcode
mat
Control bits
IMCN
Figure 3.2: Memory system block diagram - 16 mats in an 8 x 2 array
hit
DataData Data Data
Data Data DataData DataTag
TagData TagData
Data Tag
Data Cache Instr Cache
hit
Figure 3.3: Caching configuration with 2-way data and instruction caches
Scratchpad
InstrFIFOFIFOFIFOFIFO
FIFO FIFO FIFO FIFO
Scratch
Scratch ScratchInstr
FIFO
FIFO
Data FIFOs InstructionMemory
Scratch
Figure 3.4: Streaming configuration with data FIFOs, instruction memory, and scratchpad
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 30
Because many memory structures are internally very similar, we can use this general-
ized memory mat to form the core of many types of memories. However, memory struc-
tures do differ significantly in the way that they connect to the computation engine. For
example, a cache requires requests to two (or more) memory blocks, the tag array and the
data array. A FIFO, however, only requires a single memory request per access. To allow
maximum reconfigurability, we use a dynamically routed crossbar between the memories
and the computation. Each mat has its own independent port into the interconnection net-
work. The interconnect is actually two uni-directional interconnection networks: one for
memory requests from the computation to the mats, and one forthe reply from the mats
to the computation engine. The computation engine launchesrequests into the intercon-
nect, which are then dynamically routed to the addressed mats. The interconnect allows
for multi-casting from a single request port to multiple mats. The mat replies are always
routed back to the computation engine port that issued the request.
Memory structures can be quite varied in the way that they exist in the address space.
A scratchpad is typically mapped directly into the address space. However, a cache has no
mapping into the address space, but rather is a surrogate forthe main memory, and must
maintain the illusion that accessing the cache location is logically the same as accessing
the main memory location. A FIFO may be addressed as a unit, rather than on a per word
basis, with simple push and pop commands. The interface logic between the processor and
interconnect translates the address from the processor to ahardware location. This logic is
similar to the address centrifuge of the Cray T3E [92]. Chapter 4 provides more details on
the interconnect and interface logic.
3.2 Memory mat
A memory mat is a flexible memory block which can emulate a widevariety of memory
structures. The mat has a fixed access width, corresponding to the data word width of the
computation engine. While having variable access width would facilitate applications that
operate on smaller data widths, it decreases the efficiency of the memory [30] and compli-
cates the design of the mat, interconnect, and interface logic. We can however aggregate
multiple mats for wider access widths.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 31
One of the key features that distinguishes many of the memorystructures we wish to
emulate from basic RAMs is that they store status bits for each word of data. For example,
a cache tag may store a valid bit, the LRU status, the coherence protocol state, or the
speculation state for each cache line. A FIFO may store a full/empty bit for each data
word to indicate if the word holds valid data. In our reconfigurable memory, we support
these status bits by adding a generalized status bits calledmeta-databits to each data word.
These bits hold data about the data, hence the name “meta-data.” The meta-data bits are
read and written along with the main data on all accesses. Additionally, there are special
operations that only operate on the meta-data.
3.3 Operations
A mat can perform three basic operations:read, write, and a special meta-data operation
called agangoperation. A read reads the requested word, data and meta-data, from the mat
memory array and sends it to the mat data output. A write writes the accessed word, data
and meta-data, with the word presented to the mat data input.A gang operation operates
on all of the meta-data bits in the mat on a per-column basis ina single cycle.
A gang operation can set, clear, or leave alone (NOP) any combinations of meta-data
columns. For example, if there were 4 meta-data bits,md[3:0], with a single gang op-
eration, we could setmd[3], clearmd[2:1], and NOPmd[0]. Figure 3.5 illustrates this
example gang operation.
There are two additional mat operations used for reading andwriting the configura-
tion state,configuration readandconfiguration write. The configuration state is memory
mapped into a special address space accessible only via the configuration read and write
operations. The ability to read the configuration state is primarily a testability feature and
not strictly needed.
3.4 Operation Modifiers
There are four operation modifiers than can be applied to the three basic operations per
Table 3.1. An “X” indicates that the modifier can be applied tothe operation. The modifiers
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 32
gang
1
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0 1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1 0
0
0
0
NO
P
clea
r
clea
r
set
NO
P
clea
r
clea
r
set
meta−data
Figure 3.5: Example gang operation - set md[3], clear md[2:1], NOP md[0]
Table 3.1: Mat operation modifier applicability
Instruction Cmp Ptr RMW Cond
Read X X X XWrite X XGang X
are compare (cmp), pointer (ptr), read-modify-write (RMW), and conditional (cond). All
modifiers are orthogonal.
3.4.1 Comparisons
One of the most common logic operations following a memory read is a compare. For
example, the vast majority of accesses to a cache tag memory are reads followed by a
compare to determine if there is a cache hit. Some configurations may only want to perform
a comparison on a portion of the stored word. For example, in acache tag that stores both
LRU and valid information in the meta-data, a tag check will want to compare the outgoing
address to the stored tag and ensure that the valid bit is set by checking that the valid bit
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 33
is set. But the check will not care about the state of the LRU information. So we want
compare the main data and the valid bit, but wildcard the LRU bits.
The memory mat supports a maskable compare modifier, that follows up a read opera-
tion with a comparison. Unlike a normal read, a compare requires data input for the value
to compare the stored value against. Additionally, for the maskable comparison, we need
a short mask field to determine which fields are to be compared.The mask field is one
bit longer than the amount of meta-data to allow us to mask outany combination of the
meta-data bits and the main data as a chunk. Comparator hardware is relatively small, so
embedding a comparator in each mat only has a modest area overhead.
3.4.2 Pointer Operations
Many memory structures have a level of indirection in the address path. A simple example
of this is a FIFO, in which an access is either a pop of the head or a push to the tail. A request
does not name a specific memory address to access, but insteada pointer, the head or tail, to
access. Internally, the FIFO maps the head and tail pointer to an actual memory location. A
more complex example of address indirection occurs in a cache, where the accessed word
is selected based on matching the memory access tag field. Ourmemory system requires
use of two (or more) mats to support this type of complex address indirection, one to store
the tags and another for the data.
We support one level of simple address indirection via pointer operations. Any read or
write can be a pointer access that names a pointer number to access, rather than a memory
address. A special block in the address path called thepointer logictranslates this pointer
number into a memory address. Each pointer has an associatedstride value that the pointer
logic uses to update the pointer value after an access. The pointer value updates are optional
and can be increments or decrements. The pointer and stride values are read or written using
configuration reads and writes, described later.
3.4.3 Read Modify Writes
A read-modify-write (RMW) operation is an atomic operationthat reads out a word of data,
performs a logic function on the meta-data, then writes the modified meta-data back into
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 34
the previous accessed word. The logic function may take other inputs besides just the read
out meta-data. A write-modify-write operation is also possible, but the usefulness of such
an instruction is somewhat suspect.
An example use of this operation would be in a cache tag that stores LRU information
for each line. In a mat configured as the tag, the LRU bit(s) would be stored in the meta-
data. When the tag is accessed, we calculate the new LRU valuebased on the previous
LRU value, the local hit/miss result, and the global hit/miss result. By using a RMW
operation, we can perform this very simple logic and write the meta-data back all within
the memory. While the operation implementation may be pipelined internally, we maintain
the appearance of atomicity to all requesters. So if a RMW is immediately followed by a
read of the same word, the second read returns the updated meta-data value generated in
the modify logic.
3.4.4 Conditional Operations
Many memory structures have operations whose execution is contingent on an internal or
external condition. An example of a conditional operation based on an external condition
would be the data write of a cache. In the data memory, the write should only occur if there
is a hit in the tag. The hit signal must be passed from the tag memory to the data memory
to tell it whether or not to execute the data write.
A conditional operation based on an internal condition is contingent on a pattern match
against the meta-data of the word. An example of this would bea FIFO push, a write
into tail pointer location. The write is conditional on the FIFO having an empty slot (i.e.
the FIFO is not full). The write is therefore contingent on the meta-data bit storing the
full/empty information being 0 indicating an empty location. An internal conditional oper-
ation specifies the condition similar to specifying the maskfor maskable compares.
Any operation can be a conditional operation. A normal memory read does not alter
the state of the data word, and thus a conditional read does not seem strictly necessary, but
when aggregating multiple mats to form larger memory structures, a conditional read can
be useful. For any operation than can change the state of the word (i.e. a write or a read-
modify-write), a conditional operation is needed. A conditional gang operation requires
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 35
Table 3.2: Conditional gang clear truth table
md[1] md[0] md[0]’0 0 00 1 11 0 01 1 0
additional support as described next.
3.4.5 Conditional Gang Operations
A conditional gang operation gang sets or clears a meta-databit based on the value of
another meta-data bit. Figure 3.6 shows a conditional gang clear ofmd[0] based onmd[1].
If md[1] is 1, we clearmd[0], otherwisemd[0] is left alone. Table 3.2 shows the truth table
for this operation.
cond
iton
1
1
1 0
0
0
cgang clr
meta−data
1
0
0
1
1
1
0
0
0
1
1 0
0
1
1 1
0
0
targ
et
targ
et
cond
iton
Figure 3.6: Example conditional gang clear operation - clear md[0] if md[1] == 1
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 36
Table 3.3: Mat I/Os
Signal Bits Description
opcode o pre-decoded one-hot control signalsaddress a memory address or pointer numbermask m+1 bit mask for compares and gang ops
mdatain m meta-data inputdatain d data input
mdataout m meta-data outputdataout d data outputext in e external control inputext out e external control output
matchout 1 compare resultvalid out 1 valid status of output datacomplete 1 general completion signal
Systems that allow speculative memory operations such as the Hydra multiprocessor
[93] can make extensive use of conditional gang operations.Each processor in Hydra runs
a different thread, all but one of which is speculative. Hydra caches have a number of
special status bits that they use to keep track of the speculative state of the cache lines.
The Modified bit keeps track of whether a line has been speculatively written by the local
processor, set to 1 if it has. This bit would be stored in the meta-data of a mat acting
as the cache tag in a Hydra implementation on our reconfigurable memory system. If a
processor speculates incorrectly (e.g. speculatively reads a word that is later written by a
less speculative thread), then it must perform a backup and invalidate all lines in the cache
that it has speculatively written. If the Modified bit is set,then the processor has incorrectly
speculatively written the line and must clear the Valid bit.This requires a conditional gang
clear of the Valid bits based on the Modified bits.
3.5 Mat Interface
In order to support the variety of operations beyond simple reads and writes, the mat inter-
face is more complex than a basic memory. Table 3.3 lists the interface signals. Theopcode
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 37
signal iso bits wide and indicates what operation the mat should perform. This takes the
place of the usual read and write enable signals. The mat address isa bits wide, meaning
the mat stores 2a words. In pointer and gang operations that do not need to provide an
address, we embed other necessary operation specifiers in the address. Pointer operations
embed the pointer number, update enable, and update add/subtract specifier. Gang oper-
ations embed thegangdata field, which is used in conjunction withmaskto determine
which columns to set, clear, or NOP. Compares also use themaskfield to determine which
bits to use in the compare. Themaskfield is one bit wider than the meta-data, because
compares can mask out the main data word as a chunk.
Each word hasd bits of main data andm bits of meta-data. The interface to the IMCN
used for communicating control information to/from other mats is via theebits wideext in
andext out signals. The mat exports control data onext out and receives it onext in. The
matchoutsignal is the result of the comparison. If the operation is not a compare, the signal
remains low. Thevalid signal indicates to the interconnect that the mat has valid data to
output. Thecompletesignal is a general purpose completion signal to tell the computation
that the memory request has executed successfully.
There are many ways to implement this reconfigurable memory mat architecture. The
next section presents a full-custom implementation that was used in the prototype testchip
discussed in Chapter 5.
3.6 Micro-architecture and Implementation
This section examines a full custom circuit design implementation of the reconfigurable
memory mat architecture that emphasizes tight integrationwith the SRAM core, low la-
tency, fast cycle time, and low overhead. Chapter 5 discusses additional details and exper-
imental results from the prototype testchip using this design. The SRAM design uses the
techniques discussed in Chapter 2 for high performance and efficiency.
To a basic SRAM array, we add meta-data and peripheral logic blocks to create a flex-
ible memory mat. The overarching goal for these additional logic blocks is to have small
performance, power, and area overheads, and yet be general and flexible enough to meet
the needs of many memory configurations. We generalize the logic blocks and functions
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 38
as much as is feasible to increase hardware sharing across different configurations and to
increase flexibility.
For this peripheral logic, we primarily used hardwired logic, rather than reconfigurable
logic such as the look-up tables (LUTs) used in FPGAs, because we only need a few,
relatively simple peripheral logic blocks. By using hardwired logic, we ensure that the
logic will be small, fast, and efficient, at the cost of havingsome unused peripheral logic
blocks in some configurations. This also reduces the amount of configuration state and
reconfiguration time, because we don’t have large LUT arraysto program.
The peripheral logic can be broken down into two basic blocks: address logic and
datapath logic. The address logic sits between the address input and the memory array
decoder and operates on the address bits. The datapath logicsits between the data I/O and
the SRAM core and performs logic on the data read from or written to the memory array.
Figure 3.7 shows a block diagram of a generic memory buildingblock. Our design adds
the necessary peripheral logic blocks for the target configurations, as shown in Figure 3.8.
Despite adding this peripheral logic for additional functionality, we wish to maintain
high performance. Thus, we chose an aggressive cycle time goal of 10 fan-out-of-four
inverter delays (FO4) based on the achievable access times for SRAM arrays in the optimal
energy-delay size range. This short clock tick stresses thecircuit design of the memory
cores as well as the peripheral circuits. Due to this aggressive cycle time, the mat access is
pipelined, and the total delay through the mat is 2 cycles or 20 FO4. The first half-cycle is
spent in the pre-access logic: pointer logic or write buffer. The next full cycle is spent in
the SRAM access. The last half-cycle is spent in the post-access logic: control logic and
the comparator. The mat is fully pipelined, accepting a new request every 10 FO4 cycle.
In comparison to our memory system’s 10 FO4 clock cycle, state-of-the-art micropro-
cessors generally run at 15-20 FO4 [94], a typical synthesized custom ASIC at 40 FO4,
and FPGAs at around 100 FO4.1 While a few compute blocks [95] and caches [96][97] op-
erating at approximately 10 FO4 have been demonstrated in commercial microprocessors,
these units were usually internally “double pumped,” operating at twice the main clock
rate. One notable exception is the IBM/Toshiba/Sony Cell processor which does clock at
1While the theoretical maximum clock frequency of FPGAs is usually around 50 FO4, configured FPGAstypically run at around 100 FO4.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 39
BLs
cell I/O
addrlogic
cell arraym
eta−
data
datapath logic
deco
der
addr data in/outmdata in/out
corecorecoreaddr mdata data
WLs
mdataBLs
data
Figure 3.7: Generic memory structure
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 40
data in/out
cell I/O
cell array
met
a−da
ta
WB
PLA
coremdata
coredata
comparator
write bufferlogicptr
deco
der
RM
W
deco
der
coreaddr
mdataBLs
dataBLs
WL0
WL1
addr
mdata in/out
Figure 3.8: Mat block diagram showing meta-data with support logic (RMW decoder andPLA) and peripheral logic blocks (pointer logic, write buffer, and comparator)
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 41
Table 3.4: Clock cycle comparison for virtual multi-porting
Clock cycle (FO4) Frequency in 0.18µm (MHz) Accesses per cycle
Memory 10 1000 1CPU 20 500 2ASIC 40 250 4FPGA 100 100 10
11 FO4 [98]. The local memory of the Cell processor is pipelined [99] to meet the fast
clock cycle.
Running the memory system at a faster clock rate than the processor allows us to vir-
tually multi-port the memory. During a single processor cycle time, we could access the
memory multiple times, thus emulating the functionality ofa memory that has multiple
read/write ports. Even for a state-of-the-art microprocessor running at 20 FO4, we could
make two memory accesses per cycle with our 10 FO4 memory system. For an ASIC or
FPGA, we could make many more accesses per cycle, as detailedin Table 3.4.
By using a fast single-ported memory, rather than a true multi-ported memory, we can
achieve significant savings in both area and power, especially in systems that demand a
large number of ports [100]. In a virtual multi-ported memory, the accesses are staggered,
because the accesses are not truly simultaneous. This is less of a factor in systems where
the memory runs much faster than the processing units, sincethe memory accesses are only
staggered by a small amount of time (i.e. one memory clock cycle) from the processing
unit’s perspective.
To meet our aggressive access and cycle time goals, we employthe highly optimized
circuit techniques used in modern SRAMs for high performance and low power through-
out the design. In this way, the peripheral logic can match the performance and power
characteristics of the memory core. The next section focuses on the implementation of the
meta-data bits, because they are the key feature that enables the configurable memory and
required the most circuit innovation.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 42
3.6.1 Meta-data
We tightly integrate the meta-data with the main data SRAM array to share as many re-
sources (e.g., decoder, replica control path) as possible for low overhead and high-performance.
Thus, we implement the meta-data using additional storage cells in the main SRAM array
(Figure 3.8). The meta-data bit cell, shown in Figure 3.9, isan explicitly two-ported cell to
efficiently support RMW operations.
M0
Vdd Vdd
a a_b
b0 b1 b0_bb1_b
WL0
gset gclr
WL1
M1
Figure 3.9: Meta-data bitcell
All normal accesses use port 0 (WL0, b0, b0b). The main decoder drivesWL0which
goes to all meta-data cells and all single-ported normal data cells in a row. Port 1 is a
special port used by read-modify-write operations. Note that the special wordlineWL1
goes only to the meta-data cells as shown in Figure 3.10. In our prototype implementation,
the single-ported data cell has a free horizontal metal track, which allows the meta-data cell
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 43
WL1
WL0
DataMeta−data
2−portcell
1−portcell
Figure 3.10: One mat cell row
to be laid out in the same cell pitch as the single-ported datacell, using the free metal track
for WL1 in the meta-data cells.
As was discussed earlier, we support a number of special operations on the meta-data.
Gang operations allow us to operate on all bits of the meta-data on a per-column basis in
a single cycle. Any meta-data column can be gang set, gang cleared, or left alone (NOP).
To implement gang operations, we add two additional devicesin the meta-data bitcellM0
andM1. All cells in a column share thegsetandgclr lines. Assertinggsetsets all cells in
a column to 1. Assertinggclr clears all cells in a column to 0. Figure 3.11 shows the gang
I/O logic which generates thegsetandgclr signals.Gangen is simply the gang operation
enable signal.2
The additional wordline and gang control lines can increasethe size of the meta-data
bitcell. In some cell designs there is a free horizontal metal track, in which case, the cell
height can remain the same as the normal single-ported data cells. The cell may then
become wider due to the addition of the gang control lines, but will not adversely affect
the vertical cell pitch. If there is no free metal track the meta-data cell will be taller than
the normal data cells, and so if we wish to have the meta-data and data cells in the same
row, the row vertical pitch will have to increase. This decreases the area efficiency of the
memory, because the normal data cells are now no longer theirminimum size. To avoid
2The same functionality could also be achieved by grounding the Vdd of one side of the cell. This maybe more efficient, because it does not require additional devices or control lines in the cell. However, it doesrequire that the cell Vdd’s and Gnd’s run vertically. The cell used in the testchip used horizontal Vdd’s andGnd’s, so this technique was not adopted. Some modern cell designs do run Vdd and Gnd horizontally, andthus could use this technique to implement gang operations.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 44
gdata[n]
gmask[n]
gclr[n]gset[n]
gang_en
Figure 3.11: Gang operation I/O logic for one column
this problem, we can interleave the meta-data cells (see Figure 3.12) to keep the normal
data cell pitch at a minimum.
1 = word 10 = word 0
0 0
0 0 0 0
1 1
1 1 1 1
Data cellsMeta−data cells
Figure 3.12: Two cell rows using interleaved meta-data bitcells
Conditional Operations
Conditional operations require additional logic in the meta-data cell. A conditional opera-
tion based on an internal condition is contingent on a match of themdatain with the stored
value’s meta-data. This could be performed as a pipelined 3-cycle operation, similar to a
read-modify-write. However, because a conditional operation’s final access may read or
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 45
write the main data, we would either have to stall the mat or add a second port to the main
data word.
As an alternate implementation, we could add CAM-like compare logic to the meta-
data bit cells and gate the main wordline based on the match result. Figure 3.13 shows the
modified meta-data bit cell. The I/O logic would drive themdatain value to be matched
against on theK andK b lines masked by themaskinput field. All meta-data bits masked
out would drive a 0 on bothK andK b preventing a mismatch on that bit position. This
implementation takes the match out of the critical path, because the match occurs in parallel
with the main decode.
The wordline driver will be slightly slower due to the additional input of the match
result. The additional match circuitry and signal lines increase the size of the meta-data
bitcell. It is unlikely that the cell can be layed out in a normal single-ported cell pitch, and
we would have to use an interleaved meta-data cell arrangement as shown in Figure 3.12.
We would need to detect whether the operation successfully completed and return this result
to the processor via the general purpose completion signal.This could be implemented
using a wide OR of the wordlines. While this enables conditional reads and writes based
on internal conditions, conditional gang operation are a special case discussed in the next
section.
Conditional Gang Operations
Conditional gang operations also require special support circuits in the meta-data. Since
most configurations that use conditional gang operations only need one bit conditionally
cleared, we limit our conditional gang implementation to that function. We link two meta-
data columns, with one column (e.g. md[1]) as the control column and another column (e.g.
md[0]) as the target column. On a conditional gang clear, the target column is cleared if the
meta-data bit in the condition column is 1. This is the same function previously described
in Table 3.2.
A conditional gang clear can be implemented by linking two columns with a two NMOS
pulldown stack (Figure 3.14). Whencgangis asserted, if the cell inmd[1] is 1, then the
pulldown stack connects the a storage node ofmd[0] to Gnd, writing a 0 into the cell.
An alternate method of implementing conditional gang clearonly uses a single NMOS
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 46
ML
Vdd Vdd
a a_b
b0 b1 b0_bb1_b
M1 M0
WL1
WL0
aa_b
K
gset
K_b
gclr
Figure 3.13: Meta-data bit cell with embedded match circuit
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 47
md[0]
cgang
md[1]
Figure 3.14: Conditional gang clear implementation using two transistors
cgang
md[0]md[1]
Figure 3.15: Modified conditional gang clear implementation using one transistor
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 48
Table 3.5: Modified conditional gang clear truth table
md[1] md[0] md[1]’ md[0]’0 0 1 00 1 0 11 0 1 01 1 1 0
device to connectmd[1] b to md[0] with the device gate controlled bycgang(Figure 3.15).
This performs the desired logic function onmd[0], but in the case wheremd[0] is 0, and
md[1] is 0, a 1 will be written intomd[1]. If md[1] will be gang cleared before further use
(e.g.the Modified bit in Hydra is cleared on a speculative thread backup after the conditional
gang clear of the Valid bits [93]) then this push-back frommd[0] is a benign side-effect.
Table 3.5 shows the truth table for the modified gang clear operation.
Because all other operations on the meta-data bits support either true or complement
operations, having a uni-directional conditional gang operation still makes the meta-data
bits logically complete. The sense of any of the bits could always be inverted without the
loss of any functionality.
Read-Modify-Write Decoder
To support read-modify-write operations, we need an additional special decoder, called the
RMW decoder, and a reconfigurable logic block to modify the meta-data bits. The latency
of a read-modify-write operation is much longer than a single memory access, because it
requires two memory accesses, a read and a write, plus the time for the modify logic. We
pipeline read-modify-write operations to avoid adverselyaffecting the memory mat cycle
time. Additionally, since a new operation can arrive each cycle, we use the second port
of the meta-data bits for the RMW write so that it does not conflict with the incoming
operation.
Thus, a read-modify-write operation is a pipelined, 3-cycle latency operation. In the
first cycle, a standard read operation reads a word out of the array. In the next cycle, a
reconfigurable logic block operates on the read-out meta-data bits. In the final cycle, the
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 49
output of the reconfigurable logic block is written back intothe meta-data of the word
accessed two cycles ago. From the requester’s standpoint, this operation is atomic: any
subsequent read of the word will retrieve the updated meta-data and no subsequent write
will be over-written by the RMW writeback (i.e. no WAW hazard). The first cycle operation
could also be a compare or even a write.
The RMW decoder remembers which word was accessed during theinitial cycle of a
RMW operation and drives the meta-data second port wordlineduring the write cycle. To
minimize the number of forwarding paths, we want a simultaneous main read and RMW
write to return the newest meta-data value. Thus the RMW decoder must fire the second
port wordline before the main wordline activates. This needfor fast RMW decoder oper-
ation eliminated the possibility of latching the read address and re-doing a full decode for
the RMW write. Instead we chose to conditionally latch the wordlines using a crosscoupled
inverter storage cell sized as an SRAM cell (Figure 3.16) on the RMW read. Figure 3.17
shows the timing diagram for the RMW decoder operation.
rst
Vdd
WL0
n0
store
store_b
r2
WL1
M1
M2
M3 M4
rmw_b
Figure 3.16: RMW decoder
During the RMW read cycle, the RMW I/O logic assertsrmw b low. In the accessed
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 50
n0
rst
store
rmw_b
WL0
WL1
r2
clk
read modify write
Figure 3.17: RMW decoder timing diagram
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 51
row, the main decoder pulsesWL0which writes a 1 into the storage cell viaM1. The rst
signal stays low in this cycle. During the modify cycle,rmw b is held high, andrst stays
low. In the write cycle,rst rises which bringsn0 low throughM2 in the row that was
accessed in the read cycle.N0 going low assertsWL1 and activates the self-reset delay
chain. The delay chain drivesr2 low, turning onM4. M4 then resetsn0 andWL1. As r2
goes lowM3 clears the storage cell. The decoder can now accept another RMW operation.
In rows accessed in the read cycle, the storage cell undergoes a “read” whenrst asserts.M3
must be carefully sized to avoid excessive power dissipation.
The rst signal is not just a delayed version of thermw b signal, because a write to
the same word during the modify cycle aborts the writeback ofthe modified data. This
condition must be checked in therst drive logic. Because the writeback occurs early in
the cycle, a write to the same word during the writeback cyclewill overwrite the modified
data. Because the mats are fully pipelined, we can have threeRMW operations in-flight at
any given time. Thus, we need three storage cells per row (Figure 3.18), and there are three
sets ofrmw b andrst signals,rmw b[2:0] andrst[2:0] . Which cell and signal pair is used
rotates on a per cycle basis. On system reset, all RMW decoderstorage cells are reset to 0
via a reset NMOS device (not shown).
Reconfigurable PLA
The other support block for RMW operations besides the RMW decoder is the reconfig-
urable logic block that performs the modify logic. There aremany ways to design a re-
configurable logic block, but we chose to implement this block using a reconfigurable PLA
(Figure 3.19), because PLAs are dense array structures suitable for small logic functions.
The reconfigurable PLA is a NOR-NOR PLA (Figure 3.20), with the first NOR plane im-
plemented as a ternary CAM, and the second NOR plane implemented as an SRAM with
a special logic line. Because the PLA is basically a CAM and anSRAM, we can apply the
fast, low-power SRAM circuit techniques seen in Chapter 2 tothe PLA implementation.
The PLA can perform three operations: configuration read, configuration write, and
logic. Configuration reads and writes allow requests to readand write the CAM and SRAM
storage locations to change the programmed logic function.The logic operation performs
the programmed logic function on the incoming meta-data bits. During a logic operation,
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 52
rst[2:0]
Vdd
store[1]
store_b[1]
store[0]
store_b[0]
store[2]
store_b[2]
WL0
n0
r2
WL1
rmw_b[2:0]
Figure 3.18: Pipelined RMW decoder
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 53
LWL
logic
TCAMTCAMTCAMTCAM
TCAMTCAMTCAMTCAM
SRAM SRAM
SRAM SRAM
rml_b
LWL
replica matchline
ML
k_bk
ML
z[n] o[n]
Figure 3.19: PLA block diagram
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 54
out2
programmableconnection 1st NOR plane 2nd NOR plane
in0 in1 in2 out0 out1
Figure 3.20: Example 3-input, 3-output NOR NOR PLA structure
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 55
the CAM performs a match operation and the SRAM uses its special logical wordline for a
wired-NOR read.
b0_bb0k
ML
b1_b
o o_bz_bz
WL
k_b b1
Figure 3.21: Ternary CAM trit cell
The NOR-style ternary CAM uses self-resetting matchlines,and the activation of the
second NOR plane is timed using a replica matchline. Each CAMtrit cell consists of two
6T storage cells and two NMOS pull-down stacks (Figure 3.21). The CAM’s relatively
small size allows for full-swing matchlines. Each row of trit cells forms a NOR gate with
the matchline as the output. The ternary CAM cells pull down the matchline on a mis-
match, and after a delay, self-reset circuits restore any mismatched matchlines to Vdd. The
self-resetting matchline scheme will save power over a global matchline precharge scheme
provided the configurations have relatively low matchline activity factors. Table 3.6 shows
the correspondence between the stored values and logical meaning.
The activation of the second NOR plane is based on a replica matchline. One of the
challenges in a NOR-style CAM design is to know when the matchlines have settled. In the
rows that match, the matchlines remain high, but in the rows that mismatch, the matchlines
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 56
Table 3.6: Ternary CAM stored value meaning
z o Value Action
0 0 * Match on all0 1 1 Match on 11 0 0 Match on 01 1 Null Mismatch on all
are pulled low by some number of CAM cells. The falling delay of the matchline varies
depending on the number of cells in the row that mismatch. By using a replica matchline
that only mismatches on one cell, we replicate the worst-case matchline pull-down delay.
Because both the actual matchlines and replica matchlines are full-swing signals, there
is no signal amplification needed between the actual and replica matchlines as in SRAM
replica bitlines [35].
The replica matchline generates a positive pulse on therml b line that mimics the delay
of the worst-case matchline going low on a mismatch. The interface gate, a static CMOS
NAND followed by an inverter, ANDs together the matchlines and rml b to generate the
logical word line signalsLWLswhich go to the SRAM array. Figure 3.22 shows a timing
diagram of the ternary CAM and interface gate operation.
To ensure that theLWLsof all mismatched words do not erroneously glitch high, the
matchline low time must be long enough to fully encompass therml b positive pulse. We
designed the replica matchline to be 1 FO4 delay slower than the worst-case real matchline
delay to ensure that rising edge of therml b positive pulse occurs after the falling edge of
even the slowest falling matchline. We also must ensure thatthe matchlines do not reset
too early, thus causing a glitch on the back-edge ofrml b. In the worst-case, a row could
mismatch on all cells and thus the matchline would be pulled low as quickly as possible
starting the matchline self-reset chain as early as possible. The matchline self-reset delay
must be long enough to ensure that the matchline low time willstill fully encompass the
rml b pulse under these conditions.
On the matchlines, we want both a long pulsewidth (i.e. long low time) and fast cycle
time. If we used a standard self-reset circuit (Figures 3.23and 3.24), in order to get a long
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 57
match mismatch
k
k_b
ML
rml_b
LWL
logic
Figure 3.22: PLA ternary CAM timing diagram
pulsewidth, we must unnecessarily extend the cycle time to avoid a drive fight. After the
matchline has been reset, we must traverse the delay chain a second time beforeM1 is
turned off and we can pull-down the matchline again. If we do not wait, then there will
be a drive fight betweenM1 and the CAM cell(s) pulling the matchline low. The second
traversal of the delay chain uses the opposite transition from the first traversal. For a long
pulsewidth and short cycle time, we need to skew the delay chain for a slow assert edge, to
get a long pulsewidth, and a fast de-assert edge, to get a fastcycle time. One way to do this
is to skew the P/N ratios of the inverter in the delay chain. However, this causes extremely
slow slew rates on the slow transition which burn short-circuit current and are susceptible
to noise.
Instead, we replace the second-to-last inverter with a NOR gate (Figures 3.25 and 3.26).
The delay chain behaves as normal during the assert transition. However, on the second
traversal, once the matchline reaches the logic threshold of the NOR gate, the NOR gate
fires, skipping over the first two inverters. This shortens the delay chain de-assert time
to two gate delays: one NOR gate and one inverter. We leave in the NOR gate and final
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 58
ML
VddVdd
r0 r1 r2 r3
k0M1M0
Figure 3.23: Normal self-reset circuit
M1 off
ML
r0
r1
r2
r3
Figure 3.24: Normal self-reset circuit timing diagram
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 59
M0
VddVdd
r0 r3
k0
r1
r2
M1
ML
Figure 3.25: Fast reset off self-reset circuit
r3
M1 off
ML
r0
r1
r2
Figure 3.26: Fast reset off self-reset circuit timing diagram
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 60
inverter to allow the matchline time to reset all the way to Vdd from the NOR gate logic
threshold. There is an additional keeper device (M0) on each matchline to ensure that the
matchline reaches Vdd. By using this technique we can have a long pulsewidth, fast cycling
matchline, while maintaining fast edges on all transitionsof the delay chain.
bit_b
WL
LWL
bitlogic
a a_b
Figure 3.27: PLA SRAM cell
The LWLs generated from the matchlines andrml b activate the second NOR plane
implemented as an SRAM with a special logic output line. Eachbitcell of the SRAM
array (Figure 3.27) has a special logic line (logic) that is shared among all cells in a col-
umn. Within a column, all cells that store a one form a wired-NOR of the logical word
lines (LWLs), with the logic node as the output.Logic is a full swing signal which is
precharged early in the cycle, during the CAM evaluation. The small number oflogic
lines and their light loading made the precharge overhead relatively small and a reasonable
trade-off against the increased area of self-reset delay chains.
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 61
3.6.2 Peripheral Logic
In addition to the meta-data and associated support logic, we add three peripheral logic
blocks to a basic SRAM array: the pointer logic in the addresspath for pointer operations,
the comparator in the datapath for comparisons, and a write buffer in both the address and
data paths for conditional writes on external conditions. With the aggressive cycle and
access time targets for the mat, the implementation of even very logically simple structures
such as these require advanced circuit techniques.
Pointer Logic
The pointer logic consists of two small SRAMs and an adder/subtracter (Figure 3.28). A
dual-ported SRAM stores the pointer address values, and a single-ported SRAM stores the
strides, one stride per pointer. The “address” input to these memories is the pointer number
specified by the operation.
On a pointer operation, the pointer SRAM reads out the address corresponding to the
named pointer and sends it to the main decoder. The stride SRAM simultaneously reads
out of the associated stride. The adder/subtracter calculates updated pointer address value
from the pointer address and stride. We then optionally write the updated pointer address
back into accessed pointer stored in the pointer address SRAM.
We dual-port the pointer address SRAM to allow simultaneouspointer address read and
write-back of an updated pointer value. The SRAM properly handles the write-through case
when the same pointer is simultaneously updated and read by outputting the updated pointer
address value to the read. This allows us to do back-to-back pointer operations using the
same pointer number. Because the strides do not need to be updated, the stride SRAM is
only single-ported. The pointer address values and stridesare read- and write-able via the
configuration read and write instructions.
For a simple FIFO, we only need to store two pointers, the headand the tail, with each
of their strides set to 1. However, by storing more than two pointers per mat, we can enable
using a single mat as multiple FIFOs. For example, the testchip implementation stores four
pointers per mat, which allows us to hold two FIFOs in a singlemat. We could use pointers
P0 andP1 for FIFO 0 andP2 andP3 for FIFO 1. By using a stride of 2 and offsetting the
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 62
to main decoder
strideSRAM1−port
add/sub
numberpointer
strideptr addr
updated pointer
read portwrite port
ptr SRAM2−port
Figure 3.28: Pointer logic block diagram
starting head/tail values by one, we can interleave FIFO 0 and FIFO 1 in a single mat, with
FIFO 0 occupying all even addresses, and FIFO 1 occupying allodd addresses. Figure 3.29
illustrates this configuration.
We can also enable FIFOs larger than one mat by storing a pointer address value that
is longer than necessary to address the words of one mat. All requests to the FIFO are
multicast to all mats in the FIFO, and the mats range check theupper bits of the pointer to
determine which mat the request should access. All mats keeptheir pointer logic in lock-
step, so even if the mat is not accessed, the pointer logic is updated. The number of extra
bits in the pointer addresses determines the maximum FIFO size. For example, a pointer
that has two extra bits allows for a maximum FIFO size of four mats. Chapter 4 discusses
how the interconnect structure can multicast to all the matsin the FIFO. Figure 3.30 shows
a single FIFO that spans four mats.
Maskable Comparator
The maskable comparator implementation is relatively simple, differing only slightly from
regular comparators, on which there is a large body of work [101][102][103][104]. Our
comparator differs in that it can mask out select fields of thedata, treating them as don’t
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 63
word 0
word 1
word 2
word 0
word 1
word 2
Addr
0
1
2
3
4
5
FIFO 0
FIFO 1
Stride = 2
Stride = 2
Figure 3.29: Storing two FIFOs in one mat
00 ... 00
Address rangeof mat
Top 2 bits arerange checked
Mat 0 Mat 1 Mat 2 Mat 3
11 ... 11
00 ... 0011
1111 ... 1110
10 00 ... 0000 ... 0001
01 11 ... 1111 ... 1100
00
Figure 3.30: A FIFO spanning four mats
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 64
cares. We can mask out any combination of the meta-data bits and the main data as a chunk.
The comparator (Figure 3.31) uses a single mask bitmask[0] for all of the main data, but
individually masks each bit of meta-data . We implement the final wide fan-in AND gate
as a tree of smaller fan-in gates using the principles of logical effort [32].
data1[d−1]data0[d−1]
mask[1]mdata1[0]
mask[k]
mdata0[0]
mask[0]
data0[0]data1[0]mask[0]
match_out
mdata1[m−1]mdata0[m−1]
tree
Figure 3.31: Maskable comparator gate-level diagram
The comparator is implemented as a tree of dynamic gates. We could embedded the ini-
tial comparison XOR gate in the bitline column muxes, but this introduces more complexity
in the column mux circuits. Depending on the implementation, this could also introduce
another series device in the bitline path.
Write Buffer
To enable writes contingent on an external condition, we adda write buffer in both the
address and data paths. On a conditional write, the write buffer stores the incoming write
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 65
address, data, and opcode. When the external condition evaluates, the write buffer invali-
dates the entry if the condition fails, or allows the entry towrite into the main data array in
a pipelined fashion if the condition is met. This occurs in the same way that some cache
designs pipeline writes to allow single cycle cache write hits [105] as described below.
In a traditional cache implementation, a cache write operation must serialize the tag
check and the write into the data array. Because the write into the data array is an unre-
coverable change of state, we cannot execute it until we knowthat the access is a cache
hit. The serialization of the tag check and data write increases the latency of a cache write
significantly.3
A well-known technique to hide the data write dependence on the tag check is to
pipeline the cache write into two logical stages: tag check,data write. Because not ev-
ery access is a write, the operations are not pipelined on a per cycle basis, but rather on a
per write basis. This technique requires a small write buffer in the data array to store the
incoming write data until its tag check has completed. The number of storage locations in
the write buffer depends on the mat implementation, specifically on the latency of the tag
check. The write buffer must be searchable to ensure that anyreads or compares to the
stored write address that occur before the data can be written into the main array behave
properly.
3.7 Summary
In this chapter, we explored the motivation for a reconfigurable memory system and our
proposed memory architecture. This chapter also examined the design of the reconfigurable
memory mat that forms the core of the memory system. By addingthe meta-data bits and a
few peripheral logic blocks to a simple RAM array, we can create a flexible memory block
that can emulate a portion of a cache, a FIFO, or scratchpad memory. We have proposed
an implementation of that architecture using full-custom circuit design techniques for high
performance and efficiency. The next chapter will detail theremaining portions of the
3For highly set-associative caches, designers take advantage of this dependence to save power. Theyintentionally serialized the tag check with the data access(even for reads) to reduce the number of data arraysthat are activated fromn (number of ways) to just 1 (hit) or 0 (miss in all ways).
CHAPTER 3. RECONFIGURABLE MEMORY ARCHITECTURE 66
reconfigurable memory system: the inter-mat control network, the memory to processor
interconnect, and the processor interface logic.
Chapter 4
Interconnect Networks
As seen in the previous chapter, because memory structures use similar building blocks, we
can use a generalized memory mat to form the core of our reconfigurable memory system.
However, memory systems do differ significantly in the way that they connect the memory
blocks to the computation and to each other. For example, in acache, each memory request
goes to both the tag and data array, and the tag array communicates hit/miss information
to the data array. However, in a simple scratchpad memory, each memory request only
goes to one memory block, and the memory blocks that make up the scratchpad never
communicate with each other. So, for our configurable memorysystem that can emulate
multiple types of memories, we need a configurable interconnect network between the mats
and the processor and between the mats themselves.
The interconnect network must handle two types of communication, mat-to-mat and
mat-to-processor, that have quite different requirementsand characteristics. Mat-to-mat
communication is primarily one-to-many (e.g.a tag mat sending its hit/miss information to
multiple mats in the cache data array), and the width of this data is typically very narrow,
consisting of only one or two bits. Additionally, the mat-to-mat communication is unidi-
rectional, requiring only outgoing data pushes without anyreplies from the recipients. The
latency, however, must be extremely low, because the communication can be in the critical
path of the memory structure.
67
CHAPTER 4. INTERCONNECT NETWORKS 68
On the other hand, mat-to-processor communication is bi-directional and many-to-
many. There can be multiple concurrent outgoing requests from the processor, all to dif-
ferent mats, as well as multiple returning replies from the mats. The requests and replies
are wide, composed of data, meta-data, and possibly opcode and address fields. While the
latency of the mat-to-processor communication must be low,because it can be in a critical
loop that determines the minimum processor cycle time, it still can be on the order of a
cycle.
Opcode
Data in/outAddressOpcode
Requestport
mat
Control bits
IMCN
Processor Interconnect Network
Processor Interface Logic
Data in/outAddress
Figure 4.1: Interconnect overview
Because the mat-to-mat communication and mat-to-processor communication differ
significantly in their characteristics and requirements, we separate the interconnect into
two networks: the mat-to-matinter-mat control networkand the mat-to-processorproces-
sor interconnect network. Figure 4.1 shows an overview of the interconnection networks.
We implement the inter-mat control network as multiple segmented buses and the processor
interconnect network as a pair of crossbars. However, before delving into the specific archi-
tecture and implementation of the networks, we first review some relevant issues regarding
CHAPTER 4. INTERCONNECT NETWORKS 69
interconnection network design.
4.1 Interconnection Network Design
For many years, interconnection networks have been used to connect computers and dis-
crete processors [106][107]. Today’s die capacities, however, allow multiple processor
cores and large module blocks such as memories to fit on a single die. Thus, researchers
have proposed on-die interconnection networks to connect processors and large modules
together rather than hardwired global routing [108][109][110][111]. These networks offer
a more structured communication environment than custom global routing, which allows
for more design re-use and easier application of custom circuit techniques for lower power
and higher performance.
While the concept of a “network on a chip” to connect large modules together has
gained popularity, the use of interconnection networks between processor functional blocks
has been quite limited [112]. Most functional blocks are still connected together with fixed
routing or buses. While buses do offer a some communication flexibility, they do not
perform well under communication patterns with multiple concurrent requests [113]. For
increased request concurrency, some designs use split transaction buses where the requester
releases the bus while waiting for the reply [105][85], and other transactions can use the
bus for requests or replies in the mean time.
Another technique to improve bus bandwidth is to physicallysegment the bus and allow
concurrent transactions on different segments [114][115][116]. This technique can achieve
high concurrency for communication patterns where units communicate locally. Bus seg-
mentation can also reduce the power dissipation of the bus bylimiting the portion of the
bus that is activated.
Buses are very area efficient, but as the number of concurrenttransactions increases,
they are no longer the optimal interconnection topology. A crossbar interconnect topology
handles multiple concurrent transactions better than a bus-based system [113][117] because
it is a fully non-blocking network. Any of the inputs can be connected to any of the outputs
as long as both the input and output are free. A crossbar is a one-hop network (i.e. has no
intermediate stages between the input and output) and thus can have low latency. As per
CHAPTER 4. INTERCONNECT NETWORKS 70
convention, we will denote a crossbar withN inputs andM outputs as anN x M crossbar,
with each possible connection point known as a crosspoint. An N x M crossbar would have
N*M crosspoints. Figure 4.2 shows an example 4 x 6 crossbar.
out[5]
in[0]
in[1]
in[2]
in[3]
crosspoint
out[0] out[1] out[2] out[3] out[4]
Figure 4.2: 4 x 6 crossbar
Crossbars topologies have been used in many networking switching architectures [107][106],
and most previous crossbar implementations have either been stand-alone designs for net-
working switches [118][119][120] or on-chip interconnection networks connecting proces-
sors to main memory [121][122][123][124]. Crossbars that interconnect individual func-
tional blocks are much rarer [112][125][126], although, one very common crossbar-like
interconnection structure between functional units is a pipelined processor’s functional unit
bypass network [105][127][128].
While a crossbar does support multiple concurrent requests, conflicts can occur if more
than one input requests the same output. Telecom crossbars avoid conflicts using complex
input queues, arbiters, and scheduling logic [129][130][131][132]. The arbitration and
scheduling add to the latency, area, power, and complexity of the design. The delay from
input to output is now dependent on the request pattern and isnon-deterministic. This is
tolerable in networking switching applications, but for fixed pipeline, inter-functional unit
CHAPTER 4. INTERCONNECT NETWORKS 71
networks a deterministic latency that is as low as possible is desirable. An alternative to
dynamic scheduling is to statically schedule the crossbar.This avoids the added latency
and complexity of arbitration, while still allowing multiple concurrent requests [122].
Despite its larger area, a crossbar can be more energy efficient per transaction than a
segmented bus architecture, due to the large amount of concurrency available [137]. One
way to significantly decrease the energy consumption of a crossbar is to use low-swing in-
terconnect techniques [56][55][133][134], because a crossbar has a large number of long,
high-capacitance wires. To further reduce the energy dissipation, we can segment a cross-
bar to decrease the wire capacitance driven [135][124].
Besides crossbars, there are many other non-blocking network topologies and hybrid
topologies that offer a different trade-offs between area efficiency, energy consumption, and
performance [106][107]. For the inter-mat control networkand the processor interconnect
network, we chose topologies that seemed most suitable to the communication patterns
on those networks. Next, we will examine an implementation for the inter-mat control
network.
4.2 Inter-mat Control Network
The inter-mat control network allows the mats to pass a few bits of control information to
each other. For example, in a cache configuration, the tag matmust pass the hit/miss infor-
mation to the data mat, so that the data mat knows whether or not to abort its operation. On
a hit, the data mat proceeds with its operation, but on a miss,it aborts. We implement the
IMCN as multiple, 1-bit, segmented buses. Figure 4.3 shows an example implementation
with four buses with four mats per segment. The IMCN is well suited to this implemen-
tation, because it is a network with uni-directional, one-to-many communication. The bus
segments are set at configuration time via configuration registers.
We use full-swing wires in the IMCN, for a number of reasons, First, it is a relatively
short-haul network, and the overhead latency of the low-swing driver and receiver are not
tolerable. Second, because the latency is a fraction of a cycle, there is no convenient clock
edge off of which to trigger the sense amplifiers. Finally, the IMCN does not contribute
significantly to the overall power dissipation of the reconfigurable memory system, so the
CHAPTER 4. INTERCONNECT NETWORKS 72
bus
Mat
bus[0]
bus[1]
bus[2]
bus[3]
ext_out[1:0] ext_in[1:0]
1segment
segmenter
Figure 4.3: Inter-mat control network
overall savings from reducing the IMCN power would not be significant.
Each mat has a two-bit output to the IMCN,ext out, and a two-bit input from the IMCN,
ext in. A number of signals are multiplexed to generate the two bitsof ext out as seen in
Figure 4.4. Similarlyext in is demultiplexed to a number of locations as shown in Figure
4.5 The inputs and outputs are statically configured via configuration registers.
The IMCN bus drivers can implement a wired-OR function on thebus. The driver is
a pull-down only driver and thus can form a wide OR gate with any other active drivers
on the bus segment. Figure 4.6 show the bus driver and pull-upcircuit. The wired-OR
functionality is useful for aggregating control information. For example in a multi-way
cache configuration, by OR-ing the hit/miss signal from every way, we can generate the
global hit/miss for the cache.
The pre-charged IMCN bus line is vulnerable to signal coupling, but the weak keeper
helps mitigate coupling problems. If the IMCN lines are still deemed too vulnerable, shield
wires could be routed next to each IMCN line to reduce coupling effects. Given the small
number of IMCN lines, the additional area penalty of the shield lines is not too great.
CHAPTER 4. INTERCONNECT NETWORKS 73
control outputsconfigurationregisters
bus[0]
bus[1]
bus[2]
bus[3]
from mat core
ext_out[i]
Figure 4.4:Ext out logic
ext_in[i]
configurationregisters
bus[0]
bus[1]
bus[2]
bus[3]
to mat core
Figure 4.5:Ext in logic
CHAPTER 4. INTERCONNECT NETWORKS 74
weak keeper
Vdd
pq_b
Vdd
config[i]
ext_out[i]
bus[n]
Figure 4.6: IMCN wired-OR driver and pull-up circuit
CHAPTER 4. INTERCONNECT NETWORKS 75
4.3 Processor Interconnect Network
In our reconfigurable memory system, the processor interconnect network takes the place of
the fixed interconnection between the computation and the memory in hardwired systems.
The processor issues memory requests from request ports in the interface logic. Any port
can send a request to any memory mat. The mat always returns the reply to the port that
issued the request. Section 4.3.4 describes the interface logic request ports more fully.
We implement the processor interconnect as a pair of uni-directional crossbars as shown
in Figure 4.7. For the sake of clarity in the figure, the crossbars are shown as distinct, but
in the implementation, they are interleaved into one block.Therequest crossbarroutes the
requests from the processor ports to the memory mats, and thereply crossbarroutes the
replies from the mats back to the requesting processor port.
We chose to use crossbars for the processor interconnect, because they support mul-
tiple concurrent transactions, have low, one-hop latency,and offer high flexibility in the
interconnections. We made the design decision to use two uni-directional crossbars rather
than a single bi-directional crossbar to simplify both the circuit design and system archi-
tecture at the cost of some area. The circuit design for a uni-directional crossbar is simpler,
and we avoid the additional delay and energy from driving theparasitic capacitance of the
drivers and receiver of the unused signaling direction. At the system level, a bi-directional
crossbar would require more careful memory access scheduling, especially for requests
that both push and pull data, such as a compare. This could require arbitration for cross-
bar and cause the memory access latency to be non-deterministic, further complicating the
processor interaction with the memory system.
4.3.1 Request Crossbar
The request crossbar is a dynamically-routed, multicast-capable crossbar responsible for
routing requests from the processor ports to the memory mats. Each request port can send
a request to any mat or group of mats. Every request has amat ID field that determines the
destination of the request. This routing based on themat ID can be used as the high-level
decode for multi-mat memory structures, similar to the block selection mechanism of large
SRAMs [53][25][31]. Figure 4.8 shows a block diagram of the request crossbar with each
CHAPTER 4. INTERCONNECT NETWORKS 76
Mat MatMat Mat
Port Port Port Port
Request
Request Reply
Reply
Request Crossbar
Reply Crossbar
Figure 4.7: Processor interconnect overview
crosspoint detailed in Figure 4.9. Table 4.1 lists the request crossbar I/O signals.
Themat ID andmat ID maskdetermine the request destination mat(s). Thepayloadis
the request itself that goes to the mat. It contains all the mat input fields from the processor:
opcode, address, mask, mdatain, anddata in. Thevalid bit indicates whether or not the
outgoing request is valid. Thereply bit alerts the reply crossbar scheduler that the request
has reply data. The crosspoint will then schedule a reply data packet transfer on the reply
crossbar.
To avoid request conflicts, we statically schedule which ports have access to which mats
in the processor interface logic. The request crossbar still dynamically routes the requests
based on themat IDfield, so the ports have some freedom in which mats to send request to
and can use the crossbar for high-level decoding in multi-mat memory structures. However,
there is no need for queuing, arbitration, or scheduling logic, and the delays through the
crossbar are fixed and deterministic.
An example of static mat scheduling would be splitting the access to a multi-processor
CHAPTER 4. INTERCONNECT NETWORKS 77
Memory Mats
Port[0] Port[1] Port[2] Port[3]
Processor Interface Logic
Figure 4.8: Request crossbar
Table 4.1: Request crossbar I/O signals
Name Width Description
mat ID i Which mat to send the request to?mat ID mask i Allows for wildcarding in the mat IDpayload p Request to the memory matvalid 1 Is this a valid requestreply 1 Does this request have return data?
CHAPTER 4. INTERCONNECT NETWORKS 78
mat ID / valid
=maskablecomparator
enable
mat ID mask
payload
to mat
crosspoint ID
Figure 4.9: Request crossbar crosspoint
CHAPTER 4. INTERCONNECT NETWORKS 79
cache tag between the normal access and the snoop access. Every even cycle would be
dedicated to the local processor cache access and every odd cycle to the snoop access. If
we ran the memory and crossbar at twice the rate of the processor core, the processor would
still get one cache access per processor cycle. This is an application of virtual multi-porting
described in Chapter 3.
The downside is that static mat scheduling potentially wastes bandwidth on low us-
age requesters, and we need to ensure that there is enough bandwidth to the memories
to satisfy all requesters under the static allocation policy. If the memory mats and cross-
bar have more bandwidth available than most applications need, statically allocating the
bandwidth, although wasteful in some cases, may not adversely affect the application per-
formance. Also, because the processor interconnection network emulates the static inter-
connect found in hardwired memory systems, we do not expect configurations to require
dynamic scheduling. However, using static mat allocation is not fundamental to the design,
and the reconfigurable memory system could be implemented using dynamic scheduling at
the cost of increased design complexity.
4.3.2 Reply Crossbar
The reply crossbar routes memory replies from the mats back to the processor ports. The
crossbar always routes the reply back to the requesting port. Thus, we can schedule the
reply crossbar based on the request routing and which requests will have replies as indicated
by thereplybit in the request. Because the mat latency is fixed, the scheduling of the reply
crossbar is simply a time delayed version of the request schedule. Figure 4.10 shows an
overview of the reply crossbar, and Figure 4.11 details a reply crossbar crosspoint. Table
4.2 lists the I/O signals for the reply crossbar.
The contents of the reply packet depend on the request operation. If the request is any
form of a read (e.g. read, compare, read-modify-write), then the reply contains data and
meta-data. Writes and gang operations do not return data or meta-data. Every request
operation returns three control bits,valid, match, andcomplete, in the reply packet.
The request crossbar can multicast requests to multiple mats, but there can be only one
valid reply to a multicast request. The reply crossbar multiplexes the outputs of the accessed
CHAPTER 4. INTERCONNECT NETWORKS 80
Table 4.2: Reply crossbar I/O signals
Name Width Description
data d Main datamdata m Meta datavalid 1 Is this data valid?match 1 Match output resultcomplete 1 General completion bit
Memory Mats
Port[0] Port[1] Port[2] Port[3]
Processor Interface Logic
Figure 4.10: Reply crossbar
CHAPTER 4. INTERCONNECT NETWORKS 81
valid mdatadata
data / mdata
valid
enable
reply_d3
reply_d3 = 3 cycle delayed reply signal
from mat
to port
Figure 4.11: Reply crossbar crosspoint
CHAPTER 4. INTERCONNECT NETWORKS 82
mats and only send a single reply back to the processor port. Thevalid bit from the mats
indicates which mat has the valid reply packet. By design there can only be one valid
reply to a multicast request, and this must be guaranteed by the configuration of the mats.
This type of multicast-request/multiplexed-reply combination is useful for configurations
such as multi-way set associative caches. In such configurations, we multicast the memory
request to all ways of the cache, but only want the reply for way that hits. If no way hits,
no reply is received.
4.3.3 Implementation
For both the request and reply crossbar, we employ low-swing, differential signaling using
NMOS only drivers and clocked sense amplifiers [54]. The NMOSdriver circuit, shown
in Figure 4.12, pulls one of the differential output lines toVdd low and the other line to
Gnd based on the data input.Vdd low is a special, low-voltage supply that can either be
generated locally via DC-DC conversion [136] or taken in from the outside world. By
using a lowered supply, rather than just limiting the swing,we can achieve a quadratic
energy savings in swinging the wire [56]. We do then, however, need to route, decouple,
and potentially generate the lowered supply voltage. The receiver sense amplifier is a
modified StrongARM latch (Figure 4.13) [57]. Because we use alow-swing, differential
signaling scheme, we require two wires to transmit each bit.We twist the differential
lines and interleave them with power supply lines to reduce the differential crosstalk noise
[56][39][40]
There are additional techniques for reducing the power of the crossbar such as segmen-
tation [135], but we chose not to employ them in our design to keep the design as simple as
possible and avoid any excess increase in the latency. Therehave been a number of works
that explore the energy consumption of crossbars in much more detail [137][135][138].
Conceptually a crossbar has all inputs arriving from the sides and all outputs exiting on
the top and/or bottom as in Figure 4.2. For our design, both the memory mats are above
the crossbar and the processor ports are below the crossbar to achieve a rectangular shape
for the entire reconfigurable memory. Figures 4.8 and 4.10 show the wiring arrangement
needed to support this configuration.
CHAPTER 4. INTERCONNECT NETWORKS 83
Vdd_low
data
data_b
Vdd_low
out_b
data_b
out
data
Figure 4.12: Low swing driver
We allocate an entire clock cycle for signals to traverse thecrossbar. The drivers trans-
mit the signals on the rising edge of the clock. The receiversflop them on the next rising
edge. From circuit-level models of the crossbar, 10 FO4, 1nsin our target 0.18µm technol-
ogy, is enough time to traverse the crossbar in either direction. A study of inter-functional
unit crossbars shows that a similar 16-port, 32-bit crossbar design can maintain sub-1ns
latency in a 0.25µm technology [112]. Chapter 5 provides more details on the testchip
implementation of the crossbars.
4.3.4 Processor Interface
While the memory mats connect directly to the processor interconnect network, the proces-
sor launches requests into the processor interconnect network via theprocessor interface
logic. The main function of the interface logic is to translate thememory address issued by
the processor to the address space used by the interconnect network.
Hardware Address Space
In modern computing systems, there are often two sets of address spaces, virtual and phys-
ical [105]. The applications operate in individual virtualaddress spaces. Virtualizing the
memory removes the burden of memory management from the application programmer,
and allows multiple applications to more easily share the limited physical memory. The
physical address is used to address the physical main memory.
CHAPTER 4. INTERCONNECT NETWORKS 84
s_b
Vdd
clk
clkclk
clk
in in_b
r s
Vdd VddVdd Vdd
r
r_b
s
s_b
out out_b
r s
r_b
Figure 4.13: Low swing receiver - modified StrongARM latch
CHAPTER 4. INTERCONNECT NETWORKS 85
When the processor issues a memory request, the request names the desired virtual
address location. This virtual address undergoes address translation to a physical address
via a translation table which maps the virtual address spaceto physical address space.
Because the translation table itself can be quite large, often used translations are cached
near the processor in a look-up structure called thetranslation look-aside buffer(TLB). In
a hierarchical memory system, the virtual to physical translation can take place anywhere
in the hierarchy before the main memory.
Memory is broken up into fixed size chunks calledpagesor variable sized chunks called
segments. Because the total amount virtual memory is larger than the physical memory,
data in the virtual address space is stored on disk and paged into main memory by the OS
when needed. The main memory can be thought of as a cache for the disk system, similar
to how the on-die processor cache is a cache for main memory. While virtual memory is
a useful for systems that run multiple concurrent applications, some simpler systems, such
as some DSPs and embedded systems, only use a single address space. Their applications
are responsible for memory management, but the overall system is less complex.
For our reconfigurable memory system, we introduce a third address space, thehard-
ware address space, which is conceptually below the physical address space. Just as the
physical address designates locations in the main memory, the hardware address names
memory locations in the on-die reconfigurable memory system. The hardware address
consists of two fields, themat ID andmat address.1 The processor interconnect network
uses the mat ID to route requests to the proper mat. The mat decoder uses the mat address
to select the desired word in the mat. In a hardwired memory system, the hardware ad-
dress space is not necessary, because the local memory on theprocessor die is “addressed”
implicitly via the hardwired interconnections between theprocessor and memory.
The basic issue is that the processor generates virtual addresses, but in order to access
the correct local memory, the virtual address must be translated into a hardware address.
Usually since the hardware is fixed, the mapping between the virtual address and the hard-
ware address is hardwired into the hardware structures. However, for a reconfigurable sys-
tem, the memory configuration is not fixed, and as a consequence the translation between
1In a tiled system with multiple mat arrays, there can also be atile ID field that denotes which tile thememory location resides in.
CHAPTER 4. INTERCONNECT NETWORKS 86
the virtual and hardware address spaces is not fixed. Thus, weneed a block that performs
the virtual to hardware address translation. This address translation is fairly simple when
compared to the virtual to physical address translation.
Address Splitter
From the virtual address issued by the processor, we need to generate two fields to properly
access the reconfigurable memory, the mat ID and the mat address. The mat ID can either be
represented as a bit vector or a mat ID number and mat ID mask. This latter representation
restricts multicasting to groups of powers-of-two mats along powers-of-two boundaries.
The mat address is simply the internal mat address of the requested word.
Because the reconfigurable memory system takes the place of the hardwired first-level
memory system of the processor, we need low latency access toavoid increasing the delay
of critical pipeline loops. Thus, we choose to employ a hardware translation method similar
to the address centrifuge used in the Cray T3E [92] to performthe virtual-to-hardware
address translation. For each of the fields we must generate,we extract a portion of the
virtual address and use it as an offset from a base value. The base value is either statically
assigned to the issuing port or determined from the high-order bits of the virtual address.
While this type of hardware address translator reduces the addressing flexibility, it is fast,
avoiding large table look-ups.
Theaddress splitterunit performs the virtual-to-hardware address translation. The mat
ID base, mat ID mask, and mat address base are determined froma lookup table index
by the high-order bits of the virtual address. We call these high-order bits thelogical
memory ID. They identify which logical memory structure to access. A logical memory is
a collection of mats that make up a logical memory unit, such as a cache tag array, cache
data array, scratchpad memory, or FIFO. The logical memory ID also determines the fields
of the virtual address that are extracted to form the mat ID offset and mat address offset.
The logical memory ID can have different interpretations depending on the processor port.
So the same virtual address sent to processor ports handlinga cache’s tag and data, will
route the requests to different mats, the tag mats and data mats respectively.
The extractor unit extracts the mat ID offset and mat addressoffset from the virtual
address. Each extraction is a mask and shift operation. These offset fields are added to
CHAPTER 4. INTERCONNECT NETWORKS 87
Extractor
+Mat IDoffset
+
Mat addroffset
Mat addrbase
Mat IDbase
Extractconfig
LUT
Mat IDTo xbar
LM ID
Virt
ual a
ddre
ss
To xbarMat addr
Figure 4.14: Address splitter block diagram
CHAPTER 4. INTERCONNECT NETWORKS 88
previously determined base fields to generate the final mat IDand mat address. Having a
mat address base field allows a single mat to be shared among multiple logical memories.
In accesses to cache tag arrays, the unused remaining portion of the virtual address is used
as input data to the mat for the compare.
We can avoid added delay of the logical memory ID lookup tables if each processor
port only ever accesses one logical memory. For example, if aprocessor port is dedicated
to be the instruction cache tag port, then the mat ID base, matID mask, mat address base,
and offset field split can be held constant, and no table lookup is necessary.
Scratch
Mat 0 Mat 1 Mat 2 Mat 3
2048 x 32b
8192 word scratchpad
Scratch Scratch Scratch
Figure 4.15: Scratchpad mat configuration for address splitter example
Figures 4.16 and 4.17 show example address splitter configurations for a scratchpad
memory detailed in Figure 4.15. The example memory system has 16 mats each holding
2048 32b word, and the scratchpad memory spans four mats, mats 0 to 3. The address split
in Figure 4.16 is for a configuration with the scratchpad words packed contiguously into
four mats. The address split in Figure 4.17 is for a configuration with the scratchpad words
interleaved among the four mats. So word 0 is in mat 0, word 1 inmat 1, word 2 in mat 2,
and so on. This banking arrangement would be desirable to avoid request conflicts if there
were multiple requesters for the scratchpad. In a statically schedule crossbar, this would
allow multiple requesters to simultaneously access the scratchpad as long as their requests
were guaranteed to access different banks.
The address split for a cache is more complex. Figure 4.18 shows an example 2-way
set associative cache. Each way’s tag is held in one mat, and four mats make up each data
array. Each way holds 2048 4 word lines. Mats 0 and 1 are the twotag mats, and mats
8-15 make up the two data arrays. This mat numbering allows for easy multicasting of the
tag and data accesses, since the tag and data arrays are power-of-2 in number and along
CHAPTER 4. INTERCONNECT NETWORKS 89
MSBByteoffset
Mat IDoffset
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
00 Mat ID baseMat addr base
11
Mat addr
4
Mat ID
000 ... 000 LM ID
2 211
Mat addr offset
11 2
LSB
Figure 4.16: Address split for contiguous word scratchpad
MSBByteoffset
Mat IDoffset
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
00 Mat ID baseMat addr base
11
Mat addr
4
Mat ID
000 ... 000 LM ID
2
Mat addr offset
2 11
11 2
LSB
Figure 4.17: Address split for interleaved word scratchpad
CHAPTER 4. INTERCONNECT NETWORKS 90
4 words/line
Tag
Tag Data
Data Data Data Data
DataDataData
Way 0
Way 1
Mat 0
Mat 1
Mat 8 Mat 9 Mat 10 Mat 11
Mat 12 Mat 13 Mat 14 Mat 15
2048 x 32b
2−way set associative2048 lines
Figure 4.18: Cache mat configuration for address splitter example
power-of-2 boundaries.
Figures 4.19 and 4.20 show the address splits of the tag and data requests respectively.
The data split assumes that the lines are packed contiguously into the data mats. Note that
in the tag array, the remainder portion of the virtual address is used as compare data in the
tag access. Figures 4.21 and 4.22 show the address split again for the tag and data, but for
a cache with the line interleaved among the four data array mats.
The address splitter logic could be combined with the regular address generation logic
typically done in the EX processor pipeline stage. Additionally, the mat decoder could be
implemented a sum-addressed-decoder [139] to remove some of the logic from the address
splitter. This would however increase the width of the address field sent to the mat. We
assume that the traditional virtual-to-physical address translation occurs at higher levels of
the memory hierarchy. The local caches built from mats are virtually-indexed, virtually-
tagged.
To make up a traditional memory request (i.e. full data-width address and data) to a
cache would require two ports, one for the address and one forthe data. The ”address” of
the request packet is just the hardware address that the payload is sent to. For example, in a
system with a cache, the physical address is really just the data portion of one of the ports
which is sent to the mat(s) acting as the cache tag.
CHAPTER 4. INTERCONNECT NETWORKS 91
To data field
Byteoffset
Lineoffset
0 0 0 0 0 0 0 0 0
0 0
00Mat addr base
11
Mat addr
4
Mat ID
LM ID
2
Line
2 11
0
11
Mat ID base0
0 0
0 *
LSB MSBRemainder
Figure 4.19: Cache tag address split for contiguous word data array
RemainderByteoffset
Lineoffset
0 0 0 0 0 0 0 0 0
0 0
00 Mat ID baseMat addr base
11
Mat addr
4
Mat ID
LM ID
2
Line
2 11
0
11
1 * 0
2
LSB MSB
Figure 4.20: Cache data address split for contiguous word data array
CHAPTER 4. INTERCONNECT NETWORKS 92
To data field
Byteoffset
Lineoffset
0 0 0 0 0 0 0 0 0
0 0
00Mat addr base
11
Mat addr
4
Mat ID
LM ID
2
0 Mat ID base0
0 0
0 *
211
Line
11
LSB MSBRemainder
Figure 4.21: Cache tag address split for interleaved word data array
RemainderByteoffset
Lineoffset
0 0 0 0 0 0 0 0 0
0 0
00Mat addr base
11
Mat addr
4
Mat ID
LM ID
2
0 Mat ID base
211
Line
11 2
1 * 0
LSB MSB
Figure 4.22: Cache data address split for interleaved word data array
CHAPTER 4. INTERCONNECT NETWORKS 93
If a system had 8 ports, the ports could be configured as 4 traditional memory ports.
However, if the processor was only going to address local memory, then all 8 ports could
be used to push/pull a data word every cycle. Any mix of the above is also possible. A
single address port could be associated with multiple data ports to pull down a wide data
word. Or some ports could use a traditional address space scheme, while others directly
accessed the hardware addresses.
4.4 Summary
This chapter described the two interconnection networks inour reconfigurable memory sys-
tem: the inter-mat control network and the processor interconnect network. The inter-mat
control network allows the mats to pass a few bits of control information to each other. It
uses narrow, one-to-many communication, and we implement it as a number of segmented
buses. The processor communicates to the mats via the processor interconnect network.
It must support wide, many-to-many communication with manyconcurrent requests. We
implement this network as two uni-directional crossbars, one for the requests from the pro-
cessor ports to the mats and one for the replies from the mats back to the processor ports.
There are many ways to implement these networks, and we choseinterconnect topologies
that were suited to their respective communication patterns and emphasized flexibility over
area or energy efficiency. But the optimal topology depends heavily on the specific memory
architecture and optimization goals.
Chapter 5
Experimental Results
In the previous chapters, we have detailed a reconfigurable memory design based on a re-
configurable SRAM mat and a flexible interconnection network. Our design goal was to
make the memory flexible, yet maintain high performance and efficiency. As a proof-of-
concept for our design, we implemented a prototype reconfigurable memory testchip. The
testchip demonstrates that we can build a fast, flexible, efficient reconfigurable memory,
sanity checks our design decisions, and quantifies the reconfigurability overheads. From
the prototype results, we can also extrapolate the overheads for larger, more complex re-
configurable memory designs.
5.1 Testchip Overview
We designed and fabricated a reconfigurable memory testchipin a 0.18µm CMOS 6-metal
Al TSMC process (Figure 5.1) [140][141]. The die measures 3mm by 3.3mm and has 68
pads along the periphery. It was packaged in a 84-pin ceramicleadless chip carrier. Table
5.1 summarizes the testchip and process technology features.
Figure 5.2 shows the testchip block diagram. The testchip contains four reconfigurable
memory blocks (Mem0-3), a dynamically routed, low-swing crossbar interconnect, test
vector storage, and a process monitor block. Mem0 contains an SRAM core that uses
a self-timed, pulsed-mode circuit style for fast access andshort cycle time. Mem1 and
Mem2 contain complete memory mats using the SRAM core from Mem0. Mem3 contains
94
CHAPTER 5. EXPERIMENTAL RESULTS 95
Figure 5.1: Testchip die photo
CHAPTER 5. EXPERIMENTAL RESULTS 96
Table 5.1: Process and testchip features
Mat area 0.6 mm2
SRAM area 0.4 mm2
Cell area 9 µm2
Meta-data cell area 18 µm2
Supply voltage 1.8VFrequency 1.1GHzMat power dissipation 125 mW (50% reads, 50% writes)Technology 0.18µm CMOS, 6-metal AlTransistors NMOS Vt = 0.5V, PMOS Vt = -0.5V
the pointer logic and PLA used in the mats. We separated out the Mem0 SRAM and the
Mem3 peripheral blocks for increased observability and to isolate their power measure-
ments. Mem0 and Mem3 use isolated power supplies so that we can measure the power
dissipation of each block.
We chose an aggressive cycle time goal of 10 fan-out-of-fourinverter delays (FO4) for
the testchip. This short clock tick stresses the circuit design of the memory cores as well as
the peripheral circuits and interconnect. Also, this cycletime would allow us to virtually
multi-port the memory system for a slower cycling processor, as described in Chapter 3.
Due to this aggressive cycle time, the mat access is pipelined, and the total delay through
the mat is 2 cycles or 20 FO4. The first half-cycle is spent in the pre-access logic: pointer
logic or write buffer. The next full cycle is spent in the SRAMaccess. The last half-cycle is
spent in the post-access logic: control logic and the comparator. The mat is fully pipelined,
accepting a new request every 10 FO4 cycle. For each crossbartraversal, we allocate a full
10 FO4 cycle. An entire memory access requires 4 cycles, or 40FO4.
5.1.1 SRAM core
The SRAM core used in Mem0, Mem1, and Mem2 has a capacity of 18Kb. This is on
the small end of the 16Kb to 128Kb optimum energy-delay rangediscussed in Chapter 2.
We chose this capacity to keep the area of the testchip reasonable, while still being able to
include a number of SRAM cores on the die. The SRAM holds 16Kb of data, arranged as
CHAPTER 5. EXPERIMENTAL RESULTS 97
Request
MatMem1Mat
Mem0SRAM
Mem3PLA/Ptr
ProcessMonitorTVM0 TVM1
Crossbar
Request Reply
Reply
Request Reply
Mem2
Figure 5.2: Testchip block diagram
CHAPTER 5. EXPERIMENTAL RESULTS 98
512 32b words. Along with each data word, we store an additional 4 bits of meta-data, so
the SRAM is logically 512 x 36 bits. To balance the wordline and bitline lengths, we use
4:1 column multiplexing to achieve a nearly square cell array of 128 x 144 cells. The row
address is then 7 bits and the column address is 2 bits.
Our decoder structure closely follows the optimal topologypresented by Amrutur and
Horowitz in 2001[34]. The pre-decoder divides the 7 bit row address into two groups and
decodes them using a 3:8 decoder and a 4:16 decoder. Each of the 24 pre-decode gates uses
a modified Nambu OR-gate (Figure 5.3) [36][142]. We insert analways-on full CMOS
transmission gate in the clock path to the dynamic inverter to slightly delay its clock. This
reduces the output glitch on the gates that do not fire.
n0
Vdd Vdd
Vdd
Vdd
Vdd
clk
clk
in0
out
in1
Figure 5.3: SRAM predecoder gate
The wordline driver gates are self-resetting, two-input, source-driven AND gates (Fig-
ure 5.4) [37][38][33][40]. There is an explicit pull-down cut-off device to avoid drive
fights during reset. The 3:8 pre-decoder generates the negative pulsesin n[7:0] that drive
the source inputs of the wordline driver gates. The 4:16 pre-decoder generates the positive
pulsesin p[15:0] that drive the gate inputs of the wordline driver gates. The buffer delays
CHAPTER 5. EXPERIMENTAL RESULTS 99
for the 3:8 pre-decoder and 4:16 pre-decoder are designed tobe matched so that the positive
and negative pulses arrive at the wordline driver gates at the same time.
in_n
Vdd
Vdd
WLin_pn0
r0
Figure 5.4: SRAM wordline driver gate
The decoder also contains the read-modify-write decoder asdescribed in Chapter 3.
The decoder drives the wordlines into the cell array. The cell array contains both the two-
ported meta-data cells and the single ported data cells. Themeta-data cells fit in the cell
pitch of the regular single-ported data cells, but they are wider to accommodate the two
pairs of bitlines. The prototype meta-data cells support read-modify-writes and gang oper-
ations, but not conditional or conditional gang operations.
The cell bitlines feed into the cell I/O block that contains the read and write support
circuits. The read cell I/O consists of a 4:1 PMOS passgate column mux and a StrongARM
amplifier followed by a skewed SR latch (Figure 5.5) [57]. We chose this design over the
more common latch-based sense amplifier to remove a series device from the bitline data
path. To further reduce the bitline loading, the write driver is pull-down only which allows
the write column mux to be a simple 4:1 NMOS passgate mux.
After an access, the bitlines must be reset to Vdd before the next operation. The bitline
reset circuits (Figure 5.6) consist of a keeper (M0 andM1), read reset (M2, M3, andM4),
and write reset (M5 andM6). During writes the bitlines swing full-rail, and we need large
CHAPTER 5. EXPERIMENTAL RESULTS 100
Col mux
Vdd VddVdd Vdd
VddVdd
VddVdd Vdd Vdd
bit bit_b
reset
reset_b
set
reset set
set_breset_b
set_b
se
out out_b
bit_bbit
sel_b[0]
bit[0] bit[1] bit[2] bit[3] bit_b[3]bit_b[2]bit_b[0] bit_b[1]
sel_b[1]
sel_b[2]
sel_b[3]
SR latch
Sense amp
Figure 5.5: SRAM read I/O circuits
CHAPTER 5. EXPERIMENTAL RESULTS 101
devices to fully reset the bitlines before the next access. However, only 1 out of 8 bitlines
needs to be reset from Gnd to Vdd, because we use 4:1 column multiplexing and only one
bitline per activated pair is discharged in a write. We use self-resetting bitlines to avoid the
excess power dissipation of activating unnecessary write reset devices. A self-reset delay
chain activates the write reset devices,M5 andM6. The self-reset activation is predicated
on write enable being de-asserted (wr en b being high) to prevent a drive fight between the
reset device and the write driver.
bl_reset_b
VddVdd
Vdd Vdd
Vdd Vdd
bit_bbit
M2M3
M4
M0 M1
M5 M6
wr_en_b
Figure 5.6: SRAM bitline reset circuits
We use a current ratioed replica bitline [35] to time the activation of the clocked sense
amplifiers, for good tracking of the replica path with the actual data path across process,
voltage, and temperature variations. The full replica timing path consists of a mimic
pre-decoder gate, mimic wordline driver gate, replica wordline, scan tunable replica cell,
replica bitline, and sense enable buffer. The delay of the sense enable buffer is matched to
the delay of the pre-decoder output driver using logical effort [32]. Figure 5.7 shows both
timing paths and the matching of each of the delays such that the sense enable signal is
CHAPTER 5. EXPERIMENTAL RESULTS 102
aligned in time with the desired bitline split of 100mV.
wl drv
mimic mimicwl drv se buf
pdec cell
replicacell
pdec buf
pdec
Figure 5.7: SRAM replica timing path
Proper replica wordline loading requires an extra row of dummy cells, and proper
replica bitline loading requires an extra column of dummy cells. Despite the careful match-
ing, the replica path delay can be slightly different from the data path due to the slow edge
rate of the replica bitline, device mismatch, or inexact delay tracking across PVT varia-
tions. To combat the possible problems with replica path matching, we implemented a
tunable replica bitline pull-down. Ten replica bitline drivers, two per driver cell, are laid
out in the replica row as shown in Figure 5.8. A scan-set-ableregister determines the num-
ber of drivers enabled via thedrive[n] signals. In simulation, the replica path matched the
data path at the middle setting with 5 drivers enabled. This allows for 5 steps in either di-
rection, faster or slower, if the replica path does not match. The prototype implementation
performed at its maximum 1.1GHz clock frequency with the replica bitline driver at the
nominal setting.
5.1.2 Peripheral Logic
Mem1 and 2 are complete memory mats containing the SRAM core,reconfigurable PLA,
pointer logic, maskable comparator, write buffer, and control logic. The reconfigurable
PLA has 16 logic terms (i.e. rows) with 6 input and 4 outputs. The 6 inputs are 4 bits
of meta-data, the 1 bit match result from the mat comparator,and one external bit. The 4
outputs correspond to the 4 meta-data bits.
The pointer logic stores pointers that are 11 bits long, which is 2 bits longer than neces-
sary to address all 512 words in the main SRAM array. The 2 extra bits allow a maximum
CHAPTER 5. EXPERIMENTAL RESULTS 103
rbl
Vdd VddVdd
cell array drive[1] drive[0]
rwl
rbl
driver cells (5)
dum
my
cells
(12
8)
rwl
dummy cells (139)
Figure 5.8: SRAM replica bitline and wordline
CHAPTER 5. EXPERIMENTAL RESULTS 104
FIFO size of 4 mats. The pointer storage array holds 4 pointers allowing for a maximum
of two FIFOs per mat. The strides were chosen to be 4 bits long.The pointer storage array
decoder gate is a self-resetting, pulse-latch with embedded logic (Figure 5.9). We imple-
ment the adder/subtracter using complex dual-rail domino gates and uses a mixed carry-
tree/carry-select topology. We previously detailed the pointer logic micro-architecture in
Section 3.6.2.
WL
Vdd
Vdd
CLK
addr[0]
addr[1]
n0
M1
M2
Figure 5.9: Pointer logic decoder gate
5.1.3 Interconnect and Test Infrastructure
The memory blocks connect to the two test vector memories viatwo dynamically routed,
uni-directional, low-swing crossbars. The request crossbar routes 60b request packets from
CHAPTER 5. EXPERIMENTAL RESULTS 105
the two test vector memories to the four memory blocks. It supports multicasting to any
combination of the memory blocks. The reply crossbar routes40b reply packets from the
memories back to the test vector memories. It supports muxing of the return data as detailed
in Chapter 4. We used low-swing wires in the crossbars in order to reduced their power
dissipation. If we had used full-swing wires, the crossbarswould have burned as much
power as the rest of the testchip. We implemented low-swing differential signaling using
NMOS-only drivers and clocked sense amplifiers as describedin Chapter 4.
The two test vector memories (TVMs) emulate processor portssending requests into
the reconfigurable memory system and receiving the replies.Each TVM stores 16 64b
requests and can launch a request every cycle. The TVMs also capture the replies coming
back from the memory system. Each TVM can store 16 40b replies, accepting one per
cycle. Putting the TVMs on die allows us to test the memory mats at speed without using
high-speed I/O pins or a high-speed tester.
We implement the TVMs as wide looping shift registers that output a memory request
and record a response every cycle. All of the test vector memory storage cells are connected
in a serpentine scan chain which allows us to read and write the input and output vectors
via a simple, low-speed, scan interface. The TVMs can be run in a looping mode where
the 16 requests are repeated over and over again. This mode isuseful for taking power
measurements and for probing internal nodes with the on-dievoltage samplers.
The testchip has 43 voltage samplers on key internal nodes and the clock. These voltage
samplers were described by Ho in 1998 [143], and our sampler design is identical to that
used on an earlier testchip in the same technology [56]. The sampler uses a boosted supply
voltage to allow the NMOS sample-and-hold passgates to sample voltages up to the normal
core Vdd. By running the sampler clock at a slightly different frequency as the core clock,
we sub-sample the signal and can generate a time dilated version of the signal. This time
dilated signal is a much lower frequency than the actual signal and is easily driven off chip.
By using the clock sampler output as a reference, we can relate the sampler time-dilated
output to actual time. In the next section, we show a waveformcapture of a number of key
nodes in the SRAM core using the samplers.
The testchip contained a process monitor block (Figure 5.11) which allows us to mea-
sure the delay of various logic gates including the fanout-of-four inverter delay of each die
CHAPTER 5. EXPERIMENTAL RESULTS 106
data
vcal
slave_clk
mas_data
mas_calib
enable
1.44s
0.36s 0.72s 0.36s
P:1.08
N:0.44
2.16
s
3.78
s
P:2.52
N:1.26
2.16
s
Gnd
81:0.72
s
Gnd
8.1:0.72
s
sample_out
1.44sM1 M2
M3
M4
Figure 5.10: On-die voltage sampler with device widths inµms for a 0.18µmtechnology[56]
[144]. The block contains four ring oscillators, each usinga different type of gate for the
delay element. A mux selects which of the ring oscillator outputs goes to a by-32 toggle
flop frequency divider. Its output is buffered then sent to anoutput pad. The block has 10
dedicated power and signal I/O pads. The pads are not along the periphery of the die and
require special bonding.
5.2 Measured Results
The testchip operated at 1.1GHz at the nominal 1.8V supply and room temperature. The
process monitor block measured the fanout-of-four inverter delay to be an average of 89ps
across 5 die. The 1.1GHz operating frequency corresponds toa target cycle time of 10
FO4. The area breakdown for the SRAM core, memory mat, and testchip are shown in
Tables 5.2, 5.3, 5.4 respectively. The meta-data, meta-data cell I/O, and the RMW decoder
occupy 24% of the total SRAM area.
Figure 5.12 shows the percentage area breakdown of the mat. The peripheral logic
occupied 32% of the mat area, but over half of the 32% was routing area. With better
optimized layout, we could have reduced the routing area significantly. Figure 5.13 shows
CHAPTER 5. EXPERIMENTAL RESULTS 107
Ring
DIV 32Out
Enable[3:0]
FO4 InverterRing
FO1 InverterRing
FO3 NAND2Ring
FO2 NOR2
Figure 5.11: Process monitor block
Table 5.2: SRAM area breakdown
Unit Area mm2
Decoder 0.029RMW decoder 0.029Mdata array 0.052Data array 0.186
Data cell I/O 0.036Mdata cell I/O 0.012
Table 5.3: Mat area breakdown
Unit Area mm2
SRAM 0.376PLA 0.021
Pointer 0.017WB/Cmp 0.05Routing 0.097
CHAPTER 5. EXPERIMENTAL RESULTS 108
Table 5.4: Testchip area breakdown
Unit Area mm2
SRAM 0.376Mat 0.637
PLA/Ptr 0.047Crossbars 1.62
TVM 0.241Process Monitor 0.388
Table 5.5: Mat power breakdown
Unit Power mW
SRAM 104.4PLA 9.9
Pointer 6.3WB/Cmp 4.4
the area overhead as a function of the mat capacity, keeping the data and meta-data widths
constant. The projected area overheads are generated usingan SRAM area estimator [24]
and and the testchip area results. With a 64Kb mat, the peripheral logic occupies less than
15% of the total area, still using the non-optimal peripheral logic layout.
The power breakdown for a memory mat is shown in Figure 5.14 and Table 5.5. The
peripheral logic accounted for 23% of the power. The test vector was an even mix of
compare-modify-writes and pointer writes. The PLA function was a 4-bit counter. Figure
5.15 shows the power overhead as a function of the mat capacity. As with the area projec-
tions, the data and meta-data widths are kept constant. The projected power overheads are
generated using an SRAM power estimator [24] and and the testchip power results. With a
64Kb mat, the peripheral logic accounts for less than 10% of the power.
Figure 5.16 shows a waveform capture from the SRAM using the on-die voltage sam-
plers for a worst-case read after write at 1.0GHz and 1.8V. The write occurs from time -1ns
to 0ns, and the read occurs from time 0ns to 1ns. The bitlines have approximately 90mV
CHAPTER 5. EXPERIMENTAL RESULTS 109
68%
4%
3%
8%
17%
SRAM
PLA
ptr
misc
routing
Figure 5.12: Area overhead and breakdown
CHAPTER 5. EXPERIMENTAL RESULTS 110
Capacity (Kb)
Per
cen
tare
ain
per
iph
eral
log
ic
140120100806040200
35
30
25
20
15
10
5
Figure 5.13: Area overhead vs. SRAM capacity
of split when the sense enable signal fires. Due to insufficient decoupling capacitance on
the sampler supply, the bitline samples are DC shifted up by about 200mV. The jaggedness
of the bitlines near Gnd is due to noise and increased sensitivity of the sampler calibration
near ground.
5.3 Summary
This chapter described the implementation of a prototype reconfigurable memory testchip
and the measured results. The testchip contained four memory test structures, including
two complete 16Kb memory mats, a low-swing crossbar interconnect network, test vector
storage, and a process monitor block. The testchip operatedat the target 10 FO4 clock cycle
under nominal conditions. The fast cycle time of both the memory mats and interconnect
opens the possibility of virtual multi-porting the memory system. The reconfigurable logic
occupied 32% of the mat area and accounted for 23% of the mat power. The mat memory
capacity is on the small end of the optimal range, and we project that for larger memory
CHAPTER 5. EXPERIMENTAL RESULTS 111
77%
15%
5%3%
SRAM
PLA
ptr
misc
Figure 5.14: Power overhead and breakdown
CHAPTER 5. EXPERIMENTAL RESULTS 112
Capacity (Kb)
Per
cen
tpow
erin
per
iph
eral
log
ic
140120100806040200
24
22
20
18
16
14
12
10
8
6
4
Figure 5.15: Power overhead vs. SRAM capacity
sewe wl
bit b
bit
clock cycle
Time (ns)
Volta
ge
(V)
21.510.50-0.5-1
2
1.5
1
0.5
0
Figure 5.16: Worst-case read at 1GHz, 1.8V, room temperature
CHAPTER 5. EXPERIMENTAL RESULTS 113
capacity mats, the overheads for area and power would fall below 15% and 10% respec-
tively. These results show that we can implement the proposed reconfigurable memory
design with reasonable overheads while maintaining high-performance operation.
Chapter 6
Conclusions
In this work, we have examined a reconfigurable memory architecture and evaluated its
feasibility via a prototype testchip. The motivation for such an architecture stems from
current process technology and design trends that indicatethat custom ASICs will become
increasingly difficult and costly to design. However, future applications will require more
efficient, high-performance computation than current general purpose processors can pro-
vide. One promising approach to breaking this impasse is to use reconfigurable architec-
tures that keep the low non-recurring engineering costs of general purpose silicon, yet still
provide the efficiency and high-performance near that of custom ASICs. For such recon-
figurable architectures, we propose adding reconfigurability to the memory system as well
as the computation. While reconfigurable logic has been studied extensively, the design
space for reconfigurable memory is relatively uncharted. However, in modern designs, the
memory system plays an increasingly important role in the overall performance, area, and
power of the design.
Unlike previous attempts at reconfigurable memory design, we took a bottom-up, circuit-
level approach, starting with an efficient SRAM design and adding configurability where
and when it could be done for low overhead. In Chapter 2, we explored modern SRAM
design to understand the base design substrate. Modern SRAMs are highly partitioned de-
signs that use specialized circuits in the decoder, datapath, and request/reply transport for
high-performance and low-power. For the low-activity factor decoders, designers use self-
resetting logic to achieve the speed of a dynamic logic family without the excessive power
114
CHAPTER 6. CONCLUSIONS 115
dissipation of a global precharge signal. To save power in the bitlines, the read bitlines only
swing a fraction of Vdd, but this requires a sense amplifier torestore the read out data to
full logic levels. Clocked sense amplifiers are used for low power, but they require precise
and robust triggering. Using replica bitlines allows us to accurately time the sense amplifier
activation across process, voltage, and temperature variations. To reduce the power in the
transport phases, we use low-swing signaling techniques onthe wires. These highly opti-
mized SRAMs are used to build common memory structures such as scratchpads, caches,
and FIFOs.
Looking closely at these memory structures, we recognized that they use very similar
memory building blocks. We designed a single reconfigurablememory mat that could form
the core of a many common memory structures. To the memory array, we add a few extra
bits of meta-data and logic to support read-modify-writes and gang operations. Pointer
logic in the address path enables a level of address indirection for FIFO configurations. A
comparator and write buffer in the mat datapath enable compares and conditional writes.
For a realistic design evaluation, we use modern SRAM circuit design practices described
in Chapter 2 in the mat design and tightly couple the meta-data and peripheral logic with
the memory core.
Despite having very similar memory building blocks, the target memory structures vary
widely in how they connect their memory blocks to each other and to the computation. To
support this diversity, we use two flexible interconnectionnetworks, the inter-mat control
network (IMCN) for mat-to-mat communication and the processor interconnect network
for mat-to-compute communication. These two networks havedifferent communication
needs and we tailor our implementations accordingly: the narrow one-to-many IMCN im-
plemented as a segmented bus, and the wide many-to-many processor interconnect im-
plemented as a pair of multicasting/muxing crossbars. To properly interface between the
crossbar and the computation, the address splitter performs address translation from the
address space used by the computation to the memory system hardware address space.
To evaluate our design, we implemented a prototype reconfigurable memory testchip
based on our architecture in a 0.18µm CMOS technology. The testchip demonstrates that
we can build such a reconfigurable memory with low overhead, while still maintaining
a fast 10 FO4 cycle time. While this cycle time did require us to pipeline the memory
CHAPTER 6. CONCLUSIONS 116
access, it opens the possibility using such a fast memory in avirtual multi-ported fashion,
especially if the computation is a relatively slow running ASIC or FPGA. The prototype
uses a 16Kb SRAM mat, a capacity on the low end of the optimal energy-delay range,
but still achieved area and power overheads of 32% and 23% of the totals respectively.
Our projections based on the experimental results show thatthe reconfiguration overheads
can be reduced to below 15% of the area and below 10% of the power by using SRAMs
larger than in our prototype design, but still in the near optimum energy-delay block size
range. These overheads may reduce even further if we merge the peripheral logic with
other necessary logic already present in the SRAM, such as a BIST controller.
Thus, we can build a fast, efficient, reconfigurable memory block by adding only a small
amount meta-data and peripheral logic to a basic memory array. The meta-data was a com-
mon memory paradigm used across many memory structures. While the peripheral logic
blocks were more tailored to individual memory structures,the necessary logic was limited
and thus the overhead for hardwired peripheral logic units was small. Where the memory
structures did vary significantly is in the way that the mats were connected to each other
and to the computation. This led us to use very rich interconnect structures for the inter-
mat control and the processor interconnect networks. Similar to conventional memories,
the performance and power of reconfigurable memory designs may begin to be dominated
not by the memory cores themselves, but rather by the interconnection networks. To that
end, we believe that the design of interconnection networksfor reconfigurable memories
warrants additional scrutiny. Under further study, a simple, low-overhead interconnection
topology may emerge that can emulate the necessary interconnect for many memory struc-
tures, just as our memory mat does for the memory cores.
Appendix A
SRAM Survey
In Figure 2.4, we plotted the block sizes of a number of contemporary SRAM against the
total SRAM capacity. These SRAMs ranged in total capacity from 72Mb to 15Kb, in
technologies from 0.65µm to 90nm. However, the majority of the designs used block sizes
that only ranged from 16Kb to 128Kb. This supports the conclusions of both Amrutur [24]
and Evans [25] that the optimal energy-delay partition sizefalls within this range. For our
reconfigurable memory, we set the mat memory capacity based on the optimal energy-delay
block range. Table A.1 below shows the raw data used to generate Figure 2.4. The SRAMs
are sorted by total capacity, in descending order.
117
APPENDIX A. SRAM SURVEY 118
Table A.1: SRAM survey data
Citation Capacity (kb) Block size (kb) Technology (µm) Process type
Cho[145] 73728 16 0.10 CMOSWeiss[146] 24576 24 0.18 CMOSZhao[147] 18432 64 0.18 CMOSPilo[148] 18432 72 0.18 CMOSOsada[149] 16384 512 0.13 CMOSIshibashi[150] 4096 64 0.25 CMOSBraceras[151] 4096 36 0.30 CMOSIshibashi[152] 4096 64 0.25 CMOS,4TBateman[153] 4096 9 0.35 CMOSKimura[154] 2048 64 0.65 BiCMOSKushiyama[155] 1024 4 0.35 CMOSShibata[156] 1024 128 0.30 CMOS/SOIShibata[157] 1024 32 0.50 CMOSShimizu[158] 288 288 0.18 CMOSPelella[159] 288 36 0.50 CMOSSato[160] 256 32 0.40 BiCMOSAkiyoshi[161] 144 36 0.09 CMOSMori[39] 32 1 0.25 CMOSLu[162] 15 4 0.50 CMOS
Bibliography
[1] P. Silverman, “Who Can Afford Advanced Lithography?”Microlithography World,
Nov. 2003.
[2] R. Merritt, “Gurus Urges IC Design Overhaul,”EE Times, Sept. 2, 2003.
[3] N. Zhang and R. Brodersen, “The Cost of Flexibility in Systems on a Chip Design for
Signal Processing Applications,”
http://bwrc.eecs.berkeley.edu/Classes/EE225C/Papers/archdesign.doc, 2002.
[4] N. Tredennick and B. Shimamoto, “The death of microprocessors,”Embedded Sys-
tems Programming, Aug. 11, 2004.
[5] Xilinx Inc., ”Using the Virtex Block SelectRAM+ Features,”
http://www.xilinx.com/bvdocs/appnotes/xapp130.pdf,Application Note XAPP130,
Dec. 18, 2000.
[6] Altera Inc., ”TriMatrix Embedded Memory Blocks in Stratix II Devices,” Stratix II
Device Handbook, vol. 2, chapt. 2,
http://www.altera.com/literature/hb/stx2/stratix2handbook.pdf, Document Number
SII5V2-1.3, July 2004.
[7] S. Hauck,et al., “The Chimera Reconfigurable Functional Unit”,IEEE Symposium
on FPGAs for Custom Computing Machines, 1997.
[8] J. Hauser and J. Wawrzynek, “Garp: a MIPS Processor With aReconfigurable Co-
processor,”IEEE Symposium on FPGAs for Custom Computing Machines, 1997.
119
BIBLIOGRAPHY 120
[9] http://www.stretchinc.com
[10] K. Sankaralingam,et al., “Exploiting ILP, TLP, and DLP with the polymorphous
TRIPS architecture,”Proceedings of the International Symposium on Computer Ar-
chitecture, 2003.
[11] R. Krashinsky,et al., “The vector-thread architecture,”International Symposium on
Computer Architecture, June 2004.
[12] M. Taylor, et al., “The RAW microprocessor: a computational fabric for software
circuits and general-purpose programs,”IEEE Micro, Mar/Apr 2002.
[13] K. Mai et al., “Smart Memories: A Modular Reconfigurable Architecture,”Proceed-
ings, International Symposium on Computer Architecture, pp. 161-71, June 2000.
[14] Pact Corp.,“The XPP white paper,”
http://www.pactcorp.com/xneu/download/xppwhite paper.pdf, Mar. 27, 2002.
[15] D. Parlour, “The Reality and Promise of Reconfigurable Computing in Digital Signal
Processing,”Tutorial, IEEE International Solid-State Circuits Conference, Feb. 2004.
[16] K. Wu and Y. Tsai, “Structured ASIC, evolution or revolution?,” International Sym-
posium on Physical Design, April 2004.
[17] R. Sites. “It’s the Memory, Stupid!”Microprocessor Report, pages 19-20, August 5,
1996.
[18] C. Zhang,et al., “A highly configurable cache architecture for embedded systems,”
Proceedings, International Symposium on Computer Architecture, June 2003.
[19] D. Patterson,et al., “Intelligent RAM (IRAM): Chips that remember and compute,”
Digest of Technical Papers, IEEE International Solid-State Circuits Conference, Feb.
1997.
[20] A. Khan, et al., “A 150 MHz graphics rendering processor with 256Mb embedded
DRAM,” Digest of Technical Papers, IEEE International Solid-State Circuits Confer-
ence, Feb. 2001.
BIBLIOGRAPHY 121
[21] B. Dipert, “Cutting-edge consoles target the television,” EDN, December 12, 2001.
[22] W. Leung,et al., “The ideal SoC memory: 1T-SRAM,”Proceedings of the 13th An-
nual IEEE International ASIC/SOC Conference, Sept. 2000.
[23] D. Fried,et al., “Aggressively scaled (0.143µm2) 6T-SRAM cell for the 32 nm node
and beyond,”Technical Digest Electron Devices Meeting, Dec. 2004.
[24] B. Amrutur and M. Horowitz, “Speed and Power Scaling of SRAM’s,” IEEE Journal
of Solid-State Circuits, Feb. 2000.
[25] R. Evans, “Energy consumption modeling and optimization for SRAMs,”Ph.D. dis-
sertation, Dept. of Electrical and Computer Engineering, North Carolina State Uni-
versity, July 1993.
[26] T. Wada,et al., “An analytical access time model for on-chip cache memories,” IEEE
Journal of Solid-State Circuits, Aug. 1992.
[27] M. Yoshimoto,et al., “A 64kb CMOS RAM with divided word line structure,”Digest
of Technical Papers, IEEE International Solid-State Circuits Conference, Feb. 1983.
[28] T. Hirose,et al., “A 20ns 4Mb CMOS SRAM with hierarchical word decoding archi-
tecture,”Digest of Technical Papers, IEEE International Solid-State Circuits Confer-
ence, Feb. 1990.
[29] K. Osada,et al., “A 2ns access, 285MHz, two-port cache macro using double global
bit-line pairs,” Digest of Technical Papers, IEEE International Solid-State Circuits
Conference, Feb. 1997.
[30] R. Evans and P. Franzon, “Energy consumption modeling and optimization for
SRAMs,” IEEE Journal of Solid-State Circuits, May 1995.
[31] B. Amrutur, “Design and analysis of fast low power SRAMs,” Ph.D. dissertation,
Dept. of Electrical Engineering, Stanford University, Aug. 1999.
[32] I. Sutherlandet al., Logical Effort: Designing Fast CMOS Circuits, Morgan Kauf-
mann, January 1999.
BIBLIOGRAPHY 122
[33] T. Chappell,et al., “A 2-ns cycle, 3.8-ns access 512kb CMOS ECL SRAM with a
fully pipelined architecture,”IEEE Journal of Solid-State Circuits, Nov. 1991.
[34] B. Amrutur and M. Horowitz, “Fast low-power decoders for RAMs,” IEEE Journal
of Solid-State Circuits, Oct. 2001.
[35] B. Amrutur and M. Horowitz, “A replica technique for wordline and sense control in
low-power SRAMs,”IEEE Journal of Solid-State Circuits, Aug. 1998.
[36] H. Nambu,et al., “A 1.8ns access, 550MHz 4.5Mb CMOS SRAM,”Digest of Tech-
nical Papers, IEEE International Solid-State Circuits Conference, Feb. 1998.
[37] K. Sakai,et al., “A 15ns 1-Mbit CMOS SRAM,”IEEE Journal of Solid-State Circuits,
Oct. 1988.
[38] M. Matsumiya,et al., “A 15ns 16-Mb CMOS SRAM with interdigitated bit-line ar-
chitecture,”IEEE Journal of Solid-State Circuits, Nov. 1992.
[39] T. Mori, it et al., “ A 1V 0.9mW at 100MHz 2k*16b SRAM utilizing a half-swing
pulsed decoder and write-bus architecture in 0.25µm dual-Vt CMOS”Digest of Tech-
nical Papers, IEEE International Solid-State Circuits Conference, Feb. 1998.
[40] K. Mai, et al., “Low-Power SRAM Design Using Half-Swing Pulse-Mode Tech-
niques,”IEEE Journal of Solid-State Circuits, Nov. 1998.
[41] J. Alowersson,et al., “SRAM cells for low-power write in buffer memories,”IEEE
Symposium on Low Power Electronics, Oct. 1995.
[42] H. Mizuno,et al., “Driving source-line cell architecture for sub-1-V high-speed low-
power applications,”IEEE Journal of Solid-State Circuits, April 1996.
[43] M. Margala,et al., “Low-power SRAM circuit design,”Records of the 1999 IEEE
International Workshop on Memory Technology, Design and Testing, Aug. 1999.
[44] J. Wang,et al., “Low-power embedded SRAM with the current-mode write tech-
nique,” IEEE Journal of Solid-State Circuits, Jan. 2000.
BIBLIOGRAPHY 123
[45] K. Kanda,et al., “90% write power-saving SRAM using sense-amplifying memory
cell,” IEEE Journal of Solid-State Circuits, June 2004.
[46] S. Heo,et al., “Dynamic Fine-Grain Leakage Reduction using Leakage-Biased Bit-
lines,” International Symposium on Computer Architecture, May 2002.
[47] K. Osada,et al., “Universal-Vdd 0.65-2.0-V 32-kB cache using a voltage-adapted
timing-generation scheme and a lithographically symmetrical cell,” IEEE Journal of
Solid-State Circuits, Nov. 2001
[48] K. Agawa,et al., “A bitline leakage compensation scheme for low-voltage SRAMs,”
IEEE Journal of Solid-State Circuits, May 2001.
[49] Y. Ye, et al., “A 6-GHz 16-kB L1 cache in a 100-nm dual-Vt technology usinga bitline
leakage reduction (BLR) technique,”IEEE Journal of Solid-State Circuits, May 2003.
[50] A. Alvandpour,et al., “Bitline leakage equalization for sub-100nm caches,”Proceed-
ings of the 29th European Solid-State Circuits Conference, Sept. 2003.
[51] S. Tachibana,et al., “A 2.6-ns wave-pipelined CMOS SRAM with dual-sensing-latch
circuits,” IEEE Journal of Solid-State Circuits, April 1995.
[52] S. Schuster,et al, “A 15-ns CMOS 64K RAM,”IEEE Journal of Solid-State Circuits,
Oct. 1986.
[53] T. Higuchi,et al., “A 500MHz synchronous pipelined 1Mbit CMOS SRAM,”Techni-
cal Report of IEICE, May 1996, pp. 9-14, (in Japanese).
[54] R. Ho, et al., “Efficient On-Chip Global Interconnects,”Digest of Technical Papers,
Symposium on VLSI Circuits, June 2003.
[55] H. Zhanget al., “Low-swing on-chip signaling techniques,”IEEE Transactions on
VLSI, pp. 414-419, April 1993.
[56] R. Ho, “On-chip wires: scaling and efficiency,”Ph.D. dissertation, Dept. of Electrical
Engineering, Stanford University, Aug. 2003.
BIBLIOGRAPHY 124
[57] B. Nikolic, et al., “Sense Amplifier Based Flip-Flop,”Digest of Technical Papers,
IEEE International Solid-State Circuits Conference, Feb. 1999.
[58] L. Benini, et al., “Energy-aware design of embedded memories: A survey of tech-
nologies, architectures, and optimization techniques,”ACM Transactions on Embed-
ded Computing Systems, Feb. 2003.
[59] F. Balasa,et al., “Background memory area estimation for multidimensionalsignal
processing systems,”IEEE Transactions on VLSI Systems, June 1995.
[60] J. Stinson and S. Rusu, “A 1.5GHz third generation Itanium processor,”Digest of
Technical Papers, IEEE International Solid-State Circuits Conference, Feb. 2003.
[61] R. Banakar,et al., “Scratchpad memory: a design alternative for cache on-chip mem-
ory in embedded systems,”International Symposium on Hardware/Software Code-
sign, May 2002..
[62] R. Lee and M. Smith, “Media processing: a new design target,” IEEE Micro, Aug.
1996.
[63] D. Zucker,et al., “Improving performance for software MPEG players,”Proceedings
of Compcon, 1996.
[64] K. Diefendorff and P. Dubey, “How multimedia workloadswill change processor de-
sign,” IEEE Computer, Sept. 1997.
[65] N. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers,”Proceedings of the International Sym-
posium on Computer Architecture, May 1990.
[66] S. Rixner,et al., “A Bandwidth Efficient Architecture for Media Processing,” Inter-
national Symposium on Microarchitecture, 1998.
[67] S. Palacharla,et al., “Evaluating stream buffers as a secondary cache replacement,” In
Proceedings of the 21st Annual International Symposium on Computer Architecture,
pages 24-33, 1994.
BIBLIOGRAPHY 125
[68] K. Farkas,et al., “How useful are non-blocking loads, stream buffers and speculative
execution in multiple issue processors?”In Proceedings of the First International
Conference on High Performance Computer Architecture, pages 78-89, Jan. 1995.
[69] W. Dally, et al., “Merrimac: supercomputing with streams,”Proceedings of the
ACM/IEEE Supercomputing Conference SC2003, Nov. 2003.
[70] N. Jayasena,et al., “Stream register files with indexed access,”Proceedings of the
Tenth International Symposium on High Performance Computer Architecture, Feb.
2004.
[71] E. Kilgariff and R. Fernando, “The Geforce 6 series architecture,” GPU Gems 2,
Addison-Wesley, 2005.
[72] D. Albonesi, “Selective cache ways: on-demand cache resource allocation,”Interna-
tional Symposium on Microarchitecture, 1999.
[73] D. Albonesi,et al., “Dynamically Tuning Processor Resources with Adaptive Pro-
cessing,”IEEE Computer, Dec. 2003.
[74] R. Balasubramonian,et al., “Memory hierarchy reconfiguration for energy and per-
formance in general-purpose processor architectures,”International Symposium on
Microarchitecture, Dec. 2000.
[75] P. Panda,et al., “Data memory organization and optimizations in application-specific
systems,”IEEE Design and Test of Computers, May-June 2001.
[76] P. Panda,et al., “Local memory exploration optimization in embedded systems,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
Jan. 1999.
[77] A. Veidenbaum,et al., “Adapting cache line size to application behavior,”Interna-
tional Conference on Supercomputing, 1999.
[78] P. Ranganathan,et al., “Reconfigurable caches and their application to media process-
ing,” Proceedings, International Symposium on Computer Architecture, 2000.
BIBLIOGRAPHY 126
[79] M. Qureshi,et al., “The V-way cache: demand-based associativity via global replace-
ment,” International Symposium on Computer Architecture, June 2005.
[80] S. Kaxiras,et al., “Cache decay: exploiting generational behavior to reducecache
leakage power,”International Symposium on Computer Architecture, June 2001.
[81] N. Kim, et al., “Drowsy instruction caches: leakage power reduction using dy-
namic voltage scaling and cache sub-bank prediction,”Proceedings of the 35th annual
ACM/IEEE International Symposium on Microarchitecture, Nov. 2002.
[82] A. Malik, et al., “A low power unified cache architecture providing power andperfor-
mance flexibility,” International Symposium on Low Power Electronics and Design,
2000.
[83] E. Witchel,et al., “Direct addressed caches for reduced power consumption,”34th
International Symposium on Microarchitecture, Dec. 2001.
[84] Texas Instruments Inc., ”TMS320C621x/C671x DSP Two Level Internal Memory
Reference Guide (Rev. B),”
http://www-s.ti.com/sc/psheets/spru609b/spru609b.pdf, TI Literature Number:
SPRU609B, June 2004.
[85] B. Ackland, et al., “A single chip, 1.6-billion, 16-b MAC/s multiprocessor DSP,”
IEEE Journal of Solid-State Circuits, March 2000.
[86] H. Kim, et al., “A Reconfigurable Multifunction Computing Cache Architecture,”
IEEE Transactions on Very Large Scale Integration, Aug. 2001.
[87] V. Srini, et al., “Reconfigurable memory module in the RAMP system for streampro-
cessing,”Proceedings of International Symposium on Computer Architecture Work-
shop, June 2001.
[88] T. Ngai,et al., “An SRAM-programmable field-configurable memory,”IEEE Custom
Integrated Circuits Conference, 1995.
BIBLIOGRAPHY 127
[89] F. Heile,et al., “Programmable memory blocks supporting content-addressable mem-
ory,” International Symposium on Field Programmable Gate Arrays, 2000.
[90] S. Guccione,et al., “A reconfigurable content addressable memory,”Proceedings of
the 15th International Parallel and Distributed Processing Workshops (IPDPS), May
2000.
[91] Personal communications with Manoj Chirania, XilinixInc., Jan. 2004.
[92] S. Scott, “Synchronization and communication in the T3E multiprocessor,”Proceed-
ings of the Architectural Support for Programming Languages and Operating Sys-
tems, Oct. 1996.
[93] L. Hammond, “Hydra: a chip multiprocessor with supportfor speculative thread-
level parallelization,”Ph.D. dissertation, Dept. of Electrical Engineering, Stanford
University, Mar. 2002.
[94] M. Horowitz and W. Dally, “How Scaling Will Change Processor Architecture”Di-
gest of Technical Papers, IEEE International Solid-State Circuits Conference, Feb.
2004.
[95] D. Sager,et al., “A 0.18µm CMOS IA32 Microprocessor with a 4GHz Integer Ex-
ecution Unit,” Digest of Technical Papers, IEEE International Solid-State Circuits
Conference, Feb. 2001.
[96] M. Matson,et al., “Circuit Implementation of a 600MHz Superscalar RISC Micropro-
cessor,”Proceedings of the International Conference on Computer Design, October,
1998
[97] S. Hsu,et al., “A 4.5-GHz 130-nm 32-KB L0 cache with a leakage-tolerant self
reverse-bias bitline scheme,”IEEE Journal of Solid-State Circuits, May 2003.
[98] D. Pham,et al., “Design and implementation of a first-generation CELL processor,”
International Solid-State Circuits Conference, Feb. 2005.
BIBLIOGRAPHY 128
[99] S. Dhong,et al., “A 4.8GHz fully pipelined embedded SRAM in the streaming proces-
sor of a CELL processor,”International Solid-State Circuits Conference, Feb. 2005.
[100] S. Rixner,et al., “Register organization for media processing,”International Sym-
posium on High Performance Computer Architecture, Jan. 2000.
[101] O. Ergin,et al.,”A Circuit-Level Implementation of Fast, Energy-Efficient CMOS
Comparators for High-Performance Microprocessors,”IEEE International Confer-
ence on Computer Design, Sept. 2002.
[102] C. Wang,et al., “High fan-in dynamic CMOS comparators with low transistor
count,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Ap-
plications, Sept. 2003
[103] C. Huang,et al., “High-performance and power-efficient CMOS comparators,” IEEE
Journal of Solid-State Circuits, Feb. 2003
[104] H. Mizuno,et al., “A 1-V, 100-MHz, 10-mW cache using a separated bit-line mem-
ory hierarchy architecture and domino tag comparators,”IEEE Journal of Solid-State
Circuits, Nov. 1996
[105] J. Hennessy and D. Patterson.Computer Architecture, a Quantitative Approach. 2nd
Ed. 1996.
[106] W. Dally and B. Towles,Principles and Practices of Interconnection Networks, Mor-
gan Kaufmann,2004.
[107] J. Duato,et al., Interconnection Networks, Morgan Kaufmann, 2003.
[108] C. Seitz, “Let’s route packets instead of wires,”Advanced Research in VLSI: Pro-
ceedings of the Sixth MIT Conference, 1990.
[109] W. Dally and B. Towles, “Route packets, not wires: on-chip interconnection net-
works,” Proceeding of the ACM/IEEE Design Automation Conference, June 2001.
[110] L. Benini and G. DeMicheli, “Networks on chips: a new SoC paradigm,”IEEE
Computer, Jan. 2002.
BIBLIOGRAPHY 129
[111] P. Magarshack and P. Paulin, “System-on-chip beyond the nanometer wall,”Pro-
ceedings of the ACM/IEEE Design Automation Conference, June 2003.
[112] S. Dutta,et al., “High-performance crossbar interconnect for a VLIW videosignal
processor,”Ninth Annual IEEE International ASIC Conference and Exhibit, Proceed-
ings, Sept. 1996.
[113] A. Boxer, “Where buses cannot go,”IEEE Spectrum, Feb. 1995.
[114] Y. Zhang,et al., “An alternative architecture for on-chip global interconnect: seg-
mented bus power modeling,”Conference Record of the Thirty-Second Asilomar Con-
ference on Signals, Systems and Computers, Nov. 1998.
[115] J. Chen,et al., “Segmented bus design for low-power systems,”IEEE Transactions
on Very Large Scale Integration Systems, March 1999.
[116] J. Plosila,et al., “Implementation of a self-timed segmented bus,”IEEE Design and
Test of Computers, Nov.-Dec. 2003.
[117] V. Lahtinen,et al., “Comparison of synthesized bus and crossbar interconnection
architectures,”Proceedings of the International Symposium on Circuits andSystems,
May 2003.
[118] R. Naik,et al.“Large integrated crossbar switch,”Proceedings of the Seventh Annual
IEEE International Conference on Wafer Scale Integration, Jan. 1995.
[119] K. Choi and W. Adams, “VLSI implementation of a 256256 crossbar interconnec-
tion network,”Proceedings of the Sixth International Parallel Processing Symposium,
March 1992.
[120] N. McKeown,et al. “Tiny Tera: a packet switch core,”IEEE Micro, Jan.-Feb. 1997.
[121] A. Lines, “Asynchronous interconnect for synchronous SoC design,”IEEE Micro
Jan-Feb. 2004.
BIBLIOGRAPHY 130
[122] M. Borgatti,et al., “A multi-context 6.4Gb/s/channel on-chip communicationnet-
work using 0.18µm Flash-EEPROM switches and elastic interconnects,”Digest of
Technical Papers, IEEE International Solid-State Circuits Conference, Feb. 2003.
[123] S. Lee,et al., “An 800MHz star-connected on-chip network for application to sys-
tems on a chip,”Digest of Technical Papers, IEEE International Solid-State Circuits
Conference, Feb. 2003.
[124] K. Lee, et al., “A 51mW 1.6GHz on-chip network for low-power heterogeneous
SoC platform,”Digest of Technical Papers, IEEE International Solid-State Circuits
Conference, Feb. 2004.
[125] J. Labrousse and G. Slavenburg, “A 50 MHz microprocessor with a very long in-
struction word architecture,”Digest of Technical Papers, IEEE International Solid-
State Circuits Conference, Feb. 1990.
[126] L. Lin, et al., “CCC: crossbar connected caches for reducing energy consumption
of on-chip multiprocessors,”Proceedings of the Euromicro Symposium on Digital
System Design, Sept. 2003.
[127] K. Sankaralingam,et al., “Routed inter-ALU networks for ILP scalability and per-
formance,”Proceedings of the 21st International Conference on Computer Design,
Oct. 2003.
[128] E. Fetzer,et al., “A fully bypassed six-issue integer datapath and registerfile on the
Itanium-2 microprocessor,”IEEE Journal of Solid-State Circuits, Nov. 2002.
[129] P. Gupta and N. McKeown, “Designing and implementing afast crossbar scheduler,”
IEEE Micro, Jan.-Feb. 1999.
[130] J. Liang,et al. “ASOC: a scalable, single-chip communications architecture,” Pro-
ceeding of the International Conference on Parallel Architectures and Compilation
Techniques, Oct. 2000.
BIBLIOGRAPHY 131
[131] K. Lee, et al., “A high-speed and lightweight on-chip crossbar switch scheduler
for on-chip interconnection networks,”Conference on European Solid-State Circuits,
Sept. 2003.
[132] K. Lee,et al., “A distributed crossbar switch scheduler for on-chip networks,” Pro-
ceedings of the IEEE Custom Integrated Circuits Conference, Sept. 2003.
[133] M. Sinha and W. Burleson, “Current-sensing for crossbars,”Proceedings of the 14th
Annual IEEE International ASIC/SOC Conference, Sept. 2001.
[134] P. Wijetunga, “High-performance crossbar design forsystem-on-chip,”Proceedings
of the IEEE International Workshop on System-on-Chip for Real-Time Applications,
2003.
[135] H. Wang,et al., “Power-driven design of router microarchitectures in on-chip net-
works,” Proceedings of the International Symposium on Microarchitecture, 2003.
[136] A. Stratakos,et al., “A low voltage CMOS DC-DC converter for a portable battery-
operated system,”Proceedings of IEEE Power Electronics Specialists Conference,
June 1994.
[137] Y. Zhang and M. Irwin, “Power and performance comparisons of crossbars and buses
as on-chip interconnect structures,”Conference Record of the Thirty-Third Asilomar
Conference on Signals, Systems, and Computers, Oct. 1999.
[138] E. Geethanjali,et al., “An analytical power estimation model for crossbar intercon-
nects,”15th Annual IEEE International ASIC/SOC Conference, Sept. 2002.
[139] R. Heald,et al., ”64 KByte sum-addressed-memory cache with 1.6 ns cycle and
2.6 ns latency,”Digest of Technical Papers, IEEE International Solid-State Circuits
Conference, Feb. 1998.
[140] K. Mai, et al., “Architecture and Circuit Techniques for a ReconfigurableMemory
Block,” Digest of Technical Papers, IEEE International Solid-State Circuits Confer-
ence, Feb. 2004.
BIBLIOGRAPHY 132
[141] K. Mai, et al., “Architecture and circuit techniques for a 1.1GHz 16kb reconfigurable
memory in 0.18µm CMOS,” IEEE Journal of Solid-State Circuits, Jan. 2005.
[142] S. Vangal,et al., “5-GHz 32-bit integer execution core in 130-nm dual-Vt CMOS,”
IEEE Journal of Solid-State Circuits, Nov. 2002.
[143] R. Ho, et al., “Applications on On-Chip Samplers for Test and Measurement of
Integrated Circuits,”Digest of Technical Papers, Symposium on VLSI Circuits, pp.
138-9, June 1998.
[144] R. Ho,et al., “The Future of Wires,”Proceedings of the IEEE, April 2001.
[145] U. Cho,et al., “A 1.2 V 1.5 Gb/s 72 Mb DDR3 SRAM,”Digest of Technical Papers,
International Solid-State Circuits Conference, Feb. 2003.
[146] D. Weiss,et al., “An on-chip 3MB subarray-based 3rd level cache on an itanium
microprocessor,”Digest of Technical Papers, International Solid-State Circuits Con-
ference, Feb. 2002.
[147] C. Zhao,et al., “An 18-Mb, 12.3-GB/s CMOS pipeline-burst cache SRAM with 1.54
Gb/s/pin,”IEEE Journal of Solid-State Circuits, Nov. 1999.
[148] H. Pilo, et al., “An 833 MHz 1.5 W 18 Mb CMOS SRAM with 1.67 Gb/s/pin,”
Digest of Technical Papers, International Solid-State Circuits Conference, Feb. 2000.
[149] K. Osada,et al., “16.7 fA/cell tunnel-leakage-suppressed 16 Mb SRAM for handling
cosmic-ray-induced multi-errors,”Digest of Technical Papers, International Solid-
State Circuits Conference, Feb. 2003.
[150] K. Ishibashi,et al., “A 300 MHz 4-Mb wave-pipeline CMOS SRAM using a multi-
phase PLL,”Digest of Technical Papers, International Solid-State Circuits Confer-
ence, Feb. 1995.
[151] G. Braceras,et al., “A 350 MHz 3.3 V 4 Mb SRAM fabricated in a 0.3µm CMOS
process,”Digest of Technical Papers, International Solid-State Circuits Conference,
Feb. 1997.
BIBLIOGRAPHY 133
[152] K. Ishibashi,et al., “A 6-ns 4-Mb CMOS SRAM with offset-voltage-insensitive cur-
rent sense amplifiers,”IEEE Journal of Solid-State Circuits, April 1995.
[153] B. Bateman,et al., “A 450 MHz 512 kB second-level cache with a 3.6 GB/s data
bandwidth,”Digest of Technical Papers, International Solid-State Circuits Confer-
ence, Feb. 1998.
[154] T. Kimura,et al., “Design of 1.28-GB/s high bandwidth 2-Mb SRAM for integrated
memory array processor applications,”IEEE Journal of Solid-State Circuits, June
1995.
[155] N. Kushiyama,et al., “A 295 MHz CMOS 1 M (256) embedded SRAM using bi-
directional read/write shared sense amps and self-timed pulsed word-line drivers,”
Digest of Technical Papers, International Solid-State Circuits Conference, Feb. 1995.
[156] N. Shibata,et al., “A 2-V 300-MHz 1-Mb current-sensed double-density SRAM
for low-power 0.3-µm CMOS/SIMOX ASICs,”IEEE Journal of Solid-State Circuits,
Oct. 2001.
[157] N. Shibata,et al., “A 1-V, 10-MHz, 3.5-mW, 1-Mb MTCMOS SRAM: with charge-
recycling input/output buffers,”IEEE Journal of Solid-State Circuits, June 1999.
[158] H. Shimizu,et al., “A 1.4 ns access 700 MHz 288 kb SRAM macro with expandable
architecture,”Digest of Technical Papers, International Solid-State Circuits Confer-
ence, Feb. 1999.
[159] M. Pelella,et al., “A 2 ns access, 500 MHz 288 Kb SRAM macro,”Digest of Tech-
nical Papers, Symposium on VLSI Circuits, June 1996.
[160] H. Sato,et al., “A 3.6 mW 1.4 V SRAM with non-boosted, vertical bipolar bitline
contact memory cell,”Digest of Technical Papers, International Solid-State Circuits
Conference, Feb. 1998.
[161] H. Akiyoshi, et al., “A 320ps access, 3GHz cycle, 144Kb SRAM macro in 90nm
CMOS technology using an all-stage reset control signal generator,”Digest of Tech-
nical Papers, International Solid-State Circuits Conference, Feb. 2003.
BIBLIOGRAPHY 134
[162] P. Lu,et al., “A 15Kb 1.5 ns Access On-chip Tag SRAM,”Proceedings of Technical
Papers, International Symposium on VLSI Technology, Systems, and Applications,
June 1997.