Frank Vahid, UC Riverside
1
New Opportunities with Platform Based Design
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend
Frank Vahid, UC Riverside 2
How Much is Enough?
Frank Vahid, UC Riverside 3
How Much is Enough?
Perhaps a bit small
Frank Vahid, UC Riverside 4
How Much is Enough?
Reasonably sized
Frank Vahid, UC Riverside 5
How Much is Enough?
Probably plenty big
Frank Vahid, UC Riverside 6
How Much is Enough?
More than typically necessary
Frank Vahid, UC Riverside 7
How Much is Enough?
Very few people could use this
Frank Vahid, UC Riverside 8
How Much Custom Logic is Enough?
1993: ~ 1 million logic transistors
IC package IC
Perhaps a bit small
Frank Vahid, UC Riverside 9
1996: ~ 5-8 million logic transistors
Reasonably sized
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 10
1999: ~ 10-50 million logic transistors
Probably plenty big
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 11
2002: ~ 100-200 million logic transistors
More than typically necessary
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 12
2008: >1 BILLION logic transistors
1993: 1 M
Perhaps very few people
could design this
Point of diminishing returns
32-bit ARM: ~30K MPEG dcd: ~1M
Other examples Fast cars (> 100 mph) High res digital
cameras (> 4M) Disk space Even IC performance
How Much Custom Logic is Enough?
Frank Vahid, UC Riverside 13
Very Few Companies Can Design High-End ICs
Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M
10,000
1,000
100
10
1
0.1
0.01
0.001
Logic transistors per chip
(in millions)
100,000
10,000
1000
100
10
1
0.1
0.01
Productivity(K) Trans./Staff-Mo.
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2001
2003
2005
2007
2009
IC capacity
productivity
Gap
Design productivity gap
Source: ITRS’99
Frank Vahid, UC Riverside 14
Meanwhile, ICs Themselves are Costlier
And take longer to fabricate While market windows are shrinking Less than 1,000 out of 10,000 ASIC designs
have volumes to justify fabrication in 0.13 micron
Tech: 0.8 0.35 0.18 0.13
NRE: $40k $100k $350k $1,000k
Turnaround 42 days 49 days 56 days 76 days
Market: $3.5B $6B $12B $18BSource: DAC’01 panel on embedded programmable logic
Frank Vahid, UC Riverside 15
Summarizing So Far...
* Transistors are less scarce
• ICs are big enough, fast enough
* ICs take more time and money to design and fabricate
• While market windows are shrinking
Buy pre-fabricated system-level ICs: platforms
Designers
Frank Vahid, UC Riverside 16
Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific standard product
Domain-specific pre-fabricated IC
e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC
Unique IC design Ignores quantity of same IC
ASIC design starts decreasing Due to strong benefits of
using pre-fabricated devices
Sourc
e:
Gart
ner/
Data
quest
Septe
mber’
01
Frank Vahid, UC Riverside 17
Will High End ICs Still be Made?
YES The point is that
mainstream designers likely won’t be making them
Very high volume or very high cost products
Platforms are one such product – high volume
Need to be highly configurable to adapt to different applications and constraints
0
10
20
30
40
50
60
70
1 2 3 4
Volume
Cost
per
IC 1990
20002010Mainstream
design
Becoming out of reach of
mainstream designers
Frank Vahid, UC Riverside 18
Configurable Platform Design: Cache
uP
L1 cache
DSP
JPEG dcd
Periph-erals
FPGA
Pre-fabricated Platform
(A pre-designed system-level architecture)
IC ARM920T: Caches consume
half of total power (Segars 01)
M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99)
L1 cache
Frank Vahid, UC Riverside 19
Best Cache Architecture for Embedded Systems
Not clear Huge variety among popular embedded processors
What’s the best… Associativity, Line size, Total size?
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 20
Cache Associativity
Direct mapped cache Certain bits “index”
into cache Remaining “tag” bits
compared
00 0 000
11 0 000
A
B
C
D
01 0 000
10 0 000 Conflict
0000DTag11
Direct mapped cache
(1-way set associative)
Index
Set associative cache Multiple “ways” Fewer index bits, more
tag bits, simultaneous comparisons
More expensive, but better hit rate
D110 C100
2-way set associative
cache
000
Frank Vahid, UC Riverside 21
Cache Associativity
Reduces miss rate – thus improving performance Impact on power and energy?
(Energy = Power * Time)
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s r
ate
epic
mpeg2
Frank Vahid, UC Riverside 22
Associativity is Costly
Associativity improves hit rate, but at the cost of more power per access
Are the power savings from reduced misses outweighed by the increased power per hit?
sa_data
wordline_databitline_data
decode_data
data output driver
mux driver
comparator
bitline_tag sa_tag
wordline_tag
decode_tag
Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1w ay 2w ay 4w ay
Associativity
En
erg
y p
er a
ccess(n
J)
Energy per access for 8 Kbyte cache
Frank Vahid, UC Riverside 23
Associativity and Energy
Best performing cache is not always lowest energy
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s ra
te
epic
mpeg2
0.0
0.2
0.4
0.6
0.8
1.0
1 2 4
AssociativityN
orm
aliz
ed e
nerg
y
epic
mpeg2
Significantly poorer energy
Frank Vahid, UC Riverside 24
Associativity Dilemma
Direct mapped cache Good hit rate on most examples
Low power per access But poor hit rate on some examples
High power due to many misses
Four-way set-associative cache Good hit rate on nearly all examples But high power per access
Overkill for most examples, thus wasting energy
Dilemma: Design for the average or worst case?
Frank Vahid, UC Riverside 25
Associativity Dilemma
Obviously not a clear choice
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/AIntel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/AIntel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 26
Our Solution: Configurable Cache
Can be configured as 4, 2, or 1 way Ways can be
concatenated
D11xx C10x
11 0 000
This bit selects the way
0000
Size can also be configured By shutting down ways Saves static power
(leakage)
D1100
11 0 000
0000
Frank Vahid, UC Riverside 27
Configurable Cache Design: Way Concatenation (4, 2 or 1 way)
index
c1 c3c0 c2
a11
a12
reg1
reg0
sense ampscolumn mux
tag part
tag address
mux driver
c1
line offset
data output
critical path
c0
c2
c0 c1
6x64
6x64
c3c2
6x64
6x64
c3
6x64
6x64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
Configuration circuit
data array
bitline
Small area and performance overhead
Frank Vahid, UC Riverside 28
Configurable Cache Experiments
Motorola PowerStone benchmark g3fax Way concatenate outperforms 4 way and
direct map.
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
0.0040
Configuration
En
erg
y(n
J)
Frank Vahid, UC Riverside 29
Configurable Cache Experiments
Configurable cache with both way concatenation and way shutdown was best on average
Considered programs from Powerstone, MediaBench, and Spec2000 And, it was superior on every benchmark
114%268%116%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
jpeg
mpe
g2
pegw
it
g721 ar
t
mcf
pars
er vpr
Ave
rage
Benchmarks
En
erg
y (n
orm
aliz
ed)
CnvI1D1cnctshutboth
100% = 4-way conventional cache
Frank Vahid, UC Riverside 30
Configurable Cache Experiments – Line Size Too
Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases
A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy
100% = 4-way conventional cache
127% 127%122%
126% 129%
119%
1.44E+00 147%230% 133%144%125%
0%
20%
40%
60%
80%
100%
120%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
g721
pegw
it
mpe
g
jpeg
csb16 csb32 cbs64 cnv4w32 cnv1w32
csb: concatenate plus shutdown cache
Frank Vahid, UC Riverside 31
Configurable Platform Use
uP
L1 cache
DSP
JPEG dcd
Periph-erals
FPGA
Pre-fabricated Platform
Platforms increasingly come with on-chip FPGA Can we use that FPGA
to improve software performance and energy?
ICFPGA
uP
Frank Vahid, UC Riverside 32
Commercial Single-Chip Microprocessor/FPGA Platforms
Triscend E5 chip
Con
fig
ura
ble
log
ic8051 processor plus other peripherals
Memory
Triscend E5: based on 8-bit 8051 CISC core 10 Dhrystone MIPS at
40MHz 60 kbytes on-chip
RAM up to 40K logic gates Cost only about $4 (in
volume)
Frank Vahid, UC Riverside 33
Single-Chip Microprocessor/FPGA Platforms
Atmel FPSLIC Field-Programmable
System-Level IC Based on AVR 8-bit
RISC core 20 Dhrystone MIPS 5k-40k configurable
logic gates On-chip RAM (20-
36Kb) and EEPROM $5-$10 Courtesy of Atmel
Frank Vahid, UC Riverside 34
Single-Chip Microprocessor/FPGA Platforms
Triscend A7 chip Based on ARM7
32-bit RISC processor 54 Dhrystone MIPS
at 60 MHz Up to 40k logic
gates On-chip cache and
RAM $10-$20 in volume
Courtesy of Triscend
Frank Vahid, UC Riverside 35
Single-Chip Microprocessor/FPGA Platforms
Altera’s Excalibur EPXA 10
ARM (922T) hard core ~200 Dhrystone MIPS at
~200 MHz Devices range from
~200k to ~2 million programmable logic gates
Source: www.altera.com
Frank Vahid, UC Riverside 36
Single-Chip Microprocessor/FPGA Platforms
Xilinx Virtex II Pro PowerPC based
420 Dhrystone MIPS at 300 MHz
1 to 4 PowerPCs 4 to 16 gigabit
transceivers 12 to 216 multipliers 3,000 to 50,000 logic
cells 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000
units)
Con
fig
.lo
gic
Up to 16 serial transceivers• 622 Mbps to 3.125 Gbps622 Mbps to 3.125 Gbps
Pow
erP
Cs
Courtesy of Xilinx
Frank Vahid, UC Riverside 37
Why wouldn’t future microprocessor chips include some amount of on-chip FPGA?
Single-Chip Microprocessor/FPGA Platforms
Frank Vahid, UC Riverside 38
Single-Chip Microprocessor/FPGA Platforms
Lots of silicon area taken up by configurable logic As discussed earlier, less of an issue
every year Smaller area doesn’t necessarily mean
higher yield (lower costs) any more Previously could pack more die onto a wafer But die are becoming pad (pin) limited in
nanoscale technologies Configurable logic typically used for
peripherals, glue logic, etc. We have investigated another use...
Frank Vahid, UC Riverside 39
Software Improvements using On-Chip Configurable Logic
Partitioned software critical loops onto on-chip FPGA for several benchmarks
Most time spent in one or two loops Extensive simulated results for 8051 and MIPS
For Powerstone (PS), MediaBench (MB) and Netbench (NB)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Loop
Pe
rce
nt
Exe
cuti
on
Tim
e
Series1
Frank Vahid, UC Riverside 40
Software Improvements using On-Chip Configurable Logic
Example Archit Cyclesorig Cyclessw Cycleshw ClkhwSp. Psw Phw Eorig Esw/hw ESav Area
PS_g3fax 8051 19,675,456 10,812,544 176,562 25 2.2 0.05 0.032 0.1142 0.05408 53% 2,858PS_crc 8051 291,196 180,224 7,168 25 2.5 0.05 0.028 0.0017 0.00071 58% 770PS_summin 8051 109,821,892 20,394,080 384,416 25 1.2 0.05 0.033 0.6376 0.53657 16% 4,191PS_brev 8051 330,064 305,768 1,360 25 12.9 0.05 0.034 0.0019 0.00015 92% 3,961PS_matmul 8051 119,420 101,576 2,560 25 5.9 0.05 0.035 0.0007 0.00012 82% 5,882PS_g3fax MIPS 15,600,000 4,720,000 599,000 100 1.4 0.07 0.111 0.0265 0.02163 18% 2,858PS_adpcm MIPS 113,000 29,300 5,440 100 1.3 0.07 0.181 0.0002 0.00018 6% 8,075PS_crc MIPS 5,040,000 3,480,000 460,800 100 2.5 0.07 0.061 0.0086 0.00379 56% 770PS_des MIPS 142,000 70,700 15,100 100 1.6 0.07 0.197 0.0002 0.00019 20% 9,031PS_engine MIPS 915,000 145,000 28,100 100 1.1 0.07 0.082 0.0016 0.00146 6% 2,074PS_jpeg MIPS 7,900,000 646,000 171,000 100 1.1 0.07 0.092 0.0134 0.01360 -1% 3,161PS_summin MIPS 2,920,000 1,270,000 266,000 100 1.5 0.07 0.111 0.0050 0.00375 24% 4,191PS_v42 MIPS 3,850,000 846,000 216,000 100 1.2 0.07 0.102 0.0065 0.00605 7% 3,319PS_brev MIPS 3,566 2,499 138 100 3.0 0.07 0.107 0.0000 0.00000 62% 3,961MB_g721 MIPS 838,230,002 457,674,179 9,985,261 100 2.1 0.07 0.152 1.4250 0.75035 47% 5,811MB_adpcm MIPS 32,894,094 32,866,110 1,183,260 42 11.6 0.07 0.130 0.0559 0.00821 85% 14,132MB_pegwit MIPS 42,752,919 33,276,287 2,167,651 50 3.1 0.07 0.170 0.0727 0.03241 55% 18,150NB_dh MIPS 1,793,032,157 1,349,063,192 45,156,767 69 3.5 0.07 0.121 3.0482 1.00547 67% 21,383NB_md5 MIPS 5,374,034 3,046,881 289,877 47 1.8 0.07 0.251 0.0091 0.00722 21% 90,074NB_tl MIPS 57,412,470 29,244,221 2,479,552 58 1.8 0.07 0.059 0.0976 0.05930 39% 5,478
Average: 3.2 Average: 34% 10,507
Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)
Frank Vahid, UC Riverside 41
Speedup Gained with Relatively Few Gates
Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates
Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of
Embedded Systems, 2002 (to appear).
1.0
2.0
3.0
4.0
5.0
0 5,000 10,000 15,000 20,000 25,000
Gates
Sp
ee
du
p
G721(MB)
ADPCM(MB)
PEGWIT(MB)
DH(NB)
MD5(NB)
TL(NB)
URL(NB)
27.2
2.05 at 90,000
Frank Vahid, UC Riverside 42
Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement
Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better)
A7 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 11.47 7.44 1.5 1.320 1.332 15.140 9.910 35%PS_crc 10.92 4.51 2.4 1.320 1.320 14.414 5.953 59%PS_brev 9.84 3.28 3.0 1.332 1.344 13.107 4.408 66%
Average: 2.3 Average: 53%
E5 results
Benchmark Timeorig Timesw/hw Sp. Porig Psw/hw Eorig Esw/hw E sav
PS_g3fax 15.16 7.11 2.1 0.252 0.270 3.820 1.920 50%PS_crc 10.64 4.64 2.3 0.207 0.225 2.202 1.044 53%PS_brev 17.81 1.81 9.8 0.252 0.270 4.488 0.489 89%
Average: 4.8 Average: 64%
A7 IC
Triscend A7 development
board
Frank Vahid, UC Riverside 43
Other Types of Configurability
Microprocessor (other researchers) VLIW configurations Voltage scaling
Peripherals e.g., JPEG decoder with
different precisions Bus topology Etc.
uP
L1 cache
DSP
JPEG dcd
Periph-erals
FPGA IC
Frank Vahid, UC Riverside 44
Conclusions
Trend is away from semi-custom IC fabrication Pressures encourage buying pre-fabricated platforms
Platforms must be highly configurable To be useful for a variety of applications, and hence mass
produced We have discussed
Software speedup/energy benefits of on-chip configurable logic: 3x speedups and 34% energy savings with only ~10,000 gates
Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache
Designing highly-configurable platforms, and facilitating their use with good exploration tools, can help enable platform-based design
See http://www.cs.ucr.edu/~vahid for more information
Top Related