Raw Fabrics for PCA Status and Plans
description
Transcript of Raw Fabrics for PCA Status and Plans
Anant AgarwalSaman Amarasinghe
Raw Fabrics for PCAStatus and Plans
Agenda
• 09:00 – 10:00 Raw Fabrics Status and Plans Agarwal• 10:00 – 10:30 Streams and software systems
Amarasinghe• 10:30 – 11:00 Morphware update Thies• 11:00 – 11:20 Operating system update Strumpen• 11:20 – 11:50 Lab visit and demos All• 11:50 – 12:20 Applications Crago• 12:20 – 12:35 Stream Algorithms Hoffman • 12:35 – 12:50 x86 on Raw Wentzlaff• 12:50 – 1:30 Discussion All
• 12:00 Lunch
The Raw Chip
RawTile
Disk stream
Video1
SDRAM
Raw Architecture
Packet stream
SMEM
SWITCHPC
DMEM
IMEM
REGPC
FPUALU
A 16-tile 2-D fabric (1K tiles in 2010)Memory is distributedRISC-like core in each tile, with FPFast, programmable interconnect (r-r 3 cycles)~1100 off-chip data I/O
Taylor et al., IEEE Micro ‘02, ISSCC ‘03
Raw Handheld
Well… Raw Handheld
First program, 80MHz Jan 03 “Thorough testing” 300 MHz May 03
1024 Channel Audio Beam Forming
ADATOptical
AudioInterfaceFor RAW FPGA
RAW
Microphone Array
128 190
FPGA190
AudioA-to-D
16
2
AudioA-to-D
16
AudioA-to-D
16
.
.
.
12…
64...
One PCA chip beats current 640 channel custom hardware beamformer!
First proposed at review in Nov 02
FPGA RAW FPGA
384 Mbits/sec
A-to-D CPLD
768 Kbits/sec
16 KHz24 bits
16
12 Mbits/sec
32
1024 microphonesA 1024-Node Acoustic Beamformer
2-Microphone Card
32-Microphone Column
Raw Chip Specifications• IBM SA27E Process
– 0.15, 6-metal copper ASIC process
• 16 Tile RAW Processor– 18.17mm x 18.17mm– 1657 pin CCGA package– 1152 signal pins
• Clock and Power– 420MHz (actual)– 10 watts (power save turned on)– 18 watts typical– 35 watts if everything is used!
“PowerPoint” Performance
•Raw Chip (@420MHz)–~7 GOPS/GFLOPS (SP)–~100 GBytes/s of on-chip memory bandwidth–~90 GBytes/s of on-chip “bisection bandwidth” –~40 GBytes/s I/O bandwidth
No bugs so far!
Progress on the Raw Chip
• Complete Spec Feb ‘00• IBM Initial Design Review Mar ‘00• Feature complete Netlist May ‘00• Arch. Timing optimization Feb ‘01• Floorplanning Mar ‘01• Prelim Placement/Timing opt Jun ‘01• Raw H21 system board (ISI) Jun ‘01• Raw in Emulation Jun ‘01• Detailed Placement/Timing opt Dec ‘01• Release to IBM for initial layout Dec ‘01• Timing closure after layout Mar ‘02• All backend checks pass May ‘02• Release to IBM for production layout May ‘02• Final function and timing validation Jul ‘02• Final manuf. release to IBM Aug ‘02• Chip prototypes back Oct ’02
PCA Phase 2 Effort
The Raw Processor
Rawcc Compiler
Stream Compiler
Resource Management: The Raw OS
Applications and Evaluation
- -
Embedded systems e.g., network router
libStream
Raw Fabric Testbed
PCA Raw Fabrics, Systems, Apps
• Raw Chips Oct 02• Handheld (H) board arrives from ISI Dec 02• H Board bringup – Small program 80 MHz Jan
03• H Board testing, speed gasket 300 Mhz May 03• USB Interface, 500 Mbits/s xface July 03• H Board refab (in fab now), to partners Sep 03• Fabric-Array and Fabric-IO board design Jun 03• Fabric-Array and Fabric-IO board fab Sep 03 • 16 and 64-chip PCA fabric bringup• Applications and experiments• PCA demonstrations
– Embedded networking board– Audio beamformer system– 802.11b,g,a wireless system – Graphics system– Virtual x86
Partner Support Activity• Handheld boards Sep 03• USB xface • PCI xface • “Raw User Day” videos and documentation• Expansion interface testing and documentation
(used in beamformer) • Software distribution
– Simulator (useful to debug small assembly programs)– C compiler– rGDB debugger– Streamit language and compiler– Lots of other goodies
• 1024-tile (64 chip) fabric simulator (since Dec 02)• 16, 64 node Fabrics
64-Node Raw Fabric
DRAMNetwork I/O DRAM
Network I/ONetwork I/O DRAM
DR
AM
Net
wor
k I
/O
DR
AM
Net
wor
k I/O
Net
wor
k I
/O D
RA
M
Fabric System Architecture
• Design: two distinct board designs; HOW???
• replicate and connect
• Board 1: Quad Raw Board
• Board 2: I/O & Memory Board
The Challenge• How do we use the same board designs for every
position in the fabric? Fabric board is easy enough.
The Challenge• How do we use the same board designs for every
position in the fabric? E.g., I/O board
The Saman Flip• How do we use the same board designs
for every position in the fabric?– IO Board
• symmetric about x-axis• compensate for board flip in firmware
Quad Board
•4 RAW chips per board
•16 152-pin MICTOR connectors total (4 per side)
•Power distributed over separate cables from other signals
•MICTOR connectors are stacked to save space
Quad Board Layout
11”
11”
I/O & Memory Board
•4 FPGAs
•2 64-bit PCI slots
•2 Expansion Ports (same as on Raw Handheld board)
•4 SDRAM banks
•symmetric design 11”
IO/Memory Board schematic
AD[63:0]PCI CONNECTORS
1
AMP
INFORMATION SCIENCES INTSTITUTE
5
4
3
2
1
5
4
3
2
1
A B
A B
SCHEMATIC NAME:
FAX (703) 812-3712
TEL (703) 243-9423
ARLINGTON, VA 22203
SHEET SIZE E
Copyright, UNIVERSITY OF SOUTHERN CALIFORNIA
PROJECT NAME:
3811 NORTH FAIRFAX DRIVE
SUITE #200
SHEET OF 30
MPD[15:0]
MEM_A[20:0]
MEM_D[15:0]
MPA[6:0]
CONFIGURATION CONTROLLER
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
SDRAM PCI 1
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
SDRAM EXP 0
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
SDRAM PCI 0
AD[63:0]PCI CONNECTORS
IO0_[189:0]
IO1_[189:0]
EXPANSION CONNECTORS
IO0_[189:0]
IO1_[189:0]
EXPANSION CONNECTORS
UTILITY
CLOCKS
POWER
MPA[6:0]
MPD[15:0]
MEM_A[20:0]
MEM_D[15:0]
CONFIGURATION MEMORY
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
SDRAM EXP 1
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
PIE_[36:0]
POE_[36:0]
POD_[36:0]
PID_[36:0]
AD[63:0] FPGA PCI 1
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
PIE_[36:0]
POE_[36:0]
POD_[36:0]
PID_[36:0]
IO0_[189:0]
IO1_[189:0]
FPGA EXP 1
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
PIE_[36:0]
POE_[36:0]
POD_[36:0]
PID_[36:0]
IO0_[189:0]
IO1_[189:0]
FPGA EXP 0
M1DQ[63:0] M1CB[15:0]M1DQMB[7:0]M1S_N[3:0] M1A[13:0] M0DQ[63:0] M0CB[15:0]M0DQMB[7:0]M0S_N[3:0] M0A[13:0]
PIE_[36:0]
POE_[36:0]
POD_[36:0]
PID_[36:0]
AD[63:0] FPGA PCI 0
PO6_[36:0]
PI7_[36:0]
PI5_[36:0]
PO5_[36:0]
PO4_[36:0]
PI3_[36:0]
PO3_[36:0]
PO2_[36:0]
PO1_[36:0]
PI2_[36:0]
PI1_[36:0]
PI0_[36:0]
PO0_[36:0]
PI4_[36:0]
PO7_[36:0]
PI6_[36:0]
Conn
ecto
rs
QUAD RAW IO BOARD
REV#=1.0RAW
QUAD_RAW_IO
7-10-2003_12:55
MPA[6:0]
MPD[15:0]
MEM_A[20:0]
EXP1_IO0_[189:0]
PCI1_AD[63:0]
PCI0
_M0D
Q[63
:0]
PCI0
_M0A
[13:
0]
PCI0
_M0S
_N[3
:0]
PCI0
_M1D
QB[7
:0]
PCI0
_M1S
_N[3
:0]
EXP0
_M1S
_N[3
:0]
EXP0
_M1A
[13:
0]
EXP0
_M1D
QB[7
:0]
EXP0
_M1D
Q[63
:0]
EXP0
_M1C
B[15
:0]
EXP0
_M0S
_N[3
:0]
EXP0
_M0A
[13:
0]
EXP0
_M0D
QB[7
:0]
EXP1
_M1S
_N[3
:0]
EXP1
_M1D
QB[7
:0]
PCI1
_M1S
_N[3
:0]
PCI1
_M0D
Q[63
:0]
PCI1
_M0C
B[15
:0]
PO7_[36:0]
PI7_[36:0]
PO6_[36:0]
PI6_[36:0]
PO5_[36:0]
PI5_[36:0]
PO4_[36:0]
PI4_[36:0]
PO3_[36:0]
PI3_[36:0]
PO2_[36:0]
PI2_[36:0]
PO1_[36:0]
PI1_[36:0]
PO0_[36:0]
PI0_[36:0]
PCI1
_M0D
QB[7
:0]
PCI1
_M0A
[13:
0]
PCI1
_M0S
_N[3
:0]
PCI1
_M1D
Q[63
:0]
PCI1
_M1C
B[15
:0]
PCI1
_M1D
QB[7
:0]
PCI1
_M1A
[13:
0]
EXP1
_M0D
Q[63
:0]
EXP1
_M0C
B[15
:0]
EXP1
_M0D
QB[7
:0]
EXP1
_M0A
[13:
0]
EXP1
_M0S
_N[3
:0]
EXP1
_M1D
Q[63
:0]
EXP1
_M1C
B[15
:0]
EXP1
_M1A
[13:
0]
EXP0
_M0D
Q[63
:0]
EXP0
_M0C
B[15
:0]
PCI0
_M0C
B[15
:0]
PCI0
_M0D
QB[7
:0]
PCI0
_M1D
Q[63
:0]
PCI0
_M1C
B[15
:0]
Power Distribution
• 48V distributed to all boards, then down-converted
• DC-DC converters on each board– 1.8V Raw core– 1.5V Raw I/O– 3V other logic– 1.5V is also further down converted to 0.75V
supply for HSTL termination• System-wide power supply can be up to 3kW
At 1.8V, 64 Raw chips can draw 1280 amps!!!!!!!!!!!
Power Distribution
• Distributed over special connectors, separately from signals
• external power supply feeds top and bottom rows of I/O Boards
power supply
Clock Distribution
• signal generated and distributed from a center board over MICTOR connectors
• uses DLLs to deskew the clock at each connection
• every quad board sends and receives a copy of the clock to its neighbors and we can select which of the input clocks to use using dip switches
clock generator
Clock Distributionfrom external input
DLL
• Synchronized clocks for all Raw chips in fabric • Delay-Locked Loop uses feedback to tune delay line for clock
synchronization• Dip switches keep clock dist. general no custom firmware
Reset Distribution
• signal generated by one of the I/O boards and distributed over MICTOR connectors
reset originates
here
PCA Raw Fabrics, Systems, Apps
• Raw Chips Oct 02• Handheld (H) board arrives from ISI Dec 02• H Board bringup – Small program 80 MHz Jan
03• H Board testing, speed gasket 300 Mhz May 03• USB Interface, 500 Mbits/s xface July 03• H Board refab (in fab now), to partners Sep 03• Fabric-Array and Fabric-IO board design Jun 03• Fabric-Array and Fabric-IO board fab Sep 03 • 16 and 64-chip PCA fabric bringup• Applications and experiments• PCA demonstrations
– Embedded networking board– Audio beamformer system– 802.11b,g,a wireless system – Graphics system– Virtual x86