Www.huawei.com LTE System Multiple Antenna Techniques eRAN2.2 (MIMO and Beamforming) 2015-7-2.
Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team.
-
Upload
gerald-robertson -
Category
Documents
-
view
225 -
download
0
Transcript of Www.huawei.com HUAWEI TECHNOLOGIES Co., Ltd. Smart Memory Bill Lynch Sailesh Kumar and Team.
www.huawei.com
HUAWEI TECHNOLOGIES Co., Ltd.
Smart Memory
Bill Lynch
Sailesh Kumar
and Team
HUAWEI TECHNOLOGIES CO., LTD. Page 2
Overview
High Performance Packet Processing
Challenges
Solution – Smart Memory
Smart Memory Architecture
HUAWEI TECHNOLOGIES CO., LTD. Page 3
Packet Processing Workload Challenges Sequential memory references
› For lookups (L2, L3, L4, and L7)› Finite automata traversal
Read-modify-write› Statistics, counters, token-bucket, mutex, etc
Pointer and link-list management› Buffer management, packet queues, etc
Traditional implementations use› Commodity memory to store data› NPs and ASICs to process data in memory
P P P P P P
P P P P P P
Memory Memory Memory Memory
Performance Barriers:
1. Memory and chip I/O bandwidth
2. Memory latency
3. Lock for atomic access
Tons of memory referencesand minimal compute
HUAWEI TECHNOLOGIES CO., LTD. Page 4
Illustration of Performance Barrier I
P P P P P P IP lookup tree
Requires several transactions between memory and processors
Memory Memory Memory0 1
0 1
0
0
1
2 3
54
7
9
P2
P5
16
P318
P4
P1
High lookup delay due to high interconnect and memory latency
Interconnection network
P P P P P P
Memory
Low IPCNeed moreprocessors
More latencyIn interconnect
HUAWEI TECHNOLOGIES CO., LTD. Page 5
Illustration of Performance Barrier II Lookups are read-only so relati
vely easy Link-list, counters, policers, etc
are read-modify-write Requires per memory address l
ock in multi-core systems
P P P P P P
Memory Memory Memory
Interconnection network
P P P P P P
Lock free-list
Get free node
Unlock free-list
Unlock list tail
Lock list tail
Link free node
Read list tail
Update list tail
EnqueueDequeue
Lock counter
Read counter
Unlock counter
Write counter
CountersLocks often kept in memory
Requires another transaction
Adds significant latency
Single queue or single counteroperations are extremely slow
HUAWEI TECHNOLOGIES CO., LTD. Page 6
Overview
High Performance Packet Processing Challenges› Memory bandwidth and latency› Limited I/O bandwidth
Solution – Smart Memory› Attach simple compute with data› Attach lock with data› Enable local memory communication
Smart Memory Architecture› Hybrid memory – eDRAM + DDR3-DRAM› Serial I/O
HUAWEI TECHNOLOGIES CO., LTD. Page 7
Introduction to Smart Memory
What is the real problem?› Compute occurs far away from data
› Lock acquire/release occurs far from data
Solution: Make memory smarter by:
P P P P P P
Memory Memory Memory
Interconnection network
P P P P P P
Fortunately, compute for packetprocessing jobs are very modest!
compute compute compute compute
P P P P P P
Interconnection network
P P P P P P
Memory Memory Memory Memory
Keeping compute close to data
Managing lock close to data
Enabling local communication
HUAWEI TECHNOLOGIES CO., LTD. Page 8
Introduction to Smart Memory
What is the real problem?› Compute occurs far away from data
› Lock acquire/release occurs far from
dataP P P P P P
Memory Memory Memory
Interconnection network
P P P P P P
Fortunately, compute for packetprocessing jobs are very modest!
compute compute compute compute
P P P P P P
Interconnection network
P P P P P P
Memory Memory Memory MemorySmart Memory Advantages(Get more off fewer transactions!)
1. Lower I/O bandwidth2. Lower processing latency3. Higher IPC4. Significantly higher single
counter/queue performance
HUAWEI TECHNOLOGIES CO., LTD. Page 9
Overview
High Performance Packet Processing Challenges› Memory bandwidth and latency› Limited I/O bandwidth
Solution – Smart Memory› Attach simple compute with data› Attach lock with data› Enable local memory communication
Smart Memory Architecture› Hybrid memory – eDRAM + DDR3-DRAM› Serial chip I/O
HUAWEI TECHNOLOGIES CO., LTD. Page 10
Smart Memory Capacity and Bandwidth @100G
DPI (regex)
BasicLayer2
2 4 8 16 32 64 128 256 512+
40
20
10
5
2.5
1.25
.62
.31
.15
Layer2fwding
FIB(algorithmic)
ACL (algorithmic)
Statistics/Counter
PacketBuffer
Queuing/Scheduling
DPI (string)
Memory Capacity (MB)
Mem
ory
band
wid
th (
Bill
ion
acce
sses
/ p
acke
t)
VideoBuffer
HUAWEI TECHNOLOGIES CO., LTD. Page 11
ACL (algorithmic)
DPI (regex)
Smart Memory Capacity and Bandwidth @100G
2 4 8 16 32 64 128 256 512+
40
20
10
5
2.5
1.25
.62
.31
.15
Memory Capacity (MB)
Mem
ory
band
wid
th (
Bill
ion
acce
sses
/ p
acke
t)
BasicLayer2
Layer2fwding
FIB(algorithmic)
Statistics/Counter
PacketBuffer
Queuing/Scheduling
DPI (string)
VideoBuffer
64 banks eDRAM
8 channels of DDR3-DRAM
Smart Memory usesintelligent algorithms tosplit the data-structures
HUAWEI TECHNOLOGIES CO., LTD. Page 12
Smart Memory High Level ArchitecturePacket processor complex DDR3
DRAMP P P P
P P P P
P P P P
P P P P
DRAM SMEngine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
Smart Memory complex
Local interconnect:provides local communicationbetween smart memory
blocks
Global interconnect:provides fair communicationbetween processors andsmart memory
HUAWEI TECHNOLOGIES CO., LTD. Page 13
Packet processor complex
P P P P
P P P P
P P P P
P P P P
DRAM SMEngine
SM engine
eDRAM
SM engine
eDRAM
SM engine
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
Smart Memory complex
Smart Memory High Level Architecture
Split tables intoeDRAM and DRAM
DDR3DRAM
eDRAM eDRAM
eDRAM
read
read
SM engineFIB key
data
FIB key
resultComputation occurs closeto memory reducing latency
Requires fewer memorytransactions
HUAWEI TECHNOLOGIES CO., LTD. Page 14
I/O Technology Choice in Smart Memory
Based on MoSys data
- Bandwidth, latency and I/O bandwidth gap is growing
- On-chip bandwidth is much higher than memory I/O
Smart Memory use serial I/O
- 4X throughput than RLDRAM and QDR
- 3X fewer pins than DDR3 and DDR4
- 2.5X reduces I/O power
Smart Memory reduces the chip I/O bandwidth significantly How to further optimize it?
HUAWEI TECHNOLOGIES CO., LTD. Page 15
High Speed Line Card with Smart Memory
M A C
PHY
PHY
PHY
PHY
D D R 3
NP
TCAM
SRAM
D D R 3
D D R 3
D D R 3
D D R 3
NP
TCAMSRAM
D D R 3
D D R 3
D D R 3 D
D R 3
NP
TCAMSRAM
D D R 3
D D R 3
D D R 3
D D R 3
NP
TCAM
SRAM
D D R 3
D D R 3
D D R 3
TM
TM
SRAM
SRAM
SRAM
SRAM
F I C
C P U
D D R 3
D D R 3
To S
witch
Fa
bric
DDR3 memory10+ DIMM , 900+ pins
Control Plane Memory
M A C
P H Y
SM
F I C
CPU
D D R 3
D D R 3
To Switch Fabric
NP
SM
SM SM
P H Y P H Y P H Y
Traditional Line Card Line Card with SM
540+ w 212- wPower
2-3 times
Cost5600+ $ 2520 $472+ cm2 148 cm2Area
HUAWEI TECHNOLOGIES CO., LTD. Page 16
Concluding Remarks Packet Processing Bottlenecks
› Data away from compute
› I/O and memory bandwidth
Smart Memory› Keep compute close to data
› Keep locking close to data
› Provide inter-memory connect
Advantages› Reduced chip I/O bandwidth
› High performance and low latency
› Feature rich, flexible and programmable
› Lower cost
› One chip for several functions
HUAWEI TECHNOLOGIES CO., LTD. Page 17
Thank You
www.huawei.com