Full Stack Software Hardware Co-design - GitHub Pages · Full Stack Software Hardware Co-design...
Transcript of Full Stack Software Hardware Co-design - GitHub Pages · Full Stack Software Hardware Co-design...
Full Stack Software Hardware Co-design Pygmy RISC-V AI SoC on Zebu
Cissy Yuan ( RIVAI ) Xing Wang (SYNOPSYS)
© 2019 Synopsys, Inc. 2
AIoT – Fast Growing, Fragmented Market
o AIo Smart Homeo Smart Cityo Securityo Autonomous Drivingo Socialo Educationo Manufacture
Customized architectures achieve much better compute efficiency
$212B 2019 $1.6T 2025
© 2019 Synopsys, Inc. 3
Fast Evolution of AI Calls for SW HW Co-design
SW HW Co-design can bring best cost-performance
Memory CentricCustomized ASICComputing close to memory3-10x faster
Computing Centric TPUCustomized ASIC for Tensor Flow15-30x faster than GPGPU, GPCPU
Save wasted computing/memorySW HW Co-design3-7x energy efficiency
© 2019 Synopsys, Inc. 4
SW HW Co-design on Pygmy, Heterogenous Multicore AI SoCSh
ared
Mem
ory
LPDDR DRAM Controller + Scatter-Gather DMA
IO (USB3, SDIO, SPIO, I2S, I2C, UART, GPIO)
AXI L
ite /
Test
I/FC
ache
TC
MC
ache
TC
M
MMUAI Engine
AI Engine
AI Engine
AI Engine
RV64GCV CoresC
ache
TC
MC
ache
TC
MPLL
JTAG DFT
Vector AI engines have customized RISC-V ISA64bit multicores boot full Linux with SMP
FastInterrupt
RV32
MBIST
Compile application to customized ISA.AI applications directly program vector engines.
Pygmy
OS (Linux)
NN Models
Tensor Flow Lite, Glow
Frontend Compiler
Backend Compiler(LLVM, GCC)
Manually Mapping
RISC-V extended ISA enabled full stack SW HW [email protected]
© 2019 Synopsys, Inc. 5
Co-design Requires Challenging Full Stack Simulation & Debug
Zebu is a powerful weapon to overcome these challenges
Enormous flow effort at functional, performance, power simulation & debug
vLong RuntimeBoot Linux kernel at block-level takes 24 hours, 20 times slower at chip level.Co-Simulation with precise CPU behavioral model is 1.5 times slower.Hours to days manually port ASIC RTL to FPGA.2-6 hours just to generate bit file.
vHard to debugDifficult to dump out at useful time period.
Hard to reproduce bugs.Visibility is very low.
© 2019 Synopsys, Inc. 6
Differentiated ZeBu Technologies drive emulation efficiencyZeBu Supports All High Value Emulation Use Cases
ZeBu Server 4Unified, Parallel and Incremental Compile
Hybrid / Virtual Host
Virtual Integrations
Transactors and Models Speed Adaptors
Streaming
Replay
Triggers
Performance
ZeBu Technologies
ZeBu Use Cases OS Boot and Driver/FW for
IP and systemsFull Chip RTL Verification
System Validation using ICE or Virtual
IntegrationGate-Level Emulation
Power Management Validation (UPF)
Software-driven Power Analysis
Performance Validation
Simulation Acceleration
© 2019 Synopsys, Inc. 7
AI Engine
ORV RISCV Core
Cac
he
DD
R c
ontro
ller
Perip
hera
ls
LPD
DR
4
AXI4
Lite
/ Te
st I/
F
AXI4
Lite
Pygmy
ZeBu Server
ZeBu Runtime Server
AI Engine
AI Engine
AI EngineCac
he
Cac
heC
ache
UARTXTOR
UART Terminal
USB 3.0 DevXTOR
SDIOXTOR
SPI Mst./Slv.XTOR
I2CXTOR
I2SXTOR
USB 3.0 Dev TB/TC
SDIO TB/TC
SPI TB/TC
I2C TB/TC
I2S TB/TC
RIS
C-V
Deb
ugge
r
ZeBu Smart Z-ICE
ZeBu Pygmy Emulation Environment
AXI4MasterXTOR
zRciMain
ProcessUser shared objects
User C program
© 2019 Synopsys, Inc. 8
Pygmy RISC-V AI Chip on ZeBu Demonstration
© 2019 Synopsys, Inc. 9
Linux Boot test (billions of cycles)
Pygmy Linux Boot Debug Case Study
Stimuli
Frame 0
HWState
DPI (trackers/monitors) streaming
Multiple Level critical signal streaming
Stimuli
Frame N
HWState
Stimuli
Frame N-1
HWState
Verdi DebugReplay & Waveform Capture
Debug Window
Stimuli
Frame 1
HWState
Stimuli
Frame N
HWState
Stimuli
Frame N-1
HWState
Test Failure
CPU Hang
Linux Boot test (billions of cycles)
Linux Boot test (billions of cycles)
CPU Hang
Linux Boot test (billions of cycles)
Triggered
Full Visibility Signals
© 2019 Synopsys, Inc. 10
ZeBu Runtime Server
10
Pygmy Performance Debug Case Study
10
With ZEMI3, performance issues can be debugged at higher level more efficiently
ZEMI-3 Hardware
Part
ZeBu
SystemVerilog DPICPU CoreL2B Module PMU Module
AI Engine
AI Engine
AI Engine
AI Engine
ORV64 RISCV Cores
Cac
heC
ache
DD
R
cont
rolle
rPe
riphe
rals
AXI4
Lite
Bus
Pygmy
Cac
heC
ache
ZEMI-3 Hardware
Part
Time
Num
of I
nstru
ctio
ns
© 2019 Synopsys, Inc. 11
Analyze Power using RTL: Hours instead of WeeksRTL average power for millions of cycles expands beyond legacy emulation flows for 10,000s of cycles
High Performance SW Waveform
Reconstruction
Essential Signals N x Grid Jobs
Dump Waveform for all power essential signals (Registers)
Waveform reconstruction for all signals and generate SAIF Average power for millions of cycles
RTL Power Estimation
Average Power
1 Hr x 150 Grid Jobs*20khz (18min)* RTL Power Estimation for Average Power
*1 Manhattan frame, 2B gates design on ZS4
RTL ZTDB SAIF
Flops + hierarchy boundaries
© 2019 Synopsys, Inc. 12
Ø SW HW co-design brings best cost-performance in fast growing AIoT market.
ØZeBu delivered highest efficiency for full stack Functionality Verification, Performance Debug and Power Analysis.
ØRISC-V extended ISA enabled full stack SW HW co-design flow.
Ø软硬件联合开发能获得最好的性价比, Zebu助力实现最快的全栈迭代,这一切得益于RISC-V
Conclusions
磨刀不误砍柴工
Thank YouWe are hiring J [email protected]