AMD_11th_Intl_SoC_Conf_UCI_Irvine

29
Platform Coherency and SoC Verification Challenges PANKAJ SINGH , CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE THE 11 TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA WWW.SOCCONFERENCE.COM ACKNOWLEDGEMENTS: PHIL ROGERS AMD CORPORATE FELLOW, ROY JU & BEN SANDER SR FELLOW NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES

description

1] A New Parallel Computing Platform – Heterogeneous System Architecture Opportunities, Benefits and Feature Roadmap 2] Kaveri Platform Coherency Shared memory, Platform atomics 3] Kaveri Verification Approach 4] SoC Verification Challenges and Solutions

Transcript of AMD_11th_Intl_SoC_Conf_UCI_Irvine

Page 1: AMD_11th_Intl_SoC_Conf_UCI_Irvine

Platform Coherency and SoC Verification Challenges

PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE

THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA WWW.SOCCONFERENCE.COM

ACKNOWLEDGEMENTS:

PHIL ROGERS AMD CORPORATE FELLOW, ROY JU & BEN SANDER SR FELLOW NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES

Page 2: AMD_11th_Intl_SoC_Conf_UCI_Irvine

2 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

TODAY’S TOPICS

A New Parallel Computing Platform

– Heterogeneous System Architecture

Opportunities, Benefits and Feature Roadmap

Kaveri Platform Coherency

Shared memory, Platform atomics

Kaveri Verification Approach

SoC Verification Challenges and Solutions

HSA

1 KAVERI

PLATFORM COHERENCY

KAVERI VERIFICATION

SoC VERIFICATION

2 3 4

Page 3: AMD_11th_Intl_SoC_Conf_UCI_Irvine

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

A New Parallel Computing

Platform – Heterogeneous

System Architecture (HSA)

HSA

1 KAVERI

PLATFORM COHERENCY

KAVERI VERIFICATION

SoC VERIFICATION

2 3 4

Page 4: AMD_11th_Intl_SoC_Conf_UCI_Irvine

4 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

The APU is a great advance compared

to previous platforms

Combines scalar processing on CPU

with parallel processing on the GPU and

high-bandwidth access to memory

APU: ACCELERATED PROCESSING UNIT

Challenge: How do we make it even better going forward? Easier to program

Easier to optimize

Easier to load balance

Higher performance

Lower power

CPU pair GPU SIMD

Page 5: AMD_11th_Intl_SoC_Conf_UCI_Irvine

5 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

THE HSA OPPORTUNITY ON MODERN APPLICATIONS

SOLUTION

PROBLEM

Developer

Return (Differentiation in

performance,

reduced power,

features,

time to market)

Developer Investment (Effort, time, new skills)

Good user experiences

Historically, developers program CPUs

HSA + Libraries = productivity & performance with low power

Wide range of differentiated experiences

~4M apps

~20+M* CPU

coders

PROBLEM

Significant niche value

GPU/HW blocks hard to program

Not all workloads accelerate

Few hundred

apps

Tens of Ks GPU

coders

Few 100Ks HSA apps

Few M HSA

coders

*IDC

Page 6: AMD_11th_Intl_SoC_Conf_UCI_Irvine

6 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

HSA AND ITS BENEFITS

HSA is an enabler of APU’s higher performance and power efficiency

Our industry-leading APUs speed up applications beyond graphics

CPU and GPU (APUs) work cooperatively together directly in system memory

Makes programming the APU as easy as C++

Improves Performance per watt

App-Accelerated

Software Applications

Serial and Task-Parallel Workloads

Data-Parallel Workloads

Graphics Workloads

HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS

Ref [1]

Page 7: AMD_11th_Intl_SoC_Conf_UCI_Irvine

7 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Improves Power and Performance: Move application from CPU to GPU, remove data copies,

and reduce launch time

Simulate removing memory copies: 1.32 X

0 fps

5 fps

10 fps

15 fps

20 fps

25 fps

CPU CPU+GPU

Measured Perf

1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency

Easier to Program + Remove Copies

CPU Cores

CPU Cores

NB+GPU

NB+GPU

DRAM DRAM

0 W

5 W

10 W

15 W

20 W

25 W

30 W

35 W

CPU CPU+GPU

Measured Power

ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP

HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE)

Ref [1]

Page 8: AMD_11th_Intl_SoC_Conf_UCI_Irvine

8 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP

System

Integration

GPU compute

context switch

GPU graphics

pre-emption

Quality of Service

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User Mode Schedulng

Physical

Integration

Integrate CPU & GPU

in silicon

Unified Memory

Controller

Common

Manufacturing

Technology

Page 9: AMD_11th_Intl_SoC_Conf_UCI_Irvine

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

PLATFORM COHERENCY

HSA

1 KAVERI

PLATFORM COHERENCY

KAVERI VERIFICATION

SoC VERIFICATION

2 3 4

Page 10: AMD_11th_Intl_SoC_Conf_UCI_Irvine

10 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Shared memory accesses between the CPU and

GPU happens via ‘system memory’.

– Corresponds to the notion of shared virtual memory

(SVM) in OpenCL 2.0, available via clSVMalloc()

call. With SVM, CPUs and GPUs can share an

address space and share the pointer to the same

memory location.

– The compiler supports SVM and atomics calls that

work across the CPU-GPU boundary.

– System-memory accesses may go one of three

paths

If coherence with CPU is not required:

GARLIC path

If kernel-granularity coherence with CPU is

required: ONION bus path

If instruction-granularity coherence with CPU

is required: Bypass L2 via ONION+ bus (required

by atomics)

KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM

ATOMICS

Page 11: AMD_11th_Intl_SoC_Conf_UCI_Irvine

11 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

CONCURRENT STACK PUSH USING ATOMIC COMPARE-AND-

EXCHANGE (AN EXAMPLE)

Each CPU thread and each GPU workitem execute the following code concurrently:

The code shows an example implementation of a concurrent stack’s “push” operation.

The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU

thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0]

do { head = list[0]; //redundant because the atomic call updates head on failure

list[i] = head; } while (!atomic_compare_exchange_strong(&list[0], &head,i));

Time Instant Workitem i=2 Workitem i=4

Before ACE head=3, list[2]=3 head=3,list[4]=3

ACE Wins! Loses and goes back & retries

After ACE completes list[0]=2 list[0]=2

3

5

-1

2

3

5

-1

0

1

2

3

4

5

99

0

1

2

3

4

5

99

i=2 and i=4 contest for ACE List after i=2 wins!

(List: 3 (head)->5->-1) (List: 2 (head)->3->5->-1)

Page 12: AMD_11th_Intl_SoC_Conf_UCI_Irvine

12 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

IMPLEMENTING PLATFORM ATOMICS FOR KAVERI

The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.

The key issue in implementing these atomics is to make sure that both CPU and GPU see

the shared memory in “coherent” state.

The coherency is implemented using the ONION+ memory path and using the GPU ISA

instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and

snoop to invalidate the CPU caches. This support is provided in the KV SOC.

For example: atomic_load with acquire semantics generates code on the GPU side as

shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with

release semantics generates the GPU ISA given later.

OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering

1. load with glc=1 // bypass the L1 cache

2. S_waitcnt 0 // wait for the load to complete

3. buffer_wbinv_vol // invalidate L1 so that any following load reads from memory

1. s_waitcnt 0 // wait for any previous memop to complete

2. store with glc=0 // L1 is a write-through cache, so write onto memory as L2 is bypassed

3. s_waitcnt 0; // prevent any following memop to move up

Page 13: AMD_11th_Intl_SoC_Conf_UCI_Irvine

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

KAVERI SOC VERIFICATION

APPROACH

HSA

1 KAVERI

PLATFORM COHERENCY

KAVERI VERIFICATION

SoC VERIFICATION

2 3 4

Page 14: AMD_11th_Intl_SoC_Conf_UCI_Irvine

14 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

CPU-BASED VERIFICATION

Higher language (C/C++)

BFM model used across PCIe-based

interface to inject data

GPU sends requests to DRAM over 2

paths: coherent and non-coherent

Assembly based input

Memory image of x86 machine code is

preloaded into DRAM model

CPU fetches instructions from DRAM

and executes them

GPU-BASED VERIFICATION

CPU

NorthBridge

Graphics model

SouthBridge BFM

DRAM Model

TRADITIONAL VERIFICATION AND SOC CHALLENGE

SoC Verification Challenge Layer of complexity due to HSA coherency environment.

SoC GPU needs to be programmed, which requires host

SoC CPU can be used the host. However, running the same host software stack results

in huge simulation time

One approach is Mailbox:

Inefficient due to lack of CPU-GPU interaction, longer run time.

GPU-focused verification not suitable for CPU-GPU interaction (HSA)

GFX

Page 15: AMD_11th_Intl_SoC_Conf_UCI_Irvine

15 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

The memory accesses and configuration writes from the test are extracted into C function calls

Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.

On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.

Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim

memory image

SOC VERIFICATION METHODOLOGY: TEST FLOW

Test (Open CL)

sp3 shader

GPU Test

Intent Capture

Replay()

CPU Test

One Thread

[ Driver

CPU]

Runs

Other

Threads CX

Shell

.sim

memory

image

APU RTL

Sim

Test

Output

Capture Output

Running driver code on simulated CPU is

impossible due to simulation run-times.

Ref [4]

Intent Capture is a mechanism to allow existing

discrete GPU graphics tests to execute on the CPU

in a Heterogeneous APU simulation.

Page 16: AMD_11th_Intl_SoC_Conf_UCI_Irvine

16 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

POWER MANAGEMENT: BAPM

Core

Pwr

Rest

of

APU

Pwr

AP

U P

wr

Core

Pwr

Rest

of

APU

Pwr

cores

@ Pbase

App1 with High CAC

All-cores active

Die

Te

mp

Core

Pwr

Rest

of

APU

Pwr

App2 with Med CAC Half-

cores active

App3 with Low CAC All-

cores active

P0/Pbase

P1

SWP0

SWP1

SW/OS

View

HW

View

… …

Pb0

Pbx

...

Multiple

Boost

Pstates

ILLUSTRATION WITH

CPU-CENTRIC SCENARIO

CPU Core1

CPU Core2

Compute

Unit

Power

Monitor

calculates

CPU

Power

GPU Core1

GPU Core2

GPU

Power

Monitor

calculates

GPU

Power

Compare

Temperature to

Limit & adjust

Voltage/Frequency

If Temp > Limit, reduce power allocation If Temp < Limit, increase power allocation

Firmware

converts power

into

temperature

estimates

In a multi-core design, apps running on CPU/GPU cores may consume less power

Power-efficient algorithms exploit this power headroom for performance

The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa

Ref[2]

Page 17: AMD_11th_Intl_SoC_Conf_UCI_Irvine

17 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

CPU-centric

BAPM VERIFICATION APPROACH @ SOC Multiple

Boost

Pstates

NB CAC

Manager

CPU Core1

CPU

Power

Monitor

CPU Core2

CPU

Power

Monitor

GPU Core1

GPU

Power

Monitor

GPU Core2

GPU

Power

Monitor

SMU F/W

• Developed high and low power consuming CPU

patterns based on micro-architecture and power

analysis.

• Interleaved high and low power patterns in random

stimulus

• Used an Irritator to manipulate the credits sent to

CAC manager at times to hit corner cases like

back-to-back boost/throttle

• Modeled F/W algorithm using a simple BFM

• Added CSR framework to drive read/write to CAC

manager

• A very few sanity tests run with real f/w loaded

through backdoor to check the end-to-end flow.

• Used irritators to model GPU power credit

reporting instead of running GPU applications.

• GPU power monitor verified at GPU IP level

Efficient Coverage-driven random verification

CPU boosted because of GPU giving away credits and vice versa

Crosses of CPU/GPU events and effect on BAPM

Page 18: AMD_11th_Intl_SoC_Conf_UCI_Irvine

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

SOC VERIFICATION

CHALLENGES & SOLUTION

HSA

1 KAVERI

PLATFORM COHERENCY

KAVERI VERIFICATION

SoC VERIFICATION

2 3 4

Page 19: AMD_11th_Intl_SoC_Conf_UCI_Irvine

19 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Intent Capture and Playback methodology

Test setup update @ IP level to support test run with SOC

as a new target

Using functional model to simulate IP[RTL] in SoC scenario

for IP test development and easy porting to SoC

TEST STIMULUS REUSE AND PORTING TO SOC

CPU C Model/RTL

Bus Unit

MPMM

GPU C Model

cMemory Memory Model

MEMIO Memory Model

CPU to GPU access

APU Test

Output

GPU C Model

Test Output

DV Test

Capture Output

Replay Capture Output

A simple HSA SOC test with 1 Rd-WR in RTL takes about 18

hours whereas it is <1 hour on the Heterogeneous C model

Goal: Improve Quality, Reduce development time

Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult.

IP2SoC script

Export suite, test key

Common test options

sim output directories

reports

Perf_options.yml Memory config

run_job command-line options: GNB,XNB,UNB

NB/DCT prog. options

UNB Perf options

Run/Execute Regression

Create job spec [ip2soc –merge]

Test setup update such as configuration changes, test stimulus

defines allowed IP test to be reused.

Page 20: AMD_11th_Intl_SoC_Conf_UCI_Irvine

20 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

HW-SW INTERACTION : MODELING & ABSTRACTION

Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:

Firmware algorithms are compute-intensive and often developed late in design cycle.

Additional challenge to Verification in terms of load and execute time of the software.

Model the relevant section of the software using BFM with proper interface to the hardware

Add sufficient controllability to stress different paths of the BFM model - find coverage

Adaptive stimulus based on coverage of the BFM/state-machine

HW-SW INTERACTION: MODELING AND ABSTRACTION

Goals: Improve Quality, Reduce development time

Connected Standby Verification Approach

Page 21: AMD_11th_Intl_SoC_Conf_UCI_Irvine

21 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

ADAPTIVE STIMULUS

Typically, power management transitions kick off after active code execution stops. This results in deeper

corner cases associated with thread-level coordination in multi-core design.

Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.

Define the power management modes as state machines - each state having granular phases including

thread specific information.

Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different

sorts of interrupts, probes, warmreset) and updates a scoreboard.

Events are generated very close to the relevant points - provides great controllability.

Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less

frequently covered <state> X <event> buckets.

Goals: Improve Quality, Reduce development time

Page 22: AMD_11th_Intl_SoC_Conf_UCI_Irvine

22 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC

S11

S21 St S23

St St

Random Initial States

St

Ref[3]

Register

IP Constraints

SOC Constraints

Randomization utility

Package level info

Build Fuse

Modes: LFBR, BfD,long_init/ unfused test

RandomConfig executable

Time t=0 [config values

Import value after reset

Run

CMD line options

Goals: Improve Quality, Reduce development time

Complex SoC requires Randomization at different levels

Page 23: AMD_11th_Intl_SoC_Conf_UCI_Irvine

23 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION

Challenges with Netlist simulation :

Longer run-times

Longer debug times

Approach to minimize runtime: Compute intensive RTL and associated verification components must be

replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file.

Run RTL simulation,get FSDB

Create Gatesim files (gatesim.v,forces.v )

Build w Netlist + Gatesim files + TB to drive stimulus

from FSDB

Run Netlist sims(with

FSDB dump)

Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug:

Goals: Improve Quality, Reduce development time

Ref [5]

10x runtime optimization over traditional approach.

Page 24: AMD_11th_Intl_SoC_Conf_UCI_Irvine

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

THANKYOU

Page 25: AMD_11th_Intl_SoC_Conf_UCI_Irvine

25 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

REFERENCES

[1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote

Speech, Roy Ju, AMD Senior Fellow

[2] AMD APUs :Dynamic Power Management Techniques, DAC 2013.

Praveen Dongara, System Architect

[3] Wilson Research Group-MGC 2013.

[4] Kaveri DTP. Internal Document.

[5] Innovative Approach to Overcome Limitations of Netlist Simulation,

SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K

Page 26: AMD_11th_Intl_SoC_Conf_UCI_Irvine

26 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

GLOSSARY

GPU – Graphics processing unit

APU -- Accelerated Processing Unit

Open CL™ -- Open Computing Language

TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device

AMD Turbo Core Technology – AMD boost mechanism

BIAPM -- Bi-directional Application Power Management.

Cac -- Capacitance AC switching, measures switching activity of a cluster

TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design

Pstate -- Processor performance state

GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel

ONION -- On-chip Northbridge to I/O Noncoherent bus

FSDB – Fast Signal Database

Page 27: AMD_11th_Intl_SoC_Conf_UCI_Irvine

27 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

BACKUP

Page 28: AMD_11th_Intl_SoC_Conf_UCI_Irvine

28 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

75.0

80.0

85.0

90.0

95.0

100.0

75.0

80.0

85.0

90.0

95.0

100.0

75.0

80.0

85.0

90.0

95.0

100.0

The dynamically calculated temperature of

each core and the GPU enables the

operating point of each to be dynamically

balanced in-order to maximize

performance within temperature limits.

Low activity in one core enables it to be a

thermal sink for a more active core

GPU-centric Balanced CPU-centric

DYNAMIC FINE-GRAINED POWER TRANSFERS

Ref [2]

Page 29: AMD_11th_Intl_SoC_Conf_UCI_Irvine

29 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Disclaimer

The information presented in this document is for informational purposes only and may contain technical inaccuracies,

omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not

limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,

product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD

assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this

information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of

such revisions or changes.

AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any

inaccuracies, errors or omissions that appear in this information.

AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD

be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information

contained herein, even if AMD is expressly advised of the possibility of such damages.

Trademark Attribution

AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United

States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or

Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other

jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their

respective owners.

©2011 Advanced Micro Devices, Inc. All rights reserved.