Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

7/11/12

Can we Systematically Evaluate and Exploit Heterogeneous Accelerators?

A 10x10 Perspective

Andrew A. Chien Dept of Computer Science, University of Chicago

MCS, Argonne National Laboratory

SAAHPC Keynote July 11, 2012

Outline •  The Future is Heterogeneous •  Accelerators in Perspective •  Towards Systematic Accelerator Evalution •  10x10: Systematic Heterogeneous Architecture •  Summary and Futures

July 11, 2012 © Andrew A. Chien, 2012

7/11/12

The Future is Heterogeneous

Heterogeneous Supercomputers •  Tianhe-1 (NUDT, Nov

2010) o  5PF o  14,336 Xeons + 7168 Teslas

•  Titan (ORNL, fall 2012) o  19K AMD CPU’s + 960 GPU’s o  Grow to 20PF in fall? o  ~ 20PF/2TF => 10K Nvidia GPU’s

(Kepler?)

•  Blue Waters (NCSA, late 2012) o  11.5PF, 1.5PB o  49K AMD CPUS (380K cores) o  3K Nvidia GPU’s (Kepler)

7/11/12

Heterogeneity Dominates

•  Heterogeneity is growing dramatically – on single chips, in systems, and in high volume deployment o  Sandy Bridge/Fusion/Denver, Tegra 2/Omap/A5

•  Heterogeneous in architecture and implementation is the dominant computing platform of future

July 11, 2012

Smart Phones

Laptops and Tablets

Desktops

Servers Homogeneous

Some Heterogeneity (2x)

Extreme Heterogeneity (10x)

Exploding Diversity

•  Highly competitive markets, many without a dominant leader •  Smart Phone Market Highly fragmented – and diverse •  Laptop and Tablet Market fragmenting?

July 11, 2012

Smart Phones

Laptops and Tablets

2015 Marvell (RIM)

TI Omap

Apple Qualcomm

Mediatek

Nvidia

Qualcomm, TI, …

7/11/12

Accelerators in Perspective

Accelerators in HPC Systems

•  Waning Moore’s Law o  Energy-limited, Data Movement

Limited [BorkarChien, CACM May 2011]

•  Base vs Base+Acc vs. Ratiod o  Performance, Coupling, Capacity

•  Cost: compute chips, total energy, compute/cuft, price

•  DIFF: Delivery in Whole Compute Chips

CPU/APU

CPU CPU

CPU GPU

7/11/12

How Accelerators Deliver Performance

•  Location: Path-oriented accelerators (flow and offload), NIC offload, PIM

•  Special Resources: high performance memory (e.g. GDDR, Convey )

•  Customization: specialized logic and dense packaging/coupling

•  Assumption: regular, replicated organization •  Scaled to thousands or millions

•  Challenges: Programming, Specialization, Integration

Programming •  Porting effort? (software

architecture, algorithms) •  Performance attainable? •  The Fast road? •  ...or the road to nowhere? •  ....How long is good

enough?

7/11/12

Programming

•  Critical: Avoiding Disaster!! July 11, 2012 © Andrew A. Chien, 2012

Specialization •  “Everyone uses only 10% of the functionality, the only

trouble is its a different 10% for everyone”

•  (image, character, graphics, floating point) Embedded, Smartphone, Laptop, Server processors

•  (parallel) Multithreaded applications? •  (floating point) DOE Scientific applications mini-apps

and PETSc – 20-30% of operation count

•  Architect: What to specialize and how to expose? •  Software Architect: What abstractions? (datatype,

representation, movement) What interfaces and partitions?

•  => see 10x10

7/11/12

Integration •  Future programming is about orchestrating data

movement, not operations. Data Movement dominates energy consumption.

– DARPA Exascale Software report (2009)

•  Parallel computing – horizontal, internode •  Exascale computing – vertical and horizontal, internode

and intranode (memory and accelerator hierarchy)

•  Not just about computing, but the relationship of computing to memory and to each other (and to the network)

Accelerator Integration

•  Shared Nothing, Asymmetric

•  Shared Memory, Symmetric

•  Shared Memory, Internal Customization, Symmetric

CPU/Acc

CPU CPU

CPU Acc

7/11/12

What’s a Programmer to do?

Towards Systematic Accelerator Evalution

OmniBench: Systematic Evaluation of Accelerators

•  Objective: Neutral evaluation of performance

•  Idea: Benchmark with codes Designed for an accelerator...

But not “exactly this” accelerator

•  Software complexity is the key driver •  “The community can tune for 1, but not for dozens”

7/11/12

Omnibench Experiment •  Challenging Kernel Programs

o  SGEMM, SpMV, BFS, FFT

•  Standard Interface – OpenCL 1.2 o  Simple model: CPU + ACC

•  Range of Heterogeneous Platforms o  Vary compute capabilities o  Vary special resources o  Vary integration and memory hierarchy approaches o  Range of cost/power levels

Heterogeneous CPU-‐‑GPU Systems

•  IvyBridge: Intel Core i5 3570K o  4 CPU cores (3.4 GHz) o  64 graphics cores (1.15 GHz) o  6 MB LLC, shared by CPU and integrated graphics o  dual-channel, DDR3 Memory, 25.6 GB/s o  77 W, 22 nm, 216 mm^2, 1.4 billion transistors

•  APU: AMD A-8 3850 o  4 CPU cores (2.9 GHz) o  400 graphics cores (0.6 GHz) o  4 MB LLC, dedicated to the CPU cores o  Dual-channel, DDR3 Memory, 29.9 GB/s o  100 W, 32 nm, 228 mm^2, 1.45 billion transistors

•  Tesla: NVIDIA Tesla C2075 o  448 cores = 14 multiprocessors * 32 (1.15 GHz) o  768 KB LLC, 64 KB shared memory/multiprocessor o  Private GDDR5 Memory, 144 GB/s o  225 W, 40 nm, 520 mm^2, 3 billion transistors o  CPU-GPU link: PCI-Express x16 Gen 2, 8 GB/s

CPU/ACC CPU

CPU GPU

CPU/acc CPU

7/11/12

One-‐‑sided Performance (SGEMM)

Simple Performance (SGEMM)

7/11/12

Self-‐‑normalized Accessible Performance (SGEMM)

July 11, 2012 © Andrew A. Chien, 2012 Self-‐‑normalized (integration)

•  Integration (data transfer and computation)

•  Fraction of “Peak” Performance

•  Fraction of Accessible Performance

Relative Accessible Performance (SGEMM)

•  Same terms •  Normalized to

performance of fastest accelerator

7/11/12

Self-‐‑normalized Accessible Performance (BFS)

July 11, 2012 © Andrew A. Chien, 2012 Highlighting Integration

Relative Accessible Performance (BFS)

7/11/12

Observations •  Data home location has significant impact on

accessible performance, should be captured in benchmarking

•  Organizational differences not highlighted by compute-intensive applications, but exposed clearly by memory-intensive

•  Data movement management is problematic in current integrated CPU-GPU systems (sw/hw)

•  Performance of discrete accelerators dominates on compute-intensive, but not on memory-intensive workloads (even w/o equal chip resources)

Related and Future Work •  Related Work

o  Accelerator Benchmarking: CUDA, OpenCL benchmarks (Rodinia, SHoC, ...)

o  Extensive Performance Modeling

•  Future Work o  Additional Platforms – configs, types, variations o  Improved software (always): Drivers, MemHierarcy, Compilers o  Higher level software interfaces: Beyond OpenCL? ,Open

ACC, ?? o  Larger systems: Larger-nodes (e.g. 2 hybrid vs CPU+GPU), Parallel

(multi-node) systems

7/11/12

10x10 Systematic Heterogeneity

July 11, 2012

Three Paths Forward

Heterogeneous (incl. Hybrid)

Small Core (100’s)

Big Core (10’s)

Dennard Scaling Energy-‐‑limited Scaling

Performan

[Borkar and Chien, “Technology Scaling creates New Landscape for Computer Architecture”, Communications of the ACM, May 2011]

7/11/12

Path #3: Customize, Scaleup

•  Customize: a collection of custom tools form a core o  Designed for a narrow domain, high performance and energy efficient o  Tool domains complement each other to cover general-purpose space

•  Separation maximizes energy efficiency o  Layout density, Isolation o  Exercise one/few tools at a time

•  Challenges: Programmability, Code Portability, Design Effort, Architecture, Si utilization

July 11, 2012

Customize

Examples: SoC & Integrated GPU

•  Apple’s A5

•  Nvidia’s Tegra 2 and 3

•  Intel Ivy Bridge

•  AMD Fusion (Ontario)

July 11, 2012

What’s WRONG with these chips?

Not very programmable...

7/11/12

10x10 Framework Enables Systematic Exploitation of Heterogeneity

Tight Clusters Loose Clusters No Clusters (general-‐‑purpose)

Micro-‐‑engine Workload Coverage

Micro-‐‑engine Energy Efficiency

Overall Workload Energy Efficiency

10x10 = Federated Heterogeneity

Traditional Core 10x10 Core

July 11, 2012

µengine #6

Basic RISC CPU

µengine #2

µengine #3

µengine #4

µengine #5 <tbd>

I-‐‑Cache

Shared L1 Data Cache

I-‐‑Cache I-‐‑Cache I-‐‑Cache I-‐‑Cache I-‐‑Cache

L1 Inst Cache

L1 Data Cache

7/11/12

Traditional Optimization: 90/10 Paradigm

•  Workloads: analyze and derive common cases(90%) •  Invent arch features, implementation optimizations with broad

impact (90%) •  Improve performance by adding optimizations •  Aggregation and Efficiency: 8080 80 insts => SB 500+ instructions

Workloads

“ILP” “reuse locality” “linear access” “bit-‐‑field opns” “branch panerns”

“pipelining” “superscalar” “caches & blocks” “mmedia” “branch pred”

Abstracted “common” cases Optimizations

Amdahl’s Law, H&P’s Comp Arch: A Quantitative Approach July 11, 2012 © Andrew A. Chien, 2012

10x10 Optimization Paradigm

•  Identify 10 application clusters; compute structures; datatypes (focus on 10 distinct bins)

•  Optimize architecture, Optimize implementation of each separately (improve energy-delay product by 10-100x)

•  Compose together sharing memory hierarchy and interconnect (preserve the benefits of customization)

7 idiom, 29 SPEC, 13 dwarves, 11NPB “Workload”

Factor into 10 Bins Compose Micro-‐‑engine

per Bin

7/11/12

The Big Picture: 10x10 •  Spectrum of energy efficiency vs. programmability •  Asics, soc, gpu, parallel cpu, cpu

•  Where are we going? Overlap, dominate? •  Answer is deeply a hardware and software question

o  Waning days of Moore’s Law, end of Moore’s Law, success of near-threshold and device scaling heroics

o  Software translation technology for cross-compilation, transformation and optimization, higher-level programming

July 11, 2012

EE [Ops/J]

@ fixed Process

Programmability/ Portability

SoC / IP Accel

+M-‐‑core

+ features

+GPU+M-‐‑core

Ideal Compute Chip

10x10 Workload Clustering •  Challenges

o  How to cluster? (try LOTS of things) o  How many for good coverage? o  How much benefit?

•  Broad Set of Workloads (34 total, varied) o  UHPC Challenge Problems (5) “Super”

•  Streaming sensor, chess, graph, md, shock hydro •  DARPA”Extreme Computing”

o  PARSEC (12) “PC” •  Data mining, vision, financial, genetic, physics, …

o  Embedded Benchmarks (10) “Mobile, IOT” •  Image, crypto, coding, signal processing

o  Biobench (7) “Data mining” •  Alignment, assembly, phylogeny, database search

7/11/12

What Characteristics Maner?

•  Where the time goes o  Focus on important sections– >90% coverage from each application

•  Architecturally Significant Features o  Cluster based on like requirements o  Supports sharing of customization

•  Two Feature Vectors o  Low-resolution: (Datatype x Size) o  High Resolution: (Datatype x Operation x Size)

July 11, 2012

Dynamic Profiles

Vector Clustering

Codes, Benchmarks

Loops, Opns

Memory

Clustered Regions

Low Res Clusters (8)

code regions

BR 8B INT 8B REG_XFER 8B INT 4B REG_XFER 4B REG_XFER 16BFLT 8B FLT 16B OTHER 4B FLT 4B OTHER <1B OTHER 2BINT 16B OTHER 8B INT 1B INT 2B REG_XFER 1B REG_XFER 2B

•  Width is “hot region” count; Ordered by Dynamic Weight •  Legend is Operation x Datatype •  #1 Integer, #2-‐‑5 FP single, double, vector •  Much simpler, cleaner clustering... •  8 clusters (100%)

7/11/12

code regions

•  #1=> #1,2,3,5,8 Integer split by size, #4, 6, 7,... FP •  Very similar clusters (tight)... •  8 clusters (70%), 16 clusters (85%), 32 clusters (100%)

code regions

•  Essentially homogenous clusters (very tight)... •  8 clusters (50%), 16 clusters (70%), 32 clusters (80%), 128 clusters (100%)

7/11/12

Clusters Insights •  Clusters draw from across the workloads – not in any

obvious “application domain” structure. •  Clusters reflect a wide variety of different

computational needs that correlate with architecture structure o  Call and branch intensive o  32 bit integer oriented o  Bit/byte oriented o  Mixed 32 and 64-bit oriented o  Single-precision floating point o  … and so on…

•  Clusters separate cleanly,(overpartition), ample opportunities for customization and energy efficiency.

Benefit Models

•  Specialization: fraction instructions unneeded •  Interpolation from Nehalem to DP Float energy

0 0.2 0.4 0.6 0.8 1

fraction of unimplemented opcodes

square-rootlinearquadraticcubic

7/11/12

Weighted Benefit vs. Benefit Model

fit (x

sq rootlinearquadraticcubic

Weighted Benefit (linear) vs. # Cores

linear benefit model

hr8hr16hr32hr64hr128lr8lr16lr32lr64lr128

7/11/12

Related Work •  System on Chip (SoC, SoP, 3D, etc.) [CE products]

o  Rapid system integration, not architectural design. Less stable; discontinuous change, partitioned software.

o  6 months to α-silicon, 6 months to product in market

•  CPU+Reconfigurable hardware (FPGA’s, LUTS, adders, etc.) o  Convey HC-1, Sankaralingam11, Xilinx Zynq [EPP]

o  Advantages: flexibility o  Disadvantages: lose customized implementation, speed, energy efficiency.

•  Hybrid Computing (CPU-GPU, APU, GenX…) o  Advantages: Silicon today o  Disdvantages: programmability,1-way hetero, cost, energy efficiency.

•  Low-level Programmability and Heterogeneity o  QSCores/GreenDroid: Super instructions ; Khan11 [Morphing], Wu11 [VM-based, single ISA], o  Advantage: Don’t require much software support o  Disadvantages: local impact

•  Build “Chip Generators”, not Chips o  Horowitz; customization and closed systems

o  Custom for everything: programmabiliity?

Summary and Perspective •  Heterogeneity is endemic, and a basic source of

efficiency (local) •  We need integrated assessment – programmable,

usable, delivered performance (demand it) •  Unibench – uniform, systematic assessment of

accessible performance for a diverse accelerator future

•  10x10 – federated heterogeneous architecture based on systematic optimization of energy efficiency (major benefit)

•  Prepare wisely for a Heterogenous future!

7/11/12

More Information •  Papers

o  The Future of Microprocessors. Communications of the ACM 54(5): 67-77 (2011), [Borkar & Chien]

o  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia CS 4: 1987-1996 (2011) [Chien, Snavely, Gahagan]

o  10x10: Taming Heterogeneity for General-purpose Architecture, in 2nd Workshop on New Directions in Computer Architecture, June 2011. Held at ISCA-38. [Chien]

o  Systematic Evaluation of Workload Clustering for Designing Heterogeneous, General-Purpose Architectures, UChicago Tech Report 2012, [A. Guha, A. Chien]

o  An Empirical Foundation for Heterogeneity: Clustering Applications by Computation and Memory Behavior, UChicago Tech Report 2011, [A. Guha, P. Cicotti, A. Snavely, and A. Chien]

•  Acknowledgements o  Apala Guha, Yao Zhang, Mark Sinclair o  Allan Snavely, Pietro Cicotti, Mark Gahagan o  Insightful feedback from Shekhar Borkar (Intel) and Bill Harrod (DARPA) o  Supported by the National Science Foundation under NSF Grant OCI-1057921

and DARPA MTO

Questions?

Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

Documents

Transcript of Outline - saahpc.ncsa.illinois.edusaahpc.ncsa.illinois.edu/presentations/chien.pdf · o Sandy...

Picop vs Base Metals Mineral Corp

Na Pratica Ppt-base Educacao Integral vs-final Rev

Base Oil Price Setting Mechanisms...Source: EIA, ICIS • Supply vs demand = base oil margin above crude • Oversupply brings falling margins Amy Claxton, P.E. My Energy Heavy base

Critical Materials in Catalysis: Precious vs Base Metals ...dels.nas.edu/resources/static-assets/bcst/miscellaneous/Lambert... · Critical Materials in Catalysis: Precious vs Base

Topics ACID vs BASE Starfish Availability

SARINA HUI-LIN CHIEN Hui-Lin Chien.pdf · 2011, 2012 Twice Recipient, Excellent Research Presentation Award, Taiwanese Psychology Association, Taiwan 2011.10 Outstanding Mentor Award,

Base SAS vs. Data Integration Studio: Understanding ETL and the SAS … · 2006-10-21 · 1 Figure 1. Corporate Information Factory. Paper DM01 Base SAS® vs. SAS® Data Integration

Security for WLANs - wIPS vs Base IDS

TriStar VS Spec Sheet - Hayward · PDF fileservicing TriStar VS pumps simple. Two pump-base ... super-efficient permanent magnet motor, Hayward’s TriStar® VS residential pool

Topics ACID vs BASE Starfish Availability TACC Model Transend Measurements SNS Architecture.

PNMsoft Knowledge Base Sequence User Guidesdyzz9obi78pm5.cloudfront.net/app/image/id/591ae3f96e121...Workflow Permissions vs. Activity Permissions ..... 4 User vs. Group Permissions

H base vs hive srp vs analytics 2-14-2012

Variance vs Entropy Base Sensitivity Indices

Corporate vs. Retail Banking Prasad Kaipa. Corporate vs. Retail Banking Small number of customers Narrow client base Large value transactions Large number.

Base Inm. vs Tecnicas15.9 2012(1).ppt

Le chien des Baskerville - La Bibliothèque électronique du …beq.ebooksgratuits.com/vents/Doyle-chien.pdf · · 2015-06-21Arthur Conan Doyle Le chien des Baskerville roman Traduit

Penyelesaian Terus Persamaan Pembezaan Biasa Peringkat ... Lee Khai Chien.pdf · Fungsi interpolasi Y(x) digunakan untuk menganggar penyelesaian dalam rumus (2) dalam bentuk fungsi

099-31: Base SAS® vs. SAS® Data Integration Studio

9 People vs BASE

BASE | @BonniesBaseball Notes vs. Saint Joseph's