MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4:...

26
Graphicionado A High-Performance and Energy-Efficient Graph Analytics Accelerator Tae Jun Ham Lisa Wu Narayanan Sundaram Nadathur Satish Margaret Martonosi MICRO 2016 Slide: http://tiny.cc/graphicionado

Transcript of MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4:...

Page 1: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

GraphicionadoA High-Performance and Energy-Efficient

Graph Analytics Accelerator

Tae Jun Ham Lisa Wu

Narayanan SundaramNadathur Satish

Margaret Martonosi

MICRO 2016

Slide: http://tiny.cc/graphicionado

Page 2: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Graph Analytics– A key workload of modern-day computing

§ Application Domains

§ Algorithms• Path Analysis (e.g., BFS, SSSP)

• Centrality Analysis (e.g., PageRank, Betweenness Centrality)

• Recommendation systems (e.g., Collaborative Filtering)

• Community Analysis

o Social Networks, Bioinformatics, Healthcare, Cybersecurity, Simulation, etc.

Page 3: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Graph Analytics is difficult to code§ Graph Analytics has …

o Irregular data access patterns

o Extremely low computation-to-communication ratio

o Poor spatial/temporal data locality

o Difficult-to-extract parallelism

o Memory-bound performance

§ Solution: Software Graph Processing Frameworkso Programmers express graph algorithms using a programming abstraction (i.e., vertex program)

o Frameworks orchestrate data movements for given computation using efficient backend SW

( )

Pregel

GraphMat

PGX

Graph Engine

Extended Giraph

nvGraph

System G

GraphLab Create

Galois (SOSP 13)

Ligra (PPoPP 13)

X-Stream (SOSP 13)

GraphX (OSDI 14)

Gunrock (PPoPP 16)...

TurboGraph (KDD 13)Pegasus (ICDM 09)Industry Academia

Page 4: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

§ Application-oblivious memory system in conventional processorso Caches blindly cache all data at a coarse granularity including ...

o Low temporal locality data

o Low spatial locality data (i.e., only 4B out of 64B is utilized)

Our measurement shows that SW frameworks consume 2x-27x more memory bandwidth than the optimal communication case

Software Frameworks have limitations

§ Expensive data movement in conventional processors

o Graph analytics has extremely low computation-to-communication ratio

Most algorithms: < 6% of instructions are used for actual computation;94% used for data movements; results in huge energy consumption.

= Insufficient tailoring to HW

Page 5: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

§ Retains Programmability

o Specialized-while-flexible HW pipeline

§ Overcomes Limitationo Application-specific memory system

o Application-specific pipeline design

Graphicionado Approach

Graphicionado: A high-performance, energy-efficient graph analytics HW accelerator which overcomes the limitations of software frameworks while retaining the programmability benefit of SW frameworks

Page 6: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

§ Why Graphicionado?

§ Vertex Programming Abstraction

§ Constructing Graphicionado Pipeline from Abstraction

§ Optimizing Graphicionado

§ Optimizing Data Movement

§ Scaling On-Chip Memory Usage

§ Parallelization

§ Performance/Energy Evaluation

§ Conclusions

Presentation Outline

Page 7: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Graph Data Structure

§ Graph consists of vertices & edges o Each Vertex has an ID and and a property (or state)

o Each Edge is defined as a tuple (SRC ID, DST ID, edge property)

§ Graph analytics: an iterative computation to calculate the desired vertex property for each vertex

1 2 3

4 5 6

Page 8: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

For each active Vertex UFor each outgoing edge E(U,V)

Res = Process_Edge (Eweight, Uprop)Vtemp = Reduce (Vtemp, Res)

EndEndFor each Vertex V

Vprop = Apply(Vtemp, Vprop)End

Vertex Programming Abstraction

§ Vertex programming abstraction is commonly used in SW frameworks(e.g., Intel GraphMat, Google Pregel, etc.)

Processing Phase

ApplyPhase

123456789

Process_EdgeReduceApply

§ Graphicionado uses the same abstraction to retain the programmability of SW frameworks

o Programmers express graph analytics algorithm with three custom functions

Page 9: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

For each active Vertex UFor each outgoing edge E(U,V)

Res = Eweight + UpropVtemp = min(Vtemp, Res)

EndEndFor each Vertex V

Vprop = min(Vtemp, Vprop)End

Vertex Programming Abstraction

Processing Phase

ApplyPhase

123456789

Process_EdgeReduce

Apply

Process_Edge: Reduce: Apply:

Single Source Shortest Path (SSSP) Example

Vprop: minimum distance to V on last iteration Vtemp: minimum distance to V on this iteration

Eweight + Uprop min(Vtemp, Res) min(Vtemp, Vprop)

Page 10: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

For each active Vertex UFor each outgoing edge E(U,V)

Res = Eweight + UpropVtemp = min(Vtemp, Res)

EndEndFor each Vertex V

Vprop = min(Vtemp, Vprop)End

Vertex Programming Abstraction

Processing Phase

ApplyPhase

123456789

Process_Edge: Reduce: Apply:

Single Source Shortest Path (SSSP) Example

Vprop: minimum distance to V on last iteration Vtemp: minimum distance to V on this iteration

Calculate a distance to V through this edgeUpdate the current minimum distance to V

Compare the current minimum distancewith the value from the last iteration;

If it changes its value, add V to the active vertex set for the next iteration

Eweight + Uprop min(Vtemp, Res) min(Vtemp, Vprop)

Page 11: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

From abstraction to HWfor (i=0; i<ActiveVertex.size(); i++) {

vid = ActiveVertexID[i]; vprop = ActiveVertexProp[i]; eptr = PtrToEdgeList[vid]; for (e = Edges[eptr]; e.src == vid; e = Edges[++eptr]) {res = Process_Edge(e.weight, vprop);temp = TempVertexProp[e.dst];temp = Reduce(temp, res); TempVertexProp[e.dst] = temp;

}} // Apply Phase updates ActiveVertexProp with TempVertexProp

123456789

1011

Updating the destination vertexwith the programmer supplied computation

Read the active (SRC) vertex

Traversing edges of the given active vertex

Process EdgeRead ActiveSRC Property

Read Edge Pointer

Read Temp. DST Property

Reduce Write Temp. DST Property

Line 5Line 4 Line 6

Line 7 Line 8 Line 9

Read Edges for given SRC

Line 2-3Processing Phase Block Diagram

Page 12: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Read Temp. DST Property

From abstraction to HWfor (i=0; i<ActiveVertex.size(); i++) {

vid = ActiveVertexID[i]; vprop = ActiveVertexProp[i]; eptr = PtrToEdgeList[vid]; for (e = Edges[eptr]; e.src == vid; e = Edges[++eptr]) {res = Process_Edge(e.weight, vprop);temp = TempVertexProp[e.dst];temp = Reduce(temp, res); TempVertexProp[e.dst] = temp;

}} // Apply Phase updates ActiveVertexProp with TempVertexProp

123456789

1011

Process EdgeRead ActiveSRC Property

Read Edge Pointer

Reduce Write Temp. DST Property

Processing Phase Block DiagramLine 5Line 4Line 2-3 Line 6

Line 7 Line 8 Line 9

Non-sequential (vertex)Memory Access

Sequential (vertex) Memory Access

Non-sequential (edge ptr) Memory Access

Non-sequential and thenSequential (edge) Memory Access

Custom Computation

Sequential Memory Access Unit

Read ActiveSRC Property

Read Temp. DST Property

Reduce Write Temp. DST Property

Process Edge

Non-SequentialMemory Access Unit

Custom Computation Unit

Read Edges for given SRC

Read Edge Pointer

Page 13: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

From abstraction to HWfor (i=0; i<ActiveVertex.size(); i++) {

vid = ActiveVertexID[i]; vprop = ActiveVertexProp[i]; eptr = PtrToEdgeList[vid]; for (e = Edges[eptr]; e.src == vid; e = Edges[++eptr]) {res = Process_Edge(e.weight, vprop);temp = TempVertexProp[e.dst];temp = Reduce(temp, res); TempVertexProp[e.dst] = temp;

}} // Apply Phase updates ActiveVertexProp with TempVertexProp

123456789

1011

Processing Phase Pipeline

Sequential Memory Access Unit

Custom Computation Unit Control

Atomic Update

Line 5Line 4Line 2-3 Line 6

Line 7 Line 8 Line 9Atomic Update

Control Unit

Read-Modify-Write Updateneeds to be atomic for the

same destination vertex (e.dst)

Read ActiveSRC Property

Read Edge Pointer

Read Temp. DST Property

Reduce Write Temp. DST Property

Process EdgeRead Edges for given SRC

Non-SequentialMemory Access Unit

Page 14: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Optimizing Data MovementProcessing Phase Pipeline

Sequential Memory Access Unit

Custom Computation Unit

S5: ControlAtomic Update

Atomic UpdateControl Unit

S1: Read ActiveSRC Property

S7: Reduce

S3: Read Edges for given SRC

S2: Read Edge Pointer

S6: Read Temp. DST Property

S8: Write Temp. DST Property

S4: Process Edge

Use fine-grained on-chip SPM for data used for these modules

1. Non-sequential access; low-spatial-locality

o Only use 4-16B out of 64B off-chip memory access granularity;leads to off-chip memory BW waste

2. High-temporal locality

o Data will be used again for edges with the same destination vertex

Non-SequentialMemory Access Unit

PtrtoEdgeList[vid]TempVertexProp[e.dst]

TempVertexProp[e.dst]

Page 15: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Optimizing Data MovementOn-chip SPMAccess Unit

Processing Phase Pipeline

Sequential Memory Access Unit

Custom Computation Unit

S5: ControlAtomic Update

Atomic UpdateControl Unit

S1: Read ActiveSRC Property

S2: Read Edge Pointer

S6: Read Temp. DST Property

S7: Reduce S8: Write Temp. DST Property

S4: Process Edge

S3: Read Edges for given SRC

Use fine-grained on-chip SPM for data used in these modules

1. Non-sequential access; low-spatial-locality

o Only use 4-16B out of 64B off-chip memory access granularity;leads to off-chip memory BW waste

2. High-temporal locality

o Data will be used again for edges with the same destination vertex

PtrtoEdgeList[vid]TempVertexProp[e.dst]

Non-SequentialMemory Access Unit

TempVertexProp[e.dst]

Page 16: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Optimizing Data Movement

1. Sequentially accessed; high-spatial locality

o Has predictable access pattern

o All 64B loaded from the off-chip memory will be used

2. Low-temporal locality

o Each active vertex or an edge is processed only once. This data will not be accessed again for the current iteration.

On-chip SPMAccess Unit

Processing Phase Pipeline

Sequential Memory Access Unit

Custom Computation Unit

S5: ControlAtomic Update

Atomic UpdateControl Unit

S1: Read ActiveSRC Property

S2: Read Edge Pointer

S6: Read Temp. DST Property

S7: Reduce S8: Write Temp. DST Property

S4: Process Edge

S3: Read Edges for given SRC

Use prefetcher; Do not use on-chip SPM for data used for these modules

ActiveVertexProp[i]Edges[eptr++]

Non-SequentialMemory Access Unit

Page 17: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Scaling On-chip Memory Usage§ Graphicionado utilizes an on-chip storage to avoid wasting off-chip BW

o Often requires 4-16B on-chip storage per vertex

Not enough on-chip storage for 10M+ vertices

1 2 3

4 5 6

Original Graph

§ Solution: Partition a graph before the execution; Then, process each subgraph at a timeo Assign edges to different slices based on their destination vertices

Page 18: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Scaling On-chip Memory Usage

1 2 3

4 5 6

1 2 3

4 5 6

Only half of the vertices to be stored in on-chip storage at a time

*Paper has additional scaling techniques

1 2 3

4 5 6

Original GraphSlice 1 (Edges with DST vertices 1,2,3)

Slice 2 (Edges with DST vertices 4,5,6)

§ Graphicionado utilizes an on-chip storage to avoid wasting off-chip BWo Often requires 4-16B on-chip storage per vertex

Not enough on-chip storage for 10M+ vertices§ Solution: Partition a graph before the execution; Then, process each subgraph

at a timeo Assign edges to different slices based on their destination vertices

Page 19: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

NxNSwitchN x N Switch

SRC Vertex 0

§ Naive Parallelization Approacho Replicate Pipelineo Each pipeline stream processes a portion of the active (SRC) vertices

Different streams will try to read/write the same memory address (TempVertexProp[]) at the same time on S6 & S8

(e.g., a modulo of the SRCID determines which stream for processing)

SRC Vertex 2

Two modules trying to read the same address(TempVertexProp[3])

Design complexity (SPM port count) and performance degradation

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

Parallelizing Graphicionado Pipeline

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

Edge (0,3)

Edge (2,3)

Page 20: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

DST Access Portion

SRC Access Portion

Parallelizing Graphicionado Pipeline

NxNSwitchN x N Switch

SRC Vertex 0

Now each unit access an exclusive portion of the scratchpad memory

SRC Vertex 2

Each unit accesses an exclusive memory region

N x N Switch

§ Graphicionado Approacho Each pipeline stream is separated to two pieces:

SRC access / DST access portion o N x N switch routes data to the correct DST stream (e.g., a modulo of

the DSTID)

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

S1: Read ActiveSRC Property

S2: Read Edge ID Table

S3: Read Edges for given SRC

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

S4: Process Edge

S5: ControlAtomic Update

S6: Read Temp. DST Property

S7: ReduceS8: Write Temp.

DST Property

Edge (0,3)

Edge (2,3)

Reduction in design complexity and improvement in throughput

Page 21: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Presentation Outline§ Why Graphicionado?

§ Vertex Programming Abstraction

§ Constructing Graphicionado Pipeline from Abstraction

§ Optimizing Graphicionado

§ Optimizing Data Movement

§ Scaling On-Chip Memory Usage

§ Parallelization

§ Performance/Energy Evaluation

§ Conclusions

Page 22: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

FR FB Wiki LJ RMAT TW AVG FR FB Wiki LJ RMAT TW AVG FR FB Wiki LJ RMAT TW AVG NF SB AVG

Spee

dup

over

Sof

twar

e Fr

amew

ork

0

1

2

3

4

5

6

7

PageRank (PR) Breadth-First Search (BFS) Single Source Shortest Path (SSSP)

Collaborative Filtering (CF)

Software FrameworkGraphicionado

Overall Graphicionado Performance

§ Graphicionado achieves ~3x speedup over state-of-the-art software framework (Intel GraphMat) across different workloadso Software framework was run on a 32-core Haswell Xeon servero Both system were provisioned the same theoretical peak memory BW (78GB/s)

PageRank (PR) Breadth-First Search (BFS) Single-Source Shortest Path (SSSP)

Collaborative Filtering (CF)

GraphicionadoSoftware Framework

Page 23: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Graphicionado Energy Consumption

§ Graphicionado consumes < 2% of the energy (50x-100x) compared to the software processing frameworko Most of the energy (70%+) was spent for the eDRAM static energyo 20-25x power saving & 2-5x speedup

FR FB Wiki LJ RMAT TW FR FB Wiki LJ RMAT TW FR FB Wiki LJ RMAT TW NF SB

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

PageRank (PR) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) Collaborative Filtering (CF)

eDRAM Static EnergyeDRAM Dynamic EnergyGraphicionado Pipeline Static EnergyGraphicionado Pipeline Dynamic EnergyCustom Computation Energy

PageRank (PR) Breadth-First Search (BFS) Single-Source Shortest Path (SSSP)

Collaborative Filtering (CF)

eDRAM Static EnergyeDRAM Dynamic EnergyGraphicionado Pipeline Static EnergyGraphicionado Pipeline Dynamic EnergyCustom Computation Energy

Page 24: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Effect of Parallelization & Optimization

FR FB Wiki LJ RMAT AVG FR FB Wiki LJ RMAT AVG FR FB Wiki LJ RMAT AVG NF SB AVG

Nor

mal

ized

Per

form

ance

0

5

10

15

20

25

30

PageRank (PR) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) Collaborative Filtering (CF)

BaselineCFG 1 = Stream Parallelization(x2)CFG 2 = Stream Parallelization(x4)CFG 3 = Stream Parallelization(x8)

CFG 4 = CFG 3 + PrefetchingCFG 5 = CFG 4 + Pipeline Balancing*

§ The parallelization and optimization of Graphicionado achieves up to 27x speedup over the baseline single-stream pipeline

PageRank (PR) Breadth-First Search (BFS) Single Source Shortest Path (SSSP) Collaborative Filtering (CF)

Nor

mal

ized

Per

form

ance

0

5

10

15

20

25

30

PageRank (PR) Breadth-First Search (BFS) Single-Source Shortest Path (SSSP)

Collaborative Filtering (CF)

BaselineCFG1 = Stream Parallelization (2x)CFG2 = Stream Parallelization (4x)CFG3 = Stream Parallelization (8x)

CFG4 = CFG3 + PrefetchingCFG5 = CFG4 + Pipeline Balancing*

o Without effective parallelizations and optimizations, simply using HW does not necessarily bring any benefit

Page 25: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

Conclusions

§ Software graph processing frameworks has limitationso Inefficient on-chip memory (cache) usageo Expensive data movement

§ Graphicionado: Specialized HW accelerator for graph analyticso Efficient use of on-chip memoryo Specialized pipeline tailored for graph analytics data movemento Effective parallelism

~3x speedup and 50x+ energy saving over state-of-the-art software framework

Page 26: MICRO 2016 Graphicionado - Princeton University Computer ...tae/graphicionado_slide.pdf · S4: Process Edge S5: Control Atomic Update S6: Read Temp. DST Property S7: Reduce S8: Write

GraphicionadoA High-Performance and Energy-Efficient

Graph Analytics Accelerator

Tae Jun Ham Lisa Wu

Narayanan SundaramNadathur Satish

Margaret Martonosi

MICRO 2016