Download - truncated_parlab_retreat_preview_posters.ppt

Direction-OptimizingBreadth-First Search

• Breadth-first search (BFS) is a common graph alg. building block

• Hybrid approach demonstrates speedups (8x) over prior work

• Since last retreat– Much better insight into when and why

hybrid approach is faster– Distributed (MPI) hybrid

implementation (w/ Aydin Buluc)

1

Scott Beamer

Photo of youPhoto of youPhoto of youPhoto of you

• GPUs are often underutilized.• More CPU/GPU integration.• Can offload GC to the GPU.• Present a new algorithm for

Mark & Sweep GC on a GPU.• Need to use memory band-

width, maximize parallelism.• Achieve mark performance

within 1.4-1.8x of CPU.

2

MartinMaas

GPUs as an Opportunity for Off-loading Garbage Collection

PhilipReame

s

+ J. Morlan, K. Asanovic, A. Joseph, J. Kubiatowicz

SecureCell: HW for Private Cloud

• OS sees encrypted view of app’s memory

• HW re-encrypts data leaving cell

• “Capabilities” for keys• Investigation of arch

for crypto.• Trust neither App nor

OS not to leak data

3

Eric Love

PayloadData

Metadata

Key1

Metadata

mmap()

EncryptedData

Virtual Memory Address Space

AES Key Registers

DRAM Contents

DecryptedKey

CPU Private Mem

DecryptedObject

ContentsdecryptRange

With John Kubiatowicz, Krste Asanović

Untrusted Untrusted StorageStorage

• Four cell types:– Event-triggered (NEW!)– Time-triggered– No-multiplexed– Best effort

• One resource multiplexer per hardware thread

• “communication-avoiding gang-scheduling”: use global time base to avoid coordination in the common case

4

Decentralized Cell Multiplexing in Tessellation

Juan Colmenares

Gage Eads

John Kubiatowicz

Hilfi Alkaff

QoS for Storage in Tessellation OS• Current OS often don’t

provide good QoS for storage

In Tessellation:• Disk accessed by multiple

services and applications.• Services need to access

multiple devicses.• Hierarchical composition

of SLA’s• Work in early stages

5

Nitesh Mor

& Israel Jacques, Juan Colmenares, John

Kubiatowicz

Verification and Testing of MPI Programs

• Previous work on Shared Memory Concurrency– Scalable data race detection

for UPC– Compiler instrumentation for

hybrid programming models• Active Testing for MPI

– Many legacy code and frameworks for MPI

– Message races in MPI are a cause of non-determinism

6

Chang-Seo Park

Concurrit: A Domain Specific Language for Writing Concurrent Tests

• How to write unit test for concurrent programs?– Must fix: Inputs + Thread schedule

• DSL: Specify thread schedules for SUT– Formal, concise, and convenient way– Describe model checking algorithms

• Tool: Systematically explore all-and-only schedules of SUT specified in DSL

Tayfun ElmasJacob Burnim

George Necula Koushik Sen

+Test written in Concurrit DSLTest written in Concurrit DSL

Software Under Test (SUT)

Software Under Test (SUT)

Insights/ideas aboutthread

schedules

Insights/ideas aboutthread

schedules

✓✓ ??

// Example test in Concurrit DSL

TA, TB, TC = WAIT_FOR_DISTINCT_THREADS()

LOOP UNTIL TA, TB, TC COMPLETE {

BACKTRACK HERE WITH T IN [TA, TB, TC]

RUN T UNTIL READS OR WRITES}

// Example test in Concurrit DSL

TA, TB, TC = WAIT_FOR_DISTINCT_THREADS()

LOOP UNTIL TA, TB, TC COMPLETE {

BACKTRACK HERE WITH T IN [TA, TB, TC]

RUN T UNTIL READS OR WRITES}

A SEJITS Specializer for BLB

• SEJITS• BLB• Applications

– Machine Learning/Statistical Analysis for Big Data

• Results– 127,000 vectors;

96,000 features– 109.8 seconds– Statistically robust

8

Aakash PrasadDavid Howard

High Performance Analysis ofFiltered Semantic Graphs

• Graph algorithms applied to graphs with “attributes” on edges

• Integrate with Knowledge Discovery Toolbox

• Filtering graph to only include wanted edges = 80x slowdown

• SEJITS moves algorithms from being bound by interpreter to being memory bound

9

Shoaib Kamil

Roofline Perf Bounds

C++

SEJITS

MemoryBandwidth

PurePython

A Distributed Algorithm for 3D Radar Imaging on eWallpaper

• eWallpaper: thousands of embedded Rocket processors. One antenna per processor.

• Application: Use the radio transceivers to image the room. For assisted living, gesture interaction, soundfield synthesis, etc.

• Algorithm: each radio transmits pulses. The responses are combined using SAR techniques to form an image.

• Challenges:• Response distributed amongst 16 000

processors• Restrictive mesh topology• Limited local memory per processor

10

Patrick Li, Simon Scott

128

128

PyCASP: Python-Based Audio Content Analysis using Specialization• Programmer tool for developing

audio analysis applications• Goals:

– Productivity, Efficiency, Portability and Scalability

• Three example applications:– Speaker Diarization– Music Recommendation (Pardora)– Video event detection

• CPU, GPU and cluster platforms

11

Katya Gonina, Gerald Friedland

with: Eric Battenberg, Penporn Koanantakool, Michael Driscoll, Evangelos Georganas and Kurt

Keutzer

clSpMV: A Cross-Platform OpenCL SpMV Autotuner on GPUs

• This is the first SpMV OpenCL work that covers a wide spectrum of sparse matrix formats (9 formats in total)

• We propose a new sparse matrix format, the Cocktail Format, that takes advantage of the strengths of many di erent sparse matrix ffformats

• We have developed the clSpMV autotuner to analyze the input sparse matrix at runtime, and recommend the best representation of the given sparse matrix

• clSpMV outperforms state-of-the-art GPU SpMV works

12

Bor-Yiing Su

Bor-Yiing Su, Kurt Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," Bor-Yiing Su, Kurt Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in International Conference on Supercomputing (ICS 2012), Italy, June 2012.in International Conference on Supercomputing (ICS 2012), Italy, June 2012.

Three Fingered Jack: Automatic code Three Fingered Jack: Automatic code generation for CPU, GPU ,and FPGA/ASICgeneration for CPU, GPU ,and FPGA/ASIC

• Using automatic speech recognition as the driver of our system

• Want to productively explore a large design space and choose the right implementation fabric– Cannot afford custom

programming environments for each target platform

• From a single Python description, our system gives us the ability to target GPU,CPU, and FPGA/ASIC– We use reordering transforms

to target these diverse platforms

13

David Sheffield

Modeling of Strokes in the Cerebral Vasculature

• Problem– Stroke treatment not quantitative

• Solution– Patient specific risk-stratification

Progress– 3x increase image processing speed– Improved patient analysis algorithm– Approval for 160 new patient scans

Tobias Harrison-Noonan

Patterns

Sparse Linear Algebra

Dense Linear Algebra

Structured GridMap-Reduce

Chisel Project Updates

• More results– Sodor educational processors

• 1stage, 2stage, 5stage, uCode, OOO– Vector research processor

• More functionality– Improved syntax– Vec(n){ item }– Reduce Verilog warnings– Logic simplification– Combinational loop detection– Standard library

15

Huy Vo, Jonathan Bachrach

Productive Design of Extensible Cache Coherence Protocols• Language for describing

global coherence xactions as “flows” with declarative concurrency

• Language for describing local pipeline logic as transactional processes

• Adding both to Chisel as language extensions and composing them

• Integrating synthesis with formal verification tools

16

Henry CookHenry CookJonathan BachrachJonathan Bachrach

Chisel in the Classroom

• undergraduate computer architecture class (CS 152)– 5 labs

• Menagerie of processors – 1-stage, 2-stage, 5-stage– Micro-coded– Out-of-Order (MIPS r10k style)– Hwacha (vector-thread)– Dual-core Rocket

17

Interesting ImageInteresting ImageOr Graph fromOr Graph from

PosterPoster

Interesting ImageInteresting ImageOr Graph fromOr Graph from

PosterPoster

Chris Celio

Energy-Aware Resource Allocation on the Sandy Bridge Processor• Performance and energy

characterization• Parsec, DaCapo, SPEC and

Par Lab applications• Hierarchical clustering

– LLC, prefetcher, BW, thread scalability– Representive selection

• Tradeoff threads and cache capacity– Consolidate several applications

without affecting execution time and reducing overall energy consumption

18

M. MoretoH. CookS. BirdK. Dao

K. Asanovic

D. Patterson

pOSKI (Parallel Optimized Sparse Kernel Interfaces)Update: Run-time autotuning with “History Data”• pOSKI-v1.0.0 was released (4/27/12)

• pOSKI autotuning

19

Jong-Ho Byun, Richard Lin

• History Data using SQLite

1,000-cycle memory latenciesBenchmarking CPU/GPU memories:

•2,500 cycle average memory latency•Occasional 1,000,000 cycle stalls

•500 cycles to broadcast cache line across 10 cores•2,000 cycles to broadcast cache line across 4 sockets•Naïve barrier beats OpenMP barrier 2x

20

Vasily Volkov

Reproducibility of FP computations• Because of rounding errors

floating-point computations are non-consistent, depending on the order of computation.

• Repeatability can be attained, BUT at what cost ?

• Approach 1: reproducible reduction tree

o fix the order of computation• Approach 2: higher precision

reduce computing errors.

21

Hong Diep Nguyen

Communication-Avoiding Parallel

Strassen: Implementation and Performance• CAPS decreases both

computation and communication compared to classical matrix multiplication

• CAPS outperforms all previous algorithms, classical and Strassen-based

• Our poster has lots of performance data and discusses the practicality of Strassen

• If allowed, CAPS would improve the LINPACK score of machines on TOP500 list

22

Grey BallardJim DemmelOlga Holtz

Ben LipshitzOded Schwartz

OUROURSS

THEIRSTHEIRS

goo

dg

ood

Energy & Communication-Avoiding Algorithms

• Energy consumption of major interest in client/cloud

• We propose a simple model of algorithm energy consumption, and then extend comm. bounds to obtain bounds on energy

• Using *.5 algorithms, matrix multiplication and naïve n-body have region of perfect energy strong scaling

• Reviewed during earlier talk, happy to discuss details or suggestions

23

Jim DemmelAndrew Gearhart

Ben LipshitzOded Schwartz

Communication Avoiding Optimizations for the Geometric Multigrid on GPUs

• Communication Avoiding (CA) :-minimize communication between different levels of memory to increase performance

• Multigrid Method :- Multilevel technique to accelerate iterative solver convergence

• Iterates towards convergence via a hierarchy of grid resolutions , operates in a V-cycle

Amik Singh

Optimizations :-1. More Ghost Zones2. Wavefront Approach

Communication time decreased by a factor of two!

Numerical Aspects of Communication-Avoiding Iterative Solvers

25

Erin Carson

Krylov Subspace Methods (KSMs) are popular iterative solvers for large, sparse, linear systems

Problem: Communication-bound operations in each iteration limit performance!

Communication-Avoiding KSMs (CA-KSMs)

One communication step per s iterations

Reduces communication cost by a factor of s!

Problem: Round-off error in finite precision CA-KSM algorithms grows with s!

Trade off between performance and accuracy

Can we regain accuracy without sacrificing performance benefits?

• Communication avoidance (CA) and overlapping seem to be orthogonal • We combine CA algorithms and overlapping techniques for:

– Matrix Multiplication– Triangular solve– Cholesky factorization

• Performance modeling to guide our optimizations

26

Communication Avoiding and Overlapping for Numerical Linear Algebra

JorgeGonzález-

Domínguez

EvangelosGeorganas

2.5D Cholesky factorization with 2.5D Cholesky factorization with overlappingoverlapping

1 2.5D algorithm replicates the matrix on different layers

2 Each layer updates its trailing matrix using a different subpanel

3 The trailing matrix is reduced among different layers

In the second phase we can overlap broadcasts with computations

Edgar Solomonik, Yili Zheng,Katherine Yelick