Direction-OptimizingBreadth-First Search
• Breadth-first search (BFS) is a common graph alg. building block
• Hybrid approach demonstrates speedups (8x) over prior work
• Since last retreat– Much better insight into when and why
hybrid approach is faster– Distributed (MPI) hybrid
implementation (w/ Aydin Buluc)
1
Scott Beamer
Photo of youPhoto of youPhoto of youPhoto of you
• GPUs are often underutilized.• More CPU/GPU integration.• Can offload GC to the GPU.• Present a new algorithm for
Mark & Sweep GC on a GPU.• Need to use memory band-
width, maximize parallelism.• Achieve mark performance
within 1.4-1.8x of CPU.
2
MartinMaas
GPUs as an Opportunity for Off-loading Garbage Collection
PhilipReame
s
+ J. Morlan, K. Asanovic, A. Joseph, J. Kubiatowicz
SecureCell: HW for Private Cloud
• OS sees encrypted view of app’s memory
• HW re-encrypts data leaving cell
• “Capabilities” for keys• Investigation of arch
for crypto.• Trust neither App nor
OS not to leak data
3
Eric Love
PayloadData
Metadata
Key1
Metadata
mmap()
EncryptedData
Virtual Memory Address Space
AES Key Registers
DRAM Contents
DecryptedKey
CPU Private Mem
DecryptedObject
ContentsdecryptRange
With John Kubiatowicz, Krste Asanović
Untrusted Untrusted StorageStorage
• Four cell types:– Event-triggered (NEW!)– Time-triggered– No-multiplexed– Best effort
• One resource multiplexer per hardware thread
• “communication-avoiding gang-scheduling”: use global time base to avoid coordination in the common case
4
Decentralized Cell Multiplexing in Tessellation
Juan Colmenares
Gage Eads
John Kubiatowicz
Hilfi Alkaff
QoS for Storage in Tessellation OS• Current OS often don’t
provide good QoS for storage
In Tessellation:• Disk accessed by multiple
services and applications.• Services need to access
multiple devicses.• Hierarchical composition
of SLA’s• Work in early stages
5
Nitesh Mor
& Israel Jacques, Juan Colmenares, John
Kubiatowicz
Verification and Testing of MPI Programs
• Previous work on Shared Memory Concurrency– Scalable data race detection
for UPC– Compiler instrumentation for
hybrid programming models• Active Testing for MPI
– Many legacy code and frameworks for MPI
– Message races in MPI are a cause of non-determinism
6
Chang-Seo Park
Concurrit: A Domain Specific Language for Writing Concurrent Tests
• How to write unit test for concurrent programs?– Must fix: Inputs + Thread schedule
• DSL: Specify thread schedules for SUT– Formal, concise, and convenient way– Describe model checking algorithms
• Tool: Systematically explore all-and-only schedules of SUT specified in DSL
Tayfun ElmasJacob Burnim
George Necula Koushik Sen
+Test written in Concurrit DSLTest written in Concurrit DSL
Software Under Test (SUT)
Software Under Test (SUT)
Insights/ideas aboutthread
schedules
Insights/ideas aboutthread
schedules
✓✓ ??
// Example test in Concurrit DSL
TA, TB, TC = WAIT_FOR_DISTINCT_THREADS()
LOOP UNTIL TA, TB, TC COMPLETE {
BACKTRACK HERE WITH T IN [TA, TB, TC]
RUN T UNTIL READS OR WRITES}
// Example test in Concurrit DSL
TA, TB, TC = WAIT_FOR_DISTINCT_THREADS()
LOOP UNTIL TA, TB, TC COMPLETE {
BACKTRACK HERE WITH T IN [TA, TB, TC]
RUN T UNTIL READS OR WRITES}
A SEJITS Specializer for BLB
• SEJITS• BLB• Applications
– Machine Learning/Statistical Analysis for Big Data
• Results– 127,000 vectors;
96,000 features– 109.8 seconds– Statistically robust
8
Aakash PrasadDavid Howard
High Performance Analysis ofFiltered Semantic Graphs
• Graph algorithms applied to graphs with “attributes” on edges
• Integrate with Knowledge Discovery Toolbox
• Filtering graph to only include wanted edges = 80x slowdown
• SEJITS moves algorithms from being bound by interpreter to being memory bound
9
Shoaib Kamil
Roofline Perf Bounds
C++
SEJITS
MemoryBandwidth
PurePython
A Distributed Algorithm for 3D Radar Imaging on eWallpaper
• eWallpaper: thousands of embedded Rocket processors. One antenna per processor.
• Application: Use the radio transceivers to image the room. For assisted living, gesture interaction, soundfield synthesis, etc.
• Algorithm: each radio transmits pulses. The responses are combined using SAR techniques to form an image.
• Challenges:• Response distributed amongst 16 000
processors• Restrictive mesh topology• Limited local memory per processor
10
Patrick Li, Simon Scott
128
128
PyCASP: Python-Based Audio Content Analysis using Specialization• Programmer tool for developing
audio analysis applications• Goals:
– Productivity, Efficiency, Portability and Scalability
• Three example applications:– Speaker Diarization– Music Recommendation (Pardora)– Video event detection
• CPU, GPU and cluster platforms
11
Katya Gonina, Gerald Friedland
with: Eric Battenberg, Penporn Koanantakool, Michael Driscoll, Evangelos Georganas and Kurt
Keutzer
clSpMV: A Cross-Platform OpenCL SpMV Autotuner on GPUs
• This is the first SpMV OpenCL work that covers a wide spectrum of sparse matrix formats (9 formats in total)
• We propose a new sparse matrix format, the Cocktail Format, that takes advantage of the strengths of many di erent sparse matrix ffformats
• We have developed the clSpMV autotuner to analyze the input sparse matrix at runtime, and recommend the best representation of the given sparse matrix
• clSpMV outperforms state-of-the-art GPU SpMV works
12
Bor-Yiing Su
Bor-Yiing Su, Kurt Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," Bor-Yiing Su, Kurt Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in International Conference on Supercomputing (ICS 2012), Italy, June 2012.in International Conference on Supercomputing (ICS 2012), Italy, June 2012.
Three Fingered Jack: Automatic code Three Fingered Jack: Automatic code generation for CPU, GPU ,and FPGA/ASICgeneration for CPU, GPU ,and FPGA/ASIC
• Using automatic speech recognition as the driver of our system
• Want to productively explore a large design space and choose the right implementation fabric– Cannot afford custom
programming environments for each target platform
• From a single Python description, our system gives us the ability to target GPU,CPU, and FPGA/ASIC– We use reordering transforms
to target these diverse platforms
13
David Sheffield
Modeling of Strokes in the Cerebral Vasculature
• Problem– Stroke treatment not quantitative
• Solution– Patient specific risk-stratification
Progress– 3x increase image processing speed– Improved patient analysis algorithm– Approval for 160 new patient scans
Tobias Harrison-Noonan
Patterns
Sparse Linear Algebra
Dense Linear Algebra
Structured GridMap-Reduce
Chisel Project Updates
• More results– Sodor educational processors
• 1stage, 2stage, 5stage, uCode, OOO– Vector research processor
• More functionality– Improved syntax– Vec(n){ item }– Reduce Verilog warnings– Logic simplification– Combinational loop detection– Standard library
15
Huy Vo, Jonathan Bachrach
Productive Design of Extensible Cache Coherence Protocols• Language for describing
global coherence xactions as “flows” with declarative concurrency
• Language for describing local pipeline logic as transactional processes
• Adding both to Chisel as language extensions and composing them
• Integrating synthesis with formal verification tools
16
Henry CookHenry CookJonathan BachrachJonathan Bachrach
Chisel in the Classroom
• undergraduate computer architecture class (CS 152)– 5 labs
• Menagerie of processors – 1-stage, 2-stage, 5-stage– Micro-coded– Out-of-Order (MIPS r10k style)– Hwacha (vector-thread)– Dual-core Rocket
17
Interesting ImageInteresting ImageOr Graph fromOr Graph from
PosterPoster
Interesting ImageInteresting ImageOr Graph fromOr Graph from
PosterPoster
Chris Celio
Energy-Aware Resource Allocation on the Sandy Bridge Processor• Performance and energy
characterization• Parsec, DaCapo, SPEC and
Par Lab applications• Hierarchical clustering
– LLC, prefetcher, BW, thread scalability– Representive selection
• Tradeoff threads and cache capacity– Consolidate several applications
without affecting execution time and reducing overall energy consumption
18
M. MoretoH. CookS. BirdK. Dao
K. Asanovic
D. Patterson
pOSKI (Parallel Optimized Sparse Kernel Interfaces)Update: Run-time autotuning with “History Data”• pOSKI-v1.0.0 was released (4/27/12)
• pOSKI autotuning
19
Jong-Ho Byun, Richard Lin
• History Data using SQLite
1,000-cycle memory latenciesBenchmarking CPU/GPU memories:
•2,500 cycle average memory latency•Occasional 1,000,000 cycle stalls
•500 cycles to broadcast cache line across 10 cores•2,000 cycles to broadcast cache line across 4 sockets•Naïve barrier beats OpenMP barrier 2x
20
Vasily Volkov
Reproducibility of FP computations• Because of rounding errors
floating-point computations are non-consistent, depending on the order of computation.
• Repeatability can be attained, BUT at what cost ?
• Approach 1: reproducible reduction tree
o fix the order of computation• Approach 2: higher precision
reduce computing errors.
21
Hong Diep Nguyen
Communication-Avoiding Parallel
Strassen: Implementation and Performance• CAPS decreases both
computation and communication compared to classical matrix multiplication
• CAPS outperforms all previous algorithms, classical and Strassen-based
• Our poster has lots of performance data and discusses the practicality of Strassen
• If allowed, CAPS would improve the LINPACK score of machines on TOP500 list
22
Grey BallardJim DemmelOlga Holtz
Ben LipshitzOded Schwartz
OUROURSS
THEIRSTHEIRS
goo
dg
ood
Energy & Communication-Avoiding Algorithms
• Energy consumption of major interest in client/cloud
• We propose a simple model of algorithm energy consumption, and then extend comm. bounds to obtain bounds on energy
• Using *.5 algorithms, matrix multiplication and naïve n-body have region of perfect energy strong scaling
• Reviewed during earlier talk, happy to discuss details or suggestions
23
Jim DemmelAndrew Gearhart
Ben LipshitzOded Schwartz
Communication Avoiding Optimizations for the Geometric Multigrid on GPUs
• Communication Avoiding (CA) :-minimize communication between different levels of memory to increase performance
• Multigrid Method :- Multilevel technique to accelerate iterative solver convergence
• Iterates towards convergence via a hierarchy of grid resolutions , operates in a V-cycle
Amik Singh
Optimizations :-1. More Ghost Zones2. Wavefront Approach
Communication time decreased by a factor of two!
Numerical Aspects of Communication-Avoiding Iterative Solvers
25
Erin Carson
Krylov Subspace Methods (KSMs) are popular iterative solvers for large, sparse, linear systems
Problem: Communication-bound operations in each iteration limit performance!
Communication-Avoiding KSMs (CA-KSMs)
One communication step per s iterations
Reduces communication cost by a factor of s!
Problem: Round-off error in finite precision CA-KSM algorithms grows with s!
Trade off between performance and accuracy
Can we regain accuracy without sacrificing performance benefits?
• Communication avoidance (CA) and overlapping seem to be orthogonal • We combine CA algorithms and overlapping techniques for:
– Matrix Multiplication– Triangular solve– Cholesky factorization
• Performance modeling to guide our optimizations
26
Communication Avoiding and Overlapping for Numerical Linear Algebra
JorgeGonzález-
Domínguez
EvangelosGeorganas
2.5D Cholesky factorization with 2.5D Cholesky factorization with overlappingoverlapping
1 2.5D algorithm replicates the matrix on different layers
2 Each layer updates its trailing matrix using a different subpanel
3 The trailing matrix is reduced among different layers
In the second phase we can overlap broadcasts with computations
Edgar Solomonik, Yili Zheng,Katherine Yelick