The Other HPC: High Productivity Computing
-
Upload
university-of-washington -
Category
Data & Analytics
-
view
261 -
download
0
Transcript of The Other HPC: High Productivity Computing
05/03/2023 1
The Other HPC: High Productivity Computing in Polystore Environments
Bill Howe, Ph.D.Associate Director, eScience Institute
Senior Data Science Fellow, eScience InstituteAffiliate Associate Professor, Computer Science & Engineering
Bill Howe, UW
Time
Amou
nt o
f dat
a in
the
wor
ld
Time
Proc
essi
ng p
ower
What is the rate-limiting step in data understanding?
Processing power: Moore’s Law
Amount of data in the world
Proc
essi
ng p
ower
Time
What is the rate-limiting step in data understanding?
Processing power: Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in the world
slide src: Cecilia Aragon, UW HCDE
05/03/2023 4
How much time do you spend “handling data” as opposed to “doing science”?
Mode answer: “90%”
Bill Howe, UW
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of scale, file handling, and feature extraction.
Martin Kircher, Genome SciencesWhy?
3k NSF postdocs in 2010$50k / postdocat least 50% overhead
maybe $75M annually at NSF alone?
Where does the time go? (2)
Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility threshold
interactivity threshold
These two performance thresholds are really important; other requirements are situation-specific
05/03/2023 7Bill Howe, UW
Table
Graph
Array
Matrix
Key-Value
Data- frame
MATLAB
GEMS
GraphX Neo4J
Dato
RDBMS
HIVE
Spark
R
Pandas
Ibis
Accumulo
Spark
SciDB HDF5
MyriaPolystore Algebra
05/03/2023 8
Desiderata for a Polystore Algebra
• Captures user intent• Affords reasoning and optimization• Accommodates best-known
algorithms
Bill Howe, UW
05/03/2023 Bill Howe, eScience Institute 13
Why do we care? Algebraic Optimization
N = ((z*2)+((z*3)+0))/1Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra
The Myria Algebra is…
Relational Algebra + While / Sequence+ Flatmap + Window Ops+ Sample (+ Dimension Bounds)
https://github.com/uwescience/raco/
MyriaX Radish SciDB GEMS
Parallel Algebra
Polystore Algebra
Middleware
SciDB API
MyriaX API
Radish API
Graph API
rewrite rulesArray
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
How does this actually work?(1) Client submits a program
in one of several Big Data languages….
(2) Program is parsed as an expression tree….
(or programs directly against the API…)
(3) Expression tree is optimized into a parallel, federated execution plan involving one or more Big Data platforms.
(4) Depending on the back end, parallel plan may be directly compiled into executable code
How does this actually work?
(5) Orchestrates the parallel, federated plan execution across the platforms
Client
MyriaQ Sys1 Sys2
How does this actually work?
(6) Exposes query execution logs and results through a REST API and a visual web-based interface
How does this actually work?
05/03/2023 20
What can you do with a Polystore Algebra?
1) Facilitate Experiments– Provide reference implementations– Apply shared optimizations for apples-to-
apples comparisons– K-means, Markov chain, Naïve Bayes, TPC-H,
Betweenness Centrality, Sigma-clipping, Linear Algebra
– LANL using this idea to express algorithms to solve governing equations for heat transfer models!
Bill Howe, UW
05/03/2023 21
What can you do with a Polystore Algebra?
2) Rapidly develop new applications– Microbial Oceanography– Neuroanatomy– Music Analytics– Video Analytics– Clinical Analytics– Astronomical Image de-noising
Bill Howe, UW
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC (Forward scatter)
Orange fluo
Red fluo
EX: SeaFlowFrancois Ribalet
Jarred Swalwell
Ginger Armbrust
Ex: SeaFlow
RE
D fl
uore
scen
ceFSC
Picoplankton
Nanoplankton
IS
Ultraplankton
Prochlorococcus
Continuous observations of various phytoplankton groups from 1-20 mm in size
Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores
Francois Ribalet
Jarred Swalwell
Ginger Armbrust
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler”
Dan Halperin Sophie Clayton
05/03/2023 25Bill Howe, UW
select a.annotation , var_samp(d.density) as var from density d join annotation a on d.x = a.x and d.y = a.y and d.z = a.zgroup by a.annotationorder by var desclimit 10
Sample variance by annotation across all experiments
05/03/2023 29Bill Howe, UW
Are two regions connected?
adjacent(r1, r2) :- annotation(experiment, x1, y1, z1, r1), annotation(experiment, x2, y2, z2, r2), x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1
connected(r1, r2) :- adjacent(r1,r2)connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
Music Analytics
segments = scan(Jeremy:MSD:SegmentsTable);songs = scan(Jeremy:MSD:SongsTable);
seg_count = select song_id, count(segment_number) as c from segments;density = select songs.song_id, (seg_count.c / songs.duration) as density from songs, seg_count where songs.song_id = seg_count.song_id;store(density, public:adhoc:song_density);
Computing song densityMillion-Song Dataset
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
Blog post on how to run it in 20 minutes on Hadoop…
05/03/2023 31Bill Howe, UW
-- calculate probability of outcomesPoe = select input_sp.id as inputId, sum(CondP.lp) as lprob, CondP.outcome as outcome from CondP, input_sp where CondP.index=input_sp.index and CondP.value=input_sp.value;
-- select the max probability outcomeclasses = select inputId, ArgMax(outcome, lprob) from Poe;
Naïve Bayes Classification: Million Song Dataset
Predict song year in a 515,345-song dataset using eight timbre features, discretized into intervals of size 10
bad data?
lower heart rate variance
average relative heart rate variance
time (hours)
aver
age
hear
t rat
e be
ats/
min
ute
MIMIC Information Flow
Client MyriaMiddleware MyriaX SciDB
Waveform data
Structured data
headless Octave + Web
interface
REST interface, optimization, orchestration
serverclient
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
05/03/2023 36Bill Howe, UW
Ollie Lo, Los Alamos National Lab
05/03/2023 37
What can you do with a Polystore Algebra?
3) Reason about algorithms
• Apply application-specific optimizations (in addition to automatic optimizations)
Bill Howe, UW
38
CurGood = SCAN(public:adhoc:sc_points);
DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
39
CurGood = Psum = [FROM CurGood EMIT SUM(val)];sumsq = [FROM CurGood EMIT SUM(val*val)]cnt = [FROM CurGood EMIT CNT(*)];NewBad = []DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {}
Sigma-clipping, V1: Incremental
40
Points = SCAN(public:adhoc:sc_points);aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0];WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v];DUMP(output);
Sigma-clipping, V2
05/03/2023 41
What can you do with a Polystore Algebra?
3) Orchestrate Federated Workflows
Bill Howe, UW
Client MyriaX SciDB
More Orchestrating Federated Workflows
Spark
HadoopRDBMSMyriaQ
05/03/2023 43
What can you do with a Polystore Algebra?
4) Study the price of abstraction
Bill Howe, UW
Compiling the Myria algebra to bare metal PGAS programs
RADISH
ICDE 15
Brandon Myers
RADISH
ICDE 15
Brandon Myers
Query compilation for distributed processing
pipeline as
parallel code
parallel compiler
machine code
[Myers ’14]
pipeline fragment
code
pipeline fragment
code
sequential compiler
machine code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential compiler
05/03/2023 Bill Howe, UW 47/57
1% selection microbenchmark, 20GB
Avoid long code paths
05/03/2023 Bill Howe, UW 48/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
49
Graph Patterns
• SP2Bench, 100 million triples• Queries compiled to a PGAS C++ language layer, then compiled
again by a low-level PGAS compiler• One of Myria’s supported back ends• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
05/03/2023 50Bill Howe, UW
select A.i, B.k, sum(A.val*B.val) from A, Bwhere A.j = B.jgroup by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rowsm = number of non-zerosComplexity of matrix
multiply
naïve sparse algorithm
best known sparse algorithm
best known dense algorithm
lots of room here
BLAS vs. SpBLAS vs. SQL (10k)off the shelf database
15X
Relative Speedup of SpBLAS vs. HyperDB
- speedup = T_HyperDB / T_SpBLAS
- benchmark datasets with r is 1.2 and the real data cases (the three largest datasets: 1.17 < r < 1.20)
- on star (nTh = 12), on dragon (nTh = 60)
- As n increases, the relative speedup of SpBLAS over HyperDB is reduced.
- soc-Pokec: the speedup is only around 5 times.
on star, hyperDB stuck on thrashing with soc-Pokec data.
05/03/2023 55Bill Howe, UW
20k X 20k matrix multiply by sparsityCombBLAS, MyriaX, Radish
05/03/2023 56Bill Howe, UW
50k X 50k matrix multiply by sparsityCombBLAS, MyriaX, Radish Filter to upper left corner of result matrix
05/03/2023 57
What can you do with a Polystore Algebra?
5) Provide new services over a Polystore Ecosystem
Bill Howe, UW
Lowering barrier to entry
Exposing Performance IssuesDominik Moritz
EuroSys 15
Exposing Performance IssuesDominik Moritz
EuroSys 15
Sou
rce
wor
ker
Destination worker
Kanit "Ham" Wongsuphasawat
Voyager: Visualization Recommendation
InfoVis 15
Seung-Hee BaeScalable Graph Clustering
Version 1Parallelize Best-known Serial Algorithm
ICDM 2013
Version 2Free 30% improvement for any algorithm
TKDD 2014 SC 2015 Version 3Distributed approx. algorithm, 1.5B edges
Viziometrics: Analysis of Visualization in the Scientific Literature
Proportion of non-quantitative figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/