Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.
![Page 1: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/1.jpg)
Cluster Computing with DryadLINQ
Mihai Budiu, MSR-SVCPARC, May 8 2008
![Page 2: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/2.jpg)
2
Aknowledgments
MSR SVC and ISRC SVC
Michael Isard, Yuan Yu, Andrew Birrell, Dennis Fetterly
Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
![Page 3: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/3.jpg)
3
Computer Evolution
1961 2008 2040
?
![Page 4: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/4.jpg)
4
Computer Evolution
ENIAC 1943
30 tons200kW
Datacenter 2008
500,000 ft2
40MW
?2040
![Page 5: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/5.jpg)
5
2040
![Page 6: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/6.jpg)
6
Layers
Networking
Storage
Distributed Execution
Scheduling
Resource Management
Applications
Identity & Security
Caching and Synchronization
Programming Languages and APIs
Ope
ratin
g Sy
stem
![Page 8: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/8.jpg)
8
This Work
![Page 9: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/9.jpg)
9
The Rest of This Talk
Windows Server
Cluster Services
Distributed Filesystem
Dryad
DryadLINQ
Windows Server
Windows Server
Windows Server
CIFS/NTFS
Large Vectors
Machine Learning
![Page 10: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/10.jpg)
10
How fast can you sort 1010 100-byte records (1Tb)?
Sequential scan/disk = 4.6 hours
Current record: 435 seconds (7.2 min)cluster of 40 Itanium2, 2520 SAN disks
Code: 3300 lines of C
Our result: 349 seconds (5.8 min)cluster of 240 AMD64 (quad) machines, 920 disks
Code: 17 lines of LINQ
TeraSort
![Page 11: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/11.jpg)
11
• Introduction• Dryad • DryadLINQ• Building on DryadLINQ
Outline
![Page 12: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/12.jpg)
12
• Introduction• Dryad
– deployed since 2006– many thousands of machines– analyzes many petabytes of data/day
• DryadLINQ• Building on DryadLINQ
Outline
![Page 13: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/13.jpg)
13
Goal
![Page 14: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/14.jpg)
14
Design Space
ThroughputLatency
Internet
Privatedata
center
Data-parallel
Sharedmemory
DryadSearch
HPC
Grid
Transaction
![Page 15: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/15.jpg)
15
Data Partitioning
RAM
DATA
DATA
![Page 16: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/16.jpg)
16
2-D Piping• Unix Pipes: 1-D
grep | sed | sort | awk | perl
• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
![Page 17: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/17.jpg)
17
Dryad = Execution Layer
Job (application)
Dryad
Cluster
Pipeline
Shell
Machine≈
![Page 18: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/18.jpg)
18
Virtualized 2-D Pipelines
![Page 19: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/19.jpg)
19
Virtualized 2-D Pipelines
![Page 20: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/20.jpg)
20
Virtualized 2-D Pipelines
![Page 21: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/21.jpg)
21
Virtualized 2-D Pipelines
![Page 22: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/22.jpg)
22
Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized
![Page 23: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/23.jpg)
23
Dryad Job Structure
grep
sed
sortawk
perlgrep
grepsed
sort
sort
awk
Inputfiles
Vertices (processes)
Outputfiles
ChannelsStage
![Page 24: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/24.jpg)
24
Channels
X
M
Items
Finite Streams of items
• distributed filesystem files (persistent)• SMB/NTFS files (temporary)• TCP pipes (inter-machine)• memory FIFOs (intra-machine)
![Page 25: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/25.jpg)
25
Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
![Page 26: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/26.jpg)
Fault Tolerance
![Page 27: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/27.jpg)
X[0] X[1] X[3] X[2] X’[2]
Completed vertices Slow vertex
Duplicatevertex
Dynamic Graph Rewriting
Duplication Policy = f(running times, data volumes)
![Page 28: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/28.jpg)
28
S S S S
A A A
S S
T
S S S S S S
T
# 1 # 2 # 1 # 3 # 3 # 2
# 3# 2# 1
static
dynamic
rack #
Dynamic Aggregation
![Page 29: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/29.jpg)
29
Data-Parallel Computation
Storage
Execution
Application
Parallel Databases
Map-Reduce
GFSBigTable
Dryad
![Page 30: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/30.jpg)
30
• Introduction• Dryad • DryadLINQ• Building on Dryad
Outline
![Page 31: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/31.jpg)
31
DryadLINQ
Dryad
![Page 32: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/32.jpg)
32
LINQ
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
![Page 33: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/33.jpg)
33
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
![Page 34: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/34.jpg)
34
Data Model
Partition
Collection
C# objects
![Page 35: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/35.jpg)
35
Query Providers
DryadLINQ
Client machine
(11)
Distributed query plan
C#
Query Expr
Data center
Output TablesResults
Input TablesInvoke Query
Output DryadTable
Dryad Execution
C# Objects
JM
ToDryadTable
foreach
![Page 36: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/36.jpg)
36
Demo
![Page 37: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/37.jpg)
37
Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}]
![Page 38: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/38.jpg)
38
Histogram Plan
SelectManyHashDistribute
MergeGroupBy
Select
OrderByDescendingTake
MergeSortTake
![Page 39: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/39.jpg)
39
Map-Reduce in DryadLINQ
public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}
![Page 40: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/40.jpg)
40
Map-Reduce Plan
M
D
R
G
M
Q
G1
R
D
MS
G2
R
(1) (2) (3)
X
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
MS
G2
R
map
sort
groupby
reduce
distribute
mergesort
groupby
reduce
mergesort
groupby
reduce
consumer
map
parti
al a
ggre
gatio
nre
duce
S S S S
A A A
S S
T
![Page 41: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/41.jpg)
41
Distributed Sorting in DryadLINQ
public static IQueryable<TSource>DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, int pcount){ var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector);}
![Page 42: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/42.jpg)
42
Distributed Sorting Plan
O
DS
H
D
M
S
DS
H
D
M
S
DS
D
DS
H
D
M
S
DS
D
M
S
M
S
(1) (2) (3)
![Page 43: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/43.jpg)
43
• Introduction• Dryad • DryadLINQ• Building on DryadLINQ
Outline
![Page 44: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/44.jpg)
44
Machine Learning in DryadLINQ
Dryad
DryadLINQ
Large Vector
Machine learningData analysis
![Page 45: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/45.jpg)
45
Operations on Large Vectors: Map 1
U
T
T Uf
f
f preserves partitioning
![Page 46: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/46.jpg)
46
V
Map 2 (Pairwise)
T Uf
V
U
T
f
![Page 47: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/47.jpg)
47
Map 3 (Vector-Scalar)T U
fV
V
47
U
T
f
![Page 48: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/48.jpg)
Reduce (Fold)
48
U UU
U
f
f f f
fU U U
U
![Page 49: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/49.jpg)
49
Linear Algebra
T U Vnmm ,,=, ,
T
![Page 50: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/50.jpg)
50
Linear Regression
• Data
• Find
• S.t.
mt
nt yx ,
mnA
tt yAx
},...,1{ nt
![Page 51: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/51.jpg)
51
Analytic Solution
X×XT X×XT X×XT Y×XT Y×XT Y×XT
Σ
X[0] X[1] X[2] Y[0] Y[1] Y[2]
Σ
[ ]-1
*
A
1))(( Ttt t
Ttt t xxxyA
Map
Reduce
![Page 52: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/52.jpg)
52
Linear Regression Code
Vectors x = input(0), y = input(1);Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));OneMatrix xxs = xx.Sum();Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));
1))(( Ttt t
Ttt t xxxyA
![Page 53: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/53.jpg)
Expectation Maximization (Gaussians)
53
• 160 lines • 3 iterations shown
![Page 54: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/54.jpg)
Conclusions
• Dryad = distributed execution environment• Application-independent (semantics oblivious)• Supports rich software ecosystem
– Relational algebra, Map-reduce, LINQ• DryadLINQ = Compiles LINQ to Dryad• C# objects and declarative programming• .Net and Visual Studio for parallel programming
54
![Page 55: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/55.jpg)
55
Backup Slides
![Page 56: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/56.jpg)
56
Software Stack
Windows Server
Cluster Services
Distributed Filesystem
Dryad
Distributed Shell
PSQL
DryadLINQ
PerlSQL
server
C++
Windows Server
Windows Server
Windows Server
C++
CIFS/NTFS
legacycode
sed, awk, grep, etc.
SSISScope
C#
Vectors
Machine Learning
C#
Job
queu
eing
, mon
itorin
g
![Page 57: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/57.jpg)
57
Very Large Vector LibraryPartitionedVector<T>
T
Scalar<T>
T T
T
![Page 58: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/58.jpg)
58
DryadLINQ
• Declarative programming • Integration with Visual Studio• Integration with .Net• Type safety• Automatic serialization• Job graph optimizations static dynamic
• Conciseness
![Page 59: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/59.jpg)
59
Sort & Map-Reduce in DryadLINQ
![Page 60: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/60.jpg)
60
• Many similarities• Exe + app. model• Map+sort+reduce• Few policies• Program=map+reduce• Simple• Mature (> 4 years)• Widely deployed• Hadoop
Dryad Map-Reduce
• Execution layer• Job = arbitrary DAG• Plug-in policies• Program=graph gen.• Complex ( features)• New (< 2 years)• Still growing• Internal
![Page 61: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/61.jpg)
61
PLINQ
public static IEnumerable<TSource> DryadSort<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending){
return source.AsParallel().OrderBy(keySelector, comparer);}
![Page 62: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/62.jpg)
Query histogram computation
• Input: log file (n partitions)• Extract queries from log partitions• Re-partition by hash of query (k buckets)• Compute histogram within each bucket
![Page 63: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/63.jpg)
Naïve histogram topology
Q Q
R
Q
R k
k
k
n
n
is:Each
R
is:
Each
MS
C
P
C
S
C
S
D
P parse linesD hash distributeS quicksortC count
occurrencesMS merge sort
![Page 64: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/64.jpg)
Efficient histogram topologyP parse linesD hash distributeS quicksortC count
occurrencesMS merge sortM non-deterministic
merge
Q' is:Each
R
is:
Each
MS
C
M
P
C
S
Q'
RR k
T
k
n
T
is:
Each
MS
D
C
![Page 65: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/65.jpg)
Final histogram refinement
Q' Q'
RR 450
TT 217
450
10,405
99,713
33.4 GB
118 GB
154 GB
10.2 TB
1,800 computers43,171 vertices11,072 processes11.5 minutes
![Page 66: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/66.jpg)
66
Data Distribution(Group By)
Dest
Source
Dest
Source
Dest
Source m
n
m x n
![Page 67: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/67.jpg)
TT[0-?) [?-100)
Range-Distribution Manager
S
D D D
S S
S S S
Tstatic
dynamic67
Hist
[0-30),[30-100)
[30-100)[0-30)
[0-100)
![Page 68: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/68.jpg)
68
Goal: Declarative Programming
X
T
S
X X
S S
T T T
X
static dynamic
![Page 69: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/69.jpg)
JM code
vertex code
Staging1. Build
2. Send .exe
3. Start JM
5. Generate graph
7. Serializevertices
8. MonitorVertex execution
4. Querycluster resources
Cluster services6. Initialize vertices
![Page 70: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/70.jpg)
70
SkyServer Query 18
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
L L
select distinct P.ObjIDinto results from photoPrimary U, neighbors N, photoPrimary Lwhere U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g))<0.05 and abs((U.g-U.r)-(L.g-L.r))<0.05 and abs((U.r-U.i)-(L.r-L.i))<0.05 and abs((U.i-U.z)-(L.i-L.z))<0.05
![Page 71: Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May 8 2008.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d4a5503460f94a27de1/html5/thumbnails/71.jpg)
71
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
0 2 4 6 8 10
Number of Computers
Speed-up (times)
Dryad In-Memory
Dryad Two-pass
SQLServer 2005
SkyServer Q18 Performance