Machine Learning in DryadLINQ
description
Transcript of Machine Learning in DryadLINQ
1
Machine Learning in DryadLINQKannan AchanMihai Budiu
MSR-SVC, 1/30/2008
2
Goal
3
The Software Stack
Windows Server
Cluster Services
Distributed Filesystem: Cosmos
Dryad
DryadLINQ
Windows Server
Windows Server
Large Vector
Machine learningData analysis
4
Dryad
5
Dryad Jobs
R R
X X X
M M M
X X
M
M Vertices (processes)
Channels
Output files
Input files
Stage
M
R R
X
6
LINQ and C#
7
LINQ
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
8
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
9
Recall: The Software Stack
Windows Server
Cluster Services
Distributed Filesystem: Cosmos
Dryad
DryadLINQ
Windows Server
Windows Server
Large Vector
Machine learningData analysis
10
Very Large Vector LibraryPartitionedVector<T>
T
Scalar<T>
T T
T
11
Operations on Large Vectors: Map 1
U
T
T Uf
f
f preserves partitioning
12
V
Map 2 (Pairwise)
T Uf
V
U
T
f
13
Map 3 (Vector-Scalar)T U
fV
V
13
U
T
f
Reduce (Fold)
14
U UU
U
f
f f f
fU U U
U
15
Linear Algebra
T U Vnmm ,,=, ,
T
16
Linear Regression
• Data
• Find
• S.t.
mt
nt yx ,
mnA
tt yAx
},...,1{ nt
17
Analytic Solution
X×XT X×XT X×XT Y×XT Y×XT Y×XT
Σ
X[0] X[1] X[2] Y[0] Y[1] Y[2]
Σ
[ ]-1
*
A
1))(( Ttt t
Ttt t xxxyA
Map
Reduce
18
Linear Regression Code1))(( T
tt tTtt t xxxyA
Matrices xx = x.PairwiseOuterProduct(x);OneMatrix xxs = xx.Sum();Matrices yx = y.PairwiseOuterProduct(x);OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(
xxinv, (a, b) => a.Multiply(b));
Expectation Maximization
19
• 160 lines • 3 iterations shown
20
Understanding Botnet Traffic using EM
• 3 GB data• 15 clusters• 60 computers• 50 iterations• 9000 processes• 50 minutes
21
Conclusions
• Dryad simplifies programming large clusters
• DryadLINQ = declarative programming for Dryad jobs
• The Large Vector library provides simple mathematical primitiveson top of DryadLINQ
• Matlab-style coding for writing distributed numeric computations
WinCluster Services
Distributed FilesystemDryadDryadLINQ
Win Win
Large VectorMLData analysis
22
Backup Slides
23
Chaining
X×XT X×XT X×XT Y×XT Y×XT Y×XT
Σ
X[0] X[1] X[2] Y[0] Y[1] Y[2]
Σ
[ ]-1
*
A
1))(( Ttt t
Ttt t xxxyA
Σ Σ Σ Σ Σ Σ
24
EM StructureE stage
Input size
π
σ
μ
All parameters