[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE
-
Upload
tzu-li-tai -
Category
Technology
-
view
400 -
download
4
description
Transcript of [Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE
DisCo: Distributed
Co-clustering with
Map-Reduce
2008 IEEE International Conference on Data Engineering (ICDE)
Tzu-Li Tai, Tse-En Liu
Kai-Wei Chan, He-Chuan Hoh
National Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory
IBM T.J. Watson Research CenterNY, USA
S. Papadimitron, J. Sun
Agenda
A. MotivationB. Background: Co-Clustering + MapReduceC. Proposed Distributed Co-Clustering ProcessD. Implementation DetailsE. Experimental EvaluationF. ConclusionsG. Discussion
390
Motivation
Fast Growth in Volume of Data
• Google processes 20 petabytes of data per day• Amazon and eBay with petabytes of transactional
data every day
Highly variant structure of data
• Data sources naturally generate data in impure forms• Unstructured, semi-structured
391
Motivation
Problems with Big Data mining for DBMSs
• Significant preprocessing costs for the majority of data mining tasks
• DBMS lacks performance for large amount of data
392
Motivation
Why distributed processing can solve the issues:
• MapReduce is irrelevant to the schema or form of the input data
• Many preprocessing tasks are naturally expressible with MapReduce
• Highly scalable with commodity machines
393
Motivation
Contributions of this paper:
• Presents the whole process for distributed data mining
• Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce
394
BackGround: Co-Clustering
• Also named biclustering, or two-mode clustering
• Input format: a matrix of 𝑚 rows and 𝑛 columns
• Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns
4*5 4*5
395
BackGround: Co-Clustering
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
Why Co-Clustering?
Student A
Student B
Student C
Student D
Traditional Clustering:
A C
B D
Can only know that students A & C / B & D
have similar scores
396
BackGround: Co-Clustering
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
Why Co-Clustering?
Student A
Student B
Student C
Student D
1 1 0 0 0
1 1 0 0 0
0 0 1 1 1
0 0 1 1 1
Student D
Student B
Student C
Student A
Co-Clustering:
Cluster 1 Cluster 2
Good atScience + Math
Good atEnglish + Chinese+ Social Studies
B & D A & C
Rows that have similar properties for a subset of selected columns
397
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
398
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
399
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data
3910
BackGround: MapReduce
The MapReduce Paradigm
Map
Map
Map
Map
Reduce
Reduce
Reduce
(𝒌𝟏, 𝒗𝟏)
(𝒌𝟐, 𝒗𝟐)
(𝒌𝟐, 𝒗𝟐)
(𝒌𝟐, 𝒗𝟐) (𝒌𝟐, [𝑽𝟐]) (𝒌𝟑, 𝒗𝟑)
3911
Distributed Co-Clustering Process
Mining Network Logs toCo-Cluster Communication Behavior
3912
Distributed Co-Clustering Process
Mining Network Logs toCo-Cluster Communication Behavior
3913
Distributed Co-Clustering Process
The Preprocessing Process
HDFS
MapReduce Job
Extract SrcIP + DstIP and build adjacency matrix
HDFS
SrcI
P
DstIP
IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress
…
0 1 0 1 1 0 0 0 1 1 1 0 ……0 1 0 1 1 0 0 0 1 1 1 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……
0 0 1 0 0 0 0 0 0 0 0 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……
0 1 0 1 1 0 0 0 1 1 1 0 ……
MapReduce Job
Build adjacency list
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
3914
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
r(1) = 1
r(2) = 1
r(3) = 1
r(4) = 2
𝑟 = 1, 1, 1, 2
𝑐 = 1, 1, 1, 2, 2
𝐺 =𝑔11 𝑔12𝑔21 𝑔22
=4 42 0
Random Initialize:
Goal:Co-cluster into 2x2 = 4 sub-matrices
⇒ 𝑅𝑜𝑤 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒌 = 𝟐
⇒ 𝐶𝑜𝑙𝑢𝑚𝑛 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒍 = 𝟐
3915
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
r(1) = 1
r(2) = 1
r(3) = 1
r(4) = 2
Fix column labels,Iterate through rows:
r(2) = 2
𝑟 = 1, 𝟐, 1, 2
𝑐 = 1, 1, 1, 2, 2
𝐺 =𝑔11 𝑔12𝑔21 𝑔22
=𝟐 𝟒𝟒 𝟎
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
3916
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
Fix row labels,Iterate through columns:
𝑟 = 1, 𝟐, 1, 2
𝑐 = 1, 𝟐, 1, 2, 2
𝐺 =𝑔11 𝑔12𝑔21 𝑔22
=𝟎 𝟒𝟒 𝟎
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
c(2) = 2
0 0 1 1 1
0 0 1 1 1
1 1 0 0 0
1 1 0 0 0
3917
Distributed Co-Clustering Process
Co-Clustering with MapReduce
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
1 -> 2, 4, 5
2 -> 1, 3
3 -> 2, 4, 5
4 -> 1, 3
MR
1 -> 2,4,5
2 -> 1,3
3 -> 2,4,5
4 -> 1,3
3918
Distributed Co-Clustering Process
Co-Clustering with MapReduce
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
1 -> 2, 4, 5
2 -> 1, 3
3 -> 2, 4, 5
4 -> 1, 3
MR
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
MapReduce Job
𝑟, 𝑐, 𝐺 random initializationbased on parameters 𝑘, 𝑙
𝑟 = 1,1,1,2
𝑐 = 1,1,1,2,2
𝐺 =4 42 0
3919
Distributed Co-Clustering Process
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
Mapper Function:
For each K-V input, (𝑘, 𝑣)
1. Calculate ℊ𝑘 (with 𝑣 and 𝑐)
2. Change row labels if results in lower cost (function of 𝐺)
3. Emit (r(k), (ℊ𝑘 , 𝑘))
⇒ ℊ1 = (1,2)0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
𝑣 = {2,4,5}
𝑐 = {1,1,1,2,2}
if r(1) = 2, cost becomes higher
⇒ r(1) = 1
⇒ emit(r(k), (ℊ𝑘 , 𝑘) ) =
(1, {(1,2), 1})
3920
Distributed Co-Clustering Process
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
Mapper Function:
For each K-V input, (𝑘, 𝑣)
1. Calculate ℊ𝑘 (with 𝑣 and 𝑐)
2. Change row labels if results in lower cost (function of 𝐺)
3. Emit (r(k), (ℊ𝑘 , 𝑘))
⇒ ℊ1 = (2,0)
0 1 0 1 1
1 0 1 0 0
0 1 0 1 1
1 0 1 0 0
𝑣 = {1,3}
𝑐 = {1,1,1,2,2}
if r(2) = 2, cost becomes lower
⇒ r(2) = 2
⇒ emit(r(k), (ℊ𝑘 , 𝑘) ) =
(2, {(2,0), 2})
3921
Distributed Co-Clustering Process
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(𝒌𝒆𝒚𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
3922
Distributed Co-Clustering Process
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
Reducer Function:
For each K-V input, (𝑘, [𝑉])
For each (𝑔, 𝐼) ∈ 𝑉,
1. Accumulate all 𝑔 ∈ 𝑉 into ℊ𝑘
2. 𝐼𝑘 = Union of all 𝐼
3. Emit (𝑘, (ℊ𝑘 , 𝐼𝑘)
ℊ1 = 1,2 + 1,2= (2,4)
𝐼1 = {1,3}
⇒ Emit (1, ( 2,4 , {1,3}))
3923
Distributed Co-Clustering Process
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
(1, ( 2,4 , {1,3}))
(2, ( 4,0 , {2,4}))
(𝒌𝒆𝒚𝒓𝒆𝒔𝒖𝒍𝒕, 𝒗𝒂𝒍𝒖𝒆𝒓𝒆𝒔𝒖𝒍𝒕)
𝒓 = {𝟏, 𝟐, 𝟏, 𝟐}
𝑮 = {𝟐 𝟒𝟒 𝟎
}
𝑐 = {1,1,1,2,2}
Sync Results
0 1 0 1 1
0 1 0 1 1
1 0 1 0 0
1 0 1 0 0
3924
Distributed Co-Clustering Process
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
MapReduce Job
Fix columnRow iteration
Random 𝑟, 𝑐, 𝐺given 𝑘, 𝑙
+ SyncResults
Synced 𝑟, 𝑐, 𝐺with best 𝑟
permutation
MapReduce Job
Fix rowColumn iteration
+
Final Co-Clustering resultwith best 𝑟, 𝑐permutations
PreprocessingCo-Clustering
3925
Implementation Details
Tuning the number of Reduce Tasks
• The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase
• For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either 𝑘 or 𝑙
3926
Implementation Details
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(𝒌𝒆𝒚𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
𝒌 = 𝟐 (row-iterate)⇒ 𝟐 inter-keys
3927
Implementation Details
Tuning the number of Reduce Tasks
• So, for the row-iteration/column-iteration jobs, 1 reduce task is enough
• However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks
3928
Implementation Details
The Preprocessing Process
HDFS
MapReduce Job
Extract SrcIP + DstIP and build adjacency matrix
HDFS
SrcI
P
DstIP
IPAddressIPAddressIPAddressIPAddressIPAddressIPAddress
…
0 1 0 1 1 0 0 0 1 1 1 0 ……0 1 0 1 1 0 0 0 1 1 1 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……
0 0 1 0 0 0 0 0 0 0 0 0 ……0 0 1 0 0 0 0 0 0 0 0 0 ……
0 1 0 1 1 0 0 0 1 1 1 0 ……
MapReduce Job
Build adjacency list
HDFS
MapReduce Job
Build transpose adjacency list
HDFS
(𝑺𝒓𝒄𝑰𝑷, [𝑫𝒔𝒕𝑰𝑷])
3929
Experimental Evaluation
Environment
• There are 39 nodes in four different blade enclosure
Gigabit Ethernet
• Blade Server
- CPU: two dual-core (Intel Xeon 2.66GHz)- Memory: 8GB- OS: Red Hat Enterprise Linux
• Hadoop Distributed File System(HDFS) capacity: 2.4 TB
3930
Experimental Evaluation
Datasets
3931
Experimental Evaluation
Preprocessing ISS Data
• Optimal values of each situation• Map tasks number 6
• Reduce tasks number 5
• Input split size 256MB
6 5 256MB
3932
Experimental Evaluation
Co-Clustering TREC Data
• After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM.
3933
Conclusion
• Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach
• Designed a general MapReduce approach for co-clusteringalgorithms
• Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC)
3934
Discussion
• Necessity of the global 𝑟, 𝑐, 𝐺 sync action• Questionable Scalability for DisCo
3935
Discussion
Necessity of the global 𝒓, 𝒄, 𝑮sync action
MapReduce Job
Fix columnRow iteration
Random 𝑟, 𝑐, 𝐺given 𝑘, 𝑙
+ SyncResults
Synced 𝑟, 𝑐, 𝐺with best 𝑟
permutation
MapReduce Job
Fix rowColumn iteration
+
Final Co-Clustering resultwith best 𝑟, 𝑐permutations
Co-Clustering
3936
Discussion
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(𝒌𝒆𝒚𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
3937
Discussion
Questionable Scalability of DisCo
• For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be 𝑘 (or 𝑙)
• This implies that for a given 𝑘 and 𝑙, as the input matrix gets larger, the reducer size* will increase dramatically
• Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance
*reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB
3938
Discussion
1 -> 2,4,5
𝑟, 𝑐, 𝐺
2 -> 1,3
𝑟, 𝑐, 𝐺
3 -> 2,4,5
𝑟, 𝑐, 𝐺
4 -> 1,3
𝑟, 𝑐, 𝐺
M
M
M
M
(𝒌𝒆𝒚𝒊𝒏, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏)
( 1, 2,4,5 )
( 2, 1,3 )
( 3, 2,4,5 )
( 4, 1,3 )
(1, { 1,2 , 1})
(𝒌𝒆𝒚𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆, 𝒗𝒂𝒍𝒖𝒆𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆)
(2, { 2,0 , 2})
(1, { 1,2 , 3})
(2, { 2,0 , 4})
(1, [ 1,2 , 1 , { 1,2 , 3}])
(2, [ 2,0 , 2 , { 2,0 , 4}])
R
R
(𝒌𝒆𝒚𝒓−𝒊𝒏, [𝑽𝒂𝒍𝒖𝒆]𝒓−𝒊𝒏)
3939