Tree representation in map reduce world
-
Upload
yu-liu -
Category
Technology
-
view
143 -
download
1
Transcript of Tree representation in map reduce world
![Page 1: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/1.jpg)
Tree Representation in MapReduce World
IPL weekly-seminar
Yu Liu@NII
2011-11-22
![Page 2: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/2.jpg)
Distributed File System of MapReduce
• A GFS/HDFS cluster consists of a single master (namenode) and multiple chunkservers (datanodes) and is accessed by multiple clients.
• The master maintains all filesystem metadata.
• Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the datanodes
![Page 3: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/3.jpg)
![Page 4: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/4.jpg)
Distributed File System of MapReduce
• Architecture of GFS/HDFS
– Files are divided into fixed-size chunks
– Each chunk is identified by an immutable and globally unique 64 bit integer (chunk handle)
– Each chunk is replicated on multiple chunkservers
– Chunks of a file are placed as balance as possablein the cluster.
(The Google File System, SOSP03)
![Page 5: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/5.jpg)
Apache HDFS: http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html#Introduction
![Page 6: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/6.jpg)
Inputs and Outputs of MapReduce
• The MapReduce framework operates exclusively on <key, value> pairs.
• Each pair is called a record.
• Applications specify the input/output locations and supply map and reduce functions and other job parameters, comprise the job configuration. The job client then submits the job to framework.
![Page 7: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/7.jpg)
Tree Data Structure inside MapReduce
• Currently, GFS/HDFS prefers flat data structures/ files, files such as xml is not supported.
• We already know how to represent a file which contains a large list in HDFS (EuroPar2011)
• Tree representation is still a problem.
![Page 8: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/8.jpg)
How to Represent a Tree in MapReduce
• If we can represent the tree by an list , and if :When this list is split into arbitrary continues sublists,
each split of the list represents a sub tree
After any tree contracting operations on each sub tree, concated sublists can still get a tree
Then such a list is what we want.
![Page 9: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/9.jpg)
Tree Representation: Balanced Parenthesis
• Balanced Parenthesis (BP) for a ordered tree (Munro and Raman, 2001)
BP: ( ( ( ) ( )( ) ) (( ) ( )) )
1
2 6
73 4 5 8
1
2 6
3 4 5 7 8 Outer-planar sequence
![Page 10: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/10.jpg)
BP Can Be A Solution
• A tree node can be represented by a pair of parentheses :
• node= ( ‘(’ , ‘)’ )
• We want to represent a list of nodes, the nodes should be sort-able
• data HalfNode = HalfNode{lr::Char, id::Int, index::Int}– E.g.: left1 : HalfNode {lr=‘L’, id=1 , index=0} ,
right1 : HalfNode {lr=‘R’, id=1 , index=16}
• data Node = Node { left::HalfNode, right::HalfNode}– E.g. : the root ① : Node {left =left1, right =right1 }
![Page 11: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/11.jpg)
Parenthesis / HalfNode
• For simple we define
data HalfNode = (Bool,Int,Int)
leftPar: (False, _,_)
rightPar: (True, _,_)
so that a node can be expressed by two HalfNodes,
E.g.: the root ① : { (False, 1 , 0) , (True, 1, 15) }
the node ②: {(False, 2 , 1) , (True, 2, 7) }
the node ⑦: {(False, 7 , 10) , (True, 7, 11) }
![Page 12: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/12.jpg)
Comparable HalfNode
• A set of HalfNode can be sorted by index to get a BP sequence– data HalfNode = (Bool,Int,Int) – We know each bracket is left or right
• { (False, 1 , 0) , (False, 2 , 1) , (False, 3, 2) (True, 3 , 3) , (False, 4 , 4) , (True, 4 , 5)
(False, 5 , 6) , (True, 5 , 7) , (True, 2 , 8) (False, 6 , 9) , (False, 7 , 10) , (True, 7 , 11) (False, 8 , 12), (True, 8 , 13) , (True, 6 , 14) (True, 1 , 15)}
![Page 13: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/13.jpg)
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) )
1 2 3 4 5
1
2 6
73 4 5 8
![Page 14: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/14.jpg)
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) ) ( (
1 2 3 4 5 1 2
1
2 d
d3 4 5
1
2 d
d
![Page 15: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/15.jpg)
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) )
1 2 3 4 5 2 6 7 8
d
2 6
73 4 5 8
![Page 16: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/16.jpg)
Sub Trees
• A sub sequence indicates the sub tree:
( ( ( ) ( ) ( ) | ) ( ( ) ( ) ) ) ) ( ) )
1 2 3 4 5 2 6 7 8 2 6
1
6
7 8
1
6
![Page 17: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/17.jpg)
Bottom-up Tree contraction
• When we concat two sublists
( ( ) ( ) )
1 2 2 6
1
2 d
d
1
6
1
2 6
![Page 18: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/18.jpg)
Sub Trees
• { (False, 1 , 0) , (False, 2 , 1) , (False, 3, 2)
(True, 3 , 3) , (False, 4 , 4) , (True, 4 , 5)
(False, 5 , 6) , (True, 5 , 7) , (True, 2 , 8)
(False, 6 , 9) , (False, 7 , 10) , (True, 7 , 11)
(False, 8 , 12), (True, 8 , 13) , (True, 6 , 14)
(True, 1 , 15)}
![Page 19: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/19.jpg)
Splitting and Grouping
• We can split a list and group the elements of each sub-list in MapReduce.
– We extend data HalfNode = (Bool,Int, (Int, Int) )
• Here (Int, Int) is the index /d and index
• For the BP-MR sequence, that means we can split a tree by number of brackets
![Page 20: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/20.jpg)
Practical Data
• Real data are associated to left-half-node
– data HalfNode = ((Int, Int), Bool, Int, Map)
– For right-half-nodes, let Map be always empty/null
![Page 21: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/21.jpg)
Bottom-up Build a Tree
• A list a items as input
• Make a sparse list of “leaf”:E.g.: [ ((0,100),False, 100, data1), ((0,101), True, 100,
null) , ( (0, 200), False, 200, data2), ( (0, 201), True, 200, data2) .. ]
( 100) (200) (300) (400) ….
• Insert parents ( ( 100) (200) ) ( (300) (400) )….
50 250
![Page 22: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/22.jpg)
Examples
• XML
– XML file is just a BP representation
An example of a xml file:
<?xml version="1.0"?><note>
<to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget </body>
</note>
![Page 23: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/23.jpg)
Examples
• XML file can be easily transformed to BP-MR
– Operation:
• query – by xpath
– By id / index
• Parallel parsing ?
![Page 24: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/24.jpg)
Hieratical Clustering
• This work is not finished
• Usually, clustering algorithms are related to two categories: hierarchical and partitioning
• The more popular hierarchical agglomerative clustering (HAC) algorithms use a bottom-up approach to merge items into a hierarchy of clusters.
![Page 25: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/25.jpg)
Hieratical Clustering
![Page 26: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/26.jpg)
Hieratical Clustering
• The Average-link is one of the most popular algorithms for hieratical clustering
• Average link: The distance between any two clusters is the average distance between each pair of points such that each pair has a point in both clusters
![Page 27: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/27.jpg)
GTA Algorithm for Hieratical Clustering
Currently only for the first merge step• Initial data are a set of items• map makeNode items
where makeNode item= ((0,0), False, 1, item ) , ((0,0), True, 1, item )
• Input are a BP-MR sequence but only left-half• Generate: all possible bags• Test: only keep pairs• Aggregate : the minimum distance pair• Post-process : new HalfNode pair which is parent
of aggregate’s results
![Page 28: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/28.jpg)
Problems
• Hard to do insertion
– Appending to the tail is easy but insertion into other place is difficult
• Parallel generate BP-MR sequences
– Ideas: first generate skeletons of a tree
![Page 29: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/29.jpg)
Skeletons of a Tree
• For example
a = 1000, b = 2000, c= 3000 …
Index_a = 1000, index_b=2000, index_c = 3000 …
Index_a’ = 8000, index_b’ = 4000, index_c’ = 6000 …
1
a e
fb c d g
![Page 30: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/30.jpg)
End
• Thanks
![Page 31: Tree representation in map reduce world](https://reader031.fdocuments.in/reader031/viewer/2022030314/589bcece1a28ab92618b5423/html5/thumbnails/31.jpg)