Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Giraph at Hadoop Summit 2014
-
date post
22-Sep-2014 -
Category
Technology
-
view
5 -
download
2
description
Transcript of Giraph at Hadoop Summit 2014
![Page 1: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/1.jpg)
Apache GiraphLarge-scale Graph Processing on Hadoop
Claudio Martella <[email protected]> @claudiomartella
Hadoop Summit @ Amsterdam - 3 April 2014
![Page 2: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/2.jpg)
2
![Page 3: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/3.jpg)
Graphs are simple
3
![Page 4: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/4.jpg)
A computer network
4
![Page 5: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/5.jpg)
A social network
5
![Page 6: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/6.jpg)
A semantic network
6
![Page 7: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/7.jpg)
A map
7
![Page 8: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/8.jpg)
Graphs are huge
•Google’s index contains 50B pages
•Facebook has around1.1B users
•Google+ has around 570M users
•Twitter has around 530M users
VERY rough estimates!
8
![Page 9: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/9.jpg)
9
![Page 10: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/10.jpg)
Graphs aren’t easy
10
![Page 11: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/11.jpg)
Graphs are nasty.
11
![Page 12: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/12.jpg)
Each vertex depends on its
neighbours, recursively.
12
![Page 13: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/13.jpg)
Recursive problems are nicely solved iteratively.
13
![Page 14: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/14.jpg)
PageRank in MapReduce
•Record: < v_i, pr, [ v_j, ..., v_k ] >
•Mapper: emits < v_j, pr / #neighbours >
•Reducer: sums the partial values
14
![Page 15: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/15.jpg)
MapReduce dataflow
15
![Page 16: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/16.jpg)
Drawbacks
•Each job is executed N times
•Job bootstrap
•Mappers send PR values and structure
•Extensive IO at input, shuffle & sort, output
16
![Page 17: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/17.jpg)
17
![Page 18: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/18.jpg)
Timeline
•Inspired by Google Pregel (2010)
•Donated to ASF by Yahoo! in 2011
•Top-level project in 2012
•1.0 release in January 2013
•1.1 release in days 2014
18
![Page 19: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/19.jpg)
Plays well with Hadoop
19
![Page 20: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/20.jpg)
Vertex-centric API
20
![Page 21: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/21.jpg)
BSP machine
21
![Page 22: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/22.jpg)
BSP & Giraph
22
![Page 23: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/23.jpg)
Advantages
•No locks: message-based communication
•No semaphores: global synchronization
•Iteration isolation: massively parallelizable
23
![Page 24: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/24.jpg)
Architecture
24
![Page 25: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/25.jpg)
Giraph job lifetime
25
![Page 26: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/26.jpg)
Designed for iterations
•Stateful (in-memory)
•Only intermediate values (messages) sent
•Hits the disk at input, output, checkpoint
•Can go out-of-core
26
![Page 27: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/27.jpg)
A bunch of other things
•Combiners (minimises messages)
•Aggregators (global aggregations)
•MasterCompute (executed on master)
•WorkerContext (executed per worker)
•PartitionContext (executed per partition)
27
![Page 28: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/28.jpg)
Shortest Paths
28
![Page 29: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/29.jpg)
Shortest Paths
29
![Page 30: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/30.jpg)
Shortest Paths
30
![Page 31: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/31.jpg)
Shortest Paths
31
![Page 32: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/32.jpg)
Shortest Paths
32
![Page 33: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/33.jpg)
Composable API
33
![Page 34: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/34.jpg)
Checkpointing
34
![Page 35: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/35.jpg)
No SPoFs
35
![Page 36: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/36.jpg)
Giraph scales
36
ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
![Page 37: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/37.jpg)
Giraph is fast
• 100x over MR (Pr)
• jobs run within minutes
• given you have resources ;-)
37
![Page 38: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/38.jpg)
Serialised objects
38
![Page 39: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/39.jpg)
Primitive types
•Autoboxing is expensive
•Objects overhead (JVM)
•Use primitive types on your own
•Use primitive types-based libs (e.g. fastutils)
39
![Page 40: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/40.jpg)
Sharded aggregators
40
![Page 41: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/41.jpg)
Many stores with Gora
41
![Page 42: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/42.jpg)
And graph databases
42
![Page 43: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/43.jpg)
Current and next steps
•Out-of-core graph and messages
•Jython interface
•Remove Writable from < I V E M >
•Partitioned supernodes
•More documentation
43
![Page 44: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/44.jpg)
Giraph in Action• Published by Manning
• MEAP now
• Complete Q3 2014 (well...)
• Part 1: Graphs and Algorithms
• Part 2: Giraph Basic Topics
• Part 3: Giraph Advanced Topics
• http://www.manning.com/martella
44
![Page 45: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/45.jpg)
Okapi
• Apache Mahout for graphs
•Graph-based recommenders: ALS, SGD, SVD++, etc.
•Graph analytics: Graph partitioning, Community Detection, K-Core, etc.
45
![Page 46: Giraph at Hadoop Summit 2014](https://reader035.fdocuments.in/reader035/viewer/2022062613/541f2f387bef0ab16e8b4657/html5/thumbnails/46.jpg)
Thank you
Claudio Martella <[email protected]> @claudiomartella
http://giraph.apache.org
Some figures gently borrowed from Nitay Joffe:http://www.slideshare.net/nitayj/20130910-giraph-at-london-
hadoop-users-group