Apache Giraph
-
Upload
ahmet-emre-aladag -
Category
Career
-
view
1.219 -
download
1
description
Transcript of Apache Giraph
![Page 1: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/1.jpg)
A Distributed Graph-Processing Library
Ahmet Emre Aladağ - AGMLab26.08.2013
![Page 2: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/2.jpg)
● Library for large-scale graph processing.● Runs on Apache Hadoop with Map Jobs● Bulk Synchronous Parallel (BSP) model
What is Giraph?
1incoming messages
outgoing messages
0.2
0.53
0.320.16
0.12
0.34
Vertex computation
![Page 3: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/3.jpg)
Uses
● PageRank-variant iterative algorithms● Graph clustering
○ Label propagation○ Max Clique○ Triangle Closure○ Finding related people, groups, interests.
● Shortest-Path○ Single source, s-t, all to all
● Finding Connected Components
![Page 4: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/4.jpg)
Alternatives
● Map-Reduce jobs on Hadoop○ Not a good fit for graph algorithms: overhead.
● Google Pregel○ Requires its own infrastructure○ Not available○ Master is single point of failure.
● Message Passing Interface (MPI)○ Not fault-tolerant○ Too generic
![Page 5: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/5.jpg)
How Giraph differs
● You can use a Hadoop cluster, no need for special infrastructure.
● Easy deployment with Amazon EMR● Dynamic resource management● Graph oriented API● Open Source● Fault Tolerant, no SPOF except Hadoop
namenode and jobtracker● Jython Support
![Page 6: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/6.jpg)
Layers
![Page 7: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/7.jpg)
Mechanism
InputFormat/Reader
Input
Computation OutputFormat/Writer
Output
● Accumulo● HBase● HCatalog● HDFS● Hive● Neo4j etc.
● Accumulo● HBase● HCatalog● HDFS● Hive● Neo4j etc.● GraphViz
Adjacency matrix, id-value pairs, JSON
![Page 8: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/8.jpg)
InputFormat
● VertexInputFormat1;3.42;6.13;2.7
● EdgeInputFormat1;22;31;3
1 2 3
3.4 6.1 2.7
1 2 3
![Page 9: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/9.jpg)
Computation● Superstep barriers.● Send/Receive messages from neighbors● Update value.● Vote to halt or wake up.
Single-Source Shortest Path Example
![Page 10: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/10.jpg)
Shortest-Path Computation Code
Note: old API
![Page 11: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/11.jpg)
Ex: Finding the maximum value
![Page 12: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/12.jpg)
Aggregators
● Shared variables among the workers.● Each vertex computation can add/multiply a
value to aggregators. ● Examples:
○ Holding the min/max value among all vertices○ Holding sum of the vertex values.○ Holding average value of vertex values.○ Holding sum of mean square errors and stdev.
1 2 30.2 0.6
0.45
1.25
Computation at Iteration k
![Page 13: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/13.jpg)
MasterCompute Class
● Master’s compute() always runs before the slaves (like pre-superstep)
○ In compute: aggregate vertex values: sum of values○ In MasterCompute: average=sum/N
● Aggregators are registered here.
● You can set values to aggregators.
![Page 14: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/14.jpg)
Worker Context
● Allows for the execution of user code on a per-worker basis.
● There's one WorkerContext per worker.
● Methods for Pre/post superstep/application operations.
![Page 15: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/15.jpg)
Flexible Edge/Vertex Input
● Read edges/vertices from different sources.● Multiple input resources
![Page 16: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/16.jpg)
Parallel Computing
● More map jobs (workers) = parallel computing
● To overcome slowest worker problem, multithreading is applied on input/computation/output
● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading
● Take a set of entrie machines & use multithreading to maximize resource utilization.
![Page 17: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/17.jpg)
Memory Optimization
● Vertices and edges are stored as serialized byte arrays.
● Used FastUtil-based Java primitives.
![Page 18: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/18.jpg)
Sharded Aggregators
● Each aggregator is randomly assigned to one of the workers.
● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers.
● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.
![Page 19: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/19.jpg)
Performance● PageRank on 1 trillion edges with 200 commodity
machines: 4 minutes/iteration.● K-Means on 1 billion input vectors x 100 features into
10.000 centroids: 10 minutes.● Linear Scalability
![Page 20: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/20.jpg)
Currently
● Version 1.0, on the way to 1.1● Changing rapidly: backwards-incompatible
changes● Documentation not mature yet.● More algorithms to be contributed.● More data sources to be ported.● http://giraph.apache.org for more info
![Page 21: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/21.jpg)
References
Giraph: Large-scale graph processing infrastructure on Hadoop, 2011
Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013
Scaling Apache Giraph, Nitay Joffe, Facebook, 2013.
Giraph: http://giraph.apache.org
![Page 22: Apache Giraph](https://reader035.fdocuments.in/reader035/viewer/2022081720/53ecb5b48d7f7289708b5879/html5/thumbnails/22.jpg)
Questions
?