Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money...

Apache Giraphstart analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Who’s this guy?

Roman Shaposhnik

•ASF junkie• VP of Apache Incubator, former VP of Apache Bigtop• Hadoop/Sqoop/Giraph committer• contributor across the Hadoop ecosystem)

•Used to be root@Cloudera•Used to be a PHB at Yahoo!•Used to be a UNIX hacker at Sun microsystems

Giraph in action (MEAP)

http://manning.com/martella/

What’s this all about?

Agenda

• A brief history of Hadoop-based bigdata management• Extracting graph relationships from unstructured data• A case for iterative and explorative workloads• Bulk Synchronous Parallel to the rescue!• Apache Giraph: a Hadoop-based BSP graph analysis

framework• Giraph application development• Demos! Code! Lots of it!

On day one Doug created HDFS/MR

Google papers

•GFS (file system)• distributed• replicated• non-POSIX

•MapReduce (computational framework)• distributed• batch-oriented (long jobs; final results)• data-gravity aware• designed for “embarrassingly parallel” algorithms

One size doesn’t fit all

•Key-value approach• map is how we get the keys• shuffle is how we sort the keys• reduce is how we get to see all the values for a key

•Pipeline approach•Intermediate results in a pipeline need to be flushed to HDFS•A very particular “API” for working with your data

It’s not about the size of your data;it’s about what you do with it!

Graph relationships

•Entities in your data: tuples• customer data• product data• interaction data

•Connection between entities: graphs• social network or my customers• clustering of customers vs. products

Challenges

•Data is dynamic• No way of doing “schema on write”

•Combinatorial explosion of datasets• Relationships grow exponentially

•Algorithms become• explorative• iterative

Graph databases

•Plenty available• Neo4J, Titan, etc.

•Benefits• Tightly integrate systems with few moving parts• High performance on known data sets

•Shortcomings• Don’t integrate with HDFS• Combine storage and computational layers• A sea of APIs

Enter Apache Giraph

Key insights

•Keep state in memory for as long as needed•Leverage HDFS as a repository for unstructured data•Allow for maximum parallelism (shared nothing)•Allow for arbitrary communications•Leverage BSP approach

Bulk Synchronous Parallel

communications

local processing

barrier #1

barrier #2

barrier #3

BSP applied to graphs

@rhatr

@TheASF

@c0sin

Think like a vertex:•I know my local state•I know my neighbours•I can send messages to vertices•I can declare that I am done•I can mutate graph topology

Bulk Sequential Processing

message delivery

individualvertexprocessing &sending messages

superstep #1

all vertices are “done”

superstep #2

Giraph “Hello World”public class GiraphHelloWorld extends

BasicComputation<IntWritable, IntWritable, NullWritable, NullWritable> {

public void compute(Vertex<IntWritable, IntWritable, NullWritable> vertex,

Iterable<NullWritable> messages) {

System.out.println(“Hello world from the: “ + vertex.getId() + “ : “);

for (Edge<IntWritable, NullWritable> e : vertex.getEdges()) {

System.out.println(“ “ + e.getTargetVertexId());

System.out.println(“”);

Mighty four of Giraph API

BasicComputation<IntWritable, // VertexID -- vertex ref IntWritable, // VertexData -- a vertex datum NullWritable, // EdgeData -- an edge label datum NullWritable>// MessageData -- message payload

On circles and arrows

•You don’t even need a graph to begin with!• Well, ok you need at least one node

•Dynamic extraction of relationships• EdgeInputFormat• VetexInputFormat

•Full integration with Hadoop ecosystem• HBase/Accumulo, Gora, Hive/HCatalog

Anatomy of Giraph run

1 2 1 3

12 1 33 1 2

HDFS mappers reducers HDFS

@rhatr

@TheASF

@c0sin

InputForm

OutputF

Anatomy of Giraph run mappers or YARN containers

@rhatr

@TheASF

@c0sin

InputForm

OutputF

A vertex view

MessageData1

VertexID

EdgeData

VertexData MessageData2

Turning Twitter into Facebook

@rhatr

@TheASF

@c0sin

TheASF

Ping thy neighbourspublic void compute(Vertex<Text, DoubleWritable, DoubleWritable> vertex, Iterable<Text> ms ){

if (getSuperstep() == 0) {

sendMessageToAllEdges(vertex, vertex.getId());

} else {

for (Text m : ms) {

if (vertex.getEdgeValue(m) == null) {

vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE));

vertex.voteToHalt();

Demo time!

But I don’t have a cluster!•Hadoop in pseudo-distributed mode

• All Hadoop services on the same host (different JVMs)• Hadoop-as-a-Service

• Amazon’s EMR, etc.• Hadoop in local mode

Prerequisites

•Apache Hadoop 1.2.1•Apache Giraph 1.1.0-SNAPSHOT•Apache Maven 3.x•JDK 7+

Setting things up

$ curl hadoop.tar.gz | tar xzvf –$ git clone git://git.apache.org/giraph.git ; cd giraph$ mvn –Phadoop_1 package$ tar xzvf *dist*/*.tar.gz

$ export HADOOP_HOME=/Users/shapor/dist/hadoop-1.2.1$ export GIRAPH_HOME=/Users/shapor/dist/$ export HADOOP_CONF_DIR=$GIRAPH_HOME/conf$ PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin:$PATH

Setting project up (maven)

<groupId>org.apache.giraph</groupId>

<artifactId>giraph-core</artifactId>

<version>1.1.0-SNAPSHOT</version>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

</dependencies>

Running it

$ mvn package$ giraph target/*.jar giraph.GiraphHelloWorld \ -vip src/main/resources/1 \ -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat \ -w 1 \ -ca giraph.SplitMasterWorker=false,giraph.logLevel=error

Testing itpublic void testNumberOfVertices() throws Exception {

GiraphConfiguration conf = new GiraphConfiguration();

conf.setComputationClass(GiraphHelloWorld.class);

conf.setVertexInputFormatClass(TextDoubleDoubleAdjacencyListVertexInputFormat.class);

Iterable<String> results =

InternalVertexRunner.run(conf, graphSeed);

Simplified view mappers or YARN containers

@rhatr

@TheASF

@c0sin

InputForm

OutputF

Master and master compute

@rhatr

@TheASF

@c0sin

ActiveMaster

compute()

StandbyMaster

Zookeeper

Master compute

•Runs before slave compute()•Has a global view•A place for aggregator manipulation

Aggregators

•“Shared variables”•Each vertex can push values to an aggregator•Master compute has full control

@rhatr

@TheASF

@c0sin

ActiveMaster

min: sum:

Questions?

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money...

Software

Transcript of Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money...

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University

Compute "Closeness" in Graphs using Apache Giraph.

Apache Cassandra As A BigData Platform Matthew F. Dennis ...€¦ · Apache Cassandra As A BigData Platform Matthew F. Dennis // @mdennis October 1-3, 2012 gotocon.com

What’s a graph? - UHgabriel/courses/cosc6339_s17/BDA_10_Giraph.pdf · Apache Giraph • Master: responsible for coordination – assigns partitions to workers – coordinates synchronization

Professional Training for BigData and Apache Hadoop Features of ...

Introducing Apache Giraph for Large Scale Graph Processingresearcher.watson.ibm.com/researcher/files/us-heq/Large Scale Graph... · Large Scale Graph Processing with Apache Giraph

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

Big Data in the Cloud: A Survey · The MapReduce framework and its developments Apache Spark, HaLoop, Twister, and other alternatives such as Apache Giraph, GraphLab, Pregel and MapD

A Brief Historyhari/teaching/bigdata/spark.pdf · 2018-02-22 · A Brief History: MapReduce MapReduce General Batch Processing Pregel Giraph Dremel Drill Tez Impala GraphLab Storm

BigData Tools Seyyed mohammad Razavi. Outline Introduction Hbase Cassandra Spark Acumulo Blur MongoDB Hive Giraph Pig.

Apache Giraph · 2017. 12. 14. · • Apache Giraph: a Hadoop-based BSP graph analysis framework • Giraph application development • Demos! Code! Lots of it! On day one Doug created

Real world tales of repair - Apache BigData BIGDATA - MAY 2017 Alexander Dejanovski @alexanderdeja Consultant Datastax MVP for Apache Cassandra Licensed under a …

HKG18-220: State of Big Data on AArch64 - Apache BigTopconnect.linaro.org.s3.amazonaws.com/hkg18/presentations/... · 2018-04-03 · Apache giraph v1.1.0 Apache phoenix v4.9.0-HBase-1.1

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

ScyllaDB @ Apache BigData, may 2016

Apache Hivemall @ Apache BigData '17, Miami

Selecting the Best VM across Multiple Public Clouds: A ... · (a) Building Apache Giraph (b) YCSB-benchmarks Workload A Figure 2: (a) Runtime for building Apache Giraph (lower the

Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads

HPC Meets BigData: Accelerating Apache Hadoop, Spark, and ... · HPC Meets BigData: Accelerating Apache Hadoop, Spark, and Memcached with HPC Technologies Dhabaleswar K. (DK) Panda

Datacenter Computing with Apache Mesos - BigData DC