Big Graph Data

76
AURELIUS THINKAURELIUS.COM BIG GRAPH DATA Understanding a Complex World Matthias Broecheler, CTO @mbroecheler November XIII, MMXII

description

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.

Transcript of Big Graph Data

Page 1: Big Graph Data

AURELIUS THINKAURELIUS.COM

BIG GRAPH DATA Understanding a Complex World

Matthias Broecheler, CTO @mbroecheler November XIII, MMXII

Page 2: Big Graph Data

AURELIUS THINKAURELIUS.COM

I Graph Foundation

Page 3: Big Graph Data

Graph

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

Vertex Property

Page 4: Big Graph Data

Graph

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

Edge

Edge Property

Edge Type

Page 5: Big Graph Data

Path

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

Page 6: Big Graph Data

Degree

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

Page 7: Big Graph Data

AURELIUS THINKAURELIUS.COM

I Connected World

Page 8: Big Graph Data

HEALTH

Page 9: Big Graph Data

HEALTH

Page 10: Big Graph Data

HEALTH

Page 11: Big Graph Data

HEALTH

Page 12: Big Graph Data

ECONOMY

Page 13: Big Graph Data

ECONOMY

Page 14: Big Graph Data

ECONOMY

Page 15: Big Graph Data

ECONOMY

Page 16: Big Graph Data

Social Systems

Page 17: Big Graph Data

Social Systems

Page 18: Big Graph Data

Social Systems

Page 19: Big Graph Data

Social Systems

Page 20: Big Graph Data
Page 21: Big Graph Data
Page 22: Big Graph Data
Page 23: Big Graph Data

AURELIUS THINKAURELIUS.COM

III Titan Graph Database

Page 24: Big Graph Data

  Numerous Concurrent Users   Many Short Transactions

  read/write

  Real-time Traversals (OLTP)   High Availability   Dynamic Scalability   Variable Consistency Model

  ACID or eventual consistency

 Real-time Big Graph Data

Titan Features

Page 25: Big Graph Data

Storage Backends

Partitionability

Availability Consistency

Page 26: Big Graph Data

Titan Features

I.  Data Management

II. Vertex-Centric Indices

Page 27: Big Graph Data

Titan Features

III.  Graph Partitioning

IV.  Edge Compression

Page 28: Big Graph Data

Titan Ecosystem

  Native Blueprints Implementation

  Gremlin Query Language

  Rexster Server   any Titan graph can be

exposed as a REST endpoint

Generic Graph API

Dataflow Processing

TraversalLanguage

Object-GraphMapper

GraphAlgorithms

GraphServer

Page 29: Big Graph Data

AURELIUS THINKAURELIUS.COM

IV Github Network

Page 30: Big Graph Data

Setup

$ ./titan-0.1.0/bin/gremlin.sh! ! ! !\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open('/tmp/titan')!==>titangraph[local:/tmp/titan]!

Page 31: Big Graph Data

Titan Storage Model

  Adjacency list in one column family

  Row key = vertex id

  Each property and edge in one column   Denormalized, i.e. stored twice

  Direction and label/key as column prefix   Use slice predicate for quick retrieval

5

5

Page 32: Big Graph Data

USER

REPOSITORY

COMMENT

ISSUE COMMIT

PAGE

edited

pushed

in

opened

to on on

on

created

Page 33: Big Graph Data

Defining Property Keys

gremlin> g.makeType().name(‘username’).!! ! ! dataType(String.class).!! ! ! functional().!! ! ! indexed().unique().!! ! ! makePropertyKey()!

gremlin> g.makeType().name(‘time’).!! ! ! dataType(Long.class).!! ! ! functional().makePropertyKey()!

Page 34: Big Graph Data

Defining Edge Labels

gremlin> g.makeType().name(‘on’).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘pushed’).!! ! ! primaryKey(time).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘in’).!! ! ! unidirected().!! ! ! makeEdgeLabel()!

Page 35: Big Graph Data

Create & Retrieve

gremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V('username','okram')!==>v[4]!

Page 36: Big Graph Data

Titan Locking

  Locking ensures consistency when it is needed

  Titan uses time stamped consistent reads and writes on separate CFs for locking

  Uses   Property uniqueness: .unique()

  Functional edges: .functional()

  Global ID management

5

9

name : Hercules

name : Hercules

name : Jupiter

name : Pluto

father

father

x

Page 37: Big Graph Data

Titan Indexing

  Vertices can be retrieved by property key + value

  Titan maintains index in a separate column family as graph is updated

  Only need to define a property key as .index()

5

9

name : Hercules

name : Jupiter

Page 38: Big Graph Data

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Page 39: Big Graph Data

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Query Optimization

Page 40: Big Graph Data

Vertex-Centric Indices

  Sort and index edges per vertex by primary key   Primary key can be composite

  Enables efficient focused traversals   Only retrieve edges that matter

  Uses push down predicates for quick, index-driven retrieval

Page 41: Big Graph Data

v

time: 1

fought fought father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()!

Page 42: Big Graph Data

v

time: 1

father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)!

Page 43: Big Graph Data

v

time: 1

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)! .labels(‘battled’)!

Page 44: Big Graph Data

v

time: 1

battled battled

time: 3

v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!

Page 45: Big Graph Data

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

Page 46: Big Graph Data

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!

Page 47: Big Graph Data

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!

Page 48: Big Graph Data

Titan Embedding

  Rexster RexPro   lightweight Gremlin

Server

  based on Grizzly

  Titan Gremlin Engine

  Embedded Storage Backend   in-JVM method calls

Page 49: Big Graph Data

Graph Partitioning

Goal: Vertex Co-location   Titan maintains multiple

ID Pools

  Ordered Partitioner in Storage Backend

  Dynamically determines optimal partition and allocates corresponding IDs

ID Pool

Page 50: Big Graph Data

What’s coming

  Full-text indexing   external index system integration

  Bulk Loading   integration with storage backend

utilities and Hadoop ingestion

  240 Billion Edge Benchmark   performance analysis and improvements

across the entire stack

Page 51: Big Graph Data

AURELIUS THINKAURELIUS.COM

V Faunus Graph Analytics

Page 52: Big Graph Data

  Hadoop-based Graph Computing Framework

  Graph Analytics

  Breadth-first Traversals

  Global Graph Computations

 Batch Big Graph Data

Faunus Features

Page 53: Big Graph Data

Faunus Architecture

g._()!

Page 54: Big Graph Data

Faunus Work Flow

hdfs://user/ubuntu/

output/job-0/

output/job-1/

output/job-2/ { graph*

sideeffect*

g.V.out .out .count()

Compressed HDFS Graphs   stored in sequence files   variable length encoding   prefix compression

Page 55: Big Graph Data

Faunus Setup

$ bin/gremlin.sh !

\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat!==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!

gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!

Page 56: Big Graph Data

Graph Analytics

gremlin> g.E.has('label',’followed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

gremlin> g.E.has('label','pushed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

Page 57: Big Graph Data

Follow Degree Distribution

Page 58: Big Graph Data

Follow Degree Distribution

P(k) ~ k-γ

γ = 2.2

Page 59: Big Graph Data

Pushed Degree Distribution

Page 60: Big Graph Data

Global Recommendations

gremlin> g.E.has('label','pushed','to').keep.!! ! !V.out('pushed').out('to').!! ! !in('to').in('pushed').!! ! !sideEffect('{it.score =it.pathCounter}').!! ! !score.order(F.decr,'name')!

# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!

Page 61: Big Graph Data

What’s coming

  Faunus 0.1

  Bulk Loading   loaded graph into Titan

  loading derivations into Titan

  Extending Gremlin Support   currently only a subset is of

Gremlin implemented

  Operational Tools

Page 62: Big Graph Data

I Graph = Relationship Centric

Page 63: Big Graph Data

II Graph = Agile Data Model

Page 64: Big Graph Data

III Graph = Algebraic Data Model

Page 65: Big Graph Data

Aurelius Graph Cluster

Stores a massive-scale property graph allowing real-time traversals and updates

Batch processing of large graphs with Hadoop

Runs global graph algorithms on large, compressed,

in-memory graphs

Map/Reduce Load & Compress

Analysis results back into Titan

Apache 2

Page 66: Big Graph Data

The Graph Landscape Sp

eed

of T

rave

rsal

/Pro

cess

Size of Graph Illustration only, not to scale

Page 67: Big Graph Data

TINKERPOP.COM

Page 68: Big Graph Data

AURELIUS THINKAURELIUS.COM

Thanks!

Vadas Gintautas @vadasg

Marko Rodriguez @twarko

Stephen Mallette @spmallette

Daniel LaRocque

Page 69: Big Graph Data

AURELIUS THINKAURELIUS.COM

Page 70: Big Graph Data

AURELIUS THINKAURELIUS.COM

XVX Benchmark Results

Page 71: Big Graph Data

AURELIUS THINKAURELIUS.COM

XVX - I Titan Performance Evaluation on Twitter-like Benchmark

Page 72: Big Graph Data

Twitter Benchmark

  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph

  Twitter in 2009, crawled by Kwak et. al

  4 Transaction Types   Create Account (1%)

  Publish tweet (15%)

  Read stream (76%)

  Recommendation (8%)   Follow recommended user (30%)

Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.

Page 73: Big Graph Data

Benchmark Setup

  6 cc1.4xl Cassandra nodes   in one placement group

  Cassandra 1.10

  40 m1.small worker machines   repeatedly running transactions

  simulating servers handling user requests

  EC2 cost: $11/hour

Page 74: Big Graph Data

Benchmark Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 379,019 115.15 ms 5.88 ms

Publish tweet 7,580,995 18.45 ms 6.34 ms

Read stream 37,936,184 6.29 ms 1.62 ms

Recommendation 3,793,863 67.65 ms 13.89 ms

Total 49,690,061

Runtime 2.3 hours 5,900 tx/sec

Page 75: Big Graph Data

Peak Load Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 374,860 172.74 ms 10.52 ms

Publish tweet 7,517,667 70.07 ms 19.43 ms

Read stream 37,618,648 24.40 ms 3.18 ms

Recommendation 3,758,266 229.83 ms 29.08 ms

Total 49,269,441

Runtime 1.3 hours 10,200 tx/sec

Page 76: Big Graph Data

Benchmark Conclusion

Titan   can   handle   10s   of   thousands   of   concurrent  users  with   short   response  5mes  even   for   complex  traversals   on   a   simulated   social   networking  applica5on  based  on  real-­‐world  network  data  with  billions  of  edges  and  millions  of  users  in  a  standard  EC2  deployment.  

For  more  informa5on  on  the  benchmark:  hDp://thinkaurelius.com/2012/08/06/5tan-­‐provides-­‐real-­‐5me-­‐big-­‐graph-­‐data/