Big Graph Data

Post on 15-Jan-2015

5.922 views 2 download

Tags:

description

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.

Transcript of Big Graph Data

AURELIUS THINKAURELIUS.COM

BIG GRAPH DATA Understanding a Complex World

Matthias Broecheler, CTO @mbroecheler November XIII, MMXII

AURELIUS THINKAURELIUS.COM

I Graph Foundation

Graph

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

Vertex Property

Graph

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

Edge

Edge Property

Edge Type

Path

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

Degree

name: Jupiter type: god

name: Pluto type: god

name: Neptune type: god

name: Hercules type: demigod

name: Cerberus type: monster

name: Alcmene type: god

name: Saturn type: titan

father father

mother brother

brother battled

pet

time:12

AURELIUS THINKAURELIUS.COM

I Connected World

HEALTH

HEALTH

HEALTH

HEALTH

ECONOMY

ECONOMY

ECONOMY

ECONOMY

Social Systems

Social Systems

Social Systems

Social Systems

AURELIUS THINKAURELIUS.COM

III Titan Graph Database

  Numerous Concurrent Users   Many Short Transactions

  read/write

  Real-time Traversals (OLTP)   High Availability   Dynamic Scalability   Variable Consistency Model

  ACID or eventual consistency

 Real-time Big Graph Data

Titan Features

Storage Backends

Partitionability

Availability Consistency

Titan Features

I.  Data Management

II. Vertex-Centric Indices

Titan Features

III.  Graph Partitioning

IV.  Edge Compression

Titan Ecosystem

  Native Blueprints Implementation

  Gremlin Query Language

  Rexster Server   any Titan graph can be

exposed as a REST endpoint

Generic Graph API

Dataflow Processing

TraversalLanguage

Object-GraphMapper

GraphAlgorithms

GraphServer

AURELIUS THINKAURELIUS.COM

IV Github Network

Setup

$ ./titan-0.1.0/bin/gremlin.sh! ! ! !\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open('/tmp/titan')!==>titangraph[local:/tmp/titan]!

Titan Storage Model

  Adjacency list in one column family

  Row key = vertex id

  Each property and edge in one column   Denormalized, i.e. stored twice

  Direction and label/key as column prefix   Use slice predicate for quick retrieval

5

5

USER

REPOSITORY

COMMENT

ISSUE COMMIT

PAGE

edited

pushed

in

opened

to on on

on

created

Defining Property Keys

gremlin> g.makeType().name(‘username’).!! ! ! dataType(String.class).!! ! ! functional().!! ! ! indexed().unique().!! ! ! makePropertyKey()!

gremlin> g.makeType().name(‘time’).!! ! ! dataType(Long.class).!! ! ! functional().makePropertyKey()!

Defining Edge Labels

gremlin> g.makeType().name(‘on’).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘pushed’).!! ! ! primaryKey(time).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘in’).!! ! ! unidirected().!! ! ! makeEdgeLabel()!

Create & Retrieve

gremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V('username','okram')!==>v[4]!

Titan Locking

  Locking ensures consistency when it is needed

  Titan uses time stamped consistent reads and writes on separate CFs for locking

  Uses   Property uniqueness: .unique()

  Functional edges: .functional()

  Global ID management

5

9

name : Hercules

name : Hercules

name : Jupiter

name : Pluto

father

father

x

Titan Indexing

  Vertices can be retrieved by property key + value

  Titan maintains index in a separate column family as graph is updated

  Only need to define a property key as .index()

5

9

name : Hercules

name : Jupiter

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Query Optimization

Vertex-Centric Indices

  Sort and index edges per vertex by primary key   Primary key can be composite

  Enables efficient focused traversals   Only retrieve edges that matter

  Uses push down predicates for quick, index-driven retrieval

v

time: 1

fought fought father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()!

v

time: 1

father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)!

v

time: 1

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)! .labels(‘battled’)!

v

time: 1

battled battled

time: 3

v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!

Titan Embedding

  Rexster RexPro   lightweight Gremlin

Server

  based on Grizzly

  Titan Gremlin Engine

  Embedded Storage Backend   in-JVM method calls

Graph Partitioning

Goal: Vertex Co-location   Titan maintains multiple

ID Pools

  Ordered Partitioner in Storage Backend

  Dynamically determines optimal partition and allocates corresponding IDs

ID Pool

What’s coming

  Full-text indexing   external index system integration

  Bulk Loading   integration with storage backend

utilities and Hadoop ingestion

  240 Billion Edge Benchmark   performance analysis and improvements

across the entire stack

AURELIUS THINKAURELIUS.COM

V Faunus Graph Analytics

  Hadoop-based Graph Computing Framework

  Graph Analytics

  Breadth-first Traversals

  Global Graph Computations

 Batch Big Graph Data

Faunus Features

Faunus Architecture

g._()!

Faunus Work Flow

hdfs://user/ubuntu/

output/job-0/

output/job-1/

output/job-2/ { graph*

sideeffect*

g.V.out .out .count()

Compressed HDFS Graphs   stored in sequence files   variable length encoding   prefix compression

Faunus Setup

$ bin/gremlin.sh !

\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat!==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!

gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!

Graph Analytics

gremlin> g.E.has('label',’followed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

gremlin> g.E.has('label','pushed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

Follow Degree Distribution

Follow Degree Distribution

P(k) ~ k-γ

γ = 2.2

Pushed Degree Distribution

Global Recommendations

gremlin> g.E.has('label','pushed','to').keep.!! ! !V.out('pushed').out('to').!! ! !in('to').in('pushed').!! ! !sideEffect('{it.score =it.pathCounter}').!! ! !score.order(F.decr,'name')!

# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!

What’s coming

  Faunus 0.1

  Bulk Loading   loaded graph into Titan

  loading derivations into Titan

  Extending Gremlin Support   currently only a subset is of

Gremlin implemented

  Operational Tools

I Graph = Relationship Centric

II Graph = Agile Data Model

III Graph = Algebraic Data Model

Aurelius Graph Cluster

Stores a massive-scale property graph allowing real-time traversals and updates

Batch processing of large graphs with Hadoop

Runs global graph algorithms on large, compressed,

in-memory graphs

Map/Reduce Load & Compress

Analysis results back into Titan

Apache 2

The Graph Landscape Sp

eed

of T

rave

rsal

/Pro

cess

Size of Graph Illustration only, not to scale

TINKERPOP.COM

AURELIUS THINKAURELIUS.COM

Thanks!

Vadas Gintautas @vadasg

Marko Rodriguez @twarko

Stephen Mallette @spmallette

Daniel LaRocque

AURELIUS THINKAURELIUS.COM

AURELIUS THINKAURELIUS.COM

XVX Benchmark Results

AURELIUS THINKAURELIUS.COM

XVX - I Titan Performance Evaluation on Twitter-like Benchmark

Twitter Benchmark

  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph

  Twitter in 2009, crawled by Kwak et. al

  4 Transaction Types   Create Account (1%)

  Publish tweet (15%)

  Read stream (76%)

  Recommendation (8%)   Follow recommended user (30%)

Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.

Benchmark Setup

  6 cc1.4xl Cassandra nodes   in one placement group

  Cassandra 1.10

  40 m1.small worker machines   repeatedly running transactions

  simulating servers handling user requests

  EC2 cost: $11/hour

Benchmark Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 379,019 115.15 ms 5.88 ms

Publish tweet 7,580,995 18.45 ms 6.34 ms

Read stream 37,936,184 6.29 ms 1.62 ms

Recommendation 3,793,863 67.65 ms 13.89 ms

Total 49,690,061

Runtime 2.3 hours 5,900 tx/sec

Peak Load Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 374,860 172.74 ms 10.52 ms

Publish tweet 7,517,667 70.07 ms 19.43 ms

Read stream 37,618,648 24.40 ms 3.18 ms

Recommendation 3,758,266 229.83 ms 29.08 ms

Total 49,269,441

Runtime 1.3 hours 10,200 tx/sec

Benchmark Conclusion

Titan   can   handle   10s   of   thousands   of   concurrent  users  with   short   response  5mes  even   for   complex  traversals   on   a   simulated   social   networking  applica5on  based  on  real-­‐world  network  data  with  billions  of  edges  and  millions  of  users  in  a  standard  EC2  deployment.  

For  more  informa5on  on  the  benchmark:  hDp://thinkaurelius.com/2012/08/06/5tan-­‐provides-­‐real-­‐5me-­‐big-­‐graph-­‐data/