Big Graph Data

AURELIUS THINKAURELIUS.COM

BIG GRAPH DATA Understanding a Complex World

Matthias Broecheler, CTO @mbroecheler November XIII, MMXII

I Graph Foundation

Vertex Property

father father

mother brother

brother battled

time:12

Edge Property

Edge Type

father father

mother brother

brother battled

time:12

Degree

father father

mother brother

brother battled

time:12

I Connected World

HEALTH

ECONOMY

Social Systems

III Titan Graph Database

  Numerous Concurrent Users   Many Short Transactions

  read/write

  Real-time Traversals (OLTP)   High Availability   Dynamic Scalability   Variable Consistency Model

  ACID or eventual consistency

 Real-time Big Graph Data

Titan Features

Storage Backends

Partitionability

Availability Consistency

Titan Features

I.  Data Management

II. Vertex-Centric Indices

Titan Features

III.  Graph Partitioning

IV.  Edge Compression

Titan Ecosystem

  Native Blueprints Implementation

  Gremlin Query Language

  Rexster Server   any Titan graph can be

exposed as a REST endpoint

Generic Graph API

Dataflow Processing

TraversalLanguage

Object-GraphMapper

GraphAlgorithms

GraphServer

IV Github Network

$ ./titan-0.1.0/bin/gremlin.sh! ! ! !\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open('/tmp/titan')!==>titangraph[local:/tmp/titan]!

Titan Storage Model

  Adjacency list in one column family

  Row key = vertex id

  Each property and edge in one column   Denormalized, i.e. stored twice

  Direction and label/key as column prefix   Use slice predicate for quick retrieval

REPOSITORY

COMMENT

ISSUE COMMIT

edited

pushed

opened

to on on

created

Defining Property Keys

gremlin> g.makeType().name(‘username’).!! ! ! dataType(String.class).!! ! ! functional().!! ! ! indexed().unique().!! ! ! makePropertyKey()!

gremlin> g.makeType().name(‘time’).!! ! ! dataType(Long.class).!! ! ! functional().makePropertyKey()!

Defining Edge Labels

gremlin> g.makeType().name(‘on’).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘pushed’).!! ! ! primaryKey(time).!! ! ! makeEdgeLabel()!

gremlin> g.makeType().name(‘in’).!! ! ! unidirected().!! ! ! makeEdgeLabel()!

Create & Retrieve

gremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V('username','okram')!==>v[4]!

Titan Locking

  Locking ensures consistency when it is needed

  Titan uses time stamped consistent reads and writes on separate CFs for locking

  Uses   Property uniqueness: .unique()

  Functional edges: .functional()

  Global ID management

name : Hercules

name : Jupiter

name : Pluto

father

Titan Indexing

  Vertices can be retrieved by property key + value

  Titan maintains index in a separate column family as graph is updated

  Only need to define a property key as .index()

name : Hercules

name : Jupiter

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Basic Queries

gremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.!! ! ! name.sort{it}!

gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!

Query Optimization

Vertex-Centric Indices

  Sort and index edges per vertex by primary key   Primary key can be composite

  Enables efficient focused traversals   Only retrieve edges that matter

  Uses push down predicates for quick, index-driven retrieval

time: 1

fought fought father

mother

battled battled battled

battled

time: 3 time: 5

time: 9 v.query()!

time: 1

father

mother

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)!

time: 1

battled

time: 3 time: 5

time: 9 v.query()! .direction(OUT)! .labels(‘battled’)!

time: 1

battled battled

time: 3

v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!

Recommendation Engine

gremlin> v.out('pushed').out('to')[0..9].!! ! ! in('to').in('pushed')[0..500].!! ! ! except([v]).name.!! ! ! groupCount.cap.next().sort{-it.value}[0..4]!

v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!

v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!

Titan Embedding

  Rexster RexPro   lightweight Gremlin

Server

  based on Grizzly

  Titan Gremlin Engine

  Embedded Storage Backend   in-JVM method calls

Graph Partitioning

Goal: Vertex Co-location   Titan maintains multiple

ID Pools

  Ordered Partitioner in Storage Backend

  Dynamically determines optimal partition and allocates corresponding IDs

ID Pool

What’s coming

  Full-text indexing   external index system integration

  Bulk Loading   integration with storage backend

utilities and Hadoop ingestion

  240 Billion Edge Benchmark   performance analysis and improvements

across the entire stack

V Faunus Graph Analytics

  Hadoop-based Graph Computing Framework

  Graph Analytics

  Breadth-first Traversals

  Global Graph Computations

 Batch Big Graph Data

Faunus Features

Faunus Architecture

g._()!

Faunus Work Flow

hdfs://user/ubuntu/

output/job-0/

output/job-1/

output/job-2/ { graph*

sideeffect*

g.V.out .out .count()

Compressed HDFS Graphs   stored in sequence files   variable length encoding   prefix compression

Faunus Setup

$ bin/gremlin.sh !

\,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat!==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!

gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!

Graph Analytics

gremlin> g.E.has('label',’followed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

gremlin> g.E.has('label','pushed').keep.!! ! !V.sideEffect('{it.degree = it.outE.count()}').!! ! !degree.groupCount!

Follow Degree Distribution

P(k) ~ k-γ

γ = 2.2

Pushed Degree Distribution

Global Recommendations

gremlin> g.E.has('label','pushed','to').keep.!! ! !V.out('pushed').out('to').!! ! !in('to').in('pushed').!! ! !sideEffect('{it.score =it.pathCounter}').!! ! !score.order(F.decr,'name')!

# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!

What’s coming

  Faunus 0.1

  Bulk Loading   loaded graph into Titan

  loading derivations into Titan

  Extending Gremlin Support   currently only a subset is of

Gremlin implemented

  Operational Tools

I Graph = Relationship Centric

II Graph = Agile Data Model

III Graph = Algebraic Data Model

Aurelius Graph Cluster

Stores a massive-scale property graph allowing real-time traversals and updates

Batch processing of large graphs with Hadoop

Runs global graph algorithms on large, compressed,

in-memory graphs

Map/Reduce Load & Compress

Analysis results back into Titan

Apache 2

The Graph Landscape Sp

Size of Graph Illustration only, not to scale

TINKERPOP.COM

Thanks!

Vadas Gintautas @vadasg

Marko Rodriguez @twarko

Stephen Mallette @spmallette

Daniel LaRocque

XVX Benchmark Results

XVX - I Titan Performance Evaluation on Twitter-like Benchmark

Twitter Benchmark

  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph

  Twitter in 2009, crawled by Kwak et. al

  4 Transaction Types   Create Account (1%)

  Publish tweet (15%)

  Read stream (76%)

  Recommendation (8%)   Follow recommended user (30%)

Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.

Benchmark Setup

  6 cc1.4xl Cassandra nodes   in one placement group

  Cassandra 1.10

  40 m1.small worker machines   repeatedly running transactions

  simulating servers handling user requests

  EC2 cost: $11/hour

Benchmark Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 379,019 115.15 ms 5.88 ms

Publish tweet 7,580,995 18.45 ms 6.34 ms

Read stream 37,936,184 6.29 ms 1.62 ms

Recommendation 3,793,863 67.65 ms 13.89 ms

Total 49,690,061

Runtime 2.3 hours 5,900 tx/sec

Peak Load Results

Transaction Type Number of tx Mean tx time Std of tx time

Create account 374,860 172.74 ms 10.52 ms

Publish tweet 7,517,667 70.07 ms 19.43 ms

Read stream 37,618,648 24.40 ms 3.18 ms

Recommendation 3,758,266 229.83 ms 29.08 ms

Total 49,269,441

Runtime 1.3 hours 10,200 tx/sec

Benchmark Conclusion

Titan can handle 10s of thousands of concurrent users with short response 5mes even for complex traversals on a simulated social networking applica5on based on real-‐world network data with billions of edges and millions of users in a standard EC2 deployment.

For more informa5on on the benchmark: hDp://thinkaurelius.com/2012/08/06/5tan-‐provides-‐real-‐5me-‐big-‐graph-‐data/

Big Graph Data

Technology

Transcript of Big Graph Data

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms › fileadmin › user_upload › ... · 2015-06-01 · Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Titan: The Rise of Big Graph Data

Build Serverless Big Data and Graph Visualization Applications · Title: Build Serverless Big Data and Graph Visualization Applications Author: Tom Sawyer Software Subject: Oracle

Oracle Big Data Spatial and Graph - An Overview · 2015-07-22 · Title: Oracle Big Data Spatial and Graph - An Overview Author: Oracle Subject: Customer Overview Presentation Keywords:

Big Data Facebook Data from Graph API - csuohio.educis.csuohio.edu/~sschung/cis612/BigData_Mongodb_Presentation... · Big Data Facebook Wall Data using Graph API Presented by: Prashant

Big Data Appliance for Graph Analytics

Big Data,Neo4j Developer,Neo4j Graph database,Casandra Developer

Mining Big Data with RDF Graph Technology · Conference Session: Mining Big Data with RDF Graph Technology: Discovering What You Didn’t Know Moscone South – 200 3:15pm-4:15pm

Graph Based Processing of Big Images · 2020. 1. 3. · Big Data 4Vs of Big Data Volume, Variety, Velocity, Variability Big Data is mostly represented in blue color! Big Data ≠

Reality Distortion…€¦ · 01 AI-Powered Enterprise Knowledge Graph. Big Data Store. Graph Data Model. Intelligent Algorithms ...

Big Data & Social Graph: illycaffe Case Study

Visual Analytics Sandbox: A big data platform for ... · –Virtual big data environment for stream processing, graph analytics –Leverage stream processing, graph mining & visualization

Oracle Big Data Spatial and Graph - An Overviewdownload.oracle.com/otndocs/products/bigdata-spatialandgraph/...Title: Oracle Big Data Spatial and Graph - An Overview Author: Oracle

When Graph Meets Big Data: Opportunities and …hpgdmp.bsc.es/system/files/uploads/Xia_GraphInBigData...2016/11/13 · When Graph Meets Big Data: Opportunities and Challenges Yinglong

Graph Databases: Connecting the Dots in Big Data

Oracle big data spatial and graph

Oracle Big Data Spatial and Graph - Property Graph · platforms, Apache HBase and Oracle NoSQL Database, respectively running on the Oracle Big Data Appliance engineered system and

Mining Big Data with RDF Graph Technology

oracle big data property graph fo perf wpdownload.oracle.com/otndocs/products/bigdata... · source technologies for Oracle Big Data Spatial and Graph, such as Apache Hadoop, Lucene,

Big-Data Graph Knowledge Bases for Cyber Resilienceceur-ws.org/Vol-2040/paper2.pdf · Big-Data Graph Knowledge Bases for ... CyGraph leverages NoSQL graph database technology to capture