Handling Billions of Edges in a Graph Database

58
www.arangodb.com Handling Billions Of Edges in a Graph Database Michael Hackstein @mchacki New Technology

Transcript of Handling Billions of Edges in a Graph Database

Page 1: Handling Billions of Edges in a Graph Database

www.arangodb.com

Handling Billions Of Edges in a Graph Database

Michael Hackstein @mchacki

New Technology

Page 2: Handling Billions of Edges in a Graph Database

Michael Hackstein

‣ ArangoDB Core Team ‣ Web Frontend ‣ Graph visualisation ‣ Graph features ‣ SmartGraphs

‣ Host of cologne.js

‣ Master’s Degree(spec. Databases and Information Systems)

2

Page 3: Handling Billions of Edges in a Graph Database

What are Graph Databases

‣ Schema-free Objects (Vertices)‣ Relations between them (Edges)‣ Edges have a direction

3

{ name: "alice", age: 32 }

{ name: "bob", age: 35, size: 1,73m }

{ name: "fishing" }

{ name: "reading" }

{ name: "dancing" }

mar

ried

hobby

hobby

hobby

hobby

Page 4: Handling Billions of Edges in a Graph Database

What are Graph Databases

‣ Schema-free Objects (Vertices)‣ Relations between them (Edges)‣ Edges have a direction

‣ Edges can be queried in both directions

‣ Easily query a range of edges (2 to 5)

‣ Undefined number of edges (1 to *)‣ Shortest Path between two vertices

3

{ name: "alice", age: 32 }

{ name: "bob", age: 35, size: 1,73m }

{ name: "fishing" }

{ name: "reading" }

{ name: "dancing" }

mar

ried

hobby

hobby

hobby

hobby

Page 5: Handling Billions of Edges in a Graph Database

Typical Graph Queries

4

Page 6: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice

4

Page 7: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice

4

Page 8: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice‣ What is the linking path between Alice and Bob

4

Page 9: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice‣ What is the linking path between Alice and Bob‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my

ticket

4

Page 10: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice‣ What is the linking path between Alice and Bob‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my

ticket‣ Pattern Matching:

4

Page 11: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice‣ What is the linking path between Alice and Bob‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my

ticket‣ Pattern Matching:‣ Give me all users that share two hobbies with Alice

4

Page 12: Handling Billions of Edges in a Graph Database

Typical Graph Queries

‣ Give me all friends of Alice‣ Give me all friends-of-friends of Alice‣ What is the linking path between Alice and Bob‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my

ticket‣ Pattern Matching:‣ Give me all users that share two hobbies with Alice‣ Give me all products that at least one of my friends has bought together with the

products I already own, ordered by how many friends have bought it and the products rating, but only 20 of them.

4

Page 13: Handling Billions of Edges in a Graph Database

Non-Typical Graph Queries

5

Page 14: Handling Billions of Edges in a Graph Database

Non-Typical Graph Queries

‣ Give me all users which have an age attribute between 21 and 35.

5

Page 15: Handling Billions of Edges in a Graph Database

Non-Typical Graph Queries

‣ Give me all users which have an age attribute between 21 and 35.‣ Give me the age distribution of all users

5

Page 16: Handling Billions of Edges in a Graph Database

Non-Typical Graph Queries

‣ Give me all users which have an age attribute between 21 and 35.‣ Give me the age distribution of all users‣ Group all users by their name

5

Page 17: Handling Billions of Edges in a Graph Database

Traversal

6

Iterate down two edges with some filters

Page 18: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)

6

S

Iterate down two edges with some filters

Page 19: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S

6

S

A BC

Iterate down two edges with some filters

Page 20: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges

6

S

A BC

Iterate down two edges with some filters

Page 21: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)

6

S

A BC

DE

Iterate down two edges with some filters

Page 22: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges

6

S

A BC

DE

Iterate down two edges with some filters

Page 23: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges‣ The next vertex (E) is in desired depth. Return the path

S -> A -> E

6

S

A BC

DE

Iterate down two edges with some filters

Page 24: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges‣ The next vertex (E) is in desired depth. Return the path

S -> A -> E‣ Go back to the next unfinished vertex (B)

6

S

A BC

DE

Iterate down two edges with some filters

Page 25: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges‣ The next vertex (E) is in desired depth. Return the path

S -> A -> E‣ Go back to the next unfinished vertex (B)‣ We iterate down on (B)

6

S

A BC

DE

Iterate down two edges with some filters

F

Page 26: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges‣ The next vertex (E) is in desired depth. Return the path

S -> A -> E‣ Go back to the next unfinished vertex (B)‣ We iterate down on (B)‣ We apply filters on edges

6

S

A BC

DE

Iterate down two edges with some filters

F

Page 27: Handling Billions of Edges in a Graph Database

Traversal

‣ We first pick a start vertex (S)‣ We collect all edges on S‣ We apply filters on edges‣ We iterate down one of the new vertices (A)‣ We apply filters on edges‣ The next vertex (E) is in desired depth. Return the path

S -> A -> E‣ Go back to the next unfinished vertex (B)‣ We iterate down on (B)‣ We apply filters on edges‣ The next vertex (F) is in desired depth. Return the path

S -> B -> F

6

S

A BC

DE

Iterate down two edges with some filters

F

Page 28: Handling Billions of Edges in a Graph Database

Traversal - Complexity

‣ Once: ‣ Find the start vertex

‣ For every depth: ‣ Find all connected edges ‣ Filter non-matching edges ‣ Find connected vertices ‣ Filter non-matching vertices

7

Depends on indexes: Hash:

Edge-Index or Index-Free: Linear in edges: Depends on indexes: Hash: Linear in vertices:

Only one pass:

O

1

1 n n * 1 n 3n

Page 29: Handling Billions of Edges in a Graph Database

Traversal - Complexity

‣ Linear sounds evil? ‣ NOT linear in All Edges O(E) ‣ Only Linear in relevant Edges n < E

‣ Traversals solely scale with their result size. ‣ They are not effected at all by total amount of data ‣ BUT: Every depth increases the exponent: O(3 * n ) ‣ "7 degrees of separation": 3*n < E < 3*n

8

d

6 7

Page 30: Handling Billions of Edges in a Graph Database

‣ MULTI-MODEL database ‣ Stores Documents and Graphs

‣ Query language AQL ‣ Document Queries ‣ Graph Queries ‣ Joins ‣ All can be combined in the same statement

‣ ACID support including Multi Collection Transactions

9

Page 31: Handling Billions of Edges in a Graph Database

AQL

10

FOR user IN users RETURN user

Page 32: Handling Billions of Edges in a Graph Database

AQL

11

FOR user IN users FILTER user.name == "alice" RETURN user

Page 33: Handling Billions of Edges in a Graph Database

AQL

12

FOR user IN users FILTER user.name == "alice" FOR product IN OUTBOUND user has_bought RETURN product

Alice TVhas_bought

Page 34: Handling Billions of Edges in a Graph Database

AQL

13

FOR user IN users FILTER user.name == "alice" FOR recommendation, action, path IN 3 ANY user has_bought FILTER path.vertices[2].age <= user.age + 5 AND path.vertices[2].age >= user.age - 5 FILTER recommendation.price < 25 LIMIT 10 RETURN recommendation

Alice TVhas_bought Bob Playstationhas_boughthas_bought

alice.age - 5 <= bob.age && bob.age <= alice.age + 5 playstation.price < 25

Page 35: Handling Billions of Edges in a Graph Database

14

Demo Time Querying basics

Page 36: Handling Billions of Edges in a Graph Database

First Boost - Vertex Centric Indices

‣ Remember Complexity? O(3 * n ) ‣ Filtering of non-matching edges is linear for every depth

‣ Index all edges based on their vertices and arbitrary other attributes ‣ Find initial set of edges in identical time ‣ Less / No post-filtering required ‣ This decreases the n

15

d

Page 37: Handling Billions of Edges in a Graph Database

16

Demo Time Vertex-Centric Indices

Page 38: Handling Billions of Edges in a Graph Database

Scaling

‣ Vertex-Centric Indexes help with super-nodes ‣ But what if the graph is too large for one machine?

‣ Distribute graph on several machines (sharding) ‣ How to query it now? ‣ No global view of the graph possible any more ‣ What about edges between servers?

17

Page 39: Handling Billions of Edges in a Graph Database

18

First let's do the cluster thingy

Page 40: Handling Billions of Edges in a Graph Database

19

Page 41: Handling Billions of Edges in a Graph Database

20

Marathon

Page 42: Handling Billions of Edges in a Graph Database

21

Demo Time DC / OS

Page 43: Handling Billions of Edges in a Graph Database

Is Mesosphere required?

‣ ArangoDB can run clusters without it ‣ Setup Requires manual effort (can be scripted): ‣ Configure IP addresses ‣ Correct startup ordering

‣ This works: ‣ Automatic Failover (Follower takes over if leader dies) ‣ Rebalancing of shards ‣ Everything inside of ArangoDB

‣ This is based on Mesos: ‣ Complete self healing ‣ Automatic restart of ArangoDBs (on new machines)

➡ We suggest you have someone on call

22

Page 44: Handling Billions of Edges in a Graph Database

23

Now distribute the graph

Page 45: Handling Billions of Edges in a Graph Database

Dangers of Sharding

‣ Only parts of the graph on every machine ‣ Neighboring vertices may be on different machines ‣ Even edges could be on other machines than their vertices

‣ Queries need to be executed in a distributed way ‣ Result needs to be merged locally

24

Page 46: Handling Billions of Edges in a Graph Database

Random Distribution

‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their

vertices ‣ A lot of network overhead is required for

querying

25

‣ Advantages: ‣ every server takes an equal portion of

data ‣ easy to realize ‣ no knowledge about data required ‣ always works

Page 47: Handling Billions of Edges in a Graph Database

Random Distribution

‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their

vertices ‣ A lot of network overhead is required for

querying

25

‣ Advantages: ‣ every server takes an equal portion of

data ‣ easy to realize ‣ no knowledge about data required ‣ always works

Page 48: Handling Billions of Edges in a Graph Database

Index-Free Adjacency

26

‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this?

Page 49: Handling Billions of Edges in a Graph Database

Index-Free Adjacency

26

‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this?

Page 50: Handling Billions of Edges in a Graph Database

Index-Free Adjacency

26

‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this?

Page 51: Handling Billions of Edges in a Graph Database

Index-Free Adjacency

26

‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this?

????

Page 52: Handling Billions of Edges in a Graph Database

Index-Free Adjacency

‣ ArangoDB uses an hash-based EdgeIndex (O(1) - lookup) ‣ The vertex is independent of it's edges ‣ It can be stored on a different machine

26

‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to find edges ‣ How to shard this?

????

Page 53: Handling Billions of Edges in a Graph Database

Domain Based Distribution

‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products

‣ Most edges in same group ‣ Rare edges between groups

27

Page 54: Handling Billions of Edges in a Graph Database

Domain Based Distribution

‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products

‣ Most edges in same group ‣ Rare edges between groups

27

Page 55: Handling Billions of Edges in a Graph Database

Domain Based Distribution

‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products

‣ Most edges in same group ‣ Rare edges between groups

27

uses Domain Knowledgefor short-cuts

Page 56: Handling Billions of Edges in a Graph Database

28

Sneak Preview

SmartGraphs

Page 57: Handling Billions of Edges in a Graph Database

Benchmark Comparison

Source: https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/

Page 58: Handling Billions of Edges in a Graph Database

Thank you

‣ Further questions? ‣ Follow us on twitter: @arangodb ‣ Join our slack: slack.arangodb.com

‣ Follow me on twitter/github: @mchacki ‣ Write me a mail: [email protected]

30