[263] s2graph large-scale-graph-database-with-hbase-2

51
S2Graph : A large-scale graph database with Hbase daumkakao

Transcript of [263] s2graph large-scale-graph-database-with-hbase-2

Page 1: [263] s2graph large-scale-graph-database-with-hbase-2

S2Graph : A large-scale graph database

with Hbase

daumkakao

Page 2: [263] s2graph large-scale-graph-database-with-hbase-2

2

Reference

1. HBase Conference 20151.http://www.slideshare.net/HBaseCon/use-cases-session-52.https://vimeo.com/128203919

2. Deview 20153. Apache Con BigData Europe

1.http://sched.co/3ztM

Page 3: [263] s2graph large-scale-graph-database-with-hbase-2

3

Our Social Graph

Message

Writelength :

Read

Couponprice :

Presentprice : 3

affinity affinity:

affinity

affinity

affinity

affinity

affinity

affinity

affinity

Friend

Groupsize : 6

Emoticon

Eatrating :

Viewcount :

Playlevel: 6

Styleshare : 3

Advertise

Searchkeyword :

Listencount :

Likecount : 7

Comment

affinity

Page 4: [263] s2graph large-scale-graph-database-with-hbase-2

4

Our Social Graph

Messagelength : 9

Writelength : 3

affinity 6affinity: 9

affinity 3

affinity 3

affinity 4

affinity 1

affinity 2

affinity 2

affinity 9

Friend

Playlevel: 6

Styleshare : 3

Advertisectr : 0.32

Searchkeyword : “HBase"

Listencount : 6

Commentlength : 15

affinity 3

Message ID : 201

Ad ID : 603Music ID : 603

Item ID : 13

Post ID : 97

Game ID : 1984

Page 5: [263] s2graph large-scale-graph-database-with-hbase-2

5

Technical Challenges

1. Large social graph constantly changing

a. Scale

more than,social network: 10 billion edges, 200 million vertices, 50 million update on existing edges.user activities: over 1 billion new edges per day

Page 6: [263] s2graph large-scale-graph-database-with-hbase-2

6

Technical Challenges (cont)

2. Low latency for breadth first search traversal on connected data.

a. performance requirement

peak graph-traversing query per second: 20000response time: 100ms

Page 7: [263] s2graph large-scale-graph-database-with-hbase-2

7

Technical Challenges (cont)

3. Realtime update capabilities for viral effects

Person A

PostFast Person B

CommentPerson C

SharingPerson D

MentionFast Fast

Page 8: [263] s2graph large-scale-graph-database-with-hbase-2

8

Technical Challenges (cont)

4. Support for Dynamic Ranking logic

a. Push strategy: Hard to change data ranking logic dynamically.

b. Pull strategy: Enables user to try out various data ranking logics.

Page 9: [263] s2graph large-scale-graph-database-with-hbase-2

9

Before

Each app server should know each DB’s sharding logic. Highly inter-connected architecture

Friend relationship SNS feeds Blog user activities Messaging

Messaging App

SNS App

Blog App

Page 10: [263] s2graph large-scale-graph-database-with-hbase-2

10

After

SNS App

Blog App

Messaging App

S2Graph DBstateless app servers

Page 11: [263] s2graph large-scale-graph-database-with-hbase-2

What is S2Graph?

daumkakao

Page 12: [263] s2graph large-scale-graph-database-with-hbase-2

12

What is S2Graph?

Storage-as-a-Service + Graph API = Realtime Breadth First Search

Page 13: [263] s2graph large-scale-graph-database-with-hbase-2

13

Example: Messanger Data Model

Participates

Chat Room

Message 1

Message 1

Message 1

Contains

Recent messages in my chat rooms.SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id = b.chat_room_id WHERE b.created_at >= yesterday

Page 14: [263] s2graph large-scale-graph-database-with-hbase-2

14

Example: Messanger Data Model

Participates

Chat Room

Message 1

Message 1

Message 1

Contains

Recent messages in my chat rooms.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step

[{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]

]

}

'

Page 15: [263] s2graph large-scale-graph-database-with-hbase-2

15

Example: News Feed (cont)

FriendsPost1

Post 2

Post 3

create/like/share posts

Posts that my friends interacted.SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and b.action_type in (‘create’, ‘like’, ‘share’)

Page 16: [263] s2graph large-scale-graph-database-with-hbase-2

16

Example: News Feed (cont)

FriendsPost1

Post 2

Post 3

create/like/share posts

Posts that my friends interacted.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": "friends", "direction": "out", "limit": 100}], // step

[{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}]

]

}

'

Page 17: [263] s2graph large-scale-graph-database-with-hbase-2

17

Example: Recommendation(User-based CF) (cont)

Similar UsersProduct 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that similar user interact recently.SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday

Batch

Page 18: [263] s2graph large-scale-graph-database-with-hbase-2

18

Example: Recommendation(User-based CF) (cont)

Products that similar user interact recently.curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

“filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}],

“steps”: [[{“label”: “user_products_interact”}]]

},

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step

[{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}]

]

}

'

Similar UsersProduct 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Batch

Page 19: [263] s2graph large-scale-graph-database-with-hbase-2

19

Example: Recommendation(Item-based CF) (cont)

Similar Products

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Product 1

Product 1

Product 1

Products that are similar to what I have interested.SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday

Batch

Page 20: [263] s2graph large-scale-graph-database-with-hbase-2

20

Example: Recommendation(Item-based CF) (cont)

Products that are similar to what I have interested.

curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}]

]

}

'

Similar Products

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Product 1

Product 1

Product 1

Batch

Page 21: [263] s2graph large-scale-graph-database-with-hbase-2

21

Example: Recommendation(Content + Most popular) (cont)

TopK(k=1) product per timeUnit(day)

Product1

Product2

Product 3

user-product interaction(click/buy/like/share)

Daily top product per categories in products that I liked.SELECT c.*FROM user_products a, product_categories b, category_daily_top_products cWHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday, today)

Category1

Category2

Product10

Product20

Product20

Today

Product10 Yesterday

Today

Yesterday

Page 22: [263] s2graph large-scale-graph-database-with-hbase-2

22

Example: Recommendation(Content + Most popular) (cont)

Daily top product per categories in products that I liked.curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{“label”: “product_cates”, “direction”: “out”, “limit”: 3}],

[{"label": “category_products_topK”, "direction": "out", "limit": 10]

]

}

'

TopK(k=1) product per timeUnit(day)

Product1

Product2

Product 3

user-product interaction(click/buy/like/share)

Category1

Category2

Product10

Product20

Product20

Today

Product10 Yesterday

Today

Yesterday

Page 23: [263] s2graph large-scale-graph-database-with-hbase-2

23

Example: Recommendation(Spreading Activation) (cont)

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that is interacted by users who interacted on products that I interactSELECT b.product_id, count(*)FROM user_products a, user_products bWHERE a.user_id = 1AND a.product_id = b.product_idGROUP BY b.product_id

Page 24: [263] s2graph large-scale-graph-database-with-hbase-2

24

Example: Recommendation(Spreading Activation) (cont)

Product 1

Product2

Product 3

user-product interaction(click/buy/like/share)

Products that is interacted by users who interacted on products that I interactcurl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d '

{

"srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}],

"steps": [

[{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}],

[{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}],

[{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}],

]

}

'

Page 25: [263] s2graph large-scale-graph-database-with-hbase-2

25

Realization

1. These examples resemble graphs.2. Object isVertex, Relationship is Edge.3. Necessary APIs: breadth first search on large scale graph.

Page 26: [263] s2graph large-scale-graph-database-with-hbase-2

26

S2Graph API: Vertex

Vertex:1. insert, delete, getVertex2. vertex id: what user

provided(string/int/long)

ID 1231-123Prop1 Val1Prop2 Val2

… …

Page 27: [263] s2graph large-scale-graph-database-with-hbase-2

27

S2Graph API: Edge

Edges:

1. Insert, delete, update, getEdge(like CRUD in RDBMS)

2. Edge reference: (from, to, label, direction)

3. Multiple props on edge.4. Every edges are ordered (details

follow).Edge Reference 1,101,”friend”,”out”

Prop1 Val1Prop2 Val2

… …

Page 28: [263] s2graph large-scale-graph-database-with-hbase-2

28

S2Graph API: Query

Query: getEdges, countEdges, removeEdges

Class Query {// Define breadth first search

List[VertexId] startVertices; List[Step] steps;}Class Step { // Define one breadth List[QueryParam] queryParams;}Class QueryParam { // Define each edges to traverse for current breadth String label; String direction; Map options;}

QueryParam

Step1 Step2

Query

Page 29: [263] s2graph large-scale-graph-database-with-hbase-2

29

S2Graph API: indices

Degree Q1 Q2 Q3

1-friend-out-PK 3 c-103 b-102 a-101

1

101

102

103

Name: a

Name: b

Name: c

Ordered(DESC)

Indices:1. addIndex, createIndex2. Automatically keep edges ordered for

multiple indices.3. Support int/long/float/string data

types.

class Index { // define how to order edges. String indexName; List[Prop] indexProps;}

Page 30: [263] s2graph large-scale-graph-database-with-hbase-2

30

What is S2Graph

Not support global computation(not like Apache Giraph, graphX). Not support graph algorithm like page rank, shortest path.

Storage-as-a-Service + Graph API = Realtime Breadth First Search

S2Graph is Not

Page 31: [263] s2graph large-scale-graph-database-with-hbase-2

31

Why S2Graph: Push vs Pull. Feeds with Push

1. Only timestamp can be used as scoring2. Hard to change scoring function dynamically

PostLike

Write(Fanout)Friends Feed Queue

Feed Queue

Feed Queue

Write # of friendsRead O(1) for friends

Storage AVG(# of friends) * total user activityQuery O(1)

Page 32: [263] s2graph large-scale-graph-database-with-hbase-2

32

1.Different weights to different action types: Like = 0.8, Click = 0.1…2.Client can change scoring dynamically.

PostLike

Friends

Why S2Graph: Push vs Pull. Feeds with Pull

Write O(1)Read None

Storage total user activityQuery O(1) for friends + O(# of friends)

Page 33: [263] s2graph large-scale-graph-database-with-hbase-2

33

Pull >> push only if

1. fast response time: 10 ~ 100ms2. throughput: 10K ~ 20K QPS

S2Graph provide linear scalability on

1. number of machine.2. bfs search space(how many edges that single query will traverse).

more detail on benchmark section later.

Why S2Graph: S2Graph Supports Pull + Push

Page 34: [263] s2graph large-scale-graph-database-with-hbase-2

34

Why S2Graph: Simplify Data Flow

S2Graph

Write API + Query DSL

WAL log

OpenSourced

User/Item Similarity

Apache Spark (Batch Computing Layer)

TopK Counter Others

S2Graph Bulk Loader

will be open sourced soon

Page 35: [263] s2graph large-scale-graph-database-with-hbase-2

35

Why S2Graph: Built in A/B test

1. Register Query Template: Each Query template have impressionId.2. Insert Click/Impression event into S2Graph as Edge insert.

Page 36: [263] s2graph large-scale-graph-database-with-hbase-2

36

Why S2Graph: Just Insert Edge

S2Graph

1. user activity history. 2. friends feed.3. user-item based collaborative filtering.4. topK ranking(most popular, segmented most popular).

and many many more.just think your service as graph model.

Page 37: [263] s2graph large-scale-graph-database-with-hbase-2

S2Graph Internal

daumkakao

Page 38: [263] s2graph large-scale-graph-database-with-hbase-2

38

Detail: previous talk on HBaseCon 2015

1.https://vimeo.com/1282039192.http://www.slideshare.net/HBaseCon/use-cases-session-5

Page 39: [263] s2graph large-scale-graph-database-with-hbase-2

Benchmarks

daumkakao

Page 40: [263] s2graph large-scale-graph-database-with-hbase-2

40

HBase Table Configuration

1. setDurability(Durability.ASYNC_WAL)

2. setCompressionType(Compression.Algorithm.LZ4)

3. setBloomFilterType(BloomType.Row)

4. setDataBlockEncoding(DataBlockEncoding.FAST_DIFF)

5. setBlockSize(32768)

6. setBlockCacheEnabled(true)

7. pre-split by (Intger.MaxValue / regionCount). regionCount = 120 when create table(on 20 region server).

Page 41: [263] s2graph large-scale-graph-database-with-hbase-2

41

HBase Cluster Configuration

• each machine: 8core, 32G memory, SSD

• hfile.block.cache.size: 0.6

• hbase.hregion.memstore.flush.size: 128MB

• otherwise use default value from CDH 5.3.1

• s2graph rest server: 4core, 16G memory

Page 42: [263] s2graph large-scale-graph-database-with-hbase-2

42

Performance

1. Total # of Edges: 100,000,000,000(100,000,000 row x 1000 column)2. Test environment

a. Zookeeper server: 3b. HBase Masterserver: 2c. HBase Regionserver: 20d. App server: 4 core, 16GB Rame. Write traffic: 5K / second

Page 43: [263] s2graph large-scale-graph-database-with-hbase-2

43

- Benchmark Query : src.out(“friend”).limit(100).out(“friend”).limit(10)- Total concurrency: 20 * # of app server

Performance

2. Linear scalability

Late

ncy

0

50

100

150

200

QPS

0

1,000

2,000

3,000

4,000

# of app server1 2 4 8

QPS(Query Per Second) Latency(ms)

46454543

3,491

1,763

88546443 45 45 46

# of app server1 2 3 4 5 6 7 8

500

1000

1500

2000

2500

3000

QPS

Page 44: [263] s2graph large-scale-graph-database-with-hbase-2

Performance

3. Varying width of traverse (tested with a single server)

Late

ncy

0

87.5

175

262.5

350

QPS

0

500

1,000

1,500

2,000

Limit on first step20 40 80 200 400 800

QPS Latency(ms)

327

164

84351911 61122237

570

1,023

1,821

11 19 3584

164

327

- Benchmark Query : src.out(“friend”).limit(x).out(“friend”).limit(10)- Total concurrency = 20 * 1(# of app server)

Page 45: [263] s2graph large-scale-graph-database-with-hbase-2

45

- All query touch 1000 edges.- each step` limit is on x axis.- Can expect performance with given query`s search space.

Performance

4. Different query path(different I/O pattern)

Late

ncy

0

37.5

75

112.5

150

QPS

0

80

160

240

320

400

limits on path10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10

QPS Latency(ms)

3234362314

307.5292.1274.4

435.3695

14 23 36 34 32

Page 46: [263] s2graph large-scale-graph-database-with-hbase-2

46

Performance

5. Write throughput per operation on single app server

Insert operation

Late

ncy

0

1.25

2.5

3.75

5

Request per second

8000 16000 800000

Page 47: [263] s2graph large-scale-graph-database-with-hbase-2

47

Performance

6. write throughput per operation on single app server

Update(increment/update/delete) operation

Late

ncy

0

2

4

6

8

Request per second

2000 4000 6000

Page 48: [263] s2graph large-scale-graph-database-with-hbase-2

48

Stats

1. HBase cluster per IDC (2 IDC)- 3 Zookeeper Server- 2 HBase Master- (20 + 40) HBase Slave

2. App server per IDC- 10 server for write-only- 30 server for query only

3. Real traffic- read: 10K ~ 20K request per second

- now mostly 2 step queries with limit 100 on first step.- write: over 5k ~ 10k request per second

Page 49: [263] s2graph large-scale-graph-database-with-hbase-2

49

Page 50: [263] s2graph large-scale-graph-database-with-hbase-2

50

Through S2Graph !

Page 51: [263] s2graph large-scale-graph-database-with-hbase-2

51

Now Available As an Open Source- https://github.com/daumkakao/s2graph- Finding contributors and mentors

Contact- Doyoung Yoon : [email protected]