Social Hash - USENIX · Summary Assignment problems are common in distributed systems design...

Post on 01-Jun-2020

1 views 0 download

Transcript of Social Hash - USENIX · Summary Assignment problems are common in distributed systems design...

Social Hash:an Assignment Framework for Optimizing Distributed Systems Operations on Social Networks

Alon Shalita, Brian Karrer, Igor Kabiljo, Arun Sharma, Alessandro Presta, Aaron Adcock, Herald Kllapi, and Michael Stumm

March 2016

Assignment ProblemFront-end clusters

Cache

Point-of-Presence

(PoP)Cache

Cache

Alon’s HTTP requests

Igor’s HTTP requests

Assignment ProblemFront-end clusters

Cache

Cache

Cache

Point-of-Presence

(PoP)

Alon’s HTTP requests

Igor’s HTTP requests

Assignment Problem OptimizationFront-end clusters

Cache

Cache

Cache

Point-of-Presence

(PoP)

Solution RequirementsFront-end clusters

Cache

Cache

Cache

Balanced

Point-of-Presence

(PoP)

Solution RequirementsFront-end clusters

Cache

Cache

Cache

Adaptive

Point-of-Presence

(PoP)

Alon’s HTTP requests

Solution RequirementsFront-end clusters

Cache

Cache

Cache

Stable

Point-of-Presence

(PoP)

Alon’s HTTP requests

Solution RequirementsFront-end clusters

Cache

Cache

Cache

Fast decision

Point-of-Presence

(PoP)

Alon’s HTTP requests

Social Hash framework

Social Hash framework

!

Components!(e.g.,!compute!clusters!or!storage!subsystems)!

Groups!

Objects!(e.g.,!data!records!or!HTTP!requests)!

static!!

assignm

ent!

dynamic!!

assignm

ent!

Static assignment

▪ Goal: assign similar objects sent to the same group

▪ Data access pattern -> represent as graph -> graph partitioning

▪ Large-scale optimization: slow, time-consuming

Dynamic assignment

▪ Goal: adapt to maintain load balance by altering group -> component

▪ hardware changes

▪ dynamic workload

▪ addition and removal of objects

▪ Two-level framework separates optimization from adaptation

▪ Slow optimization -> static

▪ Fast adaptation -> dynamic

▪ Group-to-component ratio controls tradeoff

Social Hash framework!!!!!!!!!!!!!!!!!!!!!! !

Social!Hash!Tbl! Assignment!Tbl!

group!g"

Lookup!

Request!

Missing!key!!assignment!

key"

failed!

group!

c"

Graph!Partitioning!

graph!specifications! Dynamic!!

Assignment!

monitoring!info!operator!console!

key"

HTTP Request Routing

Social Hash for Facebook’s web routing

▪ Objects: HTTP request identified by user, Components: front-end clusters

▪ PoP: Dynamic assignment by hash ringFront-end clusters

Cache

Cache

Point-of-Presence

(PoP)

Edge locality for Facebook’s web routing

●●

●●

●●

●●

●●

0.00

0.25

0.50

0.75

1.00

10 100 1,000 10,000 100,000Number of groups

Edge

loca

lity

▪ Production routing: 21k groups for 10’s of front-end clusters

▪ Over half of friendships are within groups

▪ Updated on a weekly basis (~1% movement)

Live traffic experiment: TAO miss rate

−30

−20

−10

0

10

Sun

Mon

Tue

Wed Thu Fri

Sat

Sun

Mon

Tue

Wed Thu Fri

Sat

Day

Perc

enta

ge c

hang

e in

TAO

miss

rate

(%)▪ Orange: traffic shifts

▪ Red: duration of test

▪ Green: updated Social Hash table

Storage Sharding

Assignment Problem 2: Storage sharding

Arun’s query

Objects: data recordsComponents: storage machines

Assignment Problem 2: Storage sharding

Arun’s query

Objects: data recordsComponents: storage machines

Static assignment

▪ Minimize fanout through bipartite graph partitioning

▪ Graph contains recent queries and data records

▪ edge => query accesses data record

▪ Dotted: edge locality optimization

▪ Solid: fanout optimization●

10

20

30

40

50

3 10 30 100 300 1,000 3,000 10,000Number of groups

Aver

age

fano

ut

Storage sharding deployment

▪ Graph database with thousands of storage servers

▪ Group-to-component ratio of 8

▪ Static assignment every few months

▪ Results:

▪ Average latencies decreased by over 50%

▪ CPU utilization decreased by over 50%

Summary

▪ Assignment problems are common in distributed systems design

▪ Proposed Social Hash framework for solving assignment problems

▪ Two-level design optimizes performance with graph partitioning

▪ Two Facebook integrations in production for over a year

▪ HTTP Request Routing: > 25% reduction in TAO miss rate

▪ Storage Sharding: Latency and CPU utilization reduced by over 50%