Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

On Brewing Fresh Espresso: LinkedIn’s Distributed Data

Serving Platform

Swaroop Jagadish

http://www.linkedin.com/in/swaroopjagadish

LinkedIn Confidential ©2013 All Rights Reserved



Outline

LinkedIn Data Ecosystem

Espresso: Design Points

Data Model and API

Architecture

Deep Dive: Fault Tolerance

Deep Dive: Secondary Indexing

Espresso In Production

Future work

2

The World’s Largest Professional Network

Members Worldwide

2 new Members Per Second

100M+ Monthly Unique Visitors

225M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 3

LinkedIn Data Ecosystem

4

Espresso: Key Design Points

Source-of-truth – Master-Slave, Timeline consistent

– Query-after-write

– Backup/Restore

– High Availability

Horizontally Scalable

Rich functionality – Hierarchical data model

– Document oriented

– Transactions within a hierarchy

– Secondary Indexes

5

Espresso: Key Design Points

Agility – no “pause the world” operations

– “On the fly” Schema Evolution

– Elasticity

Integration with the data ecosystem

– Change stream with freshness in O(seconds)

– ETL to Hadoop

– Bulk import

Modular and Pluggable

– Off-the-shelf: MySQL, Lucene, Avro

6

Data Model and API

7

Application View

8

key

value

REST API: /mailbox/msg_meta/bob/2

Partitioning

9

/mailbox/msg_meta/bob/2

MemberId is the partitioning key

Document based data model

Richer than a plain key-value store

Hierarchical keys

Values are rich documents and may contain

nested types

10

from : { name : "Chris", email : "[email protected]" }subject : "Go Giants!"body : "World Series 2012! w00t!"unread : true

Messages

mailboxID : StringmessageID : long

from : { name : String email : String }subject : Stringbody : Stringunread : boolean

REST based API

• Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true

+isInbox:true”&start=0&count=15

• Partial updates POST /MailboxDB/MessageMeta/bob/1

Content-Type: application/json

Content-Length: 21

{“unread” : “false”}

• Conditional operations – Get a message, only if recently updated

GET /MailboxDB/MessageMeta/bob/1

If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT

11

Transactional writes within a hierarchy

mboxId value

George { “numUnread”:

2 }

MessageCounter

mboxId msgId value etag

George 0 {…, “unread”: false, …} 7abf8091

George 1 {…, “unread”: true, …} b648bc5f

George 2 {…, “unread”: true, …} 4fde8701

Message /Message/George/0 {…, “unread”: false, …} 7abf8091

/Message/George/0 {…, “unread”: true, …}

/MessageCounter/George {…, “numUnread”: “+1”, …}

1. Read, record etags

2. Prepare after-image

3.Update

mboxId value

George { “numUnread”:

3 }

Espresso Architecture

13

Cluster Management and Fault

Tolerance

20

Generic Cluster Manager: Apache Helix

Generic cluster management

– State model + constraints

– Ideal state of distribution of partitions

across the cluster

– Migrate cluster from current state to

ideal state

• More Info

• SoCC 2012

• http://helix.incubator.apache.org

21

http://helix.incubator.apache.org

http://helix.incubator.apache.org

Espresso Partition Layout: Master, Slave

3 Storage Engine nodes, 2-way replication

22

Apache Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline

Cluster Management

Cluster Expansion

Node Failover

Cluster Expansion

Initial State with 3 Storage Nodes. Step1: Compute new Ideal

state

24

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

Cluster Expansion

Step 2: Bootstrap new node’s partitions by restoring from

backups

25

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Cluster Expansion

Step 3: Catch up from live replication stream

26

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Cluster Expansion

Step 4: Migrate masters and slaves to rebalance

27

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1

P1 P2 P3

P5 P6

P10

Node 2

P5 P6 P7

P2

P11 P12

Node 3

P9 P10 P11

P3 P4

P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Cluster Expansion

Partitions are balanced. Router starts sending traffic to new

node

28

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

…

S: P5 – Active

…

Cluster

Node 1 Node 2

P5 P6 P7

P2 P11 P12

Node 3

Master

Slave

Offline Node 4

P1 P2 P3

P5 P6 P10

P9 P10 P11

P3 P4 P8

P4 P8 P12

P1 P7 P9

Node Failover

• During failure or planned maintenance

29

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

…

S: P7 – Active

…

Node Failover

• Step 1: Detect Node failure

30

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

…

S: P7 – Active

…

Node Failover

• Step 2: Compute new ideal state for promoting slaves to

master

31

Node 1

P1 P2 P3

P5 P6

Node 2

P5 P6 P7

P12 P2

Node 3

P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

…

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

…

S: P7 – Active

…

P11 P10

P9

Failover Performance

32

Secondary indexing

33

Espresso Secondary Indexing

• Local Secondary Index Requirements

• Read after write

• Consistent with primary data under failure

• Rich query support: match, prefix, range, text search

• Cost-to-serve proportional to working set

• Pluggable Index Implementations

• MySQL B-Tree

• Inverted index using Apache Lucene with MySQL backing store

• Inverted index using Prefix Index

• Fastbit based bitmap index

Lucene based implementation

• Requires entire index to be memory-resident to support low latency

query response times

• For the Mailbox application, we have two options

Optimizations for Lucene based implementation

• Concurrent transactions on the same Lucene

index leads to inconsistency

• Need to acquire a lock

• Opening an index repeatedly is expensive

• Group commit to amortize index opening cost

write

Request 2

Request 3

Request 4

Request 5

Request 1

Optimizations for Lucene based implementation

High value users of the site accumulate large

mailboxes

– Query performance degrades with a large index

Performance shouldn’t get worse with more usage!

Time Partitioned Indexes: Partition index into buckets

based on created time

Espresso in Production

38


Unified Social Content Platform –social activity aggregation

High Read:Write ratio

39


InMail - Allows members to communicate with each other

Large storage footprint

Low latency requirement for secondary index queries involving text

search and relational predicates

40

Performance

Average Failover Latency with 1024 partitions is

around 300ms

Primary Data Reads and Writes

For Single Storage Node on SSD

Average row size = 1KB

41

Operation Average Latency Average

Throughput

Reads ~3ms 40,000 per

second

Writes ~6ms 20,000 per

second

Performance

Partition-key level Secondary Index using Lucene

One Index per Mailbox use-case

Base data on SAS, Indexes on SSDs

Average throughput per index = ~1000 per second

(after the group commit and partitioned index

optimizations)

42

Operation Average Latency

Queries (average

of 5 indexed

fields)

~20ms

Writes (Around

30 indexed fields)

~20ms

Durability and Consistency

Within a Data Center

Across Data Centers


Within a Data Center

– Write latency vs Durability

Asynchronous replication

– May lead to data loss

– Tooling can mitigate some of this

Semi-synchronous replication

– Wait for at least one relay to acknowledge

– During failover, slaves wait for catchup

Consistency over availability

Helix selects slave with least replication lag to take over

mastership

Failover time is ~300ms in practice


Across data centers

– Asynchronous replication

– Stale reads possible

– Active-active: Conflict resolution via last-writer-wins

Lessons learned

Dealing with transient failures

Planned upgrades

Slave reads

Storage Devices

– SSDs vs SAS disks

Scaling Cluster Management

46

Future work

Coprocessors

– Synchronous, Asynchronous

Richer query processing

– Group-by, Aggregation

47

Key Takeaways

Espresso is a timeline consistent,

document-oriented distributed database

Feature rich: Secondary indexing,

transactions over related documents,

seamless integration with the data

ecosystem

In production since June 2012 serving

several key use-cases

48

49

Questions?

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Technology

Transcript of Espresso: LinkedIn's Distributed Data Serving Platform (Talk)