Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
-
Upload
amy-w-tang -
Category
Technology
-
view
3.812 -
download
1
description
Transcript of Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
![Page 1: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/1.jpg)
On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroopjagadish
LinkedIn Confidential ©2013 All Rights Reserved
![Page 2: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/2.jpg)
Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2
![Page 3: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/3.jpg)
The World’s Largest Professional Network
Members Worldwide
2 new Members Per Second
100M+ Monthly Unique Visitors
225M+ 2M+ Company Pages
Connecting Talent Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
![Page 4: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/4.jpg)
LinkedIn Data Ecosystem
4
![Page 5: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/5.jpg)
Espresso: Key Design Points
Source-of-truth – Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– High Availability
Horizontally Scalable
Rich functionality – Hierarchical data model
– Document oriented
– Transactions within a hierarchy
– Secondary Indexes
5
![Page 6: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/6.jpg)
Espresso: Key Design Points
Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
Integration with the data ecosystem
– Change stream with freshness in O(seconds)
– ETL to Hadoop
– Bulk import
Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
6
![Page 7: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/7.jpg)
Data Model and API
7
![Page 8: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/8.jpg)
Application View
8
key
value
REST API: /mailbox/msg_meta/bob/2
![Page 9: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/9.jpg)
Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key
![Page 10: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/10.jpg)
Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : { name : "Chris", email : "[email protected]" }subject : "Go Giants!"body : "World Series 2012! w00t!"unread : true
Messages
mailboxID : StringmessageID : long
from : { name : String email : String }subject : Stringbody : Stringunread : boolean
![Page 11: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/11.jpg)
REST based API
• Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations – Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11
![Page 12: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/12.jpg)
Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message /Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }
![Page 13: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/13.jpg)
Espresso Architecture
13
![Page 14: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/14.jpg)
14
![Page 15: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/15.jpg)
15
![Page 16: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/16.jpg)
16
![Page 17: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/17.jpg)
17
![Page 18: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/18.jpg)
18
![Page 19: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/19.jpg)
19
![Page 20: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/20.jpg)
Cluster Management and Fault
Tolerance
20
![Page 21: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/21.jpg)
Generic Cluster Manager: Apache Helix
Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21
![Page 22: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/22.jpg)
Espresso Partition Layout: Master, Slave
3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
![Page 23: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/23.jpg)
Cluster Management
Cluster Expansion
Node Failover
![Page 24: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/24.jpg)
Cluster Expansion
Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline Node 4
![Page 25: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/25.jpg)
Cluster Expansion
Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline Node 4
P4 P8 P12
P7 P9 P1
Snapshots
![Page 26: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/26.jpg)
Cluster Expansion
Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline Node 4
P4 P8 P12
P7 P9 P1
Snapshots
![Page 27: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/27.jpg)
Cluster Expansion
Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2 P3
P5 P6
P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10 P11
P3 P4
P8
Master
Slave
Offline Node 4
P4 P8 P12
P7 P9 P1
![Page 28: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/28.jpg)
Cluster Expansion
Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1 Node 2
P5 P6 P7
P2 P11 P12
Node 3
Master
Slave
Offline Node 4
P1 P2 P3
P5 P6 P10
P9 P10 P11
P3 P4 P8
P4 P8 P12
P1 P7 P9
![Page 29: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/29.jpg)
Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10 P5 P6
Node 2
P5 P6 P7
P12 P2 P11
Node 3
P9 P10 P11
P8 P3 P4
Node 4
P4 P8 P12
P7 P9 P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
![Page 30: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/30.jpg)
Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10 P5 P6
Node 2
P5 P6 P7
P12 P2 P11
Node 3
P9 P10 P11
P8 P3 P4
Node 4
P4 P8 P12
P7 P9 P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
![Page 31: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/31.jpg)
Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P12 P2
Node 3
P10 P11
P8 P3 P4
Node 4
P4 P8 P12
P7 P9 P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
P11 P10
P9
![Page 32: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/32.jpg)
Failover Performance
32
![Page 33: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/33.jpg)
Secondary indexing
33
![Page 34: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/34.jpg)
Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index
![Page 35: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/35.jpg)
Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options
![Page 36: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/36.jpg)
Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1
![Page 37: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/37.jpg)
Optimizations for Lucene based implementation
High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
Performance shouldn’t get worse with more usage!
Time Partitioned Indexes: Partition index into buckets
based on created time
![Page 38: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/38.jpg)
Espresso in Production
38
![Page 39: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/39.jpg)
Espresso in Production
Unified Social Content Platform –social activity aggregation
High Read:Write ratio
39
![Page 40: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/40.jpg)
Espresso in Production
InMail - Allows members to communicate with each other
Large storage footprint
Low latency requirement for secondary index queries involving text
search and relational predicates
40
![Page 41: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/41.jpg)
Performance
Average Failover Latency with 1024 partitions is
around 300ms
Primary Data Reads and Writes
For Single Storage Node on SSD
Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second
![Page 42: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/42.jpg)
Performance
Partition-key level Secondary Index using Lucene
One Index per Mailbox use-case
Base data on SAS, Indexes on SSDs
Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms
![Page 43: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/43.jpg)
Durability and Consistency
Within a Data Center
Across Data Centers
![Page 44: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/44.jpg)
Durability and Consistency
Within a Data Center
– Write latency vs Durability
Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
Consistency over availability
Helix selects slave with least replication lag to take over
mastership
Failover time is ~300ms in practice
![Page 45: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/45.jpg)
Durability and Consistency
Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins
![Page 46: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/46.jpg)
Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46
![Page 47: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/47.jpg)
Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47
![Page 48: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/48.jpg)
Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48
![Page 49: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052410/55501f38b4c90535638b53db/html5/thumbnails/49.jpg)
49
Questions?