Harvesting the Power of Samza in LinkedIn's Feed

30
Harvesting the Power of Samza in News Feed Providing fresh and relevant content to hundreds of millions of members

Transcript of Harvesting the Power of Samza in LinkedIn's Feed

Page 1: Harvesting the Power of Samza in LinkedIn's Feed

Harvesting the Power of Samza in News FeedProviding fresh and relevant content to hundreds of millions of members

Page 2: Harvesting the Power of Samza in LinkedIn's Feed

 A Few Things Mentioned Here

Prerequisites

1 Samza

2 RocksDB (a key-value store)

3 SerDe (Serializer/Deserializer)

4 Kafka (a distributed messaging system)

5 Java

2

Page 3: Harvesting the Power of Samza in LinkedIn's Feed

The Challenge

Page 4: Harvesting the Power of Samza in LinkedIn's Feed

Relevant content is a great way to stay informed about your professional interests; Fresh relevant content is even better!

How do we keep track of what hundreds of millions of members

viewed on their News Feeds?

4

Page 5: Harvesting the Power of Samza in LinkedIn's Feed

Tracking

Page 6: Harvesting the Power of Samza in LinkedIn's Feed

  News Feed is the Landing Page for Most MembersScale

6

Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter

Page 7: Harvesting the Power of Samza in LinkedIn's Feed

• Lightweight events that track what

the member viewed

• Tiny payload (bandwidth-friendly)

• Events end up in a Kafka topic

Client-Side Tracking

Page 8: Harvesting the Power of Samza in LinkedIn's Feed

• Events that have more data about

served feeds

• Rich payload

• Events end up in a Kafka topic

Server-Side Tracking

Page 9: Harvesting the Power of Samza in LinkedIn's Feed

Improving Member Experience Using Samza (Overview)

A stream-stream join task buffers events from both streams; matches are sent to an output Kafka stream1 Join input streams

A custom TTL mechanism reaps stale events every n seconds2 Purge stale events

Convert the rich data about impressions into machine learning features used for ranking items in the News Feed3 Consume output stream

9

Page 10: Harvesting the Power of Samza in LinkedIn's Feed

Join

10

1

Page 11: Harvesting the Power of Samza in LinkedIn's Feed

Overview

11

Client Events

Server Events

Process Client Events

Process Server EventsOutput Events

Page 12: Harvesting the Power of Samza in LinkedIn's Feed

Client-Side Events Processor Overview

12

ID in server-

side events store?

Match events

Store (ID, const.)

Yes

No

Output to Kafka

Page 13: Harvesting the Power of Samza in LinkedIn's Feed

 OptimizationsClient-Side Events Processor

13

• Initial capacity of matches map (event, matched IDs) is determined by a metric (GC-friendly)

• Initial capacity of value set is equal to |IDs|

• An empty byte array is used as a dummy value for IDs to store in RocksDB (passes through the NOP byte array SerDe); acting as a set

Page 14: Harvesting the Power of Samza in LinkedIn's Feed

Server-Side Events Processor Overview

14

ID in client-side

events store?

Match events

Store (ID, event)

Yes

No

Output to Kafka

Page 15: Harvesting the Power of Samza in LinkedIn's Feed

• Header (shared event data)

• List of payloads (one for each item)

• Each payload has a join key (ID)

Event AnatomyShared Event Data(e.g. member ID)

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

Page 16: Harvesting the Power of Samza in LinkedIn's Feed

Server-Side Events Storage

16

Shared Event Data(e.g. member ID)

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

Page 17: Harvesting the Power of Samza in LinkedIn's Feed

 ManyKeysToOneValueStore<K, V>Server-Side Events Storage

17

• Space-efficient• Insertion is transactional• Rolling back a transaction is a best effort

thing• Requires an additional lookup (but it’s

worth it)

Page 18: Harvesting the Power of Samza in LinkedIn's Feed

Event Matching

18

Client-Side Event

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

Server-Side Event

A

ID: 111

ID: 456

ID: 906

ID: 678

ID: 901

ID: 431

ID: 746

Server-Side Event

B

ID: 234

ID: 012

ID: 123

ID: 100

ID: 313

ID: 345

ID: 333

Output Event

A

ID: 901

ID: 456

ID: 678

Output Event

B

ID: 012

ID: 123

ID: 345

ID: 234

Page 19: Harvesting the Power of Samza in LinkedIn's Feed

 [SAMZA-647] Key-Value Store Contributions to Samza

19

• The access pattern is getAll(List<K>)• RocksDB supports multiGet that’s faster

than get• Added that support to samza’s

KeyValueStore• Perf test results confirm that of RocksDB

(with caching disabled)

Page 20: Harvesting the Power of Samza in LinkedIn's Feed

TTL

20

2

Page 21: Harvesting the Power of Samza in LinkedIn's Feed

Custom TTL Mechanism

Records the timestamp of when an event was stored The “death row” store: key is the timestamp and the value is an ID Because the key is a timestamp, collisions occur:

21

Generate timestamp

Bucket is taken

Bucket is free

Attempts <= max Attempts > max

put(timestamp, ID)

Page 22: Harvesting the Power of Samza in LinkedIn's Feed

Linear Probing Timestamper

22

TTL calculation is not mission-critical (currentTimeMillis() is not very precise anyways); events get deleted in the next window

Keeping it simple and stupid works

Page 23: Harvesting the Power of Samza in LinkedIn's Feed

Reapers

Every n seconds:

Get death rows (t < now – TTL)

For each entry in death row:

Remove from core stores

Remove from death row

23

Page 24: Harvesting the Power of Samza in LinkedIn's Feed

 OptimizationsReapers

24

• Keys (timestamps) are stored in order• A range query (0, now – TTL) is much

faster than a range scan (testing all values)

• Even though TTL is in the order of minutes/hours, reaping stale events happens every 10 seconds (the window method is blocking)

Page 25: Harvesting the Power of Samza in LinkedIn's Feed

Stats

25

Page 26: Harvesting the Power of Samza in LinkedIn's Feed

[SAMZA-647] getAll is %23 FasterRocksDB Get All vs. Get Performance

26

Page 27: Harvesting the Power of Samza in LinkedIn's Feed

Timestamp Collision Resolution Metrics

27

Page 28: Harvesting the Power of Samza in LinkedIn's Feed

The Most Important Metric

28

Page 29: Harvesting the Power of Samza in LinkedIn's Feed

29

of messages handled by the job everyday

Billions

Page 30: Harvesting the Power of Samza in LinkedIn's Feed

Find out more:

©2015 LinkedIn Corporation. All Rights Reserved.

blog.linkedin.com linkedin.com/in/elgeish

[email protected]

30