#TwitterRealTime - Real time processing @twitter

63

Transcript of #TwitterRealTime - Real time processing @twitter

Page 1: #TwitterRealTime - Real time processing @twitter
Page 2: #TwitterRealTime - Real time processing @twitter
Page 3: #TwitterRealTime - Real time processing @twitter

TWITTER IS REAL TIME

Page 4: #TwitterRealTime - Real time processing @twitter

WHAT IS REAL TIME?

Page 5: #TwitterRealTime - Real time processing @twitter

REAL TIME PIPELINE

Page 6: #TwitterRealTime - Real time processing @twitter

REAL TIME COMPONENTS

Page 7: #TwitterRealTime - Real time processing @twitter

REAL TIME USE CASES

ETL BI PRODUCT SAFETY TRENDS

ML MEDIA OPS ADS

Page 8: #TwitterRealTime - Real time processing @twitter

20 PB

2 Trillion Events/Day

100 mse2e

latency

400 Real Time Jobs

DLOG & HERON are

Open Sourced

Page 9: #TwitterRealTime - Real time processing @twitter
Page 10: #TwitterRealTime - Real time processing @twitter
Page 11: #TwitterRealTime - Real time processing @twitter
Page 12: #TwitterRealTime - Real time processing @twitter
Page 13: #TwitterRealTime - Real time processing @twitter

WE ARE HIRING!

Messaging

Data Infrastructure

Core Services

Search Infrastructure

Traffic

Real Time Compute

Compute PlatformPlatform Engineering

Kernel

#LoveWhereYouWorkLearn more at careers.twitter.com

Hadoop

Core Data Libraries

Data Applications

Core Metrics

Page 14: #TwitterRealTime - Real time processing @twitter
Page 15: #TwitterRealTime - Real time processing @twitter

- Easy operations

- Small technology portfolio

- Quick development Iteration

- Diverse use cases

Page 16: #TwitterRealTime - Real time processing @twitter

Bookkeeper

WriteProxy

ReadProxy clie

nt

client

Page 17: #TwitterRealTime - Real time processing @twitter

Bookkeeper

WriteProxy

ReadProxy

PublisherSubscriber

Read Write

DistributedLog

Metadata

Self Serve

Page 18: #TwitterRealTime - Real time processing @twitter

20 PB

2 Trillion Events

100 mse2e

latency

Page 19: #TwitterRealTime - Real time processing @twitter

- Event

A discrete, self-contained, piece of data

- Stream

A persistent, unordered collection of events with a time

- Partition

A portion of a stream with a proportional amount of that the overall capacity

- Subscriber

A collection of processes collectively consuming a copy of the stream

Page 20: #TwitterRealTime - Real time processing @twitter

Bookkeeper

WriteProxy

ReadProxy

PublisherSubscriber

Read Write

DistributedLog

Metadata

Self Serve

Page 21: #TwitterRealTime - Real time processing @twitter

Flow Control

Stream Configuration

Partition Ownership

DistributedLog

(E => Future[Unit])

Offset Tracking

Offset Store

Metadata

DL Read Proxy

Page 22: #TwitterRealTime - Real time processing @twitter

@DistributedLoghttp://distributedlog.io

Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming>

Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar

Page 23: #TwitterRealTime - Real time processing @twitter
Page 24: #TwitterRealTime - Real time processing @twitter
Page 25: #TwitterRealTime - Real time processing @twitter

STORM/HERON TERMINOLOGY- TOPOLOGY

Directed acyclic graph

Vertices=computation, and edges=streams of data tuples

- SPOUTS

Sources of data tuples for the topology

Examples - Kafka/Distributed Log/MySQL/Postgres

- BOLTS

Process incoming tuples and emit outgoing tuples

Examples - filtering/aggregation/join/arbitrary function

Page 26: #TwitterRealTime - Real time processing @twitter

STORM/HERON TOPOLOGY

BOLT 1

BOLT 2

BOLT 3

BOLT 4

BOLT 5

SPOUT 1

SPOUT 2

Page 27: #TwitterRealTime - Real time processing @twitter

WHY HERON?

● SCALABILITY and PERFORMANCE PREDICTABILITY

● IMPROVE DEVELOPER PRODUCTIVITY

● EASE OF MANAGEABILITY

Page 28: #TwitterRealTime - Real time processing @twitter

TOPOLOGY ARCHITECTURE

TopologyMaster

ZKCLUSTER

Stream Manager

I1 I2 I3 I4

Stream Manager

I1 I2 I3 I4

Logical Plan, Physical Plan and Execution State

Sync Physical Plan

CONTAINER CONTAINER

Metrics Manager

Metrics Manager

Page 29: #TwitterRealTime - Real time processing @twitter

HERON ARCHITECTURE

Topology 1

TOPOLOGYSUBMISSION

Scheduler

Topology 2

Topology 3

Topology N

Page 30: #TwitterRealTime - Real time processing @twitter
Page 31: #TwitterRealTime - Real time processing @twitter

HERON SAMPLE TOPOLOGIES

Page 32: #TwitterRealTime - Real time processing @twitter

Large amount of data produced every day

Large cluster Several hundred topologies deployed

Several million messages every second

HERON @TWITTER

1 stage 10 stages

3x reduction in cores and memory

Heron has been in production for 2 years

Page 33: #TwitterRealTime - Real time processing @twitter
Page 34: #TwitterRealTime - Real time processing @twitter

STRAGGLERS

Stragglers are the norm in a multi-tenant distributed systems

● BAD/SLOW HOST

● EXECUTION SKEW

● INADEQUATE PROVISIONING

Page 35: #TwitterRealTime - Real time processing @twitter

APPROACHES TO HANDLE STRAGGLERS

d

● SENDERS TO STRAGGLER DROP DATA

● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER

● DETECT STRAGGLERS AND RESCHEDULE THEM

Page 36: #TwitterRealTime - Real time processing @twitter

S1 B2

B3

SLOW DOWN SENDERS STRATEGY

Stream Manager

Stream Manager

Stream Manager

Stream Manager

S1 B2

B3 B4

S1 B2

B3

S1 B2

B3 B4

B4

S1 S1

S1S1

Page 37: #TwitterRealTime - Real time processing @twitter

BACK PRESSURE IN PRACTICE

● IN MOST SCENARIOS BACK PRESSURE RECOVERS

Without any manual intervention

● SOMETIMES USER PREFER DROPPING OF DATA

Care about only latest data

● SUSTAINED BACK PRESSURE

Irrecoverable GC cycles

Bad or faulty host

Page 38: #TwitterRealTime - Real time processing @twitter

ENVIRONMENT'S SUPPORTED

STORM API

PRE- 1.0.0

POST 1.0.0

SUMMINGBIRD FOR HERON

Page 39: #TwitterRealTime - Real time processing @twitter

CURIOUS TO LEARN MORE…

Page 40: #TwitterRealTime - Real time processing @twitter

INTERESTED IN HERON?

CONTRIBUTIONS ARE WELCOME!https://github.com/twitter/heron

http://heronstreaming.io

HERON IS OPEN SOURCED

FOLLOW US @HERONSTREAMING

Page 41: #TwitterRealTime - Real time processing @twitter
Page 42: #TwitterRealTime - Real time processing @twitter
Page 43: #TwitterRealTime - Real time processing @twitter

● 100K+ Advertisers, $2B+ revenue/year

● 300M+ Users

● Impressions/Engagements

○ Tens of billions of events daily

Page 44: #TwitterRealTime - Real time processing @twitter

Use Heron & EventBus:

● Prediction

● Serving

● Analytics

Page 45: #TwitterRealTime - Real time processing @twitter
Page 46: #TwitterRealTime - Real time processing @twitter

● Online learning: models require real-time data○ On-going training for existing ads

■ CTR, conversions, RTs, Likes○ On-going training for user data

■ Interests change, targeting must stay relevant○ New ads arrive constantly

● Consumes 150 GB/second from EventBus streams

Page 47: #TwitterRealTime - Real time processing @twitter

Ad Server● Reads Prediction models● Finalizes Ad selection● Writes 56GB/second to EventBus

○ Served impressions○ Spend events

Callback Service● Receives engagements from clients● Writes engagements to EventBus

○ Consumed by Prediction and Analytics

Page 48: #TwitterRealTime - Real time processing @twitter

Advertiser Dashboard keeps advertisers informed in real-time

For Ads:● Impressions● Engagements● Spend rate● Uniques

For Users:● Geolocation● Gender● Age● Followers● Keywords● Interests

Page 49: #TwitterRealTime - Real time processing @twitter

Offline layer (hours)● Engagement log● Billing pipeline● 14TB/hour

Online layer (seconds)● Heron topologies read 1M events/sec

From EventBus, provide real-time analytics

Advertiser Dashboard● Ad-hoc queries for desired time range● View performance of ads in real-time

http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html

Page 50: #TwitterRealTime - Real time processing @twitter

(~6 hrs)

Page 51: #TwitterRealTime - Real time processing @twitter

#RealTime processing helps us scale our Ads business:

● Prediction - Online learning○ Ads○ Users

● Analytics - Advertisers get real-time visibility into ad performance

This enables us to provide high ROI for Advertisers.

Image Credits:http://images.clipartpanda.com/cycle-clipart-bike_red.pnghttp://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.pnghttp://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png

Page 52: #TwitterRealTime - Real time processing @twitter
Page 53: #TwitterRealTime - Real time processing @twitter

Observation

● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved

Requirement

● Real-time : a competition game with spammers i.e. “detect” vs “mutate”● Generic : need to support all common feature representations

Page 54: #TwitterRealTime - Real time processing @twitter

Crest is a generic online similarity clustering system

● Inputs are a stream of entities

● Similar Clustering system groups similar entities together ( according to predefined

similarity metric)

● outputs are the clusters and the cluster entity members.

“Built on top of Heron“ https://github.com/twitter/heron

Page 55: #TwitterRealTime - Real time processing @twitter
Page 56: #TwitterRealTime - Real time processing @twitter

● Locality sensitive hashing

probabilistic similarity-preserving random projection method

Entity1 => hashValue1 (010010001110010100101001000011)

Entity2 => hashValue2 (000111001110010101100110100100)

Sim(Entity1, Entity2) ~ Sim(hash1, hash2)

● No “Pair-wise” similarity calculation

Similarity match based on “signature band”

Page 57: #TwitterRealTime - Real time processing @twitter

Similarity match based on “signature band” collision

Cut signatures into bands:

01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)

Two entities become similarly candidates, if they collide on at least one band.

(i.e. need to match all signatures within some band)

Page 58: #TwitterRealTime - Real time processing @twitter

1. Given entity features, calculate signatures, and cut into bands2. Match with all existing clusters in cluster store, which collide with at least one band3. Find the closest cluster

Incoming Entity: 01001 00011 10010 10010 10010 00011

Known Cluster1: 01011 00011 01010 10111 11110 10011

Known Cluster2: 01101 01011 01000 10010 10010 01111

Page 59: #TwitterRealTime - Real time processing @twitter

1. Count for each band signatures

2. Use Count-Min Sketch to find the hot signatures

3. Send entities with hot signatures for clustering

Page 60: #TwitterRealTime - Real time processing @twitter

1. Group entities by band signatures2. Run in-memory clustering algorithm when the group is big enough3. Save the cluster in cluster key-value store

Page 61: #TwitterRealTime - Real time processing @twitter

1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )

Page 62: #TwitterRealTime - Real time processing @twitter

● Crest : similarity clustering system , based on locality-sensitive hashing

● Detect spam in real-time , built on top of heron topology● Generic interface, clustering “everything” happening in Twitter

Page 63: #TwitterRealTime - Real time processing @twitter