Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...

29
Building a high scale machine learning pipeline with Apache Spark and Kafka https://www.flickr.com/photos/sanjayaprime/5013115478 Bedő Dániel

Transcript of Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...

Page 1: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Building a high scale machine learning pipeline

with Apache Spark and Kafka

https://www.flickr.com/photos/sanjayaprime/5013115478Bedő Dániel

Page 2: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

• biggest community-driven question & answer website in Germany

• 20 million questions, 70 million answers

• similar to Quora, Yahoo Answers

Page 3: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Google Update Impact

Page 4: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Ordering of answers

Page 5: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

supervised machine learning

determine the type of the training data

gather a training set

find a representation of the data

pick a learning algorithm

run the training algorithm

evaluate the accuracy

Page 6: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Regression Prototype

Page 7: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Identify the problems

• Model not complex enough

• Similar inputs, different outputs?

• Not enough training data

Page 8: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

The old ETL pipeline

Page 9: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

ETL v2Spark

Page 10: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Spark ecosystem

API

Scala Python Java R

Spark Streaming

Spark SQL MLLib GraphX

Page 11: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

K

Kafka

Producer

Producer

Consumer

Consumer

Consumer

Topic 1 Partition 0

Broker 1

Topic 1 Partition 1

Broker 2

Topic 1 Partition 2

Broker 3

Kafka Cluster

Page 12: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Kafka topic

• scale

• parallelism

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8

Producer

partition 0

partition 1

partition 2

Page 13: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Parquet

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

SELECT votes FROM logs WHERE cc = ‘AT’

push-down filters

Page 14: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Kafka

Rabbit MQServices

Tracking

Spark Cluster

KafkaKafka

HDFS(Tracking)

Stre

amin

g Worker

MySQL Read Slave

MySQL Master

Redis Master

Redis Read Slave

ElasticSearch

ETL v2

Page 15: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Project Moria

Page 16: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Clean training data?

Page 17: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Project Angmar

• tried lots of different supervised learning methods

• feature engineering - most crucial part

• analyse the domain, chart everything

Page 18: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

FeaturesContent

length

syntactic complexity

number of links

probability of deletion

Social

votes

most helpful answer

number of comments

answered by expert

Author

gained votes

credibility score

role

deleted answer ratio

number of answers

number of comments

reported answer ratio

Page 19: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

The structure of the network

21

3

1

0,2

0,4

0,1

2 0,8

Answer vector

AV normalized

0,9

0,6

0,2

0,1

Output

Page 20: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

The Result

Page 21: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Calculate features for all answers

Batch Layer (Spark Batch)

Insert features in Redis

Calculate Score and store in MySQL

Speed Layer (Spark Streaming)

Listen for events

Insert or update Redis

Calculate Score and store in MySQL

Serving Layer

Lambda Architecture

Page 22: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Back pressure

• Bulk jobs insert too fast

• MySQL: sendQueue size, threads connected

• ElasticSearch: load on the instance creating the new index

Page 23: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Debugging the network

+1Change individual features

Page 24: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

real world test (deleted vs non-deleted)

deletednon-deleted

Page 25: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Switching models

Amount of questions for a

score range

Old Score

New score

Page 26: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.
Page 27: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Insights

MenWomen

Page 28: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Learnings• If your use case is complex, you need a complex

model

• If you have a complex model, you need lots of data

• If you have lots of data, you need an ETL pipeline that can process huge amounts of data fast

• Think about your use case first, then design the pipeline

Page 29: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.

Questions?You can ask them on gutefrage too :)