RedisConf17 - Building Large High Performance Redis Databases with Redis Enterprise
Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...
Transcript of Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...
![Page 1: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/1.jpg)
Building a high scale machine learning pipeline
with Apache Spark and Kafka
https://www.flickr.com/photos/sanjayaprime/5013115478Bedő Dániel
![Page 2: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/2.jpg)
• biggest community-driven question & answer website in Germany
• 20 million questions, 70 million answers
• similar to Quora, Yahoo Answers
![Page 3: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/3.jpg)
Google Update Impact
![Page 4: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/4.jpg)
Ordering of answers
![Page 5: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/5.jpg)
supervised machine learning
determine the type of the training data
gather a training set
find a representation of the data
pick a learning algorithm
run the training algorithm
evaluate the accuracy
![Page 6: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/6.jpg)
Regression Prototype
![Page 7: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/7.jpg)
Identify the problems
• Model not complex enough
• Similar inputs, different outputs?
• Not enough training data
![Page 8: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/8.jpg)
The old ETL pipeline
![Page 9: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/9.jpg)
ETL v2Spark
![Page 10: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/10.jpg)
Spark ecosystem
API
Scala Python Java R
Spark Streaming
Spark SQL MLLib GraphX
![Page 11: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/11.jpg)
K
Kafka
Producer
Producer
Consumer
Consumer
Consumer
Topic 1 Partition 0
Broker 1
Topic 1 Partition 1
Broker 2
Topic 1 Partition 2
Broker 3
Kafka Cluster
![Page 12: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/12.jpg)
Kafka topic
• scale
• parallelism
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8
Producer
partition 0
partition 1
partition 2
![Page 13: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/13.jpg)
Parquet
id cc votes
1 DE 2
2 DE 3
3 AT 1
4 DE 2
id cc votes
1 DE 2
2 DE 3
3 AT 1
4 DE 2
id cc votes
1 DE 2
2 DE 3
3 AT 1
4 DE 2
SELECT votes FROM logs WHERE cc = ‘AT’
push-down filters
![Page 14: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/14.jpg)
Kafka
Rabbit MQServices
Tracking
Spark Cluster
KafkaKafka
HDFS(Tracking)
Stre
amin
g Worker
MySQL Read Slave
MySQL Master
Redis Master
Redis Read Slave
ElasticSearch
ETL v2
![Page 15: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/15.jpg)
Project Moria
![Page 16: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/16.jpg)
Clean training data?
![Page 17: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/17.jpg)
Project Angmar
• tried lots of different supervised learning methods
• feature engineering - most crucial part
• analyse the domain, chart everything
![Page 18: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/18.jpg)
FeaturesContent
length
syntactic complexity
number of links
probability of deletion
Social
votes
most helpful answer
number of comments
answered by expert
Author
gained votes
credibility score
role
deleted answer ratio
number of answers
number of comments
reported answer ratio
![Page 19: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/19.jpg)
The structure of the network
21
3
1
0,2
0,4
0,1
2 0,8
Answer vector
AV normalized
0,9
0,6
0,2
0,1
Output
![Page 20: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/20.jpg)
The Result
![Page 21: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/21.jpg)
Calculate features for all answers
Batch Layer (Spark Batch)
Insert features in Redis
Calculate Score and store in MySQL
Speed Layer (Spark Streaming)
Listen for events
Insert or update Redis
Calculate Score and store in MySQL
Serving Layer
Lambda Architecture
![Page 22: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/22.jpg)
Back pressure
• Bulk jobs insert too fast
• MySQL: sendQueue size, threads connected
• ElasticSearch: load on the instance creating the new index
![Page 23: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/23.jpg)
Debugging the network
+1Change individual features
![Page 24: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/24.jpg)
real world test (deleted vs non-deleted)
deletednon-deleted
![Page 25: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/25.jpg)
Switching models
Amount of questions for a
score range
Old Score
New score
![Page 26: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/26.jpg)
![Page 27: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/27.jpg)
Insights
MenWomen
![Page 28: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/28.jpg)
Learnings• If your use case is complex, you need a complex
model
• If you have a complex model, you need lots of data
• If you have lots of data, you need an ETL pipeline that can process huge amounts of data fast
• Think about your use case first, then design the pipeline
![Page 29: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture.](https://reader036.fdocuments.in/reader036/viewer/2022081617/6053d36f166b21300c157e64/html5/thumbnails/29.jpg)
Questions?You can ask them on gutefrage too :)