Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...

Building a high scale machine learning pipeline

with Apache Spark and Kafka

https://www.flickr.com/photos/sanjayaprime/5013115478Bedő Dániel

• biggest community-driven question & answer website in Germany

• 20 million questions, 70 million answers

• similar to Quora, Yahoo Answers

Google Update Impact

Ordering of answers

supervised machine learning

determine the type of the training data

gather a training set

find a representation of the data

pick a learning algorithm

run the training algorithm

evaluate the accuracy

Regression Prototype

Identify the problems

• Model not complex enough

• Similar inputs, different outputs?

• Not enough training data

The old ETL pipeline

ETL v2Spark

Spark ecosystem

Scala Python Java R

Spark Streaming

Spark SQL MLLib GraphX

Producer

Consumer

Topic 1 Partition 0

Broker 1

Topic 1 Partition 1

Broker 2

Topic 1 Partition 2

Broker 3

Kafka Cluster

Kafka topic

• scale

• parallelism

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8

Producer

partition 0

partition 1

partition 2

Parquet

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

SELECT votes FROM logs WHERE cc = ‘AT’

push-down filters

Rabbit MQServices

Tracking

Spark Cluster

KafkaKafka

HDFS(Tracking)

g Worker

MySQL Read Slave

MySQL Master

Redis Master

Redis Read Slave

ElasticSearch

ETL v2

Project Moria

Clean training data?

Project Angmar

• tried lots of different supervised learning methods

• feature engineering - most crucial part

• analyse the domain, chart everything

FeaturesContent

length

syntactic complexity

number of links

probability of deletion

Social

most helpful answer

number of comments

answered by expert

Author

gained votes

credibility score

deleted answer ratio

number of answers

number of comments

reported answer ratio

The structure of the network

Answer vector

AV normalized

Output

The Result

Calculate features for all answers

Batch Layer (Spark Batch)

Insert features in Redis

Calculate Score and store in MySQL

Speed Layer (Spark Streaming)

Listen for events

Insert or update Redis

Calculate Score and store in MySQL

Serving Layer

Lambda Architecture

Back pressure

• Bulk jobs insert too fast

• MySQL: sendQueue size, threads connected

• ElasticSearch: load on the instance creating the new index

Debugging the network

+1Change individual features

real world test (deleted vs non-deleted)

deletednon-deleted

Switching models

Amount of questions for a

score range

Old Score

New score

Insights

MenWomen

Learnings• If your use case is complex, you need a complex

• If you have a complex model, you need lots of data

• If you have lots of data, you need an ETL pipeline that can process huge amounts of data fast

• Think about your use case first, then design the pipeline

Questions?You can ask them on gutefrage too :)

Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...

Documents

Transcript of Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events...

MySQL Redis Plugin

Managing IoT and Time Series Data with Amazon ElastiCache ...œˆ… · Amazon Kinesis and AWS Lambda are serverless Amazon Kinesis + Amazon ElastiCache for Redis ensure order among

Amazon · Amazon ElastiCache for Redis ElastiCache for Redis User Guide Table of Contents What Is ElastiCache for Redis

Apache Traffic Server & Lua - … · Lua Uses •mod_lua •ngx_lua •Redis •MySQL Proxy •some games/game engines(e.g World of Warcraft)

Redis in Kenshoo microservices - Redis Labs · Redis in Kenshoo microservices Jesque based delayed priority queue system Presented by Dima Vizelman. A Redis Use-Case Performance Aggregation

SRV314 securing serverless · Amazon API Gateway AWS Lambda Amazon RDS (Aurora MySQL) 3rdparty Client Amazon Cognito authentication AWS Lambda (Custom authorizer) Verify access token

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

Pivotal Cloud Cache 1 · Redis for PCF Redis Yes Yes (shared-VM plan). Only recommended for test environments. MySQL for PCF MySQL Yes No Pivotal Cloud Cache (PCC) Pivotal GemFire

Redis / Redis Labs Overview and then busting 4 myths … · Redis / Redis Labs Overview and then busting 4 myths about Redis Yiftach Shoolman Co-Founder & CTO Home of Redis

Mike (San Francisco, CA) - GroupTalentmrooney.github.io/about/portfolio.pdfPython/Django, jQuery, AWS, MySQL, Redis, Solr. Big fan of Jenkins, Continuous Integration, and Continuous

YOUR ORGANIZATION IS KILLING YOUR SOFTWARE · 2017-08-23 · ROUTING PRESENTATION LOGIC Flock T-Bird MySQL Monorail. STORAGE & RETRIEVAL ROUTING PRESENTATION LOGIC Redis Memcache

Redis Day Keynote Salvatore Sanfillipo Redis Labs

Redis begins

Back your app with MySQL and Redis on Cloud Foundry

Redis : Play buzz uses Redis

Redis for Security Data : SecurityScorecard JVM Redis Usage

Partners for success · Oracle DB MongoDB Cassandra MySQL MySQL Cluster Riak PostgreSQL Voldemort Redis. Data Model Relational Object Column-Family Oracle DB Riak Cassandra MySQL

ARC202 Accenture Cloud Platform Serverless Journey · PDF fileServerless Architecture Lambda execution with SNS, Redis, ... ElasticSearch) •Pay attention to limits (e.g. CloudWatch

Mesos Go Stateful - events.static.linuxfound.org · Analysis of Different Stateful Workload MySql Kafka ETCD PostgreSql Redis ... Mesos Go Stateful Mesos Go Framework Executor Executor

Redis Presentation