[2C2]PredictionIO

36
An Open Source Machine Learning Server for Developers @PredictionIO #PredictionIO Simon Chan [email protected]

description

DEVIEW 2014 [2C2]PredictionIO

Transcript of [2C2]PredictionIO

Page 1: [2C2]PredictionIO

An Open Source Machine Learning Server

for Developers

@PredictionIO #PredictionIO

Simon [email protected]

Page 2: [2C2]PredictionIO

• Simon Chan - CEO of PredictionIO

• A small team of Data Scientists and Engineers

• Mainly based in Silicon Valley, also London and Hong Kong

Thank you for having me here today!

Page 3: [2C2]PredictionIO

Top Github Open Source

• Over 5000 developers engaged• Powering over 200 applications

Page 4: [2C2]PredictionIO

Talk Focus:

• Machine Learning - A (Very) Brief Review

• Challenges We Face When Building PredictionIO

Page 5: [2C2]PredictionIO

Machine Learning is Simple?

Page 6: [2C2]PredictionIO

I am going to give an example that will make you… HUNGRY!

Page 7: [2C2]PredictionIO

F FOOD Club – Menu

FOODCLUB

Page 8: [2C2]PredictionIO
Page 9: [2C2]PredictionIO

# Using PredictionIO# Collect Datacli = predictionio.EventClient("<my_app_id>")cli.record_user_action_on_item("buy", "John", “BulgogiA")

# Predict top preferenceseng = predictionio.EngineClient("<my_engine_url>")rec = eng.send_query({"uid" : "John", "n" : 5})

Coding time….

Page 10: [2C2]PredictionIO

The Magic Behind: Engine

1. Data Sourcing and Preparation

2. Algorithm

3. Serving

4. Evaluation

Page 11: [2C2]PredictionIO
Page 12: [2C2]PredictionIO
Page 13: [2C2]PredictionIO

Challenges and Solutions

Page 14: [2C2]PredictionIO

Architectural Challenge 1

Workflow Co-ordination on a Distributed Cluster

Page 15: [2C2]PredictionIO

Needs:

•Support multiple distributed engines

•Support multiple algorithms to execute in parallel

How to coordinate the workflow when you have more pending tasks than processing units?

Page 16: [2C2]PredictionIO

Attempt #1

Use a database system to store tasks, and have a pool of workers pull tasks from it.

•Inefficient. Database becomes bottleneck and potentially single point of failure.

Page 17: [2C2]PredictionIO

Attempt #2

Use an Akka cluster.

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.

•Fundamentally the same problem with the above.

•Need to build management suite on top.

Page 18: [2C2]PredictionIO

Solution

Apache Spark: directed acyclic graph (DAG) scheduling

Adapts to many different infrastructure: Apache Spark standalone cluster, Apache Hadoop 2 YARN, Apache Mesos.

Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg

Page 19: [2C2]PredictionIO

Solution Source Code:

http://github.com/predictionio

Page 20: [2C2]PredictionIO

Architectural Challenge 2

Distributed In-memory Model Retrieval

Page 21: [2C2]PredictionIO

Needs:

•Engines produce models that are distributed across a cluster. Requires a way to serve these distributed in-memory models to queries in real-time.

Page 22: [2C2]PredictionIO

Solution

All PredictionIO engine instances are launched inside a “SparkContext”.

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

Page 23: [2C2]PredictionIO

•When an engine is local to a single machine, it loads the model to its memory.

•When an engine is distributed, SparkContext will automatically load the model on a cluster.

Page 24: [2C2]PredictionIO

Conceptual Code for the Solution

val sc = SparkContext(conf)...val model = if (model_is_distributed) { if (model_is_persisted) { sc.objectFile(model_on_HDFS) } else { engine.algo.train() } } else { ... }}

Page 25: [2C2]PredictionIO

PredictionIO 0.8

Page 26: [2C2]PredictionIO

Built-in Engines:

•Item Recommendation

•Item Rank

•Item Similarity

Page 27: [2C2]PredictionIO

Create an Engine Instance Project….

$ pio instance io.prediction.engines.itemrec

$ cd io.prediction.engines.itemrec

$ pio register

Page 28: [2C2]PredictionIO

Collect Event Data….

cli = predictionio.EventClient("<app_id>")

cli.record_user_action_on_item("like", "John", “bulgogi_12”)

cli.record_user_action_on_item("view", "John", “bimbimbap_13”)

Page 29: [2C2]PredictionIO

Configurate the Engine Instance settings

in params/datasource.json

{

"appId": <app_id>,

"actions": [

"view", "like", ...

], ...

}

Page 30: [2C2]PredictionIO

Train the Data Model

$ pio train

Deploy the Engine Instance

$ pio deploy

Page 31: [2C2]PredictionIO

Retrieve Prediction Results

from predictionio import EngineClient

client = EngineClient(url="http://localhost:8000")

prediction = client.send_query({"uid": "John", "n": 3})

print prediction

Output

{u'items': [{u'272': 9.929327011108398}, {u'313': 9.92607593536377}, {u’347': 9.92170524597168}]}

Page 32: [2C2]PredictionIO

You can also….

• Change algorithm

• Tune algorithm parameter

• Compare and evaluate algorithm

• Add custom business logics

Page 33: [2C2]PredictionIO

SDKs for:

• Python

• Ruby

• PHP

• Java / Andriod

• Scala

• Node.js

• iOS

• Meteor

• more….

Page 34: [2C2]PredictionIO

Also,

build your own Engine!

Page 35: [2C2]PredictionIO

Applications

of

Machine Learning

Speech Recognition

Personal Newsfeed

SPAM Filtering

Recommendation

Driverless Car

Churn Prediction

Ad Targeting

Fraud Detection

{

Page 36: [2C2]PredictionIO

- @PredictionIO

- prediction.io - Newsletters

- github.com/predictionio

감사합니다Korean Documentation (Beta)!

http://docs.prediction.io/kr