[2C2]PredictionIO

An Open Source Machine Learning Server

for Developers

@PredictionIO #PredictionIO

Simon [email protected]

mailto:[email protected]

• Simon Chan - CEO of PredictionIO

• A small team of Data Scientists and Engineers

• Mainly based in Silicon Valley, also London and Hong Kong

Thank you for having me here today!

Top Github Open Source

• Over 5000 developers engaged• Powering over 200 applications

Talk Focus:

• Machine Learning - A (Very) Brief Review

• Challenges We Face When Building PredictionIO

Machine Learning is Simple?

I am going to give an example that will make you… HUNGRY!

F FOOD Club – Menu

FOODCLUB

# Using PredictionIO# Collect Datacli = predictionio.EventClient("<my_app_id>")cli.record_user_action_on_item("buy", "John", “BulgogiA")

# Predict top preferenceseng = predictionio.EngineClient("<my_engine_url>")rec = eng.send_query({"uid" : "John", "n" : 5})

Coding time….

The Magic Behind: Engine

1. Data Sourcing and Preparation

2. Algorithm

3. Serving

4. Evaluation

Challenges and Solutions

Architectural Challenge 1

Workflow Co-ordination on a Distributed Cluster

Needs:

•Support multiple distributed engines

•Support multiple algorithms to execute in parallel

How to coordinate the workflow when you have more pending tasks than processing units?

Attempt #1

Use a database system to store tasks, and have a pool of workers pull tasks from it.

•Inefficient. Database becomes bottleneck and potentially single point of failure.

Attempt #2

Use an Akka cluster.

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.

•Fundamentally the same problem with the above.

•Need to build management suite on top.

Solution

Apache Spark: directed acyclic graph (DAG) scheduling

Adapts to many different infrastructure: Apache Spark standalone cluster, Apache Hadoop 2 YARN, Apache Mesos.

Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg

Solution Source Code:

http://github.com/predictionio

Architectural Challenge 2

Distributed In-memory Model Retrieval

Needs:

•Engines produce models that are distributed across a cluster. Requires a way to serve these distributed in-memory models to queries in real-time.

Solution

All PredictionIO engine instances are launched inside a “SparkContext”.

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

•When an engine is local to a single machine, it loads the model to its memory.

•When an engine is distributed, SparkContext will automatically load the model on a cluster.

Conceptual Code for the Solution

val sc = SparkContext(conf)...val model = if (model_is_distributed) { if (model_is_persisted) { sc.objectFile(model_on_HDFS) } else { engine.algo.train() } } else { ... }}

PredictionIO 0.8

Built-in Engines:

•Item Recommendation

•Item Rank

•Item Similarity

Create an Engine Instance Project….

$ pio instance io.prediction.engines.itemrec

$ cd io.prediction.engines.itemrec

$ pio register

Collect Event Data….

cli = predictionio.EventClient("<app_id>")

cli.record_user_action_on_item("like", "John", “bulgogi_12”)

cli.record_user_action_on_item("view", "John", “bimbimbap_13”)

Configurate the Engine Instance settings

in params/datasource.json

{

"appId": <app_id>,

"actions": [

"view", "like", ...

], ...

}

Train the Data Model

$ pio train

Deploy the Engine Instance

$ pio deploy

Retrieve Prediction Results

from predictionio import EngineClient

client = EngineClient(url="http://localhost:8000")

prediction = client.send_query({"uid": "John", "n": 3})

print prediction

Output

{u'items': [{u'272': 9.929327011108398}, {u'313': 9.92607593536377}, {u’347': 9.92170524597168}]}

You can also….

• Change algorithm

• Tune algorithm parameter

• Compare and evaluate algorithm

• Add custom business logics

SDKs for:

• Python

• Ruby

• PHP

• Java / Andriod

• Scala

• Node.js

• iOS

• Meteor

• more….

Also,

build your own Engine!

Applications

of

Machine Learning

Speech Recognition

Personal Newsfeed

SPAM Filtering

Recommendation

Driverless Car

Churn Prediction

Ad Targeting

Fraud Detection

{

- @PredictionIO

- prediction.io - Newsletters

- github.com/predictionio

감사합니다Korean Documentation (Beta)!

http://docs.prediction.io/kr

[2C2]PredictionIO

Technology

Transcript of [2C2]PredictionIO