[2C2]PredictionIO
-
Upload
naver-d2 -
Category
Technology
-
view
1.483 -
download
0
description
Transcript of [2C2]PredictionIO
An Open Source Machine Learning Server
for Developers
@PredictionIO #PredictionIO
Simon [email protected]
• Simon Chan - CEO of PredictionIO
• A small team of Data Scientists and Engineers
• Mainly based in Silicon Valley, also London and Hong Kong
Thank you for having me here today!
Top Github Open Source
• Over 5000 developers engaged• Powering over 200 applications
Talk Focus:
• Machine Learning - A (Very) Brief Review
• Challenges We Face When Building PredictionIO
Machine Learning is Simple?
I am going to give an example that will make you… HUNGRY!
F FOOD Club – Menu
FOODCLUB
# Using PredictionIO# Collect Datacli = predictionio.EventClient("<my_app_id>")cli.record_user_action_on_item("buy", "John", “BulgogiA")
# Predict top preferenceseng = predictionio.EngineClient("<my_engine_url>")rec = eng.send_query({"uid" : "John", "n" : 5})
Coding time….
The Magic Behind: Engine
1. Data Sourcing and Preparation
2. Algorithm
3. Serving
4. Evaluation
Challenges and Solutions
Architectural Challenge 1
Workflow Co-ordination on a Distributed Cluster
Needs:
•Support multiple distributed engines
•Support multiple algorithms to execute in parallel
How to coordinate the workflow when you have more pending tasks than processing units?
Attempt #1
Use a database system to store tasks, and have a pool of workers pull tasks from it.
•Inefficient. Database becomes bottleneck and potentially single point of failure.
Attempt #2
Use an Akka cluster.
Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.
•Fundamentally the same problem with the above.
•Need to build management suite on top.
Solution
Apache Spark: directed acyclic graph (DAG) scheduling
Adapts to many different infrastructure: Apache Spark standalone cluster, Apache Hadoop 2 YARN, Apache Mesos.
Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg
Solution Source Code:
http://github.com/predictionio
Architectural Challenge 2
Distributed In-memory Model Retrieval
Needs:
•Engines produce models that are distributed across a cluster. Requires a way to serve these distributed in-memory models to queries in real-time.
Solution
All PredictionIO engine instances are launched inside a “SparkContext”.
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
•When an engine is local to a single machine, it loads the model to its memory.
•When an engine is distributed, SparkContext will automatically load the model on a cluster.
Conceptual Code for the Solution
val sc = SparkContext(conf)...val model = if (model_is_distributed) { if (model_is_persisted) { sc.objectFile(model_on_HDFS) } else { engine.algo.train() } } else { ... }}
PredictionIO 0.8
Built-in Engines:
•Item Recommendation
•Item Rank
•Item Similarity
Create an Engine Instance Project….
$ pio instance io.prediction.engines.itemrec
$ cd io.prediction.engines.itemrec
$ pio register
Collect Event Data….
cli = predictionio.EventClient("<app_id>")
cli.record_user_action_on_item("like", "John", “bulgogi_12”)
cli.record_user_action_on_item("view", "John", “bimbimbap_13”)
Configurate the Engine Instance settings
in params/datasource.json
{
"appId": <app_id>,
"actions": [
"view", "like", ...
], ...
}
Train the Data Model
$ pio train
Deploy the Engine Instance
$ pio deploy
Retrieve Prediction Results
from predictionio import EngineClient
client = EngineClient(url="http://localhost:8000")
prediction = client.send_query({"uid": "John", "n": 3})
print prediction
Output
{u'items': [{u'272': 9.929327011108398}, {u'313': 9.92607593536377}, {u’347': 9.92170524597168}]}
You can also….
• Change algorithm
• Tune algorithm parameter
• Compare and evaluate algorithm
• Add custom business logics
SDKs for:
• Python
• Ruby
• PHP
• Java / Andriod
• Scala
• Node.js
• iOS
• Meteor
• more….
Also,
build your own Engine!
Applications
of
Machine Learning
Speech Recognition
Personal Newsfeed
SPAM Filtering
Recommendation
Driverless Car
Churn Prediction
Ad Targeting
Fraud Detection
{
- @PredictionIO
- prediction.io - Newsletters
- github.com/predictionio
감사합니다Korean Documentation (Beta)!
http://docs.prediction.io/kr