Data By the Bay 2016 - May 17, 2016

Using for NLPMichelle CasbonText By the BayMay 17, 2016San Francisco

The construction of predictive models, trained on features

extracted from raw text

Turn text into numbers, do some math, and turn

it back into text.

NLP in the wild• Data ingestion• Interactive Voice Response• SMS prioritization• Multilingual news• Release feedback• Intent to purchase

Prediction

Math to the rescue

ln[p/(1-p)] = a + BX + e

p/(1-p) = e(a + BX + e)

p = 1/[1 + e(-a - BX)]

MLlib to the rescue

Training Datapipeline.fit(training)

[1.0, 3.0, 7.0, …]

IdiML to the rescuehttps://github.com/g-c-k/idiml

IdiML• Feature extraction• Model training• Prediction

[1.0, [1.0, 0.0, 3.0]]

FeatureExtraction

Training

Prediction

[1.0, 0.0, 3.0]

Lorem ipsumdolor sitamet,consecteturadipiscing elit

PROFIT

Featurization

ExtractContent Tokenize

Bigrams

Trigrams

FeatureLookup

[1.0, 0.0, 3.0]

Vector


Model Training

LogisticRegressionWithLBFGS

[1.0, [1.0, 0.0, 3.0]]

LabeledPoint

ModelStorage

[1.0, 0.0, 3.0]

Vector

Addclassification

LogisticRegressionModel

Prediction

ExtractContent Tokenize

Bigrams

Trigrams

FeatureLookup

[0.0, 1.0, 4.0]

Vector

ModelLookup

Predict

Newdocument

[0.0, 1.0, 4.0]

Vector

ClassificationLookup


PROFIT

What makes it so great?

Single object

Flexibility• Deployment environment• Device• Logging framework

Standardization for developers

Corefunctionality CustomML

…

RESTAPI

IdiMLpersistence

layer

Version Control

Hyperparameter Tuning

Performance… if you have small data

Task Timein µs

Vector prediction 300

DataFrame prediction 7800

DataFrames are slow ...

Performance

Computing power to process the entire Twitter feed in real-time

from this: to this:

What’s next for IdiML?• Support more statistical

models• Expand automated

hyperparameter tuning across the full training pipeline• Support more options

for featurization• Generic external

touchpoints

Summary• Flexibility, speed, woot!• Continuous stream processing, woot!• Multi-language support, woot!• Scala & MLlib, woot!

Michelle [email protected]

@texasmichelle

https://github.com/g-c-k/idiml

Data By the Bay 2016 - May 17, 2016

Software

Transcript of Data By the Bay 2016 - May 17, 2016