Data By the Bay 2016 - May 17, 2016

26
Using for NLP Michelle Casbon Text By the Bay May 17, 2016 San Francisco

Transcript of Data By the Bay 2016 - May 17, 2016

Page 1: Data By the Bay 2016 - May 17, 2016

Using for NLPMichelle CasbonText By the BayMay 17, 2016San Francisco

Page 2: Data By the Bay 2016 - May 17, 2016

The construction of predictive models, trained on features

extracted from raw text

Page 3: Data By the Bay 2016 - May 17, 2016

Turn text into numbers, do some math, and turn

it back into text.

Page 4: Data By the Bay 2016 - May 17, 2016

NLP in the wild• Data ingestion• Interactive Voice Response• SMS prioritization• Multilingual news• Release feedback• Intent to purchase

Page 5: Data By the Bay 2016 - May 17, 2016

Prediction

Page 6: Data By the Bay 2016 - May 17, 2016

Math to the rescue

ln[p/(1-p)] = a + BX + e

p/(1-p) = e(a + BX + e)

p = 1/[1 + e(-a - BX)]

Page 7: Data By the Bay 2016 - May 17, 2016

MLlib to the rescue

Page 8: Data By the Bay 2016 - May 17, 2016

Training Datapipeline.fit(training)

Page 9: Data By the Bay 2016 - May 17, 2016

[1.0, 3.0, 7.0, …]

Page 10: Data By the Bay 2016 - May 17, 2016

IdiML to the rescuehttps://github.com/g-c-k/idiml

Page 11: Data By the Bay 2016 - May 17, 2016

IdiML• Feature extraction• Model training• Prediction

[1.0, [1.0, 0.0, 3.0]]

FeatureExtraction

Training

Prediction

[1.0, 0.0, 3.0]

Lorem ipsumdolor sitamet,consecteturadipiscing elit

PROFIT

Page 12: Data By the Bay 2016 - May 17, 2016

Featurization

ExtractContent Tokenize

Bigrams

Trigrams

FeatureLookup

[1.0, 0.0, 3.0]

Vector

Lorem ipsumdolor sitamet,consecteturadipiscing elit

Page 13: Data By the Bay 2016 - May 17, 2016

Model Training

LogisticRegressionWithLBFGS

[1.0, [1.0, 0.0, 3.0]]

LabeledPoint

ModelStorage

[1.0, 0.0, 3.0]

Vector

Addclassification

LogisticRegressionModel

Page 14: Data By the Bay 2016 - May 17, 2016

Prediction

ExtractContent Tokenize

Bigrams

Trigrams

FeatureLookup

[0.0, 1.0, 4.0]

Vector

ModelLookup

Predict

Newdocument

[0.0, 1.0, 4.0]

Vector

ClassificationLookup

Lorem ipsumdolor sitamet,consecteturadipiscing elit

PROFIT

Page 15: Data By the Bay 2016 - May 17, 2016

What makes it so great?

Page 16: Data By the Bay 2016 - May 17, 2016

Single object

Page 17: Data By the Bay 2016 - May 17, 2016

Flexibility• Deployment environment• Device• Logging framework

Page 18: Data By the Bay 2016 - May 17, 2016

Standardization for developers

Corefunctionality CustomML

RESTAPI

IdiMLpersistence

layer

Page 19: Data By the Bay 2016 - May 17, 2016

Version Control

Page 20: Data By the Bay 2016 - May 17, 2016

Hyperparameter Tuning

Page 21: Data By the Bay 2016 - May 17, 2016

Performance… if you have small data

Task Timein µs

Vector prediction 300

DataFrame prediction 7800

DataFrames are slow ...

Page 22: Data By the Bay 2016 - May 17, 2016

Performance

Page 23: Data By the Bay 2016 - May 17, 2016

Computing power to process the entire Twitter feed in real-time

from this: to this:

Page 24: Data By the Bay 2016 - May 17, 2016

What’s next for IdiML?• Support more statistical

models• Expand automated

hyperparameter tuning across the full training pipeline• Support more options

for featurization• Generic external

touchpoints

Page 25: Data By the Bay 2016 - May 17, 2016

Summary• Flexibility, speed, woot!• Continuous stream processing, woot!• Multi-language support, woot!• Scala & MLlib, woot!

Page 26: Data By the Bay 2016 - May 17, 2016

Michelle [email protected]

@texasmichelle

https://github.com/g-c-k/idiml