Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

42
MLconf ATL! Sept 23 rd , 2016 Chris Fregly Research Scientist @ PipelineIO

Transcript of Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Page 1: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

MLconf ATL!Sept 23rd, 2016

Chris FreglyResearch Scientist @ PipelineIO

Page 2: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Who am I?

Chris Fregly, Research Scientist @ PipelineIO, San Francisco

Previously, Engineer @ Netflix, Databricks, and IBM Spark

Contributor @ Apache Spark, Committer @ Netflix OSS

Founder @ Advanced Spark and TensorFlow Meetup

Author @ Advanced Spark (advancedspark.com)

Page 3: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Advanced Spark and Tensorflow Meetup

Page 4: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

ATL Spark Meetup (9/22)

http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016

Page 5: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

ATL Hadoop Meetup (9/21)

http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

Page 6: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Page 7: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Confession #1

I Failed Linguistics in College!Chose Pass/Fail Option

(90 (mid-term) + 70 (final)) / 2 = 80 = C+How did a C+ turn into an F?

ZER0 (0) CLASS PARTICIPATION?!

Page 8: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Confession #2

I Hated Statistics in College

2 Degrees: Mechanical + Manufacturing EnggApproximations were Bad!

I Wasn’t a Fluffy Physics MajorThough, I Kinda Wish I Was!

Page 9: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Wait… Please Don’t Leave!I’m Older and Wiser Now

Approximate is the New Exact

Computational Linguistics and NLP are My Jam!

Page 10: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 11: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

What is Tensorflow?General Purpose Numerical Computation Engine

Happens to be good for neural nets!

ToolingTensorboard (port 6006 == `goog`) à

DAG-based like Spark!Computation graph is logical plan

Stored in Protobuf’s

TF converts logical -> physical plan

Lots of LibrariesTFLearn (Tensorflow’s Scikit-learn Impl)

Tensorflow Serving (Prediction Layer) à ^^

Distributed and GPU-Optimized

Page 12: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

What are Neural Networks?Like All ML, Goal is to Minimize Loss (Error)

Error relative to known outcome of labeled data

Mostly Supervised Learning ClassificationLabeled training data

Training StepsStep 1: Randomly Guess Input Weights

Step 2: Calculate Error Against Labeled Data

Step 3: Determine Gradient Value, +/- Direction

Step 4: Back-propagate Gradient to Update Each Input Weight

Step 5: Repeat Step 1 with New Weights until Convergence

ActivationFunction

Page 13: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Activation FunctionsGoal: Learn and Train a Model on Input Data

Non-Linear Functions Find Non-Linear Fit of Input Data

Common Activation FunctionsSigmoid Function (sigmoid)

{0, 1}Hyperbolic Tangent (tanh)

{-1, 1}

Page 14: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Back Propagation

http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Gradients Calculated by Comparing to Known Label

Use Gradients to Adjust Input Weights

Chain Rule

Page 15: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Loss/Error OptimizersGradient Descent

Batch (entire dataset)Per-record (don’t do this!)Mini-batch (empirically 16 -> 512)Stochastic (approximation)Momentum (optimization)

AdaGradSGD with adaptive learning rates per featureSet initial learning rateMore-likely to incorrectly converge on local minima

http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-advanced-spark-and-tensorflow-meetup-08042016

Page 16: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

The MathLinear Algebra

Matrix MultiplicationVery Parallelizable

CalculusDerivativesChain Rule

Page 17: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Convolutional Neural NetworksFeed-forward

Do not form a cycle

Apply Many Layers (aka. Filters) to Input

Each Layer/Filter Picks up on FeaturesFeatures not necessarily human-grokkable

Examples of Human-grokkable Filters3 color filters: RGBMoving AVG for time series

Brute ForceTry Diff numLayers & layerSizes

Page 18: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

CNN Use Case: Stitch Fix

Stitch Fix Also Uses NLP to Analyze Return/Reject Comments

StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!

Page 19: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Recurrent Neural NetworksForms a Cycle (vs. Feed-forward)

Maintains State over TimeKeep track of context

Learns sequential patterns

Decay over time

Use CasesSpeech

Text/NLP Prediction

Page 20: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

RNN Sequences

Input: ImageOutput: Classification

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Input: ImageOutput: Text (Captions)

Input: TextOutput: Class (Sentiment)

Input: Text (English)Output: Text (Spanish)

InputLayer

HiddenLayer

OutputLayer

Page 21: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Character-based RNNsTokens are Characters vs. Words/Phrases

Microsoft trains ever 3 characters

Less Combination of Possible NeighborsOnly 26 alpha character tokens vs. millions of word tokens

Preserves state between

1st and 2nd ‘l’improves prediction

Page 22: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Long Short Term Memory (LSTM)

More ComplexState Update

Functionthan

Vanilla RNN

Page 23: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

LSTM State Update

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cell State

Forget Gate Layer(Sigmoid)

Input Gate Layer(Sigmoid)

Candidate Gate Layer(tanh)

OutputLayer

Page 24: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Transfer Learning

Page 25: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 26: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Use CasesDocument Summary

TextRank: TF/IDF + PageRank

Article Classification and SimilarityLDA: calculate top `k` topic distribution

Machine Translationword2vec: compare word embedding vectors

Must Convert Text to Numbers!

Page 27: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Core ConceptsCorpus

Collection of text ie. Documents, articles, genetic codes

EmbeddingsTokens represented/embedded in vector spaceLearned, hidden features (~PCA, SVD)Similar tokens cluster together, analogies cluster apart

k-skip-gramSkip k neighbors when defining tokens

n-gramTreat n consecutive tokens as a single token

Composable:1-skip, bi-gram(every other word)

Page 28: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Parsers and POS Taggers

Describe grammatical sentence structure

Requires context of entire sentence

Helps reason about sentence

80% obvious, simple token neighbors

Major bottleneck in NLP pipeline!

Page 29: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Pre-trained Parsers and TaggersPenn Treebank

Parser and Part-of-Speech TaggerHuman-annotated (!)Trained on 4.5 million words

Parsey McParsefaceTrained by SyntaxNet

Page 30: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Feature EngineeringLower-case

Preserve proper nouns using carat (`^`)“MLconf ” => “^m^lconf ”“Varsity” => “^varsity”

Encode Common N-grams (Phrases)Create a single token using underscore (`_`)“Senior Developer” => “senior_developer”

Stemming and LemmatizationTry to avoid: let the neural network figure this outCan preserve part of speech (POS) using “_noun”, “_verb”“banking” => “banking_verb”

Page 31: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 32: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Count-based ModelsGoal: Convert Text to Vector of Neighbor Co-occurrences

Bag of Words (BOW)Simple hashmap with word countsLoses neighbor context

Term Frequency / Inverse Document Frequency (TF/IDF)Normalizes based on token frequency

GloVeMatrix factorization on co-occurrence matrix

Highly parallelizable, reduce dimensions, capture global co-occurrence statsLog smoothing of probability ratios

Stores word vector diffs for fast analogy lookups

Page 33: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Neural-based Predictive ModelsGoal: Predict Text using Learned Embedding Vectors

word2vecShallow neural networkLocal: nearby words predict each otherFixed word embedding vector size (ie. 300)Optimizer: Mini-batch Stochastic Gradient Descent (SGD)

SyntaxNetDeep(er) neural networkGlobal(er)Not a Recurrent Neural Net (RNN)!Can combine with BOW-based models (ie. word2vec CBOW)

Page 34: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

word2vec

CBOW word2vecPredict target word from source contextA single source context is an observationLoses useful distribution informationGood for small datasets

Skip-gram word2vec (Inverse of CBOW)Predict source context words from target wordEach (source context, target word) tuple is observationBetter for large datasets

Page 35: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

word2vec Libraries

gensimPython onlyMost popular

Spark MLPython + Java/Scala Supports only synonyms

Page 36: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

*2vec

lda2vecLDA (global) + word2vec (local)From Chris Moody @ Stitch Fix

like2vecEmbedding-based Recommender

Page 37: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

word2vec vs. GloVeBoth are Fundamentally Similar

Capture local co-occurrence statistics (neighbors)Capture distance between embedding vector (analogies)

GloVeCount-basedAlso captures global co-occurrence statisticsRequires upfront pass through entire dataset

Page 38: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

SyntaxNet POS TaggingDetermine coarse-grained grammatical role of each wordMultiple contexts, multiple roles

Neural Net Inputs: stack, buffer

Results: POS probability distro

Already Tagged

Page 39: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

SyntaxNet Dependency ParserDetermine fine-grained roles using grammatical relationships“Transition-based”, Incremental Dependency Parser

Globally Normalized using Beam Search with Early Update

Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs

Fine-grained

Coarse-grained

Page 40: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

SyntaxNet Use Case: NutritionNutrition and Health Startup in SF (Stealth)

Using Google’s SyntaxNet

Rate Recipes and Menus by Nutritional Value

Correct

Incorrect

Page 41: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Model ValidationUnsupervised Learning Requires Validation

Google has Published Analogy Tests for Model Validation

Thanks, Google!

Page 42: Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Thank You, Atlanta!Chris Fregly, Research Scientist @ PipelineIO

All Source Code, Demos, and Docker Images @ pipeline.io

Join the Global Meetup for all Slides and Videos@ advancedspark.com