Atlanta MLconf Machine Learning Conference 09-23-2016

42
MLconf ATL! Sept 23 rd , 2016 Chris Fregly Research Scientist @ PipelineIO

Transcript of Atlanta MLconf Machine Learning Conference 09-23-2016

Page 1: Atlanta MLconf Machine Learning Conference 09-23-2016

MLconf ATL!Sept 23rd, 2016

Chris FreglyResearch Scientist @ PipelineIO

Page 2: Atlanta MLconf Machine Learning Conference 09-23-2016

Who am I?

Chris Fregly, Research Scientist @ PipelineIO, San Francisco

Previously, Engineer @ Netflix, Databricks, and IBM Spark

Contributor @ Apache Spark, Committer @ Netflix OSS

Founder @ Advanced Spark and TensorFlow Meetup

Author @ Advanced Spark (advancedspark.com)

Page 3: Atlanta MLconf Machine Learning Conference 09-23-2016

Advanced Spark and Tensorflow Meetup

Page 4: Atlanta MLconf Machine Learning Conference 09-23-2016

ATL Spark Meetup (9/22)

http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016

Page 5: Atlanta MLconf Machine Learning Conference 09-23-2016

ATL Hadoop Meetup (9/21)

http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

Page 6: Atlanta MLconf Machine Learning Conference 09-23-2016
Page 7: Atlanta MLconf Machine Learning Conference 09-23-2016

Confession #1

I Failed Linguistics in College!Chose Pass/Fail Option

(90 (mid-term) + 70 (final)) / 2 = 80 = C+How did a C+ turn into an F?

ZER0 (0) CLASS PARTICIPATION?!

Page 8: Atlanta MLconf Machine Learning Conference 09-23-2016

Confession #2

I Hated Statistics in College

2 Degrees: Mechanical + Manufacturing EnggApproximations were Bad!

I Wasn’t a Fluffy Physics MajorThough, I Kinda Wish I Was!

Page 9: Atlanta MLconf Machine Learning Conference 09-23-2016

Wait… Please Don’t Leave!I’m Older and Wiser Now

Approximate is the New Exact

Computational Linguistics and NLP are My Jam!

Page 10: Atlanta MLconf Machine Learning Conference 09-23-2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 11: Atlanta MLconf Machine Learning Conference 09-23-2016

What is Tensorflow?General Purpose Numerical Computation Engine

Happens to be good for neural nets!

ToolingTensorboard (port 6006 == `goog`) à

DAG-based like Spark!Computation graph is logical plan

Stored in Protobuf’s

TF converts logical -> physical plan

Lots of LibrariesTFLearn (Tensorflow’s Scikit-learn Impl)

Tensorflow Serving (Prediction Layer) à ^^

Distributed and GPU-Optimized

Page 12: Atlanta MLconf Machine Learning Conference 09-23-2016

What are Neural Networks?Like All ML, Goal is to Minimize Loss (Error)

Error relative to known outcome of labeled data

Mostly Supervised Learning ClassificationLabeled training data

Training StepsStep 1: Randomly Guess Input Weights

Step 2: Calculate Error Against Labeled Data

Step 3: Determine Gradient Value, +/- Direction

Step 4: Back-propagate Gradient to Update Each Input Weight

Step 5: Repeat Step 1 with New Weights until Convergence

ActivationFunction

Page 13: Atlanta MLconf Machine Learning Conference 09-23-2016

Activation FunctionsGoal: Learn and Train a Model on Input Data

Non-Linear Functions Find Non-Linear Fit of Input Data

Common Activation FunctionsSigmoid Function (sigmoid)

{0, 1}Hyperbolic Tangent (tanh)

{-1, 1}

Page 14: Atlanta MLconf Machine Learning Conference 09-23-2016

Back Propagation

http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Gradients Calculated by Comparing to Known Label

Use Gradients to Adjust Input Weights

Chain Rule

Page 15: Atlanta MLconf Machine Learning Conference 09-23-2016

Loss/Error OptimizersGradient Descent

Batch (entire dataset)Per-record (don’t do this!)Mini-batch (empirically 16 -> 512)Stochastic (approximation)Momentum (optimization)

AdaGradSGD with adaptive learning rates per featureSet initial learning rateMore-likely to incorrectly converge on local minima

http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-advanced-spark-and-tensorflow-meetup-08042016

Page 16: Atlanta MLconf Machine Learning Conference 09-23-2016

The MathLinear Algebra

Matrix MultiplicationVery Parallelizable

CalculusDerivativesChain Rule

Page 17: Atlanta MLconf Machine Learning Conference 09-23-2016

Convolutional Neural NetworksFeed-forward

Do not form a cycle

Apply Many Layers (aka. Filters) to Input

Each Layer/Filter Picks up on FeaturesFeatures not necessarily human-grokkable

Examples of Human-grokkable Filters3 color filters: RGBMoving AVG for time series

Brute ForceTry Diff numLayers & layerSizes

Page 18: Atlanta MLconf Machine Learning Conference 09-23-2016

CNN Use Case: Stitch Fix

Stitch Fix Also Uses NLP to Analyze Return/Reject Comments

StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!

Page 19: Atlanta MLconf Machine Learning Conference 09-23-2016

Recurrent Neural NetworksForms a Cycle (vs. Feed-forward)

Maintains State over TimeKeep track of context

Learns sequential patterns

Decay over time

Use CasesSpeech

Text/NLP Prediction

Page 20: Atlanta MLconf Machine Learning Conference 09-23-2016

RNN Sequences

Input: ImageOutput: Classification

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Input: ImageOutput: Text (Captions)

Input: TextOutput: Class (Sentiment)

Input: Text (English)Output: Text (Spanish)

InputLayer

HiddenLayer

OutputLayer

Page 21: Atlanta MLconf Machine Learning Conference 09-23-2016

Character-based RNNsTokens are Characters vs. Words/Phrases

Microsoft trains ever 3 characters

Less Combination of Possible NeighborsOnly 26 alpha character tokens vs. millions of word tokens

Preserves state between

1st and 2nd ‘l’improves prediction

Page 22: Atlanta MLconf Machine Learning Conference 09-23-2016

Long Short Term Memory (LSTM)

More ComplexState Update

Functionthan

Vanilla RNN

Page 23: Atlanta MLconf Machine Learning Conference 09-23-2016

LSTM State Update

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cell State

Forget Gate Layer(Sigmoid)

Input Gate Layer(Sigmoid)

Candidate Gate Layer(tanh)

OutputLayer

Page 24: Atlanta MLconf Machine Learning Conference 09-23-2016

Transfer Learning

Page 25: Atlanta MLconf Machine Learning Conference 09-23-2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 26: Atlanta MLconf Machine Learning Conference 09-23-2016

Use CasesDocument Summary

TextRank: TF/IDF + PageRank

Article Classification and SimilarityLDA: calculate top `k` topic distribution

Machine Translationword2vec: compare word embedding vectors

Must Convert Text to Numbers!

Page 27: Atlanta MLconf Machine Learning Conference 09-23-2016

Core ConceptsCorpus

Collection of text ie. Documents, articles, genetic codes

EmbeddingsTokens represented/embedded in vector spaceLearned, hidden features (~PCA, SVD)Similar tokens cluster together, analogies cluster apart

k-skip-gramSkip k neighbors when defining tokens

n-gramTreat n consecutive tokens as a single token

Composable:1-skip, bi-gram(every other word)

Page 28: Atlanta MLconf Machine Learning Conference 09-23-2016

Parsers and POS Taggers

Describe grammatical sentence structure

Requires context of entire sentence

Helps reason about sentence

80% obvious, simple token neighbors

Major bottleneck in NLP pipeline!

Page 29: Atlanta MLconf Machine Learning Conference 09-23-2016

Pre-trained Parsers and TaggersPenn Treebank

Parser and Part-of-Speech TaggerHuman-annotated (!)Trained on 4.5 million words

Parsey McParsefaceTrained by SyntaxNet

Page 30: Atlanta MLconf Machine Learning Conference 09-23-2016

Feature EngineeringLower-case

Preserve proper nouns using carat (`^`)“MLconf ” => “^m^lconf ”“Varsity” => “^varsity”

Encode Common N-grams (Phrases)Create a single token using underscore (`_`)“Senior Developer” => “senior_developer”

Stemming and LemmatizationTry to avoid: let the neural network figure this outCan preserve part of speech (POS) using “_noun”, “_verb”“banking” => “banking_verb”

Page 31: Atlanta MLconf Machine Learning Conference 09-23-2016

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

Page 32: Atlanta MLconf Machine Learning Conference 09-23-2016

Count-based ModelsGoal: Convert Text to Vector of Neighbor Co-occurrences

Bag of Words (BOW)Simple hashmap with word countsLoses neighbor context

Term Frequency / Inverse Document Frequency (TF/IDF)Normalizes based on token frequency

GloVeMatrix factorization on co-occurrence matrix

Highly parallelizable, reduce dimensions, capture global co-occurrence statsLog smoothing of probability ratios

Stores word vector diffs for fast analogy lookups

Page 33: Atlanta MLconf Machine Learning Conference 09-23-2016

Neural-based Predictive ModelsGoal: Predict Text using Learned Embedding Vectors

word2vecShallow neural networkLocal: nearby words predict each otherFixed word embedding vector size (ie. 300)Optimizer: Mini-batch Stochastic Gradient Descent (SGD)

SyntaxNetDeep(er) neural networkGlobal(er)Not a Recurrent Neural Net (RNN)!Can combine with BOW-based models (ie. word2vec CBOW)

Page 34: Atlanta MLconf Machine Learning Conference 09-23-2016

word2vec

CBOW word2vecPredict target word from source contextA single source context is an observationLoses useful distribution informationGood for small datasets

Skip-gram word2vec (Inverse of CBOW)Predict source context words from target wordEach (source context, target word) tuple is observationBetter for large datasets

Page 35: Atlanta MLconf Machine Learning Conference 09-23-2016

word2vec Libraries

gensimPython onlyMost popular

Spark MLPython + Java/Scala Supports only synonyms

Page 36: Atlanta MLconf Machine Learning Conference 09-23-2016

*2vec

lda2vecLDA (global) + word2vec (local)From Chris Moody @ Stitch Fix

like2vecEmbedding-based Recommender

Page 37: Atlanta MLconf Machine Learning Conference 09-23-2016

word2vec vs. GloVeBoth are Fundamentally Similar

Capture local co-occurrence statistics (neighbors)Capture distance between embedding vector (analogies)

GloVeCount-basedAlso captures global co-occurrence statisticsRequires upfront pass through entire dataset

Page 38: Atlanta MLconf Machine Learning Conference 09-23-2016

SyntaxNet POS TaggingDetermine coarse-grained grammatical role of each wordMultiple contexts, multiple roles

Neural Net Inputs: stack, buffer

Results: POS probability distro

Already Tagged

Page 39: Atlanta MLconf Machine Learning Conference 09-23-2016

SyntaxNet Dependency ParserDetermine fine-grained roles using grammatical relationships“Transition-based”, Incremental Dependency Parser

Globally Normalized using Beam Search with Early Update

Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs

Fine-grained

Coarse-grained

Page 40: Atlanta MLconf Machine Learning Conference 09-23-2016

SyntaxNet Use Case: NutritionNutrition and Health Startup in SF (Stealth)

Using Google’s SyntaxNet

Rate Recipes and Menus by Nutritional Value

Correct

Incorrect

Page 41: Atlanta MLconf Machine Learning Conference 09-23-2016

Model ValidationUnsupervised Learning Requires Validation

Google has Published Analogy Tests for Model Validation

Thanks, Google!

Page 42: Atlanta MLconf Machine Learning Conference 09-23-2016

Thank You, Atlanta!Chris Fregly, Research Scientist @ PipelineIO

All Source Code, Demos, and Docker Images @ pipeline.io

Join the Global Meetup for all Slides and Videos@ advancedspark.com