Atlanta MLconf Machine Learning Conference 09-23-2016

MLconf ATL!Sept 23rd, 2016

Chris FreglyResearch Scientist @ PipelineIO

Who am I?

Chris Fregly, Research Scientist @ PipelineIO, San Francisco

Previously, Engineer @ Netflix, Databricks, and IBM Spark

Contributor @ Apache Spark, Committer @ Netflix OSS

Founder @ Advanced Spark and TensorFlow Meetup

Author @ Advanced Spark (advancedspark.com)

Advanced Spark and Tensorflow Meetup

ATL Spark Meetup (9/22)

http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016

ATL Hadoop Meetup (9/21)

http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

Confession #1

I Failed Linguistics in College!Chose Pass/Fail Option

(90 (mid-term) + 70 (final)) / 2 = 80 = C+How did a C+ turn into an F?

ZER0 (0) CLASS PARTICIPATION?!

Confession #2

I Hated Statistics in College

2 Degrees: Mechanical + Manufacturing EnggApproximations were Bad!

I Wasn’t a Fluffy Physics MajorThough, I Kinda Wish I Was!

Wait… Please Don’t Leave!I’m Older and Wiser Now

Approximate is the New Exact

Computational Linguistics and NLP are My Jam!

Agenda

Tensorflow + Neural Nets

NLP Fundamentals

NLP Models

What is Tensorflow?General Purpose Numerical Computation Engine

Happens to be good for neural nets!

ToolingTensorboard (port 6006 == `goog`) à

DAG-based like Spark!Computation graph is logical plan

Stored in Protobuf’s

TF converts logical -> physical plan

Lots of LibrariesTFLearn (Tensorflow’s Scikit-learn Impl)

Tensorflow Serving (Prediction Layer) à ^^

Distributed and GPU-Optimized

What are Neural Networks?Like All ML, Goal is to Minimize Loss (Error)

Error relative to known outcome of labeled data

Mostly Supervised Learning ClassificationLabeled training data

Training StepsStep 1: Randomly Guess Input Weights

Step 2: Calculate Error Against Labeled Data

Step 3: Determine Gradient Value, +/- Direction

Step 4: Back-propagate Gradient to Update Each Input Weight

Step 5: Repeat Step 1 with New Weights until Convergence

ActivationFunction

Activation FunctionsGoal: Learn and Train a Model on Input Data

Non-Linear Functions Find Non-Linear Fit of Input Data

Common Activation FunctionsSigmoid Function (sigmoid)

{0, 1}Hyperbolic Tangent (tanh)

{-1, 1}

Back Propagation

http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Gradients Calculated by Comparing to Known Label

Use Gradients to Adjust Input Weights

Chain Rule

Loss/Error OptimizersGradient Descent

Batch (entire dataset)Per-record (don’t do this!)Mini-batch (empirically 16 -> 512)Stochastic (approximation)Momentum (optimization)

AdaGradSGD with adaptive learning rates per featureSet initial learning rateMore-likely to incorrectly converge on local minima

http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-advanced-spark-and-tensorflow-meetup-08042016

The MathLinear Algebra

Matrix MultiplicationVery Parallelizable

CalculusDerivativesChain Rule

Convolutional Neural NetworksFeed-forward

Do not form a cycle

Apply Many Layers (aka. Filters) to Input

Each Layer/Filter Picks up on FeaturesFeatures not necessarily human-grokkable

Examples of Human-grokkable Filters3 color filters: RGBMoving AVG for time series

Brute ForceTry Diff numLayers & layerSizes

CNN Use Case: Stitch Fix

Stitch Fix Also Uses NLP to Analyze Return/Reject Comments

StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!

Recurrent Neural NetworksForms a Cycle (vs. Feed-forward)

Maintains State over TimeKeep track of context

Learns sequential patterns

Decay over time

Use CasesSpeech

Text/NLP Prediction

RNN Sequences

Input: ImageOutput: Classification

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Input: ImageOutput: Text (Captions)

Input: TextOutput: Class (Sentiment)

Input: Text (English)Output: Text (Spanish)

InputLayer

HiddenLayer

OutputLayer

Character-based RNNsTokens are Characters vs. Words/Phrases

Microsoft trains ever 3 characters

Less Combination of Possible NeighborsOnly 26 alpha character tokens vs. millions of word tokens

Preserves state between

1st and 2nd ‘l’improves prediction

Long Short Term Memory (LSTM)

More ComplexState Update

Functionthan

Vanilla RNN

LSTM State Update

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cell State

Forget Gate Layer(Sigmoid)

Input Gate Layer(Sigmoid)

Candidate Gate Layer(tanh)

OutputLayer

Transfer Learning

Agenda

NLP Fundamentals

NLP Models

Use CasesDocument Summary

TextRank: TF/IDF + PageRank

Article Classification and SimilarityLDA: calculate top `k` topic distribution

Machine Translationword2vec: compare word embedding vectors

Must Convert Text to Numbers!

Core ConceptsCorpus

Collection of text ie. Documents, articles, genetic codes

EmbeddingsTokens represented/embedded in vector spaceLearned, hidden features (~PCA, SVD)Similar tokens cluster together, analogies cluster apart

k-skip-gramSkip k neighbors when defining tokens

n-gramTreat n consecutive tokens as a single token

Composable:1-skip, bi-gram(every other word)

Parsers and POS Taggers

Describe grammatical sentence structure

Requires context of entire sentence

Helps reason about sentence

80% obvious, simple token neighbors

Major bottleneck in NLP pipeline!

Pre-trained Parsers and TaggersPenn Treebank

Parser and Part-of-Speech TaggerHuman-annotated (!)Trained on 4.5 million words

Parsey McParsefaceTrained by SyntaxNet

Feature EngineeringLower-case

Preserve proper nouns using carat (`^`)“MLconf ” => “^m^lconf ”“Varsity” => “^varsity”

Encode Common N-grams (Phrases)Create a single token using underscore (`_`)“Senior Developer” => “senior_developer”

Stemming and LemmatizationTry to avoid: let the neural network figure this outCan preserve part of speech (POS) using “_noun”, “_verb”“banking” => “banking_verb”

Agenda

NLP Fundamentals

NLP Models

Count-based ModelsGoal: Convert Text to Vector of Neighbor Co-occurrences

Bag of Words (BOW)Simple hashmap with word countsLoses neighbor context

Term Frequency / Inverse Document Frequency (TF/IDF)Normalizes based on token frequency

GloVeMatrix factorization on co-occurrence matrix

Highly parallelizable, reduce dimensions, capture global co-occurrence statsLog smoothing of probability ratios

Stores word vector diffs for fast analogy lookups

Neural-based Predictive ModelsGoal: Predict Text using Learned Embedding Vectors

word2vecShallow neural networkLocal: nearby words predict each otherFixed word embedding vector size (ie. 300)Optimizer: Mini-batch Stochastic Gradient Descent (SGD)

SyntaxNetDeep(er) neural networkGlobal(er)Not a Recurrent Neural Net (RNN)!Can combine with BOW-based models (ie. word2vec CBOW)

word2vec

CBOW word2vecPredict target word from source contextA single source context is an observationLoses useful distribution informationGood for small datasets

Skip-gram word2vec (Inverse of CBOW)Predict source context words from target wordEach (source context, target word) tuple is observationBetter for large datasets

word2vec Libraries

gensimPython onlyMost popular

Spark MLPython + Java/Scala Supports only synonyms

lda2vecLDA (global) + word2vec (local)From Chris Moody @ Stitch Fix

like2vecEmbedding-based Recommender

word2vec vs. GloVeBoth are Fundamentally Similar

Capture local co-occurrence statistics (neighbors)Capture distance between embedding vector (analogies)

GloVeCount-basedAlso captures global co-occurrence statisticsRequires upfront pass through entire dataset

SyntaxNet POS TaggingDetermine coarse-grained grammatical role of each wordMultiple contexts, multiple roles

Neural Net Inputs: stack, buffer

Results: POS probability distro

Already Tagged

SyntaxNet Dependency ParserDetermine fine-grained roles using grammatical relationships“Transition-based”, Incremental Dependency Parser

Globally Normalized using Beam Search with Early Update

Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs

Fine-grained

Coarse-grained

SyntaxNet Use Case: NutritionNutrition and Health Startup in SF (Stealth)

Using Google’s SyntaxNet

Rate Recipes and Menus by Nutritional Value

Correct

Incorrect

Model ValidationUnsupervised Learning Requires Validation

Google has Published Analogy Tests for Model Validation

Thanks, Google!

Thank You, Atlanta!Chris Fregly, Research Scientist @ PipelineIO

All Source Code, Demos, and Docker Images @ pipeline.io

Join the Global Meetup for all Slides and Videos@ advancedspark.com

Atlanta MLconf Machine Learning Conference 09-23-2016

Software

Transcript of Atlanta MLconf Machine Learning Conference 09-23-2016

Ted Willke, Intel Labs MLconf 2013

ReviewAnalysis MLconf 2016 JPrendki

Josh Wills, MLconf 2013

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Xia Zhu – Intel at MLconf ATL

MLConf 2016 SigOpt Talk by Scott Clark

Mahoney mlconf-nov13

MLconf NYC Animashree Anandkumar

MLconf NYC Ted Willke

MLconf NYC Xiangrui Meng

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

Evan Estola, Lead Machine Learning Engineer, Meetup, at MLconf NYC 2017

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Evan Estola, Lead Machine Learning Engineer, Meetup at MLconf SEA - 5/20/16

Scott Triglia, MLconf 2013

Music recommendations @ MLConf 2014

MLconf NYC Justin Basilico

Ewa Dominowska – Engineering Manager, Facebook at MLconf ATL

Sri Ambati – CEO, 0xdata at MLconf ATL