Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

  • MLconf ATL!Sept 23rd, 2016

    Chris FreglyResearch Scientist @ PipelineIO

  • Who am I?

    Chris Fregly, Research Scientist @ PipelineIO, San Francisco

    Previously, Engineer @ Netflix, Databricks, and IBM Spark

    Contributor @ Apache Spark, Committer @ Netflix OSS

    Founder @ Advanced Spark and TensorFlow Meetup

    Author @ Advanced Spark (

  • Advanced Spark and Tensorflow Meetup

  • ATL Spark Meetup (9/22)

  • ATL Hadoop Meetup (9/21)

  • Confession #1

    I Failed Linguistics in College!Chose Pass/Fail Option

    (90 (mid-term) + 70 (final)) / 2 = 80 = C+How did a C+ turn into an F?


  • Confession #2

    I Hated Statistics in College

    2 Degrees: Mechanical + Manufacturing EnggApproximations were Bad!

    I Wasnt a Fluffy Physics MajorThough, I Kinda Wish I Was!

  • Wait Please Dont Leave!Im Older and Wiser Now

    Approximate is the New Exact

    Computational Linguistics and NLP are My Jam!

  • Agenda

    Tensorflow + Neural Nets

    NLP Fundamentals

    NLP Models

  • What is Tensorflow?General Purpose Numerical Computation Engine

    Happens to be good for neural nets!

    ToolingTensorboard (port 6006 == `goog`)

    DAG-based like Spark!Computation graph is logical plan

    Stored in Protobufs

    TF converts logical -> physical plan

    Lots of LibrariesTFLearn (Tensorflows Scikit-learn Impl)

    Tensorflow Serving (Prediction Layer) ^^

    Distributed and GPU-Optimized

  • What are Neural Networks?Like All ML, Goal is to Minimize Loss (Error)

    Error relative to known outcome of labeled data

    Mostly Supervised Learning ClassificationLabeled training data

    Training StepsStep 1: Randomly Guess Input Weights

    Step 2: Calculate Error Against Labeled Data

    Step 3: Determine Gradient Value, +/- Direction

    Step 4: Back-propagate Gradient to Update Each Input Weight

    Step 5: Repeat Step 1 with New Weights until Convergence


  • Activation FunctionsGoal: Learn and Train a Model on Input Data

    Non-Linear Functions Find Non-Linear Fit of Input Data

    Common Activation FunctionsSigmoid Function (sigmoid)

    {0, 1}Hyperbolic Tangent (tanh)

    {-1, 1}

  • Back Propagation

    Gradients Calculated by Comparing to Known Label

    Use Gradients to Adjust Input Weights

    Chain Rule

  • Loss/Error OptimizersGradient Descent

    Batch (entire dataset)Per-record (dont do this!)Mini-batch (empirically 16 -> 512)Stochastic (approximation)Momentum (optimization)

    AdaGradSGD with adaptive learning rates per featureSet initial learning rateMore-likely to incorrectly converge on local minima

  • The MathLinear Algebra

    Matrix MultiplicationVery Parallelizable

    CalculusDerivativesChain Rule

  • Convolutional Neural NetworksFeed-forward

    Do not form a cycle

    Apply Many Layers (aka. Filters) to Input

    Each Layer/Filter Picks up on FeaturesFeatures not necessarily human-grokkable

    Examples of Human-grokkable Filters3 color filters: RGBMoving AVG for time series

    Brute ForceTry Diff numLayers & layerSizes

  • CNN Use Case: Stitch Fix

    Stitch Fix Also Uses NLP to Analyze Return/Reject Comments

    StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!

  • Recurrent Neural NetworksForms a Cycle (vs. Feed-forward)

    Maintains State over TimeKeep track of context

    Learns sequential patterns

    Decay over time

    Use CasesSpeech

    Text/NLP Prediction

  • RNN Sequences

    Input: ImageOutput: Classification

    Input: ImageOutput: Text (Captions)

    Input: TextOutput: Class (Sentiment)

    Input: Text (English)Output: Text (Spanish)




  • Character-based RNNsTokens are Characters vs. Words/Phrases

    Microsoft trains ever 3 characters

    Less Combination of Possible NeighborsOnly 26 alpha character tokens vs. millions of word tokens

    Preserves state between

    1st and 2nd limproves prediction

  • Long Short Term Memory (LSTM)

    More ComplexState Update


    Vanilla RNN

  • LSTM State Update

    Cell State

    Forget Gate Layer(Sigmoid)

    Input Gate Layer(Sigmoid)

    Candidate Gate Layer(tanh)


  • Transfer Learning

  • Agenda

    Tensorflow + Neural Nets

    NLP Fundamentals

    NLP Models

  • Use CasesDocument Summary

    TextRank: TF/IDF + PageRank

    Article Classification and SimilarityLDA: calculate top `k` topic distribution

    Machine Translationword2vec: compare word embedding vectors

    Must Convert Text to Numbers!

  • Core ConceptsCorpus

    Collection of text ie. Documents, articles, genetic codes

    EmbeddingsTokens represented/embedded in vector spaceLearned, hidden features (~PCA, SVD)Similar tokens cluster together, analogies cluster apart

    k-skip-gramSkip k neighbors when defining tokens

    n-gramTreat n consecutive tokens as a single token

    Composable:1-skip, bi-gram(every other word)

  • Parsers and POS Taggers

    Describe grammatical sentence structure

    Requires context of entire sentence

    Helps reason about sentence

    80% obvious, simple token neighbors

    Major bottleneck in NLP pipeline!

  • Pre-trained Parsers and TaggersPenn Treebank

    Parser and Part-of-Speech TaggerHuman-annotated (!)Trained on 4.5 million words

    Parsey McParsefaceTrained by SyntaxNet

  • Feature EngineeringLower-case

    Preserve proper nouns using carat (`^`)MLconf => ^m^lconf Varsity => ^varsity

    Encode Common N-grams (Phrases)Create a single token using underscore (`_`)Senior Developer => senior_developer

    Stemming and LemmatizationTry to avoid: let the neural network figure this outCan preserve part of speech (POS) using _noun, _verbbanking => banking_verb

  • Agenda

    Tensorflow + Neural Nets

    NLP Fundamentals

    NLP Models

  • Count-based ModelsGoal: Convert Text to Vector of Neighbor Co-occurrences

    Bag of Words (BOW)Simple hashmap with word countsLoses neighbor context

    Term Frequency / Inverse Document Frequency (TF/IDF)Normalizes based on token frequency

    GloVeMatrix factorization on co-occurrence matrix

    Highly parallelizable, reduce dimensions, capture global co-occurrence statsLog smoothing of probability ratios

    Stores word vector diffs for fast analogy lookups

  • Neural-based Predictive ModelsGoal: Predict Text using Learned Embedding Vectors

    word2vecShallow neural networkLocal: nearby words predict each otherFixed word embedding vector size (ie. 300)Optimizer: Mini-batch Stochastic Gradient Descent (SGD)

    SyntaxNetDeep(er) neural networkGlobal(er)Not a Recurrent Neural Net (RNN)!Can combine with BOW-based models (ie. word2vec CBOW)

  • word2vec

    CBOW word2vecPredict target word from source contextA single source context is an observationLoses useful distribution informationGood for small datasets

    Skip-gram word2vec (Inverse of CBOW)Predict source context words from target wordEach (source context, target word) tuple is observationBetter for large datasets

  • word2vec Libraries

    gensimPython onlyMost popular

    Spark MLPython + Java/Scala Supports only synonyms

  • *2vec

    lda2vecLDA (global) + word2vec (local)From Chris Moody @ Stitch Fix

    like2vecEmbedding-based Recommender

  • word2vec vs. GloVeBoth are Fundamentally Similar

    Capture local co-occurrence statistics (neighbors)Capture distance between embedding vector (analogies)

    GloVeCount-basedAlso captures global co-occurrence statisticsRequires upfront pass through entire dataset

  • SyntaxNet POS TaggingDetermine coarse-grained grammatical role of each wordMultiple contexts, multiple roles

    Neural Net Inputs: stack, bufferResults: POS probability distro

    Already Tagged

  • SyntaxNet Dependency ParserDetermine fine-grained roles using grammatical relationshipsTransition-based, Incremental Dependency Parser

    Globally Normalized using Beam Search with Early Update

    Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs



  • SyntaxNet Use Case: NutritionNutrition and Health Startup in SF (Stealth)

    Using Googles SyntaxNet

    Rate Recipes and Menus by Nutritional Value



  • Model ValidationUnsupervised Learning Requires Validation

    Google has Published Analogy Tests for Model Validation

    Thanks, Google!

  • Thank You, Atlanta!Chris Fregly, Research Scientist @ PipelineIO

    All Source Code, Demos, and Docker Images @

    Join the Global Meetup for all Slides and [email protected]